# generalization_analysis_for_labelspecific_representation_learning__bab5b619.pdf

Generalization Analysis for Label-Speciﬁc Representation Learning

Yi-Fan Zhang1,3, Min-Ling Zhang2,3

1 School of Cyber Science and Engineering, Southeast University, Nanjing 210096, China 2 School of Computer Science and Engineering, Southeast University, Nanjing 210096, China 3 Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China {yfzh, zhangml}@seu.edu.cn

Label-speciﬁc representation learning (LSRL), i.e., constructing the representation with speciﬁc discriminative properties for each class label, is an effective strategy to improve the performance of multi-label learning. However, the generalization analysis of LSRL is still in its infancy. The existing theory bounds for multilabel learning, which preserve the coupling among different components, are invalid for LSRL. In an attempt to overcome this challenge and make up for the gap in the generalization theory of LSRL, we develop a novel vector-contraction inequality and derive the generalization bound for general function class of LSRL with a weaker dependency on the number of labels than the state of the art. In addition, we derive generalization bounds for typical LSRL methods, and these theoretical results reveal the impact of different label-speciﬁc representations on generalization analysis. The mild bounds without strong assumptions explain the good generalization ability of LSRL.

1 Introduction

Multi-label learning has received continued attention in machine learning community due to its widespread encounters in real-world applications, where each object is represented by a single instance associated with multiple class labels [Zhang and Zhou, 2014, Liu et al., 2022]. The goal of multi-label learning is to model real-world objects with multiple semantics. It has made important advances in multimedia content annotation [Cabral et al., 2011, You et al., 2020], text categorization [Rubin et al., 2012, Xun et al., 2020], bioinformatics [Cesa-Bianchi et al., 2012, Chen et al., 2017] and other ﬁelds [Yu et al., 2005]. The failure to take into account that each class label may possess its own discriminative properties results in the most straightforward strategy which exploits the identical representation of an instance for dealing with multi-label data being perhaps suboptimal [Zhang and Wu, 2015, Huang et al., 2016, Hang and Zhang, 2022]. In recent years, label-speciﬁc representation learning (LSRL) [Zhang and Wu, 2015] has been proposed to facilitate the discrimination of each class label by tailoring its own representations. Due to its ability to model distinct characteristics for each class label, LSRL has become an effective strategy to improve the performance of multi-label learning [Huang et al., 2015, 2016, 2018, Yu and Zhang, 2022, Hang and Zhang, 2022, Hang et al., 2022]. Although LSRL has achieved impressive empirical advances in multi-label learning, the problem of understanding LSRL theoretically remains completely under-explored.

In recent years, efforts to explain why multi-label models generalize well is an important open problem in multi-label learning community. The empirical success of LSRL makes the generalization

Corresponding author

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

analysis of LSRL an important problem in multi-label learning. However, existing theoretical results and analysis methods are not applicable to LSRL, which leads to a serious lack of progress in its generalization analysis. A satisfactory and complete study of the generalization analysis for LSRL should include two aspects: 1) the dependency of the generalization bounds on the number of labels (i.e., c) should be weaker than square-root, and 2) the relationship among components should be decoupled in the generalizability analysis. First, existing bounds with a linear or square-root dependency on c are difﬁcult to explain empirical success of multi-label learning. For example, the bounds with a linear dependency on c are vacuous (i.e., no longer less than 1) for commonly used multi-label datasets such as CAL500, corel5k, rcv1-s1, Corel16k-s1, delicious, iaprtc12, espgame, etc., since c is larger than n (the number of examples) for these datasets. The bounds with a linear or square-root dependency on c are vacuous for commonly used extreme multi-label datasets such as Wiki 10, Amazon-670K, etc., since c is larger than n. The above failure of existing bounds for multilabel learning also applies to LSRL. Hence, this suggests that only the bounds with weaker dependency on c can provide effective theoretical guarantees. Second, existing theoretical results preserve the coupling among different components [Lei et al., 2015, Wu and Zhu, 2020, Wu et al., 2021a,b]. However, LSRL decomposes the multi-label learning problem into multiple binary classiﬁcation problems, which means that the relationship among different components needs to be decoupled in the generalization analysis. Hence, we need to develop new analysis methods for LSRL. As a matter of fact, theoretical research on LSRL can also promote a better understanding of multi-label learning.

In this paper, we derive novel and tighter bounds based on the Rademacher complexity for LSRL. Speciﬁcally, we develop a novel vector-contraction inequality with the assumption that the loss function is Lipschitz continuous w.r.t. the ℓ norm, then we derive the bound for general function classes of LSRL with no dependency on c, up to logarithmic terms, which is tighter than the state of the art. In addition, we also analyze the bounds for several typical LSRL methods, and we show that the construction method of label-speciﬁc representations will affect the generalization bound.

Our bounds for LSRL improve the dependency on c. Major contributions of the paper include:

We develop a novel vector-contraction inequality for ℓ norm Lipschitz continuous loss, which overcomes the limitations of existing theoretical results and provides a theoretical tool for the generalization analysis of LSRL. We derive bounds for general function classes of LSRL with a weaker dependency on c than the state of the art, which provides a general theoretical guarantee for LSRL. We derive bounds for typical LSRL methods, which reveals the impact of different labelspeciﬁc representations on the generalization analysis. The theoretical techniques and results on k-means clustering, Lasso, and DNNs involved here may be of independent interest.

2 Related Work

2.1 Multi-Label Learning

Multi-label learning is one of the most studied and important machine learning paradigms in practice [Zhang and Zhou, 2014, Liu et al., 2022]. To cope with the challenge that the output space is exponential in size to the number of class labels, modeling label correlations is adopted as a feasible strategy to facilitate the learning process. Generally speaking, existing methods can be roughly grouped into three major categories based on the order of label correlations being considered, namely ﬁrst-order methods [Boutell et al., 2004, Zhang and Zhou, 2007, Zhang et al., 2018] which tackle multi-label learning problem by decomposing it into a number of independent binary classiﬁcation problems, second-order methods [Elisseeff and Weston, 2001, Zhu et al., 2018, Sun and Zhang, 2021] which tackle multi-label learning problem by considering pairwise relationships between labels, and high-order methods [Ji et al., 2010, Xu and Guo, 2021] which tackle multi-label learning problem by exploiting high-order relationships among labels. Recent years, beneﬁting from the good generalization performance of deep learning, deep methods such as recurrent neural networks [Yazici et al., 2020], graph neural networks[Chen et al., 2020], and embedding models [Bai et al., 2020, Dahiya et al., 2021] have been explored to model label correlations.

As a complement to the exploitation of label correlations, label-speciﬁc representation learning (LSRL) have been proven to be another effective strategy to improve multi-label learning by manipulating the input space. Existing methods can be roughly grouped into three major categories

based on the construction method of label-speciﬁc representations, i.e., prototype-based label-speciﬁc representation transformation methods [Zhang and Wu, 2015, Zhang et al., 2015, Weng et al., 2018, Guo et al., 2019, Zhang and Li, 2021] which generate label-speciﬁc representations by treating the prototypes of each class label as the transformation bases, label-speciﬁc feature selection methods [Huang et al., 2015, 2016, 2018, Weng et al., 2020, Yu and Zhang, 2022] which generate label-speciﬁc representations by retaining a feature subset as the most pertinent features for each class label, and deep label-speciﬁc representation methods [Hang and Zhang, 2022, Hang et al., 2022] which learn label-speciﬁc representations in an end-to-end manner by exploiting deep models. In particular, k-means clustering-based LSRL, i.e., LIFT [Zhang and Wu, 2015], is the representative method of prototype-based label-speciﬁc representation transformation methods, which constructs label-speciﬁc representations by querying the distances between the original inputs and the cluster centers for each class label. Lasso-based LSRL, i.e., LLSF [Huang et al., 2016], is the representative method of label-speciﬁc feature selection methods, which presents a Lasso-based framework with the constraint of pairwise label correlations for each class label. DNN-based LSRL, i.e., CLIF [Hang and Zhang, 2022], is the representative method of deep label-speciﬁc representation methods, which proposes to learn label semantics and label-speciﬁc representations in a collaborative way. In this paper, we will analyze the generalization bounds of these three representative LSRL methods.

2.2 Generalization Bounds for Multi-Label Learning

Dembczynski et al. [2010] derived the relationship between the expectations of Hamming and Subset loss based on the regret analysis on Hamming and Subset loss, and Dembczynski et al. [2012] further performed regret analysis on Ranking loss, which provided preliminary theoretical insights for understanding Hamming, Subset and Ranking loss. With the typical vector-contraction inequality [Maurer, 2016] (i.e., assume that the loss function is Lipschitz continuous w.r.t. the ℓ2 norm), one can obtain generalization bounds of order O(c/ n) for multi-label learning. Wu and Zhu [2020], Wu et al. [2021a] showed that the order of the generalization bounds for Subset loss, Hamming Loss and reweighted convex surrogate univariate loss can be improved to O( p

c/n), which preserved the coupling among different components and exploited the relationship between loss functions. Liu et al. [2018] also obtained a generalization bound of order O( p

c/n) for the dual set multi-label learning, which was analyzed under the margin loss and kernel function classes.

Wu et al. [2021b] derived a generalization bound of order O(log(nc)/nσ) for norm regularized kernel function classes with the assumptions that the loss function is Lipschitz continuous w.r.t. the ℓ norm and the regularizer is σ-strongly convex with respect to some norms. Yu et al. [2014] obtained a generalization bound of order O(1/ n) for trace norm regularized linear function classes with Decomposable loss. Xu et al. [2016] used the local Rademacher complexity to derive a generalization bound of order e O(1/n) for trace norm regularized linear function classes with the assumption that the singular values of the weight matrix decay exponentially. These theoretical results all preserved the coupling among different components. In addition, Wu et al. [2023] obtained O(1/ n) bounds for Macro-Averaged AUC and gave thorough discussions about its relationships with the label-wise class imbalance, which transformed the macro-averaged maximization problem in multi-label learning into the problem of learning multiple tasks with graph-dependent examples. Here we obtain e O(1/ n) bounds with the state-of-the-art dependency on the number of labels for LSRL function classes under the assumption that the loss function is Lipschitz continuous w.r.t. the ℓ norm.

3 Preliminaries

Let [n] := {1, . . . , n} for any natural number n. In the context of multi-label learning, given a dataset D = {(x1, y1) , . . . , (xn, yn)} with n examples which are identically and independently distributed (i.i.d.) from a probability distribution P on X Y, where X Rd denotes the d-dimensional input space and Y denotes the label space with c class labels, x X, y Y { 1, +1}c, i.e., each y = (y1, . . . , yc) is a binary vector and yj = 1 (yj = 1) denotes that the j-th label is relevant (irrelevant), j [c]. The task of multi-label learning is to learn a multi-label classiﬁer h H : X 7 { 1, +1}c which assigns each instance with a set of relevant labels. A common strategy is to learn a vector-valued function f = (f1, . . . , fc) : X 7 Rc and derive the classiﬁer by a thresholding function which divides the label space into relevant and irrelevant label sets.

For any vector-valued function f : X 7 Rc, the prediction quality on the example (x, y) is measured by a loss function ℓ: Rc { 1, +1}c 7 R+. The goal of learning is to ﬁnd a hypothesis f F with good generalization performance from the dataset D by optimizing the loss ℓ. The generalization performance is measured by the expected risk: R(f) = E(x,y) P [ℓ(f(x), y)]. We denote the empirical risk w.r.t. the training dataset D as b RD(f) = 1

n Pn i=1 ℓ(f(xi), yi). We denote the optimal risk as R = inff F R(f), the minimizer of the optimal risk as f = arg minf F R(f) and denote

the minimizer of the empirical risk as ˆ f = arg minf F b RD(f). In addition, we deﬁne the loss function space as L = {ℓ(f(x), y) : f F}, where F is the vector-valued function class.

3.1 Label-Speciﬁc Representation Learning

Label-speciﬁc representation learning aims to construct the representation with speciﬁc discriminative properties for each class label to facilitate its discrimination process. The basic idea of LSRL is to decompose the multi-label learning problem into c binary classiﬁcation problems, i.e., decoupling the relationship among different components, where each binary classiﬁcation problem corresponds to a possible label in the label space. We consider the prediction function for each label of the general form fj(x) = wj, ζj(φj(x)) , where the inner nonlinear mapping φj corresponds to the nonlinear transformation induced by the construction method of label-speciﬁc representation, while the outer nonlinear mapping ζj refers to the nonlinear mapping corresponding to the classiﬁer learned on the generated label-speciﬁc representation. We deﬁne a vector-valued function class of LSRL as follows:

F = {x 7 f(x) :f(x) = (f1(x), . . . , fc(x)),

fj(x) = hj (φj(x)) = w j ζj(φj(x)), w = (w1, . . . , wc) Rd c,

α(w) Λ, β(ζj( )) A, γ(φj( )) C, x X, j [c], Λ, A, C > 0}, (1)

where α represents a functional that constrains weights, β represents a functional that constrains nonlinear mappings ζj, γ represents a functional that constrains nonlinear mappings φj.

3.2 Related Evaluation Metrics

A number of evaluation metrics are proposed to measure the generalization performance of different multi-label learning methods. Here we focus on commonly used evaluation metrics, i.e., Hamming Loss, Subset Loss, Ranking Loss and Decomposable Loss. However, the above mentioned loss functions are typically 0/1 losses, which are actually hard to handle in optimization. Hence, one usually consider their surrogate losses, which are deﬁned as follows:

Hamming Loss : ℓH(f(x), y) = 1

j=1 ℓb (yjfj(x)) ,

where the base convex surrogate loss ℓb can be various popular forms, such as the hinge loss, the logistic loss, the exponential loss and the squared loss.

Subset Loss : ℓS(f(x), y) = max j [c] {ℓb (yjfj(x))} .

Ranking Loss : ℓR(f(x), y) = 1 |Y +| |Y |

q Y ℓb (fp(x) fq(x)) ,

where Y + (Y ) denotes the relevant (irrelevant) label index set induced by y, and | | denotes the cardinality of a set.

Decomposable Loss : ℓD(f(x), y) =

j=1 ℓb (yjfj(x)) .

3.3 Related Complexity Measures

Here we introduce the complexity measures involved in theoretical analysis. The Rademacher complexity is used to perform generalization analysis for LSRL.

Deﬁnition 1 (Rademacher complexity). Let G be a class of real-valued functions mapping from X to R. Let D = {x1, . . . , xn} be a set with n i.i.d. samples. The empirical Rademacher complexity over G is deﬁned by ˆℜD(G) = Eϵ supg G 1 n Pn i=1 ϵig (xi) , where ϵ1, . . . , ϵn are i.i.d. Rademacher random variables, and we refer to the expectation ℜ(G) = ED[ˆℜD(G)] as the Rademacher complexity of G. In addition, we deﬁne the worst-case Rademacher complexity as ℜn(G) = sup D X n ˆℜD(G), and its expectation is denoted as ℜ(G) = ED[ ℜn(G)].

In multi-label learning, f F is a vector-valued function, which makes traditional Rademacher complexity analysis methods invalid. Hence, we need to convert the Rademacher complexity of a loss function space associated with the vector-valued function class F into the Rademacher complexity of a tractable scalar-valued function class. The Rademacher complexity can be bounded by other scale-sensitive complexity measures, such as the covering number and fat-shattering dimension [Srebro et al., 2010, Zhang and Zhang, 2023]. The relevant deﬁnitions are provided in the appendix.

4 General Bounds for LSRL

In this section, we ﬁrst introduce the assumptions used. Then, we develop a novel vector-contraction inequality for the Rademacher complexity of the loss function space associated with the vector-valued function class F. Finally, with the novel vector-contraction inequality, we derive the generalization bound for general function classes of LSRL with no dependency on the number of labels, up to logarithmic terms, which is tighter than the state of the art. The detailed proofs of the theoretical results in this paper are provided in the appendix. Assumption 1. Assume that the input features, the loss function and the components of the vectorvalued function are bounded: xi 2 R, ℓ( , ) M, |fj( )| B for i [n], j [c] where R > 0, M > 0 and B > 0 are constants. Assumption 2. Assume that the loss function ℓis ρ-Lipschitz continuous w.r.t. the ℓ norm, that is: ℓ(f(x), ) ℓ(f (x), ) ρ f(x) f (x) ,

where ρ > 0, t = maxj [c] |tj| for t = (t1, . . . , tc).

Assumption 1 and 2 are mild assumptions. For Assumption 1, normalization of input features is a common data preprocessing operation. When we consider the function class (1), we often use the assumptions wj 2 Λ, ζj( ) 2 A for any j [c] to replace the boundedness of the components of the vector-valued function, i.e., B := ΛA. For Assumption 2, the Lipschitz continuity w.r.t. the ℓ norm has been considered in some literature [Foster and Rakhlin, 2019, Lei et al., 2019, Wu et al., 2021b]. The following Proposition 1 further illustrates that the commonly used loss functions in multi-label learning actually satisfy Assumption 2. Proposition 1. Assume that the base loss ℓb deﬁned in Subsection 3.2 is µ-Lipschitz continuous, then the surrogate Hamming Loss is µ-Lipschitz w.r.t. the ℓ norm, the surrogate Subset Loss is µ-Lipschitz w.r.t. the ℓ norm, the surrogate Ranking Loss is 2µ-Lipschitz w.r.t. the ℓ norm, and the surrogate Decomposable Loss is cµ-Lipschitz w.r.t. the ℓ norm.

We deﬁne a function class P consisting of projection operators pj : Rc 7 R for any j [c] which project the c-dimensional vector onto the j-th coordinate. Then, we have P(F) = {(j, x) 7 pj(f(x)) : pj(f(x)) = fj(x), f F, (j, x) [c] X}. The projection function class decouples the relationship among different components. With the assumption of ℓ norm Lipschitz loss and the above deﬁnitions, we show that the Rademacher complexity of the loss function space associated with F can be bounded by the worst-case Rademacher complexity of the projection function class P(F). We develop the following novel vector-contraction inequality: Lemma 1. Let F be a vector-valued function class of LSRL deﬁned by (1). Let Assumptions 1 and 2 hold. Given a dataset D of size n. Then, we have

2ρ ceℜnc(P(F)) 1 + log 1 2 (8e2n3c3) log M n

where ˆℜD(L) = Eϵ supℓ L,f F 1 n Pn i=1 ϵiℓ(f(xi)) is the empirical Rademacher complexity of the loss function space associated with F, and eℜnc(P(F)) is the worst-case Rademacher complexity of the projection function class.

Proof Sketch. First, the Rademacher complexity of the loss function space associated with F can be bounded by the empirical ℓ norm covering number with the reﬁned Dudley s entropy integral inequality. Second, according to the Lipschitz continuity w.r.t the ℓ norm, the empirical ℓ norm covering number of F can be bounded by that of P(F). Third, the empirical ℓ norm covering number of P(F) can be bounded by the fat-shattering dimension, and the fat-shattering dimension can be bounded by the worst-case Rademacher complexity of P(F). Hence, the problem is transferred to the estimation of the worst-case Rademacher complexity. Finally, we estimate the lower bound of the worst-case Rademacher complexity of P(F), and then combined with the above steps, the Rademacher complexity of the loss function space associated with F can be bounded.

With the vector-contraction inequality above, we can derive the following tight bound for LSRL: Theorem 1. Let F be a vector-valued function class of LSRL deﬁned by (1). Let Assumptions 1 and 2 hold. Given a dataset D of size n. Then, for any 0 < δ < 1, with probability at least 1 δ, the following holds for any f F:

R(f) b RD(f) + 24

2ρΛA 1 + log 1 2 (8e2n3c3) log M n

Proof Sketch. We ﬁrst upper bound the worst-case Rademacher complexity ℜnc(P(F)), and then combined with Lemma 1, the desired bound can be derived.

Remark 1. Although Lemma 1 shows a factor of c, the term eℜnc(P(F)) ΛA/ nc, which makes the Rademacher complexity of the loss function space associated with F (i.e., ˆℜD(L)) actually independent on c, up to logarithmic terms, and results in a tighter bound than the existing O(c/ n) and O( p

c/n) bounds with a faster convergence rate e O(1/ n). The bound in Theorem 1 with no dependency on c can provide a general theoretical guarantee for LSRL, even for extreme multi-label learning where the number of labels far exceeds the number of examples [Yu et al., 2014, Prabhu and Varma, 2014, Yen et al., 2016, Liu and Shen, 2019], since it is easy to get that log c is much smaller than n. The main challenge of generalization analysis for LSRL is that existing theoretical results and existing generalization analysis methods for multi-label learning are not applicable to LSRL. Speciﬁcally, existing theoretical bounds often involve the typical vectorcontraction inequality [Maurer, 2016] for ℓ2 Lipschitz loss: Eϵ supf F 1 n Pn i=1 ϵiℓ(f(xi))

2µEϵ h supf F 1 n Pn i=1 Pc j=1 ϵijfj (xi) i , which will lead to bounds with a linear dependency on c for multi-label learning. Lei et al. [2015], Wu and Zhu [2020], Wu et al. [2021a] improve the dependency of the bounds on c to square-root, which preserve the coupling among different components reﬂected by the constraint w Λ. Lei et al. [2019] also improves the bounds of multi-class classiﬁcation to be independent on c (up to logarithmic terms) for ℓ Lipschitz loss by preserving the coupling among different components, and Wu et al. [2021b] further generalizes these results to multi-label learning. However, LSRL decomposes the multi-label learning problem into c binary classiﬁcation problems, which means that the relationship among different components needs to be decoupled in the generalization analysis. Hence, how to develop novel vector-contraction inequalities that can induce e O(1/ n) bounds and deal with the case of decoupling the relationship among different components are the two most critical difﬁculties in deriving tighter bounds for LSRL. The introduction of the projection function class plays an important role in solving these two difﬁculties. It improves the vector-contraction inequalities by a factor of c and decouples the relationship among different components (which is also reﬂected by the constraint wj Λ for any j [c]). Our tighter e O(1/ n) bound in Theorem 1 with no dependency on c (up to logarithmic terms) solve the limitations of existing theoretical results for multi-label learning and can provide a general theoretical guarantee for LSRL.

The differences in generalization bounds of different LSRL methods are mainly reﬂected in two aspects. On the one hand, the Lipschitz constant of the loss functions, as we proved in Proposition 1, the Lipschitz constants ρ corresponding to different loss functions are various. On the other hand, the nonlinear mappings induced by different LSRL methods. In fact, when we analyze the generalization for LSRL, we will further have ζ( ) A := κR (κ is the Lipschitz constant of the nonlinear mappings ζ( )) to take into account the differences or characteristics of different LSRL methods. We provide detailed analysis for A of typical LSRL methods in the next section.

5 Generalization Bounds for Typical LSRL Methods

In this section, we analyze the generalization bounds for several typical LSRL methods, i.e., k-means clustering-based [Zhang and Wu, 2015], Lasso-based [Huang et al., 2016] and DNN-based [Hang and Zhang, 2022] LSRL methods. We show that different construction methods of label-speciﬁc representation will lead to signiﬁcant differences in the constant A of the generalization bound in Theorem 1. For each LSRL method, we ﬁrst give a brief introduction, then give its formal deﬁnition corresponding to the class of LSRL deﬁned in (1), and ﬁnally derive the generalization bound.

5.1 Generalization Bounds for k-Means Clustering-Based LSRL Method

As a seminal work, k-means clustering-based LSRL method, i.e., LIFT [Zhang and Wu, 2015], uses k-means clustering to construct label-speciﬁc representation that effectively capture the speciﬁc characteristics of each label. Speciﬁcally, ﬁrst, for each label, k-means clustering is used to divide the training instances into K clusters, and the centers of the K clusters are obtained, which is denoted as cj k for the j-th label, k [K], j [c]. Then, these K centers are used to construct the label-speciﬁc representation, i.e., in the vector-valued function class of LSRL deﬁned by (1), φj(x) = h d(x, cj 1), . . . , d(x, cj K) i , d(x, cj k) = x cj k . Finally, a family of c classiﬁers fj with κ-Lipschitz nonlinear mapping are induced based on the generated label-speciﬁc representations.

Next, we formally deﬁne the process of k-means clustering. Here we follow some of the settings and deﬁnitions in [Li and Liu, 2021]. Assume that V : X 2 R+is a pairwise distance-based function used to measure the dissimilarity between pair observations, and Z = [Z1, . . . , ZK] is a collection of K partition functions Zk : X 2 R+for k [K]. The clustering framework can be cast as the problem of minimizing the following criterion:

b RD(V, Z) = 1 n(n 1)

k=1 V (xi, xj) Zk (xi, xj)

over all possible functions V and Zk for k [K]. In k-means clustering, we have V (xi, xj) = xi xj 2 2, and Zk (xi, xj) = I (xi, xj) C2 k , ck = 1 |Ck| P

x Ck x, where C1, . . . , CK are the partitions of the feature space X. Let gk (xi, xj) = V (xi, xj) Zk (xi, xj) and g = (g1, . . . , g K) be a vector-valued function, ℓclu(g(x, x )) = PK k=1 gk (x, x ), then b RD(V, Z) can be written as

b RD(V, Z) = 1 n(n 1)

i,j=1,i =j ℓclu(g(xi, xj)) = 1 n(n 1)

k=1 gk (xi, xj) .

We then deﬁne a vector-valued function class of k-means clustering as follows:

G = {(x, x ) 7 g(x, x ) :g(x, x ) = (g1(x, x ), . . . , g K(x, x )),

gk(x, x ) = x x 2 2 I (x, x ) C2 k , x, x X, k [K]},

where |gk( , )| G for k [K] and G > 0 are constants.

We denote the function class of k-means clustering corresponding to the j-th label as Gj. With the above deﬁnitions, we can derive the tight bound for k-means clustering-based LSRL method: Theorem 2. Let F be a vector-valued function class of k-means clustering-based LSRL deﬁned by (1). Let Assumptions 1 and 2 hold. Given a dataset D of size n. Then, for any 0 < δ < 1, with probability at least 1 δ, the following holds for any f F:

R(f) b RD(f) + 3M

KR 1 + log 1 2 (8e2n3c3) log M n

1 + log 1 2 (e2n3c3) log M nc

1 + log 1 2 (8e2n3c3) log M n

Remark 2. There are three key points in the generalization analysis of k-means clustering-based LSRL method. 1) Since k-means clustering-based LSRL method is two-stage, i.e., the centers of

clusters is generated by using k-means clustering in the ﬁrst stage, then these centers are exploited to generate label-speciﬁc representations which are used to learn a multi-label classiﬁer in the second stage, we cannot formally express these two stages in a closed-form expression through a composite function (K centers are generated by the arg min function). Furthermore, K centers generated in the ﬁrst stage are actually used as ﬁxed parameters rather than inputs in the second stage. Hence, in order to fully consider the capacity of the model corresponding to the ﬁrst stage, it is reasonable to deﬁne the whole function class as the sum of the function classes F + Lclu G corresponding to the methods of these two stages. Then, combined with Lemma 1, the generalization analysis is transformed into the bounding of the complexity of the projection function class P(F +Lclu G). The introduction of class G induces an additional increase in complexity, i.e., the last term in Theorem 2. 2) The generalization analysis of k-means clustering-based LSRL method involves the generalization analysis for k-means clustering. However, since the k-means clustering framework involves pairwise functions, a sequence of pairs of i.i.d. individual observation in k-means clustering is no longer independent, which makes standard techniques in the i.i.d case for traditional Rademacher complexity inapplicable for k-means clustering. We convert the non-sum-of-i.i.d pairwise function to a sum-ofi.i.d form by using permutations in U-process [Clémençon et al., 2008]. We show that the empirical Rademacher complexity of a loss function space associated with the vector-valued function class G can be bounded by ˆℜD (Lclu G) := Eϵ h supg G 1 n

2 i=1 ϵiℓclu(g(xi, xi+ n

2 )) i . 3) In order to derive tight bounds for k-means clustering, we develop a novel vector-contraction inequality that can induce bounds with a square-root dependency on the number of clusters. The theoretical techniques and results involved here may be of independent interest. The generalization bound of k-means clustering-based LSRL method is tighter than the state of the art with a faster convergence rate e O( p

K/n), which is independent on the number of labels. Since the lower bound for clustering is Ω( p

K/n) [Bartlett et al., 1998], our bound is (nearly) optimal, up to logarithmic terms, even from the perspective of clustering. The constant A of the generalization bound in Theorem 1 corresponds to 2κ

5.2 Generalization Bounds for Lasso-Based LSRL Method

Lasso-based LSRL method, i.e., LLSF [Huang et al., 2016], assumes that the label-speciﬁc representation of each label should have sparsity and sharing properties. For sparsity, LLSF uses Lasso as the model corresponding to each label. The property of sharing is achieved by considering that two strongly correlated labels will share more features with each other than two uncorrelated or weakly correlated labels and the corresponding weights will be similar, i.e., their inner product will be large.

Formally, since each label corresponds to a Lasso, this means that in the class of LSRL deﬁned by (1), the base loss ℓb is the squared loss, the nonlinear mappings ζ( ) and φ( ) are both identity transformations for any j [c], and the constraint α(w) is wj 1 Λ for any j [c]. For the j-th label, the property of sharing is reﬂected by the additionally introduced constraint Pc i(1 sji)wj wi τ, where sji is the cosine similarity between labels yj and yi, here we refer to it as the sharing constraint. The loss function used by Lasso-based LSRL method is the Decomposable loss.

Since the squared loss is not Lipschitz continuous, the theoretical results on the Lipschitz continuity of the loss functions in Proposition 1 cannot be applied to Lasso-based LSRL method. To overcome this challenge, we deﬁne the pseudo-Lipschitz function, which is also used in the theoretical analysis of approximate message passing algorithms [Bayati and Montanari, 2011]. Deﬁnition 2. For k 1, we say that a function f : R R is pseudo-Lipschitz of order k if there exists a constant L > 0 such that the following inequality holds for all x, y R:

|f(x) f(y)| L 1 + |x|k 1 + |y|k 1 |x y|.

Note that any pseudo-Lipschitz function of order 1 is Lipschitz continuous. The following Proposition shows that the Decomposable loss is still Lipschitz continuous if the base loss is the squared loss. Proposition 2. The squared loss is 1 pseudo-Lipschitz of order 2, the surrogate Decomposable Loss is (3 + 2B)c-Lipschitz w.r.t. the ℓ norm if the base loss ℓb is the squared loss.

With the above deﬁnitions, we can derive the generalization bound for Lasso-based LSRL method: Theorem 3. Let F be a vector-valued function class of Lasso-based LSRL deﬁned by (1). Let Assumptions 1 and 2 hold. Given a dataset D of size n. Then, for any 0 < δ < 1, with probability at

least 1 δ, the following holds for any f F:

R(f) b RD(f) + 24

2(3 + 2B)cΛR 1 + log 1 2 (8e2n3c3) log M n

Remark 3. The complexity of the LLSF function class can be bounded by the complexity of the LSRL function class where each label corresponds to a Lasso, since the introduction of the sharing constraint in the LLSF function class reduces the complexity of the function class compared with the LSRL function class where each label corresponds to a Lasso. Hence, the complexity analysis of the LLSF function class can be converted into upper bounding the Rademacher complexity of the LSRL function class where each label corresponds to a Lasso. The constant A of the generalization bound in Theorem 1 corresponds to R here, and the value of ρ is (3 + 2B)c, which induce the e O(c/ n) bound here. If other loss functions are used, e.g., Hamming, Subset or Ranking loss, instead of Decomposable loss, the dependency of the bounds for Lasso-based LSRL method on c can be improved from linear to independent, up to logarithmic terms.

5.3 Generalization Bounds for DNN-Based LSRL Method

DNN-based LSRL method, i.e., CLIF [Hang and Zhang, 2022], exploits the powerful representation learning capability of deep neural networks (DNNs) to learn label-speciﬁc representation in an endto-end manner. Since the construction of label-speciﬁc representation involves graph convolutional networks (GCNs), we ﬁrst introduce the relevant deﬁnitions for GCN.

Let G = {V, E} be a given undirected graph, where V = {x1, x2, . . . , xn} is the set of nodes with size |V| = n and E is the set of edges. Let A and D be the adjacency matrix and the diagonal degree matrix respectively, where Dii = Pn j=1 Aij. Let A = (D + In) 1

2 (A + In) (D + In) 1

2 denote the normalized adjacency matrix with self-connections, where I is the identity matrix. The feature propagation process of a two-layer GCN is σ( Aσ( AXW1)W2), where W1 and W2 are parameter matrices, X is the node feature matrix, and the i-th row Xi is the node feature xi.

Speciﬁcally, for DNN-based LSRL method, ﬁrst, a graph (here we call it the label graph) is constructed over the label space and a GCN is used to generate the label embeddings, i.e., the label embeddings can be denoted by ψ(Y ) = σRe LU( AσRe LU( AY W1)W2), where Y is the node feature matrix of the label graph with size c and the nodes are also bounded by R, σRe LU is the Re LU activation. Second, the label embedding of the j-th label is decoded into the importance vector by a one-layer fully-connected neural network, i.e., σsig(W3ψ(Y )j), where σsig is the sigmoid activation, and the input feature is mapped into the latent representation through a one-layer fully-connected neural network, i.e., σRe LU(W4x). Third, for the j-th label, the Hadamard product of the importance vector and the latent representation is deﬁned as the pertinent representation, and then the label-speciﬁc representation for the j-th label is obtained by feeding the pertinent representation into another one-layer fully-connected neural network. Hence, the label-speciﬁc representation for the j-th label is

φj(x) = σRe LU {W5 [σRe LU(W4x) σsig(W3ψ(Y )j)]} .

Finally, the j-th model is implemented by a fully-connected layer, i.e., fj(x) = σsig(w j φj(x)).

In the generalization analysis of the above deep neural network, we introduce the following assumption, which is a common assumption in the generalization analysis for DNNs [Bartlett et al., 2017, Golowich et al., 2018, Zhang and Zhang, 2023, Tang and Liu, 2023]. Assumption 3. Assume that the parameter metrices in DNN-based LSRL method are bounded, i.e., Wi D, i [5], where D > 0 is a constant.

With the above deﬁnitions, we can derive the generalization bound for DNN-based LSRL method: Theorem 4. Let F be a vector-valued function class of DNN-based LSRL deﬁned by (1). Let Assumptions 1, 2, and 3 hold. Given a dataset D of size n. Then, for any 0 < δ < 1, with probability at least 1 δ, the following holds for any f F:

R(f) b RD(f) + 6

2ρΛD5R2(gmax + 1) 1 + log 1 2 (8e2n3c3) log M n

where gmax is the maximum node degree of the label graph.

Remark 4. The term eℜnc(P(F)) ΛD5R2(gmax + 1)/4 nc, which makes the Rademacher complexity ˆℜD(L) actually independent on c, up to logarithmic terms, and results in a tight bound for DNN-based LSRL method with a faster convergence rate e O(1/ n). The constant A of the generalization bound in Theorem 1 corresponds to D5R2(gmax + 1)/4 here. For deep GCNs, the increase in depth means that the generalization performance will deteriorate, which is consistent with empirical performance and guides us not to design too many layers of GCN. In addition, the bound is linearly dependent on the maximum node degree of the label graph, which suggests that when the performance of the model is always unsatisfactory, we can check whether the maximum node degree is large and consider using some techniques to remove some edges, e.g., Drop Edge [Rong et al., 2020], to alleviate the over-ﬁtting problem. We will further explore more network structures to learn more effective label-speciﬁc representations, e.g., hypernetworks [Galanti and Wolf, 2020, Chen et al., 2023, Shen et al., 2023], deep kernel networks [Zhang and Liao, 2020, Zhang and Zhang, 2023], and provide generalization analysis for the corresponding DNN-based LSRL methods.

6 Discussion

Compared with existing methods considering label correlations, which mainly focus on the processing of the label space by exploiting or modeling relationships between labels, LSRL mainly focuses on the operation on the input space and implicitly considers label correlations in the process of constructing label-speciﬁc representations. For example, in the construction of label-speciﬁc representations in LLSF and CLIF, the label correlation information is embedded into the label-speciﬁc representations in the input space by introducing the sharing constraint and using a GCN over the label graph to generate label embeddings, respectively. LSRL methods are more effective since the label correlation information is considered in the construction of label-speciﬁc representations.

Our theoretical results explain why LSRL is an effective strategy to improve the generalization performance of multi-label learning. On the one hand, existing results can improve the dependency of the bound on c from linear to square-root by preserving the coupling among different components, which corresponds to high-order label correlations induced by norm regularizers. However, the improvement in the preservation of coupling by a factor of c beneﬁts from replacing cΛ with Λ in the constraint to some extent, and preserving the coupling corresponds to the stricter assumption [Zhang and Zhang, 2024]. Our results for LSRL decouple the relationship among different components, and the bounds with a weaker dependency on c are tighter than the existing results that preserve the coupling, which also explains why LSRL methods outperform the multi-label methods that consider high-order label correlations induced by norm regularizers. On the other hand, based on our results, we can ﬁnd that LSRL methods substantially increase the data processing, i.e., the process of constructing label-speciﬁc representations. From the perspective of model capacity, compared with traditional multi-label methods, since the introduction of construction methods of label-speciﬁc representations, the capacity of the model is signiﬁcantly increased, especially if deep learning methods are used to generate label-speciﬁc representations, which improves the representation ability of the model. Or more intuitively, LSRL means an increase in model capacity and stronger representation ability, which makes it easier to ﬁnd the hypotheses with better generalization in the function class.

The vector-contraction inequality and the theoretical tools developed here are applicable to the theoretical analysis of other problem settings, such as multi-class classiﬁcation, or more general vectorvalued learning problem. For multi-class classiﬁcation, multi-class margin-based loss, multinomial logistic loss, Top-k hinge loss, etc. are all ℓ Lipschitz [Lei et al., 2019]. For multi-label learning, the surrogate loss for Macro-Averaged AUC is also ℓ Lipschitz [Zhang and Zhang, 2024].

7 Conclusion

In this paper, we propose a novel vector-contraction inequality for ℓ norm Lipschitz continuous loss, and derive bounds for general function classes of LSRL with a weaker dependency on c than the state of the art. In addition, we analyze the bounds for several typical LSRL methods, and study the impact of different label-speciﬁc representations on the generalization analysis.

In future work, we will extend our bounds to more LSRL methods, and derive tighter bounds for LSRL with a faster convergence rate w.r.t. the number of examples, and further design efﬁcient models and algorithms to construct label-speciﬁc representations with good generalization performance.

Acknowledgements

The authors wish to thank the anonymous reviewers for their helpful comments and suggestions. This work was supported by the National Science Foundation of China (62225602) and the Big Data Computing Center of Southeast University.

Junwen Bai, Shufeng Kong, and Carla P. Gomes. Disentangled variational autoencoder based multilabel classiﬁcation with covariance-aware multivariate probit model. In Christian Bessiere, editor, Proceedings of the 29th International Joint Conference on Artiﬁcial Intelligence, number IJCAI 2020, pages 4313 4321, 2020.

Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002.

Peter L. Bartlett, Tamás Linder, and Gábor Lugosi. The minimax distortion redundancy in empirical quantizer design. IEEE Transactions on Information Theory, 44(5):1802 1813, 1998.

Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30(NIPS 2017):6240 6249, 2017.

Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57(2):764 785, 2011.

Matthew R. Boutell, Jiebo Luo, Xipeng Shen, and Christopher M. Brown. Learning multi-label scene classiﬁcation. Pattern Recognition, 37(9):1757 1771, 2004.

Ricardo Silveira Cabral, Fernando De la Torre, João Paulo Costeira, and Alexandre Bernardino. Matrix completion for multi-label image classiﬁcation. Advances in Neural Information Processing Systems, 24(NIPS 2011):190 198, 2011.

Nicolò Cesa-Bianchi, Matteo Re, and Giorgio Valentini. Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference. Machine Learning, 88 (1-2):209 241, 2012.

Di Chen, Yexiang Xue, Daniel Fink, Shuo Chen, and Carla P. Gomes. Deep multi-species embedding. In Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence, number IJCAI 2017, pages 3639 3646. ijcai.org, 2017.

Sirui Chen, Yuan Wang, Zijing Wen, Zhiyu Li, Changshuo Zhang, Xiao Zhang, Quan Lin, Cheng Zhu, and Jun Xu. Controllable multi-objective re-ranking with policy hypernetworks. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, number KDD 2023, pages 3855 3864, 2023.

Tianshui Chen, Liang Lin, Riquan Chen, Xiaolu Hui, and Hefeng Wu. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1371 1384, 2020.

Stéphan Clémençon, Gábor Lugosi, and Nicolas Vayatis. Ranking and empirical minimization of u-statistics. The Annals of Statistics, 36(2):844 874, 2008.

Kunal Dahiya, Ananye Agarwal, Deepak Saini, K Gururaj, Jian Jiao, Amit Singh, Sumeet Agarwal, Purushottam Kar, and Manik Varma. Siamesexml: Siamese networks meet extreme classiﬁers with 100m labels. In Proceedings of the 38th International Conference on Machine Learning, pages 2330 2340, 2021.

Krzysztof Dembczynski, Willem Waegeman, Weiwei Cheng, and Eyke Hüllermeier. Regret analysis for performance metrics in multi-label classiﬁcation: The case of hamming and subset zero-one loss. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, volume 6321, pages 280 295, 2010.

Krzysztof Dembczynski, Wojciech Kotlowski, and Eyke Hüllermeier. Consistent multilabel ranking through univariate losses. ar Xiv:1206.6401, 2012.

André Elisseeff and Jason Weston. A kernel method for multi-labelled classiﬁcation. In Advances in Neural Information Processing Systems, number NIPS 2001, pages 681 687, 2001.

Dylan J. Foster and Alexander Rakhlin. ℓ vector contraction for rademacher complexity. ar Xiv:1911.06468v1, 2019.

Tomer Galanti and Lior Wolf. On the modularity of hypernetworks. In Advances in Neural Information Processing Systems, volume 33, pages 10409 10419, 2020.

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. International Conference on Computational Learning Theory, 75(COLT 2018): 297 299, 2018.

Yumeng Guo, Fulai Chung, Guozheng Li, Jiancong Wang, and James C. Gee. Leveraging labelspeciﬁc discriminant mapping features for multi-label learning. ACM Transactions on Knowledge Discovery from Data, 13(2):24:1 24:23, 2019.

Jun-Yi Hang and Min-Ling Zhang. Collaborative learning of label semantics and deep label-speciﬁc features for multi-label classiﬁcation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9860 9871, 2022.

Jun-Yi Hang, Min-Ling Zhang, Yanghe Feng, and Xiaocheng Song. End-to-end probabilistic labelspeciﬁc feature learning for multi-label classiﬁcation. In Proceedings of the 36th AAAI Conference on Artiﬁcial Intelligence, number AAAI 2022, pages 6847 6855, 2022.

Jun Huang, Guorong Li, Qingming Huang, and Xindong Wu. Learning label speciﬁc features for multi-label classiﬁcation. In Charu C. Aggarwal, Zhi-Hua Zhou, Alexander Tuzhilin, Hui Xiong, and Xindong Wu, editors, Proceedings of the 15th IEEE International Conference on Data Mining, number ICDM 2015, pages 181 190, 2015.

Jun Huang, Guorong Li, Qingming Huang, and Xindong Wu. Learning label-speciﬁc features and class-dependent labels for multi-label classiﬁcation. IEEE Transactions on Knowledge and Data Engineering, 28(12):3309 3323, 2016.

Jun Huang, Guorong Li, Qingming Huang, and Xindong Wu. Joint feature selection and classiﬁcation for multilabel learning. IEEE Transactions on Cybernetics, 48(3):876 889, 2018.

Shuiwang Ji, Lei Tang, Shipeng Yu, and Jieping Ye. A shared-subspace learning framework for multi-label classiﬁcation. ACM Transactions on Knowledge Discovery from Data, 4(2):1 29, 2010.

Antoine Ledent, Waleed Mustafa, Yunwen Lei, and Marius Kloft. Norm-based generalisation bounds for deep multi-class convolutional neural networks. In Proceedings of the 35th AAAI Conference on Artiﬁcial Intelligence, volume 35, pages 8279 8287, 2021.

Yunwen Lei, Ürün Dogan, Alexander Binder, and Marius Kloft. Multi-class svms: From tighter data-dependent generalization bounds to novel algorithms. In Advances in Neural Information Processing Systems, volume 28, pages 2035 2043, 2015.

Yunwen Lei, Ürün Dogan, Ding-Xuan Zhou, and Marius Kloft. Data-dependent generalization bounds for multi-class classiﬁcation. IEEE Transactions on Information Theory, 65(5):2995 3021, 2019.

Shaojie Li and Yong Liu. Sharper generalization bounds for clustering. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 6392 6402, 2021.

Chong Liu, Peng Zhao, Sheng-Jun Huang, Yuan Jiang, and Zhi-Hua Zhou. Dual set multi-label learning. In Proceedings of the 32nd AAAI Conference on Artiﬁcial Intelligence, number AAAI 2018, pages 3635 3642, 2018.

Weiwei Liu and Xiaobo Shen. Sparse extreme multi-label learning with oracle property. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 4032 4041, 2019.

Weiwei Liu, Haobo Wang, Xiaobo Shen, and Ivor W. Tsang. The emerging trends of multi-label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7955 7974, 2022.

Françoise Lust-Piquard and Gilles Pisier. Non commutative khintchine and paley inequalities. Arkiv för matematik, 29:241 260, 1991.

Andreas Maurer. A vector-contraction inequality for rademacher complexities. In Proceedings of the 27th International Conference on Algorithmic Learning Theory, volume 9925, pages 3 17, 2016.

Yashoteja Prabhu and Manik Varma. Fastxml: a fast, accurate and stable tree-classiﬁer for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, number KDD 2014, pages 263 272, 2014.

Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classiﬁcation. In Proceedings of the 8th International Conference on Learning Representations, number ICLR 2020, 2020.

Timothy N. Rubin, America Chambers, Padhraic Smyth, and Mark Steyvers. Statistical topic models for multi-label document classiﬁcation. Machine Learning, 88(1-2):157 208, 2012.

Chenglei Shen, Xiao Zhang, Wei Wei, and Jun Xu. Hyperbandit: Contextual bandit with hypernewtork for time-varying user preferences in streaming recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 2239 2248, 2023.

Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems, volume 23, pages 2199 2207, 2010.

Yan-Ping Sun and Min-Ling Zhang. Compositional metric learning for multi-label classiﬁcation. Frontiers of Computer Science, 15(5):155320, 2021.

Huayi Tang and Yong Liu. Towards understanding the generalization of graph neural networks. ar Xiv:2305.08048v1, 2023.

Wei Weng, Yaojin Lin, Shunxiang Wu, Yuwen Li, and Yun Kang. Multi-label learning based on label-speciﬁc features and local pairwise label correlation. Neurocomputing, 273:385 394, 2018.

Wei Weng, Yan-Nan Chen, Chin-Ling Chen, Shunxiang Wu, and Jinghua Liu. Non-sparse label speciﬁc features selection for multi-label classiﬁcation. Neurocomputing, 377:85 94, 2020.

Guoqiang Wu and Jun Zhu. Multi-label classiﬁcation: do hamming loss and subset accuracy really conﬂict with each other? Advances in Neural Information Processing Systems, 33(Neur IPS 2020), 2020.

Guoqiang Wu, Chongxuan Li, Kun Xu, and Jun Zhu. Rethinking and reweighting the univariate losses for multi-label ranking: Consistency and generalization. Advances in Neural Information Processing Systems, 34(Neur IPS 2021):14332 14344, 2021a.

Guoqiang Wu, Chongxuan Li, and Yilong Yin. Towards understanding generalization of macro-auc in multi-label learning. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 37540 37570, 2023.

Liang Wu, Antoine Ledent, Yunwen Lei, and Marius Kloft. Fine-grained generalization analysis of vector-valued learning. In Proceedings of the 35th AAAI Conference on Artiﬁcial Intelligence, number AAAI 2021, pages 10338 10346, 2021b.

Chang Xu, Tongliang Liu, Dacheng Tao, and Chao Xu. Local rademacher complexity for multi-label learning. IEEE Transactions on Image Processing, 25(3):1495 1507, 2016.

Miao Xu and Lan-Zhe Guo. Learning from group supervision: the impact of supervision deﬁciency on multi-label learning. Science China Information Sciences, 64(3), 2021.

Guangxu Xun, Kishlay Jha, Jianhui Sun, and Aidong Zhang. Correlation networks for extreme multi-label text classiﬁcation. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, number KDD 2020, pages 1074 1082, 2020.

Vacit Oguz Yazici, Abel Gonzalez-Garcia, Arnau Ramisa, Bartlomiej Twardowski, and Joost van de Weijer. Orderless recurrent models for multi-label classiﬁcation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, number CVPR 2020, pages 13440 13449, 2020.

Ian En-Hsu Yen, Xiangru Huang, Pradeep Ravikumar, Kai Zhong, and Inderjit S. Dhillon. Pdsparse : A primal and dual sparse approach to extreme multiclass and multilabel classiﬁcation. In Proceedings of the 33nd International Conference on Machine Learning, volume 48, pages 3069 3077, 2016.

Renchun You, Zhiyao Guo, Lei Cui, Xiang Long, Yingze Bao, and Shilei Wen. Cross-modality attention with semantic graph embedding for multi-label classiﬁcation. In Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence, number AAAI 2020, pages 12709 12716, 2020.

Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit S. Dhillon. Large-scale multi-label learning with missing labels. In Proceedings of the 31th International Conference on Machine Learning, volume 32, pages 593 601, 2014.

Kai Yu, Shipeng Yu, and Volker Tresp. Multi-label informed latent semantic indexing. In Ricardo A. Baeza-Yates, Nivio Ziviani, Gary Marchionini, Alistair Moffat, and John Tait, editors, Proceedings of the 28th Annual International Conference on Research and Development in Information Retrieval, number SIGIR 2005, pages 258 265, 2005.

Ze-Bang Yu and Min-Ling Zhang. Multi-label classiﬁcation with label-speciﬁc feature generation: A wrapped approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9): 5199 5210, 2022.

Chunyu Zhang and Zhanshan Li. Multi-label learning with label-speciﬁc features via weighting and label entropy guided clustering ensemble. Neurocomputing, 419:59 69, 2021.

Jujie Zhang, Min Fang, and Xiao Li. Multi-label learning with discriminative features for each label. Neurocomputing, 154:305 316, 2015.

Min-Ling Zhang and Lei Wu. Lift: Multi-label learning with label-speciﬁc features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1):107 120, 2015.

Min-Ling Zhang and Zhi-Hua Zhou. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038 2048, 2007.

Min-Ling Zhang and Zhi-Hua Zhou. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8):1819 1837, 2014.

Min-Ling Zhang, Yu-Kun Li, Xu-Ying Liu, and Xin Geng. Binary relevance for multi-label learning: an overview. Frontiers of Computer Science, 12(2):191 202, 2018.

Yi-Fan Zhang and Shizhong Liao. A kernel perspective for the decision boundary of deep neural networks. In Proceedings of the 32nd IEEE International Conference on Tools with Artiﬁcial Intelligence, number ICTAI 2020, pages 653 660, 2020.

Yi-Fan Zhang and Min-Ling Zhang. Nearly-tight bounds for deep kernel learning. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 41861 41879, 2023.

Yi-Fan Zhang and Min-Ling Zhang. Generalization analysis for multi-label learning. In Proceedings of the 41st International Conference on Machine Learning, number ICML 2024, 2024.

Yue Zhu, James T. Kwok, and Zhi-Hua Zhou. Multi-label learning with global and local label correlation. IEEE Transactions on Knowledge and Data Engineering, 30(6):1081 1094, 2018.

A.1 Appendix Outline

In the appendix, we give the detailed proofs of those theoretical results in the main paper. Our main proofs include:

The ℓ Lipschitz continuity of the commonly used losses for LSRL (Proposition 1).

The novel vector-contraction inequality for ℓ Lipschitz loss (Lemma 1).

The generalization bound of the general LSRL class with no dependency on c (Theorem 1).

The generalization bound for k-means clustering-based LSRL method (Theorem 2).

The (pseudo-) Lipschitz continuity of the squared and Decomposable loss (Proposition 2).

The generalization bound for Lasso-based LSRL method (Theorem 3).

The generalization bound for DNN-based LSRL method (Theorem 4).

A.2 Preliminaries

A.2.1 Deﬁnitions of the corresponding complexity measures

Deﬁnition 3 (ℓ norm covering number). Let F be a class of real-valued functions mapping from X to R. Let D = {x1, . . . , xn} be a set with n i.i.d. samples. For any ϵ > 0, the empirical ℓ norm covering number N (ϵ, F, D) w.r.t. D is deﬁned as the minimal number m of a collection of vectors v1, . . . , vm Rn such that maxi [n] f (xi) vj i ϵ (vj i is the i-th component of the

vector vj). In this case, we call v1, . . . , vm an (ϵ, ℓ )-cover of F with respect to D. We also deﬁne N (ϵ, F, n) = sup D N (ϵ, F, D).

Deﬁnition 4 (Fat-shattering dimension). Let F be a class of real-valued functions mapping from X to R. We deﬁne the fat-shattering dimension fatϵ(F) at scale ϵ > 0 as the largest p N such that there exist p points x1, . . . , xp X and witnesses s1, . . . , sp R satisfying: for any δ1, . . . , δp { 1, +1} there exists f F with δi (f(xi) si) ϵ, i = 1, . . . , p.

A.2.2 The Bound for the loss function space

For any training dataset D = {(xi, yi) : i [n]}, let D = {(xi, yi) : i [n]} be the training dataset with only one sample different from D, where the k-th sample is replaced by (x k, y k). Let Φ(D) = supf F[E(x,y) P [ℓ(f(x), y)] 1

n Pn i=1 ℓ(f(xi), yi)] = supf F[R(f) b RD(f)], then

Φ (D ) Φ(D)

= sup f F [R(f) b RD (f)] sup f F [R(f) b RD(f)]

sup f F [ b RD(f) b RD (f)]

[ℓ(f(xk), yk) ℓ(f(x k), y k)] n

According to Mc Diarmid s inequality, for any 0 < δ < 1, with probability at least 1 δ

2 over the training dataset D, the following holds:

Φ(D) ED[Φ(D)] + M

Then, we will estimate the upper bound of ED[Φ(D)].

h ED h b RD (f) b RD(f) ii#

h b RD (f) b RD(f) i#

i=1 ℓ(f(x i), y i) ℓ(f(xi), yi)

i=1 ϵi (ℓ(f(x i), y i)) ℓ(f(xi), yi))

i=1 ϵiℓ(f(x i), y i)

i=1 ϵiℓ(f(xi), yi)

i=1 ϵiℓ(f(xi), yi)

Then apply Mc Diarmid s inequality to Eϵ supf F 1 n Pn i=1 ϵiℓ(f(xi), yi) , we have

i=1 ϵiℓ(f(xi), yi)

i=1 ϵiℓ(f(xi), yi)

ℜ(L) ˆℜD(L) + M

Combining with (2), (3) and (4), then

R(f) b RD(f) + 2ˆℜD(L) + 3M

A.3 General Bounds for LSRL

A.3.1 Proof of Proposition 1

We ﬁrst prove that the surrogate Hamming Loss is µ-Lipschitz continuous with respect to the ℓ norm. ℓH(f(x), y) ℓH f (x), y

j=1 ℓb (yjfj(x)) 1

j=1 ℓb yjf j(x)

ℓb (yjfj(x)) ℓb yjf j(x)

j=1 µ fj(x) f j(x)

c µc max j [c]

fj(x) f j(x)

=µ f(x) f (x) .

Second, with the elementary inequality

|max {a1, . . . , ac} max {b1, . . . , bc}| max {|a1 b1| , . . . , |ac bc|} ,

we prove that the surrogate Subset Loss is µ-Lipschitz continuous with respect to the ℓ norm. ℓS(f(x), y) ℓS f (x), y

= max j [c] ℓb (yjfj(x)) max j [c] ℓb yjf j(x)

ℓb (yjfj(x)) ℓb yjf j(x)

µ max j [c]

fj(x) f j(x)

=µ f(x) f (x) .

Third, we prove that the surrogate Ranking Loss is 2µ-Lipschitz continuous with respect to the ℓ norm. ℓR(f(x), y) ℓR f (x), y

= 1 |Y +| |Y |

ℓb (fp(x) fq(x)) ℓb f p(x) f q(x)

max p Y +,q Y ℓb (fp(x) fq(x)) ℓb f p(x) f q(x)

µ max p Y +,q Y (fp(x) fq(x)) f p(x) f q(x)

µ max p Y + fp(x) f p(x) + max q Y fq(x) f q(x)

2µ max j [c]

fj(x) f j(x)

=2µ f(x) f (x) .

Finally, we prove that the surrogate Decomposable Loss is cµ-Lipschitz continuous with respect to the ℓ norm. ℓD(f(x), y) ℓD f (x), y

j=1 ℓb (yjfj(x))

j=1 ℓb yjf j(x)

ℓb (yjfj(x)) ℓb yjf j(x)

j=1 µ fj(x) f j(x)

cµ max j [c]

fj(x) f j(x)

=cµ f(x) f (x) .

A.3.2 Proof of Lemma 1

Proof Sketch: First, the Rademacher complexity of the loss function space associated with F can be bounded by the empirical ℓ norm covering number with the reﬁned Dudley s entropy integral inequality. Second, according to the Lipschitz continuity w.r.t the ℓ norm, the empirical ℓ norm covering number of F can be bounded by the empirical ℓ norm covering number of P(F). Third, the empirical ℓ norm covering number of P(F) can be bounded by the fat-shattering dimension, and the fat-shattering dimension can be bounded by the worst-case Rademacher complexity of P(F).

Hence, the problem is transferred to the estimation of the worst-case Rademacher complexity. Finally, we estimate the lower bound of the worst-case Rademacher complexity of P(F), and then combined with the above steps, the Rademacher complexity of the loss function space associated with F can be bounded.

We ﬁrst introduce the following lemmas:

Lemma 2 (Khintchine-Kahane inequality [Lust-Piquard and Pisier, 1991]). Let v1, . . . , vn H, where H is a Hilbert space with being the associated p-th norm. Let ϵ1, . . . , ϵn be a sequence of independent Rademacher variables. Then, for any p 1 there holds

i=1 vi 2 # 1

i=1 vi 2 # 1

i=1 vi 2 # 1

Lemma 3 (Lemma A.2 in [Srebro et al., 2010]). For any function class F, any S with a ﬁnite sample of size n and any ϵ > ˆℜS(F), we have that

fatϵ(F) 4nˆℜ2 S(F) ϵ2 .

Lemma 4 ([Srebro et al., 2010, Lei et al., 2019]). If any function in class F takes values in [ B, B], then for any S with a ﬁnite sample of size n, any ϵ > 0 with fatϵ(F) < n, we have

log N (ϵ, F, S) fatϵ(F) log 2e Bn

The following lemma is a reﬁned result of Proposition 22 in [Ledent et al., 2021], where we replace the function class taking values in [0, 1] with the B-bounded function class, the reﬁnement is obvious.

Lemma 5 (Reﬁned Dudley s entropy integral inequality). Let F be a real-valued function class with f B, f F, B > 0, and assume that 0 F. Let S be a ﬁnite sample of size n. For any 2 p , we have the following relationship between the Rademacher complexity ˆℜS(F) and the covering number Np(ϵ, F, S).

ˆℜS(F) inf α>0

log Np(ϵ, F, S)dϵ

Step 1: We ﬁrst derive the relationship between the empirical ℓ norm covering number N (ϵ, L, D) and the empirical ℓ norm covering number N (ϵ, P(F), [c] D).

For the dataset D = {(x1, y1), . . . , (xn, yn)} with n i.i.d. examples:

max i |ℓ(f(xi), yi) ℓ(f (xi), yi)|

ρ max i f(xi) f (xi) (Use Assumption 2)

ρ max i max j |fj (xi) f j (xi) |

ρ max i max j |pj(f(xi)) pj(f (xi)|. (The deﬁnition of the projection function class P(F))

Then, according to the deﬁnition of the empirical ℓ covering number, we have that an empirical ℓ cover of P(F) at radius ϵ/ρ is also an empirical ℓ cover of the loss function space associated with F at radius ϵ, and we can conclude that:

N (ϵ, L, D) N

ρ, P(F), [c] D . (6)

Step 2: We show that the empirical ℓ norm covering number of P(F) can be bounded by the fatshattering dimension, and the fat-shattering dimension can be bounded by the worst-case Rademacher complexity of P(F).

According to Lemma 3, for any ϵ > ˆℜ[c] D(P(F)), we have

fatϵ(P(F)) 4ncˆℜ2 [c] D(P(F))

Then, combining with Lemma 4, we have

log N (ϵ, P(F), [c] D) 4ncˆℜ2 [c] D(P(F))

ϵ2 log 2e Bnc

Use inequality ϵ > ˆℜ[c] D(P(F)), we have

log N (ϵ, P(F), [c] D) 4ncˆℜ2 [c] D(P(F))

ϵ2 log 2e Bnc ˆℜ[c] D(P(F)) . (7)

Step 3: According to Assumption 1 in the main paper, we can obtain the lower bound of the worst-case Rademacher complexity eℜnc(P(F)) by the Khintchine-Kahane inequality with p = 1:

= sup [c] D [c] X n ˆℜ[c] D(P(F))

= sup [c] D [c] X n Eϵ

sup pj(f(xi)) P(F)

j=1 ϵijpj(f(xi))

= sup [c] D [c] X n Eϵ

j=1 ϵijfj (xi)

= sup ζj(φj(xi)) 2 A:i [n],j [c]

j=1 ϵij wj, ζj(φj(xi))

= sup ζj(φj(xi)) 2 A:i [n],j [c]

j=1 ϵijζj(φj(xi))

sup ζj(φj(xi)) 2 A:i [n],j [c]

j=1 ζj(φj(xi)) 2

. (Use Lemma 2)

Since ζj(φj(xi)) 2 A, we set sup ζj(φj(xi)) 2 A:i [n],j [c] 1 nc h Pn i=1 Pc j=1 ζj(φj(xi)) 2i 1

eℜnc(P(F)) ΛA

Step 4: According to Lemma 5 and combined with the above steps, we have ˆℜD(L)

log N (ϵ, L, D)dϵ

ρ, P(F), [c] D)dϵ

(Use inequality (6))

v u u t4ncρ2 ˆℜ2 [c] D(P(F))

ϵ2 log 2e Bnc ˆℜ[c] D(P(F)) dϵ

(Use inequality (7))

4ncρ2eℜ2nc(P(F))

2e Bn 3 2 c 3 2 )dϵ

(Use inequality (8) and the deﬁnition of the worst-case Rademacher complexity)

2ρ ceℜnc(P(F)) log 1 2 (8e2n3c3) Z M

2ρ ceℜnc(P(F)) + 12

2ρ ceℜnc(P(F)) log 1 2 (8e2n3c3) log M

2ρ ceℜnc(P(F))

(Choose α = 3

2ρ ceℜnc(P(F)))

2ρ ceℜnc(P(F)) + 12

2ρ ceℜnc(P(F)) log 1 2 (8e2n3c3) log M n

ρB (Use inequality (8) )

2ρ ceℜnc(P(F)) 1 + log 1 2 (8e2n3c3) log M n

A.3.3 Proof of Theorem 1

We upper bound the worst-case Rademacher complexity eℜnc(P(F)) as the following: eℜnc(P(F)) = sup [c] D [c] X n ˆℜ[c] D(P(F))

= sup [c] D [c] X n Eϵ

sup pj(f(xi)) P(F)

j=1 ϵijpj(f(xi))

= sup [c] D [c] X n Eϵ

j=1 ϵijfj (xi)

= sup ζj(φj(xi)) 2 A:i [n],j [c]

j=1 ϵij wj, ζj(φj(xi))

= sup ζj(φj(xi)) 2 A:i [n],j [c]

j=1 ϵijζj(φj(xi))

sup ζj(φj(xi)) 2 A:i [n],j [c]

j=1 ϵijζj(φj(xi)) 2

(Use Jensen s Inequality)

sup ζj(φj(xi)) 2 A:i [n],j [c]

j=1 ζj(φj(xi)) 2

ΛA nc. (Use Lemma 2)

Then, we have

2ρ ceℜnc(P(F)) 1 + log 1 2 (8e2n3c3) log M n

2ρΛA 1 + log 1 2 (8e2n3c3) log M n

Combining with (5), then

R(f) b RD(f) + 24

2ρΛA 1 + log 1 2 (8e2n3c3) log M n

A.4 Generalization Bounds for Typical LSRL Methods

A.4.1 Proof of Theorem 2

The generalization analysis of LIFT involves the generalization analysis for k-means clustering. According to the deﬁnitions of k-means clustering in Subection 5.1, we have the empirical Rademacher complexity of a loss function space associated with the vector-valued function class G:

ˆℜD(Lclu G) = Eϵ

i,j=1,i =j ϵiℓclu(g(xi, xj))

Rademacher complexity has proved to be a powerful data-dependent measure of hypothesis space complexity. However, since the k-means clustering framework involves pairwise functions, a sequence of pairs of i.i.d. individual observation in k-means clustering is no longer independent, which makes standard techniques in the i.i.d case for traditional Rademacher complexity inapplicable for kmeans clustering. We convert the non-sum-of-i.i.d pairwise function to a sum-of-i.i.d form by using permutations in U-process [Clémençon et al., 2008].

We ﬁrst proof the following lemma: Lemma 6. Let qτ : X X 7 R be real-valued functions indexed by τ T where T is some set. If x1, . . . , xs and x 1, . . . , x t are i.i.d., r = min{s, t}, then for any convex non-decreasing function ψ,

j=1 qτ xi, x j

i=1 qτ (xi, x i)

Proof. The proof of this lemma is inspired by [Clémençon et al., 2008].

j=1 qτ xi, x j

i=1 qτ xπ(i), x π(i)

i=1 qτ xπ(i), x π(i)

(ψ is nondecreasing)

i=1 qτ xπ(i), x π(i) !

(Jensen s inequality)

i=1 qτ (xi, x i)

According to Lemma 6, ˆℜD(Lclu G) can be bounded by

ˆℜD (Lclu G) := Eϵ

i=1 ϵiℓclu(g(xi, xi+ n

where each ϵi is an independent Rademacher random variable.

In order to obtain tighter generalization bounds for k-means clustering, we develop the following novel vector-contraction inequality: Lemma 7. Let G be a vector-valued function class of k-means clustering deﬁned in Subection 5.1. Given a dataset D of size n. Then, we have

ˆℜD(Lclu G) 12

K max k eℜ n

1 + log 1 2 (e2n3) log M eℜ n

2 (Gk) is the worst-case Rademacher complexity, Gk is the restriction of the function class along the k-th coordinate, gk Gk.

Proof. We ﬁrst introduce the following lemma:

Lemma 8 (Lemma 1 in [Foster and Rakhlin, 2019]). Let F f : X RK , and let φ1, . . . , φn each be L Lipschitz with respect to the ℓ norm. For any D with a ﬁnite sample of size n and ϵ > 0, we have that

log N2 (ϵ, φ F, D) K max k log N ϵ

L, Fk, D , (10)

where Fk is the restriction of the function class along the k-th coordinate, fk Fk, k [K].

The empirical ℓ norm covering number of Gk can be bounded by the fat-shattering dimension, and the fat-shattering dimension can be bounded by the worst-case Rademacher complexity of Gk. Combined with the above steps, we have

ˆℜD(Lclu G) ˆℜD (Lclu G)

log N2(ϵ, Lclu G, D )dϵ

(Use Lemma 5 )

K max k log N (ϵ, Gk, D )dϵ

(Use inequality (10) and ℓclu( ) is 1-Lipschitz w.r.t. ℓ norm for k-means clustering)

K max k fatϵ(Gk) log 2e B n

(Use Lemma 4)

K max k 4 n

2 ˆℜ2 D (Gk) ϵ2 log 2e G n

2 ˆℜD (Gk) dϵ

(Use Lemma 3)

v u u t K max k

ϵ2 log 2e G n

(The deﬁnition of the worst-case Rademacher complexity)

2n K maxk eℜ2 n

ϵ2 log(en 3 2 )dϵ

(Use the similar technique to the proof in inequality (8), the lower bound of eℜ n

K max k eℜ n

2 (Gk) log 1 2 (e2n3) Z M

K max k eℜ n

2 (Gk) + 12

K max k eℜ n

2 (Gk) log 1 2 (e2n3) log M

K maxk eℜ n

(Choose α = 3

K max k eℜ n

K max k eℜ n

1 + log 1 2 (e2n3) log M eℜ n

Since k-means clustering-based LSRL method is two-stage, the k-means clustering is used to generate label-speciﬁc representations in the ﬁrst stage, and the second stage is conventional multi-label learning. Therefore, the corresponding whole function class is actually denoted as H = F + Lclu G. Since eℜnc(P(H)) eℜnc(P(F)) + eℜnc(P(Lclu G) [Bartlett and Mendelson, 2002], with Lemma 1, we have

ˆℜD(L H) 12

2ρ c eℜnc(P(F)) + eℜnc(P(Lclu G)) 1 + log 1 2 (8e2n3c3) log M n

According to Lemma 7, we have that the worst-case Rademacher complexity of the loss function space associated with P(G) can be bounded by the worst-case Rademacher complexity of the restriction of the function class along each coordinate P(Gk). Hence, for k-means clustering-based LSRL method, Lemma 7 involves the upper and lower bounds of eℜ nc

We then obtain the lower bound of the worst-case Rademacher complexity eℜ nc

2 (P(Gk)) by the Khintchine-Kahane inequality with p = 1:

2 (P(Gk)) = sup

[c] D [c] X n

2 ˆℜ[c] D (P(Gk))

[c] D [c] X n

sup pj(gk(xi,xi+ n

j=1 ϵijpj(gk(xi, xi+ n

[c] D [c] X n

sup gj k Gj k

j=1 ϵijgj k(xi, xi+ n

= sup Zj k(xi,xi+ n

sup V (xi,xi+ n

2 ) 4R2 1 n

j=1 ϵij V (xi, xi+ n

2 )Zj k(xi, xi+ n

= sup Zj k(xi,xi+ n

j=1 ϵij Zj k(xi, xi+ n

sup Zj k(xi,xi+ n

j=1 Zj k(xi, xi+ n

Since Zj k(xi, xi+ n

2 ) 1, we set sup Zj k(xi,xi+ n

2 i=1 Pc j=1 Zj k(xi, xi+ n

2 c. So, eℜ nc

2 (P(Gk))) 4R2

Then, we replace the lower bound of eℜ n

2 (Gk) in the proof of Lemma 7 with the lower bound of eℜ nc

2 (P(Gk))), and we have

eℜnc(P(Lclu G)) 12

K max k eℜ nc

2 (P(Gk))) 1 + log 1 2 (e2n3c3) log M nc

We then upper bound the worst-case Rademacher complexity eℜ nc

2 (P(Gk)) as the following:

[c] D [c] X n

2 ˆℜ[c] D (P(Gk))

[c] D [c] X n

sup pj(gk(xi,xi+ n

j=1 ϵijpj(gk(xi, xi+ n

[c] D [c] X n

sup gj k Gj k

j=1 ϵijgj k(xi, xi+ n

= sup Zj k(xi,xi+ n

sup V (xi,xi+ n

2 ) 4R2 1 n

j=1 ϵij V (xi, xi+ n

2 )Zj k(xi, xi+ n

= sup Zj k(xi,xi+ n

j=1 ϵij Zj k(xi, xi+ n

sup Zj k(xi,xi+ n

j=1 ϵij Zj k(xi, xi+ n

sup Zj k(xi,xi+ n

j=1 Zj k(xi, xi+ n

Hence, we have

eℜnc(P(Lclu G)) 24

1 + log 1 2 (e2n3c3) log M nc

Use the similar technique to the proof of the inequality above, the upper bound of eℜnc(P(F))

KR nc , since φj(x) = q PK k=1(d(x, cj k))2

K maxk x cj k

K maxk( x +

Finally, combining with the above inequalities and (5), we have

R(f) b RD(f) + 3M

KR 1 + log 1 2 (8e2n3c3) log M n

1 + log 1 2 (e2n3c3) log M nc

1 + log 1 2 (8e2n3c3) log M n

A.4.2 Proof of Proposition 2

First, we prove that the squared loss is 1 pseudo-Lipschitz of order 2.

ℓb (yjfj(x)) ℓb yjf j(x)

= (yj fj(x))2 yj f j(x) 2

= (yj fj(x)) + yj f j(x) (yj fj(x)) yj f j(x)

|yj fj(x)| + yj f j(x) (yj fj(x)) yj f j(x)

1 + |yj fj(x)| + yj f j(x) (yj fj(x)) yj f j(x)

According to the deﬁnition of the pseudo-Lipschitz function, the squared loss is 1 pseudo-Lipschitz of order 2. Then, we have

ℓb (yjfj(x)) ℓb yjf j(x)

1 + |yj fj(x)| + yj f j(x) (yj fj(x)) yj f j(x)

1 + 2 |yj| + |fj(x)| + f j(x) fj(x) f j(x)

(3 + 2B) fj(x) f j(x) .

Second, we prove that the surrogate Decomposable Loss is (3 + 2B)c-Lipschitz continuous with respect to the ℓ norm if the base loss ℓb is the squared loss.

ℓD(f(x), y) ℓD f (x), y

j=1 ℓb (yjfj(x))

j=1 ℓb yjf j(x)

ℓb (yjfj(x)) ℓb yjf j(x)

j=1 (3 + 2B) fj(x) f j(x)

c(3 + 2B) max j [c]

fj(x) f j(x)

=(3 + 2B)c f(x) f (x) .

A.4.3 Proof of Theorem 3

Compared with the function class of LSRL where each label corresponds to a Lasso, the function class of LLSF introduces the additional sharing constraint. In fact, the introduction of the sharing constraint reduces the complexity of the function class compared with the original class. Then, the complexity for the function class of LLSF can be bounded by the complexity of the function class of LSRL where each label corresponds to a Lasso. Hence, the complexity analysis of the function class of LLSF can be converted into giving the bound of the Rademacher complexity of the LSRL function class where each label corresponds to a Lasso.

According to the deﬁnition, the function class of LSRL where each label corresponds to a Lasso means that in the class of LSRL deﬁned by (1), the base loss ℓb is the squared loss, the nonlinear mappings ζ( ) and φ( ) are both identity transformations for any j [c], and the constraint α(w) is wj 1 Λ for any j [c]. The proof process is similar to Lemma 1 and Theorem 1, but the upper bound of the worst-case Rademacher complexity eℜnc(P(F)) here is different from Theorem 1. According to the above deﬁnitions, we have

= sup [c] D [c] X n ˆℜ[c] D(P(F))

= sup [c] D [c] X n Eϵ

sup pj(f(xi)) P(F)

j=1 ϵijpj(f(xi))

= sup [c] D [c] X n Eϵ

j=1 ϵijfj (xi)

= sup xj i 2 R:i [n],j [c]

j=1 ϵij wj, xj i

= sup xj i 2 R:i [n],j [c]

sup wj 1 Λ wj 1

j=1 ϵijxj i

(Use H older s Inequality)

sup xj i 2 R:i [n],j [c]

j=1 ϵijxj i 2

sup xj i 2 R:i [n],j [c]

j=1 ϵijxj i 2 2

(Use Jensen s Inequality)

sup xj i 2 R:i [n],j [c]

j=1 xj i 2 2

(Use Lemma 2)

ΛR nc. (11)

Combining with Lemma 1, inequalities (11), (5), and ρ = (3 + 2B)c, then we have

R(f) b RD(f) + 24

2(3 + 2B)cΛR 1 + log 1 2 (8e2n3c3) log M n

A.4.4 Proof of Theorem 4

First, we upper bound the worst-case Rademacher complexity eℜnc(P(F)) for DNN-based LSRL method. With the deﬁnitions in the main paper, we have

= sup [c] D [c] X n ˆℜ[c] D(P(F))

= sup [c] D [c] X n Eϵ

sup pj(f(xi)) P(F)

j=1 ϵijpj(f(xi))

= sup [c] D [c] X n Eϵ

j=1 ϵijfj (xi)

= sup [c] D [c] X n Eϵ

j=1 ϵijσsig(w j φj(xi))

4 sup [c] D [c] X n Eϵ

sup wj Λ,φj

j=1 ϵijw j φj(xi)

(The Lipschitz constant of sigmoid activation is bounded by 1

2Λ sup [c] D [c] X n Eϵ sup φj

j=1 ϵijφj(xi)

2Λ sup [c] D [c] X n Eϵ sup 1

j=1 ϵijσRe LU {W5 [σRe LU(W4xi) σsig(W3ψ(Y )j)]}

2Λ sup [c] D [c] X n Eϵ sup W5 D

j=1 ϵij W5 [σRe LU(W4xi) σsig(W3ψ(Y )j)]

(Re LU activation is 1-Lipschitz)

ΛD sup [c] D [c] X n Eϵ sup 1

j=1 ϵij [σRe LU(W4xi) σsig(W3ψ(Y )j)]

ΛD sup σsig(W3ψ(Y )j) sup [c] D [c] X n Eϵ sup W4 D

j=1 ϵijσRe LU(W4xi)

2ΛD sup σsig(W3ψ(Y )j) sup xi R Eϵ sup W4 D

j=1 ϵij W4xi

2ΛD2 sup σsig(W3ψ(Y )j) sup xi R Eϵ 1 nc

2ΛD2 sup σsig(W3ψ(Y )j) sup xi R

(Use Jensen s Inequality)

2ΛD2 sup σsig(W3ψ(Y )j) sup xi R

2ΛD2 sup σsig(W3ψ(Y )j) R nc. (12)

Then, we have to bound sup σsig(W3ψ(Y )j) ,

sup σsig(W3ψ(Y )j)

4 sup W3 D W3ψ(Y )j (The Lipschitz constant of sigmoid activation is bounded by 1

4D sup j ψ(Y )j (Use Cauchy-Schwarz Inequality)

4 sup j sup W2 D σRe LU( AσRe LU( AYj W1)W2)

4 sup j sup W2 D AσRe LU( AYj W1) W2

4 sup j AσRe LU( AYj W1)

i=1 AjiσRe LU(

j=1 Aij Yj W1)

i=1 Aji σRe LU(

j=1 Aij Yj W1)

4 sup j sup W1 D

j=1 Aij Yj W1

j=1 Aij Yj (Use Cauchy-Schwarz Inequality)

We denote N(j) as the index set of the one-hop neighbors of the j-th node, and denote g as the node degree, gmax as the maximum node degree. Then, we bound A as follows

Aij gi + 1pgj + 1

1 gi + 1 + X

1 1 + 1 + X

1 gi + 1 gi + 1

Combining with Lemma 1, inequalities (12), (13), (14), and (5), then we have

R(f) b RD(f) + 6

2ρΛD5R2(gmax + 1) 1 + log 1 2 (8e2n3c3) log M n

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? Answer: [Yes] Justiﬁcation: The paper s contributions and scope are reﬂected accurately by the main claims made in the abstract and introduction. Please see Abstract and Introduction sections. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justiﬁcation: This is a purely theoretical work which improves existing theoretical results. The assumptions involved in the theoretical results are explained in detail and are satisﬁed in the relevant settings. Please see Conclusion section. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justiﬁcation: This is a purely theoretical work which improves existing theoretical results. The assumptions involved in the theoretical results are explained in detail and are satisﬁed in the relevant settings. In the appendix, we give the detailed proofs of those theoretical results in the main paper. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justiﬁcation: This is a purely theoretical work. This paper does not include experiments. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justiﬁcation: This is a purely theoretical work. This paper does not include experiments requiring code. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justiﬁcation: This is a purely theoretical work. This paper does not include experiments. 7. Experiment Statistical Signiﬁcance

Question: Does the paper report error bars suitably and correctly deﬁned or other appropriate information about the statistical signiﬁcance of the experiments? Answer: [NA] Justiﬁcation: This is a purely theoretical work. This paper does not include experiments. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufﬁcient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA] Justiﬁcation: This is a purely theoretical work. This paper does not include experiments. 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justiﬁcation: This is theoretical research and the research conducted in the paper conform with the Neur IPS Code of Ethics. 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justiﬁcation: This is theoretical research that does not have a potential negative societal impact. 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justiﬁcation: This is a purely theoretical work. This paper poses no such risks. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justiﬁcation: This is a purely theoretical work. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justiﬁcation: This is a purely theoretical work. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justiﬁcation: This is a purely theoretical work. This paper does not involve crowdsourcing nor research with human subjects. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justiﬁcation: This is a purely theoretical work. This paper does not involve crowdsourcing nor research with human subjects.