# multidimensional_classification_via_sparse_label_encoding__ec76434f.pdf Multi-Dimensional Classification via Sparse Label Encoding Bin-Bin Jia 1 2 Min-Ling Zhang 1 3 In multi-dimensional classification (MDC), there are multiple class variables in the output space with each of them corresponding to one heterogeneous class space. Due to the heterogeneity of class spaces, it is quite challenging to consider the dependencies among class variables when learning from MDC examples. In this paper, we propose a novel MDC approach named SLEM which learns the predictive model in an encoded label space instead of the original heterogeneous one. Specifically, SLEM works in an encodingtraining-decoding framework. In the encoding phase, each class vector is mapped into a realvalued one via three cascaded operations including pairwise grouping, one-hot conversion and sparse linear encoding. In the training phase, a multi-output regression model is learned within the encoded label space. In the decoding phase, the predicted class vector is obtained by adapting orthogonal matching pursuit over outputs of the learned multi-output regression model. Experimental results clearly validate the superiority of SLEM against state-of-the-art MDC approaches. 1. Introduction In traditional supervised learning, the semantics of objects are usually characterized by only one output variable, e.g., multi-class classification. However, in some real-world applications, the semantics of objects need to be characterized along different dimensions. For example, the e-commerce websites should categorize laptops from different dimensions (e.g., brand, operating system, CPU, GPU, etc.) to make it more convenient for consumers to choose the right laptop for themselves. In fact, similar requirements widely 1School of Computer Science and Engineering, Southeast University, Nanjing 210096, China 2College of Electrical and Information Engineering, Lanzhou University of Technology, Lanzhou 730050, China 3Key Lab. of Computer Network and Information Integration (Southeast University), Ministry of Education, China. Correspondence to: Min-Ling Zhang . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). exist in various fields, e.g., text classification (Shatkay et al., 2008), bioinformatics (Rodr ıguez et al., 2012), resource allocation (Al Muktadir et al., 2019), ecology (Verma et al., 2021), etc. These special applications can be naturally formalized under the multi-dimensional classification (MDC) learning framework (Read et al., 2014a; Ma & Chen, 2018; Jia & Zhang, 2020a; Wang et al., 2020). In MDC, each example is represented by a single instance while associated with multiple class variables. Here, each class variable corresponds to one specific class space which characterizes the semantics of objects from one dimension. Formally speaking, let X = Rd be the input (feature) space, and Y = C1 C2 Cq be the output space which corresponds to the Cartesian product of q class spaces. Here, each class space Cj (1 j q) consists of Kj possible class labels, i.e., Cj = {cj 1, cj 2, . . . , cj Kj}. Given the MDC training set D = {(xi, yi) | 1 i m} with m training examples, for each example (xi, yi) D, xi = [xi1, xi2, . . . , xid] X is a d-dimensional feature vector and yi = [yi1, yi2, . . . , yiq] Y is the class vector associated with xi, where each component yij is one possible item in Cj, i.e., yij Cj. The MDC task aims to learn a predictive model f : X 7 Y from D which can return a proper class vector f(x) Y for unseen instance x. To solve the MDC problem, we can independently deal with each dimension which is actually a multi-class classification problem. Nonetheless, this strategy does not consider potential dependencies among class spaces which would degenerate its generalization ability. Therefore, most existing MDC studies focus on how to model class dependencies more appropriately, e.g., specifying a chaining structure over class variables (Zaragoza et al., 2011; Read et al., 2014b), partitioning class spaces into several groups (Read et al., 2014a), learning a direct acyclic graph structure over class variables (Bielza et al., 2011; Gil-Begue et al., 2021), etc. Due to the heterogeneity of class spaces, it is quite challenging to directly consider the dependencies among class variables in the original output space as most existing MDC approaches do. In this paper, we attempt to learn the predictive model which solves the MDC problem in its transformed label space. Accordingly, we propose a novel approach named SLEM, i.e., Sparse Label Encoding for Multidimensional classification, which works in an encoding- Multi-Dimensional Classification via Sparse Label Encoding training-decoding framework by utilizing the sparse property of the transformed label space. In the encoding phase, the output space is transformed into a new one via three cascaded operations, including pairwise grouping, one-hot conversion and sparse linear encoding. To be more specific, pairwise grouping aims at making the results of one-hot conversion sparser and linear encoding further maps the sparse label vector into a real-valued one. In the training phase, SLEM learns a multi-output regression model in the encoded label space to deal with the resulting problem. In the decoding phase, based on the predicted real-valued label vector which is obtained by feeding unseen instance into the learned multi-output regressor, SLEM conducts three inverse operations corresponding to the encoding phase in reverse order to predict the class vector for unseen instance. To be more specific, the inverse operation of sparse linear encoding is implemented by adapting the orthogonal matching pursuit algorithm (Tropp & Gilbert, 2007) and the other two inverse operations can be implemented directly. Experimental results clearly validate the superiority of SLEM against state-of-the-art MDC approaches. The rest of this paper is organized as follows. Firstly, related works on MDC are briefly discussed. Secondly, technical details of the proposed approach are presented. Thirdly, experimental results of comparative studies are reported. Finally, we conclude this paper. 2. Related Work To solve the MDC problem, we can either decompose it into a number of independent multi-class classification problems, one per class space, or transform it into a single multi-class classification problem, where each distinct class combination in the training set is regarded as one new class. However, the first strategy cannot consider the dependencies among class spaces (i.e., underfitting), while the second strategy cannot return class combinations not appearing in the training set (i.e., overfitting). To alleviate the underfitting problem, the classifier chains model trains a chain of multi-class classifiers, one per class space, where the subsequent classifiers on the chain will augment the feature space with the class spaces which are used to train the preceding classifiers (Zaragoza et al., 2011; Read et al., 2014b; Liu et al., 2017). To alleviate the overfitting problem, the super-class model partitions the class spaces into several groups, where each group will be treated as a new class space (Read et al., 2014a). Besides, due to the powerful modeling abilities of probabilistic graphical model, we can also learn a direct acyclic graph over the class variables to explicitly model the class dependencies (Zhu et al., 2016; Bolt & van der Gaag, 2017; Benjumeda et al., 2018). The g MML approach solves the MDC problem in a binaryvalued label space which is obtained by concatenating the one-vs-rest decomposition of each class space (Ma & Chen, 2018). Because the simple concatenation cannot blend the heterogeneous class spaces into an integrated label space, there is no intrinsic difference between the decomposed label space and the original one. Thus, the responsibilities for considering class dependencies should be taken by the following predictive model induced in the decomposed label space, which is accomplished via a metric approach in g MML. Besides, it is less reasonable to directly align predictive outputs of class labels from different class spaces due to the heterogeneity assumption in MDC. To address or alleviate the aforementioned issues, the proposed approach in this paper attempts to encode the multi-dimensional class spaces into an integrated label space, within which a predictive model is induced. It is worth noting that the label encoding strategy (Tai & Lin, 2012; Shen et al., 2018; Liu & Shen, 2019; Liu et al., 2019) has been successfully applied to solve the multi-label classification (MLC) problem (Zhang & Zhou, 2014; Gibaja & Ventura, 2015), which can be regarded as degenerated version of MDC by restricting each class variable to be binary-valued. 3. The SLEM Approach The workflow of the proposed SLEM approach is shown in Figure 1 where the whole process can be divided into three parts: encoding, training and decoding. In the following of this section, we will present their technical details. 3.1. Encoding Phase For each training example (xi, yi) D, we generally convert the nominal class vector yi Y into its one-hot form y i {0, 1} Pq j=1 Kj when we need to do some numeric computations for it.1 We denote the one-hot conversion in output space Y as y i = ΦY(yi) and its inverse as yi = Φ 1 Y (y i). It is easy to know that the length-Pq j=1 Kj vector y i is always q-sparse,2 and there is one and only one 1 among the (1 + Pa 1 j=1 Kj)-th to (Pa j=1 Kj)-th entries which we call the a-th local group of y i (1 a q). In this paper, we refer to this property as local sparsity. For the proposed SLEM approach, its decoding process to be introduced in Subsection 3.3 will utilize the sparse property. However, in light of our observations on existing real-world MDC data sets (cf. Table 1), the sparsity of the one-hot vector is usually not sparse enough to ensure the sparsity reconstruction algorithm working properly. To tackle this issue, motivated by the idea of super-class 1Specifically, the one-hot form of the j-th entry yij in yi is a length-Kj vector y ij {0, 1}Kj, where the a-th entry in y ij equals 1 if yij = cj a and 0 otherwise. The vector y i corresponds to the concatenation of q different y ij (1 j q). 2A vector is q-sparse if it has at most q non-zero entries. Multi-Dimensional Classification via Sparse Label Encoding Figure 1. The workflow of the proposed SLEM approach. Here, ui = PY(yi), vi = ΦU(ui), zi = Avi correspond to the three cascaded operations including pairwise grouping in output space Y, one-hot conversion in output space U, spare linear encoding with matrix A, respectively; e D = {(xi, zi) | 1 i m} is a multi-output regression data set, h = M( e D) is the learned multi-output regressor where M corresponds to the employed multi-output regression algorithm, ˆz = h(x), ˆv = R(ˆz) where R is the sparsity reconstruction algorithm, ˆu = Φ 1 U (ˆv), ˆy = P 1 Y (ˆu) where Φ 1 U ( ) and P 1 Y ( ) correspond to the inverse function of ΦU( ) and PY( ), respectively. partition (Read et al., 2014a), we further group the q class spaces Cj (1 j q) into q 2 pairs (plus a singleton class space if q is odd). Let U = C 1 C 2 C q 2 be the newly-obtained output space from Y = C1 C2 Cq, and [τ(1), . . . , τ(q)] be the ascending order according to the number of class labels in each class space, i.e., Cτ(1) and Cτ(q) include the least and most number of class labels respectively. For each class space C j (1 j q 2 ), it corresponds to the powerset transformation of Cτ(j) and Cτ( q j+1), i.e., each class label in C j corresponds to one distinct class combination in the Cartesian product Cτ(j) Cτ( q j+1). Here, q equals q if q is even and q 1 if q is odd. Let sj be the number of class labels in C j, generally we have sj = Kτ(j) Kτ( q j+1). Besides, if q is odd, then C q 2 = Cτ(q) and s q 2 = Kτ(q). we refer to this operation as pairwise grouping in this paper. With the pairwise grouping operation, the output space Y is transformed into U, and then the length-q class vector yi Y will be transformed into another lengthq 2 class vector ui U. We denote the pairwise grouping operation in output space Y as ui = PY(yi) and its inverse as yi = P 1 Y (ui). Here, PY( ) just heuristically aims at making the number of class labels in each class space of U as balanced as possible, and more subtle designs can be explored in the future. In summary, this operation will bring two benefits. The first one is that ΦU(ui) will be sparser than ΦY(yi) which is also the original intention of this operation.3 The second one can be regarded as a bonus of this operation that it can consider the dependencies between class spaces grouped into one pair, which can be modeled more reliably than dependencies among many class spaces with limited number of training examples. We denote the one-hot conversion of ui in output space U as vi, i.e., vi = ΦU(ui). It is easy to know that the length P q 2 j=1 sj vector vi is always q 2 -sparse, and there is one and only one 1 among the (1+Pa 1 j=1 sj)-th to (Pa j=1 sj)- th entries, i.e., the a-th local group of vi (1 a q 3Because a b must be greater than a + b if a, b > 2, the length of ΦU(ui) (i.e., P q 2 j=1 sj) must be greater than the length of ΦY(yi) (i.e., Pq j=1 Kj), while the sparsity level of ΦU(ui) (i.e., q 2 ) is less than the sparsity level of ΦY(yi) (i.e., q). For the sake of brevity, we further denote the length of vi as s = P q 2 j=1 sj and its sparsity level as k = q 2 in the rest of this paper. Let A Rs s be an encoding matrix, which can linearly encode any length-s vector v into another length-s vector z, i.e., z = Av. Here, we require s s and hope s s to be held. In this paper, we focus more on label encoding rather than label compression but the whole proposed framework can also work properly under label compression scenario. According to the theory of compressed sensing (Donoho, 2006), there are valid reconstruction algorithms which can recover the vector v from the compressed observation z if v is k-sparse and A satisfies k-RIP given in the following Definition 1: Definition 1. (Restricted Isometry Property). For matrix A, if there is a constant δk [0, 1) which satisfies (1 δk) v 2 2 Av 2 2 (1 + δk) v 2 2 (1) where v is any k-sparse vector, then A is known as satisfying k-order Restricted Isometry Property (k-RIP). It has been proved that some random matrices satisfy k-RIP with large probability (Baraniuk et al., 2008), e.g., Gaussian matrix and Bernoulli matrix. With the encoding matrix A, each binary-valued vector vi {0, 1}s can be mapped into a real-valued vector zi Rs . Obviously, each entry in zi is related to all the entries in yi, each of which belongs to one heterogeneous class space Cj (1 j q). Therefore, even if the following induced predictive model deals with each entry of zi independently, the q class spaces in the original output space Y will be tackled in a joint manner. 3.2. Training Phase With the three cascaded operations, i.e., pairwise grouping, one-hot conversion and sparse linear encoding, each class vector yi can be transformed into one real-valued vector zi, then we can obtain a new data set e D = {(xi, zi) | 1 i m} based on D, which actually corresponds to a multioutput regression problem (Borchani et al., 2015; Reeve & Kaban, 2020). To solve the resulting problem, we learn a multi-output regressor h(x) = W x + b via optimizing Multi-Dimensional Classification via Sparse Label Encoding the following formulation: min W,b, ˆV 1 2 W 2 F + λ h(xi) zi 2 2 + γ1 h(xi) Aˆvi 2 2 + γ2 ˆvi vi 1 (2) Here, W = [w1, w2, . . . , ws ] Rd s and b = [b1, b2, . . . , bs ] are the model parameters of h to be determined, ˆV = [ˆv1, . . . , ˆvm] Rm s with ˆvi corresponding to the recovered sparse vector for vi based on its prediction h(xi), and λ, γ1 and γ2 are three trade-off parameters. Specifically, the term W 2 F penalizes the model s complexity, the term h(xi) zi 2 2 corresponds to the squared error loss, and the two terms h(xi) Aˆvi 2 2 + γ2 ˆvi vi 1 require that the regressor h can facilitate the subsequent sparse reconstruction procedure in the decoding phase. To solve the above problem (2), we derive an alternating method, where the two sets of parameters {W, b} and ˆV are optimized alternately until convergence. Optimizing W and b when ˆV is fixed. When ˆV is fixed, we can reformulate the optimization problem (2) as follows: min W,b 1 2 W 2 F + λ h(xi) zi 2 2 + γ1 h(xi) Aˆvi 2 2 (3) Theorem 1. The optimization problem (3) can be equivalently reformulated as follows: min W,b 1 2 W 2 F + λ 2 i=1 h(xi) zi 2 2 (4) where λ = 2λ(1 + γ1) and zi = zi+γ1Aˆvi Proof. In Eq.(3), for the second term of the objective function, each summation term can be reformulated as follows: h(xi) zi 2 2 + γ1 h(xi) Aˆvi 2 2 = h(xi) 2 2 2 h(xi), zi + zi 2 2 + γ1( h(xi) 2 2 2 h(xi), Aˆvi + Aˆvi 2 2) =(1 + γ1) h(xi) 2 2 2 h(xi), zi + γ1Aˆvi + zi + γ1Aˆvi 2 2 1 + γ1 + Ci =(1 + γ1) h(xi) zi + γ1Aˆvi where , returns the inner product of two vectors, Ci = zi 2 2 + γ1 Aˆvi 2 2 zi+γ1Aˆvi 2 2 1+γ1 is a constant which is not dependent on variables W and b. Plugging Eq.(5) into Eq.(3) and this completes the proof. According to Theorem 1, the formulation (3) can be optimized by equivalently transforming it into a general regression problem (4), whose closed-form solution can be obtained by solving the following linear equations: Id + λX X λX 1m 1 m X m = λX Z 1 m Z where Id is an identity matrix with size d d, 1m is a column vector of all ones with length m, X = [x1, . . . , xm] Rm d is the instance matrix, Z = [ z1, . . . , zm] Rm s is the real-valued label matrix. Moreover, if we transform the d-dimensional feature space to one d -dimensional feature space with nonlinear mapping function φ : Rd Rd , we can learn a nonlinear multi-output regressor, i.e., h(x) = W φ(x) + b. Here, note that W = [w1, w2, . . . , ws ] Rd s . According to the Representer Theorem (Sch olkopf & Smola, 2002), the predictive model can be expressed as a linear combination of the training instances under fairly general conditions, i.e., wj = Pm i=1 θjiφ(xi). Let θj = [θj1, . . . , θjm] Rm 1, Θ = [θ1, . . . , θs ] Rm s , and Φ = [φ(x1), . . . , φ(xm)] Rm d , the closed-form solution of kernelized problem (4) can be obtained by solving the following linear equations: Im + λK λ1m 1 m K m = λ Z 1 m Z where K = ΦΦ Rm m with (i, j)th element Kij = φ(xi), φ(xj) = κ(xi, xj) . Here, κ( , ) is the kernel function which is utilized to avoid directly computing inner product in d -dimensional feature space. Optimizing ˆV when W and b are fixed. When W and b are fixed, we can reformulate the optimization problem (2) as a total of m independent problems as follows: min ˆvi h(xi) Aˆvi 2 2 + γ2 ˆvi vi 1 (6) In this paper, we solve this problem via accelerated proximal gradient (APG) method (Huang et al., 2016). Theorem 2. Let g(ˆvi) = h(xi) Aˆvi 2 2, for the derivable function g(ˆvi), g(ˆvi) is Lipschitz continuous and the Lipschitz constant is Lf = 2A A F , where denotes the gradient operator. Proof. For g(ˆvi), it can be calculated as: g(ˆvi) = 2A Aˆvi 2A h(xi) Given any ˆv i and ˆvi, we have: g(ˆv i) g(ˆvi) 2 ˆv i ˆvi 2 = 2A A(ˆv i ˆvi) 2 ˆv i ˆvi 2 2A A F ˆv i ˆvi 2 ˆv i ˆvi 2 = 2A A F Multi-Dimensional Classification via Sparse Label Encoding Algorithm 1 Solving problem (6) via APG Require: The encoding matrix A Rs s, the groundtruth sparse vector vi {0, 1}s, the trade-off parameter γ2; Ensure: The recovered sparse vector ˆvi; 1: Calculate the Lipschitz constant Lf = 2A A F ; 2: Initialize ˆv(0) i = ˆv(1) i = vi, r0 = r1 = 1, t = 1; 3: repeat 4: Obtain ζ(t) according to Eq.(10); 5: Compute ν(t) = ζ(t) 1 Lf g(ζ(t)); 6: Obtain ˆv(t+1) i according to Eq.(8); 7: Compute rt+1 = 1+ 1+4r2 t 2 ; 8: t = t + 1; 9: until convergence 10: Return ˆvi = ˆv(t) i . which completes the proof. According to Theorem 2, given any initial value ˆv(t) i of ˆvi, let ˆvi = ˆvi ˆv(t) i , the following inequation always holds: g(ˆvi) g(ˆv(t) i ) 2 Lf ˆvi 2 Then, the quadratic approximation of g(ˆvi) around ˆv(t) i can be given as follows: ˆg(ˆvi) g(ˆv(t) i ) + g(ˆv(t) i ), ˆvi + Lf where CLf = 1 2Lf g(ˆv(t) i ) 2 2 + g(ˆv(t) i ) is a constant which is not dependent on variables ˆvi, and ν(t) = ˆv(t) i 1 Lf g(ˆv(t) i ) (7) According to the descent lemma (Bauschke et al., 2017), ˆg(ˆvi) is an upper bound of g(ˆvi), i.e., g(ˆvi) ˆg(ˆvi) always holds. Therefore, g(ˆvi) can be minimized by iteratively minimizing its approximation ˆg(ˆvi). Plugging ˆg(ˆvi) into the optimization problem (6), we can obtain the following iterative equation: ˆv(t+1) i = arg min ˆvi 2 + γ2 ˆvi vi 1 = softc(ν(t), γ2 Lf , vi) (8) where the (element-wise) function softc( , , ) is defined as follows: softc(x, µ, c) = x µ if x c > µ x + µ if x c < µ c otherwise. (9) Algorithm 2 LOMP: v = R(z, A, k, I) Input: The encoding matrix A Rs s, the real-valued vector z Rs , sparsity level k, local sparsity information I : s1, s2, . . . , sk; Output: The recovered k-sparse vector v; 1: Initialize v as zero vector with length s = Pk j=1 sj; 2: Initialize r0 = z, J = , B = A; 3: for i = 1 to k do 4: j = arg maxj | ri 1, B:j |; 5: v(j ) = 1; 6: J = J {j }; 7: ri = z A:J(A :JA:J) 1A :Jz; 8: for κ = 1 to k do 9: tf = Pκ t=1 st; 10: if j tf then 11: tb = tf sκ; 12: T = {tb + 1, tb + 2, . . . , tf}; 13: B:T = 0; 14: break; 15: end if 16: end for 17: end for 18: Return v. Here, note that softc is different with the well-known softthresholding function which is defined as follows: soft(x, µ) = x µ if x > µ x + µ if x < µ 0 otherwise. where there is an additional constant term in softc. In a nutshell, the solution in Eq.(8) can be obtained by performing variable substitution method, where the variables ˆvi are substituted with the variables χi = ˆvi vi. In (Beck & Teboulle, 2009), it is shown that the convergence rate of Eq.(8) can be improved to O(t 2) from O(t 1) if we replace ˆv(t) i in Eq.(7) with the following ζ(t): ζ(t) = ˆv(t) i + rt 1 1 rt (ˆv(t) i ˆv(t 1) i ) (10) where r0 = r1 = 1 and rt = 1+ 1+4r2 t 1 2 when t > 1. In summary, Algorithm 1 shows the procedure of solving problem (6) via APG. We alternately optimize the formulations in Eq.(3) and Eq.(6) until convergence, and then we can obtain the optimal model parameters (W, b) of the multi-output regressor. 3.3. Decoding Phase Given the unseen instance x, suppose its ground-truth class vector is y Y, then its grouped class vector is u = Multi-Dimensional Classification via Sparse Label Encoding PY(y) U, the one-hot conversion is v = ΦU(u) {0, 1}s and the encoded real-valued vector z = Av Rs . The decoding phase corresponds to the inverse operations of these steps to obtain x s predicted class vector ˆy. Specifically, we firstly predict the encoded vector ˆz = h(x) with the learned multi-output regression model h. Then, we can recover the q 2 -sparse vector ˆv = R(ˆz) by some reconstruction algorithm R. After that, the corresponding grouped class vector can be returned by ˆu = Φ 1 U (ˆv). Finally, the predicted class vector can be obtained by ˆy = P 1 Y (ˆu). Obviously, the first two steps are the key operations which could introduce predictive errors while the last two steps can be done accurately as long as their inputs are accurate. Therefore, we optimize the formulation (2) to accomplish the first multi-output regression learning task. For the sparse reconstruction step, we design a reconstruction algorithm which can consider the local sparsity property by adapting the orthogonal matching pursuit (OMP) algorithm (Tropp & Gilbert, 2007). We name the adapted algorithm as local orthogonal matching pursuit (LOMP). Algorithm 2 shows the pseudo-code of LOMP, where v(j ) denotes the j -th entry of vector v, and A:J denotes the submatrix consisting of the columns of A indexed by J. In contrast with the standard OMP, the key adaptions in LOMP is to set the related columns to zero which belong to the same local group with the current selected column of A, i.e., steps 8-16. This adaption ensures that there is one and only one 1 in each local group of the recovered ˆv. 4. Experiments 4.1. Experimental Setup 4.1.1. BENCHMARK DATA SETS In this paper, the experiments are conducted over a total of 11 benchmark data sets, whose detailed characteristics are summarized in Table 1, including the number of examples (#Exam.), the number of dimensions (#Dim.), the number of class labels per dimension (#Labels/Dim.),4 and the number of features (#Features). 4.1.2. EVALUATION METRICS In this paper, the performance of MDC approaches is measured by three widely-used metrics (Ma & Chen, 2018; Jia & Zhang, 2020a;b;c; Wang et al., 2020), i.e., Hamming Score (HS), Exact Match (EM) and Sub-Exact Match (SEM), whose formal definitions can be given as follows: 4Here, we record the numbers for all dimensions in turn. If all numbers are the same to each other, then we only record one of them. Table 1. Characteristics of the benchmark data sets. Data Set #Exam. #Dim. #Labels/Dim. #Features5 Jura 359 2 4,5 9n Oes10 403 16 3 298n Voice 3136 2 4,2 19n Scm20d 8966 16 4 61n Rf1 8987 8 4,4,3,4,4,3,4,3 64n Scm1d 9803 16 4 280n Co IL2000 9822 5 6,10,10,4,2 81x Flickr 12198 5 3,4,3,4,4 1536n Disfa 13095 12 5,5,6,3,4,4,5,4,4,4,6,4 136n Fera 14052 5 6 136n Adult 18419 4 7,7,5,2 5n,5x i=1 1r(i)=q SEMS(f) = 1 i=1 1r(i) q 1 Here, S = {(xi, yi) | 1 i p} is the test set with p examples, f : X 7 Y is the MDC model to be evaluated, r(i) = Pq j=1 1yij=ˆyij is the number of class labels for which f returns the correct predictions for xi, where yij and ˆyij correspond to the ground-truth and predicted class label of xi s j-th class space, and predicate 1π returns 1 if π holds and 0 otherwise. Obviously, for all the three metrics, the larger the metric value, the better the performance. In the experiments, we conduct ten-fold cross validation over each data set for all compared approaches, and both mean metric value and standard deviation are recorded for performance comparison. 4.1.3. COMPARED APPROACHES In this paper, the proposed SLEM approach is compared with five state-of-the-art MDC baselines, including BR, CP, BCC, ESC, g MML. Specifically, to solve the MDC problem, BR trains q independent multi-class classifiers, one per class space, while CP trains a single multi-class classifier by regarding the q class spaces as a compound one, where each distinct class combination corresponds to one new class in the compound class space. BCC trains q chainstructured multi-class classifiers, one per class space, where the subsequent classifiers will augment the feature space with predictions of preceding ones and the chain structure is determined by Bayesian learning techniques (Zaragoza et al., 2011). ESC partitions the class spaces into several groups which are treated as a compound class space, and solves the resulting problem via classifier chains (Read et al., 2014a). g MML decomposes the class spaces into a binary- 5Here, n and x denote numeric and nominal features. Multi-Dimensional Classification via Sparse Label Encoding Table 2. Experimental results (mean std.) of each MDC approach. In addition, / indicates whether SLEM is statistically superior/inferior to other compared MDC approaches on each data set (pairwise t-test at 0.05 significance level), and the experimental results marked as N/A are unavailable due to out of memory error of Libsvm package. Data Hamming Score Set SLEM BR CP BCC ESC g MML Jura 0.717 0.049 0.586 0.069 0.570 0.061 0.568 0.065 0.558 0.055 0.606 0.072 Oes10 0.763 0.014 0.664 0.019 0.179 0.041 0.674 0.028 0.633 0.020 0.775 0.017 Voice 0.944 0.010 0.940 0.010 0.916 0.010 0.938 0.010 0.931 0.009 0.842 0.009 Scm20d 0.873 0.004 0.632 0.006 N/A 0.605 0.008 N/A 0.600 0.007 Rf1 0.927 0.002 0.852 0.005 0.813 0.010 0.855 0.004 0.794 0.007 0.730 0.007 Scm1d 0.884 0.003 0.725 0.007 N/A 0.691 0.007 N/A 0.697 0.007 Co IL2000 0.899 0.005 0.874 0.005 0.738 0.006 0.875 0.005 0.851 0.008 0.894 0.004 Flickr 0.737 0.018 0.715 0.006 0.658 0.008 0.715 0.006 0.651 0.007 0.779 0.004 Disfa 0.935 0.002 0.885 0.003 N/A 0.883 0.003 0.878 0.003 0.884 0.003 Fera 0.765 0.007 0.599 0.008 N/A 0.593 0.008 N/A 0.589 0.007 Adult 0.696 0.006 0.701 0.004 0.682 0.005 0.680 0.006 0.675 0.006 0.705 0.004 Data Exact Match Set SLEM BR CP BCC ESC g MML Jura 0.552 0.056 0.329 0.110 0.326 0.099 0.304 0.099 0.298 0.098 0.368 0.119 Oes10 0.054 0.036 0.064 0.035 0.077 0.041 0.079 0.045 0.067 0.037 0.079 0.040 Voice 0.907 0.012 0.884 0.017 0.841 0.016 0.881 0.017 0.867 0.016 0.699 0.017 Scm20d 0.247 0.015 0.054 0.006 N/A 0.080 0.009 N/A 0.052 0.007 Rf1 0.581 0.008 0.322 0.011 0.319 0.025 0.336 0.010 0.275 0.012 0.138 0.011 Scm1d 0.278 0.015 0.115 0.010 N/A 0.123 0.013 N/A 0.102 0.009 Co IL2000 0.608 0.016 0.515 0.012 0.273 0.012 0.520 0.010 0.468 0.019 0.576 0.015 Flickr 0.248 0.017 0.187 0.011 0.125 0.016 0.187 0.011 0.114 0.014 0.287 0.009 Disfa 0.539 0.015 0.378 0.011 N/A 0.377 0.011 0.374 0.011 0.379 0.011 Fera 0.400 0.013 0.199 0.013 N/A 0.196 0.013 N/A 0.196 0.013 Adult 0.288 0.011 0.228 0.006 0.282 0.012 0.272 0.007 0.269 0.011 0.230 0.009 Data Sub-Exact Match Set SLEM BR CP BCC ESC g MML Jura 0.883 0.061 0.844 0.059 0.813 0.040 0.833 0.056 0.819 0.045 0.844 0.049 Oes10 0.144 0.051 0.119 0.059 0.107 0.044 0.142 0.055 0.117 0.048 0.176 0.038 Voice 0.981 0.011 0.996 0.005 0.991 0.005 0.996 0.005 0.995 0.005 0.985 0.011 Scm20d 0.486 0.015 0.105 0.007 N/A 0.135 0.013 N/A 0.100 0.009 Rf1 0.877 0.011 0.655 0.017 0.580 0.022 0.669 0.015 0.542 0.014 0.375 0.014 Scm1d 0.523 0.012 0.223 0.016 N/A 0.206 0.012 N/A 0.198 0.015 Co IL2000 0.902 0.013 0.873 0.016 0.576 0.016 0.875 0.014 0.820 0.017 0.903 0.010 Flickr 0.611 0.035 0.543 0.015 0.426 0.018 0.544 0.017 0.414 0.017 0.689 0.016 Disfa 0.791 0.008 0.596 0.011 N/A 0.588 0.009 0.575 0.010 0.590 0.009 Fera 0.654 0.012 0.387 0.012 N/A 0.380 0.013 N/A 0.378 0.013 Adult 0.624 0.014 0.657 0.010 0.599 0.008 0.597 0.012 0.586 0.011 0.669 0.008 valued label space via one-vs-rest strategy and solves the resulting problem via a metric approach (Ma & Chen, 2018). For BR, CP, BCC, ESC, support vector machine (SVM) is used as the base multi-class classifier to implement each of them. Specifically, the Libsvm package (Chang & Lin, 2011) with default parameter settings is used in experiments. For ESC, the ensemble is constructed with ten base models whose results are combined via majority voting (Read et al., 2014a). For g MML, its parameters λ, t, γ and k are tuned as suggested in (Ma & Chen, 2018). For the proposed SLEM approach, we use random Gaussian matrix to serve as the encoding matrix A with s = s 1, and the three trade-off parameters in the formulation (2) are set as λ = 1, γ1 = 1 and γ2 = 1. Multi-Dimensional Classification via Sparse Label Encoding Jura Oes10 Voice Scm20d Rf1 Scm1d Co IL2000 Flickr Disfa Fera Adult 0 SLEM De V1 De V2 (a) Hamming Score Jura Oes10 Voice Scm20d Rf1 Scm1d Co IL2000 Flickr Disfa Fera Adult 0 SLEM De V1 De V2 (b) Exact Match Jura Oes10 Voice Scm20d Rf1 Scm1d Co IL2000 Flickr Disfa Fera Adult 0 SLEM De V1 De V2 (c) Sub-Exact Match Figure 2. Performance comparision between SLEM and its two variants. Table 3. Win/tie/loss counts of pairwise t-test (at 0.05 significance level) between SLEM and each MDC approach. Evaluation SLEM against metric BR CP BCC ESC g MML HS 9/1/1 7/0/0 10/1/0 8/0/0 8/0/3 EM 10/1/0 5/2/0 10/1/0 7/1/0 9/1/1 SEM 7/2/2 5/1/1 8/2/1 6/1/1 5/4/2 In Total 26/4/3 17/3/1 28/4/1 21/2/1 22/5/6 4.2. Experimental Results Table 2 shows the experimental results of SLEM and all compared approaches over the benchmark data sets in terms of each evaluation metric.6 We also conduct pairwise t-test (at 0.05 significance level) to show whether SLEM is statistically superior/inferior to other compared MDC approaches on each data set, and Table 3 summarizes the corresponding win/tie/loss counts. According to the reported experimental results, the following observations can be made: Among all the 144 configurations (11 data set 5 compared approaches 3 evaluation metrics [excluding the 21 N/A cases]), SLEM achieves superior or at least comparable performance against the five compared approaches in 132 cases. The two approaches BR and CP represent two extreme MDC baselines which consider none or exhaustive class dependencies, respectively. As shown in Table 3, SLEM achieves 26 and 17 superior cases against BR and CP respectively, which clearly validates the effectiveness of SLEM s dependency modeling strategy. The two approaches BCC and ESC specially consider the class dependencies via chain structure or superclass partition in the original label space. The overwhelming advantage of SLEM over BCC and ESC in- 6In Table 2, there are a total of 21 cases marked as N/A whose experimental results are unavailable due to out of memory error of Libsvm package. The error is caused by the high computational complexity of the CP and ESC over the corresponding data sets. Table 4. Wilcoxon signed-ranks test for SLEM against its two degenerated versions in terms of each evaluation metric (significance level α = 0.05; p-values shown in the brackets). SLEM Evaluation metric versus HS EM SEM De V1 win[9.77e-04] win[9.77e-04] win[9.77e-04] De V2 win[3.91e-03] win[3.91e-03] win[3.91e-03] dicates that it is beneficial to learn predictive models in the encoded label space. It is shown that SLEM achieves 6 loss cases against g MML, which is relatively larger than other compared approaches. The g MML approach learns predictive model by utilizing the distance metric learning mechanism, which can also be introduced into label encoding in the future. 4.3. Further Analysis In this subsection, we further compare the performance of SLEM with its two degenerated versions to analyze the effectiveness of SLEM s label encoding strategy. We denote the two variants as De V1 and De V2 respectively: De V1: This variant directly obtains vi by conducting one-hot conversion on yi, i.e., vi = ΦY(yi), and the decoding phase is changed accordingly. In other words, De V1 omits the pairwise grouping operation PY( ). De V2: This variant separately deals with the q 2 entries in ui with the same encoding-decoding procedure between vi and ˆv (cf. Figure 1), each of which corresponds to one class space in U. In other words, De V2 doesn t encode the heterogeneous class spaces into an integrated label space and only considers the dependencies between class spaces grouped into one pair. The detailed comparative results are shown in Figure 2. Moreover, we also employ Wilcoxon signed-ranks test at 0.05 significance level (Demˇsar, 2006) to test the statistical relationship between SLEM and its two variants, and the Multi-Dimensional Classification via Sparse Label Encoding corresponding test results with p-values shown in brackets are summarized in Table 4. As shown in Table 4, SLEM achieves statistically superior performance against its two degenerated versions in terms of each evaluation metric which clearly validates the effectiveness of SLEM s label encoding strategy. To be more specific, the superiority of SLEM against De V1 shows the benefits of the pairwise grouping operation, and the superiority of SLEM against De V2 shows the benefits of encoding the heterogeneous class spaces into an integrated label space. 5. Conclusion In this paper, we investigate the label encoding techniques for multi-dimensional classification. Different from most existing MDC approaches, the proposed SLEM approach learns predictive model in the transformed label space instead of the original one. Specifically, SLEM transforms the output space into a new one via three cascaded operations, including pairwise grouping, one-hot conversion and sparse linear encoding. Within the transformed label space, SLEM learns a multi-output regression model based on the training examples, and then obtains a real-valued label vector for unseen instance by feeding it into the learned multioutput regressor. The final predicted class vector is obtained by conducting decoding procedure corresponding to the inverse operations of encoding steps based on the predicted real-valued label vector. We compare the performance of SLEM with five state-of-the-art MDC approaches over eleven benchmark data sets, and the experimental results clearly show the superiority of SLEM against the baselines. Acknowledgments The authors wish to thank the anonymous reviewers for their helpful comments and suggestions. This work was supported by the National Key R&D Program of China (2018YFB1004300), and the China University S&T Innovation Plan Guided by the Ministry of Education. We thank the Big Data Center of Southeast University for providing the facility support on the numerical calculations in this paper. Al Muktadir, A. H., Miyazawa, T., Martinez-Julia, P., Harai, H., and Kafle, V. P. Multi-target classification based automatic virtual resource allocation scheme. IEICE Transactions on Information and Systems, 102(5):898 909, 2019. Baraniuk, R., Davenport, M., Devore, R., and Wakin, M. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 28(3): 253 263, 2008. Bauschke, H. H., Bolte, J., and Teboulle, M. A descent lemma beyond Lipschitz gradient continuity: First-order methods revisited and applications. Mathematics of Operations Research, 42(2):330 348, 2017. Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009. Benjumeda, M., Bielza, C., and Larra naga, P. Tractability of most probable explanations in multidimensional Bayesian network classifiers. International Journal of Approximate Reasoning, 93:74 87, 2018. Bielza, C., Li, G., and Larra naga, P. Multi-dimensional classification with Bayesian networks. International Journal of Approximate Reasoning, 52(6):705 727, 2011. Bolt, J. H. and van der Gaag, L. C. Balanced sensitivity functions for tuning multi-dimensional Bayesian network classifiers. International Journal of Approximate Reasoning, 80:361 376, 2017. Borchani, H., Varando, G., Bielza, C., and Larra naga, P. A survey on multi-output regression. WIREs Data Mining and Knowledge Discovery, 5:216 233, 2015. Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):Article 27, 2011. Demˇsar, J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7: 1 30, 2006. Donoho, D. L. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289 1306, 2006. Gibaja, E. and Ventura, S. A tutorial on multilabel learning. ACM Computing Surveys, 47(3):Article 52, 2015. Gil-Begue, S., Bielza, C., and Larra naga, P. Multidimensional Bayesian network classifiers: A survey. Artificial Intelligence Review, 54:519 559, 2021. Huang, J., Li, G., Huang, Q., and Wu, X. Learning labelspecific features and class-dependent labels for multilabel classification. IEEE Transactions on Knowledge and Data Engineering, 28(12):3309 3323, 2016. Jia, B.-B. and Zhang, M.-L. Multi-dimensional classification via k NN feature augmentation. Pattern Recognition, 106:Article 107423, 2020a. Jia, B.-B. and Zhang, M.-L. Maximum margin multidimensional classification. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp. 4312 4319, New York, NY, USA, 2020b. Multi-Dimensional Classification via Sparse Label Encoding Jia, B.-B. and Zhang, M.-L. Multi-dimensional classification via stacked dependency exploitation. Science China Information Sciences, 63(12):Article 222102, 2020c. Liu, W. and Shen, X. Sparse extreme multi-label learning with oracle property. In Proceedings of the 36th International Conference on Machine Learning, pp. 4032 4041, Long Beach, CA, USA, 2019. Liu, W., Tsang, I. W., and M uller, K. An easy-to-hard learning paradigm for multiple classes and multiple labels. Journal of Machine Learning Research, 18(94):1 38, 2017. Liu, W., Xu, D., Tsang, I. W., and Zhang, W. Metric learning for multi-output tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):408 422, 2019. Ma, Z. and Chen, S. Multi-dimensional classification via a metric approach. Neurocomputing, 275:1121 1131, 2018. Read, J., Bielza, C., and Larra naga, P. Multi-dimensional classification with super-classes. IEEE Transactions on Knowledge and Data Engineering, 26(7):1720 1733, 2014a. Read, J., Martino, L., and Luengo, D. Efficient monte carlo methods for multi-dimensional learning with classifier chains. Pattern Recognition, 47(3):1535 1546, 2014b. Reeve, H. and Kaban, A. Optimistic bounds for multioutput learning. In Proceedings of the 37th International Conference on Machine Learning, pp. 8030 8040, Virtual Event, Worldwide, 2020. Rodr ıguez, J. D., P erez, A., Arteta, D., Tejedor, D., and Lozano, J. A. Using multidimensional Bayesian network classifiers to assist the treatment of multiple sclerosis. IEEE Transactions on Systems, Man, and Cybernetics Part C: Applications and Reviews, 42(6):1705 1715, 2012. Sch olkopf, B. and Smola, A. J. Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge, MA, USA, 2002. Shatkay, H., Pan, F., Rzhetsky, A., and Wilbur, W. J. Multidimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users. Bioinformatics, 24(18):2086 2093, 2008. Shen, X., Liu, W., Tseng, I. W., Sun, Q.-S., and Ong, Y.- S. Compact multi-label learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pp. 4066 4073, New Orleans, LA, USA, 2018. Tai, F. and Lin, H.-T. Multilabel classification with principal label space transformation. Neural Computation, 24(9): 2508 2542, 2012. Tropp, J. A. and Gilbert, A. C. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12):4655 4666, 2007. Verma, S. P., Uscanga-Junco, O. A., and D ıaz-Gonz alez, L. A statistically coherent robust multidimensional classification scheme for water. Science of the Total Environment, 750:Article 141704, 2021. Wang, H., Chen, C., Liu, W., Chen, K., Hu, T., and Chen, G. Incorporating label embedding and feature augmentation for multi-dimensional classification. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp. 6178 6185, New York, NY, USA, 2020. Zaragoza, J. H., Sucar, L. E., Morales, E. F., Bielza, C., and Larra naga, P. Bayesian chain classifiers for multidimensional classification. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence, pp. 2192 2197, Barcelona, Spain, 2011. Zhang, M.-L. and Zhou, Z.-H. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8):1819 1837, 2014. Zhu, M., Liu, S., and Jiang, J. A hybrid method for learning multi-dimensional Bayesian network classifiers based on an optimization model. Applied Intelligence, 44(1):123 148, 2016.