# deep_multidimensional_classification_with_pairwise_dimensionspecific_features__3acdc7eb.pdf Deep Multi-Dimensional Classification with Pairwise Dimension-Specific Features Teng Huang1,3 , Bin-Bin Jia2 , Min-Ling Zhang1,3 1School of Computer Science and Engineering, Southeast University, Nanjing 210096, China 2College of Electrical and Information Engineering, Lanzhou University of Technology, Lanzhou 730050, China 3Key Lab. of Computer Network and Information Integration (Southeast University), MOE, China {tengh, zhangml}@seu.edu.cn, jiabinbin@lut.edu.cn , In multi-dimensional classification (MDC), each instance is associated with multiple class variables characterizing the semantics of objects from different dimensions. To consider the dependencies among class variables and the specific characteristics contained in different semantic dimensions, a novel deep MDC approach named PIST is proposed to jointly deal with the two issues via learning pairwise dimension-specific features. Specifically, PIST conducts pairwise grouping to model the dependencies between each pair of class variables, which are more reliable with limited training samples. For extracting pairwise dimension-specific features, PIST weights the feature embedding with a feature importance vector, which is learned via utilizing a global loss measurement based on intraclass and inter-class covariance. Final prediction w.r.t. each dimension is determined by combining the joint probabilities related to this dimension. Comparative studies with eleven real-world MDC data sets clearly validate the effectiveness of the proposed approach. 1 Introduction In multi-dimensional classification (MDC), each object is represented by a single instance while associated with multiple class variables. Here, each class variable corresponds to one label space characterizing the rich semantics of objects from some specific dimension. Take landscape paintings classification as an example, each picture can be classified from time dimension (with possible labels morning, afternoon, night, etc.), from whether dimension (with possible labels sunny, rainy, cloudy, etc.), and from scene dimension (with possible labels desert, mountain, grass, etc.). Specifically, the needs of learning from MDC objects widely exist in diverse real-world applications, including text mining [Lertnattee and Theeramunkong, 2004; Shatkay et al., 2008], computer vision [Song et al., 2018; Lian et al., 2020; Shi et al., 2025], bioinformatics [Borchani et al., 2013; Fernandez Gonzalez et al., 2015], etc. Corresponding author Formally speaking, given a feature space X = Rd and an output space Y = C1 C2 Cq corresponding to the Cartesian product of q label spaces, each label space Cj = {cj 1, cj 2, . . . , cj Kj} includes Kj possible labels (1 j q) to characterize the semantics along one dimension. Let D = {(xi, yi)|1 i m} be the training set and each sample (xi, yi) corresponds to a d-dimensional feature vector xi = [xi1, xi2, . . . , xid]T X associated with a q-dimensional label vector yi = [yi1, yi2, . . . , yiq]T Y. Given an unseen instance x , the task of MDC is to learn a mapping function f : X 7 Y from the training set D which can return a proper label vector f(x ). One popular solution to MDC tasks is to independently deal with each dimension as a traditional multi-class classification problem. Nonetheless, this strategy completely ignores the dependencies among class variables and then the performance of the induced predictive model might degenerate. To tackle this issue, existing MDC approaches are designed to consider class dependencies in either explicit manner with some structures (e.g., directed acyclic graph [Bielza et al., 2011; Gil-Begue et al., 2021], chaining order [Zaragoza et al., 2011] and pairwise interaction [Jia and Zhang, 2020b]) or in implicit manner via manipulating feature space [Jia and Zhang, 2020a; Jia and Zhang, 2022] or label space [Jia and Zhang, 2021b; Tang et al., 2024]. Although these existing approaches have successfully considered class dependencies, they might obtain suboptimal performances since the predictive models for different dimensions are induced based on the same feature space [Jia et al., 2023]. However, different semantics in each dimension might prefer different feature characteristics. Take the aforementioned landscape painting as an example, the level of luminance would be preferred in discriminating labels in time dimension; abrupt color changes are more likely to reveal the labels in scene dimension and the upper part of a picture is supposed to be more related to whether dimension. Moreover, samples belonging to the same class should be similar in the feature space generally, but it is very common that two MDC samples belong to the same class in one dimension while different classes in another dimension. To consider the specific characteristics contained in different semantic dimensions as well as the dependencies among class variables, we propose a novel MDC approach named PIST (i.e., Pairwise d Imension-Specific fea Tures) based on Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) deep learning techniques. Specifically, PIST aims to model class dependencies between each pair of class variables via pairwise grouping. Firstly, to construct pairwise dimensional embeddings, a combinatorial encoding procedure is conducted for pairwise class variables via optimizing the intraclass and inter-class covariance. Then an element-wise selection mechanism is used to extract pairwise dimension-specific features, which are considered better capturing the correlation between feature space and heterogeneous label semantics in respective dimension pairs. Finally, joint probabilities predicted by pairwise neural networks are integrated to accomplish the final discrimination collectively. To the best of our knowledge, PIST serves as the first attempt towards learning dimension-specific features as well as considering class dependencies. Comprehensive experiments on eleven benchmark data sets show that PIST performs better than existing well-established MDC approaches. The rest of this paper is organized as follows. Firstly, Section 2 briefly reviews related works. Then Section 3 presents the proposed PIST approach at length. After that, Section 4 reports the results of empirical studies over a wide range of MDC data sets. Finally, Section 5 concludes this paper. 2 Related Work On one hand, MDC can be regarded as a set of multi-class classification problems, one per dimension. Thus, we can solve the MDC problem by learning an independent multiclass classifier for each dimension, which is known as binary relevance (BR) [Zhang et al., 2018] but ignores all possible class dependencies. To exploit class dependencies among dimensions, on the other hand, one straightforward strategy is to regard each distinct label combination as a new class, which is known as class powerset (CP). However, the combinatorial nature of CP induces high complexity and is also prone to class-imbalance and overfitting problem. According to the strategy of dependency modeling, existing MDC works can be roughly categorized into two categories, including explicit and implicit dependency modeling methods. The MDC methods of explicit category attempt to model class dependencies with some explicit structures. Chainingbased classifiers improve BR by learning a chain of multi-class classifiers, where subsequent classifiers on the chain will augment predictions of preceding one as features [Zaragoza et al., 2011; Read et al., 2014b]. Multidimensional Bayesian classifiers construct directed acyclic graph over class variable to explicitly consider the class dependencies [Bielza et al., 2011; Gil-Begue et al., 2021]. Generally, the dependencies among many class variables are hard to model due to limited samples in training set. SEEM [Jia and Zhang, 2020b] and MDKNN [Jia and Zhang, 2021a] suggest that pairwise dependencies can be modeled more reliably than modeling the dependencies among all class variables. The MDC methods of implicit category attempt to transform the original MDC problem into a new one without some explicit dependency modeling mechanism in the transformation procedure. g MML [Ma and Chen, 2018] transforms the original categorical output space into a binary one and then induce the predictive model based on metric learning. SLEM [Jia and Zhang, 2021b] encodes the original class vectors into real-valued ones and decodes the predicted class vectors over the outputs of learned multi-output regression model. To extract more powerful features, KRAM [Jia and Zhang, 2020a] manipulates the feature space via utilizing k NN information to enrich the original feature space. LEFA [Wang et al., 2020] is the first MDC approach that utilizes deep learning techniques. It learns an augmented feature vector for each instance via assuming that the representations of features and labels should be aligned in some latent space. ADVAE-FLOW [Zhang et al., 2022] encodes both feature and class variables to probabilistic latent spaces by normalizing flows, in which the one-hot representation for the label vectors w.r.t. different dimensions are directly stacked. DSOC [Saleh and Li, 2023] is formed of multiple neural networks and a hypercube classifier, where the former are responsible for feature selection and the latter aims to accommodate the model for rare sample classification. However, all these works only aim to consider class dependencies but cannot consider the specific characteristics contained in different semantic dimensions. In the next section, we will present the technical details of the proposed PIST approach, which considers not only class dependencies but also pairwise dimension-specific characteristics. 3 The PIST Approach As shown in Figure 1, PIST includes two key modules: pairwise dimension encoding and dimension-specific feature extraction. Briefly, a weighted sum-pooling is conducted to obtain pairwise dimension embeddings in the first module and the outputs will further guide the dimension-specific feature extraction in the second module. The final classification is enabled by the returned probabilities of softmax regression. 3.1 Pairwise Dimension Encoding To consider pairwise interactions, PIST considers each pair of dimensions as an entirety. Without loss of generality, we will carry out the following discussions in the case of C1 and C2. To extract dimension-specific features for this dimension pair, PIST learns a corresponding pairwise dimension embedding with a weighted sum-pooling as follows: m la1a2 (1) where fa1a2 is the number of samples labeled by c1 a1 and c2 a2 in the training set and la1a2 Rt is a latent label embedding vector related to c1 a1 and c2 a2. Here, t is a hyper-parameter to be set (cf. Section 4.3 for further discussions). For each component la1a2 in Eq.(1), a natural cluster assumption is that la11, la12, . . . , la1K2 are close to each other and far from l a11, l a12, . . . , l a1K2 where a1 {1, 2, . . . , K1} and a1 {1, 2, . . . , K1} \ {a1}. For this purpose, good embeddings la1a2 should minimize the intra-class covariance and maximize inter-class covariance, which can be implemented via minimizing the following objective L(12) le part1: L(12) le part1 = PK1 a1=1 PK2 a2=1 ||la1a2 la1||2 2 PK1 a1=1 K2|| la1 l||2 2 (2) Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) Dimension-Specific Vector 𝜽(12) Feature Vector 𝒙 Attention Network Feature Encoding Dimension-Specific Feature Extraction Feature Embedding Fully Connected Re LU Hadamard Weighted Sum-Pooling Pairwise Dimension Encoding 𝒍𝟏𝟏, 𝒍𝟏𝟐, , 𝒍𝟏𝑲𝟐 𝒍𝟐𝟏, 𝒍𝟐𝟐, , 𝒍𝟐𝑲𝟐 𝒍𝑲𝟏𝟏, 𝒍𝑲𝟏𝟐, , 𝒍𝑲𝟏𝑲𝟐 Softmax Regression Joint Prob. Figure 1: The workflow of the proposed PIST approach by taking the pair of label spaces C1 and C2 as an example. where la1 = 1 K2 PK2 a2=1 la1a2 is the intra-class mean and l = 1 K1K2 PK1 a1=1 PK2 a2=1 la1a2 is the global mean. It is worth noting that the above discussions on similarity of latent label embeddings have another dual form with an exchange of subscript a1 and a2. Then another objective L(12) le part2 is as follows: L(12) le part2 = PK2 a2=1 PK1 a1=1 ||la1a2 l a2||2 2 PK2 a2=1 K1|| l a2 l||2 2 (3) where l a2 = 1 K1 PK1 a1=1 la1a2 is the dual intra-class mean. By combining the two above objectives together, the final label embedding loss L(12) le w.r.t. C1 and C2 can be defined as follows: L(12) le = L(12) le part1 + L(12) le part2 (4) 3.2 Dimension-Specific Feature Extraction For each example x, numerous early works [Yeh et al., 2017; Wang et al., 2016; Zhang et al., 2023] have proved that it is significant to exploit powerful feature embeddings in latent spaces. Thus we firstly encode x into a latent space based on neural networks which is denoted by ϕ(x) Rd . To extract pairwise dimension-specific feature, decode l(12) into a feature importance vector with an attention network: θ(12) = σ(Wll(12) + bl) (5) where θ(12), bl Rd and Wl Rd t. σ is the Re LU activation function. Here, note that Wl and bl are shared parameters for all pairwise dimensions. PIST further assumes that ϕ(x) could be transformed via an element-wise selection mechanism. Then for dimensions C1 and C2, the latent feature embedding is transformed into ϕ(x) θ(12), where is the Hadamard product. With a fully-connected network used, the final pairwise dimensionspecific feature is obtained: s(12) = σ[Ws(ϕ(x) θ(12)) + bs] (6) where s(12), bs Rd , Ws Rd d . In addition, it is possible that one dimension is irrelevant to others. PIST also seeks to acquire single dimension-specific features by a similar procedure. Take the j-th dimension as an example (1 j q): fa m la (7) θ(jj) = σ(Wll(jj) + bl) (8) s(jj) = σ[Ws(ϕ(x) θ(jj)) + bs] (9) Here, we denote j by jj for notation consistency with the aforementioned pairwise case. 3.3 Classification For classification, we simply obtain probabilities of all K1K2 class combinations w.r.t the first two dimensions with a softmax regression as follows: o(12) = W(12) o s(12) + b(12) o (10) where o(12), b(12) o RK1K2 and W(12) o RK1K2 d . Define an injective function ψ( , ) : {1, 2, . . . , K1} {1, 2, . . . , K2} {1, 2, . . . , K1K2} and further assume that ψ(a1, a2) = w. The predicted probability of any instance x is as follows: ˆp(12) = softmax(o(12)) (11) Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) where the w-th element ˆp(12) w in ˆp(12) corresponds to: ˆp(12) w = exp(o(12) w ) PK1K2 a=1 exp(o(12) a ) (12) Here, o(12) a denotes the a-th element in vector o(12) a . It is easy to know that ˆp(12) w indicates the probability that x is labeled by c1 a1 and c2 a2 w.r.t. C1 and C2, respectively. Similar derivation can apply to the case of a single dimension. Take the j-th dimension as an example (1 j q): o(jj) = W(jj) o s(jj) + b(jj) o (13) where o(jj), b(jj) o RKj and W(jj) o RKj d . The corresponding predicted probability of any instance x is as follows: ˆp(jj) = softmax(o(jj)) (14) where the aj-th element ˆp(jj) a in ˆp(jj) corresponds to: ˆp(jj) aj = exp(o(jj) aj ) PKj a=1 exp(o(jj) a ) (15) After traversing all dimension pairs, we can obtain q 2 + q predicted probabilities {ˆp(rs)|1 r s q}, the final confidence score ρ(r) ar for the ar-th label in the r-th dimension is determined as follows (ar {1, 2, . . . , Kr}, 1 r q): ρ(r) ar = ˆp(rr) ar + s=1 ˆp(sr) ψ(as,ar) + s=r+1 ˆp(rs) ψ(ar,as) It is not hard to verify that PKr ar=1 ρ(r) ar = q holds. To render ρ(r) ar probabilistic and facilitate cross-entropy loss, we further normalize it with softmax operation: Qr ar = exp(ρ(r) ar ) PKr a=1 exp(ρ(r) a ) (17) Based on Qr ar, assuming ground-truth label of x in the r-th dimension is cr γ, the cross-entropy loss w.r.t. the r-th dimension is defined as follows (1 r q): ar=1 Jar = γK log(Qr ar) (18) where JπK returns 1 if π holds and 0 otherwise. The final loss corresponds to the sum of the average of the dimension-wise cross-entropy loss L(r) ce in Eq.(18) as well as the average of pairwise label embedding loss L(rs) le in Eq.(4): 1 r q L(r) ce + 2 q(q 1) 1 r