# partial_label_clustering__c5694962.pdf

Partial Label Clustering

Yutong Xie1, Fuchao Yang2, Yuheng Jia 3,4

1 Chien-Shiung Wu College, Southeast University, Nanjing 210096, China 2 College of Software Engineering, Southeast University, Nanjing 210096, China 3 School of Computer Science and Engineering, Southeast University, Nanjing 210096, China 4 Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China ytxie@seu.edu.cn, yangfc@seu.edu.cn, yhjia@seu.edu.cn

Partial label learning (PLL) is a significant weakly supervised learning framework, where each training example corresponds to a set of candidate labels and only one label is the ground-truth label. For the first time, this paper investigates the partial label clustering problem, which takes advantage of the limited available partial labels to improve the clustering performance. Specifically, we first construct a weight matrix of examples based on their relationships in the feature space and disambiguate the candidate labels to estimate the ground-truth label based on the weight matrix. Then, we construct a set of must-link and cannot-link constraints based on the disambiguation results. Moreover, we propagate the initial must-link and cannot-link constraints based on an adversarial prior promoted dual-graph learning approach. Finally, we integrate weight matrix construction, label disambiguation, and pairwise constraints propagation into a joint model to achieve mutual enhancement. We also theoretically prove that a better disambiguated label matrix can help improve clustering performance. Comprehensive experiments demonstrate our method realizes superior performance when comparing with state-of-the-art constrained clustering methods, and outperforms PLL and semisupervised PLL methods when only limited samples are annotated. The code and appendix are publicly available at https://github.com/xyt-ml/PLC.

1 Introduction

Partial label learning (PLL) [Tian et al., 2023; Jia et al., 2023b; Xu et al., 2024; Jiang et al., 2024; Yang et al., 2025] is a significant weakly supervised learning framework, where each training example corresponds to a set of candidate labels, but only one of which is the ground-truth label. This form of weak supervision appears in a variety of real-world scenarios, e.g., web mining [Luo and Orabona, 2010], multimedia content analysis [Chen et al., 2018], and

Corresponding author

ecoinformatics [Liu and Dietterich, 2012]. Previous studies in PLL mainly focus on learning a multi-class classifier based on the available weak supervision [Wang et al., 2022; Jia et al., 2023c; Jia et al., 2024]. However, in many real-world scenarios, obtaining enough training examples with candidate labels may be time and resource costly as it requires a large amount of manual annotation, while sufficient unlabeled samples are readily obtainable. Given the effectiveness of clustering methods in dealing with unlabeled samples, we believe that utilizing limited partial labeled samples and sufficient unlabeled samples for clustering is an effective solution for this scenario. However, how to effectively exploit the partial labels for clustering is still under-explored. The ambiguity of the candidate labels prevents existing constrained clustering methods [Jia et al., 2018; Jia et al., 2020a] from directly utilizing the ground-truth labels. A naive method is to adopt label disambiguation strategies [Wang et al., 2022; Jia et al., 2023c] to estimate the ground-truth labels, and then incorporate the estimated labels to constrained clustering methods. However, because of the limited effectiveness of existing methods in disambiguation, the performance improvement observed in experiments is marginal. To address the above issue, we propose a novel method, named PLC (Partial Label Clustering), to take advantage of the available partial labels to construct a clustering model. Specifically, we first establish a weight matrix for examples based on their similarity in the feature space and perform label disambiguation based on the weight matrix to obtain label information. Then we use the label information to establish initial pairwise constraints for the examples, i.e., must-link and cannot-links and propagate the initial inaccurate pairwise constraints through the weight matrix to obtain dense and precise pairwise constraints. Moreover, these two types of constraints form an adversarial relationship which are augmented accordingly. The augmented pairwise constraints are also used to adjust the weight matrix, ultimately improving the quality of the weight matrix. Finally, we use the optimized weight matrix for spectral clustering (SC) to obtain clustering results. The entire model is formulated as a joint optimization problem and solved by alternating optimization. More importantly, we also theoretically prove that a better disambiguated label matrix can improve clustering performance. Our contributions can be summarized as follows:

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

For the first time, we propose the partial label clustering problem and leverage the information from both the feature space and the label space of labeled and unlabeled samples to address this problem. We theoretically prove that a better disambiguated label matrix can improve clustering performance, which indicates that utilizing label disambiguation to improve clustering performance is effective. Extensive experiments demonstrate that the proposed PLC method achieves superior performances than the constrained clustering, PLL and semi-supervised PLL methods.

2 Related Work In this section, we briefly introduce the preliminary work and studies related to our PLC method. Partial Label Learning. In PLL, we cannot access the ground-truth labels of training examples because the ambiguity of candidate labels poses a significant challenge. The main approach to solve this problem is to disambiguate the candidate label set. For example, some methods establish a parameter model based on the maximum likelihood criterion [Liu and Dietterich, 2012; Jin and Ghahramani, 2002] or maximum margin criterion [Nguyen and Caruana, 2008], and identify the ground-truth label through iterative optimization. [H ullermeier and Beringer, 2005] treats candidate labels equally and makes predictions through weighted voting based on the candidate labels of the example s neighbors. [Zhang et al., 2016a; Wang et al., 2022] construct the weighted graph of the feature space of examples for disambiguation, which utilizes the similarity of examples. [Feng and An, 2019; Jia et al., 2023c] construct the dissimilarity matrix for candidate labels to guide label disambiguation. Although the above PLL methods have achieved satisfactory outcomes, they need a large number of labeled examples for training to achieve the expected results, and perform poorly with a fewer number of labeled examples. To solve this problem, we introduce a constrained clustering model, achieving superior performance with a limited number of labeled examples and many easily available unlabeled examples. Constrained Clustering. As an important type of semisupervised learning method, constrained clustering utilizes some supervision information to enhance clustering performance. Pairwise constraint is a kind of weak supervision, which represents whether two training samples belonging to the same class, i.e., must-links and cannot-links. Several methods have been proposed to exploit pairwise constraints in constrained clustering, for example, [Zhang et al., 2016b] performs low dimensional embedding of pairwise constraints through non-negative matrix factorization, [Jia et al., 2021] exploits the dissimilarity between pairwise constraints to improve the performance of constrained clustering, [Jia et al., 2023a] stacks the pairwise constraint matrix and affinity matrix into a 3-D tensor, and promotes the construction of the affinity matrix through global low-rank constraints. In addition, pairwise constraint propagation (PCP) methods [Lu and Ip, 2010; Jia et al., 2020b] are proposed to address the sparsity problem of pairwise constraints. However, how to ef-

fectively exploit the partial labels to construct a constrained clustering model is still under-explored.

3 The Proposed Method

In this section, we propose the details of PLC, including weight matrix construction, label disambiguation, and pairwise constraint propagation. Afterwards, we integrate them into a unified optimization objective and solve it through alternating optimization.

3.1 Notations

Denote X = [x1, x2, ..., xn] Rn d as the feature matrix, where n and d represent the number of examples and the dimension of features. Y = [y1, y2, ..., yn] {0, 1}n q represents the partial label matrix, where q is the number of classes and yij = 1 means that the j-th label is one of the candidate labels of the sample xi. Specifically, we only have the candidate labels for training examples, and for test examples, the set of candidate labels consists of all labels, i.e., ytest = 1q for all test examples xtest, where 1q is an all ones vector. Given a training set Dtrain and a test set Dtest, the goal of PLC is to construct a weight matrix W Rn n representing the similarity of the examples for the dataset D = Dtrain Dtest. For any example xi in D, we construct the k-nearest neighbors of xi, i.e E = {(xi, xj)|xj KNN(xi), i = j}.

3.2 Constructing Weight Matrix

Given a dataset D and its corresponding neighbor set E, our objective is to derive a weight matrix W = (wij)n n that adequately represents the similarities among the examples. This has been proven to be effective in PLL and SC [Zhang and Yu, 2015; Kang et al., 2018]. Considering that higher weights should be assigned to examples with higher similarity, the weight matrix W can be constructed by solving the following linear least square problem:

(xi,xj) E wijxi

s.t. W 1n = 1n, 0n n W N,

where 1n Rn is an all ones vector and 0n n Rn n is an all zeros matrix. Moreover, N {0, 1}n n is defined as: nij = 1, if (xi, xj) E and nij = 0 otherwise, which indicates that wij 0 ( (xi, xj) E) and wij = 0 ( (xi, xj) / E). The first constraint W 1n = 1n normalizes the weight matrix W and the second constraint 0n n W N ensures the non-negativity of the weight matrix W, while simultaneously ensuring that the weight matrix has values only for the k-nearest neighbors. By solving Eq. (1), we can obtain a weight matrix1 containing the similarities of examples in the feature space.

1Strictly, when applied to clustering, the matrix W should be symmetric. To simplify the model, we ignore symmetry and will subsequently apply symmetry processing to W, i.e., (W + W )/2.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

3.3 Label Disambiguation Denote F = [f 1, f 2, ..., f n] Rn q as the label confidence matrix, where fij represents the probability of the j-th label being the ground-truth label of sample xi. We initialize the label confidence matrix as follows: Fij = 1/ P j yij if yij = 1, otherwise Fij = 0. The similarity relationships of examples in the feature space should be consistent in the label space, i.e., if two samples are similar in the feature space, they are more likely to share the same ground-truth label. We leverage the weight matrix W to disambiguate the label confidence matrix F, which can be obtained by solving the following problem:

(xi,xj) E wijf i

s.t. F1q = 1n, 0n q F Y,

where 1q Rq is an all ones vector and 0n q Rn q is an all zeros matrix. After the label disambiguation process, we select the class with the highest label confidence of example xi as the pseudo-label y i = arg maxj Y f ij.

3.4 Pairwise Constraint Propagation Based on the pseudo-labels obtained from label disambiguation, we denote the set of must-links (MLs) as M and the set of cannot-links (CLs) as C. If two examples xi and xj have the same pseudo-label, i.e., y i = y j , then (xi, xj) M, otherwise (xi, xj) C. Then the indicator matrices of the MLs and CLs can be defined as:

Mij = 1, if (xi, xj) M, 0, otherwise,

Cij = 1, if (xi, xj) C, 0, otherwise,

MLs and CLs essentially represent the relationships between sample classes, i.e., the similarity relationship and the dissimilarity relationship respectively. However, the initial similarity and dissimilarity relationships are imprecise. To address this issue, we propagate the similarity matrix S Rn n and the dissimilarity matrix D Rn n through the weight matrix W constructed in the feature space, i.e., if two samples are similar in the feature space, their similarity and dissimilarity codings should also be similar. Moreover, the similarity matrix S and the dissimilarity matrix D naturally form an adversarial relationship, i.e., if two samples have high (low) similarity, their dissimilarity should be low (high). Combining propagation and adversarial relationship together, we have the following objective:

min D,S D S 1 +αTr(DLD ) + βTr(SLS )

s.t. S 0n n, D 0n n, Sij = Mij, Dij = Cij, if (xi, xj) M C,

where represents the elementwise product and || ||1 refers to the l1 norm, i.e., ||D S||1 = P ij Dij Sij. L Rn n is the Laplacian matrix of weight matrix W, i.e., L = A W, where A is a diagonal matrix with the i-th diagonal element

Algorithm 1 Pseudo-code of PLC Input: D : the partial label training dataset {(xi, Yi)|1 i n}. k : the number of nearest neighbors. l : the number of clusters. α, β, γ : the hyper-parameters in Eq.(5) and Eq.(10). x : the unseen test example for prediction. Output: C : the clusters {C1, C2, ..., Cl}.

1: Construct partial label dataset of all samples D = D S{(x , Y i )|Y i = 1q} ; 2: Initialize the weight matrix W by solving Eq. (1); 3: Initialize the label confidence matrix F by solving Eq. (2); 4: Initialize pairwise constraints according to Eq. (3); 5: Initialize the similarity matrix S and dissimilarity matrix D by solving Eq. (4); 6: repeat 7: for j = 1 to n do 8: Update the vector ˆwj Rk by solving Eq. (7); 9: end for 10: Update the label confidence matrix F by solving Eq. (8) or iteratively solving the subproblems as Eq. (9); 11: Update the matrix S and D by solving Eq. (10) with KKT conditions; 12: until convergence or maximum number of iterations being reached. 13: Perform SC on W to obtain the final clustering results C. 14: return the final clustering results C = {C1, C2, ..., Cl}.

Aii = Pn j=1 Wij. α, β 0 are two hype-parameters to balance different terms. Through Eq. (4), we can enhance the accuracy and density of information in S and D by incorporating adversarial and propagation terms.

3.5 Overall Model Combining the optimization problems mentioned above, the final optimization objective of our proposed PLC model can be summarized as follows:

min W,F,D,S

(xi,xj) E wijxi

2 + αTr(DLD )

(xi,xj) E wijf i

2 + βTr(SLS )+ D S 1

s.t. W 1n = 1n, 0n n W N, F1q = 1n, 0n q F Y, S 0n n, D 0n n, Sij = Mij, Dij = Cij, if (xi, xj) M C.

3.6 Alternating Optimization Since the optimization objective Eq. (5) consists of multiple linear and quadratic equations and has multiple optimization variables, it is difficult to optimize directly. In this subsection, we solve this problem through alternating optimization methods, i.e., alternately optimizing one variable and fixing the other variables until convergence or reaching the predetermined maximum iterations.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Figure 1: ACCs and NMIs when compared with constrained clustering methods under different proportions of partial label training examples on synthetic UCI datasets.

Update W With fixed F, S and D, the subproblem of variable W can be formulated as:

(xi,xj) E wijxi

2 + αTr(DLD )

(xi,xj) E wijf i

2 + βTr(SLS )

s.t. W 1n = 1n, 0n n W N.

Eq. (6) can be separated column-wisely and we can optimize the j-th column vector W j while fixing other column vectors. For the vector W j, only k non-negative elements located in the neighborhood of the sample xj need to be updated. We can simplify the optimization vector W j as ˆwj Rk. Denote the matrix Ofj = [f j f Nj(1), f j f Nj(2), ..., f j f Nj(k)] Rk q and the matrix Oxj = [xj x Nj(1), xj x Nj(2), ..., xj x Nj(k)] Rk d, where Nj( ) represents the k nearest neighbors of the sample xj. The subproblem of Eq. (6) can be reformulated as:

min ˆwj ˆw j (Gxj + Gfj)ˆwj + (αH j + βK j)ˆwj

s.t. ˆw j 1k = 1, 0k ˆwj 1k, (7)

where Gxj = Oxj(Oxj) Rk k, Gfj = Ofj(Ofj) Rk k are two Gram matrices for feature vector xj and label confidence vector f j respectively. H j = [H1j, H2j, ..., Hkj] Rk 1, where Hij = D Nj(i) D j 2 2 represents the difference in dissimilarity encoding between the sample xj and the neighboring sample xi. K j = [K1j, K2j, ..., Kkj] Rk 1, where Kij = S Nj(i) S j 2 2 represents the difference in similarity encoding between the sample xj and the neighboring sample xi. Obviously, Eq. (7) is a standard Quadratic Programming (QP) problem and it can be efficiently solved by any QP tools. Update F With fixed W, S and D, the subproblem of variable F is the same as Eq. (2). Note that P (xi,xj) E wij = 1, we

can define T = WW +In n W 1n n W 2W, where In n is an n-order identity matrix and 1n n is an matrix of all ones. The Eq. (2) can be rewritten as:

j=1 tijf i f j

s.t. F1q = 1n, 0n q F Y.

This is also a QP problem, which can be solved by any QP tools. However, this QP problem contains nq variables and n(q + 1) constraints, which leads to excessive computational

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Compared Method Lost ρ = 0.05 ρ = 0.10 ρ = 0.15 ρ = 0.20 ρ = 0.30 ρ = 0.40 PLC (Ours) 0.399 0.027 0.497 0.024 0.512 0.024 0.553 0.026 0.608 0.021 0.641 0.022 K-means 0.205 0.013 0.204 0.011 0.222 0.013 0.203 0.013 0.202 0.010 0.209 0.017 SC 0.285 0.019 0.309 0.027 0.305 0.012 0.306 0.011 0.317 0.018 0.303 0.021 SSC-TLRR 0.266 0.029 0.233 0.014 0.243 0.010 0.273 0.014 0.328 0.008 0.397 0.015 DP-GLPCA 0.208 0.018 0.202 0.024 0.211 0.027 0.220 0.014 0.200 0.018 0.209 0.026 SSSC 0.296 0.027 0.299 0.015 0.304 0.027 0.304 0.021 0.298 0.020 0.304 0.018 Compared Method MSRCv2 ρ = 0.05 ρ = 0.10 ρ = 0.15 ρ = 0.20 ρ = 0.30 ρ = 0.40 PLC (Ours) 0.337 0.026 0.359 0.020 0.372 0.021 0.404 0.016 0.425 0.015 0.433 0.017 K-means 0.243 0.022 0.229 0.008 0.245 0.022 0.266 0.017 0.235 0.015 0.265 0.015 SC 0.297 0.011 0.293 0.006 0.293 0.010 0.295 0.012 0.284 0.010 0.291 0.008 SSC-TLRR 0.298 0.013 0.263 0.012 0.261 0.010 0.269 0.010 0.301 0.018 0.355 0.013 DP-GLPCA 0.329 0.016 0.331 0.011 0.314 0.018 0.350 0.021 0.321 0.014 0.317 0.014 SSSC 0.297 0.015 0.337 0.023 0.354 0.012 0.337 0.026 0.294 0.023 0.280 0.016 Compared Method Mirflickr ρ = 0.05 ρ = 0.10 ρ = 0.15 ρ = 0.20 ρ = 0.30 ρ = 0.40 PLC (Ours) 0.567 0.021 0.585 0.014 0.579 0.020 0.587 0.022 0.589 0.022 0.596 0.022 K-means 0.299 0.008 0.296 0.007 0.299 0.017 0.304 0.008 0.298 0.008 0.303 0.015 SC 0.282 0.006 0.283 0.001 0.283 0.003 0.285 0.002 0.284 0.004 0.284 0.003 SSC-TLRR 0.415 0.022 0.423 0.021 0.359 0.018 0.385 0.042 0.435 0.010 0.474 0.012 DP-GLPCA 0.465 0.013 0.460 0.012 0.460 0.003 0.464 0.004 0.465 0.013 0.459 0.005 SSSC 0.352 0.002 0.350 0.004 0.351 0.014 0.363 0.021 0.364 0.012 0.366 0.020 Compared Method Bird Song ρ = 0.05 ρ = 0.10 ρ = 0.15 ρ = 0.20 ρ = 0.30 ρ = 0.40 PLC (Ours) 0.612 0.018 0.627 0.020 0.632 0.024 0.638 0.031 0.643 0.027 0.654 0.023 K-means 0.280 0.041 0.303 0.022 0.283 0.031 0.285 0.026 0.286 0.028 0.281 0.029 SC 0.437 0.003 0.437 0.004 0.437 0.004 0.442 0.004 0.440 0.004 0.440 0.003 SSC-TLRR 0.403 0.016 0.431 0.042 0.480 0.045 0.491 0.038 0.569 0.012 0.590 0.016 DP-GLPCA 0.476 0.002 0.482 0.003 0.485 0.003 0.488 0.007 0.494 0.007 0.490 0.006 SSSC 0.453 0.013 0.471 0.016 0.479 0.020 0.484 0.023 0.442 0.032 0.423 0.017

Table 1: Experimental results on ACC when compared with constrained clustering methods under different proportions of partial label training examples on real-world datasets, where bold and underlined indicate the best and second best results respectively.

overhead when nq is large. According to [Zhang et al., 2016a], we can update the label confidence vector f j while fixing other vectors:

min f j tjjf j f j + ( Xn

i=1,i =j(tij + tji)f i )f j

s.t. f j1q = 1, 0q f j yj. (9)

Eq. (9) is one of the subproblems for updating F with q variables and q + 1 constraints and it can be efficiently solved. Update S and D With fixed W and F, variable S and D can be updated simultaneously by solving Eq. (4). To handle the equality constraint efficiently, Eq. (4) is approximated as:

min D,S D S 1 +αTr(DLD ) + βTr(SLS )

+ γ( P (S M) 2 F + P (D C) 2 F ) s.t. S 0n n, D 0n n,

where P represents the position of PCs, and Pij = 1 if (xi, xj) M C, otherwise Pij = 0. Eq. (10) is convex to S (or D) when D (or S) is fixed. We can introduce the Lagrange multiplier matrices ΦS Rn n and ΦD Rn n to deal with the non-negative constraint and the Lagrangian function is expressed as:

L = D S 1 +αTr(DLD ) + βTr(SLS )

+ γ( P (S M) 2 F + P (D C) 2 F )

+ Tr(Φ S S) + Tr(Φ D D),

according to the Karush Kuhn Tucker (KKT) conditions, we have S ΦS = 0 and D ΦD = 0. Let L/ Sij = 0 and L/ Dij = 0, the updating formula are shown as follows:

Sij = Sij (2γP M + 2βSW)ij (D + 2βSA + 2γP S)ij , (12)

Dij = Dij (2γP C + 2αDW)ij (S + 2αDA + 2γP D)ij . (13)

After alternating optimization, we obtain a weight matrix that can fully represent the similarity relationship of the samples. Finally, we apply SC on the weight matrix W to get the final clustering results. The overall pseudo-code of our PLC method is summarized in Algorithm 1 and the computational complexity analysis of Algorithm 1 can be found in Appendix C.

4 Theoretical Analysis

Theorem 1. Denote F [0, 1]n q and W [0, 1]n n as the label confidence matrix and the weight matrix to be optimized. Let FG and WG be the ground-truth label matrix and the optimal weight matrix under the ground-truth labels. We assume that WG is constructed on the premise that the ground-truth labels of neighboring examples are the same. Let λ be the smallest eigenvalue of F GFG and || W||F be the average distance of each corresponding position between

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Compared Method Lost MSRCv2 ρ = 0.01 ρ = 0.02 ρ = 0.05 ρ = 0.10 ρ = 0.01 ρ = 0.02 ρ = 0.05 ρ = 0.10 PLC (Ours) 0.341 0.009 0.353 0.026 0.399 0.027 0.497 0.024 0.329 0.008 0.325 0.009 0.337 0.026 0.359 0.020

DPCLS 0.190 0.029 0.219 0.023 0.377 0.033 0.456 0.059 0.185 0.035 0.234 0.038 0.299 0.039 0.356 0.029 AGGD 0.184 0.026 0.216 0.012 0.375 0.035 0.447 0.040 0.186 0.037 0.229 0.040 0.280 0.044 0.334 0.019 IPAL 0.163 0.019 0.230 0.023 0.314 0.025 0.409 0.046 0.181 0.058 0.227 0.051 0.280 0.025 0.346 0.034 PL-KNN 0.147 0.047 0.170 0.017 0.185 0.015 0.207 0.015 0.144 0.012 0.171 0.034 0.229 0.033 0.273 0.027 PL-SVM 0.189 0.020 0.192 0.020 0.242 0.038 0.277 0.029 0.185 0.048 0.202 0.042 0.227 0.046 0.287 0.020 PARM 0.112 0.037 0.143 0.030 0.267 0.024 0.348 0.043 0.225 0.065 0.243 0.047 0.286 0.035 0.334 0.041 SSPL 0.132 0.063 0.180 0.038 0.253 0.043 0.300 0.041 0.265 0.061 0.267 0.056 0.314 0.052 0.386 0.043 Compared Method Mirflickr Bird Song ρ = 0.01 ρ = 0.02 ρ = 0.05 ρ = 0.10 ρ = 0.01 ρ = 0.02 ρ = 0.05 ρ = 0.10 PLC (Ours) 0.485 0.041 0.502 0.040 0.567 0.021 0.585 0.014 0.520 0.047 0.539 0.034 0.612 0.018 0.627 0.020 DPCLS 0.369 0.066 0.440 0.046 0.507 0.035 0.579 0.034 0.492 0.045 0.526 0.035 0.618 0.019 0.653 0.018 AGGD 0.366 0.082 0.417 0.055 0.490 0.042 0.555 0.017 0.488 0.044 0.534 0.037 0.612 0.019 0.638 0.014 IPAL 0.268 0.098 0.330 0.056 0.426 0.041 0.475 0.016 0.476 0.043 0.510 0.044 0.586 0.024 0.634 0.018 PL-KNN 0.222 0.056 0.252 0.058 0.324 0.033 0.380 0.024 0.365 0.072 0.434 0.038 0.532 0.021 0.571 0.014 PL-SVM 0.286 0.054 0.302 0.057 0.436 0.070 0.494 0.035 0.424 0.054 0.449 0.025 0.559 0.029 0.586 0.022 PARM 0.399 0.073 0.424 0.073 0.451 0.050 0.500 0.026 0.379 0.029 0.357 0.024 0.353 0.013 0.379 0.026 SSPL 0.379 0.084 0.397 0.049 0.399 0.038 0.428 0.024 0.467 0.036 0.479 0.037 0.521 0.034 0.576 0.025

Table 2: Experimental results on ACC when compared with PLL and semi-supervised PLL methods under different proportions of partial label training examples on real-world datasets, where bold and underlined indicate the best and second best results respectively.

WG and W, i.e., || W||F = 1 n2 ||WG W||F . Then we have

|| W||F n + 2

λn ||F F F GFG||F + 2n + q 1

The proof can be found in Appendix B. From Theorem 1, we find that a smaller difference between F and FG can reduce the upper bound of || W||F , indicating that better disambiguation results help achieve a better weight matrix, thereby improving clustering performance.In summary, we prove that, under some general assumptions, a better disambiguated label matrix improves clustering performance.

5 Experiments 5.1 Experimental Setup Datasets To conduct a comprehensive evaluation of our proposed method, we compare our PLC method with other methods on both controlled UCI datasets and real-world datasets. The characteristics of controlled UCI datasets and real-world datasets can be found in Appendix D. Following the widelyused partial label data generation protocol [Cour et al., 2011], we generate the artificial partial label datasets under the controlling parameter r which controls the number of falsepositive labels. For each example, we randomly select r other labels as false-positive labels. Compared Methods To demonstrate the effectiveness of our PLC method, we compare it with two baseline clustering methods, three state-of-the-art constrained clustering methods, five well-established PLL methods, and two semisupervised PLL methods. (1) Baseline clustering method: KMeans [Macqueen, 1967] and SC [Ng et al., 2001]. (2) Constrained clustering methods: SSC-TLRR [Jia et al., 2023a], DP-GLPCA [Jia et al., 2021] and SSSC [Jia et al., 2018]. (3) PLL methods: PL-KNN [Cour et al., 2009], PL-SVM [Nguyen and Caruana, 2008], DPCLS [Jia et al., 2023c], AGGD [Wang et al., 2022] and IPAL [Zhang and Yu, 2015]. (4) Semi-supervised PLL methods: SSPL [Wang et al., 2019]

and PARM [Wang and Zhang, 2020]. Each compared method is implemented with the default hyper-parameter setup suggested in the respective literature. Parameters for our PLC method are set as α, β {0.01, 0.1, 1}, γ = 10 and k {10, 15, 20, 25, 30, 40}. Implementation Details For constrained clustering methods, we randomly sample the partial label examples based on the proportion ρ {0.05, 0.10, 0.15, 0.20, 0.30, 0.40} and the remaining samples are used as test data. For PLL and semi-supervised PLL methods, we randomly sample the partial label examples based on the proportion ρ {0.01, 0.02, 0.05, 0.10}. For each experiment, we implemented 10 times with random partitions and reported the average performance with the standard deviation. We used ACC and NMI as evaluation metrics. For detailed information, please refer to Appendix A.

5.2 Experimental Results Comparison with Constrained Clustering Methods Fig. 1 illustrates the ACCs and the NMIs of our PLC method compared to constrained clustering methods under different proportions of partial label training examples on synthetic UCI datasets. According to these figures, our PLC method ranks first in 88.9% (64/72) cases. Table 1 reports the ACCs of our PLC method compared to constrained clustering methods under different proportions of partial label training examples on real-world datasets. Our PLC method ranks first in 100% (24/24) cases. Moreover, we can find that constrained clustering methods perform poorly in some cases, and there is a decline in performance despite an increase in the proportion of partial labeled samples, indicating that constrained clustering methods are difficult to handle the ambiguous labels. In contrast, our PLC method performs better in most cases, indicating that the PLC method can effectively utilize the partial label samples to predict the unlabeled samples.

Comparison with PLL & Semi-supervised PLL Methods Table 2 reports the ACCs of our PLC method compared to PLL and semi-supervised PLL methods under different

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Compared Method Lost ρ = 0.05 ρ = 0.10 ρ = 0.15 ρ = 0.20 ρ = 0.30 ρ = 0.40 PLC 0.399 0.027 0.497 0.024 0.512 0.024 0.553 0.026 0.608 0.021 0.641 0.022 PLC-CW 0.349 0.021 0.352 0.015 0.349 0.021 0.351 0.021 0.361 0.018 0.348 0.015 PLC-LD 0.355 0.021 0.379 0.019 0.371 0.024 0.393 0.018 0.395 0.016 0.402 0.014 PLC-SD 0.350 0.021 0.344 0.032 0.307 0.026 0.265 0.026 0.211 0.011 0.211 0.016 Compared Method MSRCv2 ρ = 0.05 ρ = 0.10 ρ = 0.15 ρ = 0.20 ρ = 0.30 ρ = 0.40 PLC 0.337 0.026 0.359 0.020 0.372 0.021 0.404 0.016 0.425 0.015 0.433 0.017 PLC-CW 0.318 0.015 0.315 0.007 0.317 0.014 0.320 0.011 0.318 0.012 0.317 0.013 PLC-LD 0.322 0.015 0.320 0.014 0.321 0.009 0.323 0.013 0.324 0.014 0.342 0.016 PLC-SD 0.319 0.009 0.313 0.013 0.297 0.017 0.270 0.021 0.209 0.028 0.198 0.015

Table 3: Ablation study of the proposed method PLC, where bold indicates the best result.

(a) Varying α

(b) Varying β

(c) Varying k

Figure 2: Parameter sensitivity analysis for PLC. (a) ACCs of PLC on Lost and MSRCv2 by varying α; (b) ACCs of PLC on Lost and MSRCv2 by varying β; (c) ACCs of PLC on Lost and MSRCv2 by varying k;

proportions of partial label training examples on real-world datasets. Our PLC method achieves superior performance against PL-KNN, PL-SVM, IPAL, PARM in 100% (16/16) cases, against SSPL in 93.75% (15/16) cases, and against AGGD and DPCLS in 87.5% (14/16) cases. We can find that our PLC method performs better in the case of fewer partial labeled samples than PLL methods and semi-supervised PLL methods, indicating that our PLC method is helpful in addressing the issue of insufficient partial labels. More experimental results can be found in Appendix E. We also conduct a significance analysis to prove that our PLC method is significantly superior to the comparing methods. The results can be found in Appendix E.4.

5.3 Further Analysis Ablation Study. In Table 3, we conduct an ablation study on Lost and MSRCv2 datasets with different proportion ρ {0.05, 0.10, 0.15, 0.20, 0.30, 0.40} to check the necessity of the terms involved in our method. Specifically, PLC-CW represents the PLC version that only utilizes sample features to construct a weight matrix and perform the SC, PLC-LD represents the PLC version that utilizes sample features and disambiguated partial labels to construct a weight matrix and perform SC, and PLC-SD represents the PLC version that utilizes pairwise constraint propagation to densify the initial pairwise constraint and perform SC. The results in Table 3 indicate that label disambiguation and pairwise constraint propagation are helpful in improving classification accuracy, and taking both into account is the best choice. Parameter Sensitivity Fig. 2 reports the performance of

our PLC method under different parameter configurations on datasets Lost and MSRCv2. According to Fig. 2, the performance of our PLC method is relative stable in the different settings of α and β (Fig. 2(a) and Fig. 2(b)). We can directly fix α, β = 0.1 in practice. In addition, the accuracy curves in Fig. 2(c) indicate that our PLC is relative sensitive to the parameter k. In practice, we find that setting a smaller k {10, 15} for small-scale datasets and a larger k {20, 30, 40} for large-scale datasets can improve the performance of our PLC method.

6 Conclusion

In this paper, we proposed a novel method named PLC. For the first time, we explore whether partial labels can help improve the performance of clustering methods. Specifically, we first construct a weight matrix based on the relationships in the feature space and disambiguate the candidate labels to estimate the ground-truth label. Then we establish the must-link constraint and the cannot-link constraint based on the disambiguation results. Finally, we propagate the initial pairwise constraints and form an adversarial relationship between these two constraints to improve clustering performance. Moreover, we theoretically prove that a better disambiguated label matrix improves clustering performance, confirming that using label disambiguation to enhance clustering performance is effective. Our PLC method achieves superior performance compared to conventional constrained clustering, PLL and semi-supervised PLL methods on both three synthetic datasets and four real-world datasets.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Acknowledgments This work was supported in part by the National Natural Science Foundation of China under Grants U24A20322 and in part by the Big Data Computing Center of Southeast University.

References [Chen et al., 2018] Ching-Hui Chen, Vishal M. Patel, and Rama Chellappa. Learning from ambiguously labeled face images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(7):1653 1667, 2018. [Cour et al., 2009] Timothee Cour, Benjamin Sapp, Chris Jordan, and Ben Taskar. Learning from ambiguously labeled images. In IEEE Conference on Computer Vision & Pattern Recognition, pages 919 926, 2009. [Cour et al., 2011] Timothee Cour, Benjamin Sapp, and Ben Taskar. Learning from partial labels. Journal of Machine Learning Research, 12(4):1501 1536, 2011. [Feng and An, 2019] Lei Feng and Bo An. Partial label learning by semantic difference maximization. In Twenty Eighth International Joint Conference on Artificial Intelligence IJCAI-19, 2019. [H ullermeier and Beringer, 2005] Eyke H ullermeier and J urgen Beringer. Learning from ambiguously labeled examples. Intell. Data Anal., 10:419 439, 2005. [Jia et al., 2018] Yuheng Jia, Sam Kwong, and Junhui Hou. Semi-supervised spectral clustering with structured sparsity regularization. IEEE Signal Processing Letters, PP(3):1 1, 2018. [Jia et al., 2020a] Yuheng Jia, Junhui Hou, and Sam Tak Wu Kwong. Constrained clustering with dissimilarity propagation-guided graph-laplacian pca. IEEE Transactions on Neural Networks and Learning Systems, 32:3985 3997, 2020. [Jia et al., 2020b] Yuheng Jia, Hui Liu, Junhui Hou, and Sam Kwong. Pairwise constraint propagation with dual adversarial manifold regularization. IEEE Transactions on Neural Networks and Learning Systems, PP(99):1 13, 2020. [Jia et al., 2021] Yuheng Jia, Junhui Hou, and Sam Kwong. Constrained clustering with dissimilarity propagationguided graph-laplacian pca. IEEE Transactions on Neural Networks and Learning Systems, 32(9):3985 3997, 2021. [Jia et al., 2023a] Yuheng Jia, Guanxing Lu, Hui Liu, and Junhui Hou. Semi-supervised subspace clustering via tensor low-rank representation. IEEE Transactions on Circuits and Systems for Video Technology, 33(7):3455 3461, 2023. [Jia et al., 2023b] Yuheng Jia, Chongjie Si, and Min-Ling Zhang. Complementary classifier induced partial label learning. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 974 983, 2023. [Jia et al., 2023c] Yuheng Jia, Fuchao Yang, and Yongqiang Dong. Partial label learning with dissimilarity propagation guided candidate label shrinkage. In Advances in

Neural Information Processing Systems, volume 36, pages 34190 34200. Curran Associates, Inc., 2023. [Jia et al., 2024] Yuheng Jia, Xiaorui Peng, Ran Wang, and Min-Ling Zhang. Long-tailed partial label learning by head classifier and tail classifier cooperation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 12857 12865, 2024. [Jiang et al., 2024] Jiahao Jiang, Yuheng Jia, Hui Liu, and Junhui Hou. Fairmatch: Promoting partial label learning by unlabeled samples. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1269 1278, 2024. [Jin and Ghahramani, 2002] Rong Jin and Zoubin Ghahramani. Learning with multiple labels. In Neural Information Processing Systems, 2002. [Kang et al., 2018] Zhao Kang, Chong Peng, Qiang Cheng, and Zenglin Xu. Unified spectral clustering with optimal graph. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. [Liu and Dietterich, 2012] Li-Ping Liu and Thomas G. Dietterich. A conditional multinomial mixture model for superset label learning. In Neural Information Processing Systems, 2012. [Lu and Ip, 2010] Zhiwu Lu and Horace Ho-Shing Ip. Constrained spectral clustering via exhaustive and efficient constraint propagation. In European Conference on Computer Vision, 2010. [Luo and Orabona, 2010] Jie Luo and Francesco Orabona. Learning from candidate labeling sets. In Neural Information Processing Systems, 2010. [Macqueen, 1967] J. Macqueen. Some methods for classification and analysis of multivariate observations. Proc. Symp. Math. Statist. and Probability, 5th, 1, 1967. [Ng et al., 2001] A. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Neural Information Processing Systems, 2001. [Nguyen and Caruana, 2008] Nam Nguyen and Rich Caruana. Classification with partial labels. In Knowledge Discovery and Data Mining, 2008. [Tian et al., 2023] Yingjie Tian, Xiaotong Yu, and Saiji Fu. Partial label learning: Taxonomy, analysis and outlook. Neural Networks, 161:708 734, 2023. [Wang and Zhang, 2020] Wei Wang and Min-Ling Zhang. Semi-supervised partial label learning via confidencerated margin maximization. In Neural Information Processing Systems, 2020. [Wang et al., 2019] Qian-Wei Wang, Yu-Feng Li, and Zhi Hua Zhou. Partial label learning with unlabeled data. In International Joint Conference on Artificial Intelligence, 2019. [Wang et al., 2022] Deng-Bao Wang, Min-Ling Zhang, and Li Li. Adaptive graph guided disambiguation for partial label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8796 8811, 2022.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

[Xu et al., 2024] Ning Xu, Congyu Qiao, Yuchen Zhao, Xin Geng, and Min-Ling Zhang. Variational label enhancement for instance-dependent partial label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11298 11313, 2024. [Yang et al., 2025] Fuchao Yang, Jianhong Cheng, Hui Liu, Yongqiang Dong, Yuheng Jia, and Junhui Hou. Mixed blessing: Class-wise embedding guided instancedependent partial label learning. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2025. [Zhang and Yu, 2015] Min-Ling Zhang and Fei Yu. Solving the partial label learning problem: An instance-based approach. In International Joint Conference on Artificial Intelligence, 2015. [Zhang et al., 2016a] Min-Ling Zhang, Bin-Bin Zhou, and Xu-Ying Liu. Partial label learning via feature-aware disambiguation. pages 1335 1344, 08 2016. [Zhang et al., 2016b] Xianchao Zhang, Linlin Zong, Xinyue Liu, and Jiebo Luo. Constrained clustering with nonnegative matrix factorization. IEEE Transactions on Neural Networks and Learning Systems, 27(7):1514 1526, 2016.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)