# semisupervised_clustering_via_pairwise_constrained_optimal_graph__69440eed.pdf Semi-supervised Clustering via Pairwise Constrained Optimal Graph Feiping Nie1, , Han Zhang1 , Rong Wang2,1 and Xuelong Li1 1School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi an, 710072, Shaanxi, P. R. China 2School of Cybersecurity, Northwestern Polytechnical University, Xi an, 710072, P. R. China {feipingnie,zhanghan9937}@gmail.com, wangrong07@tsinghua.org.cn, li@nwpu.edu.cn In this paper, we present a technique of definitely addressing the pairwise constraints in the semisupervised clustering. Our method contributes to formulating the cannot-link relations and propagating them over the affinity graph flexibly. The pairwise constrained instances are provably guaranteed to be in the same or different connected components of the graph. Combined with the Laplacian rank constraint, the proposed model learns a Pairwise Constrained structured Optimal Graph (PCOG), from which the specified c clusters supporting the known pairwise constraints are directly obtained. An efficient algorithm invoked by the label propagation is designed to solve the formulation. Additionally, we also provide a compact criterion to acquire the key pairwise constraints for prompting the semi-supervised graph clustering. Substantial experimental results show that the proposed method achieves the significant improvements by using a few prior pairwise constraints. 1 Introduction Traditional clustering targets at grouping instances according to their inherent similarities without any supervision. However, the attributes of each cluster obtained from unsupervised clustering are mostly unpredictable (e.g., Figure 1). In light of this, semi-supervised clustering attracts lots of interests, which not only guides the clustering results according to preference, but also improves the clustering performance significantly. Semi-supervised clustering frequently uses the prior information of pairwise constraints, including must-link constraints (ML, specify that the pair of instances must be in the same cluster), and the cannot-link constraints (CL, specify that the pair of instances must be in different clusters), since the pairwise constraints are much easier to acquire and more flexible than labels in practice when the cluster number is unavailable. For instance, to distinguish portraits, the images of one person are must linked, whereas the images of different persons are cannot linked. A lot of researches on pairwise constrained clustering have been made and implemented in Contact Author many applications, such as medical diagnosis [Thangavel and Mohideen, 2010], image segmentation [Saha et al., 2016], and information networks [Li et al., 2017], etc. Graph based clustering is always an important branch in the clustering and attracts more and more attentions in recent years, due to the high tractability of graph representations of data, such as the well-known normalized cut (Ncut) [Ng et al., 2002; Shi and Malik, 2000], and ratio cut [Hagen and Kahng, 1992]. The performance of graph clustering depends on a given graph where the pairwise relations of samples are depicted by an affinity matrix, thus it is natural to incorporate pairwise constraints into graphs. Many constrained graph clustering [Lu and Carreira-Perpinan, 2008; Kulis et al., 2009; Smieja et al., 2018] were raised accordingly that first refined the affinity matrix by given pairwise constraints, then performed spectral clustering on the modified graph. However, due to the cannot-link s non-transitive property, they only dealt with the must-link constraints effectively. Instead of modifying graphs, Wang et al. [Wang et al., 2014] solved a constrained spectral clustering under the constraint that the number of satisfied pairwise constraints was greater than a threshold value. Later, Cucuringu [Cucuringu et al., 2016] put forward that traditional spectral clustering was a special case of constrained clustering with implicit pairwise constraints, and they modified the denominator of Ncut s objective function to be the graph Laplacians associated with the defined cannot-link matrix, such that the pairwise constraints were considered into partitioning. Nevertheless, existing approaches still encounter two major issues: i) the cannot-link problem, how to provably ensure the instances under the cannot-link constraint to be in different clusters; ii) the multi-class problem, how to flexibly incorporate pairwise constraints into the affinity matrix for direct multi-class clustering. In other words, how to perform the multi-clustering task under the provably valid supervision of cannot-link constraints remains a crucial challenge. In this paper, we simultaneously address the cannot-link problem as well as the multi-class problem in the constrained graph clustering. To be specific, we contribute to presenting a novel cannot-link graph regularization that provably guarantees each cannot-link constrained instance to be in different clusters, as demonstrated in Theorem 1 and illustrated in Figure 2(a); Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Figure 1: Grouping images into two clusters: from the view of aggressivity, the three left columns form a cluster, and the rest columns form the other cluster; from biology, each row is more like a cluster. optimizing the graph to support the given must-link as well as cannot-link constraints and has the specified c connected components for direct multi-class clustering, as illustrated in Figure 2(b); providing a simple yet efficient pairwise constraints selection criterion that specifically for improving semisupervised graph clustering. We further conduct substantial experiments and verify that the proposed method reaches the excellent performance by using a few key pairwise constraints. Notations. Throughout the paper, Tr(M) denotes the trace of a matrix M. v 2 is the ℓ2-norm of a vector v. M F denotes the Frobenius norm of M. M T is the transpose of M, and rank(M) is the rank of M. 2 Preliminary The prevalent graph based spectral clustering is a two-step process that first seeks the intrinsic low-dimensional embedding from the pre-constructed affinity graph, and then performs k-means on the embedding to obtain the cluster labels, since the graphs built from the original feature subspace lack of the explicit cluster structure. To deal with it, in this section, we revisit a constrained Laplacian rank algorithm proposed by [Nie et al., 2016] for direct multi-class graph clustering, formulated as: min S S A 2 F , s.t. S 0, S1 = 1, rank(LS) = n c, (1) where 1 = [1, , 1]T Rn. A Rn n is the preconstructed affinity matrix of n data points, and S Rn n is the optimized affinity matrix. LS = DS S is the Laplacian matrix of S, and DS is the degree matrix of S with dii = P j sij. With inputting a rough similarity matrix A, this model obtains a non-negative and normalized approximation S which possesses the exact c connected components. As a result, the quality of original graph is improved by removing excrescent connections and the instances are exactly partitioned into c clusters. This formulation provides how to achieve a c-connected graph. However, whether the incorrect connections are removed and the valid connections are preserved is uncontrolled. In the next section, we elaborate how to overcome this deficiency by tactically incorporating the pairwise constraints and obtaining the desired clusters. 3 Methodology 3.1 Problem Formulation Suppose an affinity matrix A = [aij]n n associated with the dataset X = {x1, x2, , xn}, where xi is a d-dimensional data point, aij represents the similarity between i-th sample and j-th sample. We utilize a few pairwise constraints to guide the partition of X, denoted as: M = {(xi, xj)| i, j > i, xi must link xj}; C = {(xi, xj)| i, j > i, xi cannot link xj}. (2) From the viewpoint of the graph connectivity, when the mustlink constrained instances locate in one connected component while the cannot-link constrained instances locate in different connect components, the constraints can be satisfied intuitively. Moreover, if the edge of must-linked instances exists, they naturally belong to one connected component. However, removing the connections between cannot-linked instances does not work since they still probably connect to each other through intermediate nodes. To address it, we first put forward a theorem to definitely isolate the cannot-link constrained instances from each other, as below. Theorem 1. Suppose a nonnegative graph S, and there exists a cannot-link constraint between xa and xb. y Rn is the cannot-link indicator where ya = 1 and yb = 1. When y T LSy = 0, (3) xa and xb must be in different connected components of S. Proof. By reduction to absurdity. If xa and xb are in the same connected component of S, there is at least a path P = {xa, xk1, , xkt, xb} from xa to xb. Denote J = y T LSy = P i P j(yi yj)2sij. We split J = J (P ) + J ( e P ), where J (P ) 0 is the objectives associated with P and J ( e P ) 0 is the objectives of all paths apart from P . When J = 0 is required, we have J (P ) = 0 as well. Since J (P ) = (ya yk1)2sak1 + + (ykt yb)2sktb, and sak1 > 0, , sktb > 0, it is inferred that ya = yk1 = = ykt = yb, which is contradictory to the conditions of ya = 1, yb = 1. As a result, xa and xb must be in different connected components of S. According to Theorem 1, we present the cannot-link graph regularization to learn the pairwise constrained structured optimal graph S Rn n from the given affinity A under the supervision of the pairwise constraints, formulated as: min S Ω,Y Ψ S A 2 F + γ k=1 y T k LSyk, (4) where γ > 0 is a regularization parameter, and p is the number of the constraints in C. Ω= Ω Ω is the feasible zone of the learned graph S, where Ω inherits from model (1) and Ω is designed for addressing must-link constraints. Y Ψ is the zone feasible of all yk equipped for the cannot-link regularization, i.e., Ω : {S|S 0, S1 = 1, rank(LS) = n c}, Ω : {S| (xi, xj) M : sij = τ, sji = τ}, Ψ : {Y | (xik, xjk) C : yk(ik) = 1, yk(jk) = 1}, Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) subset 1 subset 2 cannot-link must-link 3 4 5 6 7 8 2 cluster 1 cluster 2 cluster 3 Figure 2: The illustration to the proposed method. (a) The cannot-link graph regularization. Suppose a cannot-link constraint between xi and xj. According to Theorem 1, when y T LSy = 0, the paths from xi to xj in the graph S are removed thoroughly; (b) The multi-class clustering. Given three cannot-link constraints between {x1, x8}, {x3, x4} and {x5, x7}. By virtue of the Laplacian rank constraint and the cannot-link graph regularization, the graph S learned from A would has three connected components satisfying cannot-link constraints. Invoked by model (1), S Ω compels that S has exact c connected components. S Ω sets the connections between the must-linked instances by a small value τ (where τ = 0.1 simply). Y Ψ fixes yk(ik) = 1 and yk(jk) = 1 for each yk to label the k-th couple of cannot-link constrained xik and xjk. Benefited from label propagation [Zhu and Ghahramani, 2002], the rest entries in yk are learned and thus the cannot-link relations are propagated over the affinity graph, as shown in Figure 2(a). According to Theorem 1, when γ is large enough, minimizing the regularization term in model 4 makes the cannot-link constraints satisfied simultaneously. By optimizing model (4), we can obtain the Pairwise Constrained Optimal Graph (PCOG) that directly indicates c clusters, where the must-link constrained instances must be in the same cluster and the cannot-link constrained instances must be in different clusters (see Figure 2(b)). In this way, two aforementioned problems of semi-supervised clustering are addressed simultaneously. To our best knowledge, this is the first work to provably address the cannot-linked instances. Next, how to solve model (4) takes the first priority. 3.2 Optimization of Model (4) Following the optimization of model (1), the non-convex rank constraint in model (4) is tackled by solving its counterpart. Denote the i-th smallest eigenvalue of LS as σi(LS). Since LS is positive semi-definite, we have σi(LS) 0 for each i. When the first c smallest σi(LS) equal zero, the constraint rank(LS) = n c is actually achieved. Based on this, model (4) is transformed into: min S,Y S A 2 F + γ k=1 y T k LSyk + λ i=1 σi(LS), s.t. S 0, S1 = 1, S Ω , Y Ψ, where λ is large enough, and Pc i=1 σi(LS) is minimized to be zero. According to Ky Fan s Theorem [Fan, 1950], problem (6) could be further equivalent to: min S,F,Y S A 2 F + γTr(Y T LSY ) + λTr(F T LSF), s.t. S 0, S1T = 1, S Ω , F T F = I, Y Ψ, (7) where F = {f 1, f 2, , f n} Rn c is a manifold embedding of data. Subsequently, we optimize three variables {S, F, Y } in an iterative manner, known as block-coordinate descent method [Tseng, 2001]. (i). When updating S with the fixed F and Y , problem (7) is transformed into: min S S A 2 F + γTr(Y T LSY ) + λTr(F T LSF), s.t. S 0, S1 = 1, S Ω . (8) To address problem above, we introduce an important equation in the spectral analysis [Ng et al., 2002], described as: Tr(Y T LSY ) = 1 yi yj 2 2 sij. (9) Then, we could reformulate problem (8) as i,j (sij aij)2 + X 2 dy ij + λ s.t. S 0, S1 = 1, S Ω , where dy ij = yi yj 2 2 and df ij = f i f j 2 2. Note that the optimization of each column in S is independent, so we decompose problem (10) into solving s.t. si 0, s T i 1 = 1, j, (xi, xj) M : sij = τ, where di = γdy i +λdf i . According to M, we denote the mustlinked objects of the i-th samples as κi = {j|(xi, xj) M}. Fix si,j:j κi = τ, sj:j κi,i = τ, and let si = si,j:j / κi, ai = ai,j:j / κi, di = di,j:j / κi, then problem (11) could be transformed into 2 , s.t. si 0, s T i 1 = 1 |κi|τ, (12) where |κi| denotes the number of elements in κi. Problem (12) could be efficiently addressed by referring to problem (9) in Reference [Nie et al., 2016]. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Algorithm 1 Algorithm to solve problem (4) Require: the pairwise constraints M and C, an affinity matrix A, a constant value τ, the parameters λ and γ. Ensure: the graph S with c connected components that supports the given constraints. Initialize F Rn c with the eigenvectors of LA = DA AT +A 2 corresponding to its c smallest eigenvalues; Y Rn p with a random matrix ranging in [ 1, 1] and let yk(ik) = 1, yk(jk) = 1 for k-th pair of cannot-linked samples (xik, xjk) C; κi records the must-linked samples of i-th sample according to M; Let ai = ai,j:j / κi. For each i, set si,j:j κi = τ, sj:j κi,i = τ. while not converge do For each i, j, compute dij = γdy ij + λdf ij where dy ij = yi yj 2 2, df ij = f i f j 2 2; For each i, update si by solving problem (12); For each k, update y(k) u by solving problem (14); Update F by solving problem (15); end while (ii). When updating Y with the fixed F and S, the columns of Y in problem (7) are also independent to each other and could be achieved by solving min yk y T k LSyk, s.t. (xik, xjk) C : yk(ik) = 1, yk(jk) = 1. (13) As stated before, the optimization of yk could be regarded as the label propagation process over the graph S [Zhu and Ghahramani, 2002]. To be specific, we rearrange all samples as X (k) = {xik, xjk, x1, , xn}. Accordingly, the rearranged yk is expressed as y(k) = [y(k) l , y(k) u ]T Rn, where y(k) l = [1, 1]T . The similarity S is rearranged into S(k), whose Laplacian matrix is L(k) S . So, problem (13) is reformulated as min y(k) y(k)T L(k) S y(k), s.t. y(k) l = 1 1 , where L(k) S = " L(k) ll L(k) lu L(k) ul L(k) uu (14) According to the label propagation algorithm [Zhu and Ghahramani, 2002], the closed-form solution to problem (14) is y(k) u = L(k) uu 1L(k) ul y(k) l . (iii). When updating F with the fixed S and Y , problem (7) is equivalent to min F Tr(F T LSF), s.t. F T F = I. (15) Problem (15) degrades to a spectral clustering problem [Ng et al., 2002], whose solution consists of the eigenvectors of LS corresponding to its c smallest eigenvalues. 3.3 Key Pairwise Constraints Selection In semi-supervised clustering approaches, a compromised way to acquire the pairwise constraints is to query pairs of samples as much as possible. Since excessive constraints is labor-intensive, many constraint selection approaches [Craenendonck et al., 2017; Xiong et al., 2014] have been proposed. Different from them, we put forward a compact scheme specifically for semi-supervised graph clustering. Considering that A is sparse that each sample only connects to its k neighbors in the feature space, the rough neighbors could be obtained easily. Based on this, we query pairwise constraints following that: i) the must-link constraints are obtained by querying the non-neighbors of the samples; ii) the cannot-link constraints are obtained by querying the neighbors of the samples. In this way, we can find the key pairwise samples which are easily mis-linked or mis-unlinked in the original feature space. Compared to the random querying, the key querying simply makes the best of the pre-constructed graph and has a great significance in prompting clustering. 3.4 Theoretical Analysis Firstly, we discuss the computational complexity. Since A is sparse and our algorithm only computes the k nonzero connections of each row in A, updating S requires O(n2k). In terms of solving yk, the inverse operation and the label propagation in model (14) requires the complexity of O((n 2)2k). The optimization of F is an eigen-decomposition, requiring O(n2c). Totally, the main computational complexity of our algorithm is O(n2ct + (n 2)2pt), where p is the number of cannot-link constraints and t is the iteration number. Secondly, we talk about the effect of the parameters λ, τ and γ respectively. Actually, τ is a constant that sets the edges between must-linked instances. Since we partition data points according to the graph connectivity which is independent of the intensity of edges, the value of τ does not effect the performance of our algorithm. In terms of λ, as we said, a large enough λ ensures that S possesses c connected components exactly. However, how large λ should be is difficult to seek. Thus, we adopt a widely used manner to determine λ heuristically [Nie et al., 2014]. Specifically, we first initialize λ with a small value like 0.1, and update it according to the number of eigenvalue zero of LS in the iterations. If this number is smaller than c, λ is multiplied by 2; or if it is greater than c + 1, λ is divided by 2, otherwise we terminate the iterations. In terms of γ, when γ , the regularization term infinitely approaches to zero and all of the cannot-link constraints would be satisfied. However, it cannot be neglected that when the cannot-link regularization is addressed too seriously, it would cause meticulous cluster results, and the clustering performance could be degraded. So, γ should be appropriate that is neither too small nor too large. 4 Experiments We conducted extensive experiments to validate the advantages of the proposed PCOG and the proposed key pairwise constraints selection strategy. Settings. The proposed PCOG is evaluated in two aspects: one is the clustering performance, and the other is the ability of satisfying the given pairwise constraints. The clustering performance is calculated by clustering ACCuracy (ACC) and the Normalized Mutual Information (NMI) [Strehl and Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Nam. Siz. Dim. Cla. UCI Datasets Dermatology 366 34 6 Control 600 60 6 Monk1 432 6 2 Glass 214 9 6 Image Datasets ORL 400 1024 40 COIL20 1440 1024 20 UMIST 1400 1024 200 USPS 2007 256 10 YALE 165 1024 15 Table 1: The numerical introduction to real datasets. Ghosh, 2003]. ACC computes the percentage of correctly clustered samples. and NMI computes the mutual information between cluster labels and real labels. Datasets. The proposed PCOG along with the compared approaches are tested on both toy data and real-world datasets. The real world datasets include four UCI datasets [Dua and Graff, 2017] (Dermatology, Control, Monk1 and Glass) and five image datasets (ORL [Samaria and Harter, 1994], COIL20 [Nene et al., 1996], UMIST [Graham and Allinson, 1998], USPS [Hull, 1994] and YALE [Minear and Park, 2004], as described in Table 1. Comparisons. We compare PCOG with five representative methods including three most related semi-supervised graph clustering approaches (SSGCK, CSCAP and CSP) [Lu and Carreira-Perpinan, 2008; Kulis et al., 2009; Wang et al., 2014]; the unsupervised graph clustering via Laplacian rank constraint (CLR) [Nie et al., 2016], and the classical spectral clustering (Ncut) [Shi and Malik, 2000]. 4.1 Experimental Results on Toy data To verify the effectiveness of the proposed cannot-link graph regularization in PCOG, the toy experiment was Figure 3(a) 1 must-link cannot-link (a) Two-moon data 1 must-link cannot-link Figure 3: Clustering results on two-moon toy data: (a) original samples and pairwise constraints; (b) the given affinity graph A; (c) not using pairwise constraints; (d) using pairwise constraints. 20 40 60 80 100 120 140 160 # of pairwise constraints Accuracy (%) PCOG CSCAP SSGCK CSP CLR Ncut (a) Dermatology 40 80 120 160 200 240 280 320 # of pairwise constraints Accuracy (%) PCOG CSCAP SSGCK CSP CLR Ncut (b) Control 40 80 120 160 200 240 280 320 # of pairwise constraints Accuracy (%) PCOG CSCAP SSGCK CSP CLR Ncut 40 80 120 160 200 240 280 320 # of pairwise constraints Accuracy (%) PCOG CSCAP SSGCK CSP CLR Ncut Figure 4: Comparisons of clustering performance w.r.t. the number of pairwise constraints on four UCI datasets . illustrates the toy data distributed in the shape of two closed half-moons as well as the used must-link constraints and cannot-link constraints. Figure 3(b) illustrates the used graph A constructed by PKN (k = 5) (i.e., a graph construction manner proposed in CLR [Nie et al., 2016] on the original feature space. Figure 3(c) and Figure 3(d) respectively represent the results of CLR and PCOG. It is concluded that 1) the achieved pairwise constraints are between either easily mis-linked points or easily mis-unlinked points as expected; 2) the input graph A constructed in the original feature space cannot reflect any cluster structure; 3) by only using the original features, the unsupervised graph clustering CLR fails in partitioning the pairs of points who are close in the geometrical space but different in terms of labels; 4) via the cannot-link graph regularization, the proposed PCOG ensures each cannot-link constraint, making the associated points into different clusters; 5) incorporated with the must-link constraints, PCOG dramatically enhances the clustering performance where a mass of connected points in the ambiguous areas could be correctly clustered. 4.2 Experimental Results on Real data In this part, we firstly evaluate the clustering performance on four UCI datasets w.r.t. the pairwise constraints, as shown in Figure 4. We fix that a quarter of the total pairwise constraints are CL constraints, and the rest are ML constraints. From the results, it is observed that the proposed method outperforms other methods distinctly. As the number of the pairwise constraints increasing, the clustering accuracy of PCOG and CSCAP has remarkable improvement while CSP and SSGCK are steady. This is because PCOG and CSCAP propagate the cannot-link affinity in the graph, and thus they can achieve Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) ACC Method ORL COIL20 UMIST USPS YALE CLR 0.585 0.865 0.725 0.599 0.393 SSGCK 0.663 0.794 0.820 0.534 0.480 CSCAP 0.725 0.885 0.780 0.583 0.539 CSP 0.595 0.834 0.878 0.630 0.451 PCOG 0.835 1.000 0.930 0.710 0.600 NMI Method ORL COIL20 UMIST USPS YALE CLR 0.768 0.941 0.874 0.665 0.430 SSGCK 0.787 0.742 0.853 0.579 0.541 CSCAP 0.860 0.891 0.801 0.618 0.607 CSP 0.752 0.862 0.901 0.655 0.480 PCOG 0.925 1.000 0.957 0.730 0.750 Table 2: The clustering ACC and NMI on five image datasets. the valid supervision via a few constraints. The methods like CSP and SSGCK have to depend on a lot of given pairwise constraints. Noting that despite an unsupervised manner, the performance of CLR is dramatic compared to other semisupervised methods, showing the validity of Laplacian rank constraint on the graph. Obviously, PCOG defeats CLR in the real datasets as well. Secondly, we evaluate the proposed PCOG in five image datasets, covering the face images (i.e., ORL, UMIST and YALE), the object images (i.e., COIL20) and also the handwritten images (i.e., USPS). All of these algorithms use the same key pairwise constraint sets consisting of 80 cannotlink constraints and 240 must-link constraints, except for the unsupervised CLR. The clustering ACC and NMI of different approaches are recorded in Table 2. The best results are highlighted and the second best results are underlined. It is observed that the proposed PCOG outperforms all the competitors. Moreover, the clustering results in terms of ACC and NMI are improved about 10 percent compared to the second best results, showing the dramatic performance of the proposed algorithm. Additionally, to compare the ability of algorithms in satisfying the given CL constraints, Figure 5(a) recorded the number of violated cannot-link of four pairwise constrained clustering approaches by using ORL. It is obvious that the performance of PCOG surpasses others distinctly. Although CSCAP is the most well known affinity propagation algorithm, its performance of guaranteeing the constraints is very limited compared to the proposed method. 20 100 40 60 80 # of violated constraints PCOG CSCAP SSGCK CSP # of total cannot-link constraints 0.1 0.3 0.5 0.7 . Accuracy(%) Figure 5: (a) The number of the violated CLs w.r.t. the number of total CLs on ORL. (b) The clustering accuracy of PCOG w.r.t. γ. 1 2 3 4 5 6 7 8 9 1011121314151617181920 Iteration number Objective value 1 2 3 4 5 6 7 8 9 1011121314151617181920 Iteration number Objective value (b) Dermatology Figure 6: The convergence demonstration of Algorithm 1. 4.3 Parameter Selection and Convergence Study We have theoretically analyzed that γ intensively impacts our clustering performance. So, we first seek γ in a large range of {10 3, 10 2, 10 1, 100, 101, 102, 103}, and we find that our algorithm works well in a small range [0.1, 1]. As a result, we further search γ from 0.1 to 1 with the interval of 0.2, as shown in Figure 5. The experimental curves obviously reflect the consistent conclusion with the theoretical analysis before. It is also shown that we can obtain the well performance in the range of [0.3, 0.7], and thus we search γ from 0.3 to 0.7 to obtain the best results in other experiments. We fix λ as a constant to investigate the convergence of Algorithm 1 experimentally. Figure 6 shows the objective value of model (7) within twenty iterations on two datasets. From the results, we observe that the proposed algorithm not only decreases the objective value but also converges fast, which demonstrates the high efficiency of the proposed algorithm. 5 Conclusion and Future Works Pairwise constrained clustering is a vital technique in many realistic tasks. However, how to make the best of the cannotlink constraints is a long-term challenge since the cannot-link constraints are difficult to transform and propagate efficiently. In this paper, we present a method that addresses the cannotlink constraints problem via a specific cannot-link graph regularization, provably tackling the cannot-link constraints in the graph learning. We accordingly provide a matchable pairwise constraint selection for graph-based clustering methods. The superiority of the proposed method is verified in both toy data and nine real world datasets, improving the performance of semi-supervised clustering significantly. In the future, the research will proceed and we focus on two remained challenges: i) automatically determining γ in PCOG, releasing the algorithm from parameters; ii) enhancing the efficiency, such that the algorithm is applicable to large-scale data. Acknowledgements This work was supported in part by the Innovation Foundation for Doctor Dissertation of Northwestern Polytechnical University under Grant CX201918, in part by the National Key Research and Development Program of China under Grant 2018AAA0101902, in part by the National Natural Science Foundation of China under Grants 61936014, 61772427 and 61751202, in part by the Fundamental Research Funds for the Central Universities under Grant G2019KY0501. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) [Craenendonck et al., 2017] Toon Van Craenendonck, Sebastijan Dumancic, and Hendrik Blockeel. Cobra: A fast and simple method for active clustering with pairwise constraints. In Proceedings of International Joint Conference on Artificial Intelligence, 2017. [Cucuringu et al., 2016] Mihai Cucuringu, Ioannis Koutis, Sanjay Chawla, Gary Miller, and Richard Peng. Simple and scalable constrained clustering: a generalized spectral method. In Artificial Intelligence and Statistics, pages 445 454, 2016. [Dua and Graff, 2017] Dheeru Dua and Casey Graff. Uci machine learning repository. http://archive.ics.uci.edu/ml, 2017. [Fan, 1950] Ky Fan. On a theorem of weyl concerning eigenvalues of linear transformations: Ii. Proceedings of the National Academy of Sciences of the United States of America, 36(1):31, 1950. [Graham and Allinson, 1998] Daniel B. Graham and Nigel M. Allinson. Characterising virtual eigensignatures for general purpose face recognition. https: //www.sheffield.ac.uk/eee/research/iel/research/face, 1998. [Hagen and Kahng, 1992] Lars Hagen and Andrew B Kahng. New spectral methods for ratio cut partitioning and clustering. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 11(9):1074 1085, 1992. [Hull, 1994] Jonathan J. Hull. A database for handwritten text recognition research. https://www.usps.com/ nationalpremieraccounts/findzipcodes.htm, 1994. [Kulis et al., 2009] Brian Kulis, Sugato Basu, Inderjit Dhillon, and Raymond Mooney. Semi-supervised graph clustering: a kernel approach. Machine learning, 74(1):1 22, 2009. [Li et al., 2017] Xiang Li, Yao Wu, Martin Ester, Ben Kao, Xin Wang, and Yudian Zheng. Semi-supervised clustering in attributed heterogeneous information networks. In Proceedings of the 26th International Conference on World Wide Web, pages 1621 1629. International World Wide Web Conferences Steering Committee, 2017. [Lu and Carreira-Perpinan, 2008] Zhengdong Lu and Miguel A Carreira-Perpinan. Constrained spectral clustering through affinity propagation. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1 8. IEEE, 2008. [Minear and Park, 2004] Meredith Minear and Denise C Park. A lifespan database of adult facial stimuli. http: //vision.ucsd.edu/content/yale-face-database, 2004. [Nene et al., 1996] Sameer A Nene, Shree K Nayar, Hiroshi Murase, et al. Columbia object image library (coil20). http://www.cs.columbia.edu/CAVE/software/softlib/ coil-20.php, 1996. [Ng et al., 2002] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems, pages 849 856, 2002. [Nie et al., 2014] Feiping Nie, Xiaoqian Wang, and Heng Huang. Clustering and projected clustering with adaptive neighbors. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 977 986. ACM, 2014. [Nie et al., 2016] Feiping Nie, Xiaoqian Wang, Michael I. Jordan, and Heng Huang. The constrained laplacian rank algorithm for graph-based clustering. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016. [Saha et al., 2016] Sriparna Saha, Abhay Kumar Alok, and Asif Ekbal. Brain image segmentation using semisupervised clustering. Expert Systems with Applications, 52:50 63, 2016. [Samaria and Harter, 1994] Ferdinando S Samaria and Andy C Harter. Parameterisation of a stochastic model for human face identification. https://www.cl.cam.ac.uk/ research/dtg/attarchive/facedatabase.html, 1994. [Shi and Malik, 2000] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888 905, 2000. [ Smieja et al., 2018] Marek Smieja, Oleksandr Myronov, and Jacek Tabor. Semi-supervised discriminative clustering with graph regularization. Knowledge-Based Systems, 151:24 36, 2018. [Strehl and Ghosh, 2003] Alexander Strehl and Joydeep Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. JMLR.org, 2003. [Thangavel and Mohideen, 2010] K Thangavel and A Kaja Mohideen. Semi-supervised k-means clustering for outlier detection in mammogram classification. In Trendz in Information Sciences & Computing (TISC2010), pages 68 72. IEEE, 2010. [Tseng, 2001] Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109(3):475 494, 2001. [Wang et al., 2014] Xiang Wang, Buyue Qian, and Ian Davidson. On constrained spectral clustering and its applications. Data Mining and Knowledge Discovery, 28(1):1 30, 2014. [Xiong et al., 2014] Sicheng Xiong, Javad Azimi, and Xiaoli Z. Fern. Active learning of constraints for semisupervised clustering. IEEE Transactions on Knowledge & Data Engineering, 26(1):43 54, 2014. [Zhu and Ghahramani, 2002] Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Citeseer, 2002. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)