# selfpaced_consensus_clustering_with_bipartite_graph__352ef2d9.pdf

Self-paced Consensus Clustering with Bipartite Graph

Peng Zhou1,2 , Liang Du3 , Xuejun Li1

1School of Computer Science and Technology, Anhui University 2State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences 3School of Computer and Information Technology, Shanxi University zhoupeng@ahu.edu.cn, duliang@sxu.edu.cn, xjli@ahu.edu.cn

Consensus clustering provides a framework to ensemble multiple clustering results to obtain a consensus and robust result. Most existing consensus clustering methods usually apply all data to ensemble learning, whereas ignoring the side effects caused by some difﬁcult or unreliable instances. To tackle this problem, we propose a novel selfpaced consensus clustering method to gradually involve instances from more reliable to less reliable ones into the ensemble learning. We ﬁrst construct an initial bipartite graph from the multiple base clustering results, where the nodes represent the instances and clusters and the edges indicate that an instance belongs to a cluster. Then, we learn a structured bipartite graph from the initial one by self-paced learning, i.e., we automatically decide the reliability of each edge and involves the edges into graph learning in order of their reliability. At last, we obtain the ﬁnal consensus clustering result from the learned bipartite graph. The extensive experimental results demonstrate the effectiveness and superiority of the proposed method.

1 Introduction

Clustering is a fundamental problem in unsupervised learning and attracts many attentions in past decades. However, according to [Wang et al., 2009a], traditional clustering methods suffer from the stable and robust problems. To address these problems, consensus clustering is proposed. Consensus clustering provides an elegant framework for integrating multiple weak base clusterings to generate a consensus clustering result [Topchy et al., 2004]. In recent years, many consensus clustering methods have been proposed [Strehl and Ghosh, 2003; Zhou and Tang, 2006; Zhou et al., 2015a; Liu et al., 2018]. For example, [Strehl and Ghosh, 2003] and [Topchy et al., 2003] proposed information theoretic based consensus clustering methods; [Zhou and Tang, 2006] proposed an alignment method to combine multiple k-means clustering results; [Liu et al., 2015] provided a spectral clustering based ensemble method; [Wang et

Corresponding Author

al., 2009b] and [Huang et al., 2016a] introduced probabilistic graphical model into consensus clustering. Besides these works which ensembled all base clustering results, some works tried to select some informative and non-redundant base clustering results for ensemble. For example, [Azimi and Fern, 2009] proposed an adaptive clustering ensemble selection method; [Zhao et al., 2017] applied internal validity indices to select based clustering results. Note that, these methods always apply all instances for ensemble learning. However, since the base results may not be fully reliable, it is inappropriate to always use all data for ensemble. Intuitively, some instances are unreliable for clustering or even outliers, which lead to the poor performance of the base clusterings. At the beginning of learning, these unreliable instances may mislead the model, because the early model may not have the ability to handle these unreliable ones. To address this issue, we propose a novel Self-paced Consensus Clustering with Bipartite Graph (SCCBG), which involves instances from more reliable to less reliable ones into the ensemble learning. The basic idea is that, we train the model with the easier or more reliable instances in the process of ensemble, until it is strong enough to handle the difﬁcult ones. Firstly, we construct an initial bipartite graph from base clustering results, where a node represents an instance or a cluster and an edge indicates that an instance belongs to a cluster. Then, from it, we learn a structured bipartite graph, which contains exact c components, where c is the number of clusters. As introduced before, since the base clustering results are imperfect, some edges in the graph are also unreliable. Therefore, we apply the self-paced learning framework to learn the structured graph, i.e., we automatically decide the reliability of each edge and use the edges for learning in order of their reliability. On the one hand, the reliable edges are helpful to the graph learning; and on the other hand, with the process of graph learning, the edges become more and more reliable. By introducing a carefully designed regularized term to characterize the reliability, we seamlessly integrate the reliability evaluating and graph learning into a uniﬁed selfpaced learning framework. At last, we obtain the ﬁnal clustering result from the learned structured graph by ﬁnding its connective components. The extensive experiments show that our methods often outperform the state-of-the-art consensus clustering methods.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Base clustering results Bipartite graph

Figure 1: An illustration of constructing the bipartite graph.

2 Preliminaries

Throughout this paper, we use boldface uppercase and lowercase letters to denote matrices and vectors, respectively. For a matrix M, we use Mi. and M.i to denote the i-th row and column of M, respectively. We denote the (i, j)-th element in M as Mij.

2.1 Consensus Clustering Let X = {x1, , xn} be a data set with n instances. Given a set of m base clusterings C = {C1, , Cm} of X, each clustering Ci consists of a set of clusters {πi 1, , πi ki}, where ki is the number of clusters in Ci and X = Ski j=1 πi j. Consensus clustering aims to learn a consensus partition of X from the m base clusterings C = {C1, , Cm} [Strehl and Ghosh, 2003; Topchy et al., 2003; Topchy et al., 2004]. To learn the consensus partition, in this paper, we construct a bipartite graph G = {V1, V2, E} from C = {C1, , Cm}. In more detail, V1 contains n nodes and each node represents an instance. V2 contains k = Pm i=1 ki nodes and each node represents a cluster πi j (i = 1, , m, and j = 1, , ki). E is a set of edges which link nodes between V1 and V2. If instance xi belongs to the cluster πq p, then there is an edge between xi and πq p. Fig. 1 shows an illustration of constructing the bipartite graph. In this example, we have 5 instances x1, , x5 and 3 base clusterings. For example, in the ﬁrst clustering C1, x1 and x2 belong to cluster π1 1, and x3, x4 and x5 belong to the cluster π1 2. The right side of Fig.1 shows the corresponding bipartite graph G. V1 contains the blue nodes, V2 contains the red nodes, and E denotes the set of edges. After obtaining the bipartite graph G, we aim to learn a partition on G as the consensus clustering result.

2.2 Self-paced Learning The basic idea of self-paced learning is to incrementally involve instances into learning, where easy ones are involved ﬁrst and difﬁcult ones are then involved gradually [Kumar et al., 2010]. More formally, given a data set D = {(x1, y1), , (xn, yn)}, where xi Rd is the feature vector of the i-th instance and yi is its label, we denote L(g(xi, θ), yi) as the loss function of the i-th instance, where

g(xi, θ) is the decision function and θ is the model parameter. According to [Zhang et al., 2017], the self-paced learning introduces a weighted loss term on instances and a general regularized term on the weights as follows:

i=1 wi L(g(xi, θ), yi) + f(wi, λ). (1)

where λ is the age parameter to control the learning pace and grows with the learning process, and f(wi, λ) is the selfpaced regularized term. In Eq.(1), if we ﬁx the model parameter θ, supposing w i (λ, li) is the optimum weight of xi, where li = L(g(xi, θ), yi), then f(wi, λ) should satisfy that w i (λ, li) is monotonically decreasing with li and increasing with λ as suggested in [Jiang et al., 2015; Meng et al., 2017]. Note that, on the one hand, since w i (λ, li) is decreasing with li, easy instances, which has low loss, will have a large weight, which means they will be involved in learning ﬁrst. On the other hand, w i (λ, li) is increasing with λ, which means with the learning process (λ grows), more and more instances are involved in learning. Therefore, self-paced learning optimizes Eq.(1) via alternating minimization. Fixing θ and solving w is to learn the weight of each instance and ﬁnding the easy ones; solving θ by ﬁxing w is to learn the model using the easy instances. Due to its promising performance and the beneﬁt of alleviating the local optimum problem in non-convex optimization [Basu and Christensen, 2013], self-paced learning has been widely used in many scenarios, such as multi-task learning [Li et al., 2017], robust classiﬁcation [Ren et al., 2017] and subspace learning [Jiang et al., 2018]. In this paper, we will extend it to unsupervised ensemble learning.

3 Self-paced Consensus Clustering with Bipartite Graph In this section, we introduce the proposed SCCBG method in detail.

3.1 Formulation Given m base clustering results, we ﬁrst construct the initial bipartite graph as introduced in Section 2.1. Note that this initial bipartite graph may not have a very clear clustering structure since each base clustering may not be perfect. Taking Fig. 1 as an example, we ﬁnd that there is only 1 connective component in the graph, i.e., all instances are entangled together. To make it have a clearer clustering structure and obtain the ﬁnal consensus partition, we need to learn a structured graph G which has exact c connective components where c is the number of clusters. Then clustering on G is trivial because we just need to put instances in the same connective components into the same cluster. More formally, we deﬁne the adjacent matrix of G as

G = 0 Y YT 0

where Y {0, 1}n k and k = Pm i=1 ki is the total number of clusters in all clustering results. Yij = 1 means there is an

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

edge linking the i-th instance (xi) and the j-th cluster (πj), and Yij = 0 means there is no edge between them. Similarly, we can deﬁne the adjacent matrix of the structured bipartite graph G as

G = 0 S ST 0

where S [0, 1]n k. To make G preserve G as well as possible, we should minimize S Y 2 F . Moreover, we also need to impose some constraints on S to make sure that G has c connective components. Given G , we can ﬁrst obtain its normalized Laplacian matrix L = I D 1

2 , where I is an identity matrix and D is a diagonal matrix whose diagonal element Dii = Pk+n j=1 G ij. Then according to [Nie et al., 2017], we have that the number of connected components in G is equal to n+k minus the rank of L, i.e., rank(L) = n+k c. To this end, we obtain the following formula:

min S S Y 2 F , (4)

s.t. 0 Sij 1, rank(L) = n + k c.

As introduced before, since the base clusterings may not be perfect, each edge obtained from the base clusterings may also be unreliable. To characterize the reliability of each edge, we introduce a weight matrix W [0, 1]n k where the larger Wij is, the more reliable the corresponding edge is. With W we can integrate our consensus clustering task into the self-paced learning framework seamlessly. We involve edges gradually from more reliable edges to less reliable ones. As suggested in [Jiang et al., 2015], we set f(wi, λ) in Eq.(1) as λ W 1, and obtain:

min S,W W (S Y) 2 F λ W 1, (5)

s.t. 0 Sij 1, rank(L) = n + k c, 0 Wij 1.

where is the Hadamard product, which means the elementwise production of two matrices; the second term is the selfpaced regularized term, and λ is the age parameter and becomes increasingly larger in the process of optimization. Unfortunately, it is not enough to characterize the reliability of edges only by the ﬁrst term in Eq.(5). We need to take a closer look at W. Note that, if two cluster πp and πq are similar, then for any instance xi, either xi belongs to the both clusters or xi belongs to neither. Therefore, if (Sip Siq)2 is large, which means xi is more likely to belong to one of the clusters, then at least one of Sip and Siq is unreliable, i.e., at least one of Wip and Wiq should be small. More formally, we use the following carefully designed regularized term to characterise the reliability of edges:

p,q=1 Cpq(Sip Siq)2Wip Wiq. (6)

where Cpq is the (p, q)-th element in C Rk k, which characterizes the similarity of two clusters. C can be easily obtained by C = YT Y. We can ﬁnd that, if Cpq is large (i.e.,

πp and πq are similar) and (Sip Siq)2 is large (i.e., xi only belongs to one of the clusters), then by minimizing Eq.(6), at least one of Wip and Wiq should be small. Taking it into our self-paced framework (Eq.(5)), we obtain the ﬁnal objective function:

min S,W W (S Y) 2 F λ W 1 (7)

p,q=1 Cpq(Sip Siq)2Wip Wiq

s.t. 0 Sij 1, rank(L) = n + k c, 0 Wij 1.

where γ is a balanced parameter.

3.2 Optimization Eq.(7) involves the constraint rank(L) = n + k c which is hard to optimize. We ﬁrst handle this constraint. By introducing the auxiliary orthogonal matrix F R(n+k) c and a large enough parameter ρ, Eq.(7) is equivalent to the following formula:

min S,W,F W (S Y) 2 F λ W 1 (8)

p,q=1 Cpq(Sip Siq)2Wip Wiq + ρtr(FT LF)

s.t. 0 Sij 1, 0 Wij 1,

Then we optimize W, F, S respectively by ﬁxing the other variables.

Optimizing W When optimizing W, we ﬁnd that Eq.(8) can be decoupled into n independent subproblems by rows. Considering the i-th subproblem, we have

p=1 W 2 ip A2 ip λ

p=1 Wip + γ

p,q=1 Wip Bpq Wiq (9)

s.t. 0 Wij 1.

where Aip = Sip Yip and Bpq = Cpq(Sip Siq)2. Note that, Eq.(9) is a quadratic programming with bounded constraint and can be solved by standard optimization method, such as trust region reﬂective algorithm. In our implication, we use quadprog function provided in Matlab.

Optimizing F When optimizing F, we need to solve the following subproblem:

min FT F=I tr(FT LF) (10)

Eq.(10) can be solved by computing the eigen-decomposition of L. However, conducting eigen-decomposition on an (n + k) (n + k) matrix is often in O((n + k)3) time and is very time consuming. Fortunately, since L is a Laplacian matrix of

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

a bipartite graph, according to Lemma 1 in [Nie et al., 2017], Eq.(10) can be solved by conducting Singular Valued Decomposition (SVD) on a small rectangle matrix. In more detail, deﬁne diagonal matrices ˆD Rn n and D Rk k whose diagonal elements are ˆDii = Pk j=1 Sij and Djj = Pn i=1 Sij respectively, and then compute the SVD of ˆD 1

2 = UΣVT , where U and V contain the left and right singular vectors respectively, and Σ is a diagonal matrix which contains the singular values. Let U Rn c and V Rn c be the c left and right singular vectors in U and V corresponding to the smallest c singular values, respectively. Then, the closed-form solution of Eq.(10) is:

Note that since ˆD 1

2 is an n k matrix and often has n k, the time complexity of computing its SVD is O(nk2) which is much smaller than O((n + k)3).

Optimizing S When optimizing S, Eq.(8) can also be decoupled into n subproblems by rows. Considering the i-th subproblem, we have:

p=1 W 2 ip(Sip Yip)2 + γ

p,q=1 Epq(Sip Siq)2

p=1 Hip Sip

s.t. 0 Sip 1. (12)

where Epq = Cpq Wip Wiq, and Hip = 1 2

and di = Pk j=1 Sij and dp = Pk j=1 Spj. Similar to Eq.(9), Eq.(12) is also a quadratic programming with bounded constraint and can be solved similarly. Algorithm 1 summarizes the whole process of SCCBG.

Algorithm 1 SCCBG

Input: m base clustering results, number of clusters c, parameter γ. Output: Consensus clustering results 1: Construct the initial bipartite graph from m based clustering and obtain Y, and initialize the age parameter λ = 0.5, S = Y. 2: Compute the cluster similarity matrix C = YT Y. 3: while not converge do 4: Update W by solving Eq.(9). 5: Update F by solving Eq.(10). 6: Update S by solving Eq.(12). 7: Update the age parameter by λ = λ 2. 8: end while 9: Obtain the ﬁnal bipartite graph G from S. 10: Obtain the ﬁnal clustering result from the c connective component in G .

#instances #features #classes ALLAML 72 7129 2 GLIOMA 50 4434 4 K1b 2340 21839 6 Lung 203 3312 5 Medical 706 1449 17 Tr41 878 7454 10 Tdt2 10212 36771 96 TOX 171 5748 4

Table 1: Description of the data sets.

3.3 Discussion

Firstly, we analyze the space and time complexity of our method. Since the graph we used is a bipartite graph, the space complexity of our method is O(nk). For the time complexity, we analyze it step by step. Firstly, computing C needs O(nk2) time. Then, in each iteration, when updating W, we need to solve n quadratic programming problem and each problem involves k variables. Note that, in practice, we set γ a small value to make sure the quadratic programming problem is convex. Therefore, each subproblem costs O(k3) time and updating W costs O(nk3) time. Updating F needs O(nk2) as introduced in previous subsection. Updating S also needs to solve n quadratic programming problem and each problem involves k variables. The time complexity is also O(nk3). Supposing the number of iterations is l, the whole time complexity is O(lnk3). Last but not the least, we discuss the robust consensus clustering, which is very related to our self-paced ensemble. Robust consensus clustering extracts the noises from data or base results and recovers the clean results for ensemble. For example, [Tao et al., 2016; Tao et al., 2019] proposed robust consensus clustering methods based on spectral clustering; [Huang et al., 2016b] used probability trajectories to robust consensus clustering; [Wang et al., 2019] provided an ensemble method on incomplete data. These methods only focus on the noises or outliers without distinguishing between uncontaminated instances. However, in our method, the contaminated instances can be regarded as the most unreliable ones, and in addition, the uncontaminated instances can also be handled in order of reliability. Therefore, our method provides a more sophisticated framework to handle instances. Moreover, in our self-paced framework, the reliability of edges is changing in the process of learning. With the growth of λ, W will be increasingly large until it reaches 1, which means the edges become increasingly more reliable with learning.

4 Experiments

In this section, we conduct the extensive experiments by comparing our SCCBG with several state-of-the-art consensus clustering methods on benchmark data sets.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Methods ALLAML GLIOMA K1b Lung Medical Tr41 Tdt2 Tox KM 0.6545 0.4239 0.6726 0.7114 0.3996 0.5626 0.4104 0.4229 KM-best 0.7292 0.4880 0.8559 0.8675 0.4707 0.6946 0.4460 0.4825 CSPA 0.6583 0.4100 0.4531 0.4138 0.3500 0.5213 0.2850 0.4246 HGPA 0.5444 0.4180 0.5326 0.5025 0.2950 0.4894 0.2959 0.3854 MCLA 0.6722 0.4000 0.7383 0.7084 0.4017 0.5698 0.4000 0.4152 NMFC 0.6722 0.4140 0.5860 0.6764 0.3789 0.6323 0.3716 0.4269 BCE 0.6708 0.4280 0.6345 0.6700 0.3965 0.6205 0.1806 0.4140 RCE 0.6708 0.4260 0.6887 0.7143 0.3851 0.6391 - 0.4105 MEC 0.6056 0.3940 0.8190 0.7379 0.3627 0.6559 - 0.4304 LWEA 0.6736 0.4320 0.8279 0.7458 0.4208 0.6719 0.5744 0.4234 LWGP 0.6750 0.4320 0.7172 0.6498 0.4047 0.6483 0.4288 0.4193 RSEC 0.5917 0.4180 0.8409 0.8217 0.3490 0.6367 0.4222 0.4041 DREC 0.6819 0.4280 0.6462 0.6379 0.3926 0.6243 0.3684 0.4205 SCCBG-W 0.6681 0.4080 0.8405 0.8094 0.3980 0.6136 0.5011 0.4053 SCCBG 0.6861 0.4500 0.8663 0.8961 0.4592 0.6973 0.7164 0.4339

Table 2: ACC results on all the data sets

4.1 Data Sets

We use 8 data sets, including ALLAML1, GLIOMA1, K1b [Zhao and Karypis, 2004], Lung1, Medical [Zhou et al., 2015b], Tdt22, Tr41 [Zhao and Karypis, 2004], and TOX1. The details of these data sets are summarized in Table 1.

4.2 Experimental Setup

Following the experimental setup in [Wang et al., 2009b; Zhou et al., 2015b], we use k-means to generate the base clusterings. In more detail, we run k-means 200 times with different initializations to obtain 200 base results. Then we divide them into 10 subsets, with 20 base results in each subset. Next, we apply consensus clustering methods on each subsets, and report the average results on the 10 subsets. We compare our SCCBG with the following methods:

KM, which is the average result of all base clustering.

KM-best, which is the best result of all base results.

Cluster-based Similarity Partitioning Algorithm (CSPA) [Strehl and Ghosh, 2003], which signiﬁes a relationship between instances in the same cluster to establish a measure of pairwise similarity for ensemble.

Hyper Graph Partitioning Algorithm (HGPA) [Strehl and Ghosh, 2003], which integrates base results with a constrained minimum cut objective.

Meta-CLustering Algorithm (MCLA)[Strehl and Ghosh, 2003], which transforms the ensemble into a cluster correspondence problem.

Nonnegative Matrix Factorization based Consensus clustering (NMFC) [Li and Ding, 2008], which uses NMF to aggregate base results.

Bayesian Clustering Ensemble (BCE) [Wang et al., 2009b], which is a Bayesian model for ensemble.

1http://featureselection.asu.edu/datasets.php 2http://www.cad.zju.edu.cn/home/dengcai/Data/Text Data.html

Robust Clustering Ensemble (RCE) [Zhou et al., 2015b], which learns a robust consensus result via minimize the KL divergence among each base result.

Multi-view Ensemble Clustering (MEC) [Tao et al., 2017], which is a robust multi-view consensus clustering method using low-rank and sparse decomposition to ensemble base clustering and detect the noises.

Locally Weighted Evidence Accumulation (LWEA) [Huang et al., 2018], which is a hierarchical agglomerative consensus clustering method based on uncertainty estimation and local weighting strategy.

Locally Weighted Graph Partitioning (LWGP) [Huang et al., 2018], which is a graph partition method based on the local weighting strategy.

Robust Spectral Ensemble Clustering (RSEC) [Tao et al., 2019], which is a robust clustering ensemble method based on spectral clustering.

Dense Representation Ensemble Clustering (DREC) [Zhou et al., 2019], which learns a dense representation for clustering ensemble.

SCCBG-W, which is our ensemble method without self-paced learning. In more detail, we ﬁx the weight matrix W in SCCBG as 1 and do not update it.

The number of clusters is set to the true number of classes for all data sets and algorithms. Our method adjusts λ automatically as introduced in Algorithm 1. The parameter ρ is also automatically decided. We ﬁrst initialize ρ = 1, and then, if the rank of L is larger than n + k c, we double it. If its rank is smaller than n + k c, we reduce ρ by half. The only hyper-parameter needed to tune manually is γ. As discussed before, γ should not be too large to make the subproblem convex, thus we tune it in the range [10 5, 100]. We use Accuracy (ACC) and Normalized Mutual Information (NMI) to evaluate the clustering performance. To validate the statistic signiﬁcance of results, we also calculate the p-value of t-test.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Methods ALLAML GLIOMA K1b Lung Medical Tr41 Tdt2 Tox KM 0.0882 0.1629 0.5493 0.5284 0.4209 0.5843 0.6111 0.1374 KM-best 0.1772 0.2347 0.6853 0.6558 0.4806 0.6713 0.6240 0.2164 CSPA 0.0815 0.1716 0.4071 0.3712 0.3992 0.5919 0.5589 0.1436 HGPA 0.0110 0.1509 0.3917 0.3372 0.3613 0.5084 0.5385 0.1083 MCLA 0.0909 0.1327 0.5944 0.5258 0.4296 0.6044 0.6070 0.1329 NMFC 0.0909 0.1550 0.4995 0.5202 0.4259 0.6512 0.5930 0.1434 BCE 0.0821 0.1658 0.5414 0.4977 0.4499 0.6398 0.0000 0.1370 RCE 0.0899 0.1624 0.6068 0.5248 0.4475 0.6499 - 0.1344 MEC 0.0485 0.1312 0.6818 0.5617 0.4089 0.6758 - 0.1313 LWEA 0.0935 0.1686 0.6948 0.5364 0.4185 0.6666 0.7183 0.1236 LWGP 0.0932 0.1682 0.6115 0.4993 0.4266 0.6535 0.6266 0.1333 RSEC 0.0495 0.1544 0.6615 0.6027 0.4036 0.6449 0.5243 0.1184 DREC 0.1006 0.1641 0.5774 0.4647 0.4510 0.6514 0.5971 0.1394 SCCBG-W 0.0894 0.1567 0.6888 0.5785 0.3220 0.6039 0.6433 0.1239 SCCBG 0.1252 0.2163 0.7262 0.6930 0.3918 0.6847 0.7568 0.2131

Table 3: NMI results on all the data sets

4.3 Experimental Results Tables 2 and 3 shows the ACC and NMI results of all ensemble methods. The bold fonts means the difference is statistically signiﬁcant (i.e., the p-value is smaller than 0.05). Note that, due to their high time and space complexity, RCE and MEC run out of memory on the largest data set Tdt2. From these tables, we ﬁnd that: (1) many ensemble methods outperform the KM, which indicates the effectiveness of consensus clustering. (2) Compared with other ensemble methods, our algorithm outperforms them on most data sets, which demonstrates its superiority. Even compared with the robust methods (RCE, MEC, and RSEC), ours also performs better, because our self-paced framework can handle data more ﬁnely as discussed in Section 3.3. Moreover, to further demonstrate the effectiveness of the self-paced learning framework, we also compare it with SCCBG-W, which is the version without self-paced learning. From the tables, we can see that our method signiﬁcantly outperforms it, which also demonstrates the necessity of the self-paced learning framework. (3) On most data sets, our method is closed to or even better than KM-best. Note that, ours does not need to perform exhaustive search on the predeﬁned pool of base clusterings, which also well demonstrates the effectiveness of SCCBG.

4.4 Parameter Study The only hyper-parameter needed to tune manually is γ. As discussed before, γ should not be too large to guarantee the convexity of the subproblems. So we tune it in [10 5, 100]. Fig. 2 shows the results on Lung and Tr41 data sets. We can see that our method works well when γ is in the range [10 5, 10 3]. When γ becomes larger, the performance will deteriorate, which is in line with our previous discussion.

5 Conclusion In this paper, we proposed a novel self-paced consensus clustering method with the bipartite graph. We constructed an initial bipartite graph based on the base results. Then we learned a structured graph from it. In the process of graph

10-5 10-4 10-3 10-2 10-1 100

(a) ACC on Lung

10-5 10-4 10-3 10-2 10-1 100

(b) NMI on Lung

10-5 10-4 10-3 10-2 10-1 100

(c) ACC on Tr41

10-5 10-4 10-3 10-2 10-1 100

(d) NMI on Tr41

Figure 2: Clustering results on Lung and Tr41 with respect to different values of γ.

learning, we adopted the idea of self-paced learning, which automatically decided the reliability of each edge and involves the edges into graph learning in the order of reliability. At last, we obtained the ﬁnal consensus clustering result by ﬁnding the connective components of the learned graph. We conducted extensive experiments on benchmark data sets. Compared with other state-of-the-art consensus clustering methods, our method often performs better than them, which well demonstrates the effectiveness and superiority of the proposed method.

Acknowledgements

This work is supported by the National Natural Science Fund of China grants 61806003, 61976129, and 61972001.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

References [Azimi and Fern, 2009] Javad Azimi and Xiaoli Fern. Adaptive cluster ensemble selection. In IJCAI, pages 992 997, 2009. [Basu and Christensen, 2013] Sumit Basu and Janara Christensen. Teaching classiﬁcation boundaries to humans. In AAAI, pages 109 115, 2013. [Huang et al., 2016a] Dong Huang, Jianhuang Lai, and Changdong Wang. Ensemble clustering using factor graph. Pattern Recognition, 50:131 142, 2016. [Huang et al., 2016b] Dong Huang, Jianhuang Lai, and Changdong Wang. Robust ensemble clustering using probability trajectories. IEEE TKDE, 28(5):1312 1326, 2016. [Huang et al., 2018] Dong Huang, Changdong Wang, and Jianhuang Lai. Locally weighted ensemble clustering. IEEE Transactions on Systems, Man, and Cybernetics, 48(5):1460 1473, 2018. [Jiang et al., 2015] Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G Hauptmann. Self-paced curriculum learning. In AAAI, pages 2694 2700, 2015. [Jiang et al., 2018] Yangbangyan Jiang, Zhiyong Yang, Qianqian Xu, Xiaochun Cao, and Qingming Huang. When to learn what: Deep cognitive subspace clustering. In 2018 ACM Multimedia, pages 718 726. ACM, 2018. [Kumar et al., 2010] M P Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In NIPS, pages 1189 1197, 2010. [Li and Ding, 2008] Tao Li and Chris H Q Ding. Weighted consensus clustering. In SDM, pages 798 809, 2008. [Li et al., 2017] Changsheng Li, Junchi Yan, Fan Wei, Weishan Dong, Qingshan Liu, and Hongyuan Zha. Self-paced multi-task learning. In AAAI, pages 2175 2181, 2017. [Liu et al., 2015] Hongfu Liu, Tongliang Liu, Junjie Wu, Dacheng Tao, and Yun Fu. Spectral ensemble clustering. In SIGKDD, pages 715 724, 2015. [Liu et al., 2018] Xinwang Liu, Xinzhong Zhu, Miaomiao Li, Lei Wang, Chang Tang, Jianping Yin, Dinggang Shen, Huaimin Wang, and Wen Gao. Late fusion incomplete multi-view clustering. IEEE TPAMI, pages 2410 2423, 2018. [Meng et al., 2017] Deyu Meng, Qian Zhao, and Lu Jiang. A theoretical understanding of self-paced learning. Information Sciences, 414:319 328, 2017. [Nie et al., 2017] Feiping Nie, Xiaoqian Wang, Cheng Deng, and Heng Huang. Learning a structured optimal bipartite graph for co-clustering. In NIPS, pages 4129 4138, 2017. [Ren et al., 2017] Yazhou Ren, Peng Zhao, Yongpan Sheng, Dezhong Yao, and Zenglin Xu. Robust softmax regression for multi-class classiﬁcation with self-paced learning. In IJCAI, pages 2641 2647, 2017. [Strehl and Ghosh, 2003] Alexander Strehl and Joydeep Ghosh. Cluster ensembles a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3(3):583 617, 2003.

[Tao et al., 2016] Zhiqiang Tao, Hongfu Liu, Sheng Li, and Yun Fu. Robust spectral ensemble clustering. In CIKM, pages 367 376, 2016. [Tao et al., 2017] Zhiqiang Tao, Hongfu Liu, Sheng Li, Zhengming Ding, and Yun Fu. From ensemble clustering to multi-view clustering. In IJCAI, pages 2843 2849, 2017. [Tao et al., 2019] Zhiqiang Tao, Hongfu Liu, Sheng Li, Zhengming Ding, and Yun Fu. Robust spectral ensemble clustering via rank minimization. ACM Transactions on Knowledge Discovery From Data, 13(1):1 25, 2019. [Topchy et al., 2003] Alexander Topchy, Anil K Jain, and William F Punch. Combining multiple weak clusterings. In ICDM, pages 331 338, 2003. [Topchy et al., 2004] Alexander Topchy, Anil K Jain, and William F Punch. A mixture model for clustering ensembles. In SDM, pages 379 390, 2004. [Wang et al., 2009a] Fei Wang, Xin Wang, and Tao Li. Generalized cluster aggregation. In IJCAI, pages 1279 1284, 2009. [Wang et al., 2009b] Hongjun Wang, Hanhuai Shan, and Arindam Banerjee. Bayesian cluster ensembles. In SDM, pages 211 222, 2009. [Wang et al., 2019] Siwei Wang, Xinwang Liu, En Zhu, Chang Tang, Jiyuan Liu, Jingtao Hu, Jingyuan Xia, and Jianping Yin. Multi-view clustering via late fusion alignment maximization. In IJCAI, pages 3778 3784, 2019. [Zhang et al., 2017] Dingwen Zhang, Deyu Meng, and Junwei Han. Co-saliency detection via a self-paced multipleinstance learning framework. IEEE TPAMI, 39(5):865 878, 2017. [Zhao and Karypis, 2004] Ying Zhao and George Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 55(3):311 331, 2004. [Zhao et al., 2017] Xingwang Zhao, Jiye Liang, and Chuangyin Dang. Clustering ensemble selection for categorical data based on internal validity indices. Pattern Recognition, 69:150 168, 2017. [Zhou and Tang, 2006] Zhihua Zhou and Wei Tang. Clusterer ensemble. KBS, 19(1):77 83, 2006. [Zhou et al., 2015a] Peng Zhou, Liang Du, Lei Shi, Hanmo Wang, and Yi-Dong Shen. Recovery of corrupted multiple kernels for clustering. In IJCAI, pages 4105 4111, 2015. [Zhou et al., 2015b] Peng Zhou, Liang Du, Hanmo Wang, Lei Shi, and Yidong Shen. Learning a robust consensus matrix for clustering ensemble via kullback-leibler divergence minimization. In IJCAI, pages 4112 4118, 2015. [Zhou et al., 2019] Jie Zhou, Hongchan Zheng, and Lulu Pan. Ensemble clustering based on dense representation. Neurocomputing, pages 66 76, 2019.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)