# clustering_ensemble_meets_lowrank_tensor_approximation__2c1312cf.pdf

Clustering Ensemble Meets Low-rank Tensor Approximation

Yuheng Jia1,2, Hui Liu 2, Junhui Hou2, Qingfu Zhang2

1School of Computer Science and Engineering, Southeast University, Nanjing 210096, China 2Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR yhjia@seu.edu.cn, hliu99-c@my.cityu.edu.hk, {jh.hou, qingfu.zhang}@cityu.edu.hk

This paper explores the problem of clustering ensemble, which aims to combine multiple base clusterings to produce better performance than that of the individual one. The existing clustering ensemble methods generally construct a coassociation matrix, which indicates the pairwise similarity between samples, as the weighted linear combination of the connective matrices from different base clusterings, and the resulting co-association matrix is then adopted as the input of an off-the-shelf clustering algorithm, e.g., spectral clustering. However, the co-association matrix may be dominated by poor base clusterings, resulting in inferior performance. In this paper, we propose a novel low-rank tensor approximation based method to solve the problem from a global perspective. Speciﬁcally, by inspecting whether two samples are clustered to an identical cluster under different base clusterings, we derive a coherent-link matrix, which contains limited but highly reliable relationships between samples. We then stack the coherent-link matrix and the co-association matrix to form a three-dimensional tensor, the low-rankness property of which is further explored to propagate the information of the coherent-link matrix to the co-association matrix, producing a reﬁned co-association matrix. We formulate the proposed method as a convex constrained optimization problem and solve it efﬁciently. Experimental results over 7 benchmark data sets show that the proposed model achieves a breakthrough in clustering performance, compared with 11 state-of-the-art methods. To the best of our knowledge, this is the ﬁrst work to explore the potential of low-rank tensor on clustering ensemble, which is fundamentally different from previous approaches. Last but not least, our method only contains one parameter, which can be easily tuned.

Introduction Clustering is an important but very challenging unsupervised task, the goal of which is to partition a set of samples into homogeneous groups (Jia et al. 2018). Numerous applications can be formulated as a clustering problem, such as recommender systems (Song, Tekin, and Van Der Schaar 2014), community detection (Wu et al. 2018), and image segmentation (Li et al. 2019). Over the past decades, a large number of clustering techniques were proposed, e.g., K-means (Jain 2010), spectral clustering (Ulrike

Corresponding author. Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

2007), matrix factorization (Jia et al. 2020a; Wu et al. 2018; Jia et al. 2020c), hierarchical clustering (Johnson 1967), Gaussian mixture models (Moore 1999), and so on. As each method has its own advantages as well as drawbacks, no method could always outperform others (Vega-Pons and Ruiz-Shulcloper 2011). Additionally, a clustering method usually contains a few hyper-parameters, on which its performance heavily depends (Jia, Hou, and Kwong 2020). Moreover, the hyper-parameters are difﬁcult to tune, and some methods are quite sensitive to initialization, like Kmeans. Those dilemmas increase the difﬁculty in choosing an appropriate clustering method for a typical clustering task. To this end, clustering ensemble was introduced, i.e., given a set of base clusterings produced by different methods or the same method with different hyperparameters/initializations, it aims to generate a consensus clustering with better clustering performance than the base clusterings (Sagi and Rokach 2018; Boongoen and Iam-On 2018). Unlike supervised ensemble learning, clustering ensemble is more difﬁcult (Tao et al. 2017, 2016), as the commonly used strategies in supervised ensemble learning, such as voting, cannot be directly applied to clustering ensemble, when labels of samples are unavailable.

To realize clustering ensemble, the existing methods generally ﬁrst learn a pairwise relationship matrix from the base clusterings, and then apply off-the-shelf clustering methods like spectral clustering to the resulting matrix to produce the ﬁnal clustering result (Tao, Liu, and Fu 2017). Based on how to generate the pairwise relationship matrix, we roughly divide the existing methods into two categories. (1) The ﬁrst kind of methods treat the base clusterings as new feature representations (as shown in Fig. 1-A), to learn a pairwise relationship matrix. For example, (Gao et al. 2016) formulated clustering ensemble as a convex low-rank matrix representation problem. (Zhou, Zheng, and Pan 2019) used a Frobenius norm regularized self-representation model to seek a dense afﬁnity matrix for clustering ensemble. (2) The second kind of methods rely on the co-association matrix (as shown in Fig. 1-C), which summarizes the co-occurrence of samples in the same cluster of the base clusterings. The concept of co-association matrix was ﬁrst proposed by (Fred and Jain 2005), and since then it became popular as an important fundamental method in clustering ensemble. (Liu et al. 2017) theoretically bridged the co-association based method

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

to weighted K-means clustering, which largely reduces the computational complexity. Recently, many advanced coassociation matrix construction methods were proposed. For example, (Huang, Wang, and Lai 2018) considered the uncertainty of each base clustering and proposed a locally weighted co-association matrix. (Huang et al. 2018) used the cluster-wise similarities to enhance the traditional coassociation matrix. (Zhou et al. 2020) proposed a self-paced strategy to learn the co-association matrix. See the detailed discussion about the related works in the next section. We observe that, the constructed co-association matrices of the prior works are variants of a weighted linear combination of the connective matrices (as shown in Fig. 1-B) from different base clusterings. When the performance of some base clusterings are poor, they will dominate the co-association matrix and degrade the clustering performance severely. In this paper, we propose a novel constrained lowrank tensor approximation (LTA) model to reﬁne the coassociation matrix from a global perspective. Speciﬁcally, as shown in Fig. 1-D, we ﬁrst construct a coherent-link matrix, whose element examines whether two samples are from the same cluster in all the base clusterings or not. We then stack the coherent-link matrix and the conventional co-association matrix to form a 3-dimensional (3-D) tensor shown in Fig. 1E, which is further low-rank approximated. By exploring the low-rankness, the proposed model can propagate the highly reliable information of the coherent-link matrix to the coassociation matrix, producing a reﬁned co-association matrix, which is adopted as the input of an off-the-shelf clustering method to produce the ﬁnal clustering result. Technically, the proposed model is formulated as a convex optimization problem and solved by an alternative iterative method. We evaluate the proposed model on 7 benchmark data sets, and compare it with 11 state-of-the-art clustering ensemble methods. The experimental comparisons substantiate that the proposed model signiﬁcantly outperforms stateof-the-art methods. To the best of our knowledge, this is the ﬁrst work to explore the potential of low-rank tensor on clustering ensemble.

Related Work

Notation. We denote tensors by boldface swash letters, e.g., A, matrices by boldface capital letters, e.g., A, vectors by boldface lowercase letters, e.g., a, and scalars by lowercase letters, e.g., a. Let A(i, j, k) denote the (i, j, k)-th element of 3-D tensor A, A(i, j) denote the (i, j)-th element of matrix A, and a(i) denote the i-th entry of vector a. The i-th frontal slice of tensor A is denoted as A(:, :, i).

Rank of tensors. In this paper, we use the tensor nuclear norm induced by tensor singular value decomposition (t SVD) (Kilmer et al. 2013) to measure the rank of a tensor. Speciﬁcally, the t-SVD of a 3-D tensor A Rn1 n2 n3 can be represented as

A = U S VT, (1)

where U Rn1 n1 n3 and V Rn2 n2 n3 are two orthogonal tensors, S Rn1 n1 n3 is an f-diagonal tensor,

1 1 1 1 1 1 2 1 2 2 2 2 3 2 3 3 2 4

1 1 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1

𝒙2 𝒙3 𝒙4 𝒙5 𝒙6 𝒙1

1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1

1 1 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1

.3 .3 1 0 0 .5

0 0 0 0 0 0

0 0 0 0 0 0

.5 0 0 1 .3 .3

1 1 1 1 ? ? ? 1 ? ? ? ? ? ? ? ? ?

? ? ? ? 1 ? 1 1

1 1 .3 1 / 3

1 / 2 0 0 0 0 0 0

0 0 0 0 0 0 1 / 2

0 0 1 .3 .3 1 / 3

1 1 .3 1 / 3

1 / 2 0 0 0 0 0 0

0 0 0 0 0 0 1 / 2

0 0 1 .3 .3 1 / 3

1 1 1 1 ? ? ? 1 ? ? ? ? ? ? ? ? ?

? ? ? ? 1 ? 1 1

(B) Connective Matrices

(A) Base Clusterings

(C) Co-association Matrix

(D) Coherent-Link Matrix

(E) Stack (C) and (D) to

Form a 3-D Tensor

Connective Matrix of 𝝅1

Connective Matrix of 𝝅2

Connective Matrix of 𝝅3

(F) The Proposed Low-Rank Tensor Approximation Model

1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 (G) Refined Co-association

𝒙3 𝒙4 𝒙5 𝒙6

𝒙1 𝒙2 𝒙3 𝒙4 𝒙5 𝒙6 𝒙1 𝒙2 𝒙3 𝒙4 𝒙5 𝒙6 𝒙1

𝒙2 𝒙3 𝒙4 𝒙5 𝒙6 𝒙1

𝒙2 𝒙3 𝒙4 𝒙5 𝒙6 𝒙1

𝒙2 𝒙3 𝒙4 𝒙5 𝒙6 𝒙1

𝒙2 𝒙3 𝒙4 𝒙5 𝒙6 𝒙1

𝒙3 𝒙4 𝒙5 𝒙6

𝒙3 𝒙4 𝒙5 𝒙6

𝒙3 𝒙4 𝒙5 𝒙6

𝒙3 𝒙4 𝒙5 𝒙6

𝒙3 𝒙4 𝒙5 𝒙6

𝒙3 𝒙4 𝒙5 𝒙6

𝒙3 𝒙4 𝒙5 𝒙6

𝒙3 𝒙4 𝒙5 𝒙6

𝒙2 𝒙3 𝒙4 𝒙5 𝒙6 𝒙1

Figure 1: Illustration of the proposed method by taking 3 base clusterings denoted by π1, π2 and π3, and 6 input samples denoted by x1, , x6 as an example. By exploring the low-rankness of the formed 3-D tensor, the limited but highly reliable information contained in the coherent-link matrix can be leveraged to enhance the quality of the coassociation matrix.

Algorithm 1 t-SVD of a 3-D tensor (Zhang et al. 2014)

Input: 3-D tensor A Rn1 n2 n3. 1: Perform FFT on A, i.e., Af = ﬀt(A, [ ], 3); 2: for k = 1 : n3; do 3: Perform SVD on each frontal slice of Af, i.e., [U,S,V]=SVD(Af(:, :, k)) ; 4: Uf(:, :, k) = U, Sf(:, :, k) = S, Vf(:, :, k) = V; 5: end 6: Perform inverse FFT on Uf, Sf and Vf, i.e., U = iﬀt(U f, [], 3), S = iﬀt(Sf, [], 3) and V = iﬀt(Vf, [], 3); Output: U, S and V.

and T denote tensor product and tensor transpose, respectively. The detailed deﬁnitions of the above-mentioned tensor related operators can be ﬁnd in (Zhang et al. 2014). Since the tensor product can be efﬁciently computed in the Fourier domain (Kilmer et al. 2013), the t-SVD form of a tensor can be obtained with fast Fourier transform (FFT) efﬁciently as shown Algorithm 1. Given t-SVD, the tensor nuclear norm (Zhang et al. 2014) is deﬁned as the sum of the absolute values of the diagonal entries of S, i.e.,

min(n1,n2) X

k=1 |S(i, i, k)|. (2)

Formulation of Clustering Ensemble. Given a data set X = [x1, x2, , xn] Rd n of n samples with each sample xi Rd 1, and m base clusterings Π = [π1, π2, , πm] Rn m, where each base clustering πi Rn 1 is an n-dimensional vector with the j-th element πi(j) indicating the clustering membership of the j-th sample xj in πi. For clustering ensemble, the cluster indicators

in different base clusterings are generally different. Fig. 1-A shows a toy example of 6 samples and 3 base clusterings. The objective of clustering ensemble is to combine multiple base clusterings to produce better performance than that of the individual one.

Prior Art. Based on how to use Π, we roughly divide previous clustering ensemble methods into two categories. The methods in the ﬁrst category treat Π as a representation of samples and then construct a pairwise afﬁnity matrix P Rn n accordingly, which can be generally expressed as min P f(Π, P) + λφ(P), (3)

where f(Π, P) is the ﬁdelity term and φ(P) imposes speciﬁc regularization on P. For example, in (Gao et al. 2016), f( , ) and φ( ) denote the Frobenius norm and the nuclear norm, respectively, while those in (Zhou, Zheng, and Pan 2019) are both the Frobenius norm. The second kind of methods ﬁrst transform each base clustering as a connective matrix (as shown in Fig. 1-B), i.e.,

Ak(i, j) = δ(πk(i), πk(j)), (4)

where Ak Rn n is the k-th connective matrix constructed from πk, and

δ(πk(i), πk(j)) = 1 if πk(i) = πk(j) 0 otherwise. (5)

And then, the methods in the second category build a coassociation matrix A Rn n (Fred and Jain 2005) according to the connective matrices, i.e.,

A(i, j) = 1

k=1 Ak(i, j). (6)

As the co-association matrix naturally converts the base clusterings to a pairwise similarity measure, it becomes the cornerstone of clustering ensemble. Recently, many advanced co-association matrix construction methods were proposed to enhance the clustering performance, which can be generally uniﬁed in the following formula:

k=1 ω(k) Ak(i, j), (7)

where ω Rm 1 is the weight vector constructed with different strategies. For example, (Zhou et al. 2020) used a selfpaced learning strategy to construct ω. (Huang, Wang, and Lai 2018) considered the uncertainties of the base clustering, and proposed a locally-weighted weight vector. (Huang et al. 2018) used the cluster-wise similarities to construct the weight vector.

Proposed Method As shown in Eq. (7), the previous methods construct a coassociation matrix as the linear combination of connective matrices, and thus are vulnerable to some poor base clusterings. To this end, we propose a novel low-rank tensor approximation based method to reﬁne the initial co-association matrix from a global perspective.

Problem Formulation To reﬁne the co-association matrix, we ﬁrst construct a coherent-link matrix (as shown in Fig. 1-D), which inspects whether two samples are clustered to the same category under all the base clusterings. It is worth pointing out that the elements of the coherent-link matrix are highly reliable information we could infer from the base clusteirngs. Speciﬁcally, we could directly get the coherent-link matrix M Rn n from the co-association matrix in Eq. (6), i.e,

M(i, j) = 1 if A(i, j) = 1 0 otherwise. (8)

We then stack the coherent-link matrix and the coassociation matrix to form a 3-D tensor P Rn n 2, with P(:, :, 1) = M, and P(:, : 2) = A. As the elements of both the coherent-link matrix and the co-association matrix express the pairwise similarity between samples, ideally, the formed tensor should be low-rank. Moreover, the non-one elements of M are limited but express the highly reliable similarity between samples, and we thus try to complement the zero elements with reference to the non-zero ones and the co-association matrix. On the contrary, the elements of the co-association matrix is dense but with many error connections, and we try to reﬁne it by removing the incorrect connections which is depicted by E Rn n, by leveraging the information from the coherent-link matrix. In addition, the elements of P should be bounded in [0, 1], and each frontal slice of P should be symmetric. Taking all the above analyses into account, the proposed method is mathematically formulated as a constrained optimization problem, written as

min P,E P + λ E 2 F

s.t. P(i, j, 1) = M(i, j), if M(i, j) = 1,

P(:, :, 1) = P(:, :, 1)T, 0 P(i, j, 1) 1, i, j, P(:, :, 2) + E = A,

P(:, :, 2) = P(:, :, 2)T, 0 P(i, j, 2) 1, i, j,

where λ > 0 is the coefﬁcient to balance the error matrix, and a Frobenius norm is imposed on E to avoid trivial solution, i.e., P(:, :, 2) = 0. By optimizing Eq. (9), it is expected that the limited but highly reliable information in M could be propagated to the co-association matrix, while the coherent-link matrix is complemented according to the information from the co-association matrix at the same time. After solving the problem in Eq. (9), we can obtain a reﬁned co-association matrix P (:, :, 2) with P being the optimized solution. Then, one can apply any clustering methods based on pairwise similarity on P (:, :, 2) to generate the ﬁnal clustering result. In this paper, we investigate two popular clustering methods, i.e., spectral clustering (Ng, Jordan, and Weiss 2002) and agglomerative hierarchical clustering (Fred and Jain 2005).

Numerical Solution We propose an optimization method to solve Eq. (9), based on the inexact Augmented Lagrangian method (Jia, Kwong,

and Hou 2018). Speciﬁcally, we ﬁrst introduce two auxiliary matrices B, C Rn n to deal with the bounded and symmetric constraints on P(:, :, 1) and P(:, :, 2), respectively, and Eq. (9) can be equivalently rewritten as

argmin P,E,B,C P + λ E 2 F

s.t. B(i, j) = M(i, j), if M(i, j) = 1, B = BT, 0 B(i, j) 1, i, j, B = P(:, :, 1), P(:, :, 2) + E = A, C = P(:, :, 2),

C = CT, 0 C(i, j) 1, i, j.

To handle the equality constraints, we introduce three Lagrange multipliers Λ1, Λ2 and Λ3 Rn n, and the augmented Lagrangian form of Eq. (10) becomes

argmin P,E,B,C P + λ E 2 F + µ

P(:, :, 2) + E A + Λ2

P(:, :, 1) B + Λ1

P(:, :, 2) C + Λ3

F s.t. B(i, j) = M(i, j), if M(i, j) = 1, 0 B(i, j) 1, i, j,

B = BT, C = CT, 0 C(i, j) 1, i, j, (11)

where µ > 0 is the penalty coefﬁcient. Then Eq. (11) can be optimized by solving the following four subproblems iteratively and alternately, i.e., only one variable is updated with the remaining ones ﬁxed at each time.

The P subproblem. Removing the irrelevant terms, Eq. (11) with respect to P is written as

2 P T 2 F , (12)

where (T (:, :, 1) = B Λ1

µ T (:, :, 2) = 1

2 A + C E Λ2+Λ3

According to (Zhang et al. 2014), Eq. (12) has a closedform solution with the soft-thresholding operator of the tensor singular values. Moreover, according to Algorithm 1, t SVD computes FFT and SVD on the frontal slices of the input 3-D tensor T (:, :, i) and its FFT version T f(:, : .i), respectively, which mainly emphasizes the low-rankness of the frontal slices. Differently, we aim to take advantage of the correction between the original co-association matrix and the coherent-link matrix. Therefore, we perform FFT and SVD on the lateral slices of the tensors T (:, i, :), and T f(:, i, :), respectively, to get the t-SVD representation.

The E subproblem. Without the irrelevant terms, the E subproblem becomes:

min E λ E 2 F + µ

P(:, :, 2) + E A + Λ2

Since Eq. (14) is quadratic function of E, we can get its global minimum by setting the derivative of it to 0, i.e.,

E = µA Λ2 µP(:, :, 2)

2λ + µ . (15)

The B subproblem. The B subproblem is written as

B P(:, :, 1) + Λ1

F s.t. B(i, j) = M(i, j), if M(i, j) = 1,

B = BT, 0 B(i, j) 1, i, j,

which is a symmetric and bounded constrained least squares problem, and has an optimal solution in element-wise (Jia et al. 2020d), i.e.,

M(i, j) if M(i, j) = 1, 0 if T1(i, j) 0 & M(i, j) = 1, 1 if T1(i, j) 1 & M(i, j) = 1, T1(i, j) if 0 T1(i, j) 1 & M(i, j) = 1, (17) where

P(:, :, 1) + P(:, :, 1)T + Λ1 + ΛT 1 µ

The C subproblem. The C subproblem is identical to the B subproblem without a set of element-wise equality constraints, which is written as

C P(:, :, 2) + Λ3

F s.t. C = CT, 0 C(i, j) 1, i, j,

and the optimal solution of it is

T2(i, j) if 0 T2(i, j) 1, 0 if T2(i, j) 0, 1 if T2(i, j) 1, (20)

P(:, :, 2) + P(:, :, 2)T + Λ3 + ΛT 3 µ

Update Λ1, Λ2, Λ3 and µ The Lagrange multipliers and µ are updated by

Λ1 = Λ1 + µ(P(:, :, 1) B) Λ2 = Λ2 + µ(P(:, :, 2) + E A) Λ3 = Λ3 + µ(P(:, :, 2) C) µ = min(1.1µ, µmax),

where µ is initialized to 0.0001 (Liu et al. 2019), and µmax is the upper-bound for µ. The overall numerical solution is summarized in Algorithm 2, where the stopping conditions is max( B P(:, :, 1) , C P(:, :, 2) , A E P(: , :, 2) ) < 10 8 with being the maximum of the absolute values of a matrix.

Experiment We conducted extensive experiments to evaluate the proposed model. To reproduce the results, we made the code publicly available at https://github.com/jyhlearning/Tensor Clustering Ensemble.

Algorithm 2 Numerical solution to Eq. (9)

Input: Base clusterings matrix Π; Initialize: P = 0, E = 0, B = 0, C = 0, and µmax = 108; 1: Construct the co-association matrix A by Eq. (6); 2: Construct the coherent-link matrix M by Eq. (8); 3: while not converged do 4: Update P by solving Eq. (12); 5: Update E by Eq. (15); 6: Update B by Eq. (17); 7: Update C by Eq. (20); 8: Update Λ1, Λ2, Λ3 and µ by Eq. (22); 9: Check the convergence conditions; 10: end while Output: P(:, :, 2) as the reﬁned co-association matrix.

Data Sets. Following recent clustering ensemble papers (Huang, Wang, and Lai 2018; Huang, Lai, and Wang 2016; Zhou, Zheng, and Pan 2019), we adopted 7 commonly used data sets, i.e., Bin Alpha, Multiple features (MF), MNIST, Semeion, Cal Tech, Texture and ISOLET. Following (Huang, Wang, and Lai 2018), we randomly selected 5000 samples from MNIST and used the subset in the experiments, and for Cal Tech, we used 20 representative categories out of 101 categories and denoted it as Cal Tech20.

Generation of Base Clusterings. Following (Huang, Wang, and Lai 2018), we ﬁrst generated a pool of 100 candidate base clusterings for all the data sets by applying the the K-means algorithm with the value of K randomly varying in the range of [2, n], where n is the number of input data samples.

Methods under Comparison. We compared the proposed model with 11 state-of-the-art clustering ensemble methods, including TA-CL, TA-SL and PTGP (Huang, Lai, and Wang 2016), LWSC, LWEA and LWGP (Huang, Wang, and Lai 2018), E-HC and E-MC (Huang et al. 2018), DREC (Zhou, Zheng, and Pan 2019), SPCE (Zhou et al. 2020), and SEC (Liu et al. 2017). The codes of all the compared methods are provided by the authors. Ours-EA and Ours-SC denote the proposed model equipped with agglomerative hierarchical clustering and spectral clustering, respectively, to generate the ﬁnal clustering result.

Evaluation Metrics. We adopted 7 commonly used metrics to evaluate clustering performance, i.e., clustering accuracy (ACC), normalized mutual information (NMI), purity, adjust rand index (ARI), F1-score, precision, and recall. For all the metrics, a larger value indicates better clustering performance, and the values of all the metrics are up-bounded by 1. The detailed deﬁnitions of those metrics can be found in (Zhang et al. 2020; Jia et al. 2020b).

Experiment Settings. For each data set, we randomly selected 10 base clusterings from the candidate base clustering pool, and performed different clustering ensemble methods on the selected base clusteirngs. To reduce the inﬂuence of the selected base clusterings, we repeated the random selection 20 times, and reported the average performance

Bin Alpha MF MNIST Semeion Cal Tech20 Texture ISOLET 0

Base Clusterings Ours-EA Ours-SC

Figure 2: The NMI of our methods against the average NMI of the base clusteings in the candidate base clustering pool.

0.0002 0.002 0.02 0.2 2 0.4

Bin Alpha MF MNIST Semeion Cal Tech20 Texture ISOLET

0.0002 0.002 0.02 0.2 2 0.4

Bin Alpha MF MNIST Semeion Cal Tech20 Texture ISOLET

Figure 3: The NMI of our methods against different λ.

5 10 20 30 40 50 # Base Clusterings

Bin Alpha MF MNIST Semeion Cal Tech20 Texture ISOLET

5 10 20 30 40 50 # Base Clusterings

Bin Alpha MF MNIST Semeion Cal Tech20 Texture ISOLET

Figure 4: The NMI of our methods with different numbers of base clusterings, where the vertical error bar indicates the standard deviation over 20 repetitions.

over the 20 repetitions. For the compared methods, we set the hyper-parameters according to their original papers. If there are no suggested values, we exhaustively searched the hyper-parameters, and used the ones producing the best performance. The proposed model only contains one hyperparameter λ, which was set to 0.002 for all the data sets.

Analysis of the Clustering Performance Tables 1-7 show the clustering performance of all the methods over 7 data sets, where we have the following observations. First, the proposed methods including both Ours-EA and Ours-SC almost always outperform all the compared methods under various metrics, which proves the universality of the reﬁned co-association matrix of the proposed

Bin Alpha TA-CL TA-SL PTGP LWSC LWEA LWGP E-HC E-MC DREC SPCE SEC Ours-EA Ours-SC ACC 0.429 0.186 0.429 0.424 0.403 0.431 0.375 0.454 0.375 0.298 0.443 0.712 0.858 NMI 0.577 0.300 0.574 0.570 0.553 0.575 0.537 0.592 0.518 0.541 0.585 0.824 0.916 Purity 0.451 0.197 0.446 0.444 0.413 0.457 0.383 0.478 0.396 0.285 0.470 0.718 0.876 ARI 0.291 0.081 0.291 0.284 0.289 0.287 0.269 0.300 0.248 0.227 0.291 0.643 0.817 F1-score 0.312 0.126 0.313 0.306 0.313 0.308 0.295 0.320 0.271 0.302 0.311 0.654 0.822 Precision 0.276 0.071 0.277 0.272 0.248 0.277 0.220 0.305 0.238 0.294 0.296 0.559 0.801 Recall 0.361 0.635 0.361 0.349 0.426 0.348 0.451 0.337 0.323 0.314 0.327 0.791 0.845

Table 1: Clustering Performance on Bin Alpha (# samples: 1404, dimension: 320, # clusters: 36)

MF TA-CL TA-SL PTGP LWSC LWEA LWGP E-HC E-MC DREC SPCE SEC Ours-EA Ours-SC ACC 0.606 0.507 0.648 0.671 0.609 0.649 0.589 0.652 0.362 0.581 0.592 0.718 0.990 NMI 0.638 0.536 0.654 0.655 0.650 0.655 0.618 0.652 0.347 0.621 0.602 0.790 0.979 Purity 0.644 0.533 0.677 0.690 0.650 0.673 0.616 0.676 0.387 0.615 0.623 0.719 0.990 ARI 0.500 0.371 0.523 0.533 0.514 0.530 0.481 0.526 0.257 0.459 0.472 0.685 0.979 F1-score 0.554 0.457 0.575 0.583 0.567 0.582 0.541 0.576 0.370 0.527 0.528 0.724 0.981 Precision 0.511 0.344 0.534 0.551 0.517 0.530 0.472 0.541 0.311 0.424 0.496 0.586 0.981 Recall 0.608 0.712 0.627 0.619 0.628 0.647 0.637 0.618 0.739 0.713 0.566 0.960 0.981

Table 2: Clustering Performance on MF (# samples: 2000, dimension: 649, # clusters: 10)

MNIST TA-CL TA-SL PTGP LWSC LWEA LWGP E-HC E-MC DREC SPCE SEC Ours-EA Ours-SC ACC 0.654 0.207 0.665 0.613 0.658 0.573 0.609 0.656 0.480 0.543 0.539 0.797 0.977 NMI 0.610 0.133 0.622 0.612 0.635 0.594 0.608 0.635 0.434 0.482 0.521 0.806 0.979 Purity 0.668 0.209 0.685 0.663 0.676 0.626 0.624 0.691 0.498 0.557 0.585 0.798 0.980 ARI 0.504 0.051 0.522 0.483 0.531 0.460 0.495 0.524 0.342 0.429 0.384 0.735 0.969 F1-score 0.557 0.219 0.572 0.540 0.582 0.522 0.558 0.574 0.427 0.445 0.450 0.767 0.972 Precision 0.523 0.124 0.541 0.490 0.536 0.459 0.448 0.543 0.373 0.316 0.420 0.666 0.968 Recall 0.596 0.952 0.607 0.603 0.641 0.609 0.745 0.610 0.576 0.831 0.485 0.918 0.977

Table 3: Clustering Performance on MNIST (# samples: 5000, dimension: 784, # clusters: 10)

Semeion TA-CL TA-SL PTGP LWSC LWEA LWGP E-HC E-MC DREC SPCE SEC Ours-EA Ours-SC ACC 0.700 0.425 0.692 0.682 0.739 0.620 0.638 0.679 0.450 0.571 0.594 0.846 0.983 NMI 0.634 0.418 0.631 0.630 0.656 0.598 0.601 0.635 0.386 0.571 0.569 0.831 0.962 Purity 0.707 0.449 0.703 0.702 0.739 0.651 0.645 0.705 0.460 0.607 0.634 0.847 0.983 ARI 0.510 0.248 0.507 0.507 0.540 0.465 0.480 0.508 0.290 0.401 0.418 0.790 0.962 F1-score 0.563 0.360 0.560 0.559 0.588 0.525 0.540 0.560 0.391 0.477 0.481 0.813 0.966 Precision 0.522 0.246 0.522 0.523 0.552 0.466 0.468 0.527 0.329 0.381 0.448 0.748 0.966 Recall 0.611 0.712 0.606 0.601 0.631 0.603 0.644 0.599 0.664 0.660 0.524 0.893 0.966

Table 4: Clustering Performance on Semeion (# samples: 1593, dimension: 256, # clusters: 10)

Cal Tech20 TA-CL TA-SL PTGP LWSC LWEA LWGP E-HC E-MC DREC SPCE SEC Ours-EA Ours-SC ACC 0.343 0.421 0.345 0.324 0.423 0.336 0.450 0.363 0.340 0.495 0.297 0.726 0.418 NMI 0.402 0.269 0.401 0.396 0.454 0.406 0.455 0.428 0.350 0.452 0.381 0.620 0.621 Purity 0.639 0.520 0.637 0.642 0.665 0.646 0.645 0.660 0.590 0.664 0.633 0.730 0.788 ARI 0.265 0.184 0.267 0.222 0.359 0.224 0.351 0.258 0.225 0.395 0.202 0.785 0.328 F1-score 0.337 0.363 0.338 0.291 0.432 0.298 0.437 0.332 0.316 0.457 0.269 0.823 0.384 Precision 0.561 0.284 0.563 0.529 0.612 0.510 0.538 0.543 0.479 0.503 0.525 0.764 0.743 Recall 0.241 0.562 0.243 0.201 0.335 0.211 0.373 0.239 0.253 0.449 0.181 0.898 0.259

Table 5: Clustering Performance on Cal Tech20 (# samples: 2386, dimension: 30,000, # clusters: 20)

Texture TA-CL TA-SL PTGP LWSC LWEA LWGP E-HC E-MC DREC SPCE SEC Ours-EA Ours-SC ACC 0.714 0.410 0.732 0.719 0.793 0.686 0.675 0.675 0.416 0.634 0.614 0.863 0.993 NMI 0.721 0.438 0.731 0.742 0.782 0.739 0.703 0.718 0.419 0.693 0.638 0.868 0.995 Purity 0.729 0.427 0.746 0.744 0.798 0.728 0.685 0.698 0.441 0.658 0.647 0.864 0.995 ARI 0.600 0.237 0.619 0.628 0.696 0.609 0.569 0.585 0.298 0.534 0.486 0.816 0.993 F1-score 0.639 0.350 0.656 0.663 0.724 0.648 0.614 0.626 0.397 0.590 0.537 0.834 0.993 Precision 0.598 0.228 0.627 0.631 0.700 0.592 0.543 0.582 0.337 0.465 0.500 0.780 0.991 Recall 0.690 0.897 0.689 0.700 0.752 0.719 0.710 0.677 0.752 0.827 0.586 0.895 0.996

Table 6: Clustering Performance on Texture (# samples: 5500, dimension: 20, # clusters: 11)

ISOLET TA-CL TA-SL PTGP LWSC LWEA LWGP E-HC E-MC DREC SPCE SEC Ours-EA Ours-SC ACC 0.540 0.394 0.539 0.556 0.578 0.527 0.451 0.581 0.324 0.574 0.554 0.575 0.675 NMI 0.718 0.587 0.715 0.721 0.743 0.710 0.667 0.743 0.413 0.818 0.719 0.752 0.831 Purity 0.572 0.407 0.567 0.594 0.605 0.564 0.467 0.619 0.350 0.301 0.590 0.583 0.707 ARI 0.498 0.316 0.495 0.485 0.552 0.472 0.449 0.516 0.251 0.367 0.483 0.563 0.639 F1-score 0.520 0.356 0.517 0.506 0.571 0.495 0.477 0.536 0.303 0.384 0.505 0.584 0.654 Precision 0.467 0.237 0.462 0.458 0.511 0.437 0.352 0.496 0.253 0.584 0.463 0.481 0.625 Recall 0.591 0.787 0.590 0.568 0.648 0.573 0.748 0.584 0.690 0.327 0.555 0.752 0.685

Table 7: Clustering Performance on ISOLET (# samples: 7791, dimension: 617, # clusters: 26)

Coherent-Link Matrix

Coherent-Link Matrix

Figure 5: Visual comparison of the learned pairwise similarity matrices for different methods. All the matrices share the same color bar, and the brighter color indicates a larger value.

model to different clustering methods. Moreover, Ours-SC usually performs better than Ours-EA, which means the reﬁned co-association is more suitable for spectral clustering. Second, the improvements of the proposed methods are signiﬁcant. For example, on Bin Alpha, compared with the best method under comparison, Ours-SC increases the ACC from 0.454 to 0.858. On Cal Tech20, the highest ACC of the compared methods is 0.495, while the ACC of Ours-EA is 0.726. The improvements of the proposed methods in terms of other metrics are also signiﬁcant. Moreover, the performance of Ours-SC on MF, MNIST, Semeion, Texture are extremely good, i.e., all the metrics are quite close to 1. Those phenomena suggest that the proposed model brings a breakthrough in clustering ensemble. Third, the highly competitive performance of the proposed model is achieved with a ﬁxed hyper-parameter, proving the practicability of the proposed model. Besides, the proposed model is also robust to different data sets, as both Ours-EA and Ours-SC consistently produce superior clustering performance on all the data sets.

Comparison Against Base Clusterings. We compared the average NMI of the our methods with that of all the base clusterings from the candidate clustering pool in Fig. 2. It is clear that, on all the data sets, both Ours-SC and Ours-EA can signiﬁcantly improve the NMI of the base clusterings, and Ours-SC outperforms Ours-EA in the majority cases.

Sensitivity to Hyper-parameter. Fig. 3 shows the NMI of the proposed methods with different λ on all the data sets, where we can conclude that: ﬁrst, a smaller λ usually leads to better clustering performance for both Ours-EA and Ours SC, which demonstrates the importance of removing the incorrect connections from the original co-association matrix; and second, for the majority data sets, the highest NMI occurs when λ = 0.002 for both Ours-EA and Ours-SC, which proves the highly robustness of the proposed model to dif-

ferent data sets.

Performance with Different Number of Base Clusterings. Fig. 4 illustrates the inﬂuence of different numbers of the base clusterings to the proposed model, where we have the following observations. First, with the increase of the number of base clusterings, the NMIs of both Ours-EA and Ours-SC generally increase, indicating that more base clustering are beneﬁcial to the clustering performance. Second, with more base clusterings, the standard deviations generally become smaller for all the data sets, which suggests that more base clusterings can enhance the stability our methods. Third, for the majority data sets, 20 base clusterings are sufﬁcient for our methods to generate high value of NMI.

Comparison of the Learned Pairwise Similarity Matrix. Fig. 5 presents the cohere-link matrix, the traditional coassociation matrix, the learned co-association matrices by LWCA (Huang, Wang, and Lai 2018), SPCE (Zhou et al. 2020) and the proposed model, and the ideal afﬁnity matrix of Bin Alpha, where all the matrices are normalized to [0, 1] and share the same color bar. Form 5, we can observe that the coherent-link matrix is sparse, but its majority connections are correct, while on the contrary, the co-association matrix is dense, but with many incorrect connections in it. By exploiting the the low-rankness of the 3-D tensor stacked by the coherent-link matrix and the association matrix, the reﬁned co-association matrix of the proposed model is quite close to the ideal one. Although there are some error corrections in it, almost all the relationships of two samples belonging to the same cluster have been correctly recovered, leading to high clustering performance. In contrast, there are many incorrect connections, but without enough correct connections in both the afﬁnity matrices of LWCA and SPCE, which explains why they produced inferior clustering performance than the proposed model.

Conclusion As the ﬁrst work, we introduced low-rank tensor approximation to clustering ensemble. Different from previous methods, the proposed model solves clustering ensemble from a global perspective, i.e., exploiting the low-rankness of a 3-D tensor formed by the coherent-link matrix and the coassociation matrix, such that the valuable information of the coherent-link matrix can be effectively propagated to the coassociation matrix. Extensive experiments have shown that i), the proposed model improves current state-of-the-art performance of clustering ensemble to a new level; ii), the recommended value for the hyper-parameter of the proposed model is robust to different data sets; and iii), only a few base clusterings are required to generate high clustering performance.

Acknowledgements This work was supported by the Hong Kong Research Grants Council under Grants 21211518, 11219019 and 11202320.

References Boongoen, T.; and Iam-On, N. 2018. Cluster Ensembles: A Survey of Approaches with Recent Extensions and Applications. Computer Science Review 28: 1 25.

Fred, A. L.; and Jain, A. K. 2005. Combining Multiple Clusterings Using Evidence Accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(6): 835 850.

Gao, J.; Yamada, M.; Kaski, S.; Mamitsuka, H.; Zhu, S.; and Kambhampati, S. 2016. A Robust Convex Formulations for Ensemble Clustering. In Proceedings of the 25th International Joint Conference on Artiﬁcial Intelligence (IJCAI 2016), 1476 1482. AAAI Press International Joint Conferences on Artiﬁcial Intelligence.

Huang, D.; Lai, J.-H.; and Wang, C.-D. 2016. Robust Ensemble Clustering Using Probability Trajectories. IEEE Transactions on knowledge and Data Engineering 28(5): 1312 1326.

Huang, D.; Wang, C.; Peng, H.; Lai, J.; and Kwoh, C. 2018. Enhanced Ensemble Clustering via Fast Propagation of Cluster-Wise Similarities. IEEE Transactions on Systems, Man, and Cybernetics: Systems 1 13.

Huang, D.; Wang, C.-D.; and Lai, J.-H. 2018. Locally Weighted Ensemble Clustering. IEEE Transactions on Cybernetics 48(5): 1460 1473.

Jain, A. K. 2010. Data Clustering: 50 Years Beyond Kmeans. Pattern Recognition Letters 31(8): 651 666.

Jia, Y.; Hou, J.; and Kwong, S. 2020. Constrained Clustering With Dissimilarity Propagation-Guided Graph-Laplacian PCA. IEEE Transactions on Neural Networks and Learning Systems 1 13. doi:10.1109/TNNLS.2020.3016397.

Jia, Y.; Kwong, S.; and Hou, J. 2018. Semi-Supervised Spectral Clustering With Structured Sparsity Regularization. IEEE Signal Processing Letters 25(3): 403 407.

Jia, Y.; Kwong, S.; Hou, J.; and Wu, W. 2018. Convex Constrained Clustering with Graph-Laplacian Pca. In 2018 IEEE International Conference on Multimedia and Expo (ICME), 1 6. Jia, Y.; Kwong, S.; Hou, J.; and Wu, W. 2020a. Semi Supervised Non-Negative Matrix Factorization With Dissimilarity and Similarity Regularization. IEEE Transactions on Neural Networks and Learning Systems 31(7): 2510 2521. Jia, Y.; Liu, H.; Hou, J.; and Kwong, S. 2020b. Clustering Aware Graph Construction: A Joint Learning Perspective. IEEE Transactions on Signal and Information Processing over Networks 6: 357 370. Jia, Y.; Liu, H.; Hou, J.; and Kwong, S. 2020c. Semisupervised Adaptive Symmetric Non-Negative Matrix Factorization. IEEE Transactions on Cybernetics 1 1. doi: 10.1109/TCYB.2020.2969684. Jia, Y.; Wu, W.; Wang, R.; Hou, J.; and Kwong, S. 2020d. Joint Optimization for Pairwise Constraint Propagation. IEEE Transactions on Neural Networks and Learning Systems 1 13. Johnson, S. C. 1967. Hierarchical Clustering Schemes. Psychometrika 32(3): 241 254. Kilmer, M. E.; Braman, K.; Hao, N.; and Hoover, R. C. 2013. Third-order Tensors as Operators on Matrices: A Theoretical and Computational Framework with Applications in Imaging. SIAM Journal on Matrix Analysis and Applications 34(1): 148 172. Li, H.; Kwong, S.; Chen, C.; Jia, Y.; and Cong, R. 2019. Superpixel Segmentation based on Square-wise Asymmetric Partition and Structural Approximation. IEEE Transactions on Multimedia 21(10): 2625 2637. Liu, H.; Jia, Y.; Hou, J.; and Zhang, Q. 2019. Imbalanceaware Pairwise Constraint Propagation. In In Proceedings of the 27th ACM International Conference on Multimedia, 1605 1613. Liu, H.; Wu, J.; Liu, T.; Tao, D.; and Fu, Y. 2017. Spectral Ensemble Clustering via Weighted K-Means: Theoretical and Practical Evidence. IEEE Transactions on Knowledge and Data Engineering 29(5): 1129 1143. Moore, A. W. 1999. Very Fast EM-based Mixture Model Clustering Using Multiresolution Dd-trees. In Advances in Neural information processing systems, 543 549. Ng, A. Y.; Jordan, M. I.; and Weiss, Y. 2002. On Spectral Clustering: Analysis and an Algorithm. In Advances in Neural Information Processing Systems, 849 856. Sagi, O.; and Rokach, L. 2018. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(4): e1249. Song, L.; Tekin, C.; and Van Der Schaar, M. 2014. Online Learning in Large-scale Contextual Recommender Systems. IEEE Transactions on Services Computing 9(3): 433 445. Tao, Z.; Liu, H.; and Fu, Y. 2017. Simultaneous Clustering and Ensemble. In Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence, 1546 1552.

Tao, Z.; Liu, H.; Li, S.; Ding, Z.; and Fu, Y. 2017. From ensemble clustering to multi-view clustering. In Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence, 2843 2849. Tao, Z.; Liu, H.; Li, S.; and Fu, Y. 2016. Robust Spectral Ensemble Clustering. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 367 376. Ulrike, L. 2007. A Tutorial on Spectral Clustering. Stat. Comput. 17(4): 395 416. ISSN 1573-1375. doi:10.1007/ s11222-007-9033-z. URL https://doi.org/10.1007/s11222007-9033-z. Vega-Pons, S.; and Ruiz-Shulcloper, J. 2011. A Survey of Clustering Ensemble Algorithms. International Journal of Pattern Recognition and Artiﬁcial Intelligence 25(03): 337 372. Wu, W.; Jia, Y.; Kwong, S.; and Hou, J. 2018. Pairwise Constraint Propagation-Induced Symmetric Nonnegative Matrix Factorization. IEEE Transactions on Neural Networks and Learning Systems 29(12): 6348 6361. Wu, W.; Kwong, S.; Zhou, Y.; Jia, Y.; and Gao, W. 2018. Nonnegative Matrix Factorization with Mixed Hypergraph Regularization for Community Detection. Information Sciences 435: 263 281. Zhang, C.; Fu, H.; Hu, Q.; Cao, X.; Xie, Y.; Tao, D.; and Xu, D. 2020. Generalized Latent Multi-View Subspace Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(1): 86 99. Zhang, Z.; Ely, G.; Aeron, S.; Hao, N.; and Kilmer, M. 2014. Novel Methods for Multilinear Data Completion and Denoising based on Tensor-SVD. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3842 3849. Zhou, J.; Zheng, H.; and Pan, L. 2019. Ensemble Clustering based on Dense Representation. Neurocomputing 357: 66 76. Zhou, P.; Du, L.; Liu, X.; Shen, Y.; Fan, M.; and Li, X. 2020. Self-Paced Clustering Ensemble. IEEE Transactions on Neural Networks and Learning Systems 1 15.