# sparse_subspace_clustering_with_entropynorm__9b9f1cf2.pdf

Sparse Subspace Clustering with Entropy-Norm

Liang Bai 1 Jiye Liang 1

In this paper, we provide an explicit theoretical connection between Sparse subspace clustering (SSC) and spectral clustering (SC) from the perspective of learning a data similarity matrix. We show that spectral clustering with Gaussian kernel can be viewed as sparse subspace clustering with entropy-norm (SSC+E). Compared to SSC, SSC+E can obtain an analytical, symmetrical, nonnegative and nonlinearlyrepresentational similarity matrix. Besides, SSC+E makes use of Gaussian kernel to compute the sparse similarity matrix of objects, which can avoid the complex computation of the sparse optimization program of SSC. Finally, we provide the experimental analysis to compare the efﬁciency and effectiveness of sparse subspace clustering and spectral clustering on ten benchmark data sets. The theoretical and experimental analysis can well help users for the selection of highdimensional data clustering algorithms.

1. Introduction

Clustering is an important problem in statistical multivariate analysis, data mining and machine learning (Han & Kamber, 2001). The goal of clustering is to group a set of objects into clusters so that the objects in the same cluster are highly similar but remarkably dissimilar with objects in other clusters (Jain, 2008). To tackle this problem, various types of clustering algorithms have been developed in the literature (e.g., (Aggarwal & Reddy, 2014) and references therein), including partitional, hierarchical, density-based, grid-based clustering, etc.

Recently, increasing attention has been paid to clustering

*Equal contribution 1Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, Shanxi Province, China. Correspondence to: Jiye Liang <ljy@sxu.edu.cn>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

high-dimensional data which is ubiquitous in real-world data mining applications, such as image processing, text analysis, and bioinformatics et al. Sparsity is an accompanying phenomenon of high-dimensional data, which leads to curse of dimensionality , i.e., all pairs of points tend to be almost equidistant from one another. It is a special challenge for clustering high-dimensional data. In order to solve this problem, lots of clustering algorithms have been developed in the literature (e.g., (Parsons et al., 2004; Elhamifar & Vidal, 2013) and references therein). Among them, spectral clustering and sparse subspace clustering are two state-of-the-art methods to effectively separate the high-dimensional data in accordance with the underlying subspace. Spectral clustering (Shi & Malik, 2000; Ng et al., 2001) is a representative of graph-based clustering, which ﬁrst converts a data set into a graph or a data similarity matrix and then uses a graph cutting method to identify clusters. However, the clustering results of the spectral clustering are sensitive to the converted graph. In the classical spectral clustering algorithm, the graph is often constructed by kernel functions (Dhillon et al., 2007) or k-nearest neighbors (KNN) (Zhu et al., 2014). Besides, some scholars developed graph-learning methods to obtain a high-quality graph from the data set. For example, Nie et al. proposed a clustering algorithm with adaptive neighbors (Nie et al., 2014), which learns the data similarity matrix by assigning the adaptive and optimal neighbors for each data point based on the local connectivity.

Sparse subspace clustering can also be seen as a special spectral clustering. It makes use of self representation of the data to construct the sparse similarity graph and apply spectral clustering on such graph to obtain the ﬁnal clustering result. In (Elhamifar & Vidal, 2009), Elhamifar and Vidal presented the SSC algorithm with L1-norm in detail. Furthermore, several variants of SSC have been proposed to ﬁnd out the sparse representation of the data under different assumptions. Wang and Xu proposed noisy SSC to handle noisy data that lie close to disjoint or overlapping subspaces (Wang & Xu, 2016). Yang et al. proposed SSC with L0-norm (Yang et al., 2016). Liu et al. proposed a low rank representation of all data jointly by using the structured sparsity loss (Liu et al., 2013). Hu et al. investigated theoretically the grouping effect for self-representation based approaches and presented a smooth representation

Sparse Subspace Clustering with Entropy-Norm

model (H. Hu & Zhou, 2014). Although the existing SSC methods already have good theoretical and practical contributions, they still need to improve some deﬁciencies. For example, since their optimization program needs many iterations, their the computational cost is very expensive, which is more than O(n3) even though the fast solver is used, where n is the number of objects on a data set. Besides, they can not guarantee the symmetry and nonnegativity of the obtained sparse similarity matrix which is required when implementing spectral clustering.To enhance the efﬁciency of SSC, several scalable sparse subspace clustering algorithms have been proposed in (Peng et al., 2013; You et al., 2016; Matsushima & Brbic, 2019; Zhang et al.,

Although scholars have provided lots of studies on spectral clustering and sparse subspace clustering, the relation between them is rarely discussed. Therefore, in this paper, we provide an explicit theoretical connection between them. We propose a sparse subspace clustering model with entropy-norm. In this optimization model, we transform a sparse subspace clustering problem into an optimization problem of learning a sparse similarity matrix and uses the Entropy-norm as the regularization term. We derive its optimal solution which is equivalent to Gaussian kernel as the sparse representation. Thus, we can conclude that spectral clustering with Gaussian kernel can be viewed as sparse subspace clustering with entropy-norm (SSC+E). Compared to SSC, SSC+E can avoid the complex computation of the sparse optimization program to obtain a sparse, analytical, symmetrical and nonnegative solution. Finally, we analyze the efﬁciency and effectiveness of sparse subspace clustering and spectral clustering on ten benchmark data sets. The theoretical and experimental analysis provided by this paper can well guide users to the selection of high-dimensional data clustering algorithms.

The outline of the rest of this paper is as follows. Section 2 introduces spectral clustering and sparse subspace clustering. Section 3 presents the theoretical connection between them. Section 4 shows the experimental analysis of the comparisons with them. Section 5 concludes the paper with some remarks.

2. Spectral Clustering and Sparse Subspace Clustering

Let X be a m n data matrix with n objects and m attributes, xi be the ith column of X which is used to represent the ith object. The optimization problem of the spectral clustering is described as follows.

min H Θ = Tr(HT LH), s.t., HT H = I, (1)

where L = D W is a Laplacian matrix, W is a n n data similarity matrix, D is a diagonal matrix whose entries are

column (or row, since W is symmetric) sums of W, and H is a n k membership matrix. W is often deﬁned based on Gaussian kernel as follows.

wij = exp ( ||xi xj||2

where γ is a kernel parameter. Besides, we know that ˆW = D 1/2WD 1/2 is the normalized similarity matrix of W. In this case, the spectral clustering becomes the normalized spectral clustering. The spectral clustering problem is the standard trace minimization problem which is solved by the matrix H which contains the ﬁrst k eigenvectors of L as rows.

The formulation of sparse subspace clustering is

min Z F = L(X, XZ) + λΩ(Z), (3)

where λ > 0 is the tradeoff factor, L(X, XZ) is a loss function which wishes X = XZ and Ω(.) is a regularization term which is used to sparsify Z. If X = XZ is required and Ω(Z) = ||Z||1, the optimization problem is the classical SSC problem, i.e.,

min Z ||Z||1, s.t., X = XZ, diag(Z) = 0. (4)

In the noisy SSC algorithm which allows some tolerance for inexact representation, the optimization problem is deﬁned as

min Z ||X XZ||2 F + λ||Z||1, s.t., diag(Z) = 0. (5)

Besides, in order to make the data lie in a union of subspaces, the constraint 1T Z = 1T is added to the sparse subspace problem (Elhamifar & Vidal, 2013).

The optimization problem of SSC and noisy SSC can be solved efﬁciently using convex programming tools. With the reconstruction coefﬁcient matrix Z, the sparse similarity matrix of objects is computed by

W = |Z| + |ZT |. (6)

Finally, with the converted similarity matrix W as the input, the ﬁnal clustering result is obtained by conducting standard spectral clustering.

According to the above introductions of spectral clustering and sparse subspace clustering, we can observe that their main difference is the deﬁnition of the similarity matrix W. In the spectral clustering, W is directly computed by Gaussian kernel function, where the computational complexity is O(n2). In the sparse subspace clustering, W is computed by learning a sparse similarity matrix, where the computational complexity is more than O(n3). Compared to spectral clustering, the computational cost of the sparse

Sparse Subspace Clustering with Entropy-Norm

subspace clustering is very expensive, which is not suitable for large-scale data. Furthermore, we can see from Eq.(6) that the SSC algorithm explicitly converts the similarity matrix Z into a symmetrical and nonnegative matrix W. This conversation may bring some misleading information. For example, we should think that the lower the value of zij is, the more the dissimilarity between xi and xj is. However, if zij is negative, the lower zij is, the higher |zij| is. In this case, the similarity between xi and xj are thought to be high. Thus, the similarity matrix W obtained based on the conversation is not necessarily effective. Besides, Z or W obtained by the SSC algorithm is not an analytical solution. This indicates that we can not observe the explicit similarity relation between objects from Z and W, which prevents users from understanding the similarity. Compared to SSC, the similarity matrix by using Gaussian kernel function in the spectral clustering is analytical, symmetrical and nonnegative. Although we can see the difference between the two algorithms, there is the lack of their theoretical connection. This is the motivation of our work.

3. Sparse Subspace Clustering with Entropy-Norm

In this section, we analyze the theoretical connection between spectral clustering and sparse subspace clustering from the perspective of learning the data similarity matrix W. In the analysis, we ﬁrst transform a sparse subspace clustering problem into an optimization problem of learning a sparse similarity matrix. Furthermore, we uses the Entropy-norm as the regularization term and derives its optimal solution to show the relation between spectral clustering and sparse subspace clustering. In the following, we provide the theoretical analysis in detail.

According to the deﬁnition of W in the sparse subspace clustering, we can conclude that if Z is a symmetrical and nonnegative matrix, W is equivalent to Z. In this case, the minimization problem F can be converted as follows

min Z F, s.t., Z = ZT , Z 0, diag(Z) = 0. (7)

According to Eq.(7), we can see that the sparse subspace clustering problem can be viewed as learning a similarity matrix. However, we know that the SSC algorithm can not guarantee the optimization solution Z is symmetric and nonnegative. Some scholars enforce the symmetric positive semi-deﬁnite (PSD) constraint on Z to explicitly obtain a symmetric PSD matrix, such as (Ni et al., 2010). However, it is very complex for the optimization program of SSC to obtain a symmetric and nonnegative Z.

Therefore, we select information entropy as the regularization term of the objective function F and propose a sparse subspace clustering with entropy-norm to solve this prob-

lem. The objective function F is re-deﬁned as

F = L(X, XZ) + λ

j=1 zij ln zij. (8)

While using the Lagrangian multiplier to directly minimize F, we can obtain

zij + λ (ln zij + 1) = 0

zij = β exp ( fij

where β = exp( 1) and fij = L zij . According to Eq.(9), we can see that Z is nonnegative, and if fij = fji for 1 i, j n, then Z is symmetric, i.e., zij = zji.

Furthermore, we add the constraint 1T Z = 1T to the optimization problem. While using the Lagrangian multiplier to minimize F with the constraint, we can obtain

min Z F = L(X, XZ) + λ

j=1 zij ln zij

zij + λ (ln zij + 1) + α = 0

exp ( λ + α

h =i exp ( fih

zij = exp ( fij

n h =i exp ( fih

However, in Eq.(10), if fij = fji, we can not guarantee zij is equal to zji. In order to improve this problem, we relax the constraint 1T Z = 1T and replace it with a new constraint n h =i zih + n h =j zhj = 2, for 1 i = j n. It notes that the sum of any row and column of Z is 2. While using the Lagrangian multiplier to minimize F with the new constraint, we can obtain

min Z F = L(X, XZ) + λ

j=1 zij ln zij

zij = 2 exp ( fij

h =i exp ( fih

h =j exp ( fhj

Sparse Subspace Clustering with Entropy-Norm

We can conclude that if fij = fji in Eq.(11), zij = zji for 1 i, j n. We can also observe that the representations of zij in Eqs.(9) and (11) are similar to Gaussian kernel.

According to Eqs.(9) and (11), we can see that we need to compute fij to obtain zij for 1 i, j n. In order to easily compute fij, we wish it is irrelevant with zij. Therefore, we ﬁrst assume L(X, XZ) is a linear function with zij, i.e.,

j=1 aijzij + bij. (12)

In this case, fij = L zij = aij. We can see that this assumption reduces the computational complexity of zij and fij. According to Eq.(12), we also see that it is a key issue for computing fij to deﬁne aij.

Based on the self-representation constraint X = XZ, we have

j=1,j =i zijxj, (13)

for 1 i n. According to Eq.(13), we can see that zij should reﬂect the similarity between xi and xj. The more similar they are, the higher the value of zij should be. In order to make zij reﬂect the similarity, we assume aij is a distance metric between xi and xj, i.e., aij = d(xi, xj). According to the symmetry of the distance metric, we have aij = aji. Next, we discuss how to use the constraint X = XZ to deﬁne d(xi, xj).

For object xi (1 i n), we have the following relation

j=1,i =j zijxj x T i xi =

j=1,i =j zijx T i xj. (14)

Based on the constraint 1T Z = 1T , we have

x T i xi = 2

j=1,i =j zijx T i xj. (15)

Therefore, the constraint xi n j=1,i =j zijxj = 0 is transformed into

j=1,i =j zijx T i xj = 0.

(16) Furthermore, we have

j=1,i =j zijx T i xj

j=1 ||xi xj||2zij.

Based on Eq.(17), the objective function F of the optimization problem can be rewritten as

j=1 ||xi xj||2zij + λ

j=1 zij ln zij. (18)

In this case, d(xi, xj) is deﬁned as Euclidean distance. According to Eq.(9), if d(xi, xj) is Euclidean distance, Z in Eq.(9) is equivalent to Gaussian kernel, i.e.,

zij = exp ( ||xi xj||2

According to Eq.(19), the spectral clustering with Gaussian kernel can be viewed as a sparse subspace clustering with entropy-norm.

Furthermore, if we use d(xi, xj), instead of fij in Eq.(10), Z is deﬁned by

zij = 2 exp ( ||xi xj||2

h =i exp ( ||xi xh||2

h =j exp ( ||xj xh||2

(20) In this case, we have the following relation

Z D 1/2WD 1/2. (21)

Eq.(21) illustrates that Z is a lower bound of the normalized Gaussian kernel and used to approximate it. According to the above analysis, we can see the relation between Gaussian kernel and sparse subspace clustering.

Furthermore, we discuss the selection of the regularization term on the objective function Q. If we deﬁne the regularization term Ω(Z) = ||Z||2, the objective function Q becomes

j=1 ||xi xj||2zij + λ||Z||2. (22)

In this case, the optimization problem Q is equivalent to that of Clustering with Adaptive Neighbors (Nie et al., 2014) which is proposed by Nie et al.. The authors improved the spectral clustering and used the optimization model to learn the data similarity matrix. The optimal solution Z is computed by

zij = ||xi xj||2

2λ + ηi. (23)

n h=1 ||xi xh||2

n. Since each ηi may be different, the non-negativeness of Z depends on ηi. Besides, we can see that the similarity zij reﬂects a linear relation with the distance ||xi xh||2 in the case that L2 norm is used as the regularizer.

Sparse Subspace Clustering with Entropy-Norm

If the regularization term Ω(Z) = ||Z||1, the optimal solution Z is computed by

{ 1,j = arg min t=1 ||xi xt||2,

0,otherwise. (24)

In this case, the solution is trivial, since only its nearest object can be used to represent it. According to Eqs.(23) and (24), we can see the advantages of the entropy-norm, compared to other norms.

According to the above theoretical analysis, we can get the following conclusions:

From the perspective of learning a data similarity matrix, the spectral clustering with Gaussian kernel can be viewed as a sparse subspace clustering with entropy-norm (SSC+E);

Compared to SSC, SSC+E can directly compute the data similarity matrix by Gaussian kernel, which can reduce the computational cost.

SSC+E can be used to obtain an analytical, nonnegative, symmetrical, and nonlinear representation of a data set.

However, it is worth noting that the proposed theoretical connection is not to prove and show that spectral clustering with Gaussian kernel is better than sparse subspace clustering, but to help users to understand the solution of the SSC algorithm by Gaussian kernel. Besides, although the theoretical connection is built based on some constraints, we still think that it is very valuable to provide a guidance for uses to select and understand different algorithms.

4. Experiment Analysis

In the experiments, we analyze and compare the effectiveness and efﬁciency of the sparse subspace clustering algorithm (SSC) (Elhamifar & Vidal, 2009), the spectral clustering with adaptive neighbors (CAN) (Nie et al., 2014), and the spectral clustering with Gaussian kernel (SSC+E) (Ng et al., 2001). Differently from SSC and CAN which compute a data similarity matrix by the optimization methods, SSC+E uses Gaussian kernel to compute the similarity matrix. The codes of these algorithms have been publicly shared by their authors.

The experiments are conducted on an Intel i9-7940X CPU@3.10HZ and 128G RAM. We carry out these algorithms on 10 benchmark data sets (Bache & Lichman; Cai) which are described in Table 1. It is worth noting that we did not select large-scale data sets to test these algorithms in our experiments. The main reason is that the computational cost of the SSC algorithm is very high. For example, it needs more than 60 hours on the data set Landsat

Satellite which includes 6,435 points. Furthermore, we employ two widely-used external indices (Aggarwal & Reddy, 2014), i.e., the normalized mutual information (NMI) and the adjusted rand index (ARI), to measure the similarity between a clustering result and the true partition on a data set. If the clustering result is close to the true partition, then its NMI and ARI values are high.

Table 1. Description of data sets: Number of Data Objects (n), Number of Dimensions (m), Number of Clusters (k).

Data set n m k Iris 150 4 3 Wine 178 13 3 Heart Statlog 569 30 2 Yale 165 1024 15 ORL 400 1024 40 Banknote 1,372 4 2 COIL 1,440 1024 20 Isolet 1,560 617 26 Handwritten Digits 5,620 63 10 Landsat Satellite 6,435 36 6

Before the comparisons, we need to set some parameters for these algorithms as follows. We set the number of clusters k is equal to its true number of classes on each of the given data sets. For the parameter λ, we test each algorithm with different λ values which are selected in the set {λ1 = δ 50, λ2 = δ 40, λ3 = δ 30, λ4 = δ 20, λ5 = δ 10, λ6 = δ}, where δ is the covariance of a data set. Besides, the SSC and CAN algorithms need to set the number of the nearest neighbors K. We set K to 10 in our experiments.

We ﬁrst compare the effectiveness of the three algorithms with different λ values on these benchmark data sets. The comparison results are shown in Fig.1. According to the ﬁgures, we see that the performance of the SSC+E algorithm is superior to SSC and CAN on these tested data sets, except Heart statlog.

Furthermore, we compare the efﬁciency of the three algorithms with different λ values on these benchmark data sets. The comparison results are shown in Table 2. According to the table, we see that the SSC algorithm need very expensive computational costs, compared to SSC+E and CAN. We also can observe that the clustering speed of the SSC+E algorithm is slightly faster than CAN on these tested data sets. The main reason is that the SSC+E algorithm does not need learn the sparse representation but directly compute it by Gaussian kernel. According to the above analysis, we can conclude that the SSC+E algorithm can better balance the effectiveness and efﬁciency of obtaining a high-quality clustering results, compared to the SSC and CAN algorithms.

Sparse Subspace Clustering with Entropy-Norm

0.0164 0.0205 0.0273 0.041 0.082 0.82 0

SSC CAN SSC+E

(a) ARI against λ on data set Iris

0.0164 0.0205 0.0273 0.041 0.082 0.82 0.2

SSC CAN SSC+E

(b) NMI against λ on data set Iris

0.0391 0.0489 0.0652 0.0979 0.1957 1.957 -0.2

SSC CAN SSC+E

(c) ARI against λ on data set Wine

0.0391 0.0489 0.0652 0.0979 0.1957 1.957 0

SSC CAN SSC+E

(d) NMI against λ on data set ORL

0.1115 0.1394 0.1859 0.2788 0.5577 5.5766 -0.05

SSC CAN SSC+E

(e) ARI against λ on data set Heart statlog

0.1115 0.1394 0.1859 0.2788 0.5577 5.5766 0.05

SSC CAN SSC+E

(f) NMI against λ on data set Heart statlog

199101 248876 331835 497752 995505 9.95e+06

SSC CAN SSC+E

(g) ARI against λ on data set Yale

199101 248876 331835 497752 995505 9.95e+06 0.1

SSC CAN SSC+E

(h) NMI against λ on data set Yale

88424.4 110531 147374 221061 442122 4.42e+06 0

SSC CAN SSC+E

(i) ARI against λ on data set ORL

88424.4 110531 147374 221061 442122 4.42e+06 0.3

SSC CAN SSC+E

(j) NMI against λ on data set ORL

0.0105 0.0131 0.0175 0.0263 0.0525 0.5254 -0.2

SSC CAN SSC+E

(k) ARI against λ on data set Banknote

0.0105 0.0131 0.0175 0.0263 0.0525 0.5254 0

SSC CAN SSC+E

(l) NMI against λ on data set Banknote

4.8553 6.0691 8.0922 12.1383 24.2766 242.766 0.2

SSC CAN SSC+E

(m) ARI against λ on data set COIL

4.8553 6.0691 8.0922 12.1383 24.2766 242.766 0.4

SSC CAN SSC+E

(n) NMI against λ on data set COIL

8.6384 10.798 14.3973 21.596 43.1919 431.919 0

SSC CAN SSC+E

(o) ARI against λ on data set Isolet

8.6384 10.798 14.3973 21.596 43.1919 431.919 0.2

SSC CAN SSC+E

(p) NMI against λ on data set Isolet

0.3691 0.4613 0.6151 0.9227 1.8454 18.4538 0

SSC CAN SSC+E

(q) ARI against λ on data set Handwritten Digits

0.3691 0.4613 0.6151 0.9227 1.8454 18.4538 0

SSC CAN SSC+E

(r) NMI against λ on data set Handwritten Digits

0.0847 0.1058 0.1411 0.2116 0.4233 4.2328 0

SSC CAN SSC+E

(s) ARI against λ on data set Landsat Satellite

0.0847 0.1058 0.1411 0.2116 0.4233 4.2328 0

SSC CAN SSC+E

(t) NMI against λ on data set Landsat Satellite

Figure 1. Clustering accuracies of different algorithms.

Sparse Subspace Clustering with Entropy-Norm

Table 2. Clustering speeds (seconds) of different algorithms Data set Algorithm λ1 λ2 λ3 λ4 λ5 λ6

SSC 31.727 32.176 31.724 31.445 30.836 32.336 CAN 0.078 0.069 0.046 0.078 0.047 0.047 SSC+E 0.031 0.047 0.032 0.032 0.016 0.016

SSC 37.975 37.897 38.007 40.100 39.366 39.132 CAN 0.094 0.078 0.125 0.094 0.078 0.094 SSC+E 0.031 0.016 0.015 0.015 0.015 0.015

Heart Statlog

SSC 59.689 58.674 59.923 60.267 59.767 61.689 CAN 0.156 0.188 0.188 0.156 0.156 0.125 SSC+E 0.015 0.015 0.016 0.016 0.015 0.015

SSC 204.686 203.968 204.749 202.843 204.328 201.687 CAN 0.125 0.219 0.125 0.125 0.156 0.140 SSC+E 0.047 0.047 0.031 0.062 0.047 0.047

SSC 2245.200 2258.955 2258.564 2237.354 2266.546 2265.959 CAN 0.625 0.671 0.671 0.687 0.641 0.703 SSC+E 0.218 0.203 0.204 0.188 0.219 0.203

SSC 387.909 390.128 392.174 386.707 391.206 393.673 CAN 5.343 4.702 4.624 4.452 4.264 4.749 SSC+E 0.187 0.187 0.218 0.187 0.188 0.187

SSC 70766.438 70601.720 69062.015 67736.005 63388.664 54684.533 CAN 4.434 6.264 5.905 3.546 3.359 2.952 SSC+E 0.534 0.547 0.563 0.563 0.578 0.594

SSC 56079.227 55180.180 54446.940 64799.694 72375.284 76137.546 CAN 7.545 4.187 7.467 7.108 7.436 4.405 SSC+E 0.657 0.703 0.657 0.734 0.766 0.734

Handwritten Digits

SSC 13990.391 13495.460 13250.659 12951.093 12793.112 14684.592 CAN 216.325 154.261 148.512 148.73 140.842 196.876 SSC+E 3.702 3.577 3.702 3.625 3.843 3.999

Landsat Satellite

SSC 14387.362 14463.630 14510.508 15019.646 15736.703 14685.439 CAN 265.11 317.706 204.983 318.269 315.863 316.786 SSC+E 4.467 4.374 4.452 4.562 4.452 4.545

It should be noted that due to the fact that the number of the tested data sets is very limited, we can not conclude that the effectiveness of the data similarity matrices by Gaussian kernel are better than SSC and CAN. However, our experimental results only illustrate that spectral clustering with Gaussian kernel is still a good choice to rapidly obtain a good sparse representation for a data set, although it had been easily proposed.

5. Conclusions

In this paper, we have analyzed the theoretical connection between spectral clustering and sparse subspace clustering from the perspective of learning the data similarity matrix. We have shown that the spectral clustering with Gaussian kernel can be viewed as a sparse subspace clustering with entropy-norm, which is called SSC+E. We have analyzed the advantages and disadvantages of SSC+E and SSC. Compared to SSC, the SSC+E algorithm can rapidly

obtain a sparse, analytical, symmetrical, nonnegative and nonlinearly-representational similarity matrix. Finally, we have compared the efﬁciency and effectiveness of sparse subspace clustering and spectral clustering on ten benchmark data sets. The experimental results show that the spectral clustering with Gaussian kernel is still a good choice to rapidly obtain a high-quality clustering results.

Acknowledgement

The authors are very grateful to the editors and reviewers for their valuable comments and suggestions. This work is supported by the National Natural Science Foundation of China (Nos. 61876103, 61773247, 61976128), the Technology Research Development Projects of Shanxi (No. 201901D211192) and the 1331 Engineering Project of Shanxi Province, China.

Sparse Subspace Clustering with Entropy-Norm

Aggarwal, C. C. and Reddy, C. K. (eds.). Data Clustering: Algorithms and Applications. CRC Press, 2014. ISBN 978-1-46-655821-2.

Bache, K. and Lichman, M. Uci machine learning repository. http://archive.ics.uci.edu/ml.

Cai, D. Codes and datasets for feature learning. http://www.cad.zju.edu.cn/home/dengcai/Data/data.html.

Dhillon, I. S., Guan, Y., and Kulis, B. Weighted graph cuts without eigenvectors: A multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11):1944 1957, 2007.

Elhamifar, E. and Vidal, R. Sparse subspace clustering. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.

Elhamifar, E. and Vidal, R. Sparse subspace clustering: algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11): 2765 2781, 2013.

H. Hu, Z. Lin, J. F. and Zhou, J. Smooth representation clustering. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3834 3841, 2014.

Han, J. and Kamber, M. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001.

Jain, A. K. Data clustering: 50 years beyond k-means. In Daelemans, W., Goethals, B., and Morik, K. (eds.), Machine Learning and Knowledge Discovery in Databases, pp. 3 4, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.

Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., and Ma, Y. Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):171 184, 2013.

Matsushima, S. and Brbic, M. Selective sampling-based scalable sparse subspace clustering. In Conference and Workshop on Neural Information Processing Systems, 2019.

Ng, A. Y., Jordan, M. I., and Weiss, Y. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, pp. 849 856. MIT Press, 2001.

Ni, Y., Sun, J., Yuan, S., and Cheong, L. Robust low-rank subspace segmentation with semideﬁnite guarantees. In Proceedings of the IEEE International Conference on Data Mining Workshops, pp. 1179C1188, 2010.

Nie, F., Wang, X., and Huang, H. Clustering and projected clustering with adaptive neighbors. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 977 986, 2014.

Parsons, L. R., Liu, H., Parsons, L., Haque, E., and Liu, H. Subspace clustering for high dimensional data: a review. In ACM SIGKDD Explorations Newsletter, pp. 90 105, 2004.

Peng, X., Zhang, L., and Yi, Z. Scalable sparse subspace clustering. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 430 437, 2013.

Shi, J. and Malik, J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888 905, 2000.

Wang, Y. and Xu, H. Noisy sparse subspace clustering. Journal of Machine Learning Research, 17:1 14, 2016.

Yang, Y., Feng, J., Jojic, N., Yang, J., and Huang, T. L0sparse subspace clustering. In 14th European Conference on Computer Vision, pp. 731 747, 2016.

You, C., Robinson, D. P., and Vidal, R. Scalable sparse subspace clustering by orthogonal matching pursuit. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3918 3927, 2016.

Zhang, T., Ji, P., Harandi, M., bing Huang, W., and Li, H. Neural collaborative subspace clustering. International Conference on Machine Learning, 2019.

Zhu, X., Loy, C. C., and Gong, S. Constructing robust afﬁnity graphs for spectral clustering. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1450 1457, 2014.