# a_simple_approach_to_automated_spectral_clustering__aac313b7.pdf

A Simple Approach to Automated Spectral Clustering

Jicong Fan1,2, Yiheng Tu3,4, Zhao Zhang5 , Mingbo Zhao6, Haijun Zhang7

1The Chinese University of Hong Kong, Shenzhen 2Shenzhen Research Institute of Big Data 3Chinese Academy of Science, Beijing 4University of Chinese Academy of Sciences, Beijing 5Hefei University of Technology, Hefei 6Donghua University, Shanghai 7Harbin Institute of Technology, Shenzhen fanjicong@cuhk.edu.cn yihengtu@gmail.com cszzhang@gmail.com mzhao4@dhu.edu.cn hjzhang@hit.edu.cn

The performance of spectral clustering heavily relies on the quality of afﬁnity matrix. A variety of afﬁnity-matrix-construction (AMC) methods have been proposed but they have hyperparameters to determine beforehand, which requires strong experience and leads to difﬁculty in real applications, especially when the inter-cluster similarity is high and/or the dataset is large. In addition, we often need to choose different AMC methods for different datasets, which still depends on experience. To solve these two challenging problems, in this paper, we present a simple yet effective method for automated spectral clustering. First, we propose to ﬁnd the most reliable afﬁnity matrix via grid search or Bayesian optimization among a set of candidates given by different AMC methods with different hyperparameters, where the reliability is quantiﬁed by the relative-eigen-gap of graph Laplacian introduced in this paper. Second, we propose a fast and accurate AMC method based on least squares representation and thresholding and prove its effectiveness theoretically. Finally, we provide a large-scale extension for the automated spectral clustering method, of which the time complexity is linear with the number of data points. Extensive experiments of natural image clustering show that our method is more versatile, accurate, and efﬁcient than baseline methods.

1 Introduction

Clustering is an important approach to data mining and knowledge discovery. Particularly, spectral clustering [Weiss, 1999; Shi and Malik, 2000; Ng et al., 2002; Von Luxburg, 2007] has superior performance than k-means clustering [Steinhaus and others, 1956], hierarchical clustering [Johnson, 1967], DBSCAN [Ester et al., 1996], and mixtures of probabilistic principal component analyzers [Tipping and Bishop, 1999] in many applications. Roughly speaking, spectral clustering consists of two steps: 1) construct an afﬁnity matrix in which each element denotes the similarity between two data points; 2) perform normalized cut [Shi and Malik, 2000] on the graph corresponding to the afﬁnity matrix. K-nearest neighbors (K-NN) and Gaussian kernel k(x, y) = exp( x y 2/(2ς2)) are two popular methods to construct afﬁnity matrices, where k and ς are hyperparameters.

As the performance of spectral clustering heavily relies on the quality of afﬁnity matrix, in recent years, a variety of methods have been proposed to construct or learn afﬁnity matrices for spectral clustering. Many of them are in the framework of self-expressive [Roweis and Saul, 2000; Elhamifar and Vidal, 2013] model, i.e., minimize C 1

2 X XC 2 F + λR(C). Here the columns of X Rm n are the data points drawn from a union of subspaces. C Rn n is a coefﬁcient matrix. R(C) denotes a regularization operator on C. λ is a hyperparameter to be determined in advance.

Corresponding author

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Elhamifar and Vidal [2013] proposed to use R(C) = C 1 := Pn i=1 Pn j=1 |cij| under a constraint diag(C) = 0. In [Elhamifar and Vidal, 2013], the afﬁnity matrix for spectral clustering is given by A = |C| + |C| . The method is called Sparse Subspace Clustering (SSC). Some theoretical results of SSC can be found in [Wang and Xu, 2013; Soltanolkotabi et al., 2014].

Following the self-expressive framework, Liu et al. [2013] let R(C) = C (nuclear norm of C) and proposed a Low-Rank Representation (LRR) method for subspace clustering. Lu et al. [2012] and Pan Ji et al. [2014] used the least squares representation (LSR) model for subspace clustering. A few variants of LRR and SSC can be found in [Patel and Vidal, 2014; Li and Vidal, 2015; Patel et al., 2015; Shen and Li, 2016; Li and Vidal, 2016; Fan and Chow, 2017; Fan et al., 2018; Lu et al., 2018; Pan and Kang, 2021; Kang et al., 2022]. Recently, deep learning methods were also used to learn afﬁnity matrices for spectral clustering [Ji et al., 2017; Zhang et al., 2019b,a; Lv et al., 2021] and have achieved state-of-the-art performance on many benchmark datasets.

One common limitation of these spectral or subspace clustering methods is that they have at least one hyperparameter to determine. In the codes of SSC2 and its variants provided by their authors, there is usually one more thresholding parameter for afﬁnity matrix, which affects the clustering accuracy a lot. In the deep learning clustering methods such as [Ji et al., 2017] and [Zhang et al., 2019b], we need to determine the network structures and regularization parameters, which is much more difﬁcult. Since clustering is an unsupervised learning problem, the hyperparameters cannot be tuned by cross-validation widely used in supervised learning. Thus we have to tune the hyperparameters in spectral clustering by experience, which is difﬁcult when the dataset is quite different from those in our experience and/or the inter-class similarity is high compared to the intra-cluster similarity. Note that SSC, LRR, and their kernel or deep learning extensions have quadratic or even cubic time complexity (per iteration), which further increases the difﬁculty of hyperparameter selection in clustering large datasets, though there have been a few works improving the computational efﬁciency [Peng et al., 2013; Cai and Chen, 2014; Wang et al., 2014; Peng et al., 2015; You et al., 2016a,b; Li and Zhao, 2017; You et al., 2018; Matsushima and Brbic, 2019; Li et al., 2020; Chen et al., 2020; Kang et al., 2020; Fan, 2021; Cai et al., 2022]. On the other hand, different datasets often require different AMC methods, which is hard to tackle by experience.

This paper aims at model and hyperparameter selection for spectral clustering and wants to improve the convenience, accuracy, and efﬁciency of spectral clustering. Our contributions are as follows.

We propose a relative-eigen-gap based automated spectral clustering (Auto SC) method. It ﬁnds the Laplacian matrix with largest relative-eigen-gap among a set of candidates constructed by different models with different hyperparameters.

We also implement the Auto SC method via Bayesian optimization. The method can select the possibly best model and optimize the hyperparameters automatically. Note that any AMC methods (e.g. SSC) can be included in the framework of Auto SC.

To improve the accuracy and efﬁciency of Auto SC, we propose a new AMC method based on least squares representation and thresholding and prove its effectiveness theoretically.

We provide an extension for Auto SC to cluster large-scale datasets.

Experiments on seven benchmark image datasets demonstrate the effectiveness of our method. Particularly, our method outperforms state-of-the-art methods of large-scale clustering.

2 Related work

Exploiting eigenvalue information for clustering As the number of zero eigenvalues of a Laplacian matrix is equal to the number of connected components of the graph [Von Luxburg, 2007], a few researchers took advantage of eigenvalue information in spectral clustering [Meila et al., 2005; Meila and Shortreed, 2006; Ji et al., 2015; Hu et al., 2017; Lu et al., 2018]. For instance, Ji et al. [2015] utilized eigen-gap to determine the rank of the Shape Interaction Matrix. But the method requires determining another hyperparameter γ beforehand and needs to perform spectral clustering multiple times. The methods of [Meila et al., 2005; Hu et al., 2017; Lu et al., 2018] are based on

2Wang and Xu [2013] and Soltanolkotabi et al. [2014] provided lower and upper bounds for the λ in SSC theoretically, which however depend on the unknown noise level.

iterative optimization (need to perform eigenvalue decomposition at every iteration) and hence are not effective in handling large-scale datasets. In addition, the BDR method of [Lu et al., 2018] has two hyperparameters (λ, γ) to determine by experience, although it outperformed SSC and LRR on some datasets. A comparison is shown in Figure 1.

Automated machine learning Automated model and hyperparameter selection for supervised learning have been extensively studied [Hutter et al., 2019]. In contrast, the study for unsupervised learning is very limited. The reason is that in unsupervised learning there is no ground truth or reliable metric to evaluate the performance of algorithms. Concurrently to our work, Poulakis [2020] also attempted to do automated clustering. Speciﬁcally, Poulakis [2020] proposed to use meta-learning to select clustering algorithm and use a heuristic combination of some clustering validity metrics such as Silhouette coefﬁcient [Liu et al., 2010] and S Dbw [Halkidi and Vazirgiannis, 2001] as an objective to maximize via grid search or Bayesian optimization [Jones et al., 1998]. One problem is that these metrics are mainly based on Euclidean distance or densities and hence may not be suitable to evaluate the clustering performance of non-distance or non-density based clustering algorithms. Another one is that there is no uniﬁed metric to compare different clustering algorithms.

3 Automated Spectral Clustering (Auto SC)

3.1 Preliminary Knowledge

Let A Rn n be an afﬁnity matrix constructed from a given data matrix X Rm n. The corresponding graph is denoted by G = (V, E), where V = {v1, . . . , vn} is the vertex set and E = {e1, . . . , el} is the edge set. The degree matrix of a graph G is deﬁned as D = diag(A1), where 1 = [1, . . . , 1] . Our goal is to partition the vertices into k disjoint nonempty subsets C1, . . . , Ck. Let C = {C1, . . . , Ck}. It is expected to ﬁnd a partition C that minimizes the following metric. Deﬁnition 3.1 (MNCut). The multiway normalized cut (MNCut) [Meila, 2001] is deﬁned as

Cut(Ci, Cj)

Vol(Ci) , (1)

where Cut(Ci, Cj) = P

v Cj Auv and Vol(Ci) denotes the sum of vertex degrees of Ci.

The normalized graph Laplacian matrix is deﬁned as

L = I D 1/2AD 1/2, (2)

where I is an identity matrix. The normalized graph Laplacian is often more effective than the unnormalized one in spectral clustering (some theoretical justiﬁcation was given by [Von Luxburg, 2007]). Let σi(L) be the i-th smallest eigenvalue of L. The following claim shows the connection between MNCut(C) and L. Claim 3.2. The sum of the k smallest singular values of L quantiﬁes the potential connectivity among C1, . . . , Ck: MNCut(C) Pk i=1 σi(L).

The claim can be easily proved by using Lemma 4 of [Meila, 2001]. We defer all proof of this paper to the appendices. Because the multiplicity k of the eigenvalue 0 of L equals the number of connected components in G [Von Luxburg, 2007], we expect to construct an afﬁnity matrix A from X such that L has k zero eigenvalues. Thus the optimal partition means MNCut(C) = Pk i=1 σi(L) = 0.

3.2 Relative Eigen-Gap Guided Search

In practice, we may construct an A such that Pk i=1 σi(L) is as small as possible because guaranteeing zero eigenvalues is difﬁcult. But this is not enough because L may have k + 1 or more very small or even zero eigenvalues. The second smallest eigenvalue of the Laplacian matrix of a graph G is called the algebraic connectivity of G (denoted by ac(G)) [Fiedler, 1973]. We have ac(G) = 0 if and only if G is not connected. When G has k disjointed components, there are k algebraic connectivities, denoted by ac(C1), . . . , ac(Ck). Based on this, we have

Claim 3.3. The k+1th smallest eigenvalue of L quantiﬁes the least potential connectivity of partitions C1, . . . , Ck of C: min 1 i k MNCut(Ci) σk+1(L). (3)

In other words, σk+1(L) measures the difﬁculty in segmenting each of Ci into two subsets. Hence, when σk+1(L) is large, the partitions C1, . . . , Ck are stable. Based on Claim 3.2 and Claim 3.3, we may construct an A that has small Pk i=1 σi(L) and large σk+1(L) simultaneously, by solving

maximize θ σk+1(L) 1

subject to L = I D 1/2AD 1/2, A = fθ(X).

where fθ : Rm n Rn n is a function with parameter θ, e.g. A = [exp( xi xj 2/(2ς2))] I. It is difﬁcult to solve (4) because of the composition of fθ, symmetric normalized Laplacian, and eigenvalue decomposition. On the other hand, in (4), we have to choose f in advance, which requires domain expertise or strong experience because different dataset usually needs different f.

Note that different f can result in very different distributions of eigenvalues and the small eigenvalues are sensitive to f, θ, and noise. Hence the objective in (4) is not effective to compared different f and θ. In this paper, we deﬁne a new metric relative-eigen-gap as follows

reg(L) := σk+1(L) 1

k Pk i=1 σi(L)

1 k Pk i=1 σi(L) + ε , (5)

where ε is a small constant (e.g. 10 6) to avoid zero denominator. reg(L) is not sensitive to the scale of the small eigenvalues. Therefore, instead of (4), we propose to solve

maximize (f,θ) F Θ reg(L),

subject to L = I D 1/2AD 1/2, A = fθ(X), (6)

where F is a set of pre-deﬁned functions and Θ is a set of hyperparameters. In fact, (6) is equivalent to choosing one A (or L) from a set of candidates constructed by different f with different θ, of which the relative-eigen-gap is largest. The best θ can be found using grid search (or even random search). For convenience, we call the method Auto SC-GD. Table 1 shows a few examples of f and its parameters. One may use a weighted sum of afﬁnity matrices given by different f, like [Huang et al., 2012], which however will introduce more hyperparameters.

Table 1: A few examples of f and its θ for AMC (AASC: [Huang et al., 2012]) f K-NN ϵ-neighborhood Gaussian kernel SSC LRR LSR KSSC AASC θ K ϵ σ λ λ λ λ, σ σ1, σ2, . . .

The following theorem3 shows the connection between reg(L) and the stability of the clustering C. Theorem 3.4. Let C and C be two partitions of the vertices of G, where |C| = |C | = k. De-

ﬁne the distance between C and C as dist(C, C ) = 1 1

C j C (Vol(Ci C j))2

Vol(Ci)Vol(C j). Sup-

pose ηkε Pk i=1 σi(L) kε and reg(L) > (k 1)η/2. Let δ = max MNCut(C) Pk i=1 σi(L), MNCut(C ) Pk i=1 σi(L) . Then

dist(C, C ) < 1.5δε 1 reg(L)+(1 k)η/2. (7)

It indicates that when reg(L) is large and δ is small, the partitions C and C are close to each other. Thus, the clustering has high stability. When Pk i=1 σi(L) = kε, we have dist(C, C ) < 6δ σk+1(L) kε, which means the larger σk+1(L) the more stable clustering.

3This theorem is a modiﬁed version of Theorem 1 in [Meila et al., 2005], which is for the eigen-gap σk+1 σk of L. Here we consider reg(L) instead.

Compare reg(L) with [Meila and Shortreed, 2006] It is worth noting that Meila and Shortreed [2006] proposed to minimize J(L) := 1

k Pk i=1 σi(L)+α (σk(L) σk+1(L))2 to ﬁnd a good afﬁnity matrix for spectral clustering. Although reg(L) and J(L) seem similar, they are essentially different. As mentioned before, different AMC methods may lead to different scales for the small eigenvalues, which makes it difﬁcult to compare different AMC methods using J(L). In addition, J(L) has a hyperparameter α to determine beforehand, which violates our goal of searching models and hyperparameters. A comparative study (α 0) is in Section 4.1 (Table 3).

3.3 Auto SC via Bayesian Optimization

Bayesian optimization (BO) [Jones et al., 1998] has become a promising tool for hyperparameter optimization of supervised machine learning algorithms [Snoek et al., 2012; Klein et al., 2017]. Given a black-box function g : X R, BO aims to ﬁnd an x X that globally minimizes g and usually has three steps. The ﬁrst step is ﬁnding the most promising point xt+1 argmaxx ap(g)(x) by numerical optimization, where ap(g) : X R is an acquisition function (e.g. Expected Improvement) relying on an prior p(g) (e.g. Gaussian processes [Williams and Rasmussen, 2006]). The second step is evaluating the expensive and possibly noisy function yt+1 g(x) + N(0, σ2) and adding the new sample (xt+1, yt+1) to the observation set Dt = {(x1, y1), . . . , (xt, yt)}. The last step is updating p(g) and ap(g) using Dt+1.

As an alternative to the grid search for (6), we can maximize reg(L) via BO. Suppose we have a set of different AMC models, i.e., F = {f1, f2, . . . , f M}. For i = 1, 2, . . . , M, let

gi(θ(i)) := reg(L(fi(θ(i)|X))),

where θ(i) denotes the hyperparameters in fi and X denotes the dataset. Then we use BO to ﬁnd

θ(i) = argmin θ(i) S(i) gi(θ(i)), (8)

where S(i) denotes the set of constraints. Finally we get the best model with its best hyperparameters

f (θ( ) |X), where = argmin 1 i M gi(θ(i) ). (9)

For convenience, we denote the method by Auto SC-BO. Note that F can include any AMC methods such as those in Table 1 and even DSC [Ji et al., 2017] (see Appendix D.5).

In Auto SC-BO, we use Expected Improvement (EI) acquisition function

a EI(s|Dt) = Ep [max(gmin g(s), 0)] , (10)

where gmin is the best function value known. The closed-form formulation is

a EI(s|Dt) = (gmin µ)Φ gmin µ

where µ = µ(s|Dt, θK) and σ = σ(s|Dt, θK) are the mean value and variance of the Gaussian process, φ and Φ are standard Gaussian cumulative density function and probability density function respectively, and θK denotes the hyperparameters of the Gaussian process. For the covariance function, we use the automatic relevance determination (ARD) Mat ern 5/2 kernel [Mat ern, 2013]

k M52(s, s ) = θ0

5r2(s, s ) + 5

3r2(s, s ) exp p

5r2(s, s ) , (12)

where r2(s, s ) = Pd j=1(sj s j)2/θ2 j.

3.4 Discussion on AMC Methods for Auto SC and LSR with Thresholding

In Auto SC, the size of searching space is |F| Q

j |θj|, which should be large enough to include effective models and their hyperparameters. A large number of works have shown that the selfexpressive models [Elhamifar and Vidal, 2013; Liu et al., 2013; Lu et al., 2018; Ji et al., 2017] often outperform other AMC models such as Gaussian kernel. However, the self-expressive model based

AMC methods often require iterative optimization and has at least quadratic time complexity per iteration, which leads to huge time cost in Auto SC. Although LSR [Lu et al., 2012] has closed-form solution, the clustering accuracy is not satisfactory [Lu et al., 2018]. In this work, we will show that LSR with a simple post-processing operation can be a good AMC method and can outperform SSC, LRR, and BDR [Lu et al., 2018]. Speciﬁcally, the LSR model is given as

1 2 X XC 2 F + λ

2 C 2 F , (13)

of which the closed-form solution is C = (X X +λI) 1X X. Let diag(C) = 0 and C |C|, the afﬁnity matrix can be constructed as A = (C + C )/2. One problem is that the off-diagonal elements of A are dense (leading to a connected graph), which can result in low clustering accuracy. Therefore, we propose to truncate C by keeping only the largest τ elements of each column of C. Nevertheless, it is not easy to determine τ beforehand. When τ is too small, the corresponding graph will have k + 1 or more connected components. When τ is too large, the corresponding graph will have k 1 or less connected components. However, τ can be automatically determined by our Auto SC-GD and Auto SC-BO.

In the case that the data have some low-dimensional nonlinear structures, the similarity between pair-wise columns of X cannot be well recognized by the linear regression (13). Therefore, we also consider the following nonlinear regression model

1 2 φ(X) φ(X)C 2 F + λ

2 C 2 F , (14)

where φ denotes a nonlinear feature map performed on each column of the matrix, i.e. φ(X) = [φ(x1), . . . , φ(xn)]. In (14), letting φ be some feature map induced by a kernel function k( , ) (e.g. polynomial kernel k(xi, xj) = (x i xj + b)q and Gaussian kernel), we get the kernel LSR (KLSR):

1 2Tr K 2KC + C KC + λ

2 C 2 F , (15)

where K = φ(X) φ(X) and [K]ij = k(xi, xj). The closed-form solution is C = (K+λI) 1K. The post-processing is the same as that for the solution of LSR.

We introduce the following property, a necessary condition of successful subspace clustering, which is similar to the one used in [Wang and Xu, 2013; Soltanolkotabi et al., 2014].

Deﬁnition 3.5 (Subspace Detection Property). A symmetric afﬁnity matrix A obtained from X has subspace detection property if for all i, the nonzero elements of ai correspond to the columns of X in the same subspace as xi.

For convenience, let π(i) be the index of the subspace xi belongs to and Cj be the index set of the columns of X in subspace j. We consider the following deterministic model.

Deﬁnition 3.6 (Deterministic Model). The columns of X Rm n are drawn from a union of k different subspaces and are further corrupted by noise, where dim(S1 Sk) = d < m n. Let X = UΣV be the SVD of X, where Σ = diag(σ1, . . . , σn) and σ1 σ2 σn. Let γ = σd+1/σd. Denote vi = (vi1, . . . , vin) the i-th row of V and let vi = (vi1, . . . , vid). Suppose the following conditions hold4: 1) for every i [n], the τ-th largest element of {| v i vj| : j Cπ(i)} is greater than α; 2) maxi [n] maxj [n]\Cπ(i) | v i vj| β; 3) maxi,j,l |vilvjl| µ.

Then the following theorem veriﬁes the effectiveness of (13) followed by the truncation (thresholding) operation in subspace detection.

Theorem 3.7. Suppose X is given by Deﬁnition 3.6 and C is given by (13) with

ρ2 4(2µd )(2µm 2µd ) σ2 d 4µd 2 < λ <

ρ2 4(2µd )(2µm 2µd ) σ2 d 4µd 2 (16)

where ρ = 2µmγ2 (1 + γ2) and = α β. Then the C truncated by τ τ has the subspace detection property.

4γ measures the noise level, β is dominated by the difference between subspaces, and µ quantiﬁes the incoherence in the singular vectors.

In Theorem 3.7, the width of the range of λ is w =

ρ2 4(2µd )(2µm 2µd )σ2 d 2µd . We see that a larger σd, , or smaller γ, d leads to a wider range of λ, which corresponds to a simper clustering problem. When ρ2 4(2µd )(2µm 2µd ), λ does not exist. Theorem 3.7 can be extended to the kernel case (15) without the restriction of d < m even when the columns of X are drawn from a union of nonlinear low-dimensional manifolds. See Deﬁnition C.1, Deﬁnition C.2, and Theorem C.3 in Appendix C. Based on Theorem 3.7 and Theorem C.3, the follow proposition indicates that Auto SC can cluster the data correctly. Proposition 3.8. Suppose the afﬁnity matrix A given by Auto SC has the subspace or manifold detection property (deﬁned in Appendix C) and reg(L) = σk+1

ϵ > 0. Then each component of G consists of all columns of X in the same subspace or manifold.

Now we see that LSR and KLSR with thresholding can provide effective self-expressive afﬁnity matrices for Auto SC without performing iterative optimization. On the other hand, the relativeeigen-gap is able to compare LSR with KLSR, compare different kernels, and evaluate λ, τ, and kernel parameters. Note that if we use SSC and KSSC instead of LSR and KLSR, Auto SC will be very time-consuming. If we use LSR and KLSR without thresholding, Auto SC may not provide high clustering accuracy. We hope that Auto SC is not only automatic but also accurate and efﬁcient.

3.5 Auto SC+NSE for Large-Scale Data

Since the time and space complexity of Auto SC are quadratic with n, it cannot be directly applied to large-scale datasets. To solve the problem, we propose to perform Algorithm 1 on a set of s landmarks of the data (denoted by ˆ X) to get a ˆZ. The landmarks can be generated by k-means or randomly. Then we regard ˆZ as a feature matrix and learn a map g : Rm Rk from ˆ X to ˆZ. According to the universal approximation theorem [Sonoda and Murata, 2017] of neural networks, we approximate g by a two-layer neural network and solve

minimize W1,W2,b1,b2 1 2s ˆZ W2Re LU(W1 ˆ X + b11 s ) b21 s 2 F + γ

2 W1 2 F + W2 2 F , (17)

where W1 Rd m, W2 Rk d, b1 Rd, and b2 Rk. Since A is sparse, k is often less than m, and a neural network is used, we call (17) Neural Sparse Embedding (NSE). We use mini-batch Adam [Kingma and Ba, 2014] to solve NSE. It is worth noting that NSE is different the method proposed by [Li et al., 2020]. In [Li et al., 2020], the regression is for an afﬁnity matrix, which leads to high computational cost. The network learned from (17) is applied to X to extract a kdimensional feature matrix Z:

Z = ˆg(X) = W2Re LU(W1X + b11 n ) + b21 n . (18)

Finally, we perform k-means on Z to get the clusters. The procedures are summarized into Algorithm 2 (see Appendix B). Note that the time complexity of Auto SC+NSE is O(dmn + ds2), where d depends on the speciﬁc AMC method. When s n, the time complexity of Auto SC+NSE is linear with the number of data points n. Proposition C.4 in Appendix C.2 shows that a small number of hidden nodes in NSE are sufﬁcient to make the clustering succeed.

4 Experiments

We test our Auto SC on Extended Yale B Face [Kuang-Chih et al., 2005], ORL Face [Samaria and Harter, 1994], COIL20 [Nene et al., 1996], AR Face [Mart ınez and Kak, 2001], MNIST [Le Cun et al., 1998], Fashion-MNIST [Xiao et al., 2017], GTSRB [Stallkamp et al., 2012], subsets and extracted features of MNIST and Fashion-MNIST. The descriptions for the datasets are in Appendix D.1. Our MATLAB codes are available at https://github.com/jicongfan/ Automated-Spectral-Clustering.

4.1 Intuitive validation of Auto SC

Performance of Relative-Eigen-Gap First, we use LSR and KLSR to show the effectiveness of the proposed reg(L). Figure 1(i) shows an intuitive example of the performance of LSR and KLSR

in clustering a subset of the Extended Yale B database, where for (15) we use Gaussian kernel with ς = 1 n2 P

ij xi xj . We see that: 1) in LSR and KLSR, for a ﬁxed λ (or τ), the τ (or λ) with larger reg(L) provides higher clustering accuracy; 2) for a ﬁxed λ and a ﬁxed τ, if LSR has a larger reg(L), its clustering accuracy is higher than that of KLSR, and vice versa. We conclude that roughly a larger reg(L) indeed leads to a higher clustering accuracy, which is consistent with our theoretical analysis in Section 3.2.

(i) Clustering accuracy and reg(L) (rescaled between 0 and 1) and clustering accuracy of LSR and KLSR on the ﬁrst 5 subjects of the Extended Yale Face B database.

(ii) Clustering accuracies of SSC, LRR, and BDRB with different hyperparameter λ in comparison to Auto SC-GD with LSR and KLSR on the Extended Yale Face B database. The value of λ used in BDRB has been divided by 10. The γ in BDR-B is chosen from {0.01, 0.1, 1} and the best one is used for each λ. In Case (b), the time costs of SSC, LRR, BDR-B, and Auto SC-GD are 9.5s, 33.0s, 7.6s, and 1.6s respectively.

Figure 1: Examples about reg(L), clustering accuracy, and hyperparameters on Extended Yale B.

Now we show the superiority of LSR and KLSR compared to a few important AMC methods. Figure 1(ii) shows the clustering accuracy of SSC [Elhamifar and Vidal, 2013], LRR [Liu et al., 2013], and BDR-B [Lu et al., 2018] with different hyperparameters and our method Auto-GD with LSR and KLSR (detailed by Algorithm 1 in the supplementary material) on the Extended Yale Face B subset. SSC and BDR-B are sensitive to the value of λ, especially for the relatively difﬁcult task, say Figure 1(ii-b). LRR is not sensitive to the value of λ but its accuracy is low. LSR and KLSR are more accurate and efﬁcient than other methods.

The performance of Auto SC-BO We show the performance of Auto SC-BO with many AMC methods such as SSC [Elhamifar and Vidal, 2013] and LSR. Taking the KLSR model (15) with polynomial kernel as an example, the parameters are θ = (λ, b, q, τ) and the constraints are given by S = {λ R : λmin λ λmax; b R : bmin b bmax; q Z+ : qmin q qmax; τ Z+ : τmin τ τmax}. More details are in Appendix D.5. Shown in Table 2, larger reg corresponds to higher clustering accuracy and KLSR with polynomial kernel (the optimal q is 1) performs best. Figure 2 shows the performance of KSSC [Patel and Vidal, 2014] and KLSR in each iteration of Auto SC-BO.

Table 2: Clustering accuracy of Auto SC-BO with many methods on Yale Face B dataset (ﬁrst 10 subjects). All hyperparameters of the kernel functions were optimized via Bayesian optimization.

AMC ϵ-neigh -borhood Polynomial kernel Gaussian kernel KSSC (Gauss) KSSC (Poly) KLSR (Gauss) KLSR (Poly) regmax 0.776 1.294 1.307 0.892 1.388 2.217 2.379 Accuracy 0.325 0.389 0.393 0.584 0.859 0.963 0.966

Relative-Eigen-Gap versus Eigen-Gap We compare the proposed relative-eigen-gap (reg(L)) with eigen-gap (denoted by eg(L)), and the regularizer J(L, α) proposed by [Meila and Shortreed, 2006] with different α. Note that in Auto SC, we need to minimize J(L, α) instead. The clustering results of Auto SC-GD are reported in Table 3 (details about the datasets are in Section 4). We see that our reg(L) outperforms eg(L) and J(L, α) in all cases except AR.

(i) reg(L) and clustering accuracy of KLSR and KSSC in each iteration of BO.

(ii) reg(L) and hyperparameters of KLSR and KSSC in each iteration of BO.

Figure 2: Auto SC-BO with KSSC and KLSR on the ﬁrst 10 subjects of Yale B Face dataset.

Table 3: The comparison of Auto SC-GD with reg(L), eg(L), and J(L, α)

Yale B ORL COIL20 AR MNIST F-MNIST eg(L) 0.790 0.768 0.619 0.805 0.663 0.562 J(L, 0) 0.823 0.795 0.750 0.804 0.726 0.516 J(L, 0.1) 0.818 0.788 0.768 0.826 0.718 0.519 J(L, 1) 0.812 0.785 0.769 0.817 0.722 0.523 J(L, 10) 0.804 0.788 0.768 0.832 0.731 0.539 J(L, 100) 0.788 0.765 0.769 0.817 0.735 0.545 reg(L) 0.897 0.795 0.782 0.786 0.755 0.595

4.2 Comparative studies of Auto SC and baselines

First we compare Auto SC with SSC, LRR [Liu et al., 2013], LSR [Lu et al., 2012], EDSC [Pan Ji et al., 2014], KSSC, SSC-OMP [You et al., 2016b], BDR-Z [Lu et al., 2018], and BDR-B [Lu et al., 2018] on six smaller datasets. The clustering accuracy and time cost are reported in Table 4. Auto SC-GD and Auto SC-BO outperformed other methods signiﬁcantly in almost all cases. SSC-OMP and Auto SC-GD are more efﬁcient than SSC, LRR, EDSC, and KSSC. The time cost of Auto SC-BO is much higher than that of Auto SC-GD because the former optimizes all hyperparameters including λ, τ, and kernel parameters (e.g. b, q).

Table 4: Clustering performance on the six small datasets. For the MNIST-1k and Fahsion-MNIST1k, we report the average results of 20 trials because the subset is formed randomly. Auto SC chose LSR for Yale B and AR and chose KLSR for others datasets. The NMI results are in Table 7 (See Appendix D.3).

SSC LRR LSR EDSC KSSC SSC-OMP BDR-Z BDR-B Auto SC-GD Auto SC-BO

Yale B acc 0.723 0.643 0.592 0.806 0.649 0.768 0.596 0.719 0.897 0.909 time 273.8 928.1 7.3 58.6 464.3 8.9 368.8 368.8 19.1 78.3

ORL acc 0.711 0.762 0.680 0.712 0.707 0.665 0.739 0.735 0.795 0.803 time 2.7 8.8 0.5 2.0 2.6 0.4 3.9 3.9 2.3 20.5

COIL20 acc 0.871 0.729 0.695 0.759 0.912 0.658 0.713 0.791 0.782 0.878 time 61.8 221.2 1.4 15.4 100.6 2.5 86.8 86.8 7.6 39.2

AR acc 0.718 0.769 0.665 0.673 0.726 0.669 0.745 0.751 0.786 0.826 time 317.5 1220.6 14.5 69.1 627.4 57.6 578.7 578.7 43.4 130.6

acc 0.596 (0.054) 0.513 (0.037) 0.554 (0.041) 0.536 (0.035) 0.577 (0.053) 0.542 (0.038) 0.576 (0.037) 0.578 (0.043) 0.615 (0.041) 0.619 (0.038) time 24.9 69.1 0.8 5.2 32.8 1.3 26.9 26.9 2.5 21.7

Fashion MNIST-1k

acc 0.553 (0.025) 0.515 (0.014) 0.563 (0.023) 0.544 (0.017) 0.548 (0.016) 0.566 (0.034) 0.574 (0.019) 0.563 (0.031) 0.581 (0.025) 0.584 (0.021) time 24.1 68.5 0.7 5.1 35.9 1.2 25.7 25.7 2.6 22.7

We compare our method with LSR, LSC-K [Chen and Cai, 2011], SSC-OMP [You et al., 2016b], and S5C [Matsushima and Brbic, 2019], and S3COMP-C [Chen et al., 2020] on the larger datasets. The parameter settings are in Appendix D.6. Table 5 shows the clustering accuracy and standard deviation of 10 repeated trials on the raw-pixel data of MNIST and Fashion-MNIST. Table 6 shows the results on MNIST and Fashion-MNIST features and GTSRB. Our methods have the highest clustering accuracy in every case. Note that if NSE is ablated, the accuracy of Auto SC-GD on MNIST(features)-10k, -20k, and -30k are 0.9772, 0.9783 and 0.9864 respectively, higher than those of Auto SC-GD+NSE, though the time costs increased.

It is worth mentioning that, to the best of our knowledge, the deep clustering method of [Zhang et al., 2019a] has SOTA performance on Yale B (acc=0.98) and ORL (acc=0.89), the method proposed by [Zhang et al., 2019b] has SOTA performance on Fashion-MNIST (acc=0.72), the method proposed by [Mahon and Lukasiewicz, 2021] has SOTA performance on MNIST (acc=0.99), and our method has SOTA performance on GTSRB. Nevertheless, we focus on automated spectral clustering.

Table 5: Clustering accuracy and time cost (second) on MNIST and Fashion MNIST. means the computation is out of memory.

LSR LSC-K SSC-OMP S5C S3COMP-C Auto SC-GD+NSE Auto SC-BO+NSE

MNIST-10k acc 0.583(0.007) 0.652(0.037) 0.431(0.014) 0.646(0.045) 0.623(0.028) 0.687(0.035) 0.679(0.034) time 154.9 18.9 26.4 82.3 710.4/20 16.2 48.3

MNIST acc 0.665(0.021) 0.453(0.017) 0.627(0.025) 0.755(0.022) 0.750(0.009) time 329.2 1178.3 961.5 86.9 123.6 Fashion MNIST-10k acc 0.561(0.008) 0.571(0.025) 0.509(0.038) 0.565(0.021) 0.569(0.024) 0.576(0.011) 0.572(0.019) time 153.6 18.6 26.8 107.3 707.2/20 17.3 50.9 Fashion MNIST acc 0.561(0.015) 0.359(0.017) 0.559(0.013) 0.586(0.008) 0.578(0.012) time 335.1 1156.6 932.6 88.7 122.8

Table 6: Clustering accuracy (mean value and standard deviation) and time cost (second) on MNIST and Fashion-MNIST with feature extraction. / means the algorithm was performed on a computational platform not comparable to ours. The underlined values are from [Chen et al., 2020].

LSC-K SSC-OMP S5C S3COMP-C Auto SC-GD+NSE Auto SC-BO+NSE

MNIST acc 0.8659(0.0215) 0.8159 0.7829(0.0283) 0.9632 0.9775(0.0034) 0.9741(0.0044) time 273.6 280.6 907.5 416.8 59.2 115.3 Fashion MNIST acc 0.6131(0.0298) 0.3796(0.0217) 0.6057(0.0227) 0.6398(0.0133) 0.6461(0.0104) time 251.8 1013.9 913.2 61.9 112.6

GTSRB acc 0.8711(0.0510) 0.8252 0.9044(0.0267) 0.9554 0.9873(0.0126) 0.9881(0.0078) time 31.2 / 98.7 / 16.8 69.4

5 Conclusion

We have proposed an automated spectral clustering method. Extensive experiments showed the effectiveness and superiority of our methods over baseline methods. The efﬁciency improvement is from the closed-form solutions of the least squares regressions. The accuracy improvement is from the effectiveness of LSR and KLSR with thresholding and the automation of model and hyperparameter selection. One limitation of this work is that we only considered automated spectral clustering while there are many other clustering methods (e.g. [Fan, 2021]) not relying on afﬁnity matrix.

Acknowledgments

The work of Jicong Fan was supported in part by the Youth program 62106211 of the National Natural Science Foundation of China and the research funding T00120210002 of Shenzhen Research Institute of Big Data. The work of Yiheng Tu was supported in part by the National Natural Science Foundation of China under Grant no. 32171078. The work of Zhao Zhang was supported in part by the National Natural Science Foundation of China under Grant no. 62072151 and Anhui Provincial Natural Science Fund for the Distinguished Young Scholars (2008085J30). The work of Mingbo Zhao was supported in part by the National Natural Science Foundation of China under Grant no. 61971121. The work of Haijun Zhang was supported in part by the National Natural Science Foundation of China under Grant no. 61972112 and no. 61832004, the Guangdong Basic and Applied Basic Research Foundation under Grant no. 2021B1515020088.

The authors appreciate the reviewers comments and time.

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1 122, 2011.

Joan Bruna and St ephane Mallat. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872 1886, 2013.

Deng Cai and Xinlei Chen. Large scale spectral clustering via landmark-based sparse representation. IEEE transactions on cybernetics, 45(8):1669 1680, 2014.

Jinyu Cai, Jicong Fan, Wenzhong Guo, Shiping Wang, Yunhe Zhang, and Zhao Zhang. Efﬁcient deep embedded subspace clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1 10, June 2022.

Xinlei Chen and Deng Cai. Large scale spectral clustering with landmark-based representation. In Twenty-ﬁfth AAAI conference on artiﬁcial intelligence. Citeseer, 2011.

Ying Chen, Chun-Guang Li, and Chong You. Stochastic sparse subspace clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4155 4164, 2020.

E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2765 2781, 2013.

Martin Ester, Hans-Peter Kriegel, J org Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pages 226 231, 1996.

Jicong Fan and Tommy W.S. Chow. Sparse subspace clustering for data with missing entries and high-rank matrix completion. Neural Networks, 93:36 44, 2017.

Jicong Fan, Zhaoyang Tian, Mingbo Zhao, and Tommy W.S. Chow. Accelerated low-rank representation for subspace clustering and semi-supervised classiﬁcation on large-scale data. Neural Networks, 100:39 48, 2018.

Jicong Fan, Yuqian Zhang, and Madeleine Udell. Polynomial matrix completion for missing data imputation and transductive learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pages 3842 3849, 2020.

Jicong Fan. Large-scale subspace clustering via k-factorization. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 21, page 342 352, New York, NY, USA, 2021. Association for Computing Machinery.

Miroslav Fiedler. Algebraic connectivity of graphs. Czechoslovak mathematical journal, 23(2):298 305, 1973.

M. Halkidi and M. Vazirgiannis. Clustering validity assessment: ﬁnding the optimal partitioning of a data set. In Proceedings 2001 IEEE International Conference on Data Mining, pages 187 194, 2001.

N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217 288, 2011.

Harold V Henderson and Shayle R Searle. On deriving the inverse of a sum of matrices. Siam Review, 23(1):53 60, 1981.

Juhua Hu, Qi Qian, Jian Pei, Rong Jin, and Shenghuo Zhu. Finding multiple stable clusterings. Knowledge and Information Systems, 51(3):991 1021, 2017.

Hsin-Chien Huang, Yung-Yu Chuang, and Chu-Song Chen. Afﬁnity aggregation for spectral clustering. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 773 780. IEEE, 2012.

Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated machine learning: methods, systems, challenges. Springer Nature, 2019.

Pan Ji, Mathieu Salzmann, and Hongdong Li. Shape interaction matrix revisited and robustiﬁed: Efﬁcient subspace clustering with corrupted and incomplete data. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.

Pan Ji, Tong Zhang, Hongdong Li, Mathieu Salzmann, and Ian Reid. Deep subspace clustering networks. In Advances in Neural Information Processing Systems, pages 24 33, 2017.

Stephen C Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241 254, 1967.

Donald R Jones, Matthias Schonlau, and William J Welch. Efﬁcient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455 492, 1998.

Zhao Kang, Wangtao Zhou, Zhitong Zhao, Junming Shao, Meng Han, and Zenglin Xu. Large-scale multi-view subspace clustering in linear time. In Proceedings of the AAAI conference on artiﬁcial intelligence, volume 34, pages 4412 4419, 2020.

Zhao Kang, Zhiping Lin, Xiaofeng Zhu, and Wenbo Xu. Structured graph learning for scalable subspace clustering: From single view to multiview. IEEE Transactions on Cybernetics, 52(9):8976 8986, 2022.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast bayesian optimization of machine learning hyperparameters on large datasets. In Artiﬁcial Intelligence and Statistics, pages 528 536. PMLR, 2017.

Lee Kuang-Chih, J. Ho, and D. J. Kriegman. Acquiring linear subspaces for face recognition under variable lighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5):684 698, 2005.

Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Chun-Guang Li and Rene Vidal. Structured sparse subspace clustering: A uniﬁed optimization framework. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 277 286, 2015.

Chun-Guang Li and Rene Vidal. A structured sparse plus structured low-rank framework for subspace clustering and completion. IEEE Transactions on Signal Processing, 64(24):6557 6570, 2016.

Jun Li and Handong Zhao. Large-scale subspace clustering by fast regression coding. In IJCAI, 2017.

Jun Li, Hongfu Liu, Zhiqiang Tao, Handong Zhao, and Yun Fu. Learnable subspace clustering. IEEE Transactions on Neural Networks and Learning Systems, pages 1 15, 2020.

Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, and Junjie Wu. Understanding of internal clustering validation measures. In 2010 IEEE international conference on data mining, pages 911 916. IEEE, 2010.

G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):171 184, 2013.

Can-Yi Lu, Hai Min, Zhong-Qiu Zhao, Lin Zhu, De-Shuang Huang, and Shuicheng Yan. Robust and efﬁcient subspace segmentation via least squares regression. In European conference on computer vision, pages 347 360. Springer, 2012.

Canyi Lu, Jiashi Feng, Zhouchen Lin, Tao Mei, and Shuicheng Yan. Subspace clustering by block diagonal representation. IEEE transactions on pattern analysis and machine intelligence, 41(2):487 501, 2018.

Juncheng Lv, Zhao Kang, Xiao Lu, and Zenglin Xu. Pseudo-supervised deep subspace clustering. IEEE Transactions on Image Processing, 30:5252 5263, 2021.

Louis Mahon and Thomas Lukasiewicz. Selective pseudo-label clustering. In German Conference on Artiﬁcial Intelligence (K unstliche Intelligenz), pages 158 178. Springer, 2021.

Aleix M Mart ınez and Avinash C Kak. PCA versus LDA. IEEE transactions on pattern analysis and machine intelligence, 23(2):228 233, 2001.

Bertil Mat ern. Spatial variation, volume 36. Springer Science & Business Media, 2013.

Shin Matsushima and Maria Brbic. Selective sampling-based scalable sparse subspace clustering. In Advances in Neural Information Processing Systems, pages 12416 12425, 2019.

Marian Meila and Susan Shortreed. Regularized spectral learning. Journal of machine learning research, 1(1):1 20, 2006.

Marian Meila, Susan Shortreed, and Liang Xu. Regularized spectral learning. In AISTATS. PMLR, 2005.

Marina Meila. The multicut lemma. UW Statistics Technical Report, 417, 2001.

S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (coil-20). Report, Columbia University, 1996.

Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems, pages 849 856, 2002.

Erlin Pan and Zhao Kang. Multi-view contrastive graph clustering. Advances in Neural Information Processing Systems, 34, 2021.

Pan Ji, M. Salzmann, and Hongdong Li. Efﬁcient dense subspace clustering. In IEEE Winter Conference on Applications of Computer Vision, pages 461 468, 2014.

V. M. Patel and R. Vidal. Kernel sparse subspace clustering. In 2014 IEEE International Conference on Image Processing (ICIP), pages 2849 2853, Oct 2014.

V. M. Patel, Nguyen Hien Van, and R. Vidal. Latent space sparse and low-rank subspace clustering. Selected Topics in Signal Processing, IEEE Journal of, 9(4):691 701, 2015.

Xi Peng, Lei Zhang, and Zhang Yi. Scalable sparse subspace clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 430 437, 2013.

Xi Peng, Huajin Tang, Lei Zhang, Zhang Yi, and Shijie Xiao. A uniﬁed framework for representation-based subspace clustering of out-of-sample and large-scale data. IEEE transactions on neural networks and learning systems, 27(12):2499 2512, 2015.

Giannis Poulakis. Unsupervised automl: a study on automated machine learning in the context of clustering. Master s thesis, Πανεπιστ ηµιo Πειραι ως, 2020.

Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323 2326, 2000.

F. S. Samaria and A. C. Harter. Parameterisation of a stochastic model for human face identiﬁcation. In Proceedings of 1994 IEEE Workshop on Applications of Computer Vision, pages 138 142, Dec 1994.

Jie Shen and Ping Li. Learning structured low-rank representation via matrix factorization. In Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 500 509, Cadiz, Spain, 09 11 May 2016. PMLR.

Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888 905, 2000.

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012.

Mahdi Soltanolkotabi, Ehsan Elhamifar, and Emmanuel J Candes. Robust subspace clustering. The Annals of Statistics, 42(2):669 699, 2014.

Sho Sonoda and Noboru Murata. Neural network with unbounded activation functions is universal approximator. Applied and Computational Harmonic Analysis, 43(2):233 268, 2017.

Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Benchmarking machine learning algorithms for trafﬁc sign recognition. Neural networks, 32:323 332, 2012.

Hugo Steinhaus et al. Sur la division des corps mat eriels en parties. Bull. Acad. Polon. Sci, 1(804):801, 1956.

Michael E Tipping and Christopher M Bishop. Mixtures of probabilistic principal component analyzers. Neural computation, 11(2):443 482, 1999.

Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395 416, 2007.

Yu-Xiang Wang and Huan Xu. Noisy sparse subspace clustering. In International Conference on Machine Learning, pages 89 97. PMLR, 2013.

Shusen Wang, Bojun Tu, Congfu Xu, and Zhihua Zhang. Exact subspace clustering in linear time. In Proceedings of the Twenty-Eighth AAAI Conference on Artiﬁcial Intelligence, pages 2113 2120, 2014.

Y. Weiss. Segmentation using eigenvectors: a unifying view. In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 975 982 vol.2, 1999.

Christopher K Williams and Carl Edward Rasmussen. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006.

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.

Chong You, Chun-Guang Li, Daniel P Robinson, and Ren e Vidal. Oracle based active set algorithm for scalable elastic net subspace clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3928 3937, 2016.

Chong You, Daniel Robinson, and Ren e Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3918 3927, 2016.

Chong You, Chi Li, Daniel P Robinson, and Ren e Vidal. Scalable exemplar-based subspace clustering on class-imbalanced data. In Proceedings of the European Conference on Computer Vision (ECCV), pages 67 83, 2018.

Junjian Zhang, Chun-Guang Li, Chong You, Xianbiao Qi, Honggang Zhang, Jun Guo, and Zhouchen Lin. Self-supervised convolutional subspace clustering network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5473 5482, 2019.

Tong Zhang, Pan Ji, Mehrtash Harandi, Wenbing Huang, and Hongdong Li. Neural collaborative subspace clustering. In International Conference on Machine Learning, pages 7384 7393. PMLR, 2019.

The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justiﬁcation to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

Did you include the license to the code and datasets? [Yes] See Section. Did you include the license to the code and datasets? [No] The code and the data are proprietary. Did you include the license to the code and datasets? [N/A]

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [N/A]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]