# subspace_clustering_through_subclusters__b879c7de.pdf

Journal of Machine Learning Research 22 (2021) 1-37 Submitted 11/18; Revised 1/21; Published 2/21

Subspace Clustering through Sub-Clusters

Weiwei Li WEIWEILI@LIVE.UNC.EDU Department of Statistics and Operations Research University of North Carolina at Chapel Hill Chapel Hill, NC 27514, USA

Jan Hannig JAN.HANNIG@UNC.EDU Department of Statistics and Operations Research University of North Carolina at Chapel Hill Chapel Hill, NC 27514, USA

Sayan Mukherjee SAYAN@STAT.DUKE.EDU Department of Statistical Science Mathematics, Computer Science, Biostatistics & Bioinformatics Duke University Durham, NC 27708, USA

Editor: David Blei

The problem of dimension reduction is of increasing importance in modern data analysis. In this paper, we consider modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular we propose a highly scalable sampling based algorithm that clusters the entire data via ﬁrst spectral clustering of a small random sample followed by classifying or labeling the remaining out-of-sample points. The key idea is that this random subset borrows information across the entire dataset and that the problem of clustering points can be replaced with the more eﬃcient problem of clustering sub-clusters . We provide theoretical guarantees for our procedure. The numerical results indicate that for large datasets the proposed algorithm outperforms other state-of-the-art subspace clustering algorithms with respect to accuracy and speed. Keywords: dimension reduction, subspace clustering, sub-cluster, random sampling, scalability, handwritten digits, spectral clustering

1. Introduction

In data analysis, researchers are often given datasets with large volume and high dimensionality. To reduce the computational complexity arising in these settings, researchers resort to dimension reduction techniques. To this end, traditional methods like PCA Hotelling (1933) use few principal components to represent the original dataset; factor analysis (Cattell, 1952) seeks to get linear combinations of latent factors; subsequent works of PCA include kernel PCA (Sch olkopf et al., 1998), generalized PCA (Vidal et al., 2005); manifold learning (Belkin and Niyogi, 2003) assumes data points collected from a high dimensional ambient space lie around a low dimensional manifold, and muli-manifold learning (Liu et al.,

c 2021 Weiwei Li, Jan Hannig, Sayan Mukherjee.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/18-780.html.

Li, Hannig, Mukherjee

2011) considers the setting of a mixture of manifolds. In this paper, we focus on one of the simplest manifold, a subspace, and consider the subspace clustering problem. Speciﬁcally, we approximate the original dataset as an union of subspaces. Representing the data as a union of subspaces allows for more computationally eﬃcient downstream analysis on various problems such as motion segmentation (Elhamifar and Vidal, 2009), handwritten digits recognition (You et al., 2016a), and image compression (Hong et al., 2006).

1.1 Related Work

Many techniques have been developed for subspace clustering, see Elhamifar and Vidal (2013) for a review. The mainstream methods usually include two phases: (1) calculating the aﬃnity matrix; (2) applying spectral clustering (Ng et al., 2002) to the aﬃnity matrix to compute a label for each data point. For phase (1), the property of self-representation is often used to calculate the aﬃnity matrix: self-representation states that a point can be represented by a linear combination of other points in the same subspace. Speciﬁcally, Elhamifar and Vidal (2009) proposed the sparse subspace clustering (SSC) algorithm which solves the lasso minimization problem N times, where N is the total number of data points, the theoretical property of SSC was further studied in Soltanolkotabi et al. (2012). Similarly, Rahmani and Atia (2017) proposed the direction search algorithm (DSC) which uses ℓ1 minimization to ﬁnd the optimal direction for each data point, these directions are then used to cluster the data points. One of the main drawbacks of SSC and DSC is their computational complexity of O(N2) in both time and space, which limits its application to large datasets. To address this limitation, a variety of methods have been proposed to avoid solving complicated optimization problems in constructing the aﬃnity matrix. Heckel and B olcskei (2015) used inner products with thresholding (TSC) to calculate the aﬃnity between each pair of points, Park et al. (2014) used a greedy algorithm to ﬁnd for each point the linear space spanned by its neighbors, similarly Dyer et al. (2013) and You et al. (2016c) used orthogonal matching pursuit (OMP), You et al. (2016b) used elastic the net for subspace clustering (ENSC) and proposed an eﬃcient solver by active set method. However, these methods require running spectral clustering on the full N N aﬃnity matrix. A Bayesian mixture model was proposed for subspace clustering in Thomas et al. (2014), but its parameter inference is not scalable to large dataset. Zhou et al. (2018) used a deep learning based method which does not have theoretical guarantee.

Recently, there have been two methods that increase the scalability of sparse subspace clustering. The SSSC algorithm and its varieties (Peng et al., 2015) clusters a random subset of the whole dataset and then uses this clustering to classify or label the out-of-sample data points. This method scales well when the random subset is small, however a great deal of information is discarded as only the information in the subset is used. In You et al. (2016a) a divide and conquer strategy is used for SSC the dataset is split into several small subsets on which SSC is run, and clustering results are merged. This method cannot reduce the computational complexity of the SSC by an order of magnitude so is limited in its ability to scale to large dataset.

Subspace Clustering through Sub-Clusters

1.2 Contribution

In this paper, we propose a novel, eﬃcient sampling based algorithm with provable guarantees that extends the ideas in previous scalable methods (Peng et al., 2015; You et al., 2016a). The motivation for using sampling based algorithm is twofold. From the theoretical perspective, Luxburg et al. (2005) showed that under certain assumptions, the spectral clustering results on the sampled subset will converge to the results on the whole dataset. This gives us the insight that as the size of the sampled subset increases properly, the subset becomes almost as informative as the whole dataset. From the computational perspective, traditional spectral clustering based algorithms need to build a neighborhood for each data point. Thus the complexity (both in time and memory) is usually at least quadratic in the total number of data points, while sampling based algorithms need to ﬁnd neighboring points only within the subset. This greatly reduces the computational resources needed incurring some loss of information. Our algorithm seeks to combine strengths from both approaches. In particular, for each point in the subset we ﬁnd its nearest neighbors in the complete dataset and use these points to construct a sub-cluster, these sub-clusters contain information from the entire dataset and not just the random sample. Finding neighboring points among the whole dataset makes it possible to get a neighborhood with big enough size and few false connections for each sampled point. The aﬃnity matrix for the subset is then constructed from these sub-clusters. The idea is that we change the problem from clustering of data points to clustering of sub-clusters , which integrates information across the dataset and should deliver better clustering results. We provide theoretical guarantees for our procedure in Section 3. The analysis reveals that under mild conditions, the subspaces can share arbitrarily many intersections as long as most of their principal angles are larger than a certain threshold. While our algorithm for ﬁnding neighboring points is similar to that of Heckel and B olcskei (2015), the data generation model and assumptions underlying our theorems are diﬀerent we take into account the fact that after normalization the noisy terms will no longer follow a multivariate Gaussian distribution. While our work is originally designed for linear subspace clustering problems. The idea of clustering through sub-clusters can be easily extended to general clustering problems. Finally, we study empirical properties of the proposed algorithm on both synthetic and real-world datasets selected to have diverse sizes. We show that the clustering through sub-clusters algorithm is highly scalable and can signiﬁcantly boost the clustering accuracy on both the subset and whole dataset. The advantage of our algorithm over other state-ofthe-art algorithms changes from marginal to signiﬁcant as the size of the dataset increases.

1.3 Paper Organization

The rest of this paper is organized as follows: in Section 2, we describe the implementation of our clustering procedure, in Section 3 we state the model setting and theoretical guarantees for our procedure and explain in some details the geometric and distributional intuitions underlying our procedure. The detailed proofs can be found in Appendix C. In Section 4 we present numerical experiments and compare our method with other state-of-the-art methods, a comprehensive report of the numerical results can be found in Appendix E.

Li, Hannig, Mukherjee

1.4 Notation

Unless speciﬁed otherwise, we use capital bold letter to denote data matrix, and corresponding lower bold letter to denote the columns of it. In this paper, we are given a dataset Y with N data points in RD. We use both yi and [Y]i to denoted the i-th column of Y, and Y i is the matrix Y with the i-th column removed. Similarly, we write y j as vector y with the j-th entry removed. The ij-th entry of a matrix Y is denoted as [Y]ij. The complement of event E is denoted by E . The cardinality of E is denoted by card(E), and the mode of E is mode(E). We use subscript with parenthesis to represent the order statistics of entries in a vector, for example a(i) is the i-th smallest entry in vector a, while without ambiguity both a(i) and ai refer to the i-th element of vector a. The unit sphere in Rd is denoted by Sd 1. We assume each data point of Y lies on one of K linear subspaces denoted by {Sk}K k=1. Here K is a known constant and Sk is the k-th linear subspace. The subspace clustering problem aims assigning to each point in Y the membership to a subspace (cluster) Sk. We write dk as the dimension of subspace Sk and Uk RD dk as its corresponding orthogonal base. The number of points belong to cluster Sk is Nk. We use yi(k) RD

to represent the i-th point from the k-th cluster, the set {y1(k), ..., y Nk (k)} contains all points that belong to Sk. Finally, we write Fm,n as the F-distribution with parameters (m, n), Dir(α) as the Dirichlet distribution with parameter vector α, β(a, b) as the beta distribution with parameters (a, b), N(µ, σ) as the Gaussian distribution with mean (vector) µ and variance (covariance matrix) σ2, χ2 d as the chi-square distribution with d degrees of freedom, and U(Sd 1) as the uniform distribution on the surface of unit sphere Sd 1.

2. The Algorithm for Sampling Based Subspace Clustering

In this section, we introduce our sampling based algorithm for subspace clustering (SBSC). In Appendix A we will discuss issues regarding hyper-parameters. Throughout this section, we assume the columns of Y have unit ℓ2 norm. Our main algorithm takes the raw dataset Y that has N observations and several parameters as inputs and outputs the clustering assignment for each point in the dataset, it proceeds in two stages (see the matched steps in Algorithm 1 for further details):

Stage 1: In-sample clustering

1. Draw a subset Y of n N points.

2. For each point yi Y, ﬁnd its (dmax + 1) nearest neighbor points in Y and use Ci to denote the index set of these points. We call YCi the sub-cluster of yi.

3. Compute the aﬃnity matrix D where each element [D]ij is the similarity calculated between YCi and YCj.

4. Sparsify the aﬃnity matrix by removing possible spurious connections.

5. Conduct spectral clustering on Y with the sparsiﬁed aﬃnity matrix.

Stage 2: Out-of-sample classiﬁcation

6. Fit a classiﬁer to the clustered points in Y and classify the points in Y \ Y.

Subspace Clustering through Sub-Clusters

In Stage 1, we formulated n sub-clusters out of the sampled dataset. These n subclusters are then further grouped into K clusters. An algorithm based on a similar idea was proposed for Gaussian mixture models by Aragam et al. (2020), they ﬁrst divide the dataset into L( K) mixtures, and then cluster these mixtures into K mixtures. In our paper the linear structure of subspaces is central to clustering while in their paper the distributional properties of mixture models play the key role in grouping, this is a signiﬁcant diﬀerence. On a high level, we transfer the problem from clustering points to clustering sub-clusters . Step 2 computes the neighborhood of points around each sampled points by thresholding a similarity score based on inner products, this method was also used in Heckel and B olcskei (2015). The intuitive reason for this step is that for normalized data, two vectors are more likely to lie in the same linear subspace if the absolute magnitude of the inner product between the points is large. One may use other measure of similarities in Step 2 to ﬁnd the neighboring points. In addition to the standard algorithm, we also present experimental results based on other similarity measures in Section 4 and Appendix E. The idea of using distance between the sub-clusters to construct an aﬃnity matrix in Step 3 relies on the self-representative property of linear subspaces see Theorem 2 for technical details. Please note that each entry of the aﬃnity matrix measures the closeness between data points, hence it is a decreasing function of distance. There is both theoretical and empirical evidence that sparsiﬁcation of an aﬃnity matrix by setting smaller elements to zero improves clustering results (Belkin and Niyogi, 2003; Von Luxburg, 2007). For this reason in Step 4 we threshold the aﬃnity matrix. In Stage 2, the remaining points are labeled via a classiﬁer where a regression model is ﬁtted on the clustered data, speciﬁcally a residual minimization ridge regression procedure. If both n, dmax and D are linear in log N, the complexity of our algorithm is O(N log N). Note that any classiﬁer can be used to do the out-of-sample classiﬁcation. While ridge regression worked well for linear subspace clustering problems in this paper, we encourage users of Algorithm 1 to choose their own favorite classiﬁer, e.g., svm, random forest, or even deep neural networks, based on their understandings of the data.

3. Clustering Accuracy

3.1 Model Speciﬁcation

We assume all subspaces have the same dimension d and the data generating process is

ˆyi(k) = ζ(k) i Uka(k) i + ˆe(k) i , i = 1, . . . , Nk, k = 1, ..., K,

where a(k) i Rd is sampled from the uniform distribution on the surface of Sd 1, ζ(k) i is a random scalar such that ζ(k)2 i χ2 d, and ˆe(k) i N(0, dσ2ID). However ˆyi(k) are unobserved and we only observe the normalized version y(k) i = ˆyi(k)/ ˆyi(k) 2. We then have

y(k) i = Uka(k) i + σe(k) i Uka(k) i + σe(k) i 2

Consequently, each entry in e(k) i = ˆe(k) i /(σζ(k) i ) follows multivariate t-distribution with

d degrees of freedom, and e(k) i 2

2 2 2/D FD,d. Numerically, the normalizing constant

Li, Hannig, Mukherjee

input : Data Y, number of subspaces K, sampling size n, neighbor threshold dmax, regularization parameters λ1 and λ2, residual minimization parameter m, aﬃnity threshold tmax. output: The label vector ℓof all points in Y 1. Uniformly sample n points Y from Y. 2. Construct the sub-clusters: for i = 1 to n do

p = | yi, Y |; Ci := {j : | yi, yj | p(N dmax)}. end

3. Construct aﬃnity matrix [D]ij = e d(YCi,YCj )/2 for i = j {1, ..., n} and

d(YCi, YCj) = ||YCi YCj(YT Cj YCj + λ1I) 1YT Cj YCi||F

+||YCj YCi(YT Ci YCi + λ1I) 1YT Ci YCj||F .

4. Sparsify the adjacency matrix: for j = 1 to n do

v := [D]j; for i = 1 to n do

if [D]ij v(n dmax) then

[D]ij := 0 end end end 5. Cluster Y : set D := D + DT and cluster the in-sample points in Y by applying spectral clustering on D, use ℓin to denote the labels of Y. 6. Label the remaining points: use the Residual Minimization by Ridge Regression (RMRR) algorithm in Appendix D to classify the remaining points in Y \ Y, speciﬁcally for the out-of-sample label we have

ℓout = RMRR(Y \ Y, Y, ℓin, λ2, m)

7. Combine ℓin and ℓout to get ℓ, the label of the whole dataset Y. Algorithm 1: Sub-cluster Based Subspace Clustering (SBSC) algorithm.

Uka(k) i + σe(k) i 2 will be approximately 1. In Heckel and B olcskei (2015), the normalizing constants are treated directly as 1 and their noise vector is a multivariate Gaussian vector. In developing theoretical guarantees of this paper, we explicitly account for the normalizing constant Uka(k) i + σe(k) i 2 and its eﬀects. Let λ(ij) 1 λ(ij) 2 ... λ(ij) d correspond to the cosine values of principal angles between Si and Sj, hence λ(ij) 1 1 and λ(ij) d 0. Note that λ(ij) k = λ(ji) k for 1 k d and 1 i < j K. For each subspace Sk, we deﬁne the uniformly maximal aﬃnity vector to quantify its closeness with respect to all other subspaces.

Subspace Clustering through Sub-Clusters

Deﬁnition 1 For each subspace Sk, its uniformly maximal aﬃnity vector with respect to the other subspaces is [λ(k) 1 , ..., λ(k) d ] with

λ(k) i = max j =k λ(kj) i .

Deﬁnition 2 The sub-cluster preserving property holds for an algorithm if each sub-cluster output contains only points from the same subspace.

If the uniformly maximal aﬃnity vectors have small entries, corresponding to large angles, we would expect that Algorithm 1 (SBSC) satisﬁes the sub-cluster preserving property. In constructing the aﬃnity matrix D, we want the following property: two sub-clusters that belong to the same subspace have bigger aﬃnities, hence smaller distances, than subclusters that belong to diﬀerent subspaces.

Deﬁnition 3 We say YCi has the correct neighborhood property with distance function d( , ) if

d(YCi, YCj) < d(YCi, YCk),

for any 1 j = k n such that YCi and YCj belong to the same subspace, and YCk belongs to a diﬀerent subspace than YCi.

3.2 Theoretical Properties of SBSC

In this section we provide a theoretical analysis of Algorithm 1 providing conditions under which we have provable guarantees.

3.2.1 Assumptions

In this section, we list the assumptions used in the lemmas and theorems. On a high level, the theoretical properties developed in this paper require two groups of conditions. First, the subspaces need to be separated, this is Assumption A2. Second, the sub-clusters should be informative and carry enough information about the subspaces they belong to, this is A3. Notice both A2 and A3 include A1. Assumption A4 subsumes assumptions A1-A3 and adds slightly stronger conditions on amount of noise, this allows us to prove Theorem 2. The correct neighborhood property for sub-clusters stated in Theorem 2 is to the best of our knowledge novel for subspace clustering. An explanation of each assumption is provided at the end of this section.

A1. There exist positive constants Tl and ρ such that

T 2 l min k=1,...,K Q1 dmax

N1 ρ k , (2)

where Qp denotes the p quantile of β(1

Li, Hannig, Mukherjee

A2. There exist positive constants {gi}2 i=1, η and ρ (0, 1), such that if we write T = 4g2+2g2 2 1 g2 + 1+g2

1 g2 g1, the following inequalities hold: (2) with Tl replaced by T, and

g2 1 λ(k)2 i 2

g2 1 λ(k)2 i 2

g2 1 λ(k)2 i

g2 1 λ(k)2 i

g2 2 Dσ2 > 3 + 6

η, d log N (2 + 2η)2. (4)

A3. There exist positive constants Tl, q0, ρ and t such that the following inequalities hold: (2), dmax > d and (T 2 l dmax C2)C2 C2 1 4 T 2 l dmax q0, (5)

2dmax π(d 1) 2 t

A4. There exist positive constants Tl, g, λ, η, q0, ρ and t such that the following inequalities hold: (2), (3) with g1 replaced by Tl, (4) with g2 replaced by g, (5) and

f(d) := (2g g2) (dmax + 1)

f(d) dmax + 1

d(1 + g)4λ2

q2 0 + D d λ(1 + g)2p

d(dmax + 1) q0(1 g) ,

d(1 + g)4λ2

q2 0 + D d λ(1 + g)2p

d(dmax + 1) q0(1 g) ,

6λ(1 + g)2p

d(dmax + 1) q0(1 g) q

Assumption A1: The inner product is used to measure the distance between data points giving rise to the order statistics of a Beta distribution which is bounded in Assumption A1. The lower bound of the order statistics is used to control the separation between diﬀerent subspaces. The upper bound controls the information carried in each sub-cluster. Mathematically, it implicitly controls the ratio between d and log N. If we write Nk = 10000, N = 10Nk, dmax = 3d, T 2 l = 0.09, and ρ = 0.01, then it suﬃces to have d log Nk 5 for inequality (2). Please see the related derivation in Appendix C. Assumption A2: This is the subspace separation assumption. We use it for the proof of

Theorem 1. In Appendix C, we show that SBSC requires most of {λ(k) i } d, K

i=1,k=1 to be smaller than g1. This means large g1 implies an easier clustering problem for SBSC, and vice versa. Throughout this paper we call g1 the aﬃnity threshold. Note that T is a upper bound

Subspace Clustering through Sub-Clusters

of the aﬃnity threshold g1, speciﬁcally if there was no noise T = g1. From (2) we know that large d log N implies a small T and g1. Therefore, large d makes the clustering problem harder. This agrees with our intuition. Consider the extreme case where the subspaces are orthogonal, λ(k) i = 0 (i = 1, . . . , d, k = 1, . . . , K), and Equations (3) are naturally true with any positive constant g1. Finally, the constant g2 in (4) controls the noise term. From the ﬁrst condition in (4) we have σ < g2

D. Assumption A3: This guarantees the sub-clusters {YCi}n i=1 are informative. We use it mainly for the proof of Lemma 5. Here C1 and C2 are closely related to the permeance statistics (Lerman et al., 2012), which measures how well a set of vectors is scattered across a space. Therefore a large dmax

d implies that these vectors are well scattered. If T 2 l equals to its upper bound in (2), ρ = 0.01, Nk = 100000, N = 10Nk, dmax = 160d, t = 0.05 and we want q0 0.5, A3 requires d log N 5 1

Assumption A4: Combines all previous assumptions, with slightly stronger conditions on subspace similarities and noise level; we use it for the proof of Theorem 2. The ﬁrst three conditions in (6) essentially control the value of g, which in turn controls the magnitude of the norm of noise terms. The last condition in (6) controls the value of the regularization parameter in a distance function that will be deﬁned latter.

3.2.2 Theoretical Properties of SBSC

Two theorems regarding Stage 1 of SBSC are stated and discussed in this section. Theorem 1 states a lower bound on the probability that SBSC satisﬁes the sub-cluster preserving property. Theorem 2 proves that SBSC has the correct-neighborhood property with high probability. Detailed proofs can be found in Appendix C.

Theorem 1 Under Assumption A2, SBSC has sub-cluster preserving property with probability at least

nk(Nk dmax) dmax(Nk + 1)(Nρ k 1)2 2(K 1)ne ϵ2 1 2N

1+ η 2+η 2 , (7)

Pd i=1 g2 1 λ(k)2 i

+ Pd i=1 g2 1 λ(k)2 i

2 r Pd i=1 g2 1 λ(k)2 i 2

4 Pd i=1 g2 1 λ(k)2 i 2

+ + 2 Pd i=1 g2 1 λ(k)2 i

If the subspaces are orthogonal with each other, i.e. {λ(k) i }d, K i=1,k=1 = 0. Equation (8) reduces to

1. In this example, dmax

d is fairly large. In the numerical section we found it is usually not necessary to choose large dmax. A better bound in Corollary 3 might be helpful the bridge the gap between numerical experiments and theoretical guarantee.

Li, Hannig, Mukherjee

This shows ϵ1 is linear in

d and monotonically increasing in g2 1. Appendix F establishes general conditions on g1 and {λ(k) i }d, K i=1,k=1 under which ϵ1 grows like

d. Combining this with Assumption A2, we observe that the third term of (7) is small for large N. When the number of data points in the sub-sample n is growing linearly in the log of the number of data points in the entire dataset log(N) (see Section 4.1.2), the second and fourth terms in Equation (7) go to 0 as N increase. This means (7) is close to 1 for large N. Next, we use the sub-cluster preserving property established in Theorem 1 to prove the theoretical guarantee for correct neighborhood property (see Deﬁnition 3). We use the distance function proposed in You et al. (2016a)

d(YCi, YCj) = YCi YCj(YT Cj YCj + λI) 1YT Cj YCi F +

YCj YCi(YT Ci YCi + λI) 1YT Ci YCj F , (9)

where λ > 0 is a regularization parameter. This distance function is used to decide if two sets of points belong to the same cluster. Intuitively, the ﬁrst term in Equation (9) computes the ℓ2 norm of the residuals from a ridge regression where YCi is the response and YCj is the design matrix. The second term exchanges the roles of YCi and YCj.

Theorem 2 Assume sub-cluster preserving property is true for SBSC with probability at least 1 ps, and Assumption A4 is satisﬁed. Then {YCi}n i=1 have the correct neighborhood property with the distance function (9) with probability at least

1 4n(n 1)e ϵ2 1 2n Nt2/2

Nk dmax dmax(Nk + 1)(Nρ k 1)2 + 2(Nk 1)e ϵ2 2

1+ η 2+η 2 ps, (10)

where ϵ1 is deﬁned in (8) with g1 replaced by Tl and

d 1 1 2 + 1

d 1+1 . (11)

Similarly as before, one can show that Equation (10) goes to 1 with large N. Speciﬁcally, from Assumption A2 we know the term nk(Nk 1)e ϵ2 2 grows linearly in nk

4. Experimental Results

In this section, we test the performance of SBSC on both synthetic and benchmark datasets. In addition to Algorithm 1, we also consider two modiﬁcations of the SBSC algorithm. The SBSC-DSC algorithm uses the optimal direction search algorithm (Rahmani and Atia, 2017) instead of correlations to ﬁnd neighboring points in Step 2 of Algorithm 1. The SBSC-SSC algorithm uses lasso minimization in Step 2 of Algorithm 1; its numerical results are reported in Appendix E. The performance of the three versions of SBSC is compared to other state-of-the art algorithms. These include classic subspace clustering method: Sparse Subspace Clustering

Subspace Clustering through Sub-Clusters

(SSC, Elhamifar and Vidal, 2009; You et al., 2016a), Thresholding Subspace Clustering (TSC, Heckel and B olcskei, 2015), Direction Search Subspace Clustering (DSC, Rahmani and Atia, 2017), Least Square Regression (LSR, Lu et al., 2012), Low-Rank Representation (LRR, Liu et al., 2010), Subspace Clustering by Orthogonal Matching Pursuit (SSC-OMP, You et al., 2016c), Elastic Net Subspace Clustering (ENSC, You et al., 2016b); and sampling based algorithms (Peng et al., 2013): Scalable Sparse Subspace Clustering (SSSC, reported in Appendix E), Scalable Thresholding Subspace Clustering (STSC), Scalable Direction Search (SDSC), Scalable Least Square Regression (SLSR), Scalable Low-Rank Representation (SLRR). To make fair comparisons, where possible we replicated results on our machine. Some results are copied from the original paper due to the unavailability of code.

Throughout this section, we use clustering accuracy, normalized mutual information (NMI) and running time as the metrics for performance evaluation. The formulas for clustering accuracy and NMI are presented in Appendix B. To demonstrate the advantage of using sub-clusters (i.e. borrowing information from the whole dataset) to cluster the data points in the subset. For sampling based algorithms we also report their clustering accuracy on the subset. In the rest of this paper, we call the clustering accuracy on the whole dataset as accuracy, and the clustering accuracy on the subset as accuracy-sub. For randomized algorithms, reported results are averaged over 10 trials. Additional numerical results are presented in Appendix E. The code used to generate these results can be found in the supplementary material.

4.1 Results on Synthetic Dataset

In this section we evaluate tolerance to noise and scalability on synthetic data generated using the model speciﬁed in Section 3.1.

4.1.1 Tolerance to Noise

In this section, we test the tolerance to noise of the various algorithms. From (1) we know the un-normalized signal part Uka(k) i has unit ℓ2 norm, and the expected squared norm of the noise is σ2E[ e(k) i 2 2] = σ2Dd/(d 2). Therefore throughout this paper we deﬁne (d 2)/(σ2Dd) as the signal strength (signal to noise ratio). The noise captured the amount of variation of points in RD.

We change the signal strength from 10 to 2. For each value of signal strength, we simulate 10 datasets with K = 20 subspaces, where each subspace contains Ni = 10000 data points. For all the sampling based algorithms we ﬁxed the sampling size as n = 200.

The results are presented in Figure 1. Accuracy and accuracy-sub are plotted on the left and right hand side panels respectively. The small discrepancy between two sides shows both sampling based algorithms can deliver consistent results between in-sample clustering and out-of-sample classiﬁcation. At the same time, the SBSC based algorithms constantly deliver much higher accuracy-sub than the other sampling based algorithms, this means for the synthetic datasets, borrowing information from the whole dataset signiﬁcantly enhanced the clustering results for subset.

Li, Hannig, Mukherjee

Figure 1: Tolerance to Noise: Plot of accuracy (left panel) and accuracy on subsets (right panel) for algorithms applied to synthetic datasets of Section 4.1.1. The x-axis is the signal strength and the y-axis is the accuracy averaged over 10 diﬀerent datasets. SBSC performs very well.

4.1.2 Scalability

In this section, we test the scalability of SBSC. Speciﬁcally, we randomly generate K = 20 subspaces in an ambient space with dimension D = 30, each of the subspaces has dimension d = 5. We increase Nk from 100 to 51200, so the corresponding N increases from 2000 to 1024000. The sampling size n is 2K log(N) . The result is presented in Figure 2. On the right hand side y-axis, we show the average accuracy, which is around 95% across all experiments, against the number of data points N, this could justify our choice of n. On the left hand side y-axis, we show the scale plot between running time and N, the linear pattern here agrees with our complexity analysis. As we increase the number of data points N, the accuracy on the whole dataset slightly gets higher, this implies our algorithm is particularly useful for large datasets.

4.2 Results on Benchmark Datasets

In this section, we test SBSC on three benchmark datasets. These datasets were selected to have small, medium and large data size respectively. As expected, the advantage of SBSC over other state-of-the-art algorithms changes from marginal to signiﬁcant as the size of the dataset increases.

4.2.1 The Extended Yale B dataset

The Extended Yale B dataset (Yale B) contains N = 2432 face images of K = 38 individuals. Each image is a front view photo of the corresponding individual with diﬀerent illumination

Subspace Clustering through Sub-Clusters

Figure 2: Scalability: Plot of running time (blue curve) and accuracy (red curve), averaged over 10 independent datasets, versus the number of data points for SBSC applied to the data of Section 4.1.2. Shows that the algorithm scales well.

condition. To speed up the running time, a dimension reduction step is taken to pre-process the dataset (Rahmani and Atia, 2017), hence in our experiment D = 500. From Table 1, we see that the DSC algorithm delivered the highest accuracy and NMI. As expected, for this small dataset sampling based algorithms did not perform as well. The reason is, there is not enough observations to form suﬃciently large homogeneous sub-clusters. This issue is worse for datasets with large number of clusters.

4.2.2 The Zipcode dataset

The Zipcode dataset (Le Cun et al., 1990) is a medium-size dataset with N = 9298 data points and D = 256, each point represents an image of handwritten digit, hence K = 10. From Table 2 we see, that SBSC delivers the best results in all metrics except running time. However the diﬀerences in running time are marginal for sampling based algorithms. The accuracy-sub of SBSC is again better than that of traditional sampling based algorithms (see SBSC versus STSC, and SBSC-DSC versus SDSC).

4.2.3 The MNIST dataset

The MNIST dataset (MNIST) contains N = 70000 data points, each point represents an image of handwritten digit. The original data was transferred into R500 by convolutional neural network and PCA (You et al., 2016c). Again K = 10. From Table 3 we see, that SBSC with bagging described in Appendix A.2 dominates in nearly every aspect. The large data size of MNIST makes the sampling based algorithms

Li, Hannig, Mukherjee

Method Accuracy (%) Accuracy-Sub (%) NMI (%) Runtime (sec.)

SBSC 26.53 (1.8)

31.56 (1.94)

41.67 (1.44) 27

SBSC-DSC 60.46 (1.62)

62.76 (1.88)

70.15 (0.97) 35

STSC 17.45 (1.27)

21.72 (1.7)

29.17 (1.28) 0.8

SDSC 52.18 (1.96)

60.78 (2.04)

54.61 (1.85) 3

SLRR 18.35 (0.64)

28.6 (1.64)

26.33 (0.57) 7

SLSR 26.48 (1.98)

37.78 (2.33)

35.21 (2.22) 1.5

TSC 26.19 NA 39.31 1 DSC 91.69 NA 93.43 45 SSC 52.96 NA 60.15 169 SSC-OMP 73.88 NA 80.1 1.51 SSC-ENSC 60.81 NA 69.4 3

Table 1: Performance on Extended Yale B: The results of sampling based algorithms are averaged over 10 independent runs and the corresponding standard deviations are presented in parentheses. The remaining algorithms do not use random subsampling and were run only once. The metric reported as NA is not deﬁned for these algorithms. The best result of each performance metric is in bold. DSC delivers the highest accuracy and NMI.

run much faster than traditional methods. Due to their slow speed we did not use bagging on non-sampling algorithms.

5. Conclusion and Future Research

While the idea of subsampling was discussed before (Peng et al., 2015), the main contribution of this paper is ﬁnding neighborhood points in the whole dataset and using cluster-wise distance to cluster points in the subset. This results in a higher clustering accuracy. In calculating cluster-wise distances and classifying out-of-sample points, ridge regression seems to be the most direct method. However, the algorithm itself is highly ﬂexible. Users are encouraged to try diﬀerent distance functions, classiﬁcation methods, and metrics in ﬁnding neighboring points. We propose the following directions for future research:

1. In this paper we select the subsamples that are used for initial clustering uniformly at random. It would be interesting to investigate if selecting these points using a more deterministic method such as Coresets (Agarwal et al., 2005) or a quasi-random method such as a Langevin based method (Roberts et al., 1996) could improve the performance of the algorithm.

Subspace Clustering through Sub-Clusters

Method Accuracy (%) Accuracy-Sub (%) NMI (%) Runtime (sec.)

SBSC 69.4 (5.17)

72.04 (5.37)

70.3 (1.75) 10

SBSC-DSC 60.84 (2.87)

64.92 (3.37)

62.92 (0.65) 71

STSC 55.28 (4.25)

60.86 (3.8)

53.1 (2.49) 2

SDSC 45.62 (6.43)

51.16 (7.31)

45.99 (3.88) 3

SLRR 63.21 (3.96)

65.16 (4.03)

66.09 (1.39) 10

SLSR 58.66 (0.99)

59.85 (0.98)

62.54 (1.38) 4

TSC 65.73 NA 78.97 115 DSC 60.92 NA 68.43 800 SSC 48.16 NA 52.37 2165 SSC-OMP 23.58 NA 25.51 3 SSC-ENSC 44.65 NA 50.08 36

Table 2: Performance on Zipcode: See the caption of Table 1 for description. We see SBSC delivers the highest accuracy and accuracy-sub, while TSC delivers the highest NMI.

2. While the correct neighborhood property developed in Theorem 2 assures sub-clusters from the same subspaces are close to each other. It would be interesting to further explore, from the theoretical perspective, the impacts of this property on clustering accuracy. For example, it would be interesting to explore the relationship between the identiﬁability discussed in Aragam et al. (2020) and the correct neighborhood property in this paper.

3. It would be interesting to extend our algorithm to non-linear manifold clustering problems. For example, one could project the dataset into a RKHS and apply our algorithm on the projected features.

Appendix A. Practical Recommendations for Parameter Setting

In Algorithm 1 (SBSC), we assume the number of clusters is known. Several methods have been developed for the estimation of the number of clusters from data (Ng et al., 2002). Intuitively, n should be large enough to represent the structure of the whole dataset while still being relatively small to reduce the computational complexity. In our numerical experiments, we choose n to be linear in K log N. Ideally, each sub-cluster YCi should well represent the subspace it belongs to, i.e., contain at least one basis of that subspace. Therefore we want dmax to be larger than maxk=1,...,K dk which is unknown. For this reason we set dmax to grow linearly with D. Similarly the residual minimization parameter m should also be linear in D.

Li, Hannig, Mukherjee

Method Accuracy (%) Accuracy-Sub (%) NMI (%) Runtime (sec.)

SBSC(1) 95.74 (0.28)

96.44 (1.14)

89.9 (0.47) 38

SBSC(6) 97.15 (0.16)

95.25 (1.78)

92.6 (0.3) 246

STSC(1) 30.2 (2.13)

67.8 (3.95)

11.52 (2.12) 28

STSC(6) 40.12 (2.84)

65.23 (2.2)

22.53 (2.36) 172

SLRR(1) 79.5 (1.19)

79.46 (1.3)

79.9 (1.52) 59

SLRR(6) 81 (0.67)

79.6 (0.44)

83.75 (0.74) 378

SLSR(1) 75.06 (6.11)

74.62 (5.99)

76.21 (3.63) 54

SLSR(6) 79.64 (0.85)

76.43 (1.85)

81.24 (0.95) 326

TSC 84.63 NA 87.47 1184 SSC (DC1) 96.55 NA NA 5254 SSC (DC2) 96.1 NA NA 4390 SSC (DC5) 94.9 NA NA 1596 SSC-OMP 81.51 NA 84.45 232 SSC-ENSC 93.79 NA 88.8 500

Table 3: Performance on MNIST: See description in Table 1. Additionally, the results of methods with star marks are copied from the original paper that did not report NMI. The number in the parenthesis next to the algorithm name is the number of bags. We see SBSC dominates other algorithms in nearly every aspect.

A.1 Threshold Selection

The spectral clustering algorithm can deliver exact clustering result (Von Luxburg, 2007) if the graph induced by the aﬃnity matrix (D+DT ) has no false connections; and has exactly K connected components. For a large threshold parameter tmax on the aﬃnity matrix more entries in D will be kept and our algorithm is more likely to have false connections, while small tmax eliminates false connections but might incur non-connectivity. Let us consider a heuristic situation: the subset we sampled contains exactly the same points (hence n

K points) for each cluster. Then if we choose the threshold index tmax to be n 2K , the induced graph from our aﬃnity matrix will have no false connection (given that points from same subspace have bigger similarities between each other) and the clusters themselves will be connected, therefore the spectral clustering algorithm will deliver the exact clustering result (Luxburg et al., 2005). In reality clusters do not usually have same points in ˆY, hence we choose tmax to start from a relatively large number n 0.5K and gradually increase it. Based on diﬀerent threshold values, we can generate diﬀerent label vectors on the subset ˆY, intuitively label vectors

Subspace Clustering through Sub-Clusters

that can deliver highly accurate results should be similar to each other or stable. Based on this intuition, we developed a simple adaptive algorithm for ﬁnding an optimal aﬃnity threshold tmax; see supplementary code for details. Based on our observation, choosing tmax adaptively works well with datasets where each cluster has large amount of points.

A.2 Combining Runs of the Algorithm

Thanks to the speed of our algorithm, we can conduct several independent runs for one experiment (for sampling based algorithms, the results between independent runs might be diﬀerent) with acceptable running time. In order to make full use of such advantage, we designed an algorithm to combine the results via bagging from several runs of SBSC (Breiman, 1996). Unlike the classiﬁcation problem, we need to conduct label switching, see Algorithm 2 for details on how label switching is addressed. Please note that bagging can be used for any clustering algorithms. In Section 4 and Appendix E we report the results, both with and without bagging, for all sampling based algorithms.

input : The label vectors {lj}b j=1 RN from b independent runs. The number of clusters K. Note that each entry of lj is a positive integer from 1 to K. output: The ﬁnal label vector l0 RN. for m = 1 to b do

Write Mj = {r : lm(r) = j}, j = 1, ..., K. for i = 1 to b and i = m do

1. Write Iq = {r : li(r) = q}, q = 1, ..., K. Let S RK K be a score matrix where

[S]jq = card(Mj Iq) min(card(Mj), card(Iq)), 1 j, q K.

2. Switch the labels in li based on score matrix S: for k = 1 to K do

Let q = arg maxj[S]jk. For r Iq set li(r) := k . end end end for n = 1 to N do

Set l0(n) := mode({lj(n)}b j=1). end

Algorithm 2: Bagging of Clustering Labels

The Step 2 of Algorithm 2 has a subtle issue: there might exists an integer q such that q = arg maxj[S]jk = arg maxj[S]jr. Please see our supplementary codes on how to tackle this problem.

Li, Hannig, Mukherjee

Appendix B. Clustering Accuracy and Normalized Mutual Information

The clustering accuracy measures the percentage of correctly labeled data points (You et al., 2016c). It is calculated by

accr = max π 100

i,j lest π(i)jltrue ij , 1 i K, 1 j N.

Here π is the permutation of K labels. The estimated label indicator lest π(i)j equals to 1 if and only if we assign label π(i) to the j-th point, and 0 otherwise. The ground-truth label indicator ltrue ij equals to 1 if and only if the j-th point has label i, and 0 otherwise. The normalized mutual information (Strehl and Ghosh, 2002) is calculated by

NMI(lest, ltrue) = I(lest, ltrue) p

H(lest)H(ltrue) .

Here lest and ltrue are estimated/ground-truth label vectors, respectively. We use I(lest, ltrue) to denote the mutual information between lest and ltrue, and H(lest) to denote the entropy of lest. Similarly for H(ltrue).

Appendix C. Proofs of Main Theorems

In this section, we will prove the theorems from Section 3. The following Lemmas are used to prove Theorem 1.

Lemma 1 Let b be a vector sampled uniformly from Sd 1, and λk (k = 1, .., d) be constants such that 1 λ1 λ2 ... λd 0. For constant g1 (λd, λ1), we write ri = (g2 1 λ2 i )+ and si = (g2 1 λ2 i ) . Assuming that Pd i=1 ri > Pd i=1 si, then

i=1 (λibi)2 < g2 1

ϵ = Pd i=1 (ri si) q Pd i=1 r2 i + q Pd i=1 s2 i

s q Pd i=1 r2 i + q Pd i=1 s2 i

2 + 2s1 Pd i=1 (ri si)

Proof We write bi = zi q Pd j=1 z2 j , where {zi}d i=1 are i.i.d. N(0, 1) random variables. The

goal is to bound

i=1 si z2 i

i=1 ri z2 i

Note that g1 (λd, λ1), hence both Pd i=1 ri and Pd i=1 si are strictly positive.

Subspace Clustering through Sub-Clusters

Now we write X = Pd i=1 si z2 i and Y = Pd i=1 ri z2 i . Applying Lemma 1 in Laurent and Massart (2000) we have for positive constants ϵa and ϵb the following inequalities are true

i=1 s2 i ϵa + 2s1ϵ2 a

i=1 r2 i ϵb

We set ϵa = ϵb and

i=1 s2 i ϵa + 2s1ϵ2 a =

i=1 r2 i ϵb.

Solving the above quadratic equation we have

ϵa = ϵb = Pd i=1 (ri si) q Pd i=1 r2 i + q Pd i=1 s2 i

s q Pd i=1 r2 i + q Pd i=1 s2 i

2 + 2s1 Pd i=1 (ri si)

Consequently

i=1 s2 i ϵa + 2s1ϵ2 a

i=1 r2 i ϵb

e ϵ2 a + e ϵ2 b.

Substituting ϵa and ϵb into the inequality above yields the result.

The following bound on F-distributed random variables follows from Lemma 1.

Corollary 1 Let X F(m, n), and m, n 2. Then for constant q > 1, we have

P [X q] 2e ϵ2,

where ϵ = 1

2 + 2m (q 1)

Proof We write bi = zi Pm+n i=1 z2 i , and X = (Pm i=1 z2 i )/m (Pm+n i=m+1 z2 i )/n, where {zi}m+n i=1 are i.i.d. N(0, 1)

random variables. Then we have

P [X q] = P

The corollary follows by selecting λi = q

1 2 + 1 mq for i = 1, ..., m, λi = q

i = m + 1, ..., m + n, and g1 = q

1 2 in Lemma 1.

Lemma 2 states a bound on the order statistics of Beta distributed random variables.

Li, Hannig, Mukherjee

Lemma 2 Suppose Tl satisfy Assumption A1. For any k = 1, ..., K, let {B(i)}Nk 1 i=1 be the order statistics from a sample of (Nk 1) i.i.d β(1

2 ) random variables, then

P B(Nk 1) 1

2(Nk 1)e ϵ22,

P B(Nk dmax) T 2 l (Nk dmax)

dmax (Nk + 1) Nρ k 1 2 .

Here ϵ2 is deﬁned in (11). Proof Let B β(1

2, 1 d 1). Then we can write B = z2 1 Pd i=1 z2 i , where {zi}d i=1 are i.i.d. N(0, 1)

random variables. Select λ1 = 1, λi = 0 (i = 2, .., d) and and g1 = 1

2 in Lemma 1. Note the following fact

d 1 1 2 + 1

d 1 + 1 + q

d 1 + 1 + 2(d 2) .

From Lemma 1 we have P B 1

2 2e ϵ22. Therefore by union bound inequality we have

P B(Nk 1) 1

2(Nk 1)e ϵ22.

This proves the ﬁrst part of Lemma 2. Next we prove the second part of Lemma 2. Let U(i) = F( 1

2 )(B(i)), here F( 1

the CDF of the Beta distribution β(1

2 ). Note that {U(i)}Nk 1 i=1 are the order statistics of the uniform distribution. From Assumption A1 we know F( 1

2 )(T 2 l ) 1 dmax

N1 ρ k and hence

P B(Nk dmax) T 2 l P

U(Nk dmax) 1 dmax

By Chebyshev s inequality and basic properties of the uniform order statistics we have

U(Nk dmax) 1 dmax

V ar U(Nk dmax)

2 = (Nk dmax) dmax(Nk + 1)(Nρ k 1)2 . (13)

Combine (12) and (13) we know

P B(Nk dmax) T 2 l (Nk dmax) dmax(Nk + 1)(Nρ k 1)2 .

This completes the proof.

Lemma 3 to Lemma 5 are used to prove Theorem 2.

Subspace Clustering through Sub-Clusters

Lemma 3 Let v be a random vector that uniformly distributed on Sd 1. Then we can decompose v into v = gs, 1 gu , where g β(1

2 ), u U(Sd 2) and P [s = 1] = P [s = 1] = 0.5 are three independent random variables.

Proof It is straightforward to see v, v = [v2 1, ..., v2 d] follows the Dirichlet distribution with parameters α = (1

2) Rd. We can decompose v, v into the following concatenation of two random components

v2 1, ..., v2 d = v2 1, (1 v2 1) v 1, v 1

Since Dirichlet distribution is completely neutral (Lin, 2016), we know that v2 1 is independent of v 1,v 1

1 v2 1 , where v2 1 β(1

2 ) and v 1,v 1

1 v2 1 Dir(α 1). From symmetry, we can set gs := v1 and u := v 1

1 v2 1 , where the distributions of g, u and s are speciﬁed in the

statement of Lemma 3. This completes the proof.

Let {ai}Nk 1 i=1 be (Nk 1) vectors that are uniformly sampled from Sd 1. From Lemma 3, we know that for any i = 1, ..., Nk 1, the value of ai1 is independent of [ai2,..,aid]

1 a2 i1 . The

following corollary is then a direct result from this fact.

Corollary 2 Let {a(i)}Nk 1 i=1 be a permutation of {ai}Nk 1 i=1 sorted in ascending order of the absolute value of the ﬁrst coordinate. Then we can write

a(i) = h a(i)1, q

1 a2 (i)1b Nk i i ,

where {bi}Nk 1 i=1 are i.i.d. uniform samples on Sd 2.

Lemma 4 (Lerman et al., 2012, Lemma B.3) Let {bi}dmax i=1 be i.i.d. uniform samples from Sd 2, d 3. Then for any t 0

i=1 | u, bi |

with probability at least 1 e t2/2.

Corollary 3 Use the same deﬁnition of {bi}dmax i=1 from Lemma 4. Then for any t 0:

i=1 u, bi 2 p

with probability at least 1 e t2/2.

Proof Note that E [ u, b ] = 0 for any b U(Sd 2) and u Rd 1. Therefore by Lemma 6.3 in Ledoux and Talagrand (2013) we have:

2 sup u 2=1

Li, Hannig, Mukherjee

Here {ϵi}dmax i=1 are i.i.d. Rademacher random variables. The lemma is proved by following similar steps after equation (B.11) in Lerman et al. (2012).

Lemma 5 Suppose Assumption A3. Write a0 = [1, 0, ..., 0] Rd, and use the deﬁnitions for {ai}Nk 1 i=1 and {a(i)}Nk 1 i=1 from Corollary 2. Let B Rd (dmax+1) be a matrix where its ﬁrst column is a0 and its i-th column (2 i dmax + 1) is a(Nk i+1). Let the largest d singular values of B be s1 s2 sd. Then we have

P s2 d q0 1 2 Nt2/2 (Nk dmax) dmax(Nk + 1)(Nρ k 1)2 2(Nk 1)e ϵ22,

where ϵ2 is deﬁned in (11).

Proof From Corollary 2, we know B can be re-written as

1, a(Nk 1)1, ... a(Nk dmax)1 0, q

1 a2 (Nk 1)1b1, ... q

1 a2 (Nk dmax)1bdmax

where {bi}dmax i=1 are i.i.d. uniform samples from Sd 2. Given the dimensions of B, we know sd = inf x 2=1 BT x 2. For convenience, we write

1 x2 1 [x2, ..., xd] ,

where x 2 = 1, ci = x , bi , a(Nk)1 = 1. Let E1 be the event that {s2 d q0}, and E2 be the event that {a2 (Nk i)1 T 2 l , 1

2 , i = 1, .., dmax}. From Lemma 2 we know

P [E2] 1 (Nk dmax) dmax(Nk + 1)(Nρ k 1)2 2(Nk 1)e ϵ22. (14)

Conditioning on E2, we have the following relations

1, a(Nk 1)1, ... a(Nk dmax)1 0, q

1 a2 (Nk 1)1b1, ... q

1 a2 (Nk dmax)1bdmax

a(Nk i)1x1 +

r 1 a2 (N i)1 1 x2 1 ci

i=0 a2 (N i)1x2 1 + 2

a2 (N i)1(1 a2 (Nk i)1)(1 x2 1)cix1

i=1 (1 a2 (Nk i)1)(1 x2 1)c2 i

T 2 l dmax x2 1 q

(1 x2 1)x2 1 sup u 2=1

Subspace Clustering through Sub-Clusters

+ 1 x2 1 2 inf u 2=1

i=1 u, bi 2. (15)

From Lemma 4 and Corollary 3 and conditional on E2, we have the following inequality

(15) (1 x2 1)C2 q

(1 x2 1)x2 1C1 + T 2 l dmax x2 1, (16)

with probability at least 1 2 Nt2/2 . Since 1 x2 1 1, a lower bound of the RHS of (16) is

T 2 l dmax C2 x2 1 C1x1 + C2

T 2 l dmax C2 C2 C2 1 4 T 2 l dmax q0,

where the q0 comes from Assumption A3. Finally, note the following fact

P [E1] P [E1|E2] + P [E2] 1 (17)

= 1 2 Nt2/2 (Nk dmax) dmax(Nk + 1)(Nρ k 1)2 2(Nk 1)e ϵ22.

This completes the proof.

Proof of Theorem 1 Let the event E1i = {YCi only contains points in same subspace}, then E1 = n i=1E1i is the event that Algorithm 1 has sub-cluster preserving property. Let

the event E2 = {σ e(k) i 2 < g2, i, k}, where g2 is from Assumption A2. Our goal is to ﬁnd

a lower bound on P [E1]. Note the following fact

i=1 P h E 1i|E2 i + P [E2] 1 = P [E2]

i=1 P h E 1i|E2 i . (18)

Therefore, it suﬃces to ﬁnd a lower bound on P [E2] Pn i=1 P h E 1i|E2 i .

We start by ﬁnding a preliminary upper bound on P h E 11|E2 i . WLOG assume that y1(1)

is one of the sampled points, and YC1 is the sub-cluster associated with it. Recall that in Step 2 of Algorithm 1, we use | y(1) 1 , y(k) i | to measure the aﬃnity between y(1) 1 and y(k) i , the nearest (dmax + 1) points are then used to construct the sub-cluster associated with y(1) 1 . Write ˆAk = {| y(1) 1 , y(k) i |}Nk i=1, for E11 to happen we need the largest (dmax + 1) values among K k=1 ˆAk to be from the set ˆA1. Mathematically this means

E 11 = ˆA1 (N1 dmax) max k =1 max i=1,..,Nk ˆAk i

where ˆAk i is the i-th element in ˆAk and ˆAk (i) is the i-th smallest element in ˆAk.

Li, Hannig, Mukherjee

Recall from (1) that yi(k) = Uka(k) i +σe(k) i Uka(k) i +σe(k) i 2 . The triangle inequality tells us that

Uka(k) i 2 σe(k) i 2 Uka(k) i + σe(k) i 2 Uka(k) i 2 + σe(k) i 2 .

Therefore conditional on E2, we know the normalizing constants Uka(k) i + σe(k) i 2 are

bounded in [1 g2, 1+g2]. We can write Ak i = y(1) 1 2 y(k) i 2 ˆAk i . It is fairly straightforward to get the following relation

P A1 (N1 dmax) 1 + g2

1 g2 max k =1 max 1 i Nk Ak i

P h E 11 E2 i . (19)

Conditioning on E2 and write Bi = a(1) 1 , a(1) i 2, i = 2, ..., N1 1. We have the following inequalities

A1 (N1 dmax) = q

B(N1 dmax) + σ U1a(1) 1 , e(1) i + σ U1a(1) i , e(1) 1 + σ2 e(1) 1 , e(1) i

B(N1 dmax) σ e(1) i 2 σ e(1) 1 2 σ2 e(1) 1 2 max i =1

B(N1 dmax) 2g2 g2 2.

Similarly we have

max k =1 max 1 i Nk Ak i = max k =1 max 1 i Nk

U1a(1) 1 , Uka(k) i + σ U1a(1) 1 , e(k) i + σ Uka(k) i , e(1) 1 + σ2 e(1) 1 , e(k) i

max k =1 max 1 i Nk

U1a(1) 1 , Uka(k) i + σ max k =1 max 1 i Nk

+ σ e(1) 1 2 + σ2 e(1) 1 2 max k =1 max 1 i Nk

max k =1 max 1 i Nk

U1a(1) 1 , Uka(k) i + 2g2 + g2 2.

Pick T from Assumption A2, then the LHS of (19) has the following upper bound

P [T Q|E2] + P B(N1 dmax) T 2 E2 , (20)

Q = 1 + 1 + g2

2g2 + g2 2 + 1 + g2

1 g2 max k =1 max 1 i Nk

U1a(1) 1 , Uka(k) i .

Now we are going to complete our proof in 3 steps. Step 1: For the ﬁrst term in (20) we have

P [T Q|E2] = P g1 max k =1 max 1 i Nk

U1a(1) 1 , Uka(k) i .

Subspace Clustering through Sub-Clusters

From singular value decomposition we can write

U1a(1) 1 , Uka(k) i = a(1)T 1 W1kΛ1k VT 1ka(k) i := b T k Λ1k VT 1kai(k),

where both {bk}K k=2 and {VT 1ka(k) i }K k=2 are sampled uniformly from Sd 1. Therefore

P g1 max k =1 max 1 i Nk

U1a(1) 1 , Uka(k) i =P g2 1 max k =1 max 1 i Nk

b T k Λ1k VT 1ka(k) i 2

k=2 P g2 1 max 1 i Nk

b T k Λ1k VT 1ka(k) i 2 (21)

λ(1k) i bki 2 #

λ(1) i bki 2 #

where inequality (21) uses the union bound inequality, (22) comes from Cauchy-Schwarz inequality, and (23) uses Deﬁnition 1. Since {bk}K k=2 U(Sd 1), we can write (23) as

λ(1) i bi 2 #

where b is uniformly distributed on Sd 1. Now we apply Lemma 1 directly to the quantity

above and get P g2 1 Pd i=1 λ(1) i bi 2 2e ϵ 2 where

ϵ = Pd i=1(ri si)

( q Pd i=1 r2 i + q Pd i=1 s2 i ) +

( q Pd i=1 r2 i + q Pd i=1 s2 i )2 + 2s1 Pd i=1(ri si)

= ( q Pd i=1 r2 i + q Pd i=1 s2 i ) +

( q Pd i=1 r2 i + q Pd i=1 s2 i )2 + 2s1 Pd i=1(ri si)

( q Pd i=1 r2 i + q Pd i=1 s2 i ) +

( q Pd i=1 r2 i + q Pd i=1 s2 i )2 + 2 Pd i=1(ri si)

= Pd i=1(ri si)

( q Pd i=1 r2 i + q Pd i=1 s2 i ) +

( q Pd i=1 r2 i + q Pd i=1 s2 i )2 + 2 Pd i=1(ri si)

Pd i=1(ri si)

2 q Pd i=1 r2 i + q

4 Pd i=1 r2 i + 2 Pd i=1 ri ϵ1.

Here ϵ1 is deﬁned in (8), ri and si are deﬁned in Lemma 1, and (24) comes from the following fact for positive constants a, b and s (0, 1)

a2 + 2sb 2s a +

a2 + 2b 2 .

Li, Hannig, Mukherjee

Therefore we have

P g1 max k =1 max 1 i nk

U1a(1) 1 , Uka(k) i 2(K 1)e ϵ2 1.

Step 2: For the second term of (20), we just need to use Lemma 2. Note that for ﬁxed a(1) 1 , one can show that Bi = a(1) 1 , a(1) i 2 can be treated as a sample from a Beta distribution with parameters (1

2 ). From Lemma 2 and Assumption A2 we have

P B(N1 dmax) T 2 E2 (N1 dmax) dmax(N1 + 1)(Nρ 1 1)2 .

Combine the results above we know

P h E 11 E2 i 2(K 1)e ϵ12 + (N1 dmax) dmax(N1 + 1)(Nρ 1 1)2 . (25)

Step 3: Now we are going to ﬁnd the lower bound on P[E2]. Let e be an independent copy of e(1) 1 , note that e 2 2 /D FD,d. From Corollary 1 we have

P [g2 σ||e||2] = P g2 2 Dσ2 ||e||2 2 D

where t can be calculated from Corollary 1. Using Assumption A2 we have

t > D g2 2 Dσ2 1

D + g2 2 σ2

D + g2 2 Dσ2

1 + η 2 + η

Therefore we have P [g2 σ||e||2] 2

N(1+ η 2+η) 2 . Now we note that

P g2 > σ max k=1,...,K max 1 i Nk

= ΠN i=1 1 P h g2 σ e(k) i 2

(1 2e t2)N 1 2N

1+ η 2+η 2 ,

where the last inequality comes from the Bernoulli s inequality. Therefore

P [E2] 1 2N

1+ η 2+η 2 . (26)

Finally, the above arguments hold for any y(k) i . Putting (25) and (26) together and applying the union bound inequality yields the result

nk(Nk dmax) dmax(Nk + 1)(Nρ k 1)2 2(K 1)ne ϵ2 1 2N

1+ η 2+η 2 . (27)

Subspace Clustering through Sub-Clusters

To prove Theorem 2, we will use the following equation

(WT W + λId2) 1WT = WT (WWT + λId1) 1, (28)

where W Rd1 d2 and λ is a positive constant (Murphy, 2012, Chapter 4). Throughout the proof of Theorem 2, the subscript of identity matrix I will be omitted as its dimension is clear from the context.

Proof of Theorem 2 Similar to the proof of Theorem 1, let E1 be the event that correct neighborhood property holds for all {YCj}n j=1, let E2 be the event {σ e(k) i 2 < g, i, k} (g is from Assumption A4), E3 is the event that the smallest singular value of BBT is at least q0, i = 1, ..., n, and E4 is the event that the sub-cluster preserving property is satisﬁed. Deﬁne I = {(i, j) : the i-th and the j-th sampled points belong to diﬀerent clusters, 1 i < j n}, and J = {(i, j) : the i-th and the j-th sampled points belong to the same cluster, 1 i < j n}. Conditional on E4, we know that YCi and YCj belong to diﬀerent clusters if (i, j) I, and belong to the same cluster if (i, j) J . We will show that conditioning on E2, E3 and E4, there exists a constant l such that

P [E1|E2, E3, E4] P d(YCi, YCj) (i,j) I > l E2, E3, E4 1 X

(i,j) I P d(YCi, YCj) l E2, E3, E4 .

Then we obtain an upper bound on P[d(YCi, YCk) l | E2, E3, E4], (i, k) I. The theorem will follow by using the union bound inequality. WLOG assume that YC1 and YC2 belong to S1, and YC3 belongs to S2. The distance function d(YC1, YC2) can be explicitly written as YC1 YC2(YT C2YC2 + λI) 1YT C2YC1 F + YC2 YC1(YT C1YC1 + λI) 1YT C1YC2 F . (29)

Conditional on E4, we can write YC1 = U1 ˆB1 + ˆE1, where [U1 ˆB1]j + [ˆE1]j 2 = 1. Let B1 and E1 be the un-normalized version of ˆB1 and ˆE1 respectively. Here each column of B1 is a sample from the uniform distribution on Sd 1. We have the following relation

[ ˆB1]j = [B1]j [U1B1]j + [E1]j 2 , [ˆE1]j = [E1]j [U1B1]j + [E1]j 2 , j = 1, ..., dmax + 1.

Similarly we can write YC2 = U1 ˆB2 + ˆE2 and YC3 = U2 ˆB3 + ˆE3. Using Equation (28), the ﬁrst term in (29) can be rewritten as YC1 YC2(YT C2YC2 + λI) 1YT C2YC1 F = YC1 (YC2YT C2 + λI λI)(YC2YT C2 + λI) 1YC1 F =λ (YC2YT C2 + λI) 1YC1 F

<λ [(YC2YT C2 + λI) 1 (U1 ˆB2 ˆBT 2 UT 1 + λI) 1]||F ||YC1 F

+ λ (U1 ˆB2 ˆBT 2 UT 1 + λI) 1YC1 F

Li, Hannig, Mukherjee

<λ (YC2YT C2 + λI) 1 (U1 ˆB2 ˆBT 2 UT 1 + λI) 1 F

+ λ (U1 ˆB2 ˆBT 2 UT 1 + λI) 1U1 ˆB1 F

+ λ (U1 ˆB2 ˆBT 2 UT 1 + λI) 1 F

ˆE1 F . (30)

Now we are going to complete our proof in 3 steps. Unless speciﬁed otherwise, the following Step 1 to Step 3 are derived conditioning on E2, E3 and E4. Step 1: We can rewrite the ﬁrst term in (30) as the following term

λ (G2 + H) 1 H 1 F p

where H = U1 ˆB2 ˆBT 2 UT 1 + λI, and G2 = YC2YT C2 U1 ˆB2 ˆBT 2 UT 1 = ˆE2 ˆBT 2 UT 1 + U1 ˆB2 ˆET 2 + ˆE2 ˆET 2 . Note that the normalizing constant of each column of {ˆEi}n i=1 are bounded in [1 g, 1 + g]. We then have the following relations

ˆBT 2 UT 1 F + U1 ˆB2 + ˆE2 F

(2g g2)(dmax + 1)

(1 g)2 . (31)

The above analysis used triangle inequalities and the bounds of normalizing constants. Using the fact that

and inequality (31), we have the following inequalities

H 1G2 F H 1 F G2 F = (2g g2) (dmax + 1)

λ2 =: f(d) < 1

Therefore limm (H 1G2)m = 0. From Theorem 4.29 in Schott (2016) we know (I + H 1G2) 1 = P j=0(H 1G2)j and

(G2 + H) 1 H 1 F = H 1G(I + H 1G2) 1H 1 F

j=1 (H 1G2)j

< f(d) 1 f(d)

We then have for the ﬁrst term in (30)

λ (G2 + H) 1 H 1 F p

dmax + 1 < f(d) dmax + 1

d(1 + g)4λ2

q2 0 + D d.

Subspace Clustering through Sub-Clusters

For the second term in (30) we have

λ U1 ˆB2 ˆBT 2 UT 1 + λI 1 ˆB1

F = ˆB1 ˆB2 ˆBT 2 ( ˆB2 ˆBT 2 + λI) 1 ˆB1 F

I ˆB2 ˆBT 2 ( ˆB2 ˆBT 2 + λI) 1 F

ˆB1 F λ(1 + g)2p

d(dmax + 1) q0(1 g) .

For the third term in (30) we have

λ (U1 ˆB2 ˆBT 2 UT 1 + λI) 1 F

d(1 + g)4λ2

q2 0 + D d.

Hence by our assumption, Equation (30) can be upper bounded by the following term

3λ(1 + g)2p

d(dmax + 1) q0(1 g) ,

which is deterministic and does not depend on the choices of {Bi}n i=1 and {Uk}K k=1. The distance function in (29) has two parts which are symmetric, therefore we set

l := 6λ(1 + g)2p

d(dmax + 1) q0(1 g) > d(YCi, YCj)(i,j) J .

Step 2: Now we consider P [d (YC1, YC3) l|E2, E3, E4]. We explicitly write d(YC1, YC3) as YC1 YC3(YT C3YC3 + λI) 1YT C3YC1 F + YC3 YC1(YT C1YC1 + λI) 1YT C1YC3 F . (32)

Note the following relation

P [d(YC1, YC3) l|E2, E3] P YC1 YC3(YT C3YC3 + λI) 1YT C3YC1 F l

+ P YC3 YC1(YT C1YC1 + λI) 1YT C1YC3 F l

To bound the ﬁrst term in (32), the following facts come from the triangle inequality YC1 YC3 YT C3YC3 + λI 1 YT C3YC1 F

=λ YC3YT C3 + λI 1 YC1 F

>λ U2 ˆB3 ˆBT 3 UT 2 + λI 1 U1 ˆB1

F λ U2 ˆB3 ˆBT 3 UT 2 + λI 1 F

YC3YT C3 + λI 1 U2 ˆB3 ˆBT 3 UT 2 + λI 1 F

The last two terms of the line above are upper bounded by λ(1+g)2

d(dmax+1) q0(1 g) as before, and the ﬁrst term can be bounded by the following relations

λ U2 ˆB3 ˆBT 3 UT 2 + λI 1 U1 ˆB1

Li, Hannig, Mukherjee

U1 ˆB1 U2UT 2 U1 ˆB1 F λ U2 ˆB3 ˆBT 3 + λI 1 UT 2 U1 ˆB1

> U1 ˆB1 U2UT 2 U1 ˆB1 F λ(1 + g)2p

d(dmax + 1) q0(1 g) , (34)

where inequality (34) comes from the following relations

λ U2 ˆB3 ˆBT 3 + λI 1 UT 2 U1 ˆB1

F λ ˆB3 ˆBT 3 + λI 1 F

1 g = λ(1 + g)2p

d(dmax + 1) q0(1 g) .

For the ﬁrst term in (34) we have

U1 ˆB1 U2UT 2 U1 ˆB1 F =

Tr h ˆBT 1 ˆB1 ˆBT 1 UT 1 U2UT 2 U1 ˆB1 i

I Λ2 12 B1W F

I Λ2 12 B1 F 1 + g ,

where W is the diagonal matrix such that Wjj = 1 [ ˆB1]j 2 ([ ˆB1]j is the j-th column of

ˆB1, j = 1, .., dmax + 1), B1 = VB1 is a orthogonal transformation of B1 (here V is the right orthogonal matrix in the svd of UT 2 U1), and Λ12 is the diagonal matrix such that [Λ12]ii = λ(12) i , i = 1, ..., d. Therefore, eventually the ﬁrst term at the RHS of (33) can be upper bounded by

F 6λ(1 + g)2p

d(dmax + 1) q0(1 g)

Using Assumption A4, Lemma 1 and arguments similar to the proof of Theorem 1, we know the quantity above is upper bounded by

I Λ2 12v F q

where ϵ1 is deﬁned in (8) with g1 replaced by Tl, and v is the ﬁrst column of B1. Using analogous manipulations we obtain similar results for the second term in (33). Therefore P [d(YC1, YC3) l|E2, E3, E4] 4e ϵ12. Step 3: Now we are going to lower bound P [E2, E3, E4] from the fact P [E2, E3, E4] 1 P[E 2] P[E 3] P[E 4].

Just as in the proof of Theorem 1 we have P h σe(k) i g i 2e t2, where

t = D g2 Dσ2 1

Subspace Clustering through Sub-Clusters

From Assumption A4 we know 2e t2 2

N(1+ η 2+η) 2 . Using union bound inequality we have

P h E 2 i 2N

1+ η 2+η 2 . (35)

From Lemma 5 we have

P h E 3 i 2n Nt2/2 +

Nk dmax dmax(Nk + 1)(Nρ k 1)2 + 2(Nk 1)e ϵ2 2 . (36)

From our assumption we have

P h E 4 i ps. (37)

Combing (35), (36) and (37) we know

(i,j) I P d(YCi, YCj) l E2, E3, E4 P h E 2 i P h E 3 i P h E 4 i

1 4n(n 1)e ϵ2 1 2n Nt2/2

Nk dmax dmax(Nk + 1)(Nρ k 1)2 + 2(Nk 1)e ϵ2 2

1+ η 2+η 2 ps.

This completes the proof.

Li, Hannig, Mukherjee

Appendix D. Residual Minimization by Ridge Regression

In this section we provide the algorithm for classifying the out-of-sample points.

input : Y to be classiﬁed, R and ℓare the training data and labels, m and λ are the residual minimization and regularization parameters output: The label vector ℓof all points in Y 1. Generate subsets of training data for k = 1 to K do

Uniformly sample m points from the k-th cluster in the training set R, denote this sampled set as Rk; end 2. Compute the projection matrix for each cluster for k = 1 to K do

Pk := Rk(RT k Rk + λI) 1RT k end 3. Compute residuals for points in Y , here N is the number of points in Y for i = 1 to N do

for k = 1 to K do

ri(k) := (I Pk)yi; end end 4. Assign labels through minimum residual for i = 1 to N do

ℓi = arg mink ri(k); end

Algorithm 3: Residual Minimization by Ridge Regression (RMRR) algorithm.

Appendix E. Additional Numerical Results

In this section, we present additional numerical results. Results for some algorithms are omitted for certain datasets due to the limitations on computational resources. Speciﬁcally, the additional results are presented in Table 4, Table 5 and Table 6.

Appendix F. Additional Technical Discussion

F.1 The ϵ1 in Theorem 1

In this section, we will show that under mild conditions, ϵ1 in (8) grows at least linear in

d. For ease of notation, we write ri = g2 1 λ(1)2 i

+ and si = g2 1 λ(1)2 i

, i = 1, .., d.

WLOG assume ϵ1 is evaluated at k = 1.

Subspace Clustering through Sub-Clusters

Method Accuracy (%) Accuracy-Sub (%) NMI (%) Runtime (sec.)

SBSC-SSC 20.99 (1.02)

22.5 (1.29)

34.24 (1.15) 56

SSSC 49.33 (2.51)

56.54 (1,77)

52.82 (2.21) 22

LRR 55.63 NA 64.02 29 LSR 54.11 NA 65.12 8

Table 4: Additional Results on Extended Yale B

Method Accuracy (%) Accuracy-Sub (%) NMI (%) Runtime (sec.)

SBSC(6) 75.16 (3.28)

72.62 (1.52)

78.79 (1.67) 63

SBSC-DSC(6) 64.25 (1.25)

65.34 (1.86)

72.34 (0.76) 388

SBSC-SSC(1) 55.24 (1.34)

61.44 (2.36)

45.18 (1.42) 117

SBSC-SSC(6) 71.07 (0.94)

68.65 (1.21)

67.39 (1.19) 703

STSC(6) 57.76 (1.15)

60.4 (1.59) 13

SDSC(6) 52 (2.63)

51.59 (1.28)

63.28 (1.26) 51

SSSC(1) 41.52 (5.92)

44.86 (7.06)

38.22 (3.7) 25

SSSC(6) 44.43 (4)

44.06 (2.53)

42.61 (2.23) 150

SLRR(6) 63.7 (3.74)

63.85 (1.74)

69.25 (1.86) 46

SLSR(6) 60.71 (1.04)

59.43 (0.8)

66.39 (1.08) 26

LRR 53.25 NA 53.53 401 LSR 58.91 NA 61.56 192

Table 5: Additional Results on Zipcode

Method Accuracy (%) Accuracy-Sub (%) NMI (%) Runtime (sec.)

SBSC-SSC 84.95 (4.51)

86.48 (4.2)

73.71 (2.06) 834

SSSC(1) 33.26 (2.15)

77.22 (3.9)

13.59 (1.41) 43

SSSC(6) 48.49 (2.75)

79.06 (1.63)

30.41 (2.04) 259

Table 6: Additional Results on MNIST

Li, Hannig, Mukherjee

Main result: If there exist constants c1 (0, g1], c2 (0, 1) and c3 > 0 such that Pd i=1 ri c1d, Pd i=1 si Pd i=1 ri c2 and c3d > g1, then we have

Proof Note that

ϵ1 = 1 Pd i=1 si Pd i=1 ri

2 r Pd i=1 r2 i (Pd i=1 ri)2 + r

4 Pd i=1 r2 i (Pd i=1 ri)2 + 2 Pd i=1 ri

Deﬁne f : V R, where f(v) = Pd i=1 v2 i (Pd i=1 vi)2 , and V = {v [0, g1]d : Pd i=1 vi = Pd i=1 ri}.

Consider the following r V

g1, if i Pd i=1 ri

g1 , Pd i=1 ri Pd i=1 ri

g1 g1, i = Pd i=1 ri

0, otherwise.

We will prove by contradiction that any maximizer of f( ) is a permutation of r . In fact, assume r V also maximizes f( ) but is not a permutation of r . Assume there are m terms in {r i}d i=1 that are equal to g1. Let r 1 r 2 be the two smallest positive terms of {r i}d i=1. It is straightforward to see r 2 < g1. Consequently, we can ﬁnd a constant δ > 0 such that r 1 δ, r 2 + δ (0, g1). Note r = [r 1 δ, r 2 + δ, r 3, ..., r d] V, but f(r ) > f(r ), which is a contradiction. Note that r V, we plug r into f( ) and get

f(r) = Pd i=1 r2 i (Pd i=1 ri)2 ( Pd i=1 ri

g1 + 1)g2 1 (Pd i=1 ri)2 (c1d

g1 + 1)g2 1 (c1d)2 (c1 + c3)g1

Finally, from the inequality above and (38) we have

c2 1d + 2 c1d = (1 c2)

Subspace Clustering through Sub-Clusters

Pankaj K Agarwal, Sariel Har-Peled, Kasturi R Varadarajan, et al. Geometric approximation via coresets. Combinatorial and computational geometry, 52:1 30, 2005.

Bryon Aragam, Chen Dan, Eric P Xing, Pradeep Ravikumar, et al. Identiﬁability of nonparametric mixture models and bayes optimal clustering. Annals of Statistics, 48(4): 2277 2302, 2020.

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003.

Leo Breiman. Bagging predictors. Machine learning, 24(2):123 140, 1996.

Raymond B Cattell. Factor Analysis: An Introduction and Manual for the Psychologist and Social Scientist. Harper, 1952.

Eva L Dyer, Aswin C Sankaranarayanan, and Richard G Baraniuk. Greedy feature selection for subspace clustering. The Journal of Machine Learning Research, 14(1):2487 2517, 2013.

Ehsan Elhamifar and Ren e Vidal. Sparse subspace clustering. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2790 2797. IEEE, 2009.

Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11): 2765 2781, 2013.

Reinhard Heckel and Helmut B olcskei. Robust subspace clustering via thresholding. IEEE Transactions on Information Theory, 61(11):6320 6342, 2015.

Wei Hong, John Wright, Kun Huang, and Yi Ma. Multiscale hybrid linear models for lossy image representation. IEEE Transactions on Image Processing, 15(12):3655 3671, 2006.

Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.

GJO Jameson. Inequalities for gamma function ratios. The American Mathematical Monthly, 120(10):936 940, 2013.

Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, 28(5):1302 1338, 2000.

Yann Le Cun, Ofer Matan, Bernhard Boser, John S Denker, Don Henderson, Richard E Howard, Wayne Hubbard, LD Jacket, and Henry S Baird. Handwritten zip code recognition with multilayer networks. In [1990] Proceedings. 10th International Conference on Pattern Recognition, volume 2, pages 35 40. IEEE, 1990.

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.

Li, Hannig, Mukherjee

Gilad Lerman, Michael Mc Coy, Joel A Tropp, and Teng Zhang. Robust computation of linear models, or how to ﬁnd a needle in a haystack. Technical report, California Inst of Tech Pasadena Dept of Computing and Mathematical Sciences, 2012.

Jiayu Lin. On the Dirichlet Distribution. Ph D thesis, 2016.

Guangcan Liu, Zhouchen Lin, and Yong Yu. Robust subspace segmentation by low-rank representation. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 663 670, 2010.

Risheng Liu, Ruru Hao, and Zhixun Su. Mixture of manifolds clustering via low rank embedding. Journal of Information and Computational Science, 8:725 737, 2011.

Can-Yi Lu, Hai Min, Zhong-Qiu Zhao, Lin Zhu, De-Shuang Huang, and Shuicheng Yan. Robust and eﬃcient subspace segmentation via least squares regression. In European conference on computer vision, pages 347 360. Springer, 2012.

Ulrike V Luxburg, Olivier Bousquet, and Mikhail Belkin. Limits of spectral clustering. In Advances in neural information processing systems, pages 857 864, 2005.

Kevin P Murphy. Machine Learning: A Probabilistic Perspective. MIT press, 2012.

Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems, pages 849 856, 2002.

Dohyung Park, Constantine Caramanis, and Sujay Sanghavi. Greedy subspace clustering. In Advances in Neural Information Processing Systems, pages 2753 2761, 2014.

Xi Peng, Lei Zhang, and Zhang Yi. Scalable sparse subspace clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 430 437. IEEE, 2013.

Xi Peng, Huajin Tang, Lei Zhang, Zhang Yi, and Shijie Xiao. A uniﬁed framework for representation-based subspace clustering of out-of-sample and large-scale data. IEEE transactions on neural networks and learning systems, 27(12):2499 2512, 2015.

Mostafa Rahmani and George K Atia. Subspace clustering via optimal direction search. IEEE Signal Processing Letters, 24(12):1793 1797, 2017.

Gareth O Roberts, Richard L Tweedie, et al. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, 2(4):341 363, 1996.

Bernhard Sch olkopf, Alexander Smola, and Klaus-Robert M uller. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5):1299 1319, 1998.

James R Schott. Matrix Analysis for Statistics. John Wiley & Sons, 2016.

Mahdi Soltanolkotabi, Emmanuel J Candes, et al. A geometric analysis of subspace clustering with outliers. The Annals of Statistics, 40(4):2195 2238, 2012.

Subspace Clustering through Sub-Clusters

Alexander Strehl and Joydeep Ghosh. Cluster ensembles a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3(Dec):583 617, 2002.

Brian St Thomas, Lizhen Lin, Lek-Heng Lim, and Sayan Mukherjee. Learning subspaces of diﬀerent dimension. Ar Xiv preprint ar Xiv:1404.6841, 2014.

Rene Vidal, Yi Ma, and Shankar Sastry. Generalized principal component analysis (gpca). IEEE transactions on pattern analysis and machine intelligence, 27(12):1945 1959, 2005.

Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4): 395 416, 2007.

Chong You, Claire Donnat, Daniel P Robinson, and Ren e Vidal. A divide-and-conquer framework for large-scale subspace clustering. In Signals, Systems and Computers, 2016 50th Asilomar Conference on, pages 1014 1018. IEEE, 2016a.

Chong You, Chun-Guang Li, Daniel P Robinson, and Ren e Vidal. Oracle based active set algorithm for scalable elastic net subspace clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3928 3937. IEEE, 2016b.

Chong You, Daniel Robinson, and Ren e Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3918 3927. IEEE, 2016c.

Pan Zhou, Yunqing Hou, and Jiashi Feng. Deep adversarial subspace clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1596 1604, 2018.