# spectral_clustering_based_on_local_pca__d9e9a8a4.pdf

Journal of Machine Learning Research 18 (2017) 1-57 Submitted 8/14; Revised 7/16; Published 3/17

Spectral Clustering Based on Local PCA

Ery Arias-Castro eariasca@ucsd.edu Department of Mathematics University of California, San Diego La Jolla, CA 92093, USA

Gilad Lerman lerman@umn.edu Department of Mathematics University of Minnesota, Twin Cities Minneapolis, MN 55455, USA

Teng Zhang teng.zhang@ucf.edu Department of Mathematics University of Central Florida Orlando, FL 32816, USA

Editor: Mikhail Belkin

We propose a spectral clustering method based on local principal components analysis (PCA). After performing local PCA in selected neighborhoods, the algorithm builds a nearest neighbor graph weighted according to a discrepancy between the principal subspaces in the neighborhoods, and then applies spectral clustering. As opposed to standard spectral methods based solely on pairwise distances between points, our algorithm is able to resolve intersections. We establish theoretical guarantees for simpler variants within a prototypical mathematical framework for multi-manifold clustering, and evaluate our algorithm on various simulated data sets. Keywords: multi-manifold clustering, spectral clustering, local principal component analysis, intersecting clusters

1. Introduction

The task of multi-manifold clustering, where the data are assumed to be located near surfaces embedded in Euclidean space, is relevant in a variety of applications. In cosmology, it arises as the extraction of galaxy clusters in the form of ﬁlaments (curves) and walls (surfaces) (Valdarnini, 2001; Mart ınez and Saar, 2002); in motion segmentation, moving objects tracked along diﬀerent views form aﬃne or algebraic surfaces (Ma et al., 2008; Fu et al., 2005; Vidal and Ma, 2006; Chen et al., 2009); this is also true in face recognition, in the context of images of faces in ﬁxed pose under varying illumination conditions (Ho et al., 2003; Basri and Jacobs, 2003; Epstein et al., 1995). We consider a stylized setting where the underlying surfaces are nonparametric in nature, with a particular emphasis on situations where the surfaces intersect. Speciﬁcally, we assume the surfaces are smooth, for otherwise the notion of continuation is potentially ill-posed. For example, without smoothness assumptions, an L-shaped cluster is indistinguishable from the union of two line-segments meeting at right angle.

c 2017 Ery Arias-Castro, Gilad Lerman, and Teng Zhang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/14-318.html.

Arias-Castro, Lerman, and Zhang

Figure 1: Two rectangular clusters intersecting at right angle. Left: the original data. Center: a typical output of the standard spectral clustering method of Ng et al. (2002), which is generally unable to resolve intersections. Right: a typical output of our method.

Spectral methods (Luxburg, 2007) are particularly suited for nonparametric settings, where the underlying clusters are usually far from convex, making standard methods like K-means irrelevant. However, a drawback of standard spectral approaches such as the wellknown variant of Ng, Jordan, and Weiss (2002) is their inability to separate intersecting clusters. Indeed, consider the simplest situation where two straight clusters intersect at right angle, pictured in Figure 1 below. The algorithm of Ng et al. (2002) is based on pairwise aﬃnities that are decreasing in the distances between data points, making it insensitive to smoothness and, therefore, intersections. And indeed, this algorithm typically fails to separate intersecting clusters, even in the easiest setting of Figure 1. As argued in (Agarwal et al., 2005, 2006; Shashua et al., 2006), a multiway aﬃnity is needed to capture complex structure in data (here, smoothness) beyond proximity attributes. For example, Chen and Lerman (2009a) use a ﬂatness aﬃnity in the context of hybrid linear modeling, where the surfaces are assumed to be aﬃne subspaces, and subsequently extended to algebraic surfaces via the kernel trick (Chen, Atev, and Lerman, 2009). Moving beyond parametric models, Arias-Castro, Chen, and Lerman (2011) introduce a localized measure of ﬂatness. Continuing this line of work, we suggest a spectral clustering method based on the estimation of the local linear structure (tangent bundle) via local principal component analysis (PCA). The idea of using local PCA combined with spectral clustering has precedents in the literature. In particular, our method is inspired by the work of Goldberg, Zhu, Singh, Xu, and Nowak (2009), where the authors develop a spectral clustering method within a semi-supervised learning framework. This approach is in the zeitgeist. While writing this paper, we became aware of two concurrent publications, by Wang, Jiang, Wu, and Zhou (2011) and by Gong, Zhao, and Medioni (2012), both proposing approaches very similar to ours.1 We also mention the multiscale, spectral-ﬂavored algorithm of Kushnir, Galun, and Brandt (2006), which is also based on local PCA. We comment on these spectral methods

1. The present paper was indeed developed in parallel with these two, ﬁrst as a short version submitted to ICML in 2011.

Spectral Clustering Based on Local PCA

in more detail later on. In fact, an early proposal also based on local PCA appears in the literature on subspace clustering (Fan et al., 2006) although the need for localization is perhaps less intuitive in this setting. The basic proposition of local PCA combined with spectral clustering has two main stages. The ﬁrst one forms an aﬃnity between a pair of data points that takes into account both their Euclidean distance and a measure of discrepancy between their tangent spaces. Each tangent space is estimated by PCA in a local neighborhood around each point. The second stage applies standard spectral clustering with this aﬃnity. As a reality check, this relatively simple algorithm succeeds at separating the straight clusters in Figure 1. We tested our algorithm in more elaborate settings, some of them described in Section 4. Other methods with a spectral component include those of Polito and Perona (2001) and Goh and Vidal (2007), which (roughly speaking) embed the points by a variant of LLE (Saul and Roweis, 2003) and then group the points by K-means clustering. There is also the method of Elhamifar and Vidal (2011), which chooses the neighborhood of each point by computing a sparse linear combination of the remaining points followed by an application of the spectral graph partitioning algorithm of Ng et al. (2002). Note that these methods work under the assumption that the surfaces do not intersect. Besides spectral-type approaches to multi-manifold clustering, other methods appear in the literature. Among these, Gionis et al. (2005) and Haro et al. (2007) allow for intersecting surfaces but assume that they have diﬀerent intrinsic dimension or density and the proposed methodology is entirely based on such assumptions. We also mention the K-manifold method of Souvenir and Pless (2005), which propose an EM-type algorithm; and that of Guo et al. (2007), which propose to minimize an energy functional based on pairwise distances and local curvatures, leading to a combinatorial optimization. Our contribution is the design and detailed study of a prototypical spectral clustering algorithm based on local PCA, tailored to settings where the underlying clusters come from sampling in the vicinity of smooth surfaces that may intersect. We endeavored to simplify the algorithm as much as possible without sacriﬁcing performance. We provide theoretical results for simpler variants within a standard mathematical framework for multi-manifold clustering. To our knowledge, these are the ﬁrst mathematically backed successes at the task of resolving intersections in the context of multi-manifold clustering, with the exception of (Arias-Castro et al., 2011), where the corresponding algorithm is shown to succeed at separating intersecting curves. The salient features of our algorithm are illustrated via numerical experiments. The rest of the paper is organized as follows. In Section 2, we introduce our method in various variants. In Section 3, we analyze the simpler variants in a standard mathematical framework for multi-manifold learning. In Section 4, we perform some numerical experiments illustrating several features of our approach. In Section 5, we discuss possible extensions.

2. The Methodology

We introduce our algorithm and simpler variants that are later analyzed in a mathematical framework. We start with some review of the literature, zooming in on the most closely related publications.

Arias-Castro, Lerman, and Zhang

2.1 Some Precedents

Using local PCA within a spectral clustering algorithm was implemented in four other publications we know of (Goldberg et al., 2009; Kushnir et al., 2006; Gong et al., 2012; Wang et al., 2011). As a ﬁrst stage in their semi-supervised learning method, Goldberg, Zhu, Singh, Xu, and Nowak (2009) design a spectral clustering algorithm. The method starts by subsampling the data points, obtaining centers in the following way. Draw y1 at random from the data and remove its ℓ-nearest neighbors from the data. Then repeat with the remaining data, obtaining centers y1, y2, . . . . Let Ci denote the sample covariance in the neighborhood of yi made of its ℓ-nearest neighbors. An m-nearest-neighbor graph is then deﬁned on the centers in terms of the Mahalanobis distances. Explicitly, the centers yi and yj are connected in the graph if yj is among the m nearest neighbors of yi in Mahalanobis distance C 1/2 i (yi yj) , (1)

or vice-versa. The parameters ℓand m are both chosen of order log n. An existing edge between yi and yj is then weighted by exp( H2 ij/η2), where Hij denotes the Hellinger distance between the probability distributions N(0, Ci) and N(0, Cj). The spectral graph partitioning algorithm of Ng, Jordan, and Weiss (2002) detailed in Algorithm 1 is then applied to the resulting aﬃnity matrix, with some form of constrained K-means. We note that Goldberg et al. (2009) evaluate their method in the context of semi-supervised learning where the clustering routine is only required to return subclusters of actual clusters. In particular, the data points other than the centers are discarded. Note also that their evaluation is empirical.

Algorithm 1 Spectral Graph Partitioning (Ng, Jordan, and Weiss, 2002)

Input: Aﬃnity matrix W = (Wij), size of the partition K

Steps: 1: Compute Z = (Zij) according to Zij = Wij/ p

i j, with i := Pn j=1 Wij. 2: Extract the top K eigenvectors of Z. 3: Renormalize each row of the resulting n K matrix. 4: Apply K-means to the row vectors.

The algorithm proposed by Kushnir, Galun, and Brandt (2006) is multiscale and works by coarsening the neighborhood graph and computing sampling density and geometric information inferred along the way such as obtained via PCA in local neighborhoods. This bottom-up ﬂow is then followed by a top-down pass, and the two are iterated a few times. The algorithm is too complex to be described in detail here, and probably too complex to be analyzed mathematically. The clustering methods of Goldberg et al. (2009) and ours can be seen as simpler variants that only go bottom up and coarsen the graph only once. In the last stages of writing this paper, we learned of the works of Wang, Jiang, Wu, and Zhou (2011) and Gong, Zhao, and Medioni (2012), who propose algorithms very similar to our Algorithm 3 detailed below. Note that these publications do not provide any theoretical guarantees for their methods, which is one of our main contributions here.

Spectral Clustering Based on Local PCA

2.2 Our Algorithms

We now describe our method and propose several variants. Our setting is standard: we observe data points x1, . . . , xn RD that we assume were sampled in the vicinity of K smooth surfaces embedded in RD. The setting is formalized later in Section 3.1.

2.2.1 Connected Component Extraction: Comparing Local Covariances

We start with our simplest variant, which is also the most natural. The method depends on a neighborhood radius r > 0, a spatial scale parameter ε > 0 and a covariance (relative) scale η > 0. For a vector x, x denotes its Euclidean norm, and for a (square) matrix A, A denotes its spectral norm. For n N, we denote by [n] the set {1, . . . , n}. Given a data set x1, . . . , xn, for any point x RD and r > 0, deﬁne the neighborhood

Nr(x) = {xj : x xj r}.

Algorithm 2 Connected Component Extraction: Comparing Covariances

Input: Data points x1, . . . , xn; neighborhood radius r > 0; spatial scale ε > 0, covariance scale η > 0.

Steps: 1: For each i [n], compute the sample covariance matrix Ci of Nr(xi). 2: Remove xi when there is xj such that xj xi r and Cj Ci > ηr2. 3: Compute the following aﬃnities between data points:

Wij = 1I{ xi xj ε} 1I{ Ci Cj ηr2}. (2)

4: Extract the connected components of the resulting graph. 5: Each point removed in Step 2 is grouped with the closest point that survived Step 2.

In words, the algorithm ﬁrst computes local covariances (Step 1). It removes points that are believed to be very near an intersection (Step 2) we elaborate on this below. With the remaining points, it creates an unweighted graph: the nodes of this graph are the data points and edges are formed between two nodes if both the distance between these nodes and the distance between the local covariance structures at these nodes are suﬃciently small (Step 3). The connected components of the resulting graph are extracted (Step 4) and the points that survived Step 2 are labeled accordingly. Each point removed in Step 2 receives the label of its closest labeled point (Step 5). In principle, the neighborhood size r is chosen just large enough that performing PCA in each neighborhood yields a reliable estimate of the local covariance structure. For this, the number of points inside the neighborhood needs to be large enough, which depends on the sample size n, the sampling density, intrinsic dimension of the surfaces and their surface area (Hausdorﬀmeasure), how far the points are from the surfaces (i.e., noise level), and the regularity of the surfaces. The spatial scale parameter ε depends on the sampling density and r. It is meant to be larger than r and needs to be large enough that a point has plenty

Arias-Castro, Lerman, and Zhang

of points within distance ε, including some across an intersection, so each cluster is strongly connected. At the same time, ε needs to be small enough that a local linear approximation to the surfaces is a relevant feature of proximity. Its choice is rather similar to the choice of the scale parameter in standard spectral clustering (Ng et al., 2002; Zelnik-Manor and Perona, 2005). The covariance scale η needs to be large enough that centers from the same cluster and within distance ε of each other have local covariance matrices within distance ηr2, but small enough that points from diﬀerent clusters near their intersection have local covariance matrices separated by a distance substantially larger than ηr2. This depends on the curvature of the surfaces and the incidence angle at the intersection of two (or more) surfaces. Note that a typical covariance matrix over a ball of radius r has norm of order r2, which justiﬁes using our choice of parametrization. In the mathematical framework we introduce later on, these parameters can be chosen automatically as done in (Arias-Castro et al., 2011), at least when the points are sampled exactly on the surfaces. We will not elaborate on that since in practice this does not inform our choice of parameters. The rationale behind Step 2 is as follows. As we just discussed, the parameters need to be tuned so that points from the same cluster and within distance ε have local covariance matrices within distance ηr2. Strictly speaking, this is true of points away from the boundary of the underlying surface. Hence, although xi and xj in Step 2 are meant to be from diﬀerent clusters, they could be from the same surface near its boundary. In the former situation, since they are near each other, in our model this will imply that they are close to an intersection. Therefore, roughly speaking, Step 2 removes points near an intersection, but also points near the boundaries of the underlying surfaces. The reason that we require xi and xj to be within distance r, as opposed to ε, is because otherwise removing the points in Step 2 would create a gap of ε near an intersection which then cannot be bridged in Steps 3-4. (Alternatively, one could replace r with ξ (r, ε) in Step 2, but this would add this additional parameter ξ to the algorithm.) This step is in fact crucial as the local covariance varies smoothly along the intersection of two smooth surfaces. Although this method works in simple situations like that of two intersecting segments (Figure 1), it is not meant to be practical. Indeed, extracting connected components is known to be sensitive to spurious points and therefore unstable. Furthermore, we found that comparing local covariance matrices as in aﬃnity (2) tends to be less stable than comparing local projections as in aﬃnity (3), which brings us to our next variant.

2.2.2 Connected Component Extraction: Comparing Local Projections

We present another variant also based on extracting the connected components of a neighborhood graph that compares orthogonal projections onto the largest principal directions. See Algorithm 3. We note that the local intrinsic dimension is determined by thresholding the eigenvalues of the local covariance matrix, keeping the directions with eigenvalues within some range of the largest eigenvalue. The same strategy is used by Kushnir et al. (2006), but with a diﬀerent threshold. The method is a hard version of what we implemented, which we describe in Algorithm 4. We note that neither algorithm includes an intersection-removal step as Step 2 in Algorithm 2. The reason is that the algorithms work without such a step. Indeed, we show in

Spectral Clustering Based on Local PCA

Algorithm 3 Connected Component Extraction: Comparing Projections

Input: Data points x1, . . . , xn; neighborhood radius r > 0, spatial scale ε > 0, projection scale η > 0.

Steps: 1: For each i [n], compute the sample covariance matrix Ci of Nr(xi). 2: Compute the projection Qi onto the eigenvectors of Ci with corresponding eigenvalue exceeding η Ci . 3: Compute the following aﬃnities between data points:

Wij = 1I{ xi xj ε} 1I{ Qi Qj η}. (3)

4: Extract the connected components of the resulting graph.

Theorem 1 that Algorithm 3 is able to separate the clusters the only drawback is that it may possibly treat the intersection of two surfaces as a cluster. And we show via numerical experiments in Section 4 that Algorithm 4 performs well in a number of situations.

2.2.3 Covariances or Projections?

In our numerical experiments, we tried working both directly with covariance matrices as in (2) and with projections as in (3). Note that in our experiments we used spectral graph partitioning with soft versions of these aﬃnities, as described in Section 2.2.4. We found working with projections to be more reliable. The problem comes, in part, from boundaries. When a surface has a boundary, local covariances over neighborhoods that overlap with the boundary are quite diﬀerent from local covariances over nearby neighborhoods that do not touch the boundary. Consider the example of two segments, S1 and S2, intersecting at an angle of θ (0, π/2) at their middle point, speciﬁcally

S1 = [ 1, 1] {0}, S2 = {(x, x tan θ) : x [ cos θ, cos θ]}.

Assume there is no noise and that the sampling is uniform. Assume r (0, 1

2 sin θ) so that the disc centered at x1 := (1/2, 0) does not intersect S2, and the disc centered at x2 := (1

2 sin θ) does not intersect S1. Let x0 = (1, 0). For x S1 S2, let Cx denote the local covariance at x over a ball of radius r < 1

2 sin θ, so that the neighborhoods of x0, x1, x2 are pure . The situation is described in Figure 2. Simple calculations yield

cos2 θ sin(θ) cos(θ) sin(θ) cos(θ) sin2 θ

Cx0 Cx1 = r2

4 , Cx1 Cx2 = r2

Arias-Castro, Lerman, and Zhang

1.0 0.5 0.0 0.5 1.0

1.0 0.5 0.0 0.5 1.0

Figure 2: Two segments intersecting. The local covariances (within the disc neighborhoods drawn) at x1 and x2 are closer than the local covariances at x1 and x0, even though x0 and x1 are on the same segment.

Therefore, when sin θ 3

4 (roughly, θ 48o), the local covariances at x0, x1 S1 are farther (in operator norm) than those at x1 S1 and x2 S2. As for projections, however,

Qx0 = Qx1 = 1 0 0 0

, Qx2 = cos2 θ sin(θ) cos(θ) sin(θ) cos(θ) sin2 θ

so that Qx0 Qx1 = 0, Qx1 Qx2 =

While in theory points within distance r from the boundary account for a small portion of the sample, in practice this is not the case, at least not with the sample sizes that we are able to rapidly process. In fact, we ﬁnd that spectral graph partitioning is challenged by having points near the boundary that are far (in aﬃnity) from nearby points from the same cluster. This may explain why the (soft version of) aﬃnity (3) yields better results than the (soft version of) aﬃnity (2) in our experiments.

2.2.4 Spectral Clustering Based on Local PCA

The following variant is more robust in practice and is the algorithm we actually implemented. The method assumes that the surfaces are of same dimension d and that there are K of them, with both parameters K and d known. We note that y1, . . . , yn0 forms an r-packing of the data. The underlying rationale for this coarsening is justiﬁed in (Goldberg et al., 2009) by the fact that the covariance matrices, and also the top principal directions, change smoothly with the location of the neighborhood, so that without subsampling these characteristics would not help detect the abrupt event of an intersection. The aﬃnity (4) is of course a soft version of (3).

Spectral Clustering Based on Local PCA

Algorithm 4 Spectral Clustering Based on Local PCA

Input: Data points x1, . . . , xn; neighborhood radius r > 0; spatial scale ε > 0, projection scale η > 0; intrinsic dimension d; number of clusters K.

Steps: 0: Pick one point y1 at random from the data. Pick another point y2 among the data points not included in Nr(y1), and repeat the process, selecting centers y1, . . . , yn0. 1: For each i = 1, . . . , n0, compute the sample covariance matrix Ci of Nr(yi). Let Qi denote the orthogonal projection onto the space spanned by the top d eigenvectors of Ci. 2: Compute the following aﬃnities between center pairs:

3: Apply spectral graph partitioning (Algorithm 1) to W . 4: The data points are clustered according to the closest center in Euclidean distance.

2.2.5 Comparison with Closely Related Methods

We highlight some diﬀerences with the other proposals in the literature. We ﬁrst compare our approach to that of Goldberg et al. (2009), which was our main inspiration.

Neighborhoods. Comparing with Goldberg et al. (2009), we deﬁne neighborhoods over r-balls instead of ℓ-nearest neighbors, and connect points over ε-balls instead of mnearest neighbors. This choice is for convenience, as these ways are in fact essentially equivalent when the sampling density is fairly uniform. This is elaborated at length in (Maier et al., 2009; Brito et al., 1997; Arias-Castro, 2011).

Mahalanobis distances. Goldberg et al. (2009) use Mahalanobis distances (1) between centers. In our version, we could for example replace the Euclidean distance xi xj in the aﬃnity (2) with the average Mahalanobis distance

C 1/2 i (xi xj) + C 1/2 j (xj xi) .

We actually tried this and found that the algorithm was less stable, particularly under low noise. Introducing a regularization in this distance which requires the introduction of another parameter solves this problem, at least partially.

That said, using Mahalanobis distances makes the procedure less sensitive to the choice of ε, in that neighborhoods may include points from diﬀerent clusters. Think of two parallel line segments separated by a distance of δ, and assume there is no noise, so the points are sampled exactly from these segments. Assuming an inﬁnite sample size, the local covariance is the same everywhere so that points within distance ε are connected by the aﬃnity (2). Hence, Algorithm 2 requires that ε < δ. In terms of Mahalanobis distances, points on diﬀerent segments are inﬁnitely separated, so a

Arias-Castro, Lerman, and Zhang

version based on these distances would work with any ε > 0. In the case of curved surfaces and/or noise, the situation is similar, though not as evident. Even then, the gain in performance is not obvious since we only require that ε be slightly larger in order of magnitude than r.

Hellinger distances. As we mentioned earlier, Goldberg et al. (2009) use Hellinger distances of the probability distributions N(0, Ci) and N(0, Cj) to compare covariance matrices, speciﬁcally

1 2D/2 det(Ci Cj)1/4

det(Ci + Cj)1/2

if Ci and Cj are full-rank. While using these distances or the Frobenius distances makes little diﬀerence in practice, we ﬁnd it easier to work with the latter when it comes to proving theoretical guarantees. Moreover, it seems more natural to assume a uniform sampling distribution in each neighborhood rather than a normal distribution, so that using the more sophisticated similarity (5) does not seem justiﬁed.

K-means. We use K-means++ for a good initialization. We found that the more sophisticated size-constrained K-means (Bradley et al., 2000) used in (Goldberg et al., 2009) did not improve the clustering results.

As we mentioned above, our work was developed in parallel to that of Wang et al. (2011) and Gong et al. (2012). We highlight some diﬀerences. First, there is no subsampling, but rather, the local tangent space is estimated at each data point xi. Wang et al. (2011) ﬁt a mixture of d-dimensional aﬃne subspaces to the data using MPPCA (Tipping and Bishop, 1999), which is then used to estimate the tangent subspaces at each data point. Gong et al. (2012) develop some sort of robust local PCA. While Wang et al. (2011) assume all surfaces are of same dimension known to the user, Gong et al. (2012) estimate that locally by looking at the largest gap in the spectrum of estimated local covariance matrix. This is similar in spirit to what is done in Step 2 of Algorithm 3, but we did not include this step in Algorithm 4 because we did not ﬁnd it reliable in practice. We also tried estimating the local dimensionality using the method of Little et al. (2009), but this failed in the most complex cases. Wang et al. (2011) use a nearest-neighbor graph and their aﬃnity is deﬁned as

s=1 cos θs(i, j)

where ij = 1 if xi is among the ℓ-nearest neighbors of xj, or vice versa, while ij = 0 otherwise; θ1(i, j) θd(i, j) are the principal (a.k.a., canonical) angles (Stewart and Sun, 1990) between the estimated tangent subspaces at xi and xj. ℓand α are parameters of the method. Gong et al. (2012) deﬁne an aﬃnity that incorporates the self-tuning method of Zelnik-Manor and Perona (2005); in our notation, their aﬃnity is

exp xi xj 2

(sin 1( Qi Qj ))2

η2 xi xj 2/(εiεj)

Spectral Clustering Based on Local PCA

where εi is the distance from xi to its ℓ-nearest neighbor. ℓis a parameter. Although we do not analyze their respective ways of estimating the tangent subspaces, our analysis provides essential insights into their methods, and for that matter, any other method built on spectral clustering based on tangent subspace comparisons.

3. Mathematical Analysis

While the analysis of Algorithm 4 seems within reach, there are some complications due to the fact that points near the intersection may form a cluster of their own we were not able to discard this possibility. Instead, we study the simpler variants described in Algorithm 2 and Algorithm 3. Even then, the arguments are rather complex and interestingly involved. The theoretical guarantees that we obtain for these variants are stated in Theorem 1 and proved in Section 6. We comment on the analysis of Algorithm 4 right after that. We note that there are very few theoretical results on resolving intersecting manifolds in fact, we are only aware of (Arias-Castro et al., 2011) (under severe restrictions on the dimension of the intersection). Some such results have been established for a number of methods for subspace clustering (aﬃne surfaces), for example, in (Chen and Lerman, 2009b; Soltanolkotabi and Cand es, 2012; Soltanolkotabi et al., 2014; Wang and Xu, 2013; Heckel and B olcskei, 2013; Tsakiris and Vidal, 2015; Ma et al., 2008). The generative model we assume is a natural mathematical framework for multi-manifold learning where points are sampled in the vicinity of smooth surfaces embedded in Euclidean space. For concreteness and ease of exposition, we focus on the situation where two surfaces (i.e., K = 2) of same dimension 1 d D intersect. This special situation already contains all the geometric intricacies of separating intersecting clusters. On the one hand, clusters of diﬀerent intrinsic dimension may be separated with an accurate estimation of the local intrinsic dimension without further geometry involved (Haro et al., 2007). On the other hand, more complex intersections (3-way and higher) complicate the situation without oﬀering truly new challenges. For simplicity of exposition, we assume that the surfaces are submanifolds without boundary, though it will be clear from the analysis (and the experiments) that the method can handle surfaces with (smooth) boundaries that may self-intersect. We discuss other possible extensions in Section 5. Within that framework, we show that Algorithm 2 and Algorithm 3 are able to identify the clusters accurately except for points near the intersection. Speciﬁcally, with high probability with respect to the sampling distribution, Algorithm 2 divides the data points into two groups such that, except for points within distance Cε of the intersection, all points from the ﬁrst cluster are in one group and all points from the second cluster are in the other group. The constant C depends on the surfaces, including their curvatures, separation between them and intersection angle. The situation for Algorithm 3 is more complex, as it may return more than two clusters, but most of the two clusters (again, away from the intersection) are in separate connected components.

3.1 Generative Model

Each surface we consider is a connected, C2 and compact submanifold without boundary and of dimension d embedded in RD. Any such surface has a positive reach, which is what we use to quantify smoothness. The notion of reach was introduced by Federer (1959).

Arias-Castro, Lerman, and Zhang

Intuitively, a surface has reach exceeding r if and only if one can roll a ball of radius r on the surface without obstruction (Walther, 1997; Cuevas et al., 2012). Formally, for x RD

and S RD, let dist(x, S) = inf s S x s ,

and B(S, r) = {x : dist(x, S) < r},

which is often called the r-tubular neighborhood (or r-neighborhood) of S. The reach of S is the supremum over r > 0 such that, for each x B(S, r), there is a unique point in S nearest x. It is well-known that, for C2 submanifolds, the reach bounds the radius of curvature from below (Federer, 1959, Lem. 4.17). For submanifolds without boundaries, the reach coincides with the condition number introduced in (Niyogi et al., 2008). When two surfaces S1 and S2 intersect, meaning S1 S2 = , we deﬁne their incidence angle as θ(S1, S2) := inf θmin(TS1(s), TS2(s)) : s S1 S2 , (6)

where TS(s) denote the tangent subspace of submanifold S at point s S, and θmin(T1, T2) is the smallest nonzero principal (a.k.a., canonical) angle between subspaces T1 and T2 (Stewart and Sun, 1990). The clusters are generated as follows. Each data point xi is drawn according to

xi = si + zi, (7)

where si is drawn from the uniform distribution over S1 S2 and zi is an additive noise term satisfying zi τ thus τ represents the noise or jitter level. When τ = 0 the points are sampled exactly on the surfaces. We assume the points are sampled independently of each other. We let Ik = {i : si Sk},

and the goal is to recover the groups I1 and I2, up to some errors.

3.2 Performance Guarantees

We state some performance guarantees for Algorithm 2 and Algorithm 3.

Theorem 1. Consider two connected, compact, twice continuously diﬀerentiable submanifolds without boundary, of same dimension d, intersecting at a strictly positive angle, with the intersection set having strictly positive reach. Assume the parameters are set so that

τ rη/C, r ε/C, ε η/C, η 1/C, (8)

for a large-enough constant C 1 that depends on the conﬁguration. Then with probability at least 1 Cn exp nrdη2/C :

Algorithm 2 returns exactly two groups such that two points from diﬀerent clusters are not grouped together unless one of them is within distance Cr from the intersection.

Algorithm 3 returns at least two groups, and such that two points from diﬀerent clusters are not grouped together unless one of them is within distance Cr from the intersection.

Spectral Clustering Based on Local PCA

Thus, as long as, (8) is satisﬁed, the algorithms have the above properties with high probability when rdη2 C log n/n, where C > C is ﬁxed. In particular, we may choose η 1 and ε r τ (log(n)/n)1/d, which lightens the computational burden. We note that, while the constant C > 0 does not depend on the sample size n, it depends in somewhat complicated ways on characteristics of the surfaces and their position relative to each other, such as their reach and intersection angle, but also aspects that are harder to quantify, like their separation away from their intersection. We note, however, that it behaves as expected: C is indeed decreasing in the reach and the intersection angle, and increasing in the intrinsic dimensions of the surfaces, for example. The algorithms may make clustering mistakes within distance Cr of the intersection, where Cr τ (log(n)/n)1/d with the choice of parameters just described. Whether this is optimal in the nonparametric setting that we consider for example, in a minimax sense we do not know. We now comment on the challenge of proving a similar result for Algorithm 4. This algorithm relies on knowledge of the intrinsic dimension of the surfaces d and the number of clusters (here K = 2), but these may be estimated as in (Arias-Castro et al., 2011), at least in theory, so we assume these parameters are known. The subsampling done in Step 0 does not pose any problem whatsoever, since the centers are well-spread when the points themselves are. The diﬃculty resides in the application of the spectral graph partitioning, Algorithm 1. If we were to include the intersection-removal step (Step 2 of Algorithm 2) before applying spectral graph partitioning, then a simple adaptation of arguments in (Arias-Castro, 2011) would suﬃce. The real diﬃculty, and potential pitfall of the method in this framework (without the intersection-removal step), is that the points near the intersection may form their own cluster. For example, in the simplest case of two aﬃne surfaces intersecting at a positive angle and no sampling noise, the projection matrix at a point near the intersection meaning a point whose r-ball contains a substantial piece of both surfaces would be the projection matrix onto S1 + S2 seen as a linear subspace. We were not able to discard this possibility, although we do not observe this happening in practice. A possible remedy is to constrain the K-means part to only return large-enough clusters. However, a proper analysis of this would require a substantial amount of additional work and we did not engage seriously in this pursuit.

4. Numerical Experiments

4.1 Some Illustrative Examples

We started by applying our method2 on a few artiﬁcial examples to illustrate the theory. As we argued earlier, the methods of Wang et al. (2011) and Gong et al. (2012) are quite similar to ours, and we encourage the reader to also look at the numerical experiments they performed. Our numerical experiments should be regarded as a proof of concept, only here to show that our method can be implemented and works on some stylized examples. In all experiments, the number of clusters K and the dimension of the manifolds d are assumed known. We choose the spatial scale ε and the projection scale η automatically as

2. The code is available online at https://math.cos.ucf.edu/tengz.

Arias-Castro, Lerman, and Zhang

follows: we let ε = max 1 i n0 min j =i yi yj , (9)

and η = median (i,j): yi yj <ε Qi Qj .

Here, we implicitly assume that the union of all the underlying surfaces forms a connected set. In that case, the idea behind choosing ε as in (9) is that we want the ε-graph on the centers y1, . . . , yn to be connected. Then η is chosen so that a center yi remains connected in the (ε, η)-graph to most of its neighbors in the ε-graph. The neighborhood radius r is chosen by hand for each situation. Although we do not know how to choose r automatically, there are some general ad hoc guidelines. When r is too large, the local linear approximation to the underlying surfaces may not hold in neighborhoods of radius r, resulting in local PCA becoming inappropriate. When r is too small, there might not be enough points in a neighborhood of radius r to accurately estimate the local tangent subspace to a given surface at that location, resulting in local PCA becoming inaccurate. From a computational point of view, the smaller r, the larger the number of neighborhoods and the heavier the computations, particularly at the level of spectral graph partitioning. In our numerical experiments, we ﬁnd that our algorithm is more sensitive to the choice of r when the clustering problem is more diﬃcult. We note that automatic choice of tuning parameters remains a challenge in clustering, and machine learning at large, especially when no labels are available whatsoever. See (Zelnik-Manor and Perona, 2005; Zhang et al., 2012; Little et al., 2009; Kaslovsky and Meyer, 2011). Since the algorithm is randomized (see Step 0 in Algorithm 4) we repeat each simulation 100 times and report the median misclustering rate and number of times where the misclustering rate is smaller than 5%, 10%, and 15%. We ﬁrst run Algorithm 4 on several artiﬁcial data sets, which are demonstrated in the LHS of Figures 3 and 4. Table 1 reports the local radius r used for each data set (R is the global radius of each data set), and the statistics for misclustering rates. Typical clustering results are demonstrated in the RHS of Figures 3 and 4. It is evident that Algorithm 4 performs well in these simulations.

data set r median misclustering rate 5% 10% 15% Three curves 0.02 (0.034R) 4.16% 76 89 89 Self-intersecting curves 0.1 (0.017R) 1.16% 85 85 86 Two spheres 0.2 (0.059R) 3.98% 100 100 100 M obius strips 0.1 (0.028R) 2.22% 85 86 88 Monkey saddle 0.1 (0.069R) 9.73% 0 67 97 Paraboloids 0.07 (0.048R) 10.42% 0 12 91

Table 1: Choices for r and misclustering statistics for the artiﬁcial data sets demonstrated in Figures 3 and 4. The statistics are based on 100 repeats and include the median misclustering rate and number of repeats where the misclustering rate is smaller than 5%, 10% and 15%.

Spectral Clustering Based on Local PCA

Figure 3: Performance of Algorithm 4 on data sets Three curves and Self-intersecting curves . Left column is the input data sets, and right column demonstrates the typical clustering.

In another simulation, we show the dependence of the success of our algorithm on the intersecting angle between curves in Table 2 and Figure 5. Here, we ﬁx two curves intersecting at a point, and gradually decrease the intersection angle by rotating one of them while holding the other one ﬁxed. The angles are π/2, π/4, π/6 and π/8. From the table we can see that our algorithm performs well when the angle is π/4, but the performance deteriorates as the angle becomes smaller, and the algorithm almost always fails when the angle is π/8.

Intersecting angle r median misclustering rate 5% 10% 15% π/2 0.02 (0.034R) 2.08% 98 98 98 π/4 0.02 (0.034R) 3.33% 92 94 94 π/6 0.02 (0.034R) 5.53% 32 59 59 π/8 0.02 (0.033R) 27.87% 0 2 2

Table 2: Choices for r and misclustering statistics for the instances of two intersecting curves demonstrated in Figure 5. The statistics are based on 100 repeats and include the median misclustering rate and number of repeats where the misclustering rate is smaller than 5%, 10% and 15%.

Arias-Castro, Lerman, and Zhang

Figure 4: Performance of Algorithm 4 on data sets Two spheres , M obius strips , Monkey saddle and Paraboloids . Left column is the input data sets, and right column demonstrates the typical clustering.

Spectral Clustering Based on Local PCA

Figure 5: Performance of Algorithm 4 on two curves intersecting at angles π

Arias-Castro, Lerman, and Zhang

4.2 Comparison with Other Multi-manifold Clustering Algorithms

In this section, we compare our algorithm with several existing algorithms on multi-manifold clustering. While many algorithms have been proposed, we focus on the methods based on spectral clustering, including Sparse Manifold Clustering and Embedding (SMCE) (Elhamifar and Vidal, 2011) and Local Linear Embedding of (Polito and Perona, 2001; Goh and Vidal, 2007). Compared to these methods, a major diﬀerence of Algorithm 4 is the size of the aﬃnity matrix W : SMCE and LLE each creates an n n aﬃnity matrix while our method creates a smaller aﬃnity matrix of size n0 n0, based on the centers chosen in step 0 of Algorithm 4. This diﬀerence enables our algorithm to handle large data sets such as n > 104, while these other two methods are computationally expensive due to eigenvalue decomposition of the n n aﬃnity matrix. In order to make a fair comparison, we will run simulations and experiments on small data sets. We modiﬁed our algorithm to make it more competitive in such a setting: we modify Steps 0 and 1 in Algorithm 4 slightly and use all data points {xi}n i=1 as centers, that is, yi = xi for all 1 i n. The motivation is that, when there is no computational constraint on the eigenvalue decomposition (due to small n), we may improve Algorithm 4 by constructing a larger aﬃnity matrix by including all points as centers. The resulting algorithm is summarized in Algorithm 5.

Algorithm 5 Spectral Clustering Based on Local PCA (for small data sets)

Input: Data points x1, . . . , xn; neighborhood size N > 0; spatial scale ε > 0, projection scale η > 0; intrinsic dimension d; number of clusters K.

Steps: 1: For each i = 1, . . . , n, compute the sample covariance matrix Ci of from the N nearest neighbors of xi. Let Qi denote the orthogonal projection onto the space spanned by the top d eigenvectors of Ci. 2: Compute the following aﬃnities between n data points:

Wij = exp xi xj 2

3: Apply spectral graph partitioning (Algorithm 1) to W .

First we test the algorithms on a simulated data set of two curves, which is subsampled from the ﬁrst data set in Figure 5 with 300 data points. We plot the clustering result from Algorithm 5, SMCE, LLE in Figure 6. For Algorithm 5, K is set to be 10. For SMCE3, λ = 10 and L = 60, and we remark that similar results are obtained for a wide range of parameters. For LLE, we follow the implementation in (Polito and Perona, 2001), use 10nearest neighbors to embed the data set into R2 and run K-means on the embedded data set. It is clear from the ﬁgure that Algorithm 5 resolves the problem of intersection well

3. In (Elhamifar and Vidal, 2011), λ is the ℓ1-penalty parameter and at each point the optimization is restricted to the L nearest neighbors.

Spectral Clustering Based on Local PCA

Figure 6: Performance of Algorithm 5 on two intersecting curves. Left top: the input data set. Right top: the clustering result by Algorithm 5. Left bottom: the clustering result by SMCE. Right bottom: the clustering result by LLE.

by using the aﬃnity from estimated local subspaces, while SMCE and LLE tend to give a larger aﬃnity between nearby data points and have diﬃculties in handling intersection. Next, we run experiments on the Extended Yale Face Database B (Lee et al., 2005), with the goal of clustering face images of two diﬀerent subjects. This data set contains face images from 39 subjects, and each subject has 64 images of 192 pixels under varying lightening conditions. In our experiments, we found that the images of a person in this database lie roughly in a 4-dimensional subspace. We preprocess the data set by applying PCA and reducing the dimension to 8. We also normalize the covariance of the data set when performing dimension reduction, such that the projected data set has a unit covariance. We record the misclustering rates of Algorithm 5, SMCE and LLE in Table 3. For SMCE, we follow (Elhamifar and Vidal, 2011) by setting λ = 10 and we let L = 30. For Algorithm 5, we let the neighborhood size be 40. From the table, we can see that the two methods perform similarly.

subjects [1,2] [1,3] [1,4] [1,5] [1,6] [1,7] [1,8] [1,9] Local PCA 8.59% 11.72% 10.94% 4.69% 8.59% 7.81% 5.47% 7.03% SMCE 8.59% 11.72% 8.59% 0.00% 4.69% 8.59% 9.38% 4.69%

Table 3: Some misclustering rates for the Extended Yale Face Database B.

Arias-Castro, Lerman, and Zhang

5. Discussion

We distilled the ideas of Goldberg et al. (2009) and of Kushnir et al. (2006) to cluster points sampled near smooth surfaces. The key ingredient is the use of local PCA to learn about the local spread and orientation of the data, so as to use that information in an aﬃnity when building a neighborhood graph.

In a typical stylized setting for multi-manifold clustering, we established performance bounds for the simple variants described in Algorithm 2 and Algorithm 3, which essentially consist of connecting points that are close in space and orientation, and then extracting the connected components of the resulting graph. Both are shown to resolve general intersections as long as the incidence angle is strictly positive and the parameters are carefully chosen. As is commonly the case in such analyses, our setting can be generalized to other sampling schemes, to multiple intersections, to some features of the surfaces changing with the sample size, and so on, in the spirit of (Arias-Castro et al., 2011; Arias-Castro, 2011; Chen and Lerman, 2009b). We chose to simplify the setup as much as possible while retaining the essential features that make resolving intersecting clusters challenging. The resulting arguments are nevertheless rich enough to satisfy the mathematically thirsty reader. Whether the conditions required in Theorem 1 are optimal in some sense is an interesting and challenging open question for future research. Note that very few optimality results exist for manifold clustering; see (Arias-Castro, 2011) for an example.

We implemented a spectral version of Algorithm 3, described in Algorithm 4, that assumes the intrinsic dimensionality and the number of clusters are known. The resulting approach is very similar to what is oﬀered by Wang et al. (2011) and Gong et al. (2012), although it was developed independently of these works. Algorithm 4 is shown to perform well in some simulated experiments, although it is somewhat sensitive to the choice of parameters. This is the case of all other methods for multi-manifold clustering we know of and choosing the parameters automatically remains an open challenge in the ﬁeld.

We start with some additional notation. The ambient space is RD unless noted otherwise. For a vector v RD, v denotes its Euclidean norm and for a real matrix M RD D, M denotes the corresponding operator norm. For a point x RD and r > 0, B(x, r) denotes the open ball of center x and radius r, i.e., B(x, r) = {y RD : y x < r}. For a set S and a point x, deﬁne dist(x, S) = inf{ x y : y S}. For two points a, b in the same Euclidean space, b a denotes the vector moving a to b. For a point a and a vector v in the same Euclidean space, a + v denotes the translate of a by v. We identify an aﬃne subspace T with its corresponding linear subspace, for example, when saying that a vector belongs to T.

For two subspaces T and T , of possibly diﬀerent dimensions, let 0 θmax(T, T ) π/2 denote the largest and by θmin(T, T ) the smallest nonzero principal angle between T and T

(Stewart and Sun, 1990). When v is a vector and T is a subspace, (v, T) := θmax(Rv, T) this is the usual deﬁnition of the angle between v and T.

Spectral Clustering Based on Local PCA

For a subset A RD and positive integer d, vold(A) denotes the d-dimensional Hausdorﬀmeasure of A, and vol(A) is deﬁned as voldim(A)(A), where dim(A) is the Hausdorﬀ dimension of A. For a Borel set A, let λA denote the uniform distribution on A. For a set S RD with reach at least 1/κ, and x with dist(x, S) < 1/κ, let PS(x) denote the metric projection of x onto S, that is, the point on S closest to x. Note that, if T is an aﬃne subspace, then PT is the usual orthogonal projection onto T, and we let P T denote the orthogonal projection onto the linear subspace of same dimension and parallel to T. Let Sd(κ) denote the class of connected, C2 and compact d-dimensional submanifolds without boundary embedded in RD, with reach at least 1/κ. For a submanifold S RD, let TS(x) denote the tangent space of S at x S. We will often identify a linear map with its matrix in the canonical basis. For a symmetric (real) matrix M, let β1(M) β2(M) denote its eigenvalues in decreasing order. We say that f : Ω RD RD is C-Lipschitz if f(x) f(y) C x y , x, y Ω. For two reals a and b, a b = max(a, b) and a b = min(a, b). Additional notation will be introduced as needed.

6.1 Preliminaries

This section gathers a number of general results from geometry and probability. We took time to package them into standalone lemmas that could be of potential independent interest, particularly to researchers working in machine learning and computational geometry. When needed, we use C to denote a constant that may change with each appearance.

6.1.1 Smooth Surfaces and Their Tangent Subspaces

The following comes directly from (Federer, 1959, Th. 4.18(12)). It gives us a simple criterion for identifying the metric projection of a point on a surface with given reach.

Lemma 1. Consider S Sd(κ) and x RD such that dist(x, S) < 1/κ. Then s = PS(x) if and only if x s < 1/κ and x s TS(s).

The following result is on approximating a smooth surface near a point by the tangent subspace at that point. It is based on (Federer, 1959, Th. 4.18(2)).

Lemma 2. For S Sd(κ), and any two points s, s S,

dist(s , TS(s)) κ

2 s s 2, (10)

and when dist(s , TS(s)) 1/κ,

dist(s , TS(s)) κ PTS(s)(s ) s 2. (11)

Moreover, for t TS(s) such that s t 1 3κ,

dist(t, S) κ t s 2. (12)

Arias-Castro, Lerman, and Zhang

Proof. Let T be short for TS(s). (Federer, 1959, Th. 4.18(2)) says that

dist(s s, T) κ

2 s s 2. (13)

Immediately, we have

dist(s s, T) = s PT (s ) = dist(s , T),

and (10) comes from that. Based on that and Pythagoras theorem, we have

dist(s , T) = PT (s ) s κ

2 s s 2 = κ

2 PT (s ) s 2 + PT (s ) s 2 ,

so that dist(s , T) 1 κ

2 dist(s , T) κ

2 PT (s ) s 2,

and (11) follows easily from that. For (12), let r = 1/(3κ) and s = P 1 T (t), the latter being well-deﬁned by Lemma 5 below and belongs to B(s, r(1 + κr)) B(s, 4/(9κ)). By (10), s s 8/(81κ) < 1/κ, and by (11),

dist(t, S) t s = dist(s , T) κ t s 2.

This concludes the proof of (12).

We will need a bound on the angle between tangent subspaces on a smooth surface as a function of the distance between the corresponding points of contact.

Lemma 3 (Boissonnat et al. (2013)). For S Sd(κ), and any s, s S,

sin θmax(TS(s), TS(s )) 6κ s s . (14)

The following bounds the diﬀerence between the metric projection onto a surface and the orthogonal projection onto one of its tangents.

Lemma 4. Consider S Sd(κ) and s S. Then for any x B(S, r), we have

PS(x) PTS(s)(x) C4κ x s 2,

for a numeric constant C4 > 0.

Proof. Let T = TS(s) and deﬁne t = PT (x), s = PS(x) and t = PT (s ), and also T = TS(s ) and θ = θmax(T, T ). We have

PS(x) = PT (x) = P T (x s ) + s

PT (x) = P T (x t) + t = P T (x s ) + P T (s t) + t.

PS(x) PT (x) P T (x s ) P T (x s ) + P T (s t) + s t .

Spectral Clustering Based on Local PCA

On the one hand,

P T (s t) s t = dist(s , T) κ

2 s s 2 2κ x s 2,

by (10) and the fact that

s s s x + x s = dist(x, S) + x s 2 x s .

On the other hand, applying Lemma 18 (see further down)

P T (x s ) P T (x s ) P T P T x s = (sin θ) x s ,

with x s x s and, applying Lemma 3,

sin θ 6κ s s 12κ x s .

All together, we conclude.

Below we state some properties of a projection onto a tangent subspace. A result similar to the ﬁrst part was proved in (Arias-Castro et al., 2011, Lem. 2) based on results in (Niyogi et al., 2008), but the arguments are simpler here and the constants are sharper.

Lemma 5. There is a numeric constant C5 1 such that the following holds. Take S Sd(κ), s S and r 1/C5κ, and let T be short for TS(s). PT is injective on B(s, r) S and its image contains B(s, r ) T, where r := (1 C5(κr)2)r. Moreover, P 1 T has Lipschitz constant bounded by 1 + C5(κr)2 1 + κr over B(s, r) T.

Proof. Take s , s S distinct such that PT (s ) = PT (s ). Equivalently, s s is perpendicular to T. Let T be short for TS(s ). By (13) and the fact that dist(v, T) = v sin (v, T) for any vector v and any linear subspace T, we have

sin (s s , T ) κ

and by (14), sin θmax(T, T ) 6κ s s .

Now, by the triangle inequality,

2 = (s s , T) (s s , T ) + θmax(T, T ),

so that κ 2 s s 1 π

2 sin 1 6κ s s 1 .

When s s 1/12κ, the RHS is bounded from below by π/2 sin 1(1/2), which then implies that κ

2 s s sin(π/2 sin 1(1/2)) =

3/2, that is, s s

3/κ. This precludes the situation where s , s B(s, 1/12κ), so that PT is injective on B(s, r) when r 1/12κ. The same arguments imply that PT is an open map on R := B(s, r) S. In particular, PT (R) contains an open ball in T centered at s and PT ( R) = PT (R), with R = S

Arias-Castro, Lerman, and Zhang

B(s, r) since S = . Now take any ray out of s within T, which is necessarily of the form s + R+v, where v is a unit vector in T. Let ta = s + av T for a [0, ). Let a be the inﬁmum over all a > 0 such that ta PT (R). Note that a > 0 and ta PT ( R), so that there is s R such that PT (s ) = ta . Let sa = P 1 T (ta), which is well-deﬁned on [0, a ] by deﬁnition of a and the fact that PT is injective on R. Let Jt denote the diﬀerential of P 1 T at t. We have that sa = Jtav is the unique vector in Ta := TS(sa) such that PT ( sa) = v. Elementary geometry shows that

PT ( sa) = sa cos ( sa, T) sa cos θmax(Ta, T),

with cos θmax(Ta, T) cos sin 1 (6κ sa s 1) ζ := 1 (6κr)2,

by (14) and fact that sa s r (and assuming 6κr 1). Since PT ( sa) = v = 1, we have sa 1/ζ, and this holds for all a < a . So we can extend sa to [0, a ] into a Lipschitz function with constant 1/ζ. Together with the fact that s B(s, r), this implies that

r = s s = sa s0 a /ζ.

Hence, a ζr and therefore PT (R) contains B(s, ζr) T as stated. For the last part, assume r 1/Cκ, with C large enough that there is a unique h 1/12κ such that ζh = r, where ζ is redeﬁned as ζ := 1 (6κh)2. Take t B(s, r) T and let s = P 1 T (t ) and T = TS(s ). We saw that P 1 T is Lipschitz with constant 1/ζ on any ray emanating from s of length ζh = r, so that s s (1/ζ) t s r/ζ = h. The diﬀerential of PT at s is PT itself, seen as a linear map between T and T. Then for any vector u T , we have

PT (u) = u cos (u, T) u cos θmax(T , T),

with cos θmax(T , T) cos sin 1 6κ s s 1 (6κh)2 = ζ,

as before. Hence, Jt 1/ζ, and we proved this for all t B(s, r) T. This last set being convex, we can apply Taylor s theorem and get that P 1 T is Lipschitz on that set with constant 1/ζ. We then note that ζ = 1 + O(κr)2.

6.1.2 Volumes and Uniform Distributions

Below is a result that quantiﬁes how much the volume of a set changes when applying a Lipschitz map. This is well-known in measure theory and we only provide a proof for completeness.

Lemma 6. Suppose Ωis a measurable subset of RD and f : Ω RD RD is C-Lipschitz. Then for any measurable set A Ωand real d > 0, vold(f(A)) Cd vold(A).

Proof. By deﬁnition,

vold(A) = lim t 0 V t d(A), V t d(A) := inf (Ri) Rt(A)

i N diam(Ri)d,

Spectral Clustering Based on Local PCA

where Rt(A) is the class of countable sequences (Ri : i N) of subsets of RD such that A S i Ri and diam(Ri) < t for all i. Since f is C-Lipschitz, diam(f(R)) C diam(R) for any R Ω. Hence, for any (Ri) Rt(A), (f(Ri)) RCt(f(A)). This implies that

V Ct d (f(A)) X

i N diam(f(Ri))d Cd X

i N diam(Ri)d.

Taking the inﬁmum over (Ri) Rt(A), we get V Ct d (f(A)) Cd V t d(A), and we conclude by taking the limit as t 0, noticing that V Ct d (f(A)) vold(f(A)) while V t d(A) vold(A).

We compare below two uniform distributions. For two Borel probability measures P and Q on RD, TV(P, Q) denotes their total variation distance, meaning,

TV(P, Q) = sup{|P(A) Q(A)| : A Borel}.

Remember that for a Borel set A, λA denotes the uniform distribution on A.

Lemma 7. Suppose A and B are two Borel subsets of RD. Then

TV(λA, λB) 4 vol(A B)

Proof. If A and B are not of same dimension, say dim(A) > dim(B), then TV(λA, λB) = 1 since λA(B) = 0 while λB(B) = 1. And we also have

vol(A B) = voldim(A)(A B) = voldim(A)(A) = vol(A),

and vol(A B) = voldim(A)(A B) = voldim(A)(A) = vol(A),

in both cases because voldim(A)(B) = 0. So the result works in that case. Therefore assume that A and B are of same dimension. Assume WLOG that vol(A) vol(B). For any Borel set U,

λA(U) λB(U) = vol(A U)

vol(A) vol(B U)

|λA(U) λB(U)| = vol(A U)

vol(A) vol(B U)

vol(A) + vol(B U)

vol(A) vol(B U)

| vol(A U) vol(B U)|

vol(A) + vol(B U)

vol(B) | vol(A) vol(B)|

and we conclude with the fact that vol(A B) vol(A) + vol(B) 2 vol(A).

Arias-Castro, Lerman, and Zhang

We now look at the projection of the uniform distribution on a neighborhood of a surface onto a tangent subspace. For a Borel probability measure P and measurable function f : RD RD, P f denotes the push-forward (Borel) measure deﬁned by P f(A) = P(f 1(A)).

Lemma 8. Suppose A RD is Borel and f : A RD is invertible on f(A), and that both f and f 1 are C-Lipschitz. Then

TV(λf A, λf(A)) 2(Cdim(A) 1).

Proof. First, note that A and f(A) are both of same dimension, and that C 1 necessarily. Let d be short for dim(A). Take U f(A) Borel and let V = f 1(U). Then

λf A(U) = vol(A V )

vol(A) , λf(A)(U) = vol(f(A) U)

vol(f(A)) ,

and as in (16),

|λf A(U) λf(A)(U)| | vol(A V ) vol(f(A) U)|

vol(A) + | vol(A) vol(f(A))|

f being invertible, we have f(A V ) = f(A) U and f 1(f(A) U) = A V . Therefore, applying Lemma 6, we get

C d vol(f(A) U)

vol(A V ) Cd,

| vol(A V ) vol(f(A) U)| (Cd 1) vol(A V ) (Cd 1) vol(A).

And taking V = A, we also get

| vol(A) vol(f(A))| (Cd 1) vol(A).

From this we conclude.

Now comes a technical result on the intersection of a smooth surface and a ball.

Lemma 9. There is a constant C9 3 depending only on d such that the following is true. Take S Sd(κ), r < 1 C9κ and x RD such that dist(x, S) < κ. Let s = PS(x) and T = TS(s). Then

vol PT (S B(x, r)) (T B(x, r)) C9κ( x s + κr2) vol(T B(x, r)).

Proof. If dist(x, S) > r, then S B(x, r) = T B(x, r) = , and if dist(x, S) = r, then S B(x, r) = T B(x, r) = {s}, and in both cases the inequality holds trivially. So it suﬃces to consider the case where dist(x, S) < r. Let Ar = B(s, r), Br = B(x, r) and g = PT for short. Note that T Br = T Ar0 where r0 := (r2 δ2)1/2 and δ := x s . Take s1 S Br such that g(s1) is farthest from s, so that g(S Br) Ar1 where r1 := s g(s1) note that r1 r. Let ℓ1 = s1 g(s1) and y1 be the orthogonal projection of s1 onto the line (x, s). By Pythagoras theorem,

Spectral Clustering Based on Local PCA

we have x s1 2 = x y1 2 + y1 s1 2. We have x s1 r and y1 s1 = s g(s1) = r1. And because ℓ1 κr2 1 < r by (11), either y1 is between x and s, in which case x y1 = δ ℓ1, or s is between x and y1, in which case x y1 = δ+ℓ1. In any case, r2 r2 1 + (δ ℓ1)2, which together with ℓ1 κr2 1 implies r2 1 r2 δ2 + 2δℓ1 r2 0 + 2κr2 1δ, leading to r1 (1 2κδ) 1/2r0 (1 + 4κδ)r0 after noticing that δ r < 1/(3κ). From g(S Br) T Ar1, we get

vol g(S Br) \ (T Br) vol(T Ar1) vol(T Ar0)

= ((r1/r0)d 1) vol(T Ar0).

We follow similar arguments to get a sort of reverse relationship. Take s2 S Br such that g(S Br) T Ar2, where r2 := s g(s2) is largest. Assuming r is small enough, by Lemma 5, g 1 is well-deﬁned on T Ar, so that necessarily s2 Br. Let ℓ2 = s2 g(s2) and y2 be the orthogonal projection of s2 onto the line (x, s). By Pythagoras theorem, we have x s2 2 = x y2 2+ y2 s2 2. We have x s2 = r and y2 s2 = s g(s2) = r2. And by the triangle inequality, x y2 x s + y2 s = δ + ℓ2. Hence, r2 r2 2 + (δ + ℓ2)2, which together with ℓ2 κr2 2 by (11), implies r2 2 r2 δ2 2δℓ2 ℓ2 2 r2 0 (2δ + κr2)κr2 2, leading to r2 (1 + 2κδ + κ2r2) 1/2r0 (1 2κδ κ2r2)r0. From g(S Br) T Ar2, we get

vol (T Br) \ g(S Br) vol(T Ar0) vol(T Ar2)

= (1 (r2/r0)d) vol(T Ar0).

All together, we have

vol g(S Br) (T Br) (r1/r0)d (r2/r0)d vol(T Ar0)

(1 + 4κδ)d (1 2κδ κ2r2)d vol((T Br))

Cdκ(δ + κr2) vol((T Br)),

when δ r 1/(Cκ) and C is a large-enough numerical constant.

We bound below the d-volume of a the intersection of a ball with a smooth surface. Although it could be obtained as a special case of Lemma 9, we provide a direct proof because this result is at the cornerstone of many results in the literature on sampling points uniformly on a smooth surface.

Lemma 10. Suppose S Sd(κ). Then for any s S and r < 1 (d C5)κ, we have

1 2dκr vol(S B(s, r))

vol(T B(s, r)) 1 + 2dκr,

where T := TS(s) is the tangent subspace of S at s.

Proof. Let T = TS(s), Br = B(s, r) and g = PT for short. By Lemma 5, g is 1-Lipschitz and g 1 is (1 + κr)-Lipschitz on T Br, so by Lemma 6 we have

(1 + κr) d vol(g(S Br))

vol(S Br) 1.

Arias-Castro, Lerman, and Zhang

That g 1 is Lipschitz with constant 1+κr on g(S Br) also implies that g(S Br) contains T Br where r := r/(1 + κr). From this, and the fact that g(S Br) T Br, we get

1 vol(T Br) vol(g(S Br)) vol(T Br)

vol(T Br ) = rd

r d = (1 + κr)d.

We therefore have

vol(S Br) vol(g(S Br)) (1 + κr) d vol(T Br),

and vol(S Br) (1 + κr)d vol(g(S Br)) (1 + κr)d vol(T Br).

And we conclude with the inequality (1 + x)d 1 + 2dx valid for any x [0, 1/d] and any d 1.

We now look at the density of a sample from the uniform on a smooth, compact surface.

Lemma 11. Consider S Sd(κ). There is a constant C11 > 0 depending on S such that the following is true. Sample n points s1, . . . , sn independently and uniformly at random from S. Take 0 < r < 1/(C11κ). Then with probability at least 1 C11r d exp( nrd/C11), any ball of radius r with center on S has between nrd/C11 and C11nrd sample points.

Proof. For a set R, let N(R) denote the number of sample points in R. For any R measurable, N(R) Bin(n, p R), where p R := vol(R S)/ vol(S). Let x1, . . . , xm be a maximal (r/2)-packing of S, and let Bi = B(xi, r/4) S. For any s S, there is i such that s xi r/2, which implies Bi B(s, r) by the triangle inequality. Hence, mins S N(B(s, r)) mini N(Bi). By the fact that Bi Bj = for i = j,

i=1 vol(Bi) m min i vol(Bi),

and assuming that κr is small enough, we have

min i vol(Bi) ωd

by Lemma 10, where ωd is the volume of the d-dimensional unit ball. This leads to m Cr d

and p := mini p Bi rd/C, when κr 1/C, where C > 0 depends only on S. Now, applying Bernstein s inequality to the binomial distribution, we get

P (N(Bi) np/2) P (N(Bi) np Bi/2) e (3/32)np Bi e (3/32)np.

We follow this with the union bound, to get

P min s S N(B(s, r)) nrd/(2C) me (3/32)np Cr de 3 32C nrd.

From this the lower bound follows. The proof of the upper bound is similar.

Spectral Clustering Based on Local PCA

Next, we bound the volume of the symmetric diﬀerence between two balls.

Lemma 12. Take x, y Rd and 0 < δ 1. Then

vol(B(x, δ) B(y, 1))

2 vol(B(0, 1)) 1 (1 x y )d + δd.

Proof. It suﬃces to prove the result when x y < 1. In that case, with γ := (1 x y ) δ, we have B(x, γ) B(x, δ) B(y, 1), so that

vol(B(x, δ) B(y, 1)) = vol(B(x, δ)) + vol(B(y, 1)) 2 vol(B(x, δ) B(y, 1))

2 vol(B(y, 1)) 2 vol(B(x, γ))

= 2 vol(B(0, 1))(1 γd).

This concludes the proof.

6.1.3 Covariances

The result below describes explicitly the covariance matrix of the uniform distribution over the unit ball of a subspace.

Lemma 13. Let T be a subspace of dimension d. Then the covariance matrix of the uniform distribution on T B(0, a) (seen as a linear map) is equal to a2c P T , where c := 1 d+2.

Proof. Assume WLOG that T = Rd {0}. Let X be distributed according to the uniform distribution on T B(0, 1) and let R = X . Note that

P (R r) = vol(T B(0, r))

vol(T B(0, 1)) = rd, r [0, 1].

By symmetry, E(Xi Xj) = 0 if i = j, while

E(X2 1) = 1

d E(X2 1 + + X2 d) = 1

d E(R2) = 1

0 r2(drd 1)dr = 1 d + 2.

Now the covariance matrix of the uniform distribution on T B(0, a) equals E[(a X)(a X) ] = a2 E[XX ], which is exactly the representation of a2c P T in the canonical basis of RD.

We now show that a bound on the total variation distance between two compactly supported distributions implies a bound on the diﬀerence between their covariance matrices. For a measure P on RD and an integrable function f, let P(f) denote the integral of f with respect to P, that is,

P(f) = Z f(x)P(dx),

and let E(P) = P(x) and Cov(P) = P(xx ) P(x)P(x) denote the mean and covariance matrix of P, respectively.

Lemma 14. Suppose λ and ν are two Borel probability measures on Rd supported on B(0, 1). Then E(λ) E(ν)

d TV(λ, ν), Cov(λ) Cov(ν) 3d TV(λ, ν).

Arias-Castro, Lerman, and Zhang

Proof. Let fk(t) = tk when t = (t1, . . . , td), and note that |fk(t)| 1 for all k and all t B(0, 1). By the fact that

TV(λ, ν) = sup{λ(f) ν(f) : f : Rd R measurable with |f| 1},

we have |λ(fk) ν(fk)| TV(λ, ν), k = 1, . . . , d.

E(λ) E(ν) 2 =

k=1 (λ(fk) ν(fk))2 d TV(λ, ν)2,

which proves the ﬁrst part. Similarly, let fkℓ(t) = tktℓ. Since |fkℓ(t)| 1 for all k, ℓand all t B(0, 1), we have

|λ(fkℓ) ν(fkℓ)| TV(λ, ν), k, ℓ= 1, . . . , d.

Since for any probability measure µ on Rd,

Cov(µ) = µ(fkℓ) µ(fk)µ(fℓ) : k, ℓ= 1, . . . , d ,

Cov(λ) Cov(ν) d max k,ℓ |λ(fkℓ) ν(fkℓ)| + |λ(fk)λ(fℓ) ν(fk)ν(fℓ)|

d max k,ℓ |λ(fkℓ) ν(fkℓ)| + |λ(fk)||λ(fℓ) ν(fℓ)|

+|ν(fℓ)||λ(fk) ν(fk)|

3d TV(λ, ν),

using the fact that |λ(fk)| 1 and |ν(fk)| 1 for all k.

The following compares the mean and covariance matrix of a distribution before and after transformation by a function.

Lemma 15. Let λ be a Borel distribution with compact support A RD and consider a measurable function g : A 7 RD such that g(x) x η. Then

E(λg) E(λ) η, Cov(λg) Cov(λ) η(diam(A) + η).

Proof. Take X λ and let Y = g(X) λg. For the means, by Jensen s inequality,

E X E Y E X Y = E X g(X) η.

For the covariances, we ﬁrst note that

Cov(X) Cov(Y ) = 1

2 Cov(X Y, X + Y ) + Cov(X + Y, X Y ) ,

Spectral Clustering Based on Local PCA

where Cov(U, V ) := E((U E U)(V E V )T ) is the cross-covariance of random vectors U and V . By Jensen s inequality, the fact uv T = u v for any pair of vectors u, v, and then the Cauchy-Schwarz inequality,

Cov(U, V ) E( U E U V E V ) E( U E U 2)1/2 E( V E V 2)1/2.

Cov(X) Cov(Y ) Cov(X Y, X + Y )

E X Y E X + E Y 2 1/2 E X + Y E X E Y 2 1/2.

By the fact that the mean minimizes the mean-squared error,

E X Y E X + E Y 2 1/2 E X Y 2 1/2 = E X g(X) 2 1/2 η.

In the same vein, letting z RD be such that x z 1

2 diam(A) for all x A, we get

E X + Y E X E Y 2 1/2 E X z + Y z 2 1/2

E X z 2 1/2 + E Y z 2 1/2

2 E X z 2 1/2 + E X g(X) 2 1/2

diam(A) + η.

Using the triangle inequality twice. From this, we conclude.

Next we compare the covariance matrix of the uniform distribution on a small piece of smooth surface with that of the uniform distribution on the projection of that piece onto a nearby tangent subspace.

Lemma 16. There is a constant C16 > 0 depending only on d such that the following is true. Take S Sd(κ), r < 1 C16κ and x RD such that dist(x, S) r. Let s = PS(x) and T = TS(s). If ζ and ξ are the means, and M and N are the covariance matrices, of the uniform distributions on S B(x, r) and T B(x, r), respectively, then

ζ ξ C16κr2, M N C16κr3.

Proof. We focus on proving the bound on the covariances, and leave the bound on the means whose proof is both similar and simpler as an exercise to the reader. Let T = TS(s), Br = B(x, r) and g = PT for short. Let A = S Br and A = T Br. First, applying Lemma 15, assuming κr is suﬃciently small, we get that

Cov(λA) Cov(λg A) κ

2r2 2κr3, (38)

because diam(A) 2r and g(x) x κ

2r2 for all x A by (10). Let λg(A) denote the uniform distribution on g(A). λg A and λg(A) are both supported on Br, so that applying Lemma 14 with proper scaling, we get

Cov(λg A) Cov(λg(A)) 3dr2 TV(λg A, λg(A)).

Arias-Castro, Lerman, and Zhang

When κr is small enough, we know that g is 1-Lipschitz, and by Lemma 5, g 1 is welldeﬁned and is (1 + κr)-Lipschitz on A . Hence, by Lemma 8 and the fact that dim(A) = d, we have TV(λg A, λg(A)) 2((1 + κr)d 1) 4dκr,

using the inequality (1 + x)d 1 + 2dx, valid for any x [0, 1/d] and any d 1. Noting that λA is also supported on Br, applying Lemma 14 with proper scaling, we get Cov(λg(A)) Cov(λA ) 3dr2 TV(λg(A), λA ),

TV(λg(A), λA ) 4vol(g(A) A )

vol(A ) 4C9κ(r + κr2) 8C9κr,

by Lemma 7 and Lemma 9, and assuming that κr is small enough. By the triangle inequality,

M N = Cov(λA) Cov(λA )

Cov(λA) Cov(λg A) + Cov(λg A) Cov(λg(A))

+ Cov(λg(A)) Cov(λA )

2κr3 + 12d2κr3 + 24d C9κr3.

From this, we conclude.

Next is a lemma on the estimation of a covariance matrix. The result is a simple consequence of the matrix Hoeﬀding inequality of Tropp (2012). Note that simply bounding the operator norm by the Frobenius norm, and then applying the classical Hoeﬀding inequality (Hoeﬀding, 1963) would yield a bound suﬃcient for our purposes, but this is a good opportunity to use a more recent and sophisticated result.

Lemma 17. Let Cm denote the empirical covariance matrix based on an i.i.d. sample of size m from a distribution with support in B(0, r) Rd with covariance Σ. Then

P Cm Σ > r2t 4d exp mt

, t 0. (43)

Proof. Without loss of generality, we assume that the distribution has zero mean and is now supported on B(0, 2r). In fact, by a simple rescaling, we may also assume that r = 1. Let x1, . . . , xm denote the sample, with xi = (xi,1, . . . , xi,d). We have

i=1 xix T i , x := 1

Note that Cm Σ C m Σ + 1

Spectral Clustering Based on Local PCA

Applying the union bound and then Hoeﬀding s inequality to each coordinate which is in [ 2, 2] we get

j=1 P(| xj| > t/

d) 2d exp mt2

Noting that 1 m(xix T i Σ), i = 1, . . . , m, are independent, zero-mean, self-adjoint matrices with spectral norm bounded by 4/m, we may apply the matrix Hoeﬀding inequality (Tropp, 2012, Th. 1.3), to get

P ( C m Σ > t) 2d exp t2

, σ2 := m(4/m)2 = 16/m.

Applying the union bound and using the previous inequalities, we arrive at

P ( Cm Σ > t) P ( C m Σ > t/2) + P x > p

+ 2d exp m2t

which concludes the proof.

6.1.4 Projections

We relate below the diﬀerence of two orthogonal projections with the largest principal angle between the corresponding subspaces.

Lemma 18. For two aﬃne non-null subspaces T, T ,

( sin θmax(T, T ), if dim(T) = dim(T ), 1, otherwise.

Proof. For two aﬃne subspaces T, T RD of same dimension, let π

2 θ1 θD 0, denote the principal angles between them. By (Stewart and Sun, 1990, Th. I.5.5), the singular values of P T P T are {sin θj : j = 1, . . . , q}, so that P T P T = maxj sin θj = sin θ1 = sin θmax(T, T ). Suppose now that T and T are of diﬀerent dimension, say dim(T) > dim(T ). We have P T P T P T P T = 1, since P T and P T are orthogonal projections and therefore positive semideﬁnite with operator norm equal to 1. Let L = P T (T ). Since dim(L) dim(T ) < dim(T), there is u T L with u = 0. Then v u = P T (v) u = 0 for all v T , implying that P T (u) = 0 and consequently (P T P T )u = u, so that P T P T 1.

The lemma below is a perturbation result for eigenspaces and widely known as the sin Θ Theorem of Davis and Kahan (1970). See also (Luxburg, 2007, Th. 7) or (Stewart and Sun, 1990, Th. V.3.6).

Arias-Castro, Lerman, and Zhang

Lemma 19. Let M be positive semi-deﬁnite with eigenvalues β1 β2 . Suppose that βd βd+1 > 0. Then for any other positive semi-deﬁnite matrix N,

P (d) N P (d) M

2 N M βd βd+1 ,

where P (d) M and P (d) M denote the orthogonal projections onto the top d eigenvectors of M and N, respectively.

6.1.5 Intersections

We start with an elementary result on points near the intersection of two aﬃne subspaces.

Lemma 20. Take any two linear subspaces T1, T2 RD. For any point t1 T1 \ T2, we have dist(t1, T2) dist(t1, T1 T2) sin θmin(T1, T2).

Proof. We may reduce the problem to the case where T1 T2 = {0}. Indeed, let T1 = T1 T 2 , T2 = T 1 T2 and t1 = t1 PT1 T2(t1). Then

t1 PT2(t1) = t1 P T2( t1) , t1 PT1 T2(t1) = t1 ,

and sin θmin(T1, T2) = sin θmin( T1, T2).

So assume that T1 T2 = {0}. By (Afriat, 1957, Th. 10.1), the angle formed by t1 and PT2(t1) is at least as large as the smallest principal angle between T1 and T2, which is θmin(T1, T2) since T1 T2 = {0}. From this the result follows immediately.

The following result says that a point cannot be close to two compact and smooth surfaces intersecting at a positive angle without being close to their intersection. Note that the constant there cannot be solely characterized by κ, as it also depends on the separation between the surfaces away from their intersection.

Lemma 21. Suppose S1, S2 Sd(κ) intersect at a strictly positive angle and that reach(S1 S2) 1/κ. Then there is a constant C21 such that

dist(x, S1 S2) C21 max dist(x, S1), dist(x, S2) , x RD.

Proof. Assume the result is not true, so there is a sequence (xn) RD such that dist(xn, S1 S2) > n maxk dist(xn, Sk). Because the surfaces are bounded, we may assume WLOG that the sequence is bounded. Then dist(xn, S1 S2) is bounded, which implies that maxk dist(xn, Sk) = O(1/n). This also forces dist(xn, S1 S2) 0. Indeed, otherwise there is a constant C > 0 and a subsequence (xn ) (xn) such that dist(xn , S1 S2) C. Since (xn ) is bounded, there is a subsequence (xn ) (xn ) that converges, and by the fact that maxk dist(xn , Sk) = o(1), and by compactness of Sk, the limit is necessarily in S1 S2, which is incompatible with the fact that dist(xn , S1 S2) C. Since dist(xn, S1 S2) 1, we have n max k dist(xn, Sk) < dist(xn, S1 S2) = o(1). (47)

Spectral Clustering Based on Local PCA

Assume n is large enough that dist(xn, S1 S2) < 1/κ and let sk n be the projection of xn onto Sk, and s n the projection of xn onto S1 S2. Let Tk = TSk(s n) and note that θmin(T1, T2) θ := θ(S1, S2) see deﬁnition (6). Let tk n be the projection of sk n onto Tk. Assume WLOG that t1 n s1 n t2 n s2 n . Let tn denote the projection of t1 n onto T1 T2, and then let sn = PS1 S2(tn). By (47), we have n max k xn sk n xn s n = o(1). (48)

We start with the RHS:

xn s n = min s S1 S2 xn s xn sn , (49)

and ﬁrst show that xn sn = o(1) too. We use the triangle inequality multiple times in what follows. We have

xn sn xn s1 n + s1 n t1 n + t1 n tn + tn sn . (50)

From (48), xn s1 n xn s n = o(1), so that s1 n s n = o(1), and by (10),

s1 n t1 n κ

2 s1 n s n 2 κ( s1 n xn 2 + xn s n 2) = o(1). (51)

We also have

t1 n tn = min t T1 T2 t1 n t t1 n s n t1 n s1 n + s1 n xn + xn s n = o(1). (52)

tn sn = min s S1 S2 tn s tn s n tn t1 n + t1 n s n = o(1).

We now proceed. The last upper bound is rather crude. Indeed, using (12) for S = S1 S2 and s = s n, and noting that TS1 S2(s n) = T1 T2, and using the fact that tn s n = o(1), we get

tn sn κ tn s n 2 κ( tn sn + sn xn + xn s n )2.

Using (49), tn sn κ( tn sn + 2 sn xn )2 5κ xn sn 2, (53)

eventually, since tn sn = o(1). Combining (49), (50), (51) and (53), we get

xn sn xn s1 n + O( xn s1 n 2 + xn sn 2) + t1 n tn + O( xn sn 2),

which leads to xn sn 2 xn s1 n + 2 t1 n tn , (54)

when n is large enough. Using this bound in (48) combined with (49), we get

t1 n tn n 2

2 max k xn sk n .

Arias-Castro, Lerman, and Zhang

We then have

max k xn sk n 1 2 s1 n s2 n

1 2( t1 n t2 n s1 n t1 n s2 n t2 n )

1 2 dist(t1 n, T2) s1 n t1 n ,

s1 n t1 n = O( xn s1 n 2 + xn s n 2) = O( xn sn 2) = O( t1 n tn 2),

due (in the same order) to (51), (48)-(49), and (54). Recalling that t1 n tn = dist(t1 n, T1 T2), we conclude that

dist(t1 n, T2) = O(1/n) dist(t1 n, T1 T2) + O(1) dist(t1 n, T1 T2)2.

However, by Lemma 20, we have dist(t1 n, T2) (sin θ) dist(t1 n, T1 T2), so that dividing by dist(t1 n, T1 T2) above leads to

1 = O(1/n) + O(1) dist(t1 n, T1 T2),

which is in contradiction with the fact that dist(t1 n, T2) t1 n tn = o(1), established in (52).

6.1.6 Covariances near an Intersection

The following compares a covariance matrix of a distribution supported on the union of two smooth surfaces with that of the projection of that distribution on tangent subspaces.

Lemma 22. Consider S1, S2 Sd(κ) intersecting at a positive angle, with reach(S1 S2) 1/κ. Fix s S1 and let A = B(s, r) (S1 S2), where r < 1/κ. Let C denote the empirical covariance of an i.i.d. sample from a distribution ν supported on A of size m and deﬁne Σ = Cov(ν). Then

P Cm Σ > r2t + 3κr3 C22 exp mt

C22 min(t, m) , t 0,

where C22 depends only on d.

Proof. Let T1 = TS1(s), and if dist(s, S2) r, let s2 = PS2(s) and T2 = TS2(s2). Deﬁne g : A 7 T1 T2 where g(x) = PT1(x) if x S1, and g(x) = PT2(x) otherwise. By (10), g(x) x κ

2r2 for all x A. Let Σg = Cov(νg). Also, let νm denote the sample distribution and note that Cm = Cov(νm). Let Cg m = Cov(νg m). Note that νg m is the empirical distribution of an i.i.d. sample of size m from νg, so the notation is congruent. If dist(s, S2) > r, νg is supported on B(s, r) T1, which is a d-dimensional ball of radius r, so that (43) applies to Cg m and Σg. If dist(s, S2) r, νg is supported on

(B(s, r) T1) (B(s2, r) T2) B(s, 2r) (T1 T2) B(s, 2r) (T1 + T2),

Spectral Clustering Based on Local PCA

where the latter is a ball in dimension d := dim(T1 + T2) 2d of radius 2r, so that (43) with 2r in place of r and 2d in place of d applies to Cg m and Σg. Then, by the triangle inequality and Lemma 15,

Cm Σ Cm Cg m + Cg m Σg + Σg Σ

2κr3 + κ2r4/2 + Cg m Σg ,

as in (38).

Here is a continuity result.

Lemma 23. Let T1 and T2 be two linear subspaces of same dimension d. For x T1, denote by ΣT (x) the covariance matrix of the uniform distribution over B(x, r) (T1 T2). Then, for all x, y T1,

ΣT (x) ΣT (y)

( 5dr x y , if d 2, r x y + 2 p

r3 x y , if d = 1.

We note that, when d = 1, we also have the bound

ΣT (x) ΣT (y) 4γr x y , if min(dist(x, T2), dist(y, T2)) r p

Proof. Assume without loss of generality that r = 1. Let Aj x = B(x, 1) Tj for any x and j = 1, 2. By Lemma 14 and then Lemma 7, we have

ΣT (x) ΣT (y) = Cov(λA1x A2x) Cov(λA1y A2y)

3d TV(λA1x A2x, λA1y A2y)

12d vol (A1 x A2 x) (A1 y A2 y)

vol (A1x A2x) (A1y A2y)

12d vol(A1 x A1 y) + vol(A2 x A2 y) vol(A1x) .

Note that A2 x = B(x2, η) T2 where x2 := PT2(x) and η := p

1 x x2 2 1. Similarly, A2 y = B(y2, δ) T2 where y2 := PT2(y) and δ := p

1 y y2 2 1. Therefore, applying Lemma 12, we get

vol(A1 x A1 y) 2 vol(A1x) 1 (1 t)d +, (64)

and assuming WLOG that δ η (i.e., y y2 x x2 ) and after proper scaling, we get vol(A2 x A2 y) 2 vol(A1x) ζ := ηd (η t2)d + δd, (65)

where t := x y and t2 := x2 y2 . Note that t2 t by the fact that PT2 is 1-Lipschitz. For (64), we have 1 (1 t)d + dt. This is obvious when t 1, while when t 1 it is obtained using the fact that, for any 0 a < b 1,

bd ad = (b a)(bd 1 + abd 2 + + ad 2b + ad 1) dbd 1(b a) d(b a). (66)

For (65), we consider several cases.

Arias-Castro, Lerman, and Zhang

When η t2, then ζ = ηd η t2 t.

When t2 < η t2 + δ, then ζ = ηd (η t2)d dt2 dt, by (66).

When η t2 + δ and d 2, we have

ζ = ηd δd dηd 1(η δ) dη(η δ) d(η2 δ2)

= d( y y2 2 1 x x2 2 1)

= d( y y2 1 + x x2 1)( y y2 1 x x2 1)

2d(t + t2) 4dt,

where (66) was applied in the ﬁrst line and the triangle inequality was applied in the last line, in the form of

y y2 y x + x x2 + x2 y2 = x x2 + t + t2.

When η t2 + δ and d = 1, we have

y y2 2 1 x x2 2 1 2

using (70), preceded by the fact that, for any 0 a < b 1,

1 b = b a 1 a +

1 b + b a b a

In summary, when d 2, we can therefore bound ΣT (x) ΣT (y) by dt + 4dt = 5dt; and when d = 1, we bound that by t + 2

The following is in some sense a converse to Lemma 23, in that we lower-bound the distance between covariance matrices near an intersection of linear subspaces. Note that the covariance matrix does not change when moving parallel to the intersection; however, it does when moving perpendicular to the intersection.

Lemma 24. Let T1 and T2 be two linear subspaces of same dimension d with θmin(T1, T2) θ0 > 0. Fix a unit norm vector v T1 (T1 T2) . With ΣT (hv) denoting the covariance of the uniform distribution over B(hv, r) (T1 T2), we have

inf h sup ℓ ΣT (hv) ΣT (ℓv) r2/C24,

where the inﬁmum is over 0 h 1/ sin θ0 and the supremum over max(0, h 1/2) ℓ min(1/ sin θ0, h + 1/2), and C24 1 depends only on d and θ0.

Proof. Assume without loss of generality that r = 1. If the statement of the lemma is not true, there are subspaces T1 and T2 of same dimension d, a unit length vector v T1 (T1 T2) and 0 h 1/ sin θ0, such that

ΣT (ℓv) = ΣT (hv) for all max(0, h 1/2) ℓ min(1/ sin θ0, h + 1/2). (71)

Spectral Clustering Based on Local PCA

This is because ℓ7 ΣT (ℓv) is continuous once T1, T2 and v are chosen. By projecting onto (T1 T2) , we may assume that T1 T2 = {0} without loss of generality. Let θ = (v, T2) and note that θ θ0 since T1 T2 = {0}. Deﬁne u = (v PT2v)/ sin θ and also w = PT2v/ cos θ when θ < π/2, and w T2 is any vector perpendicular to v when θ = π/2. B(hv, 1) T1 is the d-dimensional ball of T1 of radius 1 and center hv, while using Pythagoras theorem B(hv, 1) T2 is the d-dimensional ball of T2 of radius t := (1 (h sin θ)2)1/2 and center (h cos θ)w. Let λ denote the uniform distribution over B(hv, 1) (T1 T2) and λk the uniform distribution over B(hv, 1) Tk. Note that λ = αλ1 + (1 α)λ2, where

α := vol(B(hv, 1) T1) vol(B(hv, 1) (T1 T2)) = vol(B(hv, 1) T1) vol(B(hv, 1) T1) + vol(B((h cos θ)w, t) T2) = 1 1 + td .

By the law of total covariance,

Cov(λ) = α Cov(λ1) + (1 α) Cov(λ2) + α(1 α)(E(λ1) E(λ2))(E(λ1) E(λ2)) . (72)

with E(λ1) = hv and E(λ2) = h cos(θ)w, and Cov(λ1) = c P T1 and Cov(λ2) = t2c P T2, by Lemma 13. Hence,

ΣT (hv) = αc P T1 + (1 α)t2c P T2 + α(1 α)(1 t2)uu ,

using the fact that v (cos θ)w = (sin θ)u and the deﬁnition of t. Let θ1 = θmax(T1, T2) and let v1 T1 be of unit length and such that (v1, T2) = θ1. Then for any 0 h, ℓ 1/ sin θ0, we have

ΣT (hv) ΣT (ℓv) |v 1 ΣT (hv)v1 v 1 ΣT (ℓv)v1| = |f(th) f(tℓ)|, (73)

where th := (1 (h sin θ)2)1/2 and

f(t) = c 1 + td + ctd+2(cos θ1)2

1 + td + td(1 t2)(u v1)2

(1 + td)2 .

It is easy to see that the interval

Ih = {tℓ: (h 1/2)+ ℓ (1/ sin θ0) (h + 1/2)}

is non empty. Because of (71) and (73), f(t) is constant over t Ih, but this is not possible since f is a rational function not equal to a constant and therefore cannot be constant over an interval of positive length.

We now look at the eigenvalues of the covariance matrix.

Lemma 25. Let T1 and T2 be two linear subspaces of same dimension d. For x T1, denote by ΣT (x) the covariance matrix of the uniform distribution over B(x, 1) (T1 T2). Then, for all x T1,

c 1 (1 δ2(x))d/2 + βd(ΣT (x)), β1(ΣT (x)) c + δ2(x)(1 δ2(x))d/2 + , (75)

c 8(1 cos θmax(T1, T2))2(1 δ2(x))d/2+1 + βd+1(ΣT (x)) (c + δ2(x))(1 δ2(x))d/2 + , (76)

where c := 1/(d + 2) and δ(x) := dist(x, T2).

Arias-Castro, Lerman, and Zhang

Proof. Applying (72), we have

ΣT (x) = αc P T1 + (1 α)ct2P T2 + α(1 α)(x x2)(x x2) , (77)

where x2 := PT2(x) and α := (1 + td) 1 with t := (1 δ2(x))1/2 + . Because all the matrices in this display are positive semideﬁnite, we have

βd(ΣT (x)) αc P T1 = αc,

with α 1 td. And because of the triangle inequality, we have

β1(ΣT (x)) αc P T1 + (1 α)ct2 P T2 + α(1 α) x x2 2 c + α(1 α)δ2(x),

with α(1 α) td. Hence, (75) is proved. For the upper bound in (76), by Weyl s inequality (Stewart and Sun, 1990, Cor. IV.4.9) and the fact that βd+1(P T1) = 0, and then the triangle inequality, we get

βd+1(ΣT (x)) ΣT (x) αc P T1

c(1 α)t2 P T2 + α(1 α)δ2(x)

(1 α)(c + δ2(x)),

and we then use the fact that 1 α td. For the lower bound in (76), let θ1 θ2 θd denote the principal angles between T1 and T2. By the deﬁnition of principal angles, there are orthonormal bases for T1 and T2, denoted v1, . . . , vd and w1, . . . , wd, such that v j wk = 1Ij=k cos θj. Take u span(v1, . . . , vd, w1), that is, of the form u = av1 +v+bw1, with v span(v2, . . . , vd). Since P T1 = v1v 1 + + vdv d and P T2 = w1w 1 + + wdw d , when u = 1, we have

1 cu ΣT (x)u αu P T1u + (1 α)t2u P T2u

with u P T1u = a2 + v 2 + 2ab cos θ1 + b2 cos2 θ1 = 1 b2 sin2 θ1 0,

and u P T2u = b2 + 2ab cos θ1 + a2 cos2 θ1 = (a cos θ1 + b)2.

If |b| 1/2, then u P T1u 3/4, implying that 1

cu ΣT (x)u 3α/4 3/8. Otherwise,

u P T1u + u P T2u (a + b cos θ1)2 + (a cos θ1 + b)2

(1 cos θ1)2(a2 + b2) (1 cos θ1)2/4,

and using the fact that α 1 α (1 α)t2, we get

1 cu ΣT (x)u (1 α)t2(1 cos θ1)2/4.

This last bound is always worst. Hence, by the Courant-Fischer theorem (Stewart and Sun, 1990, Cor. IV.4.7), we have

βd+1(ΣT (x)) c

4(1 α)t2(1 cos θ1)2,

and we conclude with 1 α td/2.

Spectral Clustering Based on Local PCA

Below are two technical results on the covariance matrix of the uniform distribution on the intersection of a ball and the union of two smooth surfaces, near where the surfaces intersect. The ﬁrst one is in fact a stepping stone to the second. The latter will enable us to apply Lemma 23.

Lemma 26. There is a constant C26 3 such that the following holds. Consider S1, S2 Sd(κ). Fix r 1 C26κ, and for s S1 with dist(s, S2) < 1/κ, let Σ(s) and ΣT (s) denote the covariance matrices of the uniform distributions over B(s, r) (S1 S2) and B(s, r) (T1 T2), respectively, where T1 := TS1(s) and T2 := TS2(PS2(s)). Then

Σ(s) ΣT (s) C26κr3. (83)

Proof. We note that it is enough to prove the result when r is small enough. Let δ = dist(s, S2) and let s2 = PS2(s), so that s s2 = δ. We ﬁrst treat the case where δ r. Let Br be short for B(s, r) and deﬁne Ak = Br Sk, µk = E(λAk) and Dk = Cov(λAk), for k = 1, 2. Applying (72), we have

Σ(s) = αD1 + (1 α)D2 + α(1 α)(µ1 µ2)(µ1 µ2) ,

α := vol(A1) vol(A1) + vol(A2).

Recall that T1 = TS1(s) and T2 = TS2(s2), and deﬁne A k = Br Tk, so that Br (T1 T2) = A 1 A 2. Note that E(λA 1) = s and E(λA 2) = s2, and by Lemma 13, D 1 := Cov(λA 1) = cr2P T1 and D 2 := Cov(λA 2) = c(r2 δ2)P T2, where c := 1/(d+2). Applying (72), we have

ΣT (s) = α D 1 + (1 α )D 2 + α (1 α )(s s2)(s s2) ,

α := vol(A 1) vol(A 1) + vol(A 2).

By the triangle inequality, we have

αD1 α D 1 |α α| D 1 + α D1 D 1 cr2|α α| + D1 D 1 ,

and similarly,

(1 α)D2 (1 α )D 2 cr2|α α| + (1 α) D2 D 2 .

Assuming that κr 1/C16, by Lemma 16, we have D1 D 1 D2 D 2 C16κr3. Because |α (1 α ) α(1 α)| |α α| and the identity aa bb ( a + b ) a b , we also have

α(1 α)(µ1 µ2)(µ1 µ2) α (1 α )(s s2)(s s2)

|α α| s s2 2 + α(1 α)3r µ1 s + µ2 s2 .

Arias-Castro, Lerman, and Zhang

By Lemma 16, we have µ1 s µ2 s2 C16κr2. Assuming that κr 1/3, P 1 Tk is well-deﬁned and (1 + κr)-Lipschitz on Sk Br. And being an orthogonal projection, PTk is 1-Lipschitz . Hence, applying Lemma 6, we have

1 vol(Ak) vol(PTk(Ak)) 1 + κr, k = 1, 2.

Then by Lemma 9,

vol(PTk(Ak) A k) vol(A k) C9κ(r + κr2) 2C9κr, k = 1, 2,

when κr is small enough, implying

1 2C9κr vol(PTk(Ak))

vol(A k) 1 + 2C9κr, k = 1, 2.

1 Cκr 1 2C9κr vol(Ak)

vol(A k) (1 + κr)(1 + 2C9κr) 1 + Cκr, k = 1, 2,

for some constant C > 0. Since for all a, b, a , b > 0 we have a a + b a

|a a | |b b |

(a + b) (a + b ) 1 a/a | |1 b/b |,

we then get |α α | Cκr.

Applying the triangle inequality to Σ(s) ΣT (s) and using the bounds above, we conclude. We assumed above that δ r. When δ > r, A2 = A 2 = and α = α = 1. In this case, µ2, D2, D 2 are not well-deﬁned, but one can easily check that the above calculations, and the resulting bound, remain valid. In fact, in this case Σ(s) = D1 and ΣT (s) = D 1.

Lemma 27. Consider S1, S2 Sd(κ) intersecting at a positive angle, with reach(S1 S2) 1/κ. There is a constant C27 1, depending on S1 and S2, such that the following holds. For s S1 with dist(s, S2) 2r, assuming r 1 C27 , let Σ(s) and Σ0 T (s) denote the covariance matrices of the uniform distributions over B(s, r) (S1 S2) and B(PT 0 1 (s), r) (T 0 1 T 0 2 ), where T 0 k := TSk(s0) and s0 := PS1 S2(s). Then

Σ(s) Σ0 T (s) C27 r5/2.

Proof. We use the notation of Lemma 26 and its proof. In addition, we let C > 0 denote a constant that depends only on S1 and S2 whose value increases with each appearance. Deﬁne t1 = PT 0 1 (s) and t2 = PT 0 2 (t1). Let δ0 = t1 t2 , δ1 = s t1 , δ2 = s2 t2 .

Deﬁne A0 1 = T 0 1 B(t1, r) and A0 2 = T 0 2 B(t1, r). Note that A0 2 = T 0 2 B(t2, p

Spectral Clustering Based on Local PCA

when δ0 r, and A0 2 = when δ0 > r. We have s s0 Cr by Lemma 21, which then implies δ1 Cr2 by Lemma 2. Assuming that κr is small enough,

δ2 = s2 t2 PS2(s) PT 0 2 (s) + PT 0 2 (s) PT 0 2 (t1)

C4κ s s0 2 + s t1 Cr2,

where in the second line we used Lemma 4 and the fact that any metric projection is 1-Lipschitz. Hence,

|δ0 δ| t1 t2 s s2 t1 s + t2 s2 = δ1 + δ2 Cr2.

Note that E(λA0 1) = t1 and, by Lemma 13 (with c denoting the constant deﬁned there), D0 1 := Cov(λA0 1) = cr2P T 0 1 . If δ0 r, E(λA0 2) = t2 and D0 2 := Cov(λA0 2) = c(r2 δ2 0)P T 0 2 ; if δ0 > r, then A0 2 = , and we deﬁne D0 2 = 0 in that case. Applying (72), we have

Σ0 T (s) = α0D0 1 + (1 α0)D0 2 + α0(1 α0)(t1 t2)(t1 t2) ,

α0 := vol(A0 1) vol(A0 1) + vol(A0 2).

Note that α0 = 1 when δ0 > r (i.e., when A0 2 = ). We focus on the case where max(δ, δ0) r, which is the most complex one. We have

D 1 D0 1 = cr2P T1 cr2P T 0 1 = cr2 P T1 P T 0 1 ,

D 2 D0 2 = c(r2 δ2)P T2 c(r2 δ2 0)+P T 0 2 cr2 P T2 P T 0 2 + c|δ2 δ2 0|,

by the triangle inequality and the fact that P T 1 for any aﬃne subspace T. By Lemma 3 and Lemma 18, we have

P T1 P T 0 1 6κ s s0 Cr, P T2 P T 0 2 6κ s2 s0 Cr,

since s2 s0 s2 s + s s0 2r + Cr.

And since |δ2 δ2 0| 2r|δ δ0| Cr3, we have D k D0 k Cr3 for k = 1, 2. Let ωd denote the volume of the d-dimensional unit ball. Then

vol(A 1) = ωdrd, vol(A 2) = ωd(r2 δ2)d/2, vol(A0 1) = ωdr2, vol(A0 2) = ωd(r2 δ02)d/2,

|α α0| = 1 1 + (1 δ/r)d/2 1 1 + (1 δ0/r)d/2

(1 δ/r)d/2 (1 δ0/r)d/2 .

Arias-Castro, Lerman, and Zhang

Proceeding as when we bounded ζ in the proof of Lemma 23, we get

|δ δ0|/r C r.

Hence, we proved that ΣT (s) Σ0 T (s) Cr5/2.

We use this inequality, (83), and the triangle inequality, to get

Σ(s) Σ0 T (s) Σ(s) ΣT (s) + ΣT (s) Σ0 T (s) Cr5/2,

which is what we needed to prove.

6.2 Performance Guarantees for Algorithm 2

We deal with the case where there is no noise, that is, τ = 0 in (7), so that the data points are s1, . . . , s N and sampled exactly on S1 S2 according to the uniform distribution. We explain how things change when there is noise, meaning τ > 0, in Section 6.4. We start with some notation. Let Ξi = {j = i : sj Nr(si)}, with (random) cardinality Ni = |Ξi|. For i [n], let Ki = 1 if si S1 and = 2 otherwise, and let Ti = TSKi(si), which is the tangent subspace associated with data point si. Let P i be short for P Ti. Let

I = {i : Kj = Ki, j Ξi},

or equivalently, Ic = {i : j s.t. Kj = Ki and sj si r}.

Thus I indexes the points whose neighborhoods do not contain points from the other cluster. We will soon deﬁne a constant C and will require the following

C C, r ε/C, ε η/C, η 1/C, (90)

for a large enough constant C 1 that depends only on S1 and S2.

6.2.1 A Concentration Bound for Local Covariances

Objective. We derive a concentration bound for the local covariances.

Let Ξi = {j = i : sj Nr(si)}, with (random) cardinality Ni = |Ξi|. When there is no noise, Ci is the sample covariance of {sj : j Ξi}. Also, let Σ(s) denote the covariance matrix of the uniform distribution over B(s, r) (S1 S2), and note that Σi := Σ(si) = E(Ci|si). Given Ni, {sj : j Ξi} are uniformly distributed on B(si, r) (S1 S2), and applying Lemma 22, we get that for any t > 0

P Ci Σi > r2t + 3κr3 si, Ni C22 exp Nit

C22 min t, Ni . (91)

Assume that r < 1/(C11κ) and let n := nrd/(2C11/χ) where χ := vold(S1) vold(S2)

vold(S1)+vold(S2). Deﬁne

Ω1 = { i : Ni > n }.

Spectral Clustering Based on Local PCA

Note that, for any i,

Ni N i := #{j = i : Ki = Kj and sj B(si, r)},

meaning that N i counts the number of neighboring points from the same surface. We have that #{i : Ki = k} Bin(n, χk) where χk := vold(Sk) vold(S1)+vold(S2), and using a standard concentration inequality for the binomial distribution, there is a numeric constant C > 0 such that P(#{i : Ki = k} χn/2) 1 exp( χn/C).

(We also used the fact that χ = χ1 χ2.) With this, we may apply Lemma 11, to get

P(Ωc 1) 2 exp( χn/C) + C11r d exp( (χn/2)rd/C11)

Cn exp( nrd/C),

for some other constant C > 0. (We used the fact that the left-hand side can be made greater than 1 when nrd 1 by choosing C large enough, so that we may focus on the case where nrd 1.) Deﬁne the event

Ω2 = n i : Ci Σi r2η/C + 3κr3o ,

where C will be chosen large enough, but ﬁxed, later on. Assume that n η/C . Note that η 1 by assumption, and we may assume n to be larger than any given constant, for if not, the statement in Theorem 1 is void. Based on the union bound and (91), we have

P(Ωc 2) P(Ωc 2|Ω1) + P(Ωc 1)

n C22 exp( n (η/C )2/C22) + Cn exp( nrd/C)

Cn exp( nrdη2/C),

for another constant C that depends on C and d.

6.2.2 Away from the Intersection

Objective. We show that there is an appropriate choice of parameters as in Theorem 1 such that, away from the intersection when conﬁned to I two points belonging to the same surface and within distance ε are neighbors, while two points belonging to diﬀerent surfaces are not neighbors.

For i [n], let Ki = 1 if si S1 and = 2 otherwise. Let Ti = TSKi(si), which is the tangent subspace associated with data point si. Let

I = {i : Kj = Ki, j Ξi},

or equivalently, Ic = {i : j s.t. Kj = Ki and sj si r}.

By deﬁnition, I indexes the points whose neighborhoods do not contain points from the other cluster.

Arias-Castro, Lerman, and Zhang

For i I , let ΣT,i denote the covariance of the uniform distribution on Ti B(si, r). Applying Lemma 16, this leads to

Σi ΣT,i C16κr3, i I . (95)

Under Ω2, by the triangle inequality and (95), we have

Ci ΣT,i Ci Σi + Σi ΣT,i ζr2, i I , (96)

where ζ := η/C + (3 + C16)κr. (97)

The inequality (96) leads, via the triangle inequality, to the decisive bound

Ci Cj ΣT,i ΣT,j + 2ζr2, i, j I . (98)

Take i, j I such that Ki = Kj and si sj ε. Then by Lemma 13, Lemma 18 and Lemma 3, and the triangle inequality, we have

1 cr2 ΣT,i ΣT,j = sin θmax(Ti, Tj) 6κ si sj 6κε, (99)

where c := 1/(d + 2). This implies that

1 r2 Ci Cj 6cκε + 2ζ,

by the triangle inequality and (98). Therefore, if η > 6cκε + 2ζ, then any pair of points indexed by i, j I from the same cluster and within distance ε are direct neighbors in the graph built by Algorithm 2. Take i, j I such that Ki = Kj and si sj ε. By Lemma 21,

max dist(si, S1 S2), dist(sj, S1 S2) C21 si sj .

Let z be the mid-point of si and sj. By convexity and the display above,

dist(z, S1 S2) 1

2 dist(si, S1 S2) + 1

2 dist(sj, S1 S2) C21ε.

Assuming C21ε < 1/κ, let s = PS1 S2(z). Then, by the triangle inequality again,

max s si , s sj dist(z, S1 S2) + 1

2 si sj C21ε + 1

Let T i denote the tangent subspace of SKi at s and let Σ i be the covariance of the uniform distribution over T i B(s, r). Deﬁne T j and Σ j similarly. Then, as in (99) we have

1 cr2 ΣT,i Σ i 6κ si s 6κ(C21 + 1/2)ε,

and similarly, 1 cr2 ΣT,j Σ j 6κ(C21 + 1/2)ε.

Spectral Clustering Based on Local PCA

Moreover, by Lemma 13 and Lemma 18,

1 cr2 Σ i Σ j = sin θmax(T i, T j) sin θS,

where θS is short for θ(S1, S2), deﬁned in (6). Hence, by the triangle inequality,

1 cr2 ΣT,i ΣT,j sin θS 12κ(C21 + 1/2)ε,

and then 1 r2 Ci Cj c sin θS 12cκ(C21 + 1/2)ε 2ζ, (100)

by the triangle inequality and (98). Therefore, if η < c sin θS 12cκ(C21 + 1/2)ε 2ζ, then any pair of points indexed by i, j I from diﬀerent clusters are not direct neighbors in the graph built by Algorithm 2. In summary, we would like to choose η such that

cκε + 2ζ < η < c sin θS 12cκ(C21 + 1/2)ε 2ζ.

Recalling the deﬁnition of ζ in (97), this holds under (90).

6.2.3 The Different Clusters are Disconnected in the Graph

Objective. We show that Step 2 in Algorithm 2 eliminates all points i / I , implying by our conclusion in Section 6.2.2 that the two clusters are not connected in the graph.

Hence, take i / I with Ki = 1 (say), so that dist(si, S2) r. By Lemma 21, we have dist(si, S1 S2) C21r. Assume r is small enough that C21κr < 1/2. Let s0 = PS1 S2(si) and deﬁne T 0 k = TSk(s0) for k {1, 2}. For s S1 such that dist(s, S2) 2r, deﬁne Σ0 T (s) as in Lemma 27. Assuming that si = s0 (which is true with probability one) and s0 = 0 (by translation invariance), let h = si s0 and v = (si s0)/h. Note that si = hv. Because v T 0 1 T 0 2 by Lemma 1, and that θmin(T 0 1 , T 0 2 ) θS, we apply Lemma 24 to ﬁnd ℓ h r/2 such that Σ0 T (ℓv) Σ0 T (hv) r2/C24, where C24 1 depends only on θS and d. Letting s = ℓv, we have s si = |h ℓ| r/2, so that

dist( s, S2) s si + dist(si, S2) r/2 + r 3r/2.

By Lemma 21, we have dist( s, S1 S2) C212r < 1/κ, and consequently, PS1 S2( s) = s0, by Lemma 1. Hence, by the triangle inequality and Lemma 27,

Σ(si) Σ( s) Σ0 T (s) Σ0 T ( s) Σ(si) Σ0 T (si) Σ( s) Σ0 T ( s)

r2/C24 2C27r5/2.

Take any s S1 such that s s r/2. By Lemma 23,

Σ0 T ( s) Σ0 T ( s) 5dr s s + 2 p

r3 s s (5d + 2)r3/2p

Noting that dist( s, S2) s s + dist( s, S2) r/2 + 3r/2 2r,

Arias-Castro, Lerman, and Zhang

we apply the triangle inequality and Lemma 27, and we get

Σ( s) Σ( s) Σ0 T ( s) Σ0 T ( s) + Σ( s) Σ0 T ( s) + Σ( s) Σ0 T ( s)

(5d + 2)r3/2p

s s + 2C27r5/2.

When s s r/C , we have

Σ( s) Σ(si) Σ(si) Σ( s) Σ( s) Σ( s)

r2 1/C24 (5d + 2)/ p

C 4C27r1/2 r2/(2C24),

assuming C is large enough and r is small enough. Fixing such a constant C 2, deﬁne the event

n s Sk : #{i : Ki = k and si B(s, r/C )} 2 o .

By the same arguments bounding P(Ωc 1) in Section 6.2.1, we have that there is a constant C such that, when r 1/(Cκ),

P(Ωc 3) Cn exp( nrd/C).

Under Ω3, there is sj S1 B( s, r/C ). Taking s = sj above, we have that Σj Σi r2/(2C24). Under Ω2, by the triangle inequality,

Cj Ci Σj Σi Ci Σi Cj Σj

r2/(2C24) 2r2η/C r2/(3C24),

choosing C suﬃciently large. (Recall that η 1 by assumption.) Therefore, choosing η such that η < 1/(3C24), we see that Cj Ci > ηr2, even though

sj si sj s + s si r/C + r/2 r.

6.2.4 The Points that are Removed are close to the Intersection

Objective. We show that the points removed by Step 2 are within distance Cr of the intersection.

Let I = {i I : Ξi I }. By our choice of parameters in (90), we see that i I is neighbor with any j Ξi, so that i survives Step 2. Hence, the nodes removed at Step 2 are in Ic = {i I : Ξi Ic = } Ic , with the possibility that some nodes in Ic survive. Now, for any i Ic , there is j with Kj = Ki such that si sj r, so by Lemma 21,

dist(si, S1 S2) C21 si sj C21r.

And for any i I \ Ic , there is j Ic such that sj si r, by deﬁnition of Ξi, so that dist(si, S1 S2) C21r + r, by the triangle inequality. In fact, we just proved that the last inequality holds for all i Ic . Hence, the points removed in Step 2 are within distance (C21 + 1)r of S1 S2.

Spectral Clustering Based on Local PCA

6.2.5 Each Cluster is (essentially) Connected in the Graph

Objective. We show that the points that survive Step 2 and belong to the same cluster are connected in the graph, except for possible spurious points within distance Cr of the intersection.

Take, for example, the cluster generated from sampling S1. Before we start, we recall that Ic was eliminated in Step 2, so that by our choice of η in (90), to show that two points si, sj sampled from S1 are neighbors it suﬃces to show that si sj ε. Deﬁne R = S1\B(S1 S2, r ), where r := (C21+2)r. By our conclusion in Section 6.2.4, we have that

any si S1 such that dist(si, R) < r survives Step 2, (109)

so that it suﬃces to show that the subset of nodes {i : si R} is connected. Deﬁne W = S1 \ S2 and let {Wm} denote the connected components of W. (Note that there might only be one, for example, when S1 and S2 are circles in dimension D = 3, not in the same plane, that intersect at a single point.) Deﬁne Rm = Wm \B(S1 S2, r ). Since W = m Wm = S1 \ (S1 S2), we have R = m Rm. In fact, when r (and therefore r ) is small enough, Rm is connected, and assuming this is the case, {Rm} are the connected components of R. Suppose R1 and R2 are two adjacent connected components of R, meaning that R1 R2 S2 is connected. We show that there is at least one pair j1, j2 of direct neighbors in the graph such that sjm Rm. Take s on the connected component of S1 S2 separating R1 and R2. We construct two points, s1 R1 and s2 R2, that are within distance of order ε from s, and then select sjm as the closest data point to sm. Let T k = TSk(s) and let H be the aﬃne subspace parallel to (T 1 T 2) passing through s. Take tm PT 1(Rm) H B(s, ε1), where ε1 := ε/4, which exists when ε is suﬃciently small. Note that t1, t2 T 1 and assuming that ε is suﬃciently small that ε1 < 1/κ, we have PS1 S2(tm) = s by Lemma 1. Deﬁne sm = P 1 T 1 (tm) and note that s1, s2 S1. Lemma 5 not only justiﬁes this construction when κε1 < 1/C5, it also says that P 1 T 1 has Lipschitz constant bounded by 1 + κε1, which implies that

sm s (1 + κε1) tm s = (1 + κε1)ε1 ε/3.

We also have

dist(sm, S1 S2) dist(tm, S1 S2) sm tm

= tm s sm tm

2(1 + κε1)2ε1 ε1

when ε is suﬃciently small. We used (10) in the second inequality. We assume r/ε is suﬃciently small that ε/5 r +r. Then under Ω3, there are j1, j2 such that sjm B(sm, r) S1, and by the triangle inequality, we then have that dist(sjm, S1 S2) ε/5 r r , so

Arias-Castro, Lerman, and Zhang

that sjm Rm. Moreover,

sj1 sj2 sj1 s1 + s1 s + s s2 + s2 sj2

r + ε/3 + ε/3 + r

= 2 3ε + 2r ε,

assuming r/ε is suﬃciently small. Now, we show that the points sampled from any connected component, say R1, form a connected subgraph. Take s1, . . . , s M an r-packing of R1, so that G

m (R1 B(sm, r/2)) R1 [

m (R1 B(sm, r)).

Because R1 is connected, m B(sm, r) is necessarily connected. Under Ω1, and C26 2, all the balls B(sm, r), m = 1, . . . , M, contain at least one si S1, and any such point survives Step 2, because of dist(si, R1) < r and (109). Assume r/ε 1/4. Then two points si and sj such that si, sj B(sm, r) are connected, since si sj 2r ε. And when B(sm1, r) B(sm2, r) = , si B(sm1, r) and sj B(sm2, r) are such that si sj 4r ε. Hence, the points sampled from R1 are connected in the graph under Ω1.

6.2.6 Summary and Conclusion

We worked under the assumption that (90), with a constant C large enough and depending only on S1 and S2. This is compatible with the statement of Theorem 1. In Section 6.2.1, we derived a concentration inequality for local covariances. This lead to a uniform bound (deﬁning the event Ω2) which holds with probability at least 1 Cn exp( nrdη2/C) for some constant C > 0. In Section 6.2.2, we showed that among points away from the intersection (indexed by I ), those that are in the same cluster are neighbors in the graph if they are within distance ε, while those in diﬀerent clusters cannot be neighbors in the graph. In Section 6.2.3, we showed that Step 2 removes points not indexed by I , thus implying that the two clusters are not connected in the graph. In the process, we assumed that an event (denoted Ω3) held, which happens with probability at least 1 Cn exp( nrd/C) for some constant C > 0. In Section 6.2.4, we showed that the points removed in Step 2 are within distance O(r) from the intersection of the surfaces. In Section 6.2.5, we showed that, except for points within distance O(r) of the intersection, two points in the same cluster are connected in the graph. We conclude that, after Step 4, Algorithm 2 returns two groups, and each group covers an entire cluster except for points within distance O(r) of the intersection. In Step 5, each point removed in Step 2 is included in the group which contains its closest point (in Euclidean distance). Theorem 1 is silent on the accuracy of doing this step. So the proof of Theorem 1, as it relates to Algorithm 2, is complete.

6.3 Performance Guarantees for Algorithm 3

We keep the same notation as in Section 6.2 and go a little faster here as the arguments are parallel. Let di denote the estimated dimensionality at point si, meaning the number

Spectral Clustering Based on Local PCA

of eigenvalues of Ci exceeding η Ci . Recall that Qi denotes the orthogonal projection onto the top di eigenvectors of Ci.

6.3.1 Estimation of the Intrinsic Dimension

Objective. We show that, for i I , di = d (the correct intrinsic dimension) with high probability.

Take i I . Under Ω2, (96) holds, and applying Weyl s inequality (Stewart and Sun, 1990, Cor. IV.4.9), we have

|βm(Ci) βm(ΣT,i)| ζr2, m = 1, . . . , D.

By Lemma 13, ΣT,i = cr2P i, so that βm(ΣT,i) = cr2 when m d and βm(ΣT,i) = 0 when m > d. Hence,

β1(Ci) (c + ζ)r2, βd(Ci) (c ζ)r2, βd+1(Ci) ζr2.

This implies that

βd(Ci) β1(Ci) c ζ

c + ζ > η, βd+1(Ci)

β1(Ci) ζ c + ζ < η,

when ζ η/2 and η is suﬃciently small, which is the case under (90). When this is so, di = d by deﬁnition of di. (Note that Ci = β1(Ci).)

6.3.2 Each Cluster is (essentially) Connected in the Graph

Objective. We show that each cluster is connected in the graph, except possibly for some points near the intersection.

Note that the top d eigenvectors of ΣT,i generate Ti. Hence, applying Lemma 19, and (96) again, we get that (recall that c = 1/(d + 1))

2(d + 2)ζ, i I .

This is the equivalent of (96), which leads to the equivalent of (98):

Qi Qj 1 cr2 ΣT,i ΣT,j + 2ζ , i, j I ,

using the fact that ΣT,i = cr2P i. When i, j I are such that Ki = Kj, based on (99), we have Qi Qj 6κε + 2ζ .

Hence, when η > 6κε + 2ζ (which is the case under (90)), two nodes i, j I such that Ki = Kj and si sj ε are neighbors in the graph. The arguments provided in Section 6.2.5 now apply in exactly the same way to show that nodes i I such that Ki = 1 are connected in the graph. The same is true of nodes i I such that Ki = 2.

Arias-Castro, Lerman, and Zhang

6.3.3 The Clusters are (essentially) Disconnected in the Graph

Objective. We show that the two sets of nodes {i I : Ki = 1} and {i I : Ki = 2} shown to be connected above are disconnected in the graph.

When we take i, j I such that Ki = Kj, we have the equivalent of (100), meaning,

Qi Qj sin θS 12κ(C21 + 1/2)ε 2ζ .

We choose η smaller than the RHS, so that these nodes are not neighbors in the graph. We next prove that a point indexed by I is not neighbor to a point near the intersection because of diﬀerent estimates for the local dimension. Let C denote a positive constant which increases with each appearance. Take s S1 such that δ(s) := dist(s, S2) r. Reinstate the notation used in Section 6.2.3, in particular s0, T 0 1 and T 0 2 , as well as Σ0 T (s). Deﬁne ΣT (s) as in Lemma 26. Also, let T 1 = TS1(s) and T 2 = TS2(s2) where s2 := PS2(s). By Lemma 26 and Weyl s inequality, we have

βd+1(Σ(s)) βd+1(ΣT (s)) C26r3, β1(Σ(s)) β1(ΣT (s)) + C26r3,

which together with Lemma 25 (and proper scaling), implies that

c 8(1 cos θmax(T 1, T 2))2(1 δ(s)2/r2)d/2+1 C26r3

c + (δ(s)/r)2(1 δ(s)2/r2)d/2 + C26r3 .

Then, by the triangle inequality,

θmax(T 1, T 2) θmax(T 0 1 , T 0 2 ) θmax(T 1, T 0 1 ) θmax(T 2, T 0 2 ).

By deﬁnition, θmax(T 0 1 , T 0 2 ) θS, and by Lemma 3,

θmax(T 1, T 0 1 ) sin 1 6κ s s0 1 Cr,

and similarly, θmax(T 2, T 0 2 ) sin 1 6κ s2 s0 1 Cr,

because s s0 Cr by Lemma 21 and then s2 s0 r + s s0 Cr. Hence, for r small enough, θmax(T 1, T 2) θS/2, and furthermore,

β1(Σ(s)) η when 1 δ(s)2/r2 ξ := C η1/(d+2), (118)

where C is a large enough constant, assuming r/η and η are small enough. The same is true for points on s S2 if we redeﬁne δ(s) = dist(s, S1). Hence, for si close enough to the intersection that δ(si) satisﬁes (118), di > d. Then, by Lemma 18, Qi Qj = 1 for any j I . By our choice of η < 1, this means that i and j are not neighbors. So the only way {i I : Ki = 1} and {i I : Ki = 2} are connected in the graph is if there are si S1 and sj S2 such that si sj ε and both δ(si) and δ(sj) fail to satisfy (118). We now show this is not possible. Let ΣT,i = ΣT (si) and deﬁne ΣT,j similarly. (These are well-deﬁned when ε < 1/κ, which we assume.) By Lemma 26, we have

Σi ΣT,i C26 r3. (119)

Spectral Clustering Based on Local PCA

Deﬁne αi = (1 + td i ) 1 with ti := (1 δ2(si))1/2 + . As in (77), with the triangle inequality, we get

ΣT,i αicr2P i c(1 αi)t2 i r2 + αi(1 αi)δ2(si)

2(1 αi)r2 2(1 δ(si)2/r2)d/2 + r2 2ξd/2r2,

where the second inequality we used the fact that αi = 1 if δ(si) > r, and in very last inequality comes from δ(si) not satisfying (118). Hence, under Ω2, with (119) and the triangle inequality, we get

Ci αicr2P i r2η/C + 3κr3 + 2ξd/2r2 + C26 r3.

Since βd(P i) = 1 and βd+1(P i) = 0, by Lemma 19, we have

Qi P i 1 αicr2 r2η/C + 2ξd/2r2 + (3κ + C26)r3 C(η/C + ξd/2 + r).

Similarly, Qj P j C(η/C + ξd/2 + r).

By Lemma 18, P i P j = sin θmax(Ti, Tj). Let s0 = PS1 S2(si), and deﬁne T 0 1 and T 0 2 as before. Not that si s0 Cε by Lemma 21 and the fact that dist(si, S2) si sj ε. Also, sj s0 si s0 + sj si Cε. We then have

θmax(Ti, Tj) θmax(T 0 1 , T 0 2 ) θmax(Ti, T 0 1 ) θmax(Tj, T 0 2 ) θS Cε,

by Lemma 3. Hence, by the triangle inequality,

Qi Qj P i P j Qi P i Qj P j

sin(θS Cε) C(η/C + ξd/2 + r) 1

2 sin θS > η,

when the constant in (90) is large enough. (Recall the deﬁnition of ξ above.) Therefore i and j are not neighbors, as we needed to show.

6.4 Noisy Case

So far we only dealt with the case where τ = 0 in (7). When τ > 0, a sample point xi is in general diﬀerent than its corresponding point si sampled from one of the surfaces. However, when τ/r is small, this does not change things much. For one thing, the points are close to each other, since we have xi si τ by assumption, and τ is small compared to r. And the corresponding covariance matrices are also close to each other. To see this, redeﬁne Ξi = {j = i : xj Nr(xi)} and Ci as the sample covariance of {xj : j Ξi}. Let Di denote the sample covariance of {sj : j Ξi}. Let X be uniform over {xj : j Ξi} and deﬁne Y = P j sj1I{X=xj}. Starting from (33), we have

Di Ci = Cov(X) Cov(Y )

E X Y 2 1/2 E X xi 2 1/2 + E Y xi 2 1/2

τ (r + (r + τ)) = r2 2τ/r + (τ/r)2 ,

Arias-Castro, Lerman, and Zhang

which is small compared to r2, and r2 is the operating scale for covariance matrices in our setting. Using these facts, the arguments are virtually the same, except for some additional terms due to triangle inequalities, for example, si sj 2τ xi xj si sj +2τ. In particular, this results in ζ in (97) being now redeﬁned as ζ = 3τ

r + η/C + (3 + C16)κr. We omit further technical details.

Acknowledgments

This work was partially supported by grants from the National Science Foundation (DMS 0915160, 0915064, 0956072, 1418386, 1513465). We would like to thank Jan Rataj for helpful discussion around Lemma 3 and Xu Wang for his sharp proofreading. We also gratefully acknowledge the comments, suggestions, and scrutiny of an anonymous referee. We would also like to acknowledge support from the Institute for Mathematics and its Applications (IMA). For one thing, the authors ﬁrst learned about the research of Goldberg et al. (2009) there, at the Multi-Manifold Data Modeling and Applications workshop in the Fall of 2008, and this was the main inspiration for our paper. Also, part of our work was performed while TZ was a postdoctoral fellow at the IMA, and also while EAC and GL were visiting the IMA.

S. N. Afriat. Orthogonal and oblique projectors and the characteristics of pairs of vector spaces. Proceedings of the Cambridge Philosophical Society, 53:800 816, 1957.

S. Agarwal, J. Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie. Beyond pairwise clustering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 838 845, 2005.

S. Agarwal, K. Branson, and S. Belongie. Higher order learning with graphs. In International Conference on Machine Learning (ICML), pages 17 24, 2006.

E. Arias-Castro. Clustering based on pairwise distances when the data is of mixed dimensions. IEEE Transactions on Information Theory, 57(3):1692 1706, 2011.

E. Arias-Castro, G. Chen, and G. Lerman. Spectral clustering based on local linear approximations. Electronic Journal of Statistics, 5:1537 1587, 2011.

R. Basri and D. W. Jacobs. Lambertian reﬂectance and linear subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(2):218 233, 2003.

J.-D. Boissonnat, R. Dyer, and A. Ghosh. Constructing intrinsic delaunay triangulations of submanifolds. ar Xiv preprint ar Xiv:1303.6493, 2013.

P. Bradley, K. Bennett, and A. Demiriz. Constrained k-means clustering. Technical Report MSR-TR-2000-65, Microsoft Research, 2000.

Spectral Clustering Based on Local PCA

M. R. Brito, E. L. Ch avez, A. J. Quiroz, and J. E. Yukich. Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection. Statistics & Probability Letters, 35(1):33 42, 1997.

G. Chen and G. Lerman. Spectral curvature clustering (scc). International Journal of Computer Vision, 81(3):317 330, 2009a.

G. Chen and G. Lerman. Foundations of a multi-way spectral clustering framework for hybrid linear modeling. Foundations of Computational Mathematics, 9(5):517 558, 2009b.

G. Chen, S. Atev, and G. Lerman. Kernel spectral curvature clustering (KSCC). In IEEE International Conference on Computer Vision (ICCV) Workshops, pages 765 772, 2009.

A. Cuevas, R. Fraiman, and B. Pateiro-L opez. On statistical properties of sets fulﬁlling rolling-type conditions. Advances in Applied Probability, 44(2):311 329, 2012.

C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III. SIAM Journal on Numerical Analysis, 7:1 46, 1970.

E. Elhamifar and R. Vidal. Sparse manifold clustering and embedding. In Advances in Neural Information Processing Systems (NIPS), volume 24, pages 55 63, 2011.

R. Epstein, P. Hallinan, and A. Yuille. 5 2 eigenimages suﬃce: An empirical investigation of low-dimensional lighting models. In IEEE Workshop on Physics-based Modeling in Computer Vision, pages 108 116, 1995.

Z. Fan, J. Zhou, and Y. Wu. Multibody grouping by inference of multiple subspaces from high-dimensional data using oriented-frames. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1):91 105, 2006.

H. Federer. Curvature measures. Transactions of the American Mathematical Society, 93: 418 491, 1959.

Z. Fu, W. Hu, and T. Tan. Similarity based vehicle trajectory clustering and anomaly detection. In IEEE International Conference on Image Processing, volume 2, pages 602 605, 2005.

A. Gionis, A. Hinneburg, S. Papadimitriou, and P. Tsaparas. Dimension induced clustering. In International Conference on Knowledge Discovery in Data Mining (KDD), pages 51 60, 2005.

A. Goh and R. Vidal. Segmenting motions of diﬀerent types by unsupervised manifold clustering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1 6, 2007.

A.B. Goldberg, X. Zhu, A. Singh, Z. Xu, and R. Nowak. Multi-manifold semi-supervised learning. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 169 176, 2009.

D. Gong, X. Zhao, and G. Medioni. Robust multiple manifolds structure learning. In International Conference on Machine Learning (ICML), pages 321 328, 2012.

Arias-Castro, Lerman, and Zhang

Q. Guo, H. Li, W. Chen, I-F. Shen, and J. Parkkinen. Manifold clustering via energy minimization. In International Conference on Machine Learning and Applications (ICMLA), pages 375 380, 2007.

G. Haro, G. Randall, and G. Sapiro. Stratiﬁcation learning: Detecting mixed density and dimensionality in high dimensional point clouds. Advances in Neural Information Processing Systems (NIPS), 19:553, 2007.

R. Heckel and H. B olcskei. Noisy subspace clustering via thresholding. In IEEE International Symposium on Information Theory (ISIT), pages 1382 1386, 2013.

J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman. Clustering appearances of objects under varying illumination conditions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 1 11, 2003.

W. Hoeﬀding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963.

D. N. Kaslovsky and F. G. Meyer. Optimal tangent plane recovery from noisy manifold samples. ar Xiv preprint ar Xiv:1111.4601v2, 2011.

D. Kushnir, M. Galun, and A. Brandt. Fast multiscale clustering and manifold identiﬁcation. Pattern Recognition, 39(10):1876 1891, 2006.

K.-C. Lee, J. Ho, and D. J. Kriegman. Acquiring linear subspaces for face recognition under variable lighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 (5):684 698, 2005.

A. V. Little, Y. M. Jung, and M. Maggioni. Multiscale estimation of intrinsic dimensionality of data sets. In Manifold Learning and its Applications, pages 26 33, 2009.

U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395 416, 2007.

Y. Ma, A. Y. Yang, H. Derksen, and R. Fossum. Estimation of subspace arrangements with applications in modeling and segmenting mixed data. SIAM Review, 50(3):413 458, 2008.

M. Maier, M. Hein, and U. von Luxburg. Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters. Theoretical Computer Science, 410(19):1749 1764, 2009.

V. J. Mart ınez and E. Saar. Statistics of the Galaxy Distribution. CRC press, Boca Raton, 2002.

A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems (NIPS), 2:849 856, 2002.

P. Niyogi, S. Smale, and S. Weinberger. Finding the homology of submanifolds with high conﬁdence from random samples. Discrete & Computational Geometry, 39(1):419 441, 2008.

Spectral Clustering Based on Local PCA

M. Polito and P. Perona. Grouping and dimensionality reduction by locally linear embedding. Advances in Neural Information Processing Systems (NIPS), 14:1255 1262, 2001.

L. K. Saul and S. T. Roweis. Think globally, ﬁt locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4:119 155, 2003.

A. Shashua, R. Zass, and T. Hazan. Multi-way clustering using super-symmetric nonnegative tensor factorization. In European Conference on Computer Vision (ECCV), pages 595 608, 2006.

M. Soltanolkotabi and E. J. Cand es. A geometric analysis of subspace clustering with outliers. Annals of Statistics, 40(4):2195 2238, 2012.

M. Soltanolkotabi, E. Elhamifar, and E. J. Cand es. Robust subspace clustering. Annals of Statistics, 42(2):669 699, 2014.

R. Souvenir and R. Pless. Manifold clustering. In IEEE International Conference on Computer Vision (ICCV), volume 1, pages 648 653, 2005.

G. W. Stewart and J. G. Sun. Matrix Perturbation Theory. Computer Science and Scientiﬁc Computing. Academic Press Inc., Boston, MA, 1990.

M. Tipping and C. Bishop. Mixtures of probabilistic principal component analysers. Neural Computation, 11(2):443 482, 1999.

J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12:389 434, 2012.

M. C. Tsakiris and R. Vidal. Filtrated spectral algebraic subspace clustering. In IEEE International Conference on Computer Vision Workshops, pages 28 36, 2015.

R. Valdarnini. Detection of non-random patterns in cosmological gravitational clustering. Astronomy & Astrophysics, 366:376 386, 2001.

R. Vidal and Y. Ma. A uniﬁed algebraic approach to 2-D and 3-D motion segmentation and estimation. Journal of Mathematical Imaging and Vision, 25(3):403 421, 2006.

G. Walther. Granulometric smoothing. Annals of Statistics, 25(6):2273 2299, 1997.

Y. Wang, Y. Jiang, Y. Wu, and Z.-H. Zhou. Spectral clustering on multiple manifolds. IEEE Transactions on Neural Networks, 22(7):1149 1161, 2011.

Y.-X. Wang and H. Xu. Noisy sparse subspace clustering. In International Conference on Machine Learning (ICML), pages 89 97, 2013.

L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In Advances in Neural Information Processing Systems (NIPS), pages 1601 1608, 2005.

T. Zhang, A. Szlam, Y. Wang, and G. Lerman. Hybrid linear modeling via local best-ﬁt ﬂats. International Journal of Computer Vision, pages 1 24, 2012.