# subspace_clustering_via_tangent_cones__f8a099e1.pdf

Subspace Clustering via Tangent Cones

Amin Jalali Wisconsin Institute for Discovery

University of Wisconsin

Madison, WI 53715 amin.jalali@wisc.edu

Rebecca Willett Department of Electrical and Computer Engineering

University of Wisconsin

Madison, WI 53706 willett@discovery.wisc.edu

Given samples lying on any of a number of subspaces, subspace clustering is the task of grouping the samples based on the their corresponding subspaces. Many subspace clustering methods operate by assigning a measure of afﬁnity to each pair of points and feeding these afﬁnities into a graph clustering algorithm. This paper proposes a new paradigm for subspace clustering that computes afﬁnities based on the corresponding conic geometry. The proposed conic subspace clustering (CSC) approach considers the convex hull of a collection of normalized data points and the corresponding tangent cones. The union of subspaces underlying the data imposes a strong association between the tangent cone at a sample x and the original subspace containing x. In addition to describing this novel geometric perspective, this paper provides a practical algorithm for subspace clustering that leverages this perspective, where a tangent cone membership test is used to estimate the afﬁnities. This algorithm is accompanied with deterministic and stochastic guarantees on the properties of the learned afﬁnity matrix, on the true and false positive rates and spread, which directly translate into the overall clustering accuracy.

1 Introduction

Finding a low-dimensional representation of high-dimensional data is central to many tasks in science and engineering. Union-of-subspaces have been a popular data representation tool for the past decade. These models, while still parsimonious, offer more ﬂexibility and better approximations to non-linear data manifolds than single-subspace models. To fully leverage union-of-subspaces models, we must be able to determine which data point lies in which subspace. This subproblem is referred to as subspace clustering [16].

Formally, given a set of points x1, . . . , x N Rn lying on k linear subspaces S1, . . . , Sk Rn, subspace clustering is the pursuit of partitioning those points into k clusters so that all points in each cluster lie within the same subspace among S1, . . . , Sk. Once the points have been clustered into subspaces, standard dimensionality reduction methods such as principal component analysis can be used to identify the underlying subspaces. A generic approach in the literature is to construct a graph with each vertex corresponding to one of the given samples and each edge indicating whether (or the degree to which) a pair of points could have come from the same subspace. We refer to the (weighted) adjacency matrix of this graph as the afﬁnity matrix. An ideal afﬁnity matrix A would have A(i, j) = 1 if and only if xi and xj are in the same subspace, and otherwise A(i, j) = 0. Given an estimated afﬁnity matrix, a variety of graph clustering methods, such as spectral clustering [17], can be used to cluster the samples, so forming the afﬁnity matrix is a critical step.

Many existing methods for subspace clustering with provable guarantees leverage the self-expressive property of the data. Such approaches pursue a representation of each data point in terms of the other data points, and then the representation coefﬁcients are used to construct an afﬁnity matrix.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

For example, the celebrated sparse subspace clustering (SSC) approach of [3] seeks a representation of each sample as a weighted combination of the other points, with minimal ℓ1 norm. However, such sparse self-expression can lead to graph connectivity issues, e.g., see [10, 8, 20, 5, 19, 18], where clusters can be arbitrarily broken into separate components. This paper proposes a new paradigm for devising subspace clustering algorithms:

Conic Subspace Clustering (CSC): exploiting the association of the tangent cones to the convex hull of normalized samples with the original subspaces for computing afﬁnities and subsequent clustering.

CSC leverages new insights into the geometry of subspace clustering. One of the key effects of this approach is that the learned afﬁnity matrix is generally denser among samples from the same subspace, which in turn can mitigate graph connectivity issues.

In Proposition 1 below, we hint on what we mean by the strong association of the tangent cones with the underlying subspaces for an ideal dataset. In Section 2, we show how a similar idea can be implemented with ﬁnite number of samples. Given a set of nonzero samples from a union of linear subspaces, we normalize them to fall on the unit sphere and henceforth assume X = {x1, . . . , x N} Sn 1 is the set of samples. We further overload the notation to deﬁne X = [x1, x2, . . . , x N] Rn N. Data hull refers to the convex hull of samples. The tangent cone at x conv(X) with respect to conv(X) is deﬁned as

T(x) := cl conv cone(X + { x}) = cl

x X λx (x x) : λx 0, x X

where the Minkowski sum of two sets A and B is denoted by A + B, while A + {x} may be simpliﬁed to A + x. The linear space of a cone C is deﬁned as lin C := C ( C). We term the intersection of a subspace S with the unit sphere as a ring R = S Sn 1. Proposition 1. For a union of rings, namely X = (S1 . . . Sk) Sn 1, and for every x X,

S(x) = span{x} + lin T(x),

where S(x) is the convex hull of the union of all subspaces Si , i = 1, . . . , k, to which x belongs.

1.1 Our contributions

We introduce a new paradigm for subspace clustering, conic subspace clustering (CSC), inspired by ideas from convex geometry. More speciﬁcally, we propose to consider the convex hull of normalized samples, and exploit the structure of the tangent cone to this convex body at each sample to estimate the relationships for pairs of samples (to construct an afﬁnity matrix for clustering).

We provide an algorithm which implements CSC (Section 2) along with deterministic guarantees on how to choose the single parameter in this algorithm, β, guaranteeing no false positives (Section 5) and any desired true positive rate (Section 4), in the range allowed by the provided samples. We specialize our results to random models, to showcase our guarantees in terms of the few parameters deﬁning said random generative models and to compare with existing methods. Aside from statistical guarantees, we also provide different optimization programs for implementing our algorithm that can be used for faster computation and increased robustness (Section 7).

In Section 6, we elaborate on the true positive rate and spread for CSC and compare it to what is known about a sparsity-based subspace clustering approach, namely sparse subspace clustering, SSC [3]. This comparison provides us with insight on situations where methods such as SSC would face the so called graph connectivity issue, demonstrating the advantage of CSC in such situations.

2 Conic Subspace Clustering (CSC) via Rays: Intuition and Algorithm

In this section, we discuss an intuitive algorithm for subspace clustering under the proposed conic subspace clustering paradigm. We present the underlying idea without worrying about the computational aspects, and relegate such discussions to Section 7. All proofs are presented in the Appendix. Henceforth, lower case letters represent vectors, while speciﬁc letters such as x and x are reserved to represent columns of X, and x is commonly used as the reference point.

Start by considering Figure 1(a) and the point x R := (S1 Sk) Sn 1 from which all the rays are emanating. Moreover, deﬁne Rt := St Sn 1 for t = 1, . . . , k, which gives

(a) x + cl cone(R x)

(c) x / S(x)

Figure 1: Illustration of the idea behind our implementation of Conic Subspace Clustering (CSC) via rays. The union of the red and blue rings is R, and x is the point from which all the rays are emanating. The orange wedge represents x + W(x). 1(a) The union of the red and blue surfaces is x + cl cone(R x). 1(b) When x and x are from the same subspace, the points dβ(x, x ) for different values of β 0 lie within cl cone(R x) speciﬁcally, in the blue shaded cone associated with the blue ring. 1(c) When x and x are from different subspaces, the points dβ(x, x ) lie outside cl cone(R x) for large enough values of β.

R = R1 . . . Rk. Only two subspaces are shown and the reference point x is in R1. The thin red and blue rays correspond to elements of x + cone(R x) = x + cone({x x : x R}), where cone(A) := {λy : y A, λ 0}.1 We leverage the geometry of this cone to determine subspace membership. Speciﬁcally, Figure 1(b) considers a point x R1 different from x. The dashed line segment represents points sign( x, x )βx for different values of β 0; where sign(0) can be arbitrarily chosen as 1. The vectors emanating from x and reaching these points represent

dβ(x, x ) := sign( x, x )βx x. (1)

For x, x R1, this illustration shows that dβ(x, x ) cl cone(R x) for any β 0. In contrast, Figure 1(c) considers x R2, while x R1. In this case, there exist β > 0 such that dβ(x, x ) / cl cone(R x), indicating that x S(x). Formally, Proposition 2. For any x, x R and any scalar value β 0,

dβ(x, x ) : β 0

cl cone(R x) (2)

Equivalently, x S(x) if and only if

β R : βx x cl cone(R x)

is unbounded.

In other words, we can test whether or not x S(x) by testing the cone membership for dβ(x, x ). Of course, such a test would not be practical: we cannot compute dβ(x, x ) for an inﬁnite set of β values, the set cl cone(R x) is generally non-convex (in Figure 1(a), the cone is the union of the red and blue surfaces), and cl cone(R x) is not known exactly because we only observe a ﬁnite collection of points from R instead of all of R. We now develop an alternative test to (2) that addresses these challenges and can be computed within a convex optimization framework. We ﬁrst address the convexity issue: Proposition 3. For the closed convex cone W(x) := conv cl cone(R x), and for any x, x R,

dβ(x, x ) : β 0

In other words, x S(x) implies that

β R : dβ(x, x ) cl cone(R x)

is unbounded.

Next, we formulate the test as a convex optimization program, when a ﬁnite number of samples are given. Speciﬁcally, using the samples in X R instead of all the points in R, we can deﬁne an approximation of W(x) as

which is the tangent cone (also known as the descent cone) at x with respect to the data hull conv(X). The implementation of CSC via rays, as sketched above and detailed below, is based

1Note that this is not the same as a conic hull.

on testing the membership of dβ(x, x ) in the tangent cone WN(x) for all pairs of samples x, x to determine their afﬁnity. More speciﬁcally, the cone membership test can be stated as a feasibility program, tagged as the Cone Representability (CR) program:

minλ RN 0 subject to dβ(x, x ) = (X x1T

N)λ , λ 0N. (CR)

If there exists a β 0 for which (CR) is infeasible, then we conclude x S(x). Later, in our theoretical results in Sections 4 and 5 we characterize a range (dependent on a target error rate) of possible values of β such that for any single β from this range, checking the feasibility of (CR) for all x, x reveals the true relationships within a target error rate. In Section 7, we discuss a number of variations for the above optimization program. While our upcoming guarantees are all concerned with the cone membership test itself and not the speciﬁc implementation, these variations provide better algorithmic options and are more robust to noise. Speciﬁcally, we choose to use a variation (in the box below) that is a bounded feasible linear program for our implementation of the cone membership test.

We refer to solving any of the variations of the cone membership test for an ordered pair of samples (x, x ) and a ﬁxed value of β as CSC1(β, x, x ):

Compute $γ(x, x ) = min {γ : (1 γ)(βx x) = (X x1T

N)λ , γ 0 , λ 0}.

Set A(x, x ) {0, 1} by rounding 1 $γ(x, x ) to either 0 or 1, whichever is closest.

We refer to the optimization program used in the above as the Robust Cone Membership (RCM) program. Similarly, solving a collection of these tests for all samples x and for a ﬁxed x, or for all pairs x, x , are referred to as CSC1(β, x) and CSC1(β), respectively. When CSC1(β) is followed by spectral clustering for the constructed afﬁnity matrix, we refer to the whole process as CSC(β). It is worth mentioning that the linear program used in CSC1(β, x, x ) is equivalent to (CR) in a sense made clear in Section 7, and provides the same afﬁnity matrix with a variety of algorithmic advantages, as discussed in Section 7.

3 Theoretical Guarantees

In this section, we discuss our approach to providing theoretical guarantees for the aforementioned implementation of CSC via rays. Let us ﬁrst set some conventions. We refer to a declaration x S(x) (or x S(x)) as positive (or negative), regardless of the ground truth. Hence, a true positive is an afﬁnity of 1 when the samples are from the same subspace, and a false positive is an afﬁnity of 1 when the samples are from different subspaces. We provide guarantees for CSC1(β) to give no false positives. This makes the afﬁnity matrix a permuted block diagonal matrix. In this case, if there are enough well-spread ones in each row of the afﬁnity matrix, spectral clustering or any other reasonable clustering algorithm will be able to perfectly recover the underlying grouping; see graph connectivity in spectral clustering literature [17]. These two phenomena, no false positives and enough well-spread true positives per sample, are the focus of our theoretical results in Sections 4 and 5. In a nutshell, the guarantees boil down to characterizing a range of β s for which CSC has controlled degrees of errors: no false positives and a certain true positive rate per row. We also examine the distribution of true positives recovered by our method and illustrate a favorable spread.

Through the intuition behind the cone membership test, namely (CR), it is easy to observe that the number of true positives and the number of false positives are monotonically non-increasing in β (which can be observed in Figure 2 as well). Hence, to have a high number of true positives we need to use an upper bounded β, and to have a few number of false positives we need to use a lower bounded value of β.

To assess the strength of our deterministic results, we assume probabilistic models on the subspaces and/or samples and study the ranges of β for which CSC1(β) has controlled errors of both types, with high probability. For the random models, we take the number of subspaces to be ﬁxed, namely k. However, CSC1(β) need not know the number of subspaces and spectral clustering can use the gap in the eigenvalues of the Laplacian matrix to determine the number of clusters; e.g., see [15]. In the random-sample model, we assume k subspaces are given and samples from each subspace are drawn uniformly at random from the unit sphere on that subspace. In the random-subspace model, each subspace is chosen independently and uniformly at random with respect to the Haar measure.

3.1 Examples

In this section, we illustrate the performance of the CSC method on some small examples. First, we examine the role of the parameter β in CSC1(β, x, x ) and its effect on the false positive and true positive rates in practice. In the ﬁrst experiment, we have k = 5 subspaces, each with dimension d = 5, in an n = 10 dimensional space, and we draw 30 samples from each of the k 5-dimensional subspaces. We then run CSC1(β, x, x ) for a variety of values of β between one and six over 15 random trials; In Figures 2(b), 2(c), and 2(d), we show the results of each trial in thin lines and the means across trials in thick lines (Figure 2(c) shows the median). Figure 2(a) shows, for each value of β, the histogram of true positive rates across rows. Superimposed on this histogram plot is the empirical mean of the histogram (green solid line) and our theoretical bound from Theorem 6 (purple dashed curve corresponding to the purple dashed curve in Figure 2(b)): for each value of β, the true positive rate will be above this curve, with high probability.

Our theoretical bounds correspond to sufﬁcient but not necessary conditions. While we observe the tightness of the theory for minimum per-row true positive rate in relation to β, the wide distribution of per-row true positive rates above the theoretical bound (Figures 2(a) and 2(b)), as well as the spectral clustering step, provide us with good error rates (Figure 2(c)) outside the range of β s for which we have guarantees (Figure 2(d)).

1 2 3 4 5 0

1 2 3 4 5 0

1 2 3 4 5 0

1 2 3 4 5 0

Figure 2: Illustration of the role of β (horizontal axis) in determining (a) the histogram of true positive rates across rows, with the empirical mean (solid line) and the theoretical bound (dashed curve, corresponding to the dashed curve in (b)), (b) maximum, mean, and minimum (across rows of the afﬁnity matrix) true positive rates along with the theoretical bound (dashed curve), (c) the clustering mismatch rate after performing spectral clustering, and, (d) the false positive rate. This experiment is in a 10 dimensional space, with 30 random samples from each of 5 random 5-dimensional subspaces, over 15 random trials. Bold curves correspond to averages across trials in (a), (b), and (d), but to the median in (c).

Next, we look at learned afﬁnity matrices output by the proposed CSC method and SSC [3], which is a widely-used benchmark and the foundation of much current subspace clustering research. As described at length in Section D, the true positive rate of SSC is necessarily bounded because of the ℓ1 regularization used to learn the afﬁnity matrix. This is not true of CSC in fact, β can be used to control the true positive rate (in an admissible range) as long as it exceeds some lower bound (β βL). The difference between the true positive rates of SSC and CSC are illustrated in Figure 3. In this experiment, CSC naturally outputs a 0/1 afﬁnity matrix, while the afﬁnity matrix of SSC has a broader diversity of values. We show this matrix and a thresholded version for comparison purposes, where the threshold is set to correspond to a 5% false positive rate.

4 Guarantees on True Positive Rates

We study conditions under which a fraction ρ (0, 1) of samples x S(x) are declared as such. As discussed before, the number of true positives is non-increasing in β. Therefore, we are interested in an upper bound βU,ρ on β so that CSC1(β, x) for β βU,ρ returns at least ρNt true positives (Nt is the number of samples from St) for any x Xt := X St and t = 1, . . . , k. Consider {x} := {y : x, y = 0}. For a close convex set A containing the origin, denote by r(A) the radius of the largest Euclidean sphere in span(A) that is centered at the origin and is a subset of A. Theorem 4 (Deterministic condition for any true positive rate). The conic subspace clustering algorithm at x with parameter β, namely CSC1(β, x), returns a ratio ρ (0, 1) of relationships between

Figure 3: Afﬁnity matrices for two toy models in an ambient dimension n = 12. (a-c) k = 3 subspaces, each of rank d = 4 and each with 3d = 12 samples. (a) Result of CSC1(β). (b) Result of SSC. (c) Thresholded version of (b). (d-e) k = 3 subspaces, each of rank d = 6 and each with 3d = 18 samples. (d) Result of CSC1(β). (e) Result of SSC. (f) Thresholded version of (e), with threshold set so that the false positive rate is 5%. As predicted by the theory, CSC achieves higher true positive rates than SSC can.

x Xt and other samples as true positives, provided that β βx

U,ρ := sin2(θx

t ) cos( θx

in which, for m := ρ(N 1) , cos( θx

t ) is the (m + 1)-st largest value among | x, x | for x Xt, and for r( ) denoting the inner radius, θx

t := arctan(r((x + Wt

N(x)) {x} )). Then, CSC1(β) is guaranteed to return a fraction ρ of true positives per sample provided that

β βU,ρ := minx X βx

As it can be seen from the above characterization, θx

U,ρ can vary from sample to sample even within the same subspace. When samples are drawn uniformly at random from a given subspace (the random-sample model), the next theorem provides a uniform lower bound on the inner radius and θx

t for all such samples. Note that βx

U,ρ is non-decreasing in θx

t and non-increasing in θx

Theorem 5. Under a random-sample model, and for a choice pt (0, 1), with probability at least 1 pt , a solution θ to

(cos θ)dt 1

6 dt sin θ = log(Nt/pt)

Nt is a lower bound on θx

t , which is deﬁned in Theorem 4 and is a function of the inradius of a base of the t-th cone Wt

Theorem 5 is proved in the Appendix using ideas from inversive geometry [1]. In a random-sample model, we can quantify the aforementioned m-th order statistic. Therefore, we can explicitly compute the upper bound βU,ρ (with high probability) in terms of quantities dt and Nt. The ﬁnal result is given in Theorem 6. Note that both the inradius and the m-th order statistic are random variables deﬁned through the samples, hence are dependent. Therefore, a union bound is used.

Theorem 6. Under a random-sample model, CSC1(β, x) for any x Xt yields a fraction ρ of true positives with high probability, provided that β βx

u,ρ , where βx

u,ρ is computed similar to (5), using the lower bound on θx

t from Theorem 5 and θx

t = π 2 ( m

N + ). The probability is at least I( m

N + ; m, N m) pt, where I( ; , ) denotes the incomplete Beta function.

5 Guarantees for Zero False Positives

In this section, we provide guarantees for CSC1(β, x) to return no false positives, in terms of the value of β. Speciﬁcally, we guarantee this by examining a lower bound βL for β in CSC1(β, x). For a ﬁxed column x of the data matrix X, we will use x as a pointer to any other column of X. Recall dβ(x, x ) from (1) and consider

βL(x) := inf {β 0 : dβ(x, x ) WN(x) x S(x)} (6)

= sup {β 0 : dβ(x, x ) WN(x) for some x S(x)} .

If the above value is ﬁnite, then using any value even slightly larger than this would declare any x S(x) correctly as such, hence no false positives. However, the above inﬁmum may not exist for a general conﬁguration. In other words, there might be a sample x S(x) for which dβ(x, x ) WN(x) for all values of β 0. The following condition prohibits such a situation.

Theorem 7 (Deterministic condition for zero false positives). For x X, without loss of generality, suppose S(x) = S1 . . . Sj for some j < k. Provided that all of the columns of X that are not in S(x) are also not in WN(x), then βL(x) in (6) is ﬁnite. This condition is equivalent to St WN(x) = {0} for all t = j + 1, . . . , k and all x X \ St . In case this condition holds for all x X, we deﬁne

x X βL(x). (7)

If this condition is met and β βL is used, then CSC1(β) will return no false positives.

We note that the condition of Theorem 7 becomes harder to satisfy as the number of samples grow (which makes WN(x) larger). While this is certainly not desired, such an artifact is present in other subspace clustering algorithms. See the discussion after Theorem 1 in [11] for examples.

Next, we specialize Theorem 7 to a random-subspace model. Under such model, for t = j + 1, . . . , k, St and WN(x) are two random objects and are dependent (all samples, including those from St, take part in forming WN(x), hence the orientation and the dimension of St affect the deﬁnition of WN(x)), which makes the analysis harder. However, these two can be decoupled by massaging the condition of Theorem 7 from St WN(x) = {0} into an equivalent condition St W t

N (x) = {0} where W t

N (x) = convcone{x x : x St}; see Lemma 10 in the Appendix. Next, the event of a random subspace and a cone having trivial intersection can be studied using the notion of the statistical dimension of the cone and the brilliant Gordon s Lemma (escape through a mesh) [6]. The statistical dimension of a closed convex cone C Rn is deﬁned as δ(C) := E supy C Sn 1 y, g 2 n where g N(0, In). Now, we can state the following lemma based on Gordon s Lemma.

Lemma 8. With the notation in Theorem 7, and under the random-subspace model, βL is ﬁnite provided that δ(W t

N (x)) + dim(St) n for t = 1, . . . , k.

Furthermore, for the above to hold, it is sufﬁcient to have "k

t=1 dim(St) < n (Lemma 11 in the Appendix). Under the above conditions, we are guaranteed that a ﬁnite β exists such that with high probability, CSC1(β) results in zero false positives. It is easy to compute βL for certain conﬁgurations of subspaces. For example, when the subspaces are independent (the dimension of their Minkowski sum is equal to sum of their dimensions) we have βL = 1 (Lemma 13 in the Appendix). Independent subspaces have been assumed before in the subspace clustering literature for providing guarantees; e.g., [2, 3]. Also see [21] and references therein. However, it remains as an open question how one should compute this value for more general conﬁgurations. We provide some theoretical tools for such computation in Appendix B.5.

Finally, if βL does not exceed βU,ρ from above, then CSC1(β) successfully returns a (permuted) block diagonal matrix with a density of ones (per row) of at least ρ. This allows us to have a good idea about the performance of the post-processing step (e.g., spectral clustering) and hence CSC(β).

6 True Positives Rate and Distribution

Because sparse subspace clustering (SSC) relies upon sparse representations, the number of true positives is inherently limited. In fact, it can be shown that SSC will ﬁnd a representation of each column x as a weighted sum of columns that correspond to the extreme rays of WN(x); as shown in Lemma 17 in the Appendix. This phenomenon is closely linked to the graph connectivity issues associated with SSC, mentioned before. In particular, under a random-sample model, the true positive rate for SSC will go to zero as Nt/dt grows, where Nt is the number of samples from St with dim(St) = dt. In contrast, the true positive behavior for CSC has several favorable characteristics. First, if the subspaces are all independent, then the true positive rate ρ can approach one. Second, in unfavorable settings in which the true positive rate is low, it can be shown that the true positives are distributed in such a way that precludes graph connectivity issues (see Section D.3 for more details). Speciﬁcally, in the random-sample model, for each subspace St, there is a matrix Asub

deﬁned below, whose support is contained within the true positive support of the output of CSC(β) for β (βL, βU,ρ). Let Xt have i.i.d. standard normal entries, and ϵ be the m-th largest element of |XT

t Xt|. Then, Asub is deﬁned by (Asub)i,j = |XT

t Xt| when |XT

t Xt| > ϵ, and zero otherwise. The distribution of Asub when columns of Xt are drawn uniformly at random from the unit sphere ensures that graph connectivity issues are avoided with high probability as soon as the true positive rate ρ exceeds O(log Nt/Nt). As a result, even if the values of ρ which provide βU,ρ > βL are small, there is still the potential of perfect clustering. These distributional arguments cannot be made for sparsity-based methods like SSC. We refer to Appendix D for more details.

7 CSC Optimization and Variations

In Table 1, we provide a number of optimization programs that implement the cone membership test. These formulations possess different computational and robustness properties. Let us introduce a notation of equivalence. We say an optimization program P, implementing the cone membership test, is in CR-class if the possible set of its optimal values can be divided into two disjoint sets Oin and Oout corresponding to whether dβ(x, x ) WN(x) or dβ(x, x ) WN(x), respectively. Then we write [[P : Oin, Oout]]; e.g., [[(CR) : 0, infeasible]]. All of the problems in Table 1 are in CR-class.

Table 1: Different formulations for the cone membership (second column) with their set of outputs when dβ(x, x ) WN(x) (third column) and when dβ(x, x ) WN(x). In all of the variations, A = X x1T

N, y and b = dβ(x, x ) live in Rn, and λ lives in RN.

Tag Formulation Oin Oout P1 minλ 0 s.t. b = Aλ , λ 0 {0} infeasible P2 miny y, b s.t. y T A 0 {0} unbounded P3 miny y, b s.t. y T A 0 , y, b ϵ {0} { ϵ} P4 minγ,λ γ s.t. (1 γ)b = Aλ , γ 0, λ 0 {0} {1}

The ﬁrst optimization problem (P1) is merely the statement of the cone membership test as a linear feasibility program and (P2) is its Lagrangian dual. (P2) looks for a certiﬁcate y W

N(x) (in the dual cone) that rejects the membership of dβ(x, x ) in WN(x). However, neither (P1) nor (P2) are robust or computationally appealing. Next, observe that restricting y to any set with the origin in its relative interior yields a program that is in CR-class. (P3) is deﬁned by augmenting (P2) with a linear constraint, which not only makes the problem bounded and feasible, but the freedom in choosing ϵ allows for controlling the bit-length of the optimal solution and hence allows for optimizing the computational complexity of solving (P3) via interior point methods. Furthermore, this program can be solved approximately, up to a precision ϵ (0, ϵ), and provides the same desired set of results: an ϵ -inexact solution for (P3) has a nonnegative objective value if and only if dβ(x, x ) WN(x). If we dualize (P3) and divide the objective by ϵ we get (P4) which can also be derived by handtweaking (P1). However, the duality relationship with (P3) is helpful in understanding the dual space and devising efﬁcient optimization algorithms. Notice that (γ, λ) = (1, 0N) is always feasible, and the optimal solution is in [0, 1]. The latter property makes (P4) a suitable candidate for noisy setups without modiﬁcation. Moreover, [[(P4) : 0, 1]], which makes it a desirable candidate as a proxy for x S(x). We use this program in our experiments reported in Section 3.1.

8 Discussions and Future Directions

This paper describes a new paradigm for understanding subspace clustering in relation to the underlying conic geometry. With this new perspective, we design an algorithm, CSC via rays, with guarantees on false and true positive rates and spreads, that sidesteps graph connectivity issues that arise with methods like sparse subspace clustering. This paper should be seen as the ﬁrst introduction to the idea of, and tools for, conic subspace clustering, rather than establishing CSC as the new state-of-the-art, and as a means to ignite future work on several directions in subspace clustering. We focus on our novel geometric perspective and its potential to lead to new algorithms by providing a rigorous theoretical understanding (statistical and computational) and hope that this publication will spur discussions and insights that can inform the suggested future work. A cone membership test is just one approach to exploit this geometry and implement the conic subspace clustering paradigm.

Remaining Questions. While more extensive theoretical comparisons with existing methods are necessary, many comparisons are non-trivial because CSC reveals important properties of subspace clustering methods (e.g. spread of true positives) that are not understood for other methods. The limited small-scale experiments were simply intended to illustrate these properties.

Our study of the parameter choice is theoretical in nature and beyond heuristics for implementation. But some questions are still open. Firstly, while we have a clear deterministic characterization for βU, tighter characterizations would lead to a larger range for β. In Figure 2(b), such pursuit would result in a new theoretical curve (instead of the current dashed purple curve) that stays closer to the minimum true positive rate across rows (the lowest thick solid curve). On the other hand, outside of the case of independent subspaces, where βL = 1, we only have a deterministic guarantee on the ﬁniteness of βL and computing it for the random-sample model is a topic of current research. Therefore, we do not have a guarantee on the non-triviality of the resulting range (βL, βU). However, as observed in the small numerical examples in Section 3.1, as well as in our more extensive experiments that are not reported here, there often exists a big range of β with which we can get perfect clustering.

Extensions. While the presented algorithm assumes noiseless data points from the underlying subspaces, our intuition and simulations (synthetic and real data) indicate stability towards stochastic noise. Moreover, the current analysis is suggestive of algorithmic variants that exhibit robust empirical performance in the presence of stochastic noise. This is why, similar to advances in other subspace clustering methods, we hope that the analysis for the noiseless setup provides essential insights to provably generalize the method to noisy settings. Furthermore, there remain several other open avenues for exploration, particularly with respect to theoretical and large-scale empirical comparisons with other methods, and extensions to measurements corrupted by adversarial perturbations, with outliers among the data points, as well as with missing entries in the data points. By design, SSC and other similar methods require a full knowledge of data points. CSC imposes the same requirement and an open question is how to extend the CSC framework when some entries are missing from the data points.

[1] David E. Blair. Inversion theory and conformal mapping, volume 9 of Student Mathematical

Library. American Mathematical Society, Providence, RI, 2000.

[2] João Paulo Costeira and Takeo Kanade. A multibody factorization method for independently

moving objects. Int. J. Comput. Vis., 29(3):159 179, 1998.

[3] Ehsan Elhamifar and René Vidal. Sparse subspace clustering. In 2009 IEEE Conference on

Computer Vision and Pattern Recognition, pages 2790 2797. IEEE, 2009.

[4] Ehsan Elhamifar and René Vidal. Clustering disjoint subspaces via sparse representation. In

2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1926 1929. IEEE, 2010.

[5] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applica-

tions. IEEE Trans. Pattern Anal. Mach. Intell, 35(11):2765 2781, 2013.

[6] Yehoram Gordon. On Milman s inequality and random subspaces which escape through a

mesh in Rn. In Geometric aspects of functional analysis, volume 1317 of Lecture Notes in Math., pages 84 106. Springer, 1988.

[7] Reinhard Heckel and Helmut Bölcskei. Robust subspace clustering via thresholding. IEEE

Trans. Inf. Theory, 61(11):6320 6342, 2015.

[8] Can-Yi Lu, Hai Min, Zhong-Qiu Zhao, Lin Zhu, De-Shuang Huang, and Shuicheng Yan. Ro-

bust and efﬁcient subspace segmentation via least squares regression. In European conference on computer vision, pages 347 360. Springer, 2012.

[9] Jean-Jacques Moreau. Décomposition orthogonale d un espace hilbertien selon deux cônes

mutuellement polaires. C. R. Acad. Sci. Paris, 255:238 240, 1962.

[10] Behrooz Nasihatkon and Richard Hartley. Graph connectivity in sparse subspace clustering.

In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 2137 2144. IEEE, 2011.

[11] Dohyung Park, Constantine Caramanis, and Sujay Sanghavi. Greedy subspace clustering. In

Advances in Neural Information Processing Systems, pages 2753 2761, 2014.

[12] James Renegar. Efﬁcient subgradient methods for general convex optimization. SIAM J.

Optim., 26(4):2649 2676, 2016.

[13] Fritz Scholz. Conﬁdence bounds and intervals for parameters relating to the binomial, negative

binomial, poisson and hypergeometric distributions with applications to rare events. 2008.

[14] Mahdi Soltanolkotabi and Emmanuel J. Candés. A geometric analysis of subspace clustering

with outliers. Ann. Statist., 40(4):2195 2238, 2012.

[15] Mahdi Soltanolkotabi, Ehsan Elhamifar, and Emmanuel J. Candès. Robust subspace clustering.

Ann. Statist., 42(2):669 699, 2014.

[16] René Vidal. Subspace clustering. IEEE Signal Process. Mag., 28(2):52 68, 2011.

[17] Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395

[18] Yining Wang, Yu-Xiang Wang, and Aarti Singh. Graph connectivity in noisy sparse subspace

clustering. In Artiﬁcial Intelligence and Statistics, pages 538 546, 2016.

[19] Yu-Xiang Wang and Huan Xu. Noisy sparse subspace clustering. J. Mach. Learn. Res., 17:Pa-

per No. 12, 41, 2016.

[20] Yu-Xiang Wang, Huan Xu, and Chenlei Leng. Provable subspace clustering: When LRR meets

SSC. In Advances in Neural Information Processing Systems, pages 64 72, 2013.

[21] Jingyu Yan and Marc Pollefeys. A general framework for motion segmentation: Independent,

articulated, rigid, non-rigid, degenerate and non-degenerate. In European conference on computer vision, pages 94 106. Springer, 2006.