# hyperplane_clustering_via_dual_principal_component_pursuit__fa3ff9ae.pdf Hyperplane Clustering via Dual Principal Component Pursuit Manolis C. Tsakiris 1 Ren e Vidal 1 State-of-the-art methods for clustering data drawn from a union of subspaces are based on sparse and low-rank representation theory and convex optimization algorithms. Existing results guaranteeing the correctness of such methods require the dimension of the subspaces to be small relative to the dimension of the ambient space. When this assumption is violated, as is, e.g., in the case of hyperplanes, existing methods are either computationally too intensive (e.g., algebraic methods) or lack sufficient theoretical support (e.g., K-Hyperplanes or RANSAC). In this paper we provide theoretical and algorithmic contributions to the problem of clustering data from a union of hyperplanes, by extending a recent subspace learning method called Dual Principal Component Pursuit (DPCP) to the multihyperplane case. We give theoretical guarantees under which, the non-convex ℓ1 problem associated with DPCP admits a unique global minimizer equal to the normal vector of the most dominant hyperplane. Inspired by this insight, we propose sequential (RANSAC-style) and iterative (K-Hyperplanes-style) hyperplane learning DPCP algorithms, which, via experiments on synthetic and real data, are shown to outperform or be competitive to the state-of-the-art. 1. Introduction 1.1. Hypeprlane clustering Subspace clustering, the problem of clustering data drawn from a union of linear subspaces, is an important problem in machine learning, pattern recognition and computer vision (Vidal et al., 2016). A particular case of this problem is hyperplane clustering, which arises when the data 1Center for Imaging Science, Johns Hopkins University, Baltimore, MD, USA. Correspondence to: Manolis C. Tsakiris . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). lie in a union of hyperplanes, as in, e.g., projective motion segmentation (Vidal et al., 2006), 3D point cloud analysis (Sampath & Shan, 2010) and hybrid system identification (Vidal et al., 2003; Bako, 2011). Even though in some ways hyperplane clustering is simpler than general subspace clustering, since, e.g., the dimensions of the subspaces are equal and known a priori, modern self-expressivenessbased methods (Liu et al., 2013; Lu et al., 2012; Elhamifar & Vidal, 2013; Wang et al., 2013; You et al., 2016), in principle do not apply in this case, because they require small relative subspace dimensions d/D, where d, D are the dimensions of the subspace and ambient space, respectively. From a theoretical point of view, one of the most appropriate methods for hyperplane clustering is Algebraic Subspace Clustering (ASC) which gives closed-form solutions by means of factorization or differentiation of polynomials (Vidal et al., 2005). However, the main drawback of ASC is its exponential complexity1, which makes it impractical in many settings. Another method that is theoretically justifiable for clustering hyperplanes is Spectral Curvature Clustering (SCC) (Chen & Lerman, 2009), which computes a D-fold affinity between all D-tuples of points in the dataset. As in the case of ASC, SCC has combinatorial complexity and becomes cumbersome for large D. On the other hand, the intuitive classical method of KHyperplanes (KH) (Bradley & Mangasarian, 2000), which alternates between assigning clusters and fitting a new normal vector to each cluster with PCA, is perhaps the most practical method for hyperplane clustering, since it is simple to implement and it is robust to noise. However, KH is sensitive to outliers and is guaranteed to converge only to a local minimum; hence multiple restarts are in general required. Median K-Flats (MKF) (Zhang et al., 2009) shares a similar objective function as KH, but uses the ℓ1norm instead of the ℓ2-norm, in an attempt to gain robustness to outliers. The minimization is done via a stochastic gradient descent scheme, and searches directly for a basis of each subspace, which makes it slower to converge for hyperplanes. Finally, any single robust subspace learning method suitable for high relative dimensions, such as RANSAC (Fischler & Bolles, 1981) or REAPER (Lerman et al., 2015), can be applied either i) in a sequential fashion 1The issue of robustness to noise for ASC has been recently addressed in Tsakiris & Vidal 2015b; 2017c. Hyperplane Clustering via Dual Principal Component Pursuit by first learning the most dominant hyperplane, removing the points lying close to it, learning the second most dominant hyperplane, and so on, or ii) in an iterative fashion, by assigning points to clusters, fitting a hyperplane per cluster, reassigning the points to new clusters and so on. 1.2. Dual principal component pursuit Dual Principal Component Pursuit (DPCP) (Tsakiris & Vidal, 2015a; 2017a) is an ℓ1 single subspace learning method, which aims at recovering the orthogonal complement of the subspace in the presence of outliers, and as such it is particularly suited for hyperplanes. DPCP searches for the normal vector to a hyperplane by solving a non-convex ℓ1 minimization problem on the sphere, or a recursion of linear programming relaxations, and under certain conditions, the normal to the hyperplane is the unique global solution to this non-convex ℓ1 problem, as well as the limit point of the LP recursion. Motivated by the robustness of DPCP to outliers, one could naively use it for hyperplane clustering by recovering the normal vector to a hyperplane one at a time, while treating points from other hyperplanes as outliers. However, such a scheme is not a priori guaranteed to succeed because the assumptions in the theorems of correctness of DPCP assume that outliers are uniformly distributed on the sphere, an assumption which is violated when the data come from a union of hyperplanes. 1.3. Paper contributions In this paper we provide a theoretical analysis of the nonconvex ℓ1 DPCP problem for data drawn from a union of hyperplanes. We show that as long as the hyperplanes are sufficiently separated, the dominant hyperplane is sufficiently dominant and the points are uniformly distributed (in a deterministic sense) inside their associated hyperplanes, the normal vector of the dominant hyperplane is the unique (up to sign) global minimizer of the DPCP problem. This suggests a DPCP-based sequential hyperplane learning algorithm, which uses DPCP to compute a dominant hyperplane, then a second dominant hyperplane and so on. Experiments on synthetic data show that this DPCP-based algorithm significantly improves over similar sequential algorithms, which are based on RANSAC or REAPER. Finally, 3D plane clustering experiments on real 3D point clouds show that an iterative (KH-style) DPCP scheme is very competitive to RANSAC, which is the predominant state-of-the-art method for such applications. 2. Preliminaries 2.1. Data model and the hyperplane clustering problem Consider given a collection X = [x1, . . . , x N] RD N of N points of the unit sphere SD 1 of RD, that lie in a union (arrangement) A of n hyperplanes H1, . . . , Hn of RD, i.e., X A = Sn i=1 Hi, where each hyperplane Hi is the set of points of RD that are orthogonal to a normal vector bi SD 1, i.e., Hi = x RD : x bi = 0 , i [n] := {1, . . . , n}. We assume that the data X lie in general position in A, by which we mean two things. First, we mean that there are no linear relations among the points other than the ones induced by their membership to the hyperplanes. In particular, every (D 1) points coming from Hi form a basis for Hi and any D points of X that come from at least two distinct Hi, Hi are linearly independent. Second, we mean that the points X uniquely define the hyperplane arrangement A, in the sense that A is the only arrangement of n hyperplanes that contains X. This can be verified computationally by checking that there is only one up to scale homogeneous polynomial of degree n that fits the data, see Vidal et al. 2005; Tsakiris & Vidal 2017c for details. We assume that for every i [n], precisely Ni points of X, denoted by X i = [x(i) 1 , . . . , x(i) Ni], belong to Hi, with Pn i=1 Ni = N. With that notation, X = [X 1, . . . , X n]Γ, where Γ is an unknown permutation matrix, indicating that the hyperplane membership of the points is unknown. Moreover, we assume an ordering N1 N2 Nn, and we refer to H1 as the dominant hyperplane. After these preparations, the problem of hyperplane clustering can be stated as follows: given the data X, find the number n of hyperplanes associated to X, a normal vector to each hyperplane, and a clustering of the data X = Sn i=1 X i according to hyperplane membership. 2.2. Review of dual principal component pursuit Dual Principal Component Pursuit (DPCP) (Tsakiris & Vidal, 2017a) is a robust single subspace learning method. Given unlabeled data X, which consist of inliers in a single subspace S of RD of dimension d < D, together with outliers to the subspace, DPCP computes a basis for the orthogonal complement of the inlier subspace S. The key idea of DPCP is to identify a single hyperplane H with normal vector b that is maximal with respect to the data X. Such a maximal hyperplane is defined by the property that it must contain a maximal number of points NH from the dataset, i.e., NH NH for any other hyperplane H of RD. Notice that such a maximal hyperplane can be characterized as a solution to the combinatorial problem X b 0 s.t. b = 0, (1) since X b 0 is the number of non-zero entries of X b, which is precisely the number of data points in X that lie outside the hyperplane defined by b. If S is a hyperplane, i.e., dim S = D 1, and if there are at least dim S+1 = D inliers, it is straightforward to show that (1) has a unique up to scale global minimizer, the normal vector to the inlier Hyperplane Clustering via Dual Principal Component Pursuit hyperplane. Since (1) is hard to solve, we relax it to X b 1 s.t. b 2 = 1, (2) which is still challenging to solve, since it is a non-smooth non-convex optimization problem on the sphere. Problem (2) has appeared several times in the literature (Sp ath & Watson, 1987; Spielman et al., 2013; Qu et al., 2014; Sun et al., 2015). In fact, Sp ath & Watson 1987 proved the following fascinating result. Proposition 1 (Sp ath & Watson, 1987) Let X be a D N matrix of rank D. Then any global minimizer of (2) must be orthogonal to D 1 linearly independent columns of X. Proposition 1 establishes an encouraging property of problem (2) towards recovering the normal vector of the inlier hyperplane as its global minimizer. Indeed, it would suffice that the D 1 linearly independent points of X that a global minimizer is orthogonal to, be points of the inlier hyperplane. Even though a priori it is not clear under what conditions this is the case, Tsakiris & Vidal 2015a provided an answer, informally stated as follows. Proposition 2 (Tsakiris & Vidal, 2015a) Suppose that the inliers are sufficiently uniformly distributed (in a deterministic sense defined in Grabner et al. 1997) inside the intersection of the inlier hyperplane and the (unit) sphere, and that the outliers are sufficiently uniformly distributed on the sphere. Then (2) has a unique up to sign global minimizer, equal to the normal vector of the inlier hyperplane. It was further shown in Tsakiris & Vidal 2015a that under the conditions of Proposition 2, and assuming that ˆn0 is a unit ℓ2-norm vector sufficiently far from the inlier hyperplane, the recursion of linear programs nk+1 := argmin b ˆnk=1 X b 1 , (3) will converge in a finite number of iterations to the global minimizer of (2).2 2.3. Hyperplane clustering via DPCP? Given the discussion in 2.2, the DPCP problem (2) seems a natural mechanism towards retrieving the normal vectors to the hyperplanes associated with a dataset X lying in a hyperplane arrangement. Indeed, one may be tempted to solve problem (2) for such a dataset with the hope of obtaining a unique global minimizer, which is orthogonal to one of the underlying hyperplanes. In this case, points 2Remarkably, (3) was first proposed in Sp ath & Watson 1987 as a means of solving (2), and it was established that it converges in a finite number of iterations to a critical point of (2). coming from the remaining hyperplanes are treated as outliers. Such an idea would give rise to sequential and iterative DPCP hyperplane clustering algorithms as described in 1.1. A sufficient condition for the correctness of this procedure would be that the global minimizer of the DPCP problem (2) be orthogonal to the inlier subspace. However, the conditions for this to be the case do not immediately follow from the work of Tsakiris & Vidal 2015a, since in the case of hyperplane clustering the outliers lie in a union of n 1 hyperplanes, and thus can not be uniformly distributed on the sphere, as the conditions of Tsakiris & Vidal 2015a require. The rest of the paper is devoted to providing such theoretical guarantees ( 3), as well as introducing DPCP-based hyperplane clustering algorithms ( 4). 3. Theoretical Contributions 3.1. Theoretical analysis of the continuous problem As it turns out, one can gain important insights about the analysis of the DPCP problem (2) for data in a hyperplane arrangement, by first analyzing a certain continuous problem. To see what that problem is, let ˆHi = Hi SD 1, and note first that for any b SD 1 we have b x dµ ˆ Hi, (4) where the LHS of (4) is precisely 1 Ni X i b 1 and can be viewed as an approximation ( ) via the point set X i of the integral on the RHS of (4), with µ ˆ Hi denoting the uniform measure on ˆHi. Letting θi be the principal angle between b and bi, i.e., the unique angle θi [0, π/2] such that cos(θi) = |b bi|, and πHi : RD Hi the orthogonal projection onto Hi, we have for any x Hi that b x = b πHi(x) = (πHi(b)) x (5) = h i,bx = sin(θi)ˆh i,bx, (6) with hi,b := πHi(b) and ˆhi,b := hi,b/ hi,b 2. Hence, b x dµ ˆ Hi = Z ˆh i,bx dµ ˆ Hi sin(θi) (7) x SD 2 |x1|dµSD 2 sin(θi) = c sin(θi), (8) c being the average height of the unit hemisphere of RD 1. We can now view the objective function of (2), X b 1 = Hyperplane Clustering via Dual Principal Component Pursuit as an approximation via X of the function J (b) := b x dµ ˆ Hi i=1 Ni c sin(θi). (10) In that sense, the continuous counterpart of problem (2) is min b SD 1 J (b) := N1 sin(θ1) + + Nn sin(θn). (11) Note that sin(θi) is the distance between the line spanned by b and the line spanned by bi.3 Hence, every global minimizer b of problem (11) minimizes the sum of the weighted distances of Span(b ) from Span(b1), . . . , Span(bn), and thus represents a weighted geometric median of these lines. Even though medians in Riemmannian/Grassmannian manifolds are an active subject of research (Draper et al., 2014; Ghalieh & Hajja, 1996), we are not aware of any literature that studies (11). The advantage of working with (11) instead of (2), is that the global minimizers of (11) depend solely on the weights Ni as well as on the geometry of the arrangement, captured by the principal angles θij between bi and bj. In contrast, the global minimizers of the discrete problem (2) will in principle also depend on the distribution of the points X. From that perspective, understanding when problem (11) has as unique solution b1, is essential for understanding the potential of (2) for hyperplane clustering. Towards that end, we next provide a series of results pertaining to (11).4 The first configuration that we examine is that of two hyperplanes. In that case the weighted geometric median of the two lines spanned by the normals to the hyperplanes always corresponds to one of the two normals: Proposition 3 Consider an arrangement of two hyperplanes in RD with normal vectors b1, b2 and weights N1 N2. Then the set B of global minimizers of (11) satisfies: 1. If N1 = N2, then B = { b1, b2}. 2. If N1 > N2, then B = { b1}. When N1 > N2, problem (11) recovers the normal b1 to the dominant hyperplane, irrespectively of how separated the two hyperplanes are, since, according to Proposition 3, the principal angle θ12 between b1, b2 does not play a role. The continuous problem (11) is equally favorable in recovering normal vectors as global minimizers in another extreme situation, where the arrangement consists of up to D perfectly separated (orthogonal) hyperplanes: Proposition 4 Consider n D hyperplanes in RD with orthogonal normal vectors b1, . . . , bn, and weights N1 N2 Nn. Then the set B of global minimizers of (11) can be characterized as follows: 3Recall that θi is a principal angle, i.e., θi [0, π/2]. 4All proofs can be found at Tsakiris & Vidal 2017b. 1. If N1 = = Nn, then B = { b1, . . . , bn}. 2. If N1 = = Nℓ> Nℓ+1 Nn, for some ℓ [n 1], then B = { b1, . . . , bℓ}. Propositions 3 and 4 are not hard to prove, since for two hyperplanes the objective function is strictly concave, while for orthogonal hyperplanes it is separable. In contrast, the problem becomes harder for n > 2 arbitrary hyperplanes. Even when n = 3, characterizing the global minimizers of (11) as a function of the geometry and the weights seems challenging. Nevertheless, when the three hyperplanes are equiangular and their weights are equal, the symmetry of the configuration allows us to analytically characterize the median as a function of the angle of the arrangement. Proposition 5 Consider three hyperplanes of RD, with normal vectors b1, b2, b3 s.t. b i bj = cos(θ) > 0, i = j, and N1 = N2 = N3. Then the set B of global minimizers of (11) satisfies the following phase transition: 1. If θ > 60 , then B = { b1, b2, b3}. 2. If θ = 60 , then B = { b1, b2, b3, µ}. 3. If θ < 60 , then B = { µ}, where µ := (b1 + b2 + b3)/ b1 + b2 + b3 2. Proposition 5, whose proof uses nontrivial arguments from spherical and algebraic geometry, is particularly enlightening, since it suggests that the global minimizers of (11) are associated to the normals of the underlying arrangement when the hyperplanes are sufficiently separated, while otherwise they seem to be capturing the median hyperplane of the arrangement. This is in striking similarity with the results regarding the Fermat point of planar and spherical triangles (Ghalieh & Hajja, 1996). However, when the symmetry in Theorem 5 is removed, our proof technique no longer applies, and the problem seems even harder. Even so, one intuitively expects an interplay between the angles and the weights of the arrangement under which, if the hyperplanes are sufficiently separated and H1 is sufficiently dominant, then (11) should have a unique global minimizer equal to b1. Our next theorem formalizes this intuition. Theorem 6 Consider n 3 hyperplanes in RD, with normals b1, . . . , bn of pairwise principal angles θij and weights Ni. Define an (n 1) (n 1) matrix Θ with (i 1, j 1) entry given by Ni Nj cos(θij), 1 < i, j n, and maximum eigenvalue σmax(Θ). If α2 + β2, where (12) i>1 Ni sin(θ1i) s X i>1 N 2 i σmax (Θ) 0, (13) Hyperplane Clustering via Dual Principal Component Pursuit i>1 N 2 i + 2 X i =j, i,j>1 Θi 1,j 1, and (14) γ := min j =1 i =j Ni sin(θij) X i>1 Ni sin(θ1i) > 0, (15) then problem (11) admits a unique up to sign global minimizer b = b1. Let us provide some intuition about the meaning of the quantities α, β and γ in Theorem 6. To begin with, the first term in α is precisely equal to J (b1), while the second term in α can be shown to be a lower bound on the objective function N2 sin(θ2) + + Nn sin(θn), if one discards hyperplane H1. Moving on, the quantity β/N1 admits a nice geometric interpretation: cos 1 (β/N1) is a lower bound on how small the principal angle of a critical point b = b1 from b1 can be. Interestingly, the larger N1, the larger this minimum angle is, which shows that critical hyperplanes H that are distinct from H1, must be sufficiently separated from H1. Finally, the second term in γ is J (b1), while the first term is the smallest objective value that corresponds to b = bi, i > 1, and so (15) simply guarantees that J (b1) < J (bi), i > 1. Next, condition N1 > p α2 + β2 of Theorem 6 is easier to satisfy when H1 is close to the rest of the hyperplanes (which leads to small α), while the rest of the hyperplanes are sufficiently separated5 (which leads to small α and small β). Regardless, one can show that α2 + β2 and so if N1 > i>1 Ni, then any global minimizer of (11) has to be one of the normals, irrespectively of what the angles θij are. Finally, condition (15) is consistent with condition (12) in that it requires H1 to be close to Hi, i > 1, and Hi, Hj to be sufficiently separated for i, j > 1. Once again, (15) can always be satisfied irrespectively of the θij, by choosing N1 sufficiently large, since only the positive term in the definition of γ depends on N1. 3.2. Theoretical analysis of the discrete problem We now study6 the discrete formulation of DPCP, i.e., problem (2), for the case where X = [X 1, . . . , X n]Γ, with X i being Ni points in Hi, as described in 2.1. For any i [n] and b SD 1, we can write the quantity X i b 1 as b x(i) j = b Ni X j=1 Sign b x(i) j x(i) j (16) = Ni b xi,b, xi,b := 1 j=1 Sign b x(i) j x(i) j (17) 5We emphasize that the interpretation of close and sufficiently separated is relative to N1 and θ12, . . . , θ1n. 6More detailed arguments and proofs can be found in Tsakiris & Vidal 2017b. with xi,b being the average point of X i with respect to the orthogonal projection hi,b := πHi(b) of b onto Hi. xi,b can be viewed as an approximation to the vector integral Z x ˆ Hi Sign(b x)x dµ ˆ Hi = c ˆhi,b. (18) This leads us to define the maximal approximation error ϵi := max b SD 1 xi,b c bhi,b 2 , (19) as b ranges over the entire unit sphere SD 1. Intuitively, the more uniformly distributed the points X i are inside ˆHi, the smaller ϵi is. This intuition can be formalized by means of the spherical cap discrepancy (Grabner et al., 1997; Grabner & Tichy, 1993) of X i, given by SD(X i) := sup C j=1 IC x(i) j µ ˆ Hi(C) In (20) the supremum is taken over all spherical caps C of the sphere ˆHi = SD 2, where a spherical cap is the intersection of SD 2 with a half-space of RD 1, and IC( ) is the indicator function of C, which takes the value 1 inside C and zero otherwise. SD(X i) is a deterministic measure of the uniformity of the point set X i. By adjusting an argument of Harman 2010, one can show that 5SD(X i), i [n], (21) which confirms that uniformly distributed points X i correspond to small ϵi. We note here that SD(X i) decreases with a rate of (Dick, 2014; Beck, 1984) p 2 1 2(D 2) i . (22) To state the main theorem of this section (Theorem 8) we need a definition. Definition 7 For a set Y = [y1, . . . , y L] SD 1 and positive integer K L, define RY,K to be the maximum circumradius among the circumradii of all polytopes n PK i=1 αjiyji : αji [ 1, 1] o , where j1, . . . , j K are dis- tinct integers in [L], and the circumradius of a closed bounded set is the minimum radius among all spheres that contain the set. We now define the quantity of interest as R := max K1+ +Kn=D 1 0 Ki D 2 i=1 RX i,Ki. (23) We note that it is always the case that RX i,Ki Ki, with this upper bound achieved when X i contains Ki colinear points. Combining this fact with the constraint P i Ki = D 1 in (23), we get that R D 1, and the more uniformly distributed are the points X inside the hyperplanes, the smaller R is (even though R does not go to zero). Hyperplane Clustering via Dual Principal Component Pursuit Theorem 8 Let b be a global minimizer of (2) with X = [X 1, . . . , X n]Γ, and suppose that c > α2 + β2, where (24) α := α + c 1 β := β + c 1 R + X ϵi Ni , (26) with α, β as in Theorem 6, and if ϵ1N1 + ϵ2N2 + 2 X then problem (2) has a unique minimizer b1. Notice the similarity of conditions N1 > p α2 + β2, γ > 0 of Theorem 8 with conditions N1 > p α2 + β2, γ > 0 of Theorem 6. In fact α > α, β > β and γ < γ, which implies that the conditions of Theorem 8 are strictly stronger than those of Theorem 6. This is no surprise since, as we have already remarked, the global minimizers of (2) depend not only on the geometry {θij} and the weights {Ni} of the hyperplane arrangement, but also on the distribution of the data points (parameters ϵi and R). In contrast though to condition (12) of Theorem 6, N1 now appears in both sides of condition (24) of Theorem 8, which is however harmless: under the assumption c > 2ϵ1, (24) is equivalent to the positivity of a quadratic polynomial in N1, whose leading coefficient is positive, and hence (24) can always be satisfied for sufficiently large N1. Another interesting connection of Theorem 6 to Theorem 8, is that assuming lim Ni SD(X i) = 0, Theorem 6 can be seen as a limit version of Theorem 8: dividing (24) and (27) by N1, letting N1, . . . , Nn go to infinity while keeping each ratio Ni/N1 fixed, recalling that R D 1, and noting that in view of (21) we have lim Ni ϵi = 0, we see that in the limit we recover the conditions of Theorem 6. Finally, a theorem of the same flavor gives conditions under which (3) converges in a finite number of iterations to b1 or b1; see Theorem 7 in Tsakiris & Vidal 2017b. 4. Algorithmic Contributions 4.1. DPCP via iteratively reweighted least squares Sp ath & Watson 1987; Tsakiris & Vidal 2015a propose solving the non-convex problem (2) by means of the recursion of convex optimization problems (3), referred to as DPCP-r. This is computationally equivalent to a recursion of linear programs, which can be solved efficiently by an optimized LP solver such as GUROBI. However, these linear programs are in principle not sparse, which may render the running time of this approach prohibitive for big-data applications. To alleviate this issue, we solve (2) by standard Iteratively Reweighted Least Squares (IRLS) applied to ℓ1 minimization problems (Cand es et al., 2008; Chartrand & Yin, 2008; Daubechies et al., 2010; Lerman et al., 2015). The resulting algorithm, referred to as DPCP-IRLS, is dramatically faster than solving DPCP-r by GUROBI: a MATLAB implementation on a standard Mac Book Pro with a dual core 2.5GHz processor and a total of 4GB cache memory is able to handle 6000 points of R1000 in about one minute, while in such a regime DPCP-r seems, as of now, inapplicable. Moreover, the performance of DPCPIRLS, investigated in 5, suggests that DPCP-IRLS converges most of the time to a global minimizer of (2); the theoretical justification of this claim is ongoing research. 4.2. Hyperplane clustering algorithms via DPCP Sequential Hyperplane Learning (SHL) via DPCP. Since at its core DPCP is a single subspace learning method (Tsakiris & Vidal, 2015a), we may as well use it to learn n hyperplanes in the same way that RANSAC (Fischler & Bolles, 1981) is used: learn one hyperplane from the entire dataset, remove the points close to it, then learn a second hyperplane, remove the points close to it, and so on. The main weakness of this approach is well known, and consists of its sensitivity to a thresholding parameter, which is necessary in order to be able to remove points. To alleviate the need of knowing a good threshold, we propose to replace the process of removing points by a process of appropriately weighting the points. In particular, suppose we solve the DPCP problem (2) on the entire dataset X and obtain a unit ℓ2-norm vector b1. Now, instead of removing the points of X that are close to the hyperplane with normal vector b1 (which would require a threshold parameter), we weight each and every point xj of X by its distance b 1 xj from that hyperplane. Then to compute a second hyperplane with normal b2 we apply DPCP on the weighted dataset n b 1 xj xj o . To compute a third hyperplane, the weight of point xj is chosen as the smallest distance of xj from the already computed two hyperplanes, i.e., DPCP is now applied to n mini=1,2 b i xj xj o . After n hyperplanes have been computed, the clustering of the points is obtained based on their distances to the n hypeprlanes. We note here that the theoretical correctness of this weighted sequential scheme does not follow automatically from the theory presented in this paper, since the latter applies only to unit ℓ2-norm points; studying DPCP for weighted points is ongoing research. Iterative Hyperplane Learning (IHL) via DPCP. Another way to do hyperplane learning and clustering via DPCP is to modify the classic K-Hyperplanes, which we Hyperplane Clustering via Dual Principal Component Pursuit will be referring to as IHL-SVD (Bradley & Mangasarian, 2000; Tseng, 2000; Zhang et al., 2009) (see 1.1) by computing the normal vector of each cluster by DPCP, instead of, e.g., SVD; see 5.2 for more details. The resulting algorithm, IHL-DPCP, minimizes (up to a local minimum) the sum of the distances of the points to the estimated hyperplane arrangement, which corresponds to replacing the ℓ2-norm in the objective of IHL-SVD with the ℓ1-norm, precisely as in the case of MKF (Zhang et al., 2009). 5. Experimental Evaluation 5.1. Experiments using synthetic data We evaluate SHL-DPCP ( 4) using synthetic data, and compare it with similar algorithms, where instead of solving the DPCP problem (2) (either via DPCP-r or via DPCPIRLS), one uses REAPER or RANSAC.7 For fairness, RANSAC does not remove any points as it sequentially learns the hyperplanes, rather it selects them randomly using the probability distribution induced by weights defined in a similar way as in 4. Moreover, it is configured to run at least as long as DPCP-r, which uses a maximum of 20 iterations in (3), while REAPER and DPCP-IRLS use a maximum of 100 iterations and convergence accuracy 10 3. The ambient dimension is set to D = 4, 9, 30, as inspired by major applications where hyperplane arrangements appear, e.g., D = 4 in 3D point cloud analysis (in homogeneous coordinates), and D = 9 in twoview geometry (Cheng et al., 2015). For each choice of D we randomly generate n = 2, 3, 4 hyperplanes and sample them as follows. Given n, we set N = 300n, with Ni = αi 1Ni 1, i > 1, where α (0, 1] is a parameter that controls the balancing of the clusters: α = 1 means the clusters are perfectly balanced, while smaller values of α lead to less balanced clusters. We set α = 0.6 (for α = 0.8, 1 see Tsakiris & Vidal 2017b). Each cluster is sampled from a zero-mean unit-variance Gaussian distribution with support in the corresponding hyperplane. To make the experiment more realistic, we corrupt points from each hyperplane by adding white Gaussian noise of deviation σ = 0.01 with support in the direction orthogonal to the hyperplane. Moreover, we corrupt the dataset by adding M/(M + N) = 10% outliers sampled from a standard zero-mean unit-variance Gaussian distribution supported in the ambient space, where M is the number of outliers. The left column of Figure 1 plots the clustering accuracy over 50 independent experiments as a function of the relative dimension (D 1)/D and the number of hyperplanes n. As expected, the performance degrades as either the rel- 7We have compared with methods such as SCC or MKF, however we do not report on these methods since they perform significantly more inferior to RANSAC, REAPER or DPCP. ative dimension or the number of hyperplanes increases. There are at least two interesting things to notice. First, RANSAC is the best method when D = 4 irrespectively of the number of hyperplanes, since for such a low ambient dimension the probability that D 1 = 3 randomly selected points lie in the same hyperplane is very high. Indeed, for D = 4 RANSAC s accuracy ranges from 99% (n = 2) to 97% (n = 4), as opposed to (for n = 4) REAPER (56%) or even DPCP-IRLS (89%) and DPCP-r (94%). On the other hand, DPCP-r is overall the best method with an 86% accuracy in the challenging scenario (D 1)/D = 0.97, n = 4, as opposed to 81% for DPCP-IRLS, 42% for REAPER and 28% for RANSAC. The right column of Figure 1 plots the clustering accuracy as a function of n and of the percentage of outliers, for D = 9 and additive noise as before. Evidently, DPCP-r and DPCP-IRLS are the best methods, with, e.g., a clustering accuracy of 97% and 94% respectively for n = 2 and 50% outliers, as opposed to 66% for RANSAC and 74% for REAPER. 5.2. Experiments using real kinect data In this section we explore various hyperplane clustering algorithms using the benchmark dataset NYUdepth V2 (Silberman et al., 2012). This dataset consists of 1449 RGBd data instances acquired using the Microsoft kinect sensor. Each instance corresponds to an indoor scene, and consists of the 480 640 3 RGB data together with depth data for each pixel. The depth data can be used to reconstruct a 3D point cloud associated to the scene. In this experiment we use such 3D point clouds to learn plane arrangements and segment the pixels of the corresponding images based on their plane membership. This is an important problem in robotics, where estimating the geometry of a scene is essential for successful robot navigation. In such 3D applications RANSAC is the predominant stateof-the-art method, since the probability of sampling three points from the dominant plane is very large. Thus we compare a sequential hyperplane learning RANSAC algorithm (SHL-RANSAC), which uses a threshold (0.1, 0.01, 0.001) for removing points, to iterative K-Hyperplane-like algorithms based on SVD, REAPER, RANSAC and DPCPIRLS, to be referred to as IHL-SVD, IHL-REAPER, and so on. These algorithms randomly initialize n hyperplanes, they cluster the points according to their distance to these hyperplanes, they refine the hyperplanes by fitting a new hyperplane at each cluster, re-assign points based on the new hyperplanes, and so on, until the objective function converges or 100 iterations are reached. We use 10 independent restarts, and we control the running time of SHLRANSAC and IHL-RANSAC to be not less than that of IHL-DPCP-IRLS. The algorithms do not operate on the raw 3D data, rather on standard superpixel representations of the data, where each Hyperplane Clustering via Dual Principal Component Pursuit (e) DPCP-IRLS (f) DPCP-IRLS Figure 1. Sequential hyperplane learning: Left (right) column shows clustering accuracy (white corresponds to 1, black to 0) as a function of the number of hyperplanes n and the relative dimension d/D (percentage M/(N + M) of outliers). superpixel is represented by its median 3D point, weighted by the size of the superpixel. Moreover, since 3D planes in an indoor seen usually do not pass through a common origin, the algorithms work with homogeneous coordinates. Finally, the algorithms as described so far are purely geometric, in the sense that they do not take into account the spatial coherence of the 3D point cloud (nearby points are likely to lie in the same plane), and so we expect their output segmentation to be spatially incoherent. To associate a spatially smooth image segmentation to each algorithm, we use the normal vectors b1, . . . , bn that the algorithm produced to minimize a Conditional-Random-Field (Sutton & Mc Callum, 2006) energy function E(y1, . . . , y N) := j=1 |b yjxj| + λ X k Nj CBj,k exp xj xk 2 2 2σ2 d δ(yj = yk). In (28) the first and second terms are known as unary and Table 1. Clustering error in % of 3D planes from Kinect data without (CRF(0)) and with (CRF(1)) spatial smoothing. n = 4 n = 9 method CRF(0) CRF(1) CRF(0) CRF(1) SHL-RANSAC 22.78 14.07 29.42 17.47 IHL-RANSAC 16.80 10.71 22.78 14.24 IHL-SVD 21.85 14.40 26.22 16.71 IHL-REAPER 20.94 13.71 25.52 16.27 IHL-DPCP-IRLS 20.77 13.64 25.38 16.10 pairwise potentials, yj {1, . . . , n} is the plane label of 3D point xj, which is the variable to optimize over, CBj,k is the length of the common boundary between superpixels j, and k and Nj indexes the neighbors of xj. The parameter λ in (28) is set to the inverse of twice the maximal row-sum of the pairwise matrix, in order to achieve a balance between unary and pairwise terms. Minimization of (28) is done via Graph-Cuts (Boykov et al., 2001). Since NYUdpeth V2 does not come with a ground truth annotation based on plane membership, we manually annotated 92 of the 1449 scenes in the dataset, in which dominant planes such as floors, walls, ceilings, tables and so on are present. Table 1 shows the clustering errors of various algorithms on these 92 annotated scenes for the identification of the first n dominant planes of each scene8, where for SHL-RANSAC the error is averaged over the three different choices of a threshold. As expected, the clustering error increases for all methods as the number of planes to be identified increases. Again as expected, the performance of all algorithms improves significantly if one includes spatial smoothing. Notice that the best method is IHL-RANSAC, and not the sequential SHL-RANSAC, which seems a rather interesting finding. On the other hand, the rest of the methods seem to perform similarly to each other, with IHL-SVD being slightly inferior, since it is less robust to outliers, and IHL-DPCP-IRLS being overall the second best method. 6. Conclusions In this paper we extended the framework of Dual Principal Component Pursuit (DPCP) to the case of data lying in a union of hyperplanes. We provided theoretical conditions under which the normal vector of the dominant hyperplane is the unique global minimizer of the non-convex ℓ1 DPCP optimization problem. Moreover, we proposed a fast implementation of DPCP, as well as DPCP-based hyperplane clustering algorithms, which were shown to outperform or be competitive to state-of-the-art algorithms. 8If the scene has m < n annotated planes, then the clustering error is computed only with respect to the first m dominant clusters identified by the algorithm. Hyperplane Clustering via Dual Principal Component Pursuit Acknowledgements This work was supported by NSF grants 1447822 and 1618637. The first author thanks Dr. Glyn Harman for his help in deriving equation (21), and Dr. Bijan Afsari for interesting conversations on geometric medians in nonlinear manifolds. The authors also thank two of the anonymous reviewers for providing constructive comments. Bako, L. Identification of switched linear systems via sparse optimization. Automatica, 47(4):668 677, 2011. Beck, J. Sums of distances between points on a sphere an application of the theory of irregularities of distribution to discrete geometry. Mathematika, 31(01):33 41, 1984. Boykov, Y., Veksler, O., and Zabih, R. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11): 1222 1239, 2001. Bradley, P. S. and Mangasarian, O. L. k-plane clustering. Journal of Global Optimization, 16(1):23 32, 2000. ISSN 0925-5001. Cand es, E., Wakin, M., and Boyd, S. Enhancing sparsity by reweighted ℓ1 minimization. Journal of Fourier Analysis and Applications, 14(5):877 905, 2008. Chartrand, R. and Yin, W. Iteratively reweighted algorithms for compressive sensing. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3869 3872. IEEE, 2008. Chen, G. and Lerman, G. Spectral curvature clustering (SCC). International Journal of Computer Vision, 81 (3):317 330, 2009. ISSN 0920-5691. Cheng, Y., Lopez, J. A., Camps, O., and Sznaier, M. A convex optimization approach to robust fundamental matrix estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2170 2178, 2015. Daubechies, I., De Vore, R., Fornasier, M., and G unt urk, C. S. Iteratively reweighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics, 63(1):1 38, 2010. Dick, J. Applications of geometric discrepancy in numerical analysis and statistics. Applied Algebra and Number Theory, 2014. Draper, B., Kirby, M., Marks, J., Marrinan, T., and Peterson, C. A flag representation for finite collections of subspaces of mixed dimensions. Linear Algebra and its Applications, 451:15 32, 2014. Elhamifar, E. and Vidal, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11): 2765 2781, 2013. Fischler, M. A. and Bolles, R. C. RANSAC random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 26:381 395, 1981. Ghalieh, K. and Hajja, M. The fermat point of a spherical triangle. The Mathematical Gazette, 80(489):561 564, 1996. Grabner, P. J. and Tichy, R.F. Spherical designs, discrepancy and numerical integration. Math. Comp., 60 (201):327 336, 1993. ISSN 0025-5718. doi: 10.2307/ 2153170. URL http://dx.doi.org/10.2307/ 2153170. Grabner, P. J., Klinger, B., and Tichy, R.F. Discrepancies of point sequences on the sphere and numerical integration. Mathematical Research, 101:95 112, 1997. Harman, G. Variations on the koksma-hlawka inequality. Uniform Distribution Theory, 5(1):65 78, 2010. Lerman, G., Mc Coy, M. B., Tropp, J. A., and Zhang, T. Robust computation of linear models by convex relaxation. Foundations of Computational Mathematics, 15 (2):363 410, 2015. Liu, G., Lin, Z., Yan, S., Sun, J., and Ma, Y. Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):171 184, Jan 2013. Lu, C-Y., Min, H., Zhao, Z-Q., Zhu, L., Huang, D-S., and Yan, S. Robust and efficient subspace segmentation via least squares regression. In European Conference on Computer Vision, 2012. Qu, Q., Sun, J., and Wright, J. Finding a sparse vector in a subspace: Linear sparsity using alternating directions. In Advances in Neural Information Processing Systems, pp. 3401 3409, 2014. Sampath, A. and Shan, J. Segmentation and reconstruction of polyhedral building roofs from aerial lidar point clouds. Geoscience and Remote Sensing, IEEE Transactions on, 48(3):1554 1567, 2010. Silberman, N., Kohli, P., Hoiem, D., and Fergus, R. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, 2012. Sp ath, H. and Watson, G.A. On orthogonal linear ℓ1 approximation. Numerische Mathematik, 51(5):531 543, 1987. Hyperplane Clustering via Dual Principal Component Pursuit Spielman, D.A., Wang, H., and Wright, J. Exact recovery of sparsely-used dictionaries. In Proceedings of the 23d international joint conference on Artificial Intelligence, pp. 3087 3090. AAAI Press, 2013. Sun, J., Qu, Q., and Wright, J. Complete dictionary recovery using nonconvex optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 2351 2360, 2015. Sutton, C. and Mc Callum, A. An introduction to conditional random fields for relational learning, volume 2. Introduction to statistical relational learning. MIT Press, 2006. Tsakiris, M. C. and Vidal, R. Dual principal component pursuit. ar Xiv:1510.04390v2 [cs.CV], 2017a. Tsakiris, M. C. and Vidal, R. Hyperplane clustering via dual principal component pursuit. ar Xiv:1706.01604 [cs.CV], 2017b. Tsakiris, M. C. and Vidal, R. Filtrated algebraic subspace clustering. SIAM Journal on Imaging Sciences, 10(1): 372 415, 2017c. Tsakiris, M.C. and Vidal, R. Dual principal component pursuit. In ICCV Workshop on Robust Subspace Learning and Computer Vision, pp. 10 18, 2015a. Tsakiris, M.C. and Vidal, R. Filtrated spectral algebraic subspace clustering. In ICCV Workshop on Robust Subspace Learning and Computer Vision, pp. 28 36, 2015b. Tseng, P. Nearest q-flat to m points. Journal of Optimization Theory and Applications, 105(1):249 252, 2000. Vidal, R., Soatto, S., Ma, Y., and Sastry, S. An algebraic geometric approach to the identification of a class of linear hybrid systems. In IEEE Conference on Decision and Control, pp. 167 172, 2003. Vidal, R., Ma, Y., and Sastry, S. Generalized Principal Component Analysis (GPCA). IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1 15, 2005. Vidal, R., Ma, Y., Soatto, S., and Sastry, S. Two-view multibody structure from motion. International Journal of Computer Vision, 68(1):7 25, 2006. Vidal, R., Ma, Y., and Sastry, S. Generalized Principal Component Analysis. Springer Verlag, 2016. Wang, Y-X., Xu, H., and Leng, C. Provable subspace clustering: When LRR meets SSC. In Neural Information Processing Systems, 2013. You, C., Li, C.-G., Robinson, D., and Vidal, R. Oracle based active set algorithm for scalable elastic net subspace clustering. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3928 3937, 2016. Zhang, T., Szlam, A., and Lerman, G. Median k-flats for hybrid linear modeling with many outliers. In Workshop on Subspace Methods, pp. 234 241, 2009.