# sparse_subspace_clustering_with_missing_entries__44ba2c34.pdf Sparse Subspace Clustering with Missing Entries Congyuan Yang YANGCY@JHU.EDU Daniel Robinson DANIEL.P.ROBINSON@JHU.EDU Ren e Vidal RVIDAL@CIS.JHU.EDU Johns Hopkins University, 3400 N Charles St., Baltimore, MD, USA We consider the problem of clustering incomplete data drawn from a union of subspaces. Classical subspace clustering methods are not applicable to this problem because the data are incomplete, while classical low-rank matrix completion methods may not be applicable because data in multiple subspaces may not be low rank. This paper proposes and evaluates two new approaches for subspace clustering and completion. The first one generalizes the sparse subspace clustering algorithm so that it can obtain a sparse representation of the data using only the observed entries. The second one estimates a suitable kernel matrix by assuming a random model for the missing entries and obtains the sparse representation from this kernel. Experiments on synthetic and real data show the advantages and disadvantages of the proposed methods, which all outperform the natural approach (low-rank matrix completion followed by sparse subspace clustering) when the data matrix is high-rank or the percentage of missing entries is large. 1. Introduction In many real world applications, we are faced with the problem of clustering incomplete data drawn from a union of low-dimensional subspaces. For example, in the motion segmentation problem in computer vision, feature point trajectories extracted from a video sequence with multiple moving objects lie in multiple low-dimensional subspaces, one per moving object. However, due to occlusions or objects entering or leaving the field of view, such trajectories are often incomplete. Therefore, we are faced with the problem of clustering these incomplete trajectories in order to segment the video into multiple motions. Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). Prior work. The problem of clustering complete data in a union of subspaces has received increasing attention over the past decade (see Vidal (2011) for a tutorial). Classical methods, such as K-subspaces (Bradley & Mangasarian, 2000; Tseng, 2000) and mixture of probabilistic PCAs (Tipping & Bishop, 1999) suffer from local minima, while algebraic methods such as Generalized Principal Component Analysis (Vidal et al., 2005) suffer from robustness to data corruptions. This has motivated the development of convex optimization methods based on sparse (Elhamifar & Vidal, 2009; 2010; 2013) and low-rank (Liu et al., 2010; 2013; Lu et al., 2012; Favaro et al., 2011; Vidal & Favaro, 2014) representation techniques. For example, the Sparse Subspace Clustering (SSC) algorithm of (Elhamifar & Vidal, 2009) is based on expressing each data point as a sparse linear combination of all other data points. When the subspaces are sufficiently separated, and the data points are sufficiently well distributed inside the subspaces, the nonzero coefficients of one point correspond to other points in the same subspace. Hence the sparse representation can be used to construct an affinity for clustering the data using spectral clustering. The theoretical conditions guarantee the correctness of clustering and are applicable to noiseless data (Elhamifar & Vidal, 2013), data corrupted by outliers (Soltanolkotabi & Cand es, 2013; Soltanolkotabi et al., 2014) and data corrupted by noise (Wang & Xu, 2013). In sharp contrast, the case where the data points are incomplete has received significantly less attention. Gruber & Weiss (2004) model the data with a mixture of probabilistic PCAs and use the EM algorithm to both segment the data and find the missing entries. However, this approach suffers from local minima and cannot guarantee exact matrix completion or exact clustering. Vidal & Hartley (2004) assume that, although the data lie in a union of subspaces, the data matrix is still low-rank, hence they use low-rank matrix completion techniques followed by subspace clustering techniques. In practice, the number of subspaces and their dimensions may not be small enough so that the data matrix is low rank. Eriksson et al. (2012) use a local neighborhood of each incomplete point to complete it, and refine the estimated subspaces to recover the full matrix. Under certain Sparse Subspace Clustering and Completion conditions, this method can complete the data with high accuracy. However, the conditions require having an arbitrarily large number of points, which makes it impractical. Finally, in an unpublished abstract, Candes et al. (2014) propose to extend SSC to the case of missing data by replacing a certain incomplete kernel matrix by its expected value and then computing the sparse representation of each point using a bias corrected Dantzig selector. While the first step is very clever and promising, the precise incomplete data model and the computation of the expectation are not given. Moreover, using a bias corrected Dantzig selector for finding the sparse representation is computationally more complex than classical SSC, which is based on an ADMM implementation of LASSO. Paper contributions. We present two new approaches for subspace clustering and completion. The first one uses the knowledge of which entries are observed to formulate a convex optimization problem that estimates a sparse representation of the data. The second one is based on the observation that the LASSO version of SSC does not require one to know the full data matrix X, but rather the kernel matrix X X. Under a certain probability model for the missing entries, we show that it is possible to obtain an unbiased estimator for X X from the observed entries of X and the fraction of missing entries δ. (This approach was outlined in (Candes et al., 2014) without proof.) The estimator is then used to define a convex optimization problem for computing a sparse representation of the data. The last step of both approaches is the same as that of SSC: apply spectral clustering to an affinity matrix built from the sparse representation. Experiments on synthetic and real data show that the proposed algorithms are effective with each having advantages and disadvantages, as well as complementary strengths when compared to the basic approach of matrix completion followed by SSC. In particular, when X is lowrank and the fraction of missing entries δ is small, matrix completion works extremely well and our approaches are inferior. However, when either X is high rank or δ is large, matrix completion fails, and our approaches are superior. 2. SSC with Complete Data: A Review Consider n linear or affine subspaces {Sl RD}n l=1, each of dimension dl < D for l = 1, . . . , n. Assume we are given N data points {xj RD}N j=1 lying in the union of the n subspaces. Then the observed data matrix is X = [x1 . . . x N] RD N. (1) The goal of subspace clustering is to identify the number of subspaces, their dimensions, a basis for each subspace, and the membership of each data point to its correct subspace. The SSC algorithm (Elhamifar & Vidal, 2009) is based on the observation that data in a union of subspaces are selfexpressive. That is, each data point can be represented as a linear combination of other points that lie in the same subspace. Therefore, we can write each data point as xj = Xcj, cjj = 0, (1 cj = 1), (2) where the vector of coefficients cj has at most dl nonzero entries if xj Sl. The additional constraint 1 cj = 1, where 1 is the vector of all ones, is used in the case of affine subspaces, because the coefficients must add up to 1 to give an affine combination rather than a linear combination. When the subspaces are sufficiently separated and the data points are well distributed inside each subspace, the theoretical analysis in (Elhamifar & Vidal, 2013; Soltanolkotabi & Cand es, 2013; Soltanolkotabi et al., 2014) shows that the solutions of the set of ℓ1 minimization problems min cj ||cj||1 s. t. xj = Xcj, cjj = 0, j = 1, . . . , N, (3) satisfy cij = 0 only if points xi and xj are in the same subspace. In matrix form, we can write this problem as min C ||C||1 s. t. X = XC, diag(C) = 0, (4) where C = [c1 c N] RN N is the coefficient matrix. When the data are contaminated by noise, the selfexpressiveness constraint X = XC is relaxed and the following LASSO problem is solved: min C ||C||1 + λ 2 X XC 2 F s. t. diag(C) = 0, (5) where λ > 0 is a parameter. In this case, Wang & Xu (2013) show that when the amount of noise is small enough, the subspaces are sufficiently separated, and the data are well distributed, the matrix of coefficients gives the correct clustering with high probability. Given the matrix of sparse coefficients C, SSC constructs a weighted graph with affinity matrix A = |C| + |C| and uses spectral clustering algorithms to cluster the data. 3. SSC with Missing Entries Let us now consider the case where some entries of the data matrix X = [xij] RD N are missing. Specifically, let W = [wij] {0, 1}D N be a matrix such that wij = 1 if xij is observed, and wij = 0 otherwise. The locations of the observed entries for the jth data point or for the entire data matrix can then be indexed, respectively, by the sets: Ωj = {i : wij = 1} and Ω= {(i, j) : wij = 1}. (6) Given the observed entries of X, {xij}(i,j) Ω, our goal is to determine which columns of X belong to the same subspace. SSC does so by solving for the matrix of coefficients C in (5). However, since some entries of X are missing, we cannot directly solve the optimization problem in (5). In this section, we present various approaches for addressing this problem and discuss their strengths and weaknesses. Sparse Subspace Clustering and Completion 3.1. Matrix Completion + SSC (MC+SSC) A first approach is to apply a matrix completion (MC) algorithm to recover the missing entries in X and then solve (5). More specifically, let PΩ(X) denote the entries of X in Ω. The MC+SSC approach to subspace clustering with missing entries uses the MC approach in (Cai et al., 2008) followed by the SSC approach in (5), that is: A = arg min A A + τ 2 A 2 F s. t. PΩ(X)=PΩ(A), C = arg min C C 1+ λ 2 A AC 2 F s. t. diag(C)=0, for appropriately chosen positive constants τ and λ. The MC+SSC approach is likely to be a good strategy when the rank of the data matrix and the percentage of missing entries are sufficiently small. In this case, existing results (Cai et al., 2008; Cand es & Recht, 2009; Cand es & Tao, 2010; Gross, 2011; Keshavan et al., 2010; Zhou et al., 2010; Recht, 2011) guarantee the correctness of completion, while existing results (Elhamifar & Vidal, 2013; Soltanolkotabi & Cand es, 2013; Soltanolkotabi et al., 2014) guarantee the correctness for clustering. However, MC+SSC is likely to fail as soon as the data matrix is full rank, e.g., a data matrix from 20 subspaces of dimension 5 in R100 can be of full rank 100, or the percentage of missing entries is high, as in both cases the MC step may fail. 3.2. SSC by Entry-Wise Zero-Fill (SSC-EWZF) A second approach is to fill the missing entries in X with 0s, i.e., to replace X by Xmiss = W X where is the Hadamard product, and then solve (5). We call this approach Zero-Fill+SSC (ZF+SSC). However, this heuristic may fail because it may not provide the correct completion of the data, hence SSC may not provide the correct clustering either. Moreover, the ZF approach does not minimize the correct error. Specifically, the self-expressiveness error k=1 xikckj)2 (7) Xmiss Xmiss C 2 F = j=1 (wijxij k=1 wikxikckj)2. (8) When wij = 0, this encourages the term PN k=1 wikxikckj to be close to zero, while this term should not be penalized because we do not observe xij when wij = 0. We could address this by summing only over observed entries of X: PΩ(X XC) 2 F = j=1 wij xij k=1 xikckj 2 . (9) However, PΩ(XC) cannot be computed when X has missing entries. To address this issue, we propose to replace (9) by PΩ(Xmiss Xmiss C) 2 F , which is equal to: j=1 wij(xij k=1 wikxikckj)2. (10) This modified self-expressiveness error was proposed in (Balzano et al., 2010) for a different but related problem (column subset selection with missing entries). In principle, this modified error is not correct either since it replaces the linear combination PN k=1 xikckj by PN k=1 wikxikckj. However, notice that the two sums coincide if for all k such that wik = 0 it happens to be the case that ckj = 0. Since the ℓ1 term will bias ckj to be zero by penalizing |ckj|, the SSC by Entry-Wise Zero-Fill (SSC-EWZF) approach, min C C 1+ λ 2 PΩ(Xmiss Xmiss C) 2 F s. t. diag(C)=0, (11) will effectively try to express the jth column of X, xj, as a linear combination of other columns of X that are in the same subspace as xj and whose entries in Ωj are the most complete. Ideally, this approach will work perfectly if, for each point xj Sl, there are at least dl other data points xj1, . . . , xjdl , whose ith entries are known for all i Ωj. However, the use of the projection PΩcould affect the correctness of clustering, e.g., if two different subspaces become indistinguishable after projection. As it is customary in classical matrix completion results, we need to assume that each subspace is incoherent with the pattern of missing entries. We conjecture that the SSC-EWZF is guaranteed to give the correct clustering and completion provided that (1) the subspaces are sufficiently separated, (2) the data points are well distributed inside the subspaces, and (3) the subspaces are incoherent with the pattern of missing entries. 3.3. SSC by Expectation-based Completion (SSC-EC) This approach exploits the fact that the self-expressiveness error depends on X X rather than X because X XC 2 F = trace((I C) X X(I C)). (12) This observation was the basis for the Kernel SSC method of Patel & Vidal (2014). Here, we use this idea to complete the kernel matrix X X in lieu of the data matrix X. To this end, we assume a random model in which each entry of X is missing independently and with equal probability δ [0, 1]. Under this model, the following lemma, which was stated in (Candes et al., 2014) without proof, shows how to obtain an unbiased estimator for X X from Xmiss and δ. Lemma 1. Let Z = [zij] {0, 1}D N be a random matrix whose entries are i.i.d. Bernoulli with parameter δ, i.e., P[zij = k] = δ1 k(1 δ)k, k = 0, 1, (13) for all i = 1, . . . , D and all j = 1, . . . , N. Then the matrix Γ Y Y δ diag(Y Y ), (14) where Y 1 1 δZ X, is an unbiased estimator for X X. Sparse Subspace Clustering and Completion Proof. If j = i, then γij = (Y Y )ij = y i yj = 1 (1 δ)2 k=1 xkixkjzkizkj, where yi is the ith column of Y , i = 1, . . . N. Then, E[γij] = 1 (1 δ)2 k=1 xkixkj E[zkizkj] k=1 xkixkj = x i xj = (X X)ij, where we have E[zkizkj] = E[zki]E[zkj] because of independence, and E[zki] = 1 δ by (13). If j = i, then γii = (Y Y )ii δ(Y Y )ii = (1 δ)(Y Y )ii k=1 ykiyki = 1 1 δ k=1 x2 kiz2 ki. So the expectation of γii for all i = 1, . . . , N is E[γii] = 1 1 δ k=1 x2 ki E[z2 ki] = (X X)ii, where we use the fact that E[z2 ki] = 1 δ by (13). Combining the above two cases, we have E[γij]=(X X)ij for all i = 1, . . . , D and j = 1, . . . , N, hence Γ is an unbiased estimator for X X as claimed. Lemma 1 suggests a simple approach for solving the incomplete SSC problem where we solve the problem in (5) after replacing the kernel matrix X X by bΓ = 1 (1 δ)2 (X miss Xmiss δ diag(X miss Xmiss)). (15) Notice, however that, while the matrix X X is positive semidefinite, the matrix bΓ in (15) might not be, hence the optimization problem in (5) may no longer be convex. To address this issue, we regularize bΓ as follows: eΓ = bΓ + δ (1 δ)2 max{diag(X miss Xmiss)}I, (16) where max{diag(X miss Xmiss)} is the maximum value of the diagonal entries of X miss Xmiss. One can check that eΓ is positive semi-definite. This leads to the following SSC by Expectation-based Completion (SSC-EC) algorithm: min C ||C||1 + λ 2 trace((I C) eΓ(I C)) s. t. diag(C) = 0. (17) We use the Alternating Direction Method of Multipliers (ADMM) (Boyd et al., 2010) to solve (17). The steps of the ADMM algorithm for solving the original problem (5) are given in (Elhamifar & Vidal, 2013). Since the ADMM algorithm for SSC depends only on X X, to solve (17) we simply replace X X by eΓ in the ADMM algorithm of Elhamifar & Vidal (2013). 3.4. SSC by Column-wise Expectation-based Completion (SSC-CEC) A concern with the SSC-EC method is that it is based on the self-expressiveness error X XC 2 F , rather than its incomplete version PΩ(X XC) 2 F . As discussed in 3.2 for the ZF+SSC method, this introduces an undesired bias in the estimation of C. Moreover, observe that when δ is small enough, the expression for bΓ in (15) becomes bΓ 1 (1 δ)2 X miss Xmiss. Thus, the SSC-EC method effectively reduces to the ZF+SSC method with λ replaced by λ (1 δ)2 . To address this issue, we modify the self-expressiveness error to account only for observed entries, similar to our SSC-EWZF method. As it turns out, rather than solving for the entire matrix of coefficients as per (5), it will be more convenient to equivalently solve N optimization problems, one for each column cj, j = 1, . . . , N, of C: min cj cj 1 + λ 2 xΩj,j XΩjcj||2 2 s. t. cjj = 0. (18) Here xΩj,j is a vector with the entries of xj indexed by Ωj (i.e., the observed entries of xj), and XΩj is a matrix with the rows of X indexed by Ωj. After expanding the second term in the objective function above, we obtain xΩj,j 2 2 2c j X ΩjxΩj,j + c j X Ωj XΩjcj. (19) Therefore, to solve the above optimization problem, we only need to estimate X ΩixΩj,j and X Ωj XΩj, which are specifically tuned to each data point and its missing entries. The following lemma gives unbiased estimators for them. Lemma 2. For an arbitrary matrix A, let Aω denote the submatrix of A whose rows are indexed by ω {1, . . . , D}. Then, under the random model in Lemma 1, the kernel matrix Γ(ω) RN N defined as Γ(ω) Y ω Yω δ diag(Y ω Yω) (20) is an unbiased estimator for X ω Xω. Proof. For this proof, we let γij denote the (i, j)th entry of Γ(ω). Similar to the proof of Lemma 1, we have γij = (Y ω Yω)ij = 1 (1 δ)2 X k ω xkixkjzkizkj (j = i), γii = (1 δ)(Y ω Yω)ii = 1 1 δ k ω x2 kiz2 ki. Sparse Subspace Clustering and Completion Then, since E[zkizkj] = E[zki]E[zkj] = (1 δ)2 if j = i, and E[z2 ki] = 1 δ, we have E[γij]=(X ωXω)ij for all i = 1, . . . , D and j = 1, . . . , N. Thus Γ(ω) is an unbiased estimator for X ω Xω. It follows from Lemma 2 that the matrix bΓ(j) = (Xmiss) Ωj(Xmiss)Ωj δ diag (Xmiss) Ωj(Xmiss)Ωj and its jth column bγ(j), respectively, estimate the matrix X Ωj XΩj and vector X ΩjxΩj,j. However, since bΓ(j) may not be positive semidefinite, like before we regularize it as: eΓ(j) = bΓ(j)+ δ (1 δ)2 max{diag (Xmiss) Ωj(Xmiss)Ωj }I. Similarly, we define eγ(j) as the jth column of eΓ(j). This leads to the convex optimization problem: min cj cj 1 + λ 2 c j eΓ(j)cj 2c j eγ(j) s. t. cjj = 0, (21) which we solve using the ADMM algorithm, as before. 3.5. Discussion and Bias-corrected Dantzig Selector Notice that solving the problem in (17) is more efficient than solving the N problems in (21) because (17) involves inverting a matrix that is common for all columns of C, while (21) involves inverting a different matrix for each column of C. Notice also that both methods are different from the bias-corrected Dantzig selector (BCDS) min cj cj||1 s. t. bγ(j) bΓ(j)cj λ and cjj = 0 (22) proposed by (Candes et al., 2014) because, instead of trying to penalize the standard reconstruction error xj Xcj 2 2, this approach tries to penalize X xj X Xcj . In our experience, this approach is computationally more expensive than SSC-CEC and sensitive to the choice of λ: a value that is too small leads to infeasible problems, while too large of a value leads to poor clustering performance. 4. Experiments In this section, we evaluate the performance of MC+SSC, ZF+SSC, SSC-EWZF, SSC-EC, SSC-CEC, and BCDS on both synthetic data and the Hopkins 155 motion segmentation dataset (Tron & Vidal, 2007). All algorithms involve a penalty parameter λ that should be carefully chosen so as to balance reconstruction error and sparsity: a small λ may lead to sparse solutions, but a large reconstruction error, while a large λ may give very good reconstruction, but non sparse solutions. In (Elhamifar & Vidal, 2013), an adaptive choice for λ in (5) is given for a complete data matrix X as: λ = α/ min j max i =j |X X|ij, (23) where α 1 is a new tuning parameter. The justification for this choice for λ is that, if α 1, every column of C is guaranteed to be nonzero. Considering the specific form of ZF+SSC, SSC-EWZF, SSC-EC, and SSC-CEC, the denominator in (23) changes to minj maxi =j |X miss Xmiss|ij, maxi =j |(Xmiss) Ωj(Xmiss)Ωj|ij, minj maxi =j | γij|, and maxi =j | γ(j) ij |, respectively. The clustering accuracy or alternatively the clustering error is used as the metric for comparing the performance of different methods. Specifically, the clustering accuracy is the ratio of correctly classified data points over all the data points and the clustering error = 1 clustering accuracy. 4.1. Synthetic Data The synthetic data is generated by drawing N0 d points per subspace from a union of n subspaces of equal dimension d D in RD. The data is generated as follows. Let Al RD d, Bl Rd N0 be independent random Gaussian matrices, for l = 1, . . . , n. Since rank(Al Bl) = d with high probability, the columns of Al Bl lie in a ddimensional linear subspace Sl = span(Al Bl). Also the randomness guarantees that the subspaces {Sl}n l=1 are independent with high probability. Therefore the complete data matrix X can be constructed as X = [A1B1, . . . , An Bn], where X RD N if N0 = N/n. We also normalize the columns of X to have unit ℓ2 norm. To generate the incomplete data, we draw a random matrix W whose entries are i.i.d. Bernoulli with missing rate δ. In order to study the influence of the rank of the data matrix on the clustering performance, we evaluate all methods both in the low-rank and high-rank regimes. Note that the rank of X will be approximately equal to the sum of the dimensions of the subspaces, i.e., rank(X) nd, since the subspaces are independent with high probability. In our experiments, we use D = 100, N0 = 50, d = 5, n = 2 to simulate the low-rank case (dn D), and D = 25, N0 = 50, d = 5, n = 5 to simulate the high-rank case (dn D). Effect of the penalty parameter. Figures 1(a)-1(b) show the clustering error versus α in the low-rank case for a missing rate of δ = 0.5 and 0.8. When δ = 0.5, we can see that methods that use the complete self-expressiveness error (ZF+SSC and SSC-EC) are very sensitive to the choice of α. In contrast, methods based on the incomplete error (SSC-EWZF and SSC-CEC) work perfectly for almost all values of α. Also, SSC-EC is better than ZF+SSC, as expected. The performance of MC+SSC is not very sensitive to α, but as δ increases from 0.5 to 0.8 the clustering error Sparse Subspace Clustering and Completion 0 0.5 1 1.5 2 2.5 3 0 Clustering error MC + SSC ZF + SSC SSC EWZF SSC EC SSC CEC (a) low rank case: δ = 0.5 0.5 1 1.5 2 2.5 3 0.05 Clustering error MC + SSC ZF + SSC SSC EWZF SSC EC SSC CEC (b) low rank case for δ = 0.8 0.5 1 1.5 2 2.5 3 0 Clustering error MC + SSC ZF + SSC SSC EWZF SSC EC SSC CEC (c) high rank case for δ = 0.3 0.5 1 1.5 2 2.5 3 0.1 Clustering error MC + SSC ZF + SSC SSC EWZF SSC EC SSC CEC (d) high rank case for δ = 0.5 Figure 1. Clustering error versus α (log scale). (a)-(b) show the low-rank case with D = 100, N0 = 50, d = 5, n = 2. (c)-(d) show the high-rank case with D = 25, N0 = 50, d = 5, n = 5. increases drastically. This may be explained because MC fails when the missing rate is too high. The clustering errors of SSC-EWZF and SSC-CEC also increase with δ, but less dramatically. Overall, SSC-EWZF and SSC-CEC are the most accurate and robust methods against changes in α. Figures 1(c)-1(d) show the corresponding results in the high-rank case for a missing rate of δ = 0.3 and 0.5. By comparing Figures 1(d) and 1(a), we see that the clustering errors of all methods increase, arguably due to the increase in the number of subspaces. Notice also that, as before, ZF+SSC and SSC-EC give the highest errors, but this time MC+SSC also gives high errors because the MC step fails as the data matrix is not low rank. SSC-CEC performs better than the previous methods, but its sensitivity with respect to α increases. Overall, SSC-EWZF performs best as it gives lower errors for a broader range of values of α. Effect of the missing rate. To evaluate performance as a function of the missing rate, we first need to decide how to do a fair comparison. This is because performance depends on the choice of the parameter α and the optimal range for α may not be the same for different methods. One possibility is to choose the optimal α for each method and then compare their performance. In practice, however, we may not know the best α, hence fixing α for all methods is another possibility. We evaluate performance in both ways. Figures 2(a)-2(b) show the clustering error versus δ in the low-rank case. Observe that all methods give nearly perfect clustering for δ < 0.6. As before, the performance of ZF+SSC and SSC-EC deteriorates very quickly as δ in- 0.4 0.5 0.6 0.7 0.8 0 Missing rate δ Clustering error MC + SSC ZF + SSC SSC EWZF SSC EC SSC CEC BCDS (a) low rank case for optimal α 0.4 0.5 0.6 0.7 0.8 0 Missing rate δ Clustering error MC + SSC ZF + SSC SSC EWZF SSC EC SSC CEC BCDS (b) low rank case for α = 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0 Missing rate δ Clustering error MC + SSC ZF + SSC SSC EWZF SSC EC SSC CEC BCDS (c) high rank case for optimal α 0 0.1 0.2 0.3 0.4 0.5 0.6 0 Missing rate δ Clustering error MC + SSC ZF + SSC SSC EWZF SSC EC SSC CEC BCDS (d) high rank case for α = 5 Figure 2. Clustering error versus missing rate δ. (a)-(b) show the low-rank case with D = 100, N0 = 50, d = 5, n = 2. (c)-(d) show the high-rank case with D = 25, N0 = 50, d = 5, n = 5. creases, while the performance of MC+SSC, SSC-EWZF, and SSC-CEC remains very good. For this experiment only we evaluate the BCDS method. While BCDS uses the same completion strategy as SSC-CEC, the sparse representation is found by solving (22) instead of (21). As we can see, this change results in a higher clustering error. As discussed before, we believe this is because the performance of (22) is highly dependent on the choice of λ in (22). We choose n(1 δ) , as suggested in (Candes et al., 2014). Figures 2(c)-2(d) show the results for the high-rank case. Observe that all methods give nearly perfect clustering for δ < 0.2, but their performance deteriorates quickly as δ increases. In particular, MC+SSC fails since the data matrix is no longer low-rank, giving a clustering error that is similar to that of ZF+SSC, SSC-EC and BCDS. SSC-EWZF and SSC-CEC give clustering errors that are similar and clearly lower than those of ZF+SSC, SSC-EC and BCDS. Effect of the number of subspaces, the dimension of the subspaces, and the number of samples per subspace. Figure 3 evaluates the performance of SSC-CEC as a function of the number of subspaces n, the dimension of the subspaces d, and the number of samples N0 per subspace. In Figure 3(a), we fix N0 = 30, D = 100, d = 5, and vary n as 2, 3, 5, 7, 10. We can see that the break point for perfect clustering (δ 0.55) does not change too much as n varies. This means that SSC-CEC is robust to variations of the number of clusters when the missing rate is low. For a high missing rate, however, the clustering error increases Sparse Subspace Clustering and Completion 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0 Missing rate δ Clustering error N0 = 30, D = 100, d = 5 n = 2 n = 3 n = 5 n = 7 n = 10 (a) varying n 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0 Missing rate δ Clustering error N0 = 30, D = 100, n = 2 d = 1 d = 3 d = 5 d = 7 d = 9 (b) varying d 0 0.2 0.4 0.6 0.8 0 Missing rate δ Clustering error D = 100, d = 5, n = 2 N0 / d = 10 (c) varying N0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 Missing rate δ Clustering error N0 = 30, D = 30 d = 15, n = 2 d = 10, n = 3 d = 6, n = 5 d = 3, n = 10 (d) varying d and n Figure 3. Clustering error versus number of subspaces n, dimension of the subspaces d, and number of samples N0 per subspace for SSC-CEC. (a) Varying n when N0 = 30, D = 100 and d = 5; (b) Varying d when N0 = 30, D = 100, n = 2; (c) Varying N0 when D = 100, d = 5, n = 2; (d) Varying d and n while fixing their product nd = D when N0 = 30 and D = 30. very quickly as the number of subspaces increases. In Figure 3(b), we fix N0 = 30, D = 100, n = 2, and vary d as 1, 3, 5, 7, 9. We can see that as the dimension of each subspace increases, the clustering error increases. This is expected because the data is not as well distributed inside the subspaces when N0/d decreases, hence SSC may fail. In Figure 3(c), we fix D = 100, d = 5, n = 2, and vary N0 as 5, 10, 20, 50, 100. Observed that as more samples are added, the performance of SSC-CEC becomes more and more stable. This is because the self-expressiveness property may fail when the number of samples per subspace is too small. Also, increasing the number of samples per subspace improves the break point for perfect clustering. In Figure 3(d), we fix N0 = 30, D = 30, and vary both d and n while keeping their product constant as nd = D. By construction, the subspaces are independent with high probability. Hence, this choice for n and d will make the data matrix almost full rank. It is interesting to see that when the missing rate is low, SSC-CEC performs best for low-dimensional subspaces, but when the missing rate is high, it performs best for high-dimensional subspaces. SSR error and connectivity. While clustering error is the ultimate performance metric for subspace clustering, such a metric depends on the spectral clustering step. To evaluate the direct output of the SSC algorithms, it is also customary 0 0.1 0.2 0.3 0.4 0.5 0 Subspace sparse representation error Connectivity MC + SSC ZF + SSC SSC EWZF SSC EC SSC CEC (a) δ = 0.5 in low-rank case 0.1 0.2 0.3 0.4 0.5 0 Subspace sparse representation error Connectivity MC + SSC ZF + SSC SSC EWZF SSC EC SSC CEC (b) δ = 0.8 in low-rank case 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 Subspace sparse representation error Connectivity MC + SSC ZF + SSC SSC EWZF SSC EC SSC CEC (c) δ = 0.3 in high-rank case 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 Subspace sparse representation error Connectivity MC + SSC ZF + SSC SSC EWZF SSC EC SSC CEC (d) δ = 0.5 in high-rank case Figure 4. SSR error versus connectivity by varying the parameter α. (a) Missing rate δ = 0.5 in low-rank case; (b) Missing rate δ = 0.8 in low-rank case; (c) Missing rate δ = 0.3 in high-rank case; and (d) Missing rate δ = 0.5 in high-rank case. to evaluate the quality of the sparse representation matrix C by measuring the subspace sparse representation (SSR) error and the connectivity of each cluster. For SRR, we compute the proportion of the sum of the absolute values of the entries from other subspaces in each column of C and then average over all the columns. Since separability of different subspaces is desired, a lower SSR error is preferable. For connectivity, we calculate the second smallest eigenvalue λ2 of the normalized Laplacian matrix for each cluster, which measures the connectivity of the corresponding subgraph. We take the smallest λ2 (across all groups) as a measure of the connectivity for all groups. Note that higher connectivity is preferable, since it prevents over-clustering. We report both the SSR error and connectivity by varying the penalty parameter α. We observe that connectivity usually goes up when the SSR error increases. Thus we plot them together in an ROC-like curve, as shown in Figure 4. Figures 4(a) and 4(b) plot SSR error versus connectivity for a missing rate δ of 0.5 and 0.8, respectively, in the lowrank case. Observe that MC+SSC is slightly better than SSC-EWZF and SSC-CEC when δ = 0.5 since its curve is above and to the left of that for SSC-EWZF and SSC-CEC. Observe also that MC+SSC produces very low SSR errors but also very low connectivity for all values of α. When δ = 0.8, the performance of MC+SSC deteriorates, which is consistent with the fact that MC fails for very high missing rate. On the other hand, SSC-EWZF and SSC-CEC continue to be the best methods, with SSC-EWZF being slightly better for small SSR errors, and SSC-CEC being Sparse Subspace Clustering and Completion slightly better for high SSR errors. Figures 4(c) and 4(d) plot the SSR error versus connectivity for a missing rate δ of 0.3 and 0.5, respectively, in the highrank case. As before, SSC-EWZF and SSC-CEC continue to be the best methods, showing that they are clearly advantageous in the high-rank case when MC does not work. 4.2. Motion Segmentation In this experiment, we evaluate the performance of different subspace clustering and completion methods on the motion segmentation problem in computer vision. Given a video with multiple moving objects, feature extraction and tracking methods are used to extract N feature points and track them across F frames of the video. Under the affine projection model, the trajectories associated with one moving object lie in an affine subspace of R2F of dimension d = 1, 2, 3. The task is to cluster these trajectories according to their corresponding motion subspaces. We evaluate different methods on the Hopkins 155 data set, which contains 155 video sequences with 2 or 3 moving objects. Since in this case the dimension of each motion subspace is d = 1, 2, 3, the number of motions is n = 2, 3, and the dimension of the ambient space is D = 2F 30, the data matrix is always low rank, and so we expect MC+SSC to work very well. Thus, to simulate the high-rank case, we also do experiments on subsampled trajectories with 3 or 6 frames so that the dimension of the ambient space is D = 6 or 12. The sampled frames are chosen to be as equally spaced and spread out as possible. For example, if the original data has 20 frames, and the number of sampled frames is 3, then we sample the 1st, 10th and 19th frame. Thus, the full data, 6-frame data and 3-frame data cases represent the low-rank, mid-rank and high-rank cases, respectively. To make the data incomplete, we generate a mask matrix W {0, 1}D N whose entries are i.i.d. Bernoulli with missing rate δ and let Xmiss = W X. Unlike the synthetic data experiments, in the Hopkins 155 data set the subspaces are affine. We handle this by adding the constraint 1 C = 1 to each convex optimization problem, which enforces the sparse coefficients to add up to 1. Figure 5 shows the clustering error of different methods as a function of the missing rate δ with fixed α = 5. As we can see, SSC-CEC outperforms ZF+SSC and SSC-EC in all cases. For the low-rank case shown in Figure 5(a), MC+SSC does a very good job as expected. However, as the dimension of the ambient space reduces, MC starts to fail. As a result, the clustering error for MC+SSC increases rapidly in Figure 5(b), and even more so in Figure 5(c). Meanwhile, SSC-EWZF and SSC-CEC are less sensitive to the dimension of the ambient space or the rank of the data matrix, and outperform MC+SSC in the high-rank case (Figure 5(c)) when the missing rate is higher than 0.3. 0 0.1 0.2 0.3 0.4 0.5 0 Missing rate δ Clustering error MC + SSC ZF + SSC SSC EWZF SSC EC SSC CEC (a) All frames 0 0.1 0.2 0.3 0.4 0.5 0 Missing rate δ Clustering error MC + SSC ZF + SSC SSC EWZF SSC EC SSC CEC (b) Sample 6 frames 0 0.1 0.2 0.3 0.4 0.5 0 Missing rate δ Clustering error MC + SSC ZF + SSC SSC EWZF SSC EC SSC CEC (c) Sample 3 frames Figure 5. Variation of performance with respect to missing rate δ on Hopkins 155 data set. Penalty parameter α is fixed as 5. Experiments are conducted on (a) full data, (b) 6 frames sampled data and (c) 3 frames sampled data respectively. Therefore, SSC-EWZF and SSC-CEC are potentially better methods for the high-rank and high-missing rate scenario. 5. Conclusions We have proposed two algorithms for subspace clustering with missing entries called sparse subspace clustering by entry wise zero fill (SSC-EWZF) and sparse subspace clustering by column wise expectation completion (SSC-CEC). SSC-EWZF is a natural generalization of SSC in which the self-expressiveness error is restricted only to the observed entries, while SSC-CEC is a natural generalization of SSC where the kernel matrix X X restricted to the observed entries is replaced by an unbiased estimator under a random model. Both algorithms were compared against the natural approaches of first filling in the missing entries either with zeros (ZF+SSC) or using matrix completion (MC+SSC). The results show that MC+SSC is competitive only when the data matrix is low rank and the percentage of missing entries is low. Otherwise, SSC-EWZF and SSC-CEC perform significantly better. Moreover, we conjectured that SSC-EWZF should give perfect clustering provided that the subspaces are sufficiently separated, the data are well distributed inside the subspaces, and the subspaces are incoherent with respect to the missing entries. Such conditions for correctness will be investigated in future work. Sparse Subspace Clustering and Completion Acknowledgements This work was supported by NSF grant 1447822. The authors thank Li Chunguang for his help with the design of the experiment reported in Figure 3(d). Ren e Vidal also thanks Li Chunguang, Chong You, Mahdi Soltanolkotabi, Lester Mackey, and Emmanuel Cand es for extremely insightful discussions on the problem of subspace clustering with missing data. Balzano, L., Nowak, R., and Bajwa, W. Column subset selection with missing data. In NIPS Workshop on Low Rank Methods for Large-Scale Machine Learning, 2010. Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2010. Bradley, P. S. and Mangasarian, O. L. k-plane clustering. Journal of Global Optimization, 16(1):23 32, 2000. Cai, J-F, Cand es, E. J., and Shen, Z. A singular value thresholding algorithm for matrix completion. SIAM Journal of Optimization, 20(4):1956 1982, 2008. Cand es, E. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9:717 772, 2009. Cand es, E. and Tao, T. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053 2080, 2010. Candes, E. J., Mackey, L., and Soltanolkotabi, M. From robust subspace clustering to full-rank matrix completion. Unpublished abstract, 2014. Elhamifar, E. and Vidal, R. Sparse subspace clustering. In IEEE Conference on Computer Vision and Pattern Recognition, 2009. Elhamifar, E. and Vidal, R. Clustering disjoint subspaces via sparse representation. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2010. Elhamifar, E. and Vidal, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11): 2765 2781, 2013. Eriksson, Brian, Balzano, Laura, and Nowak, Robert. High-rank matrix completion. Journal of Machine Learning Research, Proceedings Track 22:373 381, 2012. Favaro, P., Vidal, R., and Ravichandran, A. A closed form solution to robust subspace estimation and clustering. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. Gross, D. Recovering low-rank matrices from few coefficients in any basis. IEEE Trans on Information Theory, 57(3):1548 1566, 2011. Gruber, A. and Weiss, Y. Multibody factorization with uncertainty and missing data using the EM algorithm. In IEEE Conference on Computer Vision and Pattern Recognition, volume I, pp. 707 714, 2004. Keshavan, R., Montanari, A., and Oh, S. Matrix completion from a few entries. IEEE Transactions on Information Theory, 2010. Liu, G., Lin, Z., and Yu, Y. Robust subspace segmentation by low-rank representation. In International Conference on Machine Learning, 2010. Liu, Guangcan, Lin, Zhouchen, Yan, Shuicheng, Sun, Ju, and Ma, Yi. Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Analysis and Machine Intelligence, 35(1):171 184, Jan 2013. Lu, Can-Yi, Min, Hai, Zhao, Zhong-Qiu, Zhu, Lin, Huang, De-Shuang, and Yan, Shuicheng. Robust and efficient subspace segmentation via least squares regression. In Proceedings of European Conference on Computer Vision, 2012. Patel, V. M. and Vidal, R. Kernel sparse subspace clustering. In IEEE International Conference on Image Processing, 2014. Recht, Benjamin. A simpler approach to matrix completion. Journal of Machine Learning Research, 12:3413 3430, 2011. Soltanolkotabi, M. and Cand es, E. J. A geometric analysis of subspace clustering with outliers. Annals of Statistics, 2013. Soltanolkotabi, Mahdi, Elhamifar, Ehsan, and Cand es, Emmanuel J. Robust subspace clustering. Annals of Statistics, 42(2):669 699, 2014. Tipping, M. and Bishop, C. Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2): 443 482, 1999. Tron, R. and Vidal, R. A benchmark for the comparison of 3-D motion segmentation algorithms. In IEEE Conference on Computer Vision and Pattern Recognition, 2007. Tseng, P. Nearest q-flat to m points. Journal of Optimization Theory and Applications, 105(1):249 252, 2000. Sparse Subspace Clustering and Completion Vidal, R. Subspace clustering. IEEE Signal Processing Magazine, 28(3):52 68, March 2011. Vidal, R. and Favaro, P. Low rank subspace clustering (LRSC). Pattern Recognition Letters, 43:47 61, 2014. Vidal, R. and Hartley, R. Motion segmentation with missing data by Power Factorization and Generalized PCA. In IEEE Conference on Computer Vision and Pattern Recognition, volume II, pp. 310 316, 2004. Vidal, R., Ma, Y., and Sastry, S. Generalized Principal Component Analysis (GPCA). IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1 15, 2005. Wang, Yu-Xiang and Xu, Huan. Noisy sparse subspace clustering. In Proceedings of International Conference on Machine Learning, 2013. Zhou, M., Wang, C., Chen, M., Paisley, J., Dunson, D., and Carin, L. Nonparametric bayesian matrix completion. In Sensor Array and Multichannel Signal Processing Workshop, 2010.