# mixedfeatures_vectors_and_subspace_splitting__69096679.pdf

Published as a conference paper at ICLR 2021

MIXED-FEATURES VECTORS & SUBSPACE SPLITTING

Alejandro Pimentel-Alarcón IIMAS, Universidad Nacional Autónoma de México, UNAM Mexico City, Mexico pimentel@comunidad.unam.mx

Daniel Pimentel-Alarcón Department of Biostatistics Wisconsin Institute for Discovery UW-Madison, WI, 53715 pimentelalar@wisc.edu

Motivated by metagenomics, recommender systems, dictionary learning, and related problems, this paper introduces subspace splitting (SS): the task of clustering the entries of what we call a mixed-features vector, that is, a vector whose subsets of coordinates agree with a collection of subspaces. We derive precise identiﬁability conditions under which SS is well-posed, thus providing the ﬁrst fundamental theory for this problem. We also propose the ﬁrst three practical SS algorithms, each with advantages and disadvantages: a random sampling method , a projection-based greedy heuristic , and an alternating Lloyd-type algorithm ; all allow noise, outliers, and missing data. Our extensive experiments outline the performance of our algorithms, and in lack of other SS algorithms, for reference we compare against methods for tightly related problems, like robust matched subspace detection and maximum feasible subsystem, which are special simpler cases of SS.

1 INTRODUCTION

As the reach of data science expands, and as we continuously improve our sensing, storage and computing capabilities, data in virtually all ﬁelds of science keeps becoming increasingly highdimensional. For example, the CERN Large Hadron Collider currently generates so much data that scientists must discard the overwhelming majority of it, hoping that they ve not thrown away anything useful [1], and the upcoming Square Kilometer Array is expected to produce 100 times that [2]. Fortunately, high-dimensional data often has an underlying low-dimensional structure. Inferring such structure not only cuts memory and computational burdens, but also reduces noise and improves learning and prediction. However, higher dimensionality not only increases computational requirements; it also augments data s structure complexity. In light of this, several research lines have explored new low-dimensional models that best summarize data, going from principal component analysis (PCA) [3 11] and single subspaces [12 21] to unions of subspaces [22 40], and algebraic varieties [41].

This paper introduces mixed-features vectors (MFV s): a new model that describes the underlying structure of data arising from several modern applications that is not captured by existing lowdimensional models. The main idea is that each entry of a MFV comes from one out of several classes, and that the entries of the same class lie in an underlying subspace. In particular, MFV s are motivated by megatenomics [42 46] and recommender systems [47 59]: in metagenomics each gene segment comes from one of the several taxa present in a microbiome; in recommender systems each rating may come from one of several users sharing the same account. However, MFV s also have applications in robust estimation (e.g., robust PCA [3 11] and robust dictionary learning [60 67]), matrix completion [48 59], subspace clustering [22 39], and more.

This paper also introduces subspace splitting (SS): the task of clustering the entries of a MFV according to its underlying subspaces. SS is tightly related to other machine learning problems. In particular, SS can be thought as a generalization of robust matched subspace detection (RMSD) [12 17], and maximum feasible subsystem (MAXFS) [68 74]. However, the added complexity of SS renders existing approaches for these problems inapplicable, which calls the attention for specialized SS theory and methods. In these regards, (i) we derive precise identiﬁability conditions under which SS is well-posed, and (ii) we propose the ﬁrst three SS algorithms.

Published as a conference paper at ICLR 2021

2 PROBLEM STATEMENT AND FUNDAMENTAL THEORY

Let U1, . . . , UK be subspaces of Rd, and let Ω0, Ω1, . . . , ΩK denote a partition of [d] := {1, . . . , d}. For any subspace, matrix or vector that is compatible with a set of indices Ω [d], we will use the subscript Ωto denote its restriction to the coordinates/rows in Ω. For example, U1 ΩK R|ΩK| denotes the restriction of U1 to the coordinates in ΩK. Deﬁne x Rd as the mixed-features vector (MFV) such that xΩk Uk Ωk for each k = 1, . . . , K, and the entries of xΩ0 are outliers. Let ϵ Rd denote a noise vector with variance σ2. Given U1, . . . , UK, and an incomplete observation yΩ= xΩ+ ϵΩ, the goal of subspace splitting (SS) is to determine the subsets Ω1 Ω, . . . , ΩK Ωindicating the observed coordinates of y that match with each subspace.

Example 1. Consider the following setup, with 1-dimensional subspaces U1, U2 spanned by U1, U2:

1 1 1 1 1 1

1 2 3 4 5 6

0.1 0.1 0.1 0.1 0.1 0.1

5.9 8.1 8.9

It is easy to see that Ω1 = {1, 2}, Ω2 = {3, 4}, Ω0 = {5}, because xΩ1 = 1

2U1 Ω1 and xΩ2 = 2U2 Ω2.

The keen reader will immediately wonder: is there another partition {Ω 1, . . . , Ω K} different from {Ω1, . . . , ΩK} such that xΩ k Uk Ω k for every k? In other words, is this problem well-posed, and if so, under what conditions? Our main theoretical result answers this question, showing that under the next assumptions, Ωk can be recovered if and only if it has more elements than the dimension of Uk.

A1 Each Uk is drawn independently with respect to the uniform measure over the Grassmannian.

A2 Each xΩk is drawn independently according to an absolutely continuous distribution with respect to the Lebesgue measure on Uk Ωk.

In words, A1 essentially requires that U1, . . . , UK are in general position with no particular relation with one another. Similarly, A2 requires that each piece of x is in general position over its corresponding piece of subspace. This type of genericity assumptions are becoming increasingly common in compressed sensing, matrix completion, subspace clustering, tensor theory, and related problems [10, 21, 30, 31, 33 38, 41, 56 59]. All our statements hold with probability 1 with respect to the measures in A1 and A2. We point out that A1 and A2 do not imply coherence or afﬁnity (other typical assumptions in related theory that quantify alignment with the canonical axes or between subspaces [3, 6, 11, 23, 26, 27, 48 50, 54, 55]) nor vice-versa. For example, bounded coherence and afﬁnity assumptions indeed allow subspaces perfectly aligned on some coordinates. However, they rule-out cases that our assumptions allow, for example the non-zero measure set of highly coherent or afﬁne subspaces that are somewhat aligned with the canonical axes or with one another. To sum up, these assumptions are different, not stronger nor weaker than the usual coherence and afﬁnity assumptions. With this, we are ready to state our main theorem, showing that subspace splitting is possible if and only if x contains more than dim(Uk) entries of each subspace Uk.

Theorem 1. Suppose A1 and A2 hold. Given x and U1, . . . , UK, one can identify Ω1, . . . , ΩK if and only if |Ωk| > dim(Uk) for every k.

Example 1 shows a case where the conditions of Theorem 1 are met (|Ω1| = |Ω2| = 2 > dim(U1) = dim(U2) = 1), and consequently subspace splitting is well-posed (there exist no partition other than the true {Ω1, Ω2} that splits x into U1 and U2). Conversely, the following Example shows a case where the conditions of Theorem 1 are not satisﬁed, and at least some Ωk is unidentiﬁable.

Published as a conference paper at ICLR 2021

Example 2. Consider the following setting:

Now |Ω2| = |Ω3| = 1 = dim(U2) = dim(U3). As Theorem 1 shows, there exist multiple ways to split x into U1, U2, and U3. Here the partitions could be {Ω1, Ω2, Ω3} = {{1, 2}, {3}, {4}} or {{1, 2}, {4}, {3}}, and there is no way of telling which is the true partition from U1, U2, U3, and x. In other words, Ω2 and Ω3 are unidentiﬁable. Remark 1. We point out that the constructions in our examples are not generic (i.e., they were not constructed according to A1 and A2). We chose them for their simplicity to build intuition and make a point. However, our results still apply to these constructions, showing that in addition to all generic cases, our theory also holds for some non-generic ones. An exact characterization of all the non-generic cases that our theory covers requires a careful study of notions of sketching and partial coordinate discrepancy [31], which are out of the scope of this paper.

The proof of Theorem 1 follows by the next two lemmas. Lemma 1 shows that any subset of r or fewer entries of any vector will always match with any r-dimensional subspace in general position. Lemma 2 shows that r + 1 entries of a vector won t match with a random r-dimensional subspace by chance. Said in other words, r + 1 entries of a vector will match with an r-dimensional subspace if and only if such entries truly come from that r-dimensional subspace. Lemma 2 is effectively the key towards Theorem 1, as it allows us to try all combinations of r + 1 entries that ﬁt in an r-dimensional subspace, knowing that we will never get a false match. All proofs are in Appendix A.

Lemma 1. Suppose A1 holds. Let Ω [d]. If |Ω| dim(Uk), then xΩ Uk Ωfor every x Rd.

Lemma 2. Suppose A1 and A2 hold. Let Ωbe an arbitrary subset of [d] with exactly dim(Uk) + 1 elements. Then xΩ Uk Ωif and only if Ω Ωk.

Corollary 1 extends these results to account for noise and missing data, replacing A1 and A2 with:

A1 The coherence of Uk is upper bounded by µ, and the geodesic distance over the Grassmannian between Uk and Uℓis lower bounded by ϕ, for every k, ℓ= 1, . . . , K.

A2 The coherence of xΩk is upper bounded by ν, and its norm is lower bounded by ψ, for every k = 1, . . . , K.

Corollary 1. Suppose A1 and A2 hold with µ, ν < Cσ, and ϕ, ψ > cσ for some constants C and c. Let yΩand U1, . . . , UK be given. Suppose |Ωk Ω| > dim(Uk) for every k. Then with probability decreasing in C and increasing in c, one can identify Ω1 Ω, . . . , ΩK Ω.

To guarantee identiﬁability, Corollary 1 requires that subspaces and samples are sufﬁciently incoherent and separated to overcome the noise level σ. Notice that since now there is missing data (we observe yΩ, instead of x as in Theorem 1), Corollary 1 requires that there are enough observed entries per subspace, i.e., that |Ωk Ω| are sufﬁciently large (rather than Ωk). Similarly, only the observed entires can be classiﬁed, so only the intersections Ωk Ωare identiﬁable (rather than Ωk).

3 RELATED WORK

Arguably, the closest link that subspace splitting has with other popular machine learning problems, is through robust matched subspace detection (RMSD) [12 17] and maximal feasible subsystem (MAXFS) [68 74]. In RMSD one observes a vector x Rd of the form x = Uθ + z, where z Rd is a sparse vector of outlier entries, and θ Rr is the coefﬁcient of the entries of x that match with the subspace spanned by U Rd r. Given x and U, the goal is to equivalently ﬁnd θ, z, or the support of z, which in turn determine the coordinates Ωwhere x matches U. Subspace splitting can be thought as the generalization of RMSD to the case where there are multiple subspaces, and we want to ﬁnd the entries that match each one. Similarly, in MAXFS one has an inconsistent system of equations of the form x = Uθ, and wants to determine the solution θ that produces the largest

Published as a conference paper at ICLR 2021

subset Ωof consistent equations. Subspace splitting can be thought as the generalization of MAXFS to the case where there are multiple consistent subsystems (one per subspace), and we want to ﬁnd each of them. The three prevalent venues to solve these problems can be classiﬁed into three groups: (i) ℓ1 minimization, (ii) mixed integer programs, and (iii) random sampling.

The ﬁrst group essentially encompasses variants of minimizing x Uθ 1 over θ Rr, which is tightly related to the Lasso [91]. The intuition is that the ℓ1-norm will favor solutions that produce a sparse vector z = x Uθ whose zero entries reveal Ω. While formulations like this work great for just one subspace (or subsystem), if there are two or more, then for each subspace, the entries of all other subspaces are outliers, in which case zk := x Ukθk will no longer be sparse, and the solution to the ℓ1-norm minimization will no longer reveal Ωk. The second group aims to directly recover Ω with variants of the mixed integer formulation that minimizes z subject to |x Uθ| Mz over z {0, 1}d and θ Rr, where M is a tuning parameter. Here the main idea is to maximize the number of zero entries in z (which reveal Ω), where we are forcing x to match span{U}, because the constraint guarantees that |xi Uθ| zi = 0. Unfortunately, because of the combinatorial nature of this approach, this method becomes exponentially slow as the fraction of outliers/inconsistencies increases, rendering it prohibitive in practice for even small values of K.The third group essentially uses the same principle as random sampling consensus (RANSAC) [92 98]. That is, one iteratively selects a candidate subset of entries at random, until it ﬁnds one where x agrees with span{U}. This is in fact the approach that we used in the proof of Theorem 1, and the main idea behind our ﬁrst SS algorithm, which we present in Section 5.1, and study in Section 6.

Remark 2. We point out that SS is also related to subspace clustering [22 40]. The main difference is that in subspace clustering all entries of each column come from the same subspace.

4 MOTIVATING APPLICATIONS

This section details the main motivations behind SS, namely, metagenomics, recommender systems, and robust estimation, relevant for dictionary learning, matrix completion, and related problems.

Metagenomics Microbial communities affect the balance of entire ecosystems, and play a crucial role in the planet s health [42]. Consequently, understanding microbiome compositions, interactions, and evolutions holds the key to the knowledge of the deep interconnectivity factors that play a role on Earth s biodiversity, our agricultural sustainability, and adaptation to global threats like climate change. One cornerstone tool towards microbiome understanding lies in genomic analysis. In practice, to obtain an organism s genome (e.g., a person s genome), biologists feed a DNA sample (e.g., blood or hair) to a sequencer machine that produces a series of reads, which are short genomic sequences that can later be assembled and aligned to recover the entire genome. The challenge arises when the sequencer is provided a sample with DNA from multiple organisms, as is the case in any microbiome (e.g., the human gut microbiome), where any sample will contain a mixture of DNA from multiple taxa that cannot be trivially classiﬁed [43]. In this case, each read produced by the sequencer may correspond to a different taxa, resulting in a DNA sample with a mixture of genes. Existing approaches depend on the correct classiﬁcation of taxonomic units, which in turn relies on the existence of reference tables, and often require human intervention [44]. However, reference sequences remain unknown for a vast majority of microbial biodiversity, which in turn precludes holistic metagenomic analyses, as scientists often have to discard unidentiﬁed data that cannot be currently categorized with existing reference tables [45]. In contrast, our work pioneers the theoretical foundations of a novel model tailored to automatically classify taxonomic units without any human intervention. Moreover, our model will naturally allow for missing data, and hence it will be robust to taxa with length-varying sequences, and fast-occurring mutations, which are in fact quite common in several microorganisms such as certain types of bacteria [46].

Recommender Systems and Image Inpainting In recommender systems like Amazon or Netﬂix, one aims to infer users preferences in order to make good recommendations [47]. Arguably the most renowned model for this task is low-rank matrix completion [48 58], which assumes that each user s preferences vector can be explained as a linear combination of a few others. Said in other words, preferences lie in a linear subspace. However, often multiple types of users share the same account. In this case the vector of preferences will contain a mixture of entries from multiple subspaces. It is easy to see (details in the proof of Theorem 1 in Appendix A) that the coefﬁcient corresponding to the entries in the kth subspace are given by θk = (Uk Ω k) 1xΩ k, where Ω k is any subset of Ωk with

Published as a conference paper at ICLR 2021

more than dim(Uk) elements. This insight can be used to estimate the values corresponding to the kth subspace that would appear in the locations where there are observations from other subspaces, or where data is missing. In For example, in recommender systems, this would allow to infer the preferences of all users sharing an account. In image inpainting this would allow to estimate the missing, corrupted, or occluded pixels in an image [60 67].

Robust Learning In Section 3 we showed that SS is a generalization of RMSD and MAXFS, both of which are crucial subtasks in many importantproblems, including robust PCA [3 11], dictionary learning [60 67], matrix completion [48 58], and more. For example, in subspace tracking one needs to identify outliers in each new vector (meaning performing RMSD or MAXFS) before updating the underlying subspace [18 20]. Or in subspace clustering one needs to identify outliers in each vector before ﬁnding sparse representations [26, 28, 29, 32, 39]. These in turn are core techniques in target localization [76], medical imaging [79], communications [80], anomaly detection [81, 82], hyperspectral imaging [16], and networks estimation [84? 89], among many others [78]. Being a generalization of RMSD and MAXFS, subspace splitting is also applicable in these broader problems, bringing possibilities for improved performance.Moreover, as these problems evolve into more sophisticated models that allow mixed-features vectors (such as mixture matrix completion [59]), subspace splitting will gain importance and become a crucial subroutine of these emerging problems.

5 SUBSPACE SPLITTING ALGORITHMS

We now present our three subspace splitting algorithms: a random sampling method, an iterative projection-based greedy heuristic, and an alternating approach inspired by Lloyd s algorithm [75].

5.1 RANDOM SAMPLING SPLITTING

Like Theorem 1 suggests, for each k we can iteratively try sets Ω k with dim(Uk) + 1 entries selected randomly until we ﬁnd one such set where yΩ k lies close to Uk Ω k (within the noise level σ). At that

point we can estimate θk = (Uk Ω ) 1yΩ , and subsequently Ωk as the entries in yΩ Uk Ωθk that are within σ. We call this approach random sampling splitting (RANSAS). The main advantage of RANSAS is that as consequence of Theorem 1, it is guaranteed to work after enough iterations. To see this, observe that if entries in x are uniformly distributed across subspaces and outliers, then the probability of randomly selecting r + 1 entries from the kth subspace is 1/(K+1)r+1. Consequently, the probability that r + 1 random entries come from the same subspace is K/(K+1)r+1. A simple Chernoff bound shows that the expected number of iterations before this happens is O(Kr). Since we want this to happen K times, we conclude that the expected number of iterations is upper bounded by O(Kr+1). We summarize this in the next theorem.

Theorem 2. Let A1 and A2 hold. Suppose |Ωk| > dim(Uk) for every k. Suppose P(i Ωk) = 1/(K+1) independently for every i, k. Then the expected number of iterations before RANSAS identiﬁes Ω1, . . . , ΩK is lower and upper bounded by O(Kmink dim(Uk)+1) and O(Kmaxk dim(Uk)+1).

We point out that a scenario where all entries are equally distributed is in fact the most difﬁcult setting. Otherwise, identifying any subspace with probability pk > 1/(K+1) will take O(1/pkr+1) < O(1/(K+1)r+1) iterations to ﬁnd. The main downside of this approach is that its computational complexity scales exponentially. So if K and r are small, or there is one dominant subspace, RANSAS may be a good idea. However, as we see in our experiments, this approach quickly becomes computationally prohibitive as K and r grow.

5.2 GREEDY SPLITTING

The main idea of our greedy splitting (GREEDYS) heuristic is to iteratively remove the most disruptive entry from each subspace. More precisely, for each k we will start with Ω0 k = [d], and for each t > 0 we will deﬁne Ωt k := Ωt 1 k \it 1, where it 1 is the entry whose removal from Ωt 1 k minimizes the difference between yΩt k and its projection onto Uk Ωt k. The intuition is that removing it 1 gets yΩt k closer to Uk Ωt k. We repeat this procedure for each k until yΩt k is close to Uk Ωt k (within the noise level σ). We know from Lemma 1 that this procedure will terminate at some point, because any subset of r

Published as a conference paper at ICLR 2021

or fewer entries of any vector will always match with any r-dimensional subspace in general position. If for some k this procedure terminates with a set Ωt k with more than rk = dim(Uk) entries, Theorem 1 suggests that yΩt k truly lies in Uk Ωt k, that Ωt k Ω, that we can estimate θk = (Uk Ωt k) 1yΩt k, and

identify Ωk as the entries in yΩ Uk Ωθk that are within σ. At this point we can prune all the entries corresponding to this subspace, and repeat the greedy procedure from the start without these entries. If at some point none of the remaining sets Ωt k has more than rk entries, we know by Lemma 1 that we cannot determine whether yΩt k truly lies in Uk Ωt k with these entries alone, then we stop and consider this a failure. Notice that for each k, GREEDYS removes the most disruptive entry among all d entries, then the next most disruptive among the remaining d 1, until either all the remaining entries lie in the same subspace, or there remain no more than rk entries. This requires no more than d+(d 1)+(d 2)+ +(rk +1) = O(d2) iterations. After identifying the entries of a subspace, if pruning is necessary, we potentially need to repeat the iterative removal process for the remaining K 1 subspaces, and so on. In total, we potentially need to repeat the iterative removal process K + (K 1) + (K 2) + + 1 = O(K2) times. Putting these two observations together, we obtain the following theorem, showing that GREEDYS has a polynomial computational complexity. Theorem 3. GREEDYS will terminate after no more than O(Kd)2 iterations.

The main caveat of this greedy approach, as the next Example shows, is that even in a noiseless setting, the most disruptive entry for Uk (that is, the entry whose removal minimizes the distance between yΩt k and its projection onto Uk Ωt k) may in fact correspond to Uk.

Example 3. Consider the following construction where the ﬁrst two entries of x correspond to U1, the last two correspond to U2, there are no outliers, i.e., Ω0 = , there is no noise, i.e., ϵ = 0, and there is no missing data, i.e., Ω= [d], so that we observe all entries of y = x:

Pay special attention to U1 and y. Let Ω i := [d]\i denote the set of all except the ith entry, and let ˆyk Ω i := Uk Ω i(Uk T Ω i Uk Ω i) 1Uk T Ω iˆyΩ i denote the projection of yΩ i onto Uk Ω i. For example, here:

, ˆy1 Ω 1 =

2.2759 3.0345

Proceeding according to our greedy heuristic, we can compute the projection residuals when removing each entry:

yΩ 1 ˆx1 Ω 1 = 2.7038, yΩ 2 ˆx1 Ω 2 = 2.7038,

yΩ 3 ˆx1 Ω 3 = 3.4641, yΩ 4 ˆx1 Ω 4 = 2.7440.

Here we see that removing any of the ﬁrst two entries makes y appear closer to U1, even though the ﬁrst two entries in fact correspond to U1. We thus conclude that our greedy heuristic, while it works well in practice, it has no theoretical guarantee of working.

5.3 K-SPLITS

As the name suggests, our last algorithm, K-SPLITS, is inspired by the K-means algorithm. The main idea is to initialize partitioning randomly the entries of yΩinto ˆΩ0, ˆΩ1, . . . , ˆΩK, and then alternate until convergence between (i) estimating coefﬁcients ˆθ1, . . . , ˆθK in light of the assignment of entries, and (ii) reassigning entries according to these coefﬁcients. More precisely, at each step after pruning when possible we:

(i) Estimate coefﬁcients ˆθk = (Uk ˆΩk) 1yˆΩk.

(ii) Compute ˆyk := Uk ˆθk, and assign entries as ˆΩk = i [d] : |yi ˆyk i | τ(σ)}, where τ(σ) is a tuning thresholding parameter that depends on the noise level.

Published as a conference paper at ICLR 2021

The assignment step is done on a cyclic manner, that is, in every iteration, each cluster is assigned the entry that agrees the most with it, and it receives no new entries until the next iteration. This changes the isotropic assumption of K-means with the assumption that each cluster has the same number of entries, as mentioned in Section 5.1, this is the most difﬁcult setting for this problem, and whenever the assumption is not fulﬁlled, the algorithm will still be able to complete at least the most dominant subspace even if entries of that subspace contaminate other subspaces; for the next iteration that won t be the case any more. Thanks to this, the assumption of every subspace having the same number of entries does not affect the effectiveness of the algorithm.

As we will see in our experiments, this method outperforms the rest. Its main downside, however, is that like most alternating methods, it suffers from local minima, and depends heavily on initialization. To mitigate the former, in our experiments we run several independent random initializations. We observe that in practice this has excellent performance. We conjecture that it is because this strategy will at some point spread the centers θk enough, in light of the subspaces. This sort of spread is crucial in other related clustering algorithms, such as K-means [75]. Our future work will further investigate this, convergence, complexity, and initialization strategies, such as parallels of K-means++ [99], which require careful analysis and are out of the scope of this paper.

6 EXPERIMENTS

In this section we study the performance of our three methods above, as a function of the number of subspaces K, the number of entries per subspace |Ωk| (which in turn is a proxy of the ambient dimension d) and the subspaces dimensions r. We measure error as the fraction of misclassiﬁed entries. In lack of other existing subspace splitting methods, for reference we compare them against the ℓ1 minimization described in Section 3. Our results show that our SS methods perform as well as existing methods in the single subspace case, and succeed in the presence of multiple subspaces, where existing methods fail dramatically. In the interest of reproducibility, all our code is available here [100].

In our experiments we ﬁrst generate matrices U1, . . . , UK Rd r with i.i.d. N(0, 1) entries, to use as bases of U1, . . . , UK, with d = (K + 1)|Ωk|. Similarly, we generate θ1, . . . , θK Rr, also with i.i.d. N(0, 1) entries, to use as coefﬁcient vectors. Next we create the partition of entries {Ω0, Ω1, . . . , ΩK} where each Ωk has the exact same number of elements. As discussed in Section 5.1, this is in fact the most difﬁcult setting; otherwise the dominant subspace is easier to ﬁnd, and one can simply prune its corresponding entries to simplify the problem. Then we create x as the vector such that xΩk = Uk Ωkθk for each k, and xΩ0 are i.i.d. N(0, 1) entries representing outliers. Finally, we generate ϵ Rd with i.i.d. N(0, σ2) entries and Ω d with a fraction p of observed entries, to construct our observed data yΩ= xΩ+ ϵΩ.

In our ﬁrst experiment we study the behavior of our algorithms as a function of K, with r = 5, |Ωk| = 40, p = 1/2 (so that on average 20 observed entries correspond to each subspace), and σ = 0, i.e., no noise. Figure 1 (a) shows that even for K = 2 (the smallest value of K), existing methodology is unable to split y. This is not surprising, as existing methods are meant for settings where most of the entries in y agree with a single subspace, and there are only a few outliers. In our setting, for each subspace, the entries of all other subspaces are outliers. Consequently, a fraction 1 1/K of the entries (at least half with K = 2, and a dominant majority as K increases) are outliers for each subspace, which is well-documented to make existing methods fail (see the discussion in Section 3 for details). In contrast, GREEDYS is better with K = 6 (which amounts to 83.3% outliers) than ℓ1 is with K = 2, while K-SPLITS and RANSAS maintain a 100% success rate for every K we tried. Nonetheless, increasing in K comes at a price (time), as seen in Figure 1 (a ), which shows how the number of iterations of each algorithm grows with K, relative to the simplest case (K = 2). Notice that the speed of K-SPLITS decays nicely with K. For instance, with K = 6, K-SPLITS degrades an order of magnitude slower than RANSAS, while still maintaining the same 100% success rate.

In our second experiment we study the behavior of our algorithms as a function of |Ωk|, with r = 5, p = 1/2, σ = 0, and K = 2 ﬁxed. Figures 1 (b) and (b ) show that as |Ωk| grows, the performance of all methods improves: ℓ1 and GREEDYS achieve higher success rates, while RANSAS and K-SPLITS increase their speed and maintain the same 100% success rate.

Published as a conference paper at ICLR 2021

Success rate

Number of subspaces (K)

(a) r = 5 , |Ωk| = 40 , σ = 0

RANSAS GREEDYS K-SPLITS

12 16 20 24 28 32 36 40 Entries per subspace (|Ωk|)

(b) K = 2 , r = 5 , σ = 0

2 4 6 8 10 12 14 Subspaces Dimension (r)

(c) K = 2 , |Ωk| = 40 , σ = 0

6 8 10 12 14 16 Subspaces Dimension (r)

(d) K = 2 , γ = 20 , σ = 0

10 8 10 6 10 4 10 2 Noise Level (σ)

(e) K = 2 , r = 5 , |Ωk| = 40

Iteration growth

Number of subspaces (K)

(a ) r = 5 , |Ωk| = 40 , σ = 0

12 16 20 24 28 32 36 40 Entries per subspace (|Ωk|)

(b ) K = 2 , r = 5 , σ = 0

2 4 6 8 10 12 Subspaces Dimension (r)

(c ) K = 2 , |Ωk| = 40 , σ = 0

6 8 10 12 14 16 Subspaces Dimension (r)

(d ) K = 2 , γ = 20 , σ = 0

10 8 10 6 10 4 10 2 Noise Level (σ)

(e ) K = 2 , r = 5 , |Ωk| = 40

Figure 1: Average success rate and iteration growth over 100 trials, as a function of the number of subspaces K, the number of entries per subspace |Ωk| (which in turn is a proxy of the ambient dimension d), the subspaces dimensions r, the gap γ := |Ωk| (r + 1), and the noise level σ.

In our third experiment we study performance as a function of r, with p = 1/2, σ = 0, and K = 2 ﬁxed. Recall from Lemma 1 that as r increases, so does the number of entries per subspace required for identiﬁability (r + 1). Hence to allow a wide range of r, we ﬁxed |Ωk| = 40. Figures 1 (c) and (c ), show that (conversely to our previous experiment) the performance of all methods declines as r grows: ℓ1 and GREEDYS achieve lower success rates, while RANSAS and K-SPLITS increase their time (though they maintain the same 100% success rate).

Our next two experiments are consistent with the intuition that cases with fewer data (small |Ωk|) relative to the problem s complexity (the dimension r) are harder. We verify this in our next experiments, where we study performance as a function of r, ﬁxing p = 1/2, σ = 0, and K = 2 and the gap γ := p(|Ωk| (r + 1)) = 20. This setting represents the case where we always have a few more (20) entries per subspace than the strictly required (r + 1) for splitting. This is arguably a better experiment to test the effect of the subspaces dimensions isolated from the sampling rate. Figures 1 (d) and (d ) now show that r slightly affects GREEDYS and K-SPLITS(GREEDYS in terms of accuracy, and K-SPLITS in terms of time), but dramatically affects RANSAS, causing an increase of iterations four orders of magnitude higher than the rest. This is consistent with our discussion in Section 5.1, showing that RANSAC will always succeed given enough time, which grows exponentially in r. Interestingly, K-SPLITS achieves the same 100% performance, but orders of magnitude faster, suggesting that K-SPLITS is in general the most promising SS algorithm.

In our last experiment we study the accuracy of our algorithms as a function of the noise level σ, with r = 5, p = 1/2, |Ωk| = 40, and K = 2 ﬁxed. Consistent with Corollary 1, Figure 1 (e) shows that the performance of our algorithms decays nicely with noise, with K-SPLITS allowing up to σ = 10 3. The price, however, can be seen in execution time (Figure 1 (e )), which grows up to the point where algorithms start to decay, in which case their execution time also starts decreasing. We point out that the fraction of missing data and outliers only affect results and performance in the sense that they reduce the number of available observed entries per subspace, so their effects are also captured in the experiments on |Ωk|. For example, a case where there are |Ωk| = 40 samples per subspace, but only half of the entries are observed (p = 1/2) is equivalent to observing |Ωk| = 20 samples per subspace with no missing data (p = 1). Appendix B shows additional results with K = 4, where our methods show even more dramatic improvements.

BROADER IMPACT

This paper introduces a new dimensionality reduction model, and a series of methods to infer such model, with broad applicability ranging from metagenomics to recommender systems. From a pragmatic standpoint, practitioners can use this research to better predict drug-target interactions, analyze the composition and dynamics soil or human microbiomes, and more. Ultimately, this can impact the well-being of society at many levels, improving drugs, medical diagnoses, treatments, agricultural sustainability, etc. From a learning perspective, we hope this work spurs discussions and insights, and motivates the data science community to explore new directions that stem from this initial work (for example near-optimal initialization strategies on our K-SPLITS algorithm, similar to K-means++), and that leads to better methods and understanding of subspace splitting.

Published as a conference paper at ICLR 2021

[1] Edd Wilder-James, What is big data? An introduction to the big data landscape, available at: https://www.oreilly.com/ideas/what-is-big-data, 2018

[2] SKA Signs Big Data Cooperation Agreement With CERN, available at: https: //www.skatelescope.org/news/ska-signs-big-data-cooperationagreement-cern/, 2018.

[3] X. Yuan and J. Yang, Sparse and low-rank matrix decomposition via alternating direction methods, available at http://www.optimization-online.org/DB_HTML/2009/ 11/2447.html, 2009.

[4] Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen and Y. Ma, Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix, Computational Advances in Multi-Sensor Adaptive Processing, 2009.

[5] X. Ding, L. He and L. Carin, Bayesian robust principal component analysis, IEEE Transactions on Image Processing, 2011.

[6] E. Candès, X. Li, Y. Ma and J. Wright, Robust principal component analysis?, Journal of the ACM, 2011.

[7] T. Bouwmans and E. Zahzah, Robust PCA via principal component pursuit: a review for a comparative evaluation in video surveillance, Computer Vision and Image Understanding, 2014.

[8] Y. Ma, Low-rank matrix recovery and completion via convex optimization, avaiable at http: //perception.csl.illinois.edu/matrix-rank/home.html.

[9] T. Bouwmans, A. Sobral, S. Javed, S. Jung and E. Zahzah, Decomposition into low-rank plus additive matrices for background/foreground separation: A review for a comparative evaluation with a large-scale dataset, Computer Science Review, 2016.

[10] D. Pimentel-Alarcón and R. Nowak, Random consensus robust PCA, Electronic Journal of Statistics, 2017.

[11] M. C. Tsakiris and R. Vidal, Dual principal component pursuit, Journal of Machine Learning Research, 2018.

[12] L. Scharf and B. Friedlander, Matched subspace detectors, IEEE Transactions on Signal Processing, 1994.

[13] S. Kraut, L. Schaft, L. Mc Whorter, Adaptive Subspace Detectors, IEEE Transactions on Signal Processing, 2001.

[14] M. Desai and R. Mangoubi, Robust gaussian and non-gaussian matched subspace detection, IEEE Transactions on Signal Processing, 2003.

[15] O. Besson, L. Scharf and F. Vincent, Matched direction detectors and estimators for array processing with subspace steering vector uncertainties, IEEE Transactions on Signal Processing, 2005.

[16] H. Kwon and N. Nasrabadi, Kernel matched subspace detectors for hyperspectral target detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006.

[17] L. Balzano, B. Recht and R. Nowak, High-dimensional matched subspace detection when data are missing, IEEE International Symposium on Information Theory, 2010.

[18] L. Balzano, R. Nowak and B. Recht, Online identiﬁcation and tracking of subspaces from highly incomplete information, Allerton Conference on Communication, Control and Computing, 2010.

[19] J. He, L. Balzano and A. Szlam, Incremental gradient on the Grassmannian for online foreground and background separation in subsampled video, Conference on Computer Vision and Pattern Recognition, 2012.

[20] H. Mansour, X. Jiang, A robust online subspace estimation and tracking algorithm, IEEE International Conference on Acoustics, Speech, and Signal Processing, 2015.

[21] D. Pimentel-Alarcón, N. Boston and R. Nowak, Deterministic conditions for subspace identiﬁability from incomplete sampling, IEEE International Symposium on Information Theory, 2015.

Published as a conference paper at ICLR 2021

[22] K. Kanatani, Motion segmentation by subspace separation and model selection, IEEE International Conference in Computer Vision, 2001. [23] B. Eriksson, L. Balzano and R. Nowak, High-rank matrix completion and subspace clustering with missing data, Artiﬁcial Intelligence and Statistics, 2012. [24] L. Balzano, A. Szlam, B. Recht, and R. Nowak, K-subspaces with missing data, IEEE Statistical Signal Processing, 2012. [25] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, Robust recovery of subspace structures by low-rank representation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012. [26] E. Elhamifar and R. Vidal, Sparse subspace clustering: algorithm, theory, and applications, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013. [27] D. Pimentel-Alarcón, L. Balzano and R. Nowak, On the sample complexity of subspace clustering with missing data, IEEE Statistical Signal Processing, 2014. [28] C. Yang, D. Robinson and R. Vidal, Sparse subspace clustering with missing entries, International Conference on Machine Learning, 2015. [29] D. Pimentel-Alarcón, L. Balzano, R. Marcia, R. Nowak and R. Willett, Group-sparse subspace clustering with missing data, IEEE Statistical Signal Processing, 2016. [30] D. Pimentel-Alarcón and R. Nowak, The information-theoretic requirements of subspace clustering with missing data, International Conference on Machine Learning, 2016. [31] D. Pimentel-Alarcón, L. Balzano and R. Nowak Necessary and sufﬁcient conditions for sketched subspace clustering, Allerton Conference on Communication, Control and Computing, 2016. [32] Y. Chong, D. Robinson and R. Vidal, Provable self representation based outlier detection in a union of subspaces, IEEE Conference on Computer Vision and Pattern Recognition, 2017. [33] M. Ashraphijuo, X. Wang and V. Aggarwal, A characterization of sampling patterns for lowrank multi-view data completion problem, IEEE International Symposium on Information Theory, 2017. [34] M. Ashraphijuo, V. Aggarwal and X. Wang, A characterization of sampling patterns for lowtucker-rank tensor completion problem, IEEE International Symposium on Information Theory, 2017. [35] M. Ashraphijuo and X. Wang, Fundamental conditions for low-CP-rank tensor completion, Journal of Machine Learning Research, 2017. [36] M. Ashraphijuo, V. Aggarwal and X. Wang, Deterministic and probabilistic conditions for ﬁnite completability of low-tucker-rank tensor, IEEE Transactions on Information Theory, 2019. [37] M. Ashraphijuo and X. Wang, Clustering a union of low-rank subspaces of different dimensions with missing data, Pattern Recognition Letters 2019. [38] M. Ashraphijuo, X. Wang, Fundamental conditions on the sampling pattern for union of low-rank subspaces retrieval, Annals of Mathematics and Artiﬁcial Intelligence, 2019. [39] E. Elhamifar, High-rank matrix completion and clustering under self-expressive models, Advances in Neural Information Processing Systems, 2016. [40] G. Ongie, R. Willett, R. Nowak and L. Balzano, Algebraic variety models for high-rank matrix completion, International Conference on Machine Learning, 2017. [41] G. Ongie, D. Pimentel, L. Balzano, R. Willett and R. Nowak, Low algebraic dimension matrix completion, Allerton Conference on Communication, Control and Computing, 2017. [42] S. Highlander, High throughput sequencing methods for microbiome proﬁling: application to food animal systems, Animal Health Research Reviews, 2012. [43] S. Mande, M. Mohammed and T. Ghosh, Classiﬁcation of metagenomic sequences: methods and challenges, Brieﬁngs in Bioinformatics, 2012. [44] R. Ranjan, A. Rani, A. Metwally, H. Mc Gee and D. Perkins, Analysis of the microbiome: Advantages of whole genome shotgun versus 16S amplicon sequencing, Biochemical and Biophysical Research Communications, 2016. [45] N. Nguyen, T. Warnow, M. Pop and B. White, A perspective on 16S r RNA operational taxonomic unit clustering using sequence similarity, Bioﬁlms and Microbiomes, 2016.

Published as a conference paper at ICLR 2021

[46] G. Marçais, A. Delcher, A. Phillippy, R. Coston, S. Salzberg and A. Zimin, MUMmer4: A fast and versatile genome alignment system, PLo S Computational Biology, 2018. [47] Netﬂix, Inc. The Netﬂix prize, http://www.netflixprize.com. [48] E. Candès and B. Recht, Exact matrix completion via convex optimization, Foundations of Computational Mathematics, 2009. [49] E. Candès and T. Tao, The power of convex relaxation: near-optimal matrix completion, IEEE Transactions on Information Theory, 2010. [50] B. Recht, A simpler approach to matrix completion, Journal of Machine Learning Research, 2011. [51] Z. Lin, R. Liu and Z. Su, Linearized alternating direction method with adaptive penalty for low rank representation, Advances in Neural Information Processing Systems, 2011. [52] Z. Wen, W. Yin and Y. Zhang, Solving a low-rank factorization model for matrix completion by a non-linear successive over-relaxation algorithm, Mathematical Programming Computation, 2012. [53] P. Jain, P. Netrapalli and S. Sanghavi, Low-rank matrix completion using alternating minimization, ACM symposium on Theory of computing, 2013. [54] Y. Chen, S. Bhojanapalli, S. Sanghavi and R. Ward, Coherent matrix completion, International Conference on Machine Learning, 2014. [55] Y. Chen, Incoherent-optimal matrix completion, IEEE Transactions on Information Theory, 2015. [56] F. Király, L. Theran and R. Tomioka, The algebraic combinatorial approach for low-rank matrix completion, Journal of Machine Learning Research, 2015. [57] D. Pimentel-Alarcón, N. Boston and R. Nowak, A characterization of deterministic sampling patterns for low-rank matrix completion, IEEE Journal of Selected Topics in Signal Processing, 2016. [58] D. Pimentel-Alarcón and R. Nowak, A converse to low-rank matrix completion, IEEE International Symposium on Information Theory, 2016. [59] D. Pimentel-Alarcón, Mixture Matrix Completion, Advances in Neural Information Processing Systems, 2018. [60] M. Aharon, M. Elad and A. Bruckstein, K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation, IEEE Transactions on Signal Processing, 2006. [61] J. Mairal, F. Bach, J. Ponce and G. Sapiro, Online dictionary learning for sparse coding, International Conference on Machine Learning, 2009. [62] C. Zhao, X. Wang and W. Cham, Background subtraction via robust dictionary learning, EURASIP Journal on Image and Video Processing, 2011 [63] C. Lu, J. Shi and J. Jia, Online robust dictionary learning, IEEE Conference on Computer Vision and Pattern Recognition, 2013. [64] A. Iqbal and A. Seghouane, An α-divergence-based approach for robust dictionary learning, IEEE Transactions on Image Processing, 2019. [65] Y. Xu, Z. Li, C. Tian and J. Yang, Multiple vector representations of images and robust dictionary learning, Pattern Recognition Letters, 2019. [66] S. Mahdizadehaghdam, A. Panahi, H. Krim and L. Dai, Deep dictionary learning: A parametric network approach, IEEE Transactions on Image Processing, 2019. [67] J. Ren, Z. Zhang, S. Li, Y. Wang, G. Liu, S. Yan, and M. Wang, Learning Hybrid Representation by Robust Dictionary Learning in Factorized Compressed Space IEEE Transactions on Image Processing, 2020. [68] Y. Zheng, S. Sugimoto, and M. Okutomi, Deterministically maximizing feasible subsystem for robust model ﬁtting with unit norm constraint, IEEE Conference on Computer Vision and Pattern Recognition, 2011. [69] P. Purkait, C. Zach, and A. Eriksson, Maximum Consensus Parameter Estimation by Reweighted ℓ1 Methods, International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition. Springer, Cham, 2017.

Published as a conference paper at ICLR 2021

[70] T.J. Chin, P. Purkait, A. Eriksson and D. Suter, Efﬁcient globally optimal consensus maximisation with tree search, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. [71] T.-J. Chin, Y. Heng Kee, A. Eriksson and F. Neumann, Guaranteed outlier removal with mixed integer linear programs, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. [72] P. Speciale, D. Pani, M. Oswald, T. Kroeger, L. Van Gool and M. Pollefeys, Consensus Maximization with Linear Matrix Inequality Constraints, IEEE Conference on Computer Vision and Pattern Recognition. 2017. [73] Z. Cai, T, Chin, H. Le and D. Suter, Deterministic consensus maximization with biconvex programming, ar Xiv preprint, 2018. [74] C. Yu and D.Y. Ju, A maximum feasible subsystem for globally optimal 3D point cloud registration, Sensors, 2018. [75] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman and A. Wu, An efﬁcient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002. [76] S. Shahbazpanahi, S. Valaee and M. Bastani, Distributed source localization using ESPRIT algorithm, IEEE Transactions on Signal Processing, 2001. [77] M. Rangaswamy, F. Lin and K. Gerlach, Robust adaptive signal processing methods for heterogeneous radar clutter scenarios, Signal Processing, 2004. [78] H. Krim and M. Viberg, Two decades of array signal processing research: the parametric approach, Signal Processing Magazine, 1996. [79] B. Ardekani, J. Kershaw, K. Kashikura and I. Kanno, Activation detection in functional MRI using subspace modeling and maximum likelihood estimation, IEEE Transactions on Medical Imaging, 1999. [80] M. Mc Cloud and L. Scharf, Interference estimation with applications to blind multiple-access communication over fading channels, IEEE Transactions on Information Theory, 2000. [81] D. Stein, S. Beaven, L. Hoff, E. Winter, A. Schaum and A. Stocker, Anomaly detection from hyperspectral imagery, IEEE Signal Processing Magazine, 2002. [82] T. Ahmed, M. Coates and A. Lakhina, Multivariate online anomaly detection using kernel recursive least squares, INFOCOM. 2007. [83] B. Eriksson, P. Barford, J. Sommers and R. Nowak, Domain Impute: inferring unseen components in the Internet, IEEE INFOCOM, 2011. [84] R. Govindan and H. Tangmunarunkit, Heuristics for Internet Map Discovery, IEEE INFOCOM 2000. [85] P. Barford, A. Bestavros, J. Byers and M. Crovella, On the marginal utility of network topology measurements, Proceedings of ACM Internet Measurement Workshop, 2001. [86] N. Spring, R. Mahajan, D. Wetherall and T. Anderson, Measuring ISP topologies with rocketfuel, IEEE/ACM Transactions on Networking, 2004. [87] D. Alderson, L. Li, W. Willinger and J. Doyle, Understanding internet topology: Principles, models and validation, IEEE/ACM Transactions on Networking, 2005. [88] R. Sherwood, A. Bender and N. Spring, Dis Carte: A disjunctive Internet cartographer, ACM SIGCOMM, 2008. [89] B. Eriksson, P. Barford and R. Nowak, Network Discovery from Passive Measurements, ACM SIGCOMM, 2008. [90] E. Learned-Miller M. Narayana and A. Hanson, Coherent motion segmentation in moving camera videos using optical ﬂow orientations, International Conference on Computer Vision, 2013. [91] R. Tibshirani, Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society, 1996. [92] M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigm for model ﬁtting with applications to image analysis and automated cartography, Communications of the ACM, 1981.

Published as a conference paper at ICLR 2021

[93] R. Raguram, J.M. Frahm, M. Pollefeys: A comparative analysis of ransac techniques leading to adaptive real-time random sample consensus, European Conference in Computer Vision, 2008.

[94] R. Raguram, O. Chum, M. Pollefeys, J. Matas, J.M. Frahm, Usac: a universal framework for random sample consensus, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.

[95] M. Feigin, B.J. Ranger, and B.W. Anthony, Statistical consensus matching framework for image registration, International Conference on Pattern Recognition, 2016.

[96] H. Le, T.J. Chin, and D. Suter, An exact penalty method for locally convergent maximum consensus, IEEE Conference on Computer Vision and Pattern Recognition, 2017.

[97] R. Li, J. Sun, D. Gong, Y. Zhu, H. Li and Y. Zhang, ARSAC: Efﬁcient Model Estimation via Adaptively Ranked Sample Consensus, Neurocomputing, 2018.

[98] K.H. Lee, C. Yu, and S.W. Lee, Deterministic Hypothesis Generation for Robust Fitting of Multiple Structures, ar Xiv preprint ar Xiv:1807.09408 (2018).

[99] D. Arthur and S. Vassilvitskii, k-means++: The advantages of careful seeding, Annual ACMSIAM Symposium on Discrete Algorithms, 2007.

[100] https://danielpimentel.github.io/publications.html

Published as a conference paper at ICLR 2021

In this section we present the proofs of all statements. We begin with the proof of Lemmas 1 and 2, which are required for the proof of Theorem 1.

Proof. (Lemma 1) Since Uk is in general position, if |Ω| dim(Uk), then Uk Ω= R|Ω|. In other words, Uk Ωis the entire |Ω|-dimensional space.

Proof. (Lemma 2) If Ω Ωk then xΩtrivially lies in Uk Ωby A2. To see the reverse implication, let r := dim(Uk). We know that xΩ Uk Ωif and only if there exists a coefﬁcient vector θ Rr such that

xΩ= Uk Ωθ, (1)

where Uk Rd r denotes a basis of Uk. Since |Ω| = r + 1 by assumption, equation 1 deﬁnes a system of r + 1 equations (the rows of xΩ) and r unknowns (the entries of θ). Assume without loss of generality that Ω= {1, 2, . . . , r + 1} (otherwise simply permute the rows accordingly), and let Ω = {1, 2, . . . , r} be the subset of Ωwith the ﬁrst r elements. Then we can rewrite equation 1 as

Notice that Uk Ω Rr r is invertible because A1 guarantees that Uk is in general position with probability 1. Hence we can use the top block to obtain θ = (Uk Ω ) 1xΩ , and we can plug this in the last row to obtain:

xr+1 = Uk r+1(Uk Ω ) 1xΩ .

Let ki indicate the subspace corresponding to the ith entry of x. Then by A2 there exist coefﬁcients θ1, . . . , θK such that xi = Uki i θki. In light of this we can rewrite the last equation as

Ukr+1 r+1 θkr+1 = Uk r+1(Uk Ω ) 1

Uk2 2 θk2 ... Ukr r θkr

At this point equation 2 has no variables left (all Uk s and θk s are ﬁxed), which means that equation 2 is a consistency condition. Notice that if Ω Ωk, then ki = k, and equation 2 transforms into

Uk r+1θk = Uk r+1(Uk Ω ) 1Uk Ω θk,

which further simpliﬁes to Uk r+1θk = Uk r+1θk. In other words, if Ω Ωk, the consistency condition equation 2 is met. On the other hand, if Ω Ωk, equation 2 does not simplify any further. Recall from A1 that U1, . . . , UK are drawn independently. This implies that the entries in U1, . . . , UK

keep no relation with one another. Similarly, from A2 we know that xΩk is drawn independently. Equivalently, θ1, . . . , θK keep no relation with one another. Intuitively, we can think of the entries in Uk s and θk s as independent random numbers drawn from a distribution with full measure. We thus conclude that if Ω Ωk, with probability 1 the consistency condition equation 2 would not hold.

To summarize, we know that xΩ Uk Ωif and only if there is a vector θ that satisﬁes equation 1. We showed that this will be the case if and only if equation 2 holds, which in turn will happen if and only if Ω Ωk, as claimed.

Notice that if Ωhas dim(Uk) + ℓentries, equation 1 becomes a system with dim(Uk) + ℓequations, which result in ℓconsistency conditions similar to equation 2. By the same arguments as in the proof of Lemma 2, xΩwill agree with Uk Ωif and only if these consistency conditions hold, which will happen if and only if Ω Ωk. We thus obtain the following corollary

Published as a conference paper at ICLR 2021

Corollary 2. Suppose A1 and A2 hold. Let Ωbe an arbitrary subset of [d] with more than dim(Uk) elements. Then xΩ Uk Ωif and only if Ω Ωk.

With this, we are ready to give a proof of Theorem 1

Proof. (Theorem 1) Let rk := dim(Uk). To identify Ω1, . . . , Ωk we will exhaustively search for combinations Ωof rk + 1 entries in x that match with Uk. By Lemma 2 we know that xΩwill match with Uk if and only if Ω Ωk. By assumption there is at least one k for which |Ωk| > rk, so we know that at some point we will ﬁnd such a subset Ω. Once we do, we can recover θk = (Uk Ω ) 1xΩ , where Ω can be any subset of Ωwith exactly rk entries (recall that A1 guarantees that Uk Ω is invertible). Finally, we can compute xk := Ukθk, and recover Ωk by inspection (the set of zero entries in xk x), where Corollary 2 guarantees that there will be no additional false matching entries. Repeating this procedure for every k we can recover Ω1, . . . , ΩK, as claimed.

To obtain the converse, suppose that rk or fewer entries of x correspond to Uk for some k. By Lemma 1, any subset Ω of Ωk Ω0 with |Ω | rk satisﬁes xΩ Uk Ω . Hence we can split the entries of x corresponding to Ωk and Ω0 in a non-unique manner, and still have them agree with Uk, which makes Ωk unidentiﬁable.

We now use the same strategy to prove Corollary 1, which extend these ideas to noisy settings.

Proof. (Corollary 1) Let Pk Ω denote the projection operator onto Uk Ω . Under the assumptions of the corollary, if Ω Ωk, then with high probability (decreasing in C and increasing in c), xΩ Pk Ω xΩ 2 > |Ω |σ2 (see Theorem 1 in [17]). In other words, the residual of projecting xΩ onto Uk Ω will be too large if the entries of Ω do not come from the kth subspace. Hence, instead of searching for a set Ω where xΩ perfectly ﬁts in Uk Ω (as in the proof of Theorem 1), we can search for subsets Ω Ωk where the residual is within the noise level, i.e., xΩ Pk Ω xΩ 2 |Ω|σ2. The rest of the proof follows by the same arguments as the proof of Theorem 1.

B ADDITIONAL EXPERIMENTS

Success rate

Entries per subspace (|Ωk|)

(a) K = 4 , r = 5 , σ = 0

RANSAS GREEDYS K-SPLITS

2 4 6 8 10 Subspaces Dimension (r)

(b) K = 4 , |Ωk| = 40 , σ = 0

6 7 8 9 10 11 12 13 14 Subspaces Dimension (r)

(c) K = 4 , γ = 20 , σ = 0

10 6 10 5 10 4 10 3 10 2 10 1 Noise Level (σ)

(e) K = 2 , r = 5 , |Ωk| = 40

Iteration growth

Entries per subspace (|Ωk|)

(a ) K = 4 , r = 5 , σ = 0

100 101 102 103 104 105 106 107

2 4 6 8 10 Subspaces Dimension (r)

(b ) K = 4 , |Ωk| = 40 , σ = 0

6 8 10 12 14 Subspaces Dimension (r)

(c ) K = 4 , γ = 20 , σ = 0

10 6 10 5 10 4 10 3 10 2 10 1 Noise Level (σ)

(e ) K = 2 , r = 5 , |Ωk| = 40

Figure 2: Complementary results to Figure 1 with K = 4. Average success rate and iteration growth over 100 trials, as a function of the number of entries per subspace |Ωk| (which in turn is a proxy of the ambient dimension d), the subspaces dimensions r, the gap γ := |Ωk| (r + 1), and the noise level σ.