# mixture_matrix_completion__08985b43.pdf

Mixture Matrix Completion

Daniel Pimentel-Alarcón Department of Computer Science Georgia State University Atlanta, GA, 30303 pimentel@gsu.edu

Completing a data matrix X has become an ubiquitous problem in modern data science, with motivations in recommender systems, computer vision, and networks inference, to name a few. One typical assumption is that X is low-rank. A more general model assumes that each column of X corresponds to one of several lowrank matrices. This paper generalizes these models to what we call mixture matrix completion (MMC): the case where each entry of X corresponds to one of several low-rank matrices. MMC is a more accurate model for recommender systems, and brings more ﬂexibility to other completion and clustering problems. We make four fundamental contributions about this new model. First, we show that MMC is theoretically possible (well-posed). Second, we give its precise information-theoretic identiﬁability conditions. Third, we derive the sample complexity of MMC. Finally, we give a practical algorithm for MMC with performance comparable to the state-of-the-art for simpler related problems, both on synthetic and real data.

1 Introduction

Matrix completion aims to estimate the missing entries of an incomplete data matrix X. One of its main motivations arises in recommender systems, where each row represents an item, and each column represents a user. We only observe an entry in X whenever a user rates an item, and the goal is to predict unseen ratings in order to make good recommendations.

Related Work. In 2009, Candès and Recht [1] introduced low-rank matrix completion (LRMC), arguably the most popular model for this task. LRMC assumes that each column (user) can be represented as a linear combination of a few others, whence X is low-rank. Later in 2012, Eriksson et. al. [2] introduced high-rank matrix completion (HRMC), also known as subspace clustering with missing data. This more general model assumes that each column of X comes from one of several low-rank matrices, thus allowing several types of users. Since their inceptions, both LRMC and HRMC have attracted a tremendous amount of attention (see [1 27] for a very incomplete list).

Paper contributions. This paper introduces an even more general model: mixture matrix completion (MMC), which assumes that each entry in X (rather than column) comes from one out of several low-rank matrices, and the goal is to recover the matrices in the mixture. Figure 1 illustrates the generalization from LRMC to HRMC and to MMC. One of the main motivations behind MMC is that users often share the same account, and so each column in X may contain ratings from several users. Nonetheless, as we show in Section 2, MMC is also a more accurate model for many other contemporary applications, including networks inference, computer vision, and metagenomics. This paper makes several fundamental contributions about MMC:

Well posedness. First, we show that MMC is theoretically possible if we observe the right entries and the mixture is generic (precise deﬁnitions below).

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

Figure 1: In LRMC, X is a low-rank matrix. In HRMC, each column of X comes from one of several low-rank matrices. In MMC, each entry comes from one of several low-rank matrices X1, . . . , XK; we only observe XΩ, and our goal is to recover the columns of X1, . . . , XK that have observations in XΩ.

Identiﬁability conditions. We provide precise information-theoretical conditions on the entries that need to be observed such that a mixture of K low-rank matrices is identiﬁable. These extend similar recent results of LRMC [3] and HRMC [4] to the setting of MMC. The subtlety in proving these results is that there could exist false mixtures that agree with the observed entries, even if the sampling is uniquely completable for LRMC and HRMC (see Example 1). In other words, there exits samplings that are identiﬁable for LRMC (and HRMC) but are not identiﬁable for MMC, and so in general it is not enough to simply have K times more samples. Hence, it was necessary to derive identiﬁability conditions for MMC, similar to those of LRMC in [3] and HRMC in [4]. We point out that in contrast to typical completion theory [1, 2, 5 20], these type of identiﬁability conditions are deterministic (not restricted to uniform sampling), and make no coherence assumptions.

Sample complexity. If X Rd n is a mixture of K rank-r matrices, we show that with high probability, our identiﬁability conditions will be met if each entry is observed with probability O( K

d max{r, log d}), thus deriving the sample complexity of MMC, which is the same as the sample complexity of HRMC [4], and simpliﬁes to O( 1

d max{r, log d}) in the case of K = 1, which corresponds to the sample complexity of LRMC [3]. Intuitively, this means that informationtheoretically, we virtually pay no price for mixing low-rank matrices.

Practical algorithm. Our identiﬁability results follow from a combinatorial analysis that is infeasible in practice. To address this, we give a practical alternating algorithm for MMC whose performance (in the more difﬁcult problem of MMC) is comparable to state-of-the-art algorithms for the much simpler problems of HRMC and LRMC.

2 Motivating Applications

Besides recommender systems, there are many important applications where data can be modeled as a mixture of low-rank matrices. Here are a few examples motivated by current data science challenges.

Networks Inference. Estimating the topology of a network (internet, sensor networks, biological networks, social networks) has been the subject of a large body of research in recent years [28 34]. To this end, companies routinely collect distances between nodes (e.g., computers) that connect with monitors (e.g., Google, Amazon, Facebook) in a data matrix X. In a simpliﬁed model, if node j is in subnet k, then the jth column can be modeled as the sum of (i) the distance between node j and router k, and (ii) the distance between router k and each of the monitors. Hence, the columns (nodes) corresponding to each subnet form a low-rank matrix, which is precisely the model assumed by HRMC. However, depending on the network s trafﬁc, each node may use different routes to communicate at different times. Consequently, the same column in X may contain measurements from different low-rank matrices. In other words, distance matrices of networks are a mixture of low-rank matrices.

Computer Vision. Background segmentation is one of the most fundamental and crucial tasks in computer vision, yet it can be tremendously challenging. The vectorized frames of a video can be modeled as columns with some entries (pixels) in a low-rank background, and some outlier entries, corresponding to the foreground. Typical methods, like the acclaimed Robust PCA (principal component analysis) [35 46], assume that the foreground is sparse and has no particular structure. However, in many situations this is not the case. For instance, since the location of an object in consecutive frames is highly correlated, the foreground can be highly structured. Similarly, the foreground may not be sparse, specially if there are foreground objects moving close to the camera

(e.g., in a selﬁe). Even state-of-the-art methods fail in scenarios like these, which are not covered by current models (see Figure 3 for an example). In contrast, MMC allows to use one matrix in the mixture to represent the background, other matrices to represent foreground objects (small or large, even dominant), and even other matrices to account for occlusions and other illumination/visual artifacts. Hence, MMC can be a more accurate model for video segmentation and other image processing tasks, including inpainting [47] and face clustering, which we explore in our experiments.

Metagenomics. One contemporary challenge in Biology is to quantify the presence of different types of bacteria in a system (e.g., the human gut microbiome) [48 52]. The main idea is to collect several DNA samples from such a system, and use their genomic information to count the number of bacteria of each type (the genome of each bacterium determines its type). In practice, to obtain an organism s genome (e.g., a person s genome), biologists feed a DNA sample (e.g., blood or hair) to a sequencer machine that produces a series of reads, which are short genomic sequences that can later be assembled and aligned to recover the entire genome. The challenge arises when the sequencer is provided a sample with DNA from multiple organisms, as is the case in the human gut microbiome, where any sample will contain a mixture of DNA from multiple bacteria that cannot be disentangled into individual bacterium. In this case, each read produced by the sequencer may correspond to a different type of bacteria. Consequently, each DNA sample (column) may contain genes (rows) from different types of bacteria, which is precisely the model that MMC describes.

3 Problem Statement

Let X1, . . . , XK Rd n be a set of rank-r matrices, and let Ω1, . . . , Ωk {0, 1}d n indicate disjoint sets of observed entries. Suppose X1, . . . , XK and Ω1, . . . , ΩK are unknown, and we only observe XΩ, deﬁned as follows:

If the (i, j)th entry of Ωk is 1, then the (i, j)th entry of XΩis equal to the (i, j)th entry of Xk.

If the (i, j)th entry of Ωk is 0 for every k = 1, . . . , K, then the (i, j)th entry of XΩis missing.

This way Ωk indicates the entries of XΩthat correspond to Xk, and Ω:= PK k=1 Ωk indicates the set of all observed entries. Since Ω1, . . . , ΩK are disjoint, Ω {0, 1}d n. Equivalently, each observed entry of XΩcorresponds to an entry in either X1 or X2 or . . . or XK (i.e., there are no collisions). In words, XΩcontains a mixture of entries from several low-rank matrices.

The goal of MMC is to recover all the columns of X1, . . . , XK that have observations in XΩ(see Figure 1 to build some intuition). In our recommendations example, a column xω XΩwill contain entries from Xk whenever xω contains ratings from a user of the kth type. Similarly, the same column will contain entries from Xℓwhenever it also contains ratings from a user of the ℓth type. We would like to predict the preferences of both users, or more generally, all users that have ratings in xω. On the other hand, if xω has no entries from Xk, then xω involves no users of the kth type, and so it would be impossible (and futile) to try to recover such column of Xk. In MMC, the matrices Ω1, . . . , ΩK play the role of the hidden variables constantly present in mixture problems. Notice that if we knew Ω1, . . . , ΩK, then we could partition XΩaccordingly, and estimate X1, . . . , XK using standard LRMC. The challenge is that we do not know Ω1, . . . , ΩK.

3.1 The Subtleties of MMC

The main theoretical difﬁculty of MMC is that depending on the pattern of missing data, there could exist false mixtures. That is, matrices X1, . . . , XK, other than X1, . . . , XK, that agree with XΩ, even if X1, . . . , XK are observed on uniquely completable patterns for LRMC.

Example 1. Consider the next rank-1 matrices X1, X2, and their partially observed mixture XΩ:

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

1 2 3 4 2 4 6 8 3 6 9 12 4 8 12 16 5 10 15 20

1 3 4 1 2 8 3 2 3 4 8 3 4 10 15 4

We can verify that X1 and X2 are observed on uniquely completable sampling patterns for LRMC [3]. Nonetheless, we can construct the following false rank-1 matrices that agree with XΩ:

60 40 15 4 1 2/3 1/4 1/15 3 2 3/4 1/5 12 8 3 4/5 60 40 15 4

1 1/4 3 1 8 2 24 8 1 1/4 3 1 4 1 12 4 40 10 120 40

This shows that even with unlimited computational power, if we exhaustively search all the identiﬁable patterns for LRMC, we can end up with false mixtures. Hence the importance of studying the identiﬁable patterns for MMC.

False mixtures arise because we do not know a priori which entries of XΩcorrespond to each Xk. Hence, it is possible that a rank-r matrix X agrees with some entries from X1, other entries from X2, and so on. Furthermore, X may even be the only rank-r matrix that agrees with such combination of entries, as in Example 1. Remark 1. Recall that LRMC and HRMC are tantamount to identifying the subspace(s) containing the columns of X [3, 4]. In fact, if we knew such subspaces, LRMC and HRMC become almost trivial problems (see Appendix A for details). Similarly, if no data is missing, HRMC simpliﬁes to subspace clustering, which has been studied extensively, and is now reasonably well-understood [53 62]. In contrast, MMC remains challenging even if the subspaces corresponding to the low-rank matrices in the mixture are known, and even X is fully observed. We refer the curious reader to Appendix A, and point out the bottom row and the last column in Figure 2, which show the MMC error when the underlying subspaces are known, and when X is fully observed.

4 Main Theoretical Results

Example 1 shows the importance of studying the identiﬁable patterns for MMC, which we do now. First recall that r + 1 samples per column are necessary for LRMC [3]. This implies that even if an oracle told us Ω1, . . . , ΩK, if we intend to recover a column of Xk, we need to observe it on at least r + 1 entries. Hence we assume without loss of generality that:

(A1) Each column of Ωk has either 0 or r + 1 non-zero entries.

In words, A1 requires that each column of Xk to be recovered is observed on exactly r + 1 entries. Of course, observing more entries may only aid completion. Hence, rather than an assumption, A1 describes the most difﬁcult scenario where we have the bare minimum amount of information required for completion. We use A1 to ease notation, exposition and analysis. All our results can be easily extended to the case where A1 is droped (see Remark 2).

Without further assumptions on X, completion (of any kind) may be impossible. To see this consider the simple example where X is only supported on the ith row. Then it would be impossible to recover X unless all columns were observed on the ith row. In most completion applications this would be unlikely. For example, in a movies recommender system like Netﬂix, this would require that all the users watched (and rated) the same movie.

To rule out scenarios like these, typical completion theory requires incoherence and uniform sampling. Incoherence guarantees that the information is well-spread over the matrix. Uniform sampling guarantees that all rows and columns are sufﬁciently sampled. However, it is usually unclear (and generally unveriﬁable) whether an incomplete matrix is coherent. Furthermore, observations are hardly ever uniformly distributed. For instance, we do not expect children to watch adults movies.

To avoid these issues, instead of incoherence we will assume that X is a generic mixture of low-rank matrices. More precisely, we assume that:

(A2) X1, . . . , XK are drawn independently according to an absolutely continuous distribution with respect to the Lebesgue measure on the determinantal variety (set of all d n, rank-r matrices).

A2 essentially requires that each Xk is a generic rank-r matrix. This type of genericity assumptions are becoming increasingly common in studies of LRMC, HRMC, and related problems [3, 4, 23 27, 46]. See Appendix C for a further discussion on A2, and its relation to other common assumptions from the literature.

With this, we are ready to present our main theorem. It gives a deterministic condition on Ω to guarantee that X1, . . . , XK can be identiﬁed from XΩ. This provides information-theoretic requirements for MMC. The proof is in Appendix B.

Theorem 1. Let A1-A2 hold. Suppose there exist matrices {Ωτ}r+1 τ=1 formed with disjoint subsets of (d r + 1) columns of Ωk, such that for every τ:

( ) Every matrix Ω formed with a proper subset of the columns in Ωτ has at least r fewer columns than non-zero rows.

Then all the columns of Xk that have observations in XΩare identiﬁable.

In words, Theorem 1 states that MMC is possible as long as we observe the right entries in each Xk. The intuition is that each of these entries imposes a constraint on what X1, . . . , XK may be, and the pattern in Ωdetermines whether these constraints are redundant. Patterns satisfying the conditions of Theorem 1 guarantee that X1, . . . , XK is the only mixture that satisﬁes the constraints produced by the observed entries. Remark 2. Recall that r + 1 samples per column are strictly necessary for completion. A1 requires that we have exactly that minimum number of samples. If Xk is observed on more than r + 1 entries per column, it sufﬁces that Ωk contains a pattern satisfying the conditions of Theorem 1.

Theorem 1 shows that MMC is possible if the samplings satisfy certain combinatorial conditions. Our next result shows that if each entry of Xk is observed on XΩwith probability O( 1

d max{r, log d}), then with high probability Ωk will satisfy such conditions. The proof is in Appendix B.

Theorem 2. Suppose r d

6 and n (r + 1)(d r + 1). Let ǫ > 0 be given. Suppose that an entry of XΩis equal to the corresponding entry of Xk with probability

p 2 d max 2r, 12 log( d

Then Ωk satisﬁes the sampling conditions of Theorem 1 with probability 1 2(r + 1)ǫ.

Theorem 2 shows that the sample complexity of MMC is O(K max{r, log d}) observations per column of XΩ. This is exactly the same as the sample complexity of HRMC [4], and simpliﬁes to O(max{r, log d}) if K = 1, corresponding to the sample complexity of LRMC [3]. Intuitively, this means that information-theoretically, we virtually pay no price for mixing low-rank matrices.

5 Alternating Algorithm for MMC

Theorems 1 and 2 show that MMC is theoretically possible under reasonable conditions (virtually the same as LRMC and HRMC). However, these results follow from a combinatorial analysis that is infeasible in practice (see Appendix B for details). To address this, we derive a practical alternating algorithm for MMC, which we call AMMC (alternating mixture matrix completion).

The main idea is that MMC, like most mixture problems, can be viewed as a clustering task: if we could determine the entries of XΩthat correspond to each Xk, then we would be able to partition XΩ into K incomplete low-rank matrices, and then complete them using standard LRMC. The question is how to determine which entries of XΩcorrespond to each Xk, i.e., how to determine Ω1, . . . , ΩK. To address this, let Uk Rd r be a basis for the subspace containing the columns of Xk, and let xω denote the jth column of XΩ, observed only on the entries indexed by ω {1, . . . , d}. For any subspace, matrix or vector that is compatible with a set of indices , we use the subscript to denote

its restriction to the coordinates/rows in . For example, Uk ω R|ω| r denotes the restriction of Uk

to the indices in ω. Suppose xω contains entries from Xk, and let ωk ω index such entries. Then our goal is to determine ωk, as that would tell us the jth column of Ωk. Since xωk span{Uk ωk}, we can restate our goal as ﬁnding the set ωk ω such that xωk span{Uk ωk}.

To ﬁnd ωk, let υ ω, and let Pk υ := Uk υ(Uk T υ Uk υ) 1Uk T υ denote the projection operator onto span{Uk υ}. Recall that Pk υxυ xυ , with equality if and only if xυ span{Uk υ}. It follows that ωk is the largest set υ such that Pk υxυ = xυ . In other words, ωk is the solution to

arg max υ ω Pk υxυ xυ + |υ|. (1)

However, (1) is non-convex. Hence, in order to ﬁnd the solution to (1), we propose the following erasure strategy. The main idea is to start our search with υ = ω, and then iteratively remove the entries (coordinates) of υ that most increase the gap between Pk υxυ and xυ (hence the term erasure). We stop this procedure when Pk υxυ is equal to xυ (or close enough). More precisely, we initialize υ = ω, and then iteratively redeﬁne υ as the set

υ = υ\i, where i = arg max i υ Pk υ\ixυ\i xυ\i . (2)

In words, i is the coordinate of the vector xυ such that if ignored, the gap between the remaining vector xυ\i and its projection Pk υ\ixυ\i is reduced the most. At each iteration we remove (erase) such coordinate i from υ. The intuition behind this approach is that the coordinates of xυ that do not correspond to Xk are more likely to increase the gap between Pk υxυ and xυ . Notice that if Uk is in general position (guaranteed by A2) and |υ| r, then Uk υ = R|υ| (because Uk is r-dimensional). In such case, it is trivially true that xυ span{Uk υ}, whence Pk υxυ = xυ . Hence the procedure above is guaranteed to terminate after at most ω r iterations. At such point, |υ| = r, and we know that we were unable to ﬁnd ωk (or a subset of it). One alternative is to start with a different υ0 ω, and search again.

This procedure may remove some entries from ωk along the way, so in general, the output of this process will be a set υ ωk. However, ﬁnding a subset of ωk is enough to ﬁnd ωk. To see this, recall that since xωk span{Uk ωk}, there is a coefﬁcient vector θk Rr such that xωk = Uk ωkθk. Since υ ωk, it follows that xυ = Uk υθk. Furthermore, since |υ| r, we can ﬁnd θk as θk = (Uk T υ Uk υ) 1Uk T υ xυ. Since xωk = Uk ωkθk, at this point we can identify ωk by simple inspection (the matching entries in xω and Uk ωθk). Recall that ωk determines the jth column of Ωk. Hence, if we repeat the procedure above for each column in XΩand each k, we can recover Ω1, . . . , ΩK. After this, we can use standard LRMC on XΩ1, . . . , XΩK to recover X1, . . . XK (which is the ultimate goal of MMC).

The catch here is that this procedure requires knowing Uk, which we do not know. So essentially we have a chicken and egg problem: (i) if we knew Uk, we would be able to ﬁnd Ωk. (ii) If we knew Ωk we would be able to ﬁnd Uk (and Xk, using standard LRMC on XΩk). Since we know neither, we use a common technique for these kind of problems: alternate between ﬁnding Ωk and Uk. More precisely, we start with some initial guesses ˆU1, . . . , ˆUK, and then alternate between the following two steps until convergence:

(i) Cluster. Let xω be the jth column in XΩ. For each k = 1, . . . , K, we ﬁrst erase entries from ω to obtain a set υ ω indicating entries likely to correspond to Xk. This erasure procedure initializes υ = ω, and then repeats (2), (replacing Pk with ˆPk, which denotes the projection operator onto span{ ˆUk}) until we to obtain a set υ ω such that the projection ˆPk υxυ is close to xυ . This way, the entries of xυ are likely to correspond to Xk. Using these entries, we can estimate the coefﬁcient of the jth column of Xk with respect to Uk, given by ˆθk = ( ˆUk T υk ˆUk υk) 1 ˆUk T υkxυk. With ˆθk we can also estimate the jth column of Xk as ˆxk := ˆUkˆθk. Notice that both υ and ˆxk are obtained using ˆUk, which may be different from Uk. It follows that υ may contain some entries that do not correspond to Xk, and ˆxk may be inaccurate. Hence, in general, xω and ˆxk ω will have no matching entries, and so we cannot identify ωk by simple inspection, as before. However, we can repeat our procedure for each k to obtain estimates ˆx1 ω, . . . , ˆx K ω, and then assign each entry of xω to its closest match. More

0 10 8 10 5 10 2 1

0.4 0.6 0.8 1 Sampling rate (p)

Initialization Error (δ)

AMMC (this paper)

0.1 0.2 0.3 0.4 0.5 Sampling rate per matrix (p/K)

Success rate

LMa Fit (LRMC)

GSSC (HRMC)

AMMC (MMC, this paper)

Figure 2: Left: Success rate (average over 100 trials) of AMMC as a function of the fraction of observed entries p and the distance δ between the true subspaces Uk and their initial estimates. Lightest represents 100% success rate; darkest represents 0%. Right: Comparison of state-of-the-art algorithms for LRMC, HRMC, and MMC (in their respective settings; see Figure 1). The performance of AMMC (in the more difﬁcult problem of MMC) is comparable to the performance of state-of-the-art algorithms in the simpler problems of LRMC and HRMC.

precisely, our estimate ˆωk ω (indicating the entries of xω that we estimate that correspond to Xk) will contain entry i ω if |xi ˆxk i | |xi ˆxℓ i | for every ℓ= 1, . . . , K. Repeating this procedure for each column of XΩwill produce estimates ˆΩ1, . . . , ˆΩK. Speciﬁcally, the jth

column of ˆΩk {0, 1}d n will contain a 1 in the rows indicated by ˆωk.

(ii) Complete. For each k, complete XˆΩk using your favorite LRMC algorithm. Then compute a new estimate ˆUk given by the leading r left singular vectors of the completion of XˆΩk.

The entire procedure is summarized in Algorithm 1, in Appendix D, where we also discuss initialization, generalizations to noise and outliers, and other simple extensions to improve performance.

6 Experiments

Simulations. We ﬁrst present a series of synthetic experiments to study the performance of AMMC (Algorithm 1). In our simulations we ﬁrst generate matrices Uk Rd r and Θk Rr n with i.i.d. N(0, 1) entries to use as bases and coefﬁcients of the low-rank matrices in the mixture, i.e., Xk = UkΘk Rd n. Here d = n = 100, r = 5 and K = 2. With probability (1 p), the (i, j)th

entry of XΩwill be missing, and with probability p/K it will be equal to the corresponding entry in Xk. Recall that similar to EM and other alternating approaches, AMMC depends on initialization. Hence, we study the performance of AMMC as a function of both p and the distance δ [0, 1] between {Uk} and their initial estimates (measured as the normalized Frobenius norm of the difference between their projection operators). We measure accuracy using the normalized Frobenius norm of the difference between each Xk and its completion. We considered a success if this quantity was below 10 8. The results of 100 trials are summarized in Figure 2.

Notice that the performance of AMMC decays nicely with the distance δ between the true subspaces Uk and their initial estimates. We can see this type of behavior in similar state-of-the-art alternating algorithms for the simpler problem of HRMC [19]. Since MMC is highly non-convex, it is not surprising that if the initial estimates are poor (far from the truth), then AMMC may converge to a local minimum. Similarly, the performance of AMMC decays nicely with the fraction of observed entries p. Notice that even if X is fully observed (p = 1), if the initial estimates are very far from the true subspaces (δ = 1), then AMMC performs poorly. This shows, consistent with our discussing in Remark 1, that in practice MMC is a challenging problem even if X is fully observed. Hence, it is quite remarkable that AMMC works most of the time with as little as p 0.6, corresponding to observing 0.3 of the entries in each Xk. To put this under perspective, notice (Figure 2) that this is comparable the amount of missing data tolerated by GSSC [19] and LMa Fit [11], which are state-of-the-art for the simpler problems of HRMC (special case of MMC where all entries in each column of X correspond to the same Xk) and LRMC (special case where there is only one Xk).

Mixture Reconstructions Original Robust PCA MMC (this paper)

Figure 3: Left 3: Reconstructed images from a mixture. Right 3: Original frame and segmented foreground.

To obtain Figure 2 we replicated the same setup as above, but with data generated according to the HRMC and LRMC models. Hence, we conclude that the performance of AMMC (in the more difﬁcult problem of MMC) is comparable to the performance of state-of-the-art algorithms for the much simpler problems of HRMC and LRMC.

We point out that according to Theorems 1 and 2, MMC is theoretically possible with p 1/2. However, we can see that (even if U1, . . . , UK are known, corresponding to δ = 0 in Figure 2) the performance of AMMC is quite poor if p < 0.6. This shows two things: (i) MMC is challenging even if U1, . . . , UK are known (as discussed in Remark 1), and (ii) there is a gap between what is information-theoretically possible and what is currently possible in practice (with AMMC). In future work we will explore algorithms that can approach the information-theoretic limits.

Real Data: Face Clustering and Inpainting. It is well-known that images of an individual s face are approximately low-rank [63]. Natural images, however, usually contain faces of multiple individuals, often partially occluding each other, resulting in a mixture of low-rank matrices. In this experiment we demonstrate the power of MMC in two tasks: ﬁrst, classifying partially occluded faces in an image, and second, image inpainting [47]. To this end, we use the Yale B dataset [64], containing 2432 photos of 38 subjects (64 photos per subject), each photo of size 48 42. We randomly select two subjects, and vectorize and concatenate their images to obtain two approximately rank-10 matrices X1, X2 R2016 64. Next we combine them into X R2016 64, whose each entry is equal to the corresponding entry in X1 or X2 with equal probability. This way, each column of X contains a mixed image with pixels from multiple individuals. We aim at two goals: (i) classify the entries in X according to X1 and X2, which in turn means locating and classifying the face of each individual in each image, and (ii) recover X1 and X2 from X, thus reconstructing the unobserved pixels in each image (inpainting). We repeat this experiment 30 times using AMMC (with gaussian random initialization, known to produce near-orthogonal subspaces with high probability), obtaining a pixel classiﬁcation error of 2.98%, and a reconstruction error of 4.1%, which is remarkable in light that the ideal rank-10 approximation (no mixture, and full data) achieves 1.8%. Figure 3 shows an example, with more in Figure 4 in Appendix E. Notice that in this case we cannot compare against other methods, as AMMC is the ﬁrst, and currently the only method for MMC.

Real Data: MMC for Background Segmentation. As discussed in Section 2, Robust PCA models a video as the superposition of a low-rank background plus a sparse foreground with no structure. MMC brings more ﬂexibility, allowing multiple low-rank matrices to model background, structured foreground objects (sparse or abundant) and illumination artifacts, while at the same time also accounting for outliers (the entries/pixels that were assigned to no matrix in the mixture). In fact, contrary to Robust PCA, MMC allows a very large (even dominant) fraction of outliers. In this experiment we test AMMC in the task of background segmentation, using the Wallﬂower [65] and the I2R [66] datasets, containing videos of trafﬁc cameras, lobbies, and pedestrians in the street. For each video, we compare AMMC (with gaussian random initialization) against the best result amongst the following state-of-the-art algorithms for Robust PCA: [35 39]. We chose these methods based on the comprehensive review in [40], and previous reports [41 43] indicating that these algorithms typically performed as well or better than several others, including [44, 45]. In most cases, both Robust PCA and AMMC perform quite similarly (see Figure 5 in Appendix E). However, in one case AMMC achieves 87.67% segmentation accuracy (compared with the ground truth, manually segmented), while Robust PCA only achieves 74.88% (Figure 3). Our hypothesis is that this is due to the large portion of outliers (foreground). It is out of the scope of this paper, but of interest for future work, to collect real datasets with similar properties, where AMMC can be further tested. We point out, however, that AMMC is orders of magnitude slower than Robust PCA. Our future work will also focus on developing faster methods for MMC.

[1] E. Candès and B. Recht, Exact matrix completion via convex optimization, Foundations of Computational Mathematics, 2009. [2] B. Eriksson, L. Balzano and R. Nowak, High-rank matrix completion and subspace clustering with missing data, Artiﬁcial Intelligence and Statistics, 2012. [3] D. Pimentel-Alarcón, N. Boston and R. Nowak, A characterization of deterministic sampling patterns for low-rank matrix completion, IEEE Journal of Selected Topics in Signal Processing, 2016. [4] D. Pimentel-Alarcón and R. Nowak, The information-theoretic requirements of subspace clustering with missing data, International Conference on Machine Learning, 2016. [5] E. Candès and T. Tao, The power of convex relaxation: near-optimal matrix completion, IEEE Transactions on Information Theory, 2010. [6] J. Cai, E. Candès and Z. Shen, A singular value thresholding algorithm for matrix completion, SIAM Journal on Optimization, 2010. [7] R. Keshavan, A. Montanari and S. Oh, Matrix completion from a few entries, IEEE Transactions on Information Theory, 2010. [8] L. Balzano, R. Nowak, and B. Recht, Online identiﬁcation and tracking of subspaces from highly incomplete information, Allerton Conference on Communication, Control and Computing, 2010. [9] B. Recht, A simpler approach to matrix completion, Journal of Machine Learning Research, 2011. [10] S. Ma, D. Goldfarb, L. Chen, Fixed point and Bregman iterative methods for matrix rank minimization, Mathematical Programming, 2011. [11] Z. Wen, W. Yin and Y. Zhang, Solving a low-rank factorization model for matrix completion by a non-linear successive over-relaxation algorithm, Mathematical Programming Computation, 2012. [12] Y. Shen, Z. Wen and Y. Zhang, Augmented lagrangian alternating direction method for matrix separation based on low-rank factorization, International Conference on Numerical Optimization and Numerical Linear Algebra, 2014. [13] E. Chunikhina, R. Raich and T. Nguyen, Performance analysis for matrix completion via iterative hard-thresholded SVD, IEEE Statistical Signal Processing, 2014. [14] Y. Chen, S. Bhojanapalli, S. Sanghavi and R. Ward, Coherent matrix completion, International Conference on Machine Learning, 2014. [15] Y. Chen, Incoherence-optimal matrix completion, IEEE Transactions on Information Theory, 2015. [16] P. Jain, P. Netrapalli and S. Sanghavi, Low-rank matrix completion using alternating minimization, ACM symposium on Theory of computing, 2013. [17] L. Balzano, A. Szlam, B. Recht and R. Nowak, K-subspaces with missing data, IEEE Statistical Signal Processing, 2012. [18] D. Pimentel-Alarcón, L. Balzano and R. Nowak, On the sample complexity of subspace clustering with missing data, IEEE Statistical Signal Processing, 2014. [19] D. Pimentel-Alarcón, L. Balzano, R. Marcia, R. Nowak and R. Willett, Group-sparse subspace clustering with missing data, IEEE Statistical Signal Processing, 2016. [20] C. Yang, D. Robinson and R. Vidal, Sparse subspace clustering with missing entries, International Conference on Machine Learning, 2015. [21] E. Elhamifar, High-rank matrix completion and clustering under self-expressive models, Advances in Neural Information Processing Systems, 2016. [22] G. Ongie, R. Willett, R. Nowak and L. Balzano, Algebraic variety models for high-rank matrix completion, International Conference on Machine Learning, 2017. [23] D. Pimentel-Alarcón, N. Boston and R. Nowak, Deterministic conditions for subspace identiﬁability from incomplete sampling, IEEE International Symposium on Information Theory, 2015.

[24] F. Király, L. Theran and R. Tomioka, The algebraic combinatorial approach for low-rank matrix completion, Journal of Machine Learning Research, 2015.

[25] D. Pimentel-Alarcón and R. Nowak, A converse to low-rank matrix completion, IEEE International Symposium on Information Theory, 2016.

[26] M. Ashraphijuo, X. Wang and V. Aggarwal, A characterization of sampling patterns for lowrank multiview data completion problem, IEEE International Symposium on Information Theory, 2017.

[27] M. Ashraphijuo, V. Aggarwal and X. Wang, A characterization of sampling patterns for lowtucker-rank tensor completion problem, IEEE International Symposium on Information Theory, 2017.

[28] R. Govindan and H. Tangmunarunkit, Heuristics for Internet Map Discovery, IEEE INFOCOM 2000.

[29] P. Barford, A. Bestavros, J. Byers and M. Crovella, On the marginal utility of network topology measurements, Proceedings of ACM Internet Measurement Workshop, 2001.

[30] N. Spring, R. Mahajan, D. Wetherall and T. Anderson, Measuring ISP topologies with rocketfuel, IEEE/ACM Transactions on Networking, 2004.

[31] D. Alderson, L. Li, W. Willinger and J. Doyle, Understanding internet topology: Principles, models and validation, IEEE/ACM Transactions on Networking, 2005.

[32] R. Sherwood, A. Bender and N. Spring, Dis Carte: A disjunctive Internet cartographer, ACM SIGCOMM, 2008.

[33] B. Eriksson, P. Barford and R. Nowak, Network Discovery from Passive Measurements, ACM SIGCOMM, 2008.

[34] B. Eriksson, P. Barford, J. Sommers and R. Nowak, Domain Impute: inferring unseen components in the Internet, IEEE INFOCOM, 2011.

[35] Z. Lin, M. Chen, L. Wu, and Y. Ma, The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices, University of Illinois at Urbana-Champaign Technical Report, 2009.

[36] Z. Lin, R. Liu and Z. Su, Linearized alternating direction method with adaptive penalty for low rank representation, Advances in Neural Information Processing Systems, 2011.

[37] X. Ding, L. He and L. Carin, Bayesian robust principal component analysis, IEEE Transactions on Image Processing, 2011.

[38] X. Shu, F. Porikli and N. Ahuja, Robust orthonormal subspace learning: Efﬁcient recovery of corrupted low-rank matrices, International Conference on Computer Vision and Pattern Recognition, 2014.

[39] Y. Yang, Y. Feng and J. Suykens, A nonconvex relaxation approach to robust matrix completion, Preprint, 2014.

[40] T. Bouwmans, A. Sobral, S. Javed, S. Jung and E. Zahzah, Decomposition into low-rank plus additive matrices for background/foreground separation: A review for a comparative evaluation with a large-scale dataset, Computer Science Review, 2016.

[41] E. Candès, X. Li, Y. Ma and J. Wright, Robust principal component analysis?, Journal of the ACM, 2011.

[42] T. Bouwmans and E. Zahzah, Robust PCA via principal component pursuit: a review for a comparative evaluation in video surveillance, Computer Vision and Image Understanding, 2014.

[43] Y. Ma, Low-rank matrix recovery and completion via convex optimization, avaiable at http: //perception.csl.illinois.edu/matrix-rank/home.html.

[44] X. Yuan and J. Yang, Sparse and low-rank matrix decomposition via alternating direction methods, available at http://www.optimization-online.org/DB_HTML/2009/11/2447. html, 2009.

[45] Z. Lin, A. Ganesh, J. Wright, L. Wu, M. Chen and Y. Ma, Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix, Computational Advances in Multi-Sensor Adaptive Processing, 2009.

[46] D. Pimentel-Alarcón and R. Nowak, Random consensus robust PCA, Electronic Journal of Statistics, 2017. [47] J. Mairal, F. Bach, J. Ponce and G. Sapiro, Online dictionary learning for sparse coding, International Conference on Machine Learning, 2009. [48] S. Highlander, High throughput sequencing methods for microbiome proﬁling: application to food animal systems, Animal Health Research Reviews, 2012. [49] S. Mande, M. Mohammed and T. Ghosh, Classiﬁcation of metagenomic sequences: methods and challenges, Brieﬁngs in Bioinformatics, 2012. [50] R. Ranjan, A. Rani, A. Metwally, H. Mc Gee and D. Perkins, Analysis of the microbiome: Advantages of whole genome shotgun versus 16S amplicon sequencing, Biochemical and Biophysical Research Communications, 2016. [51] N. Nguyen, T. Warnow, M. Pop and B. White, A perspective on 16S r RNA operational taxonomic unit clustering using sequence similarity, Bioﬁlms and Microbiomes, 2016. [52] G. Marçais, A. Delcher, A. Phillippy, R. Coston, S. Salzberg and A. Zimin, MUMmer4: A fast and versatile genome alignment system, PLo S Computational Biology, 2018. [53] R. Vidal, Subspace clustering, IEEE Signal Processing Magazine, 2011. [54] G. Liu, Z. Lin and Y. Yu, Robust subspace segmentation by low-rank representation, International Conference on Machine Learning, 2010. [55] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu and Y. Ma, Robust recovery of subspace structures by low-rank representation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013. [56] M. Soltanolkotabi, E. Elhamifar and E. Candès, Robust subspace clustering, Annals of Statistics, 2014. [57] C. Qu and H. Xu, Subspace clustering with irrelevant features via robust Dantzig selector, Advances in Neural Information Processing Systems, 2015. [58] X. Peng, Z. Yi and H. Tang, Robust subspace clustering via thresholding ridge regression, AAAI Conference on Artiﬁcial Intelligence, 2015. [59] Y. Wang and H. Xu, Noisy sparse subspace clustering, International Conference on Machine Learning, 2013. [60] Y. Wang, Y. Wang and A. Singh, Differentially private subspace clustering, Advances in Neural Information Processing Systems, 2015. [61] H. Hu, J. Feng and J. Zhou, Exploiting unsupervised and supervised constraints for subspace clustering, IEEE Pattern Analysis and Machine Intelligence, 2015. [62] E. Elhamifar and R. Vidal, Sparse subspace clustering: algorithm, theory, and applications, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013. [63] R. Basri and D. Jacobs, Lambertian reﬂectance and linear subspaces, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003. [64] A. Georghiades, P. Belhumeur and D. Kriegman, From few to many: Illumination cone models for face recognition under variable lighting and pose, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [65] K. Toyama, J. Krumm, B. Brumitt and B. Meyers, Wallﬂower: principles and practice of background maintenance, IEEE International Conference on Computer Vision, 1999. Dataset available at: http://research.microsoft.com/en-us/um/people/jckrumm/wallflower/ testimages.htm [66] L. Li, W. Huang, I. Gu and Q. Tian, Statistical modeling of complex backgrounds for foreground object detection, IEEE Transactions on Image Processing, 2004. Dataset available at: http: //perception.i2r.a-star.edu.sg/bk_model/bk_index.html [67] A. Dempster, N. Laird and D. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the royal statistical society, 1977. [68] M. Tipping and C. Bishop, Mixtures of probabilistic principal component analysers, Neural Computation, 1999. [69] X. Yi, C. Caramanis and S. Sanghavi, Alternating Minimization for Mixed Linear Regression, International Conference on Machine Learning, 2014.