# linearsample_learning_of_lowrank_distributions__ff5317e9.pdf

Linear-Sample Learning of Low-Rank Distributions

Ayush Jain and Alon Orlitsky Dept. of Electrical and Computer Engineering University of California, San Diego {ayjain, aorlitsky}@eng.ucsd.edu

Many latent-variable applications, including community detection, collaborative ﬁltering, genomic analysis, and NLP, model data as generated by low-rank matrices. Yet despite considerable research, except for very special cases, the number of samples required to efﬁciently recover the underlying matrices has not been known. We determine the onset of learning in several common latent-variable settings. For all of them, we show that learning k k, rank-r, matrices to normalized L1 distance ϵ requires Ω( kr

ϵ2 ) samples, and propose an algorithm that uses O( kr

ϵ) samples, a number linear in the high dimension, and nearly linear in the, typically low, rank. The algorithm improves on existing spectral techniques and runs in polynomial time. The proofs establish new results on the rapid convergence of the spectral distance between the model and observation matrices, and may be of independent interest.

1 Introduction

1.1 Motivation

A great many scientiﬁc and technological applications concern relations between two objects that range over large domains, yet are linked via a low-dimensional latent space. Often this relation is precisely, or nearly, linear, hence can be modeled by a low-rank matrix. The problem of recovering low-rank model matrices from observations they generate has therefore been studied extensively. Following are ﬁve of the most common settings considered, each introduced via a typical application.

Distribution matrices In Probabilistic latent semantic analysis samples are co-occurrences (w, d) of words and documents, assumed independent given one of r latent topic classes t [Hof99]. The joint probability matrix [pw,d] is therefore a mixture of at most r product matrices p(d|t) p(w|t), and has rank r. This setting also arises in many hidden Markov applications, e.g., [MR05, HKZ12].

For this setting, the model matrix is a distribution, hence its elements are non-negative and sum to 1. The model is sampled independently n times, and Xi,j is the number of times pair (i, j) was observed. In the remaining settings, the model matrix consists of arbitrary parameters, it is sampled just once, and each parameter is reﬂected in an independent observation Xi,j.

Poisson Parameters Recommendation systems infer consumer preferences from their consumption patterns. The number of times customer i purchases product j is typically modeled as an independent Poisson random variable Xi,j Poi(λi,j). Often λi,j is the inner product of the consumer s disposition and product s expression of r latent features [SKKR00], so the parameter matrix [λi,j] has rank r.

Bernoulli Parameters In inhomogeneous Erdös-Rényi graphs, the edge between nodes i and j is an independent random variable Xi,j Ber(pi,j). In Community-detection, the Stochastic Block Model (SBM), e.g., [Abb17, MNS18, ABH15, BLM15], assumes that graph nodes fall in few communities C1, . . . ,Cr such that the pi,j s are one constant if i and j are in the same community, and a different constant if i and j are in two distinct communities. Clearly, the parameter matrix [pi,j] has rank r.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Binomial Parameters The probability that gene pair (i, j) will express as a phenotype is often viewed as the result of a few factors, resulting in an expression probability matrix [pi,j] of low rank at most r [KMA16]. In a study of t phenotypic patients, the number Xi,j of patients with gene pair (i, j) will therefore be distributed binomially Bin(t, pi,j).

Collaborative ﬁltering Let Yi,j [0, 1] be the random rating user i assigns to movie j. If the ratings are based on a small number of r intrinsic ﬁlm features, the matrix [Fi,j] of expected ratings Fi,j = E[Yi,j], may be views as having a low rank r. Since only some ratings are reported, the matrix-completion and collaborative-ﬁltering literature, e.g., [BCLS17], assumes that 1i,j Ber(p) are independent indicator random variables, and upon observing Xi,j = 1i,j Yi,j for all (i, j) we wish to recover the mean matrix [Fi,j].

Each of these ﬁve settings has been studied in many additional contexts, including word embedding [SCH15], Genomic Analysis [ZLWY14, ZHPA13], and more [AGH+14].

As observed above, in all these settings, the observations are generated by low-rank matrices. In the rest of the paper we show that relatively few observations sufﬁce to learn the underlying matrices.

1.2 Uniﬁed formulation

We ﬁrst unify the ﬁve settings, facilitating the interpretation of their matrix norm as the number of observations, and allowing us to subsequently show that all models have essentially the same answer.

Let Rk m be the collection of k m real matrices, and let Rk m r be its subset of matrices with rank at most r. Both our lower and upper bounds for recovering k m matrices depend only on the larger of k and m. Hence without loss of generality we assume square model matrices of size k k.

Note that in the ﬁrst four models, Xi,j is the number of times pair (i, j) was observed. For example, for the Poisson model, it is the number of times customer i bought product j. Hence ||X||1 = P

i,j Xi,j is the total number of observations. In collaborative ﬁltering, Xi,j = 1i,j Yi,j does not carry the same interpretation, still ||X||1 = P

i,j Xi,j will concentrate around p k2 F avg, where F avg = 1 k2 P

i,j Fi,j. And since Fi,j [0, 1], therefore F avg is typically a constant, hence ||X||1 will be proportional to the expected number of observations, pk2.

We therefore scale the matrix M so that Mi,j = E(Xi,j) reﬂects the expected number of observations of pair (i, j). We let M be n [pi,j] for distribution matrices, [λi,j] for Poisson parameters, [pi,j] for Bernoulli parameters, [t pi,j] for Binomial parameters, and [p Fi,j] for collaborative ﬁltering. Consequently, for all models M = E[X], hence ||M||1 = ||E[X]||1 is the expected number of observations.

Let M Rk k r be an unknown model matrix for one of the ﬁve settings, and let X M be a resulting observation matrix. We would like to recover M from X. A model estimator is a mapping M est : Rk k Rk k that associates with an observation X an estimated model matrix M est := M est(X).

Different communities have used different measures for how well M est approximates M. For example community detection concerns the recovery of labels, a criterion quite speciﬁc to this particular application. Many other works considered recovery in squared-error, or Forbenius, norm, but as argued in Section 1.4, this measure is less meaningful for the applications we consider.

Perhaps the most apt estimation measure, and the one we adopt, is L1 distance, the standard Machine Learning accuracy measure. L1 distance arises naturally in numerous applications and is the main criterion used for learning distributions. Since M has non-unitary norm, we deﬁne the loss of the estimate M est as the normalized L1 distance between M and M est, L(M est) := LM(M est) := ||M est M||1/||M||1. Note that for distribution matrices, this reduces to the standard L1 norm. Also, similar to total variation distance, L(M est) upper bounds the absolute difference between the expected and predicted number of occurrences of pairs (i, j) in any subset S [k] [k], normalized by the total observations.

While the multitude of applications has drawn considerable amount of work on these latent variable model, a majority of work assumes that the number of samples are plenty, way beyond the information theoretic limit, or assumes more stronger assumptions on the model matrix then just low rank, thus limiting its applicability. We show recovery is possible with number of samples only linear in k and near linear in r in all these models and with no other assumptions on low rank matrix M.

1.3 Overview of the background, main results and techniques

Recovery of distribution-matrices (the ﬁrst model) in L1 distance was ﬁrst addressed in [HKKV18]. They considered matrices M that can be factored as UWU where W is br br and positive semi-deﬁnite, and U is non-negative. They derived a polynomial-time algorithm that requires O(w M k br2/ϵ5) samples, where w M br2, hence at best guaranteeing sample complexity O(k br4/ϵ5), and potentially much higher. Note however that their deﬁnition applies only to positive semi-deﬁnite, and not general matrices M, and even then, br is at least the rank of M, and can be signiﬁcantly higher.

We ﬁrst lower bound the loss of any estimator. An array of kr elements can be viewed as a special case of a k k matrix of rank r. Simply place the array s kr elements in the matrix s ﬁrst r rows, and set the remaining rows to zero. A well-known lower bound for learning discrete distributions, or arrays of Poisson parameters, in L1 distance [KOPS15, HJW15] therefore implies:

Recall that X M means that observation matrix X was generated by an underlying model matrix M, where EX = M. Unless speciﬁed otherwise, the results apply to all ﬁve models described in Section 1.1. Theorem 1. For any k, r, ϵ < 1, and M Rk k r , let X M via the Poisson parameters or distribution matrix model. Then for any, possibly random, estimator M est,

sup M Rk k r :||M||1 kr/ϵ2 EX M[L(M est(X))] = Ω(ϵ).

The bound implies that achieving expected normalized L1 error < ϵ requires expected number of observations ||M||1 = Ω(kr/ϵ2). Equivalently, any estimator incurs an expected normalized L1 error at least Ω(min{ p

kr/||M||1, 1}).

Our main result is a polynomial-time algorithm curated SVD that returns an estimate M cur := M cur(X) that essentially achieves the above lower bound for all ﬁve models and all matrices M.

Theorem 2. Curated SVD runs in polynomial time, and for every k, r, ϵ > 0, and M Rk k r with ||M||1 kr

ϵ, if X M, then with probability 1 k 2,

L(M cur(X)) = O(ϵ).

A few observations are in order. While [HKKV18] provides weaker guarantees and only for a special subclass of matrices, Curated SVD achieves essentially the lower bound for all matrices. It also holds for all ﬁve models. It recovers M with O(kr log2 r) observations. This number is linear in the large matrix dimension k, and near linear in the typically small rank r. This is the ﬁrst such result for general matrices.

In many applications, only a small number of observations is available per row and column. For example on average each viewer may rate only few movies, and a person typically has few friends. Hence the number of observations is near linear in the dimension k. Our results are the ﬁrst to enable learning general low rank matrices in these regimes.

With n samples, general discrete distributions over k elements can be learned to L1 distance Θ( p

k/n). Theorems 1 and 2 show that essentially the same result holds for L1 learning of low-rank matrices. The number of parameters is kr, the number of observations is ||M||1, and the normalized L1 error is between p

kr/||M||1 and p

kr/||M||1 log(r||M||1/k).

To obtain these results we generalize a recent work [LLV17] that bounds the spectral distance between M and X. This bound requires the strong condition that each entry of the corresponding Bernoulli parameter matrix M should be within a constant factor from the average. The paper asked whether such results can be achieved for more general sparse graphs, possibly with the aid of regularization. We provide a counter example showing that such strong guarantees cannot hold for sparse graphs.

Instead, we derive a new spectral result (Theorem 3) that helps recover M even when few observations are available, and may be of independent interest. The result applies to all models, not just Bernoulli, and shows that even when the number of observations is only linear in k, zeroing out a small submatrix in M and X, and then regularizing the two matrices, results in a small spectral distance between them. This result is the ﬁrst result to imply nontrivial concentration in spectral norm in linear sample regime for the general matrices in any of these settings. We believe this new result could potentially imply learning in the sparse sample regime for other settings that are not explicitly considered here.

Theorem 3 (Informal, full version in Section 3) There is a small unknown set of rows that when zeroed out from regularized versions of X and M results in small spectral distance between them.

Although this set of rows is unknown, we derive an algorithm that recovers M to a small L1 distance.

Curated SVD (Informal, full version in Section 4) Successively zeroes out few suspicious rows, ensuring that any small set of rows of X don t have too large an inﬂuence on the recovered matrix.

We show that Theorem 3 implies that curated SVD achieves the recovery guaranties in Theorem 2.

1.4 Implications of the results and related work

As mentioned in the introduction, many communities have considered recovering model matrices from samples. Here we describe a few more of the most relevant work to this paper.

In community detection the goal is to infer communal structure from pairwise interactions between individuals. Much work has focused on the Stochastic Block models (SBM) where individuals fall in few communities C1, . . . ,Cr and the interaction probabilities pi,j are one constant if i and j are in the same community, and a different constant if i and j are in two distinct communities. Precise guarantees were provided for both exact recovery and detection [Abb17, AS15, ABH15, BLM15], even for the sparse regime. However, note that SBM only allows matrices of interaction probabilities that are a very special case of low rank matrices and not general low rank matrices.

A more general Mixed membership SBM associates each individual with an unknown r-dimensional vector reﬂecting weighted membership in each of r communities, and the pairwise interaction probability is determined by inner product of the respective membership vectors. The resulting interaction probability matrix is a rank-r Bernoulli parameter matrix. Recently, [HS17] considered a Bayesian setting where the resulting membership weights are both sparse and evenly distributed, and achieved weak detection with only O(kr2) samples.

The additional assumptions needed in [HS17] imply that the entries in the resulting model matrix are within a constant factor from their average. By contrast our analysis applies to all low rank matrices M, and for L1 recovery using O(kr log2 r) samples. We note that the goal in community detection setting is somewhat more speciﬁc than ours, and the two guarantees may not be directly comparable.

Another recent work [MD19] considered recovering Poisson and distribution matrices under the Frobenius norm. They showed that matrices M with moderately-sized entries, can be recovered with O(k log3/2 k) samples, but the error depended on M s maximum row and column sum. However, they note that the Frobenius norm "might not always be the most appropriate error metric" and point out that the L1 norm is "much stronger" for these settings. A similar sentiment about L1 is echoed by [HKKV18] in relation to the spectral norms. By contrast, our results apply to the L1 norm, grow only linearly with k, and the error depends on M s average, not the maximum, row and column sum.

Our collaborative ﬁltering setting uses the same general bounded noise model as [BCLS17]. However they assume that the mean matrix F is generated by a Lipschitz latent variable model. They recover F to mean square error P

i,j(Fi,j ˆFi,j)2/k2 = O(r2/(pk)2/5), implying that recovering F requires pk2 = O(r5k) samples. They also provides a nice survey of related matrix completion and show that their result is the ﬁrst to achieve linear in k recovery for the general bounded noise model.

By contrast, we make no additional assumptions, and recover F to L1 distance with O(kr log2 r) samples. Note also that 0 Fi,j 1, hence |Fi,j ˆFi,j| |Fi,j ˆFi,j|2. Therefore normalized L1 error upper bounds mean squared error. In Appendix D we show that our estimator achieves a better error bound, P

i,j |Fi,j ˆFi,j|/k2 = O( p

r/pk log(rpk)), even for the stronger L1 norm.

Learning latent variables models has been addressed in several other communities that typically focused on computational efﬁciency when the data is in abundance, includes work on Topic Modelling [AGH+13, KW17, BBW18], and Hidden Markov Models [HKZ12, MR05, AGH+14], word embedding [SCH15], and Gaussian mixture models [Das99, GHK15, VW04].

1.5 Arrangement of the paper

The reminder of the paper is organised as follows. Section 2 deﬁnes some useful notations and recalls some useful properties for these models. Section 3 deﬁnes the regularization and establish bounds on

the regularized spectral distance between X and M. Section 4 describes the Curated-SVD algorithm to recover M and gives an overview of its analysis.

2 Preliminaries

2.1 Notation

We will use the following formulation for rank, singular values, and decompositions of a matrix. Every matrix A Rk m can be expressed in terms of its singular-value decomposition (SVD), A = Pmin{k,m} i=1 σiuiv i , where the singular values σi := σi(A) are non-increasing and non-negative, and the right singular vectors vi := vi(A) are orthogonal, as are the left singular vectors ui := ui(A). A matrix has rank r iff its ﬁrst r singular values are positive, and the rest are zero. A t-truncated SVD of a matrix is one where only the ﬁrst t singular values in the SVD are retained and the rest are discarded. For t min{k, m}, let A(t) denote the t-truncated SVD of a matrix A Rk m, it is easy to see that the rank of A(t) is min(t, r).

The L1 "entry-wise" norm, or L1 norm, of a matrix A is ||A||1 := P

i,j|Aij|. Let ||Ai, ||1 := P j |Ai,j| and ||A ,j||1 := P i |Ai,j| denote the L1 norm of ith row and jth

column of A, respectively.

The L2 norm of a vector v = (v(1), . . . ,v(m)) Rm is ||v|| := p Pm i=1 v(i)2. The spectral norm, or norm for short, of a matrix A Rk m is ||A|| := maxv Rm:||v||=1 ||Av|| = maxu Rk:||u||=1 ||A u||.

2.2 A uniﬁed framework

We ﬁrst describe a uniﬁed common framework for all ﬁve problems.

To unify distribution-matrices with Poisson-parameter matrices, we apply the well-known Poisson trick [Szp01], where instead of a ﬁxed sample size n, we take Poi(n) samples. The resulting random variables Xi,j will be Poisson and independent. Furthermore, since with probability 1 1/n3, the difference between n and Poi(n) is O( n log n), this modiﬁcation contributes only smaller order terms, and the algorithm and guarantees for Poisson parameters carry over to distribution matrices.

Having uniﬁed distributionand Poisson-parametermatrices, we focus on the remaining four settings. In all of them, the observations Xi,j are independent non-negative random variables with E[Xi,j] = Mi,j and Var(Xi,j) Mi,j. The last inequality clearly holds for the Poisson, Bernoulli, and Binomial matrices. For collaborative ﬁltering it follows as

Var(Xi,j) E((Xi,j)2) E(Xi,j) = Mi,j.

Deﬁne the noise matrix N := X M as the difference between X and its expectation M. Note that the spectral distance between X and M is the same as the the spectral norm of the noise matrix N.

Let navg := ||M||1/k denote the average expected number of observations in each row and column. For simplicity, we assume ||M||1, the total number of expected observation, is known. Otherwise, since E[||X||1] = ||M||1, it can be estimated very accurately.

3 Spectral norm of the regularized noise matrix

Recall that the noise matrix N = X M, and its spectral norm ||N|| is the spectral distance between observation matrix X and M. When ||N|| = O( navg), a simple truncated-SVD of X can be shown to recover M. Unfortunately ||N|| can be >> navg, and one of the reason for this is when some rowor column-sums of M far exceed the average. This is because the expected squared norm of row i of N is P

j Var(Xi,j) which is roughly ||Mi, ||1, and if for some row i of M, the row sum ||Mi, ||1 is much larger than the average value navg across the rows then ||N|| will be large as well. Similarly for the columns.

The difﬁculty caused by the heavy rows and columns of M can be mitigated by a regularization that reduces their weight. Let w = (wf, wb) where wf = ((wf(1), .., wf(k)) and wb = (wb(1), .., wb(k))

are the row and columns regularization weights, all at least 1. And let D(u) be the diagonal matrix with entries u(i). The w-regularized A Rk k is

R(A, w) := D 1

2 (wf) A D 1

Upon multiplying the ith row of matrix A by (wf(i)) 1/2, its expected squared norm reduces by a factor 1/wf(i), and similarly for the columns. Therefore, selecting regularization weights w = ( wf, wb), where wf(i) = max{1, ||Mi, ||1/navg} and wb(j) = max{1, ||M ,j||1/navg} would reduce the expected squared norm of heavy rows and columns of R(N, w) to navg.

Unfortunately, the ||Mi, ||1 and ||M ,j||1 s are not known, hence neither is w. However, E[P

j Xi,j] = ||Mi, ||1, hence we approximate w by w = ( wf, wb), where

wf(i) := max{1, ||Xi, ||1

navg } and wb(j) := max{1, ||X ,j||1

Unless speciﬁed otherwise we use weights w, and refer to them as weights. This is one of the several commonly used regularizations in spectral methods for community detection [CCT12, QR13, JY13].

When navg = o(log k), some rows and columns may have regularization weights that are below the ideal weights, wf(i) wf(i) and wb(j) wb(j), and will not be regularized properly. Yet as shown in Theorem 3, our technique can handle these problematic rows and columns as well.

The regularized spectral distance between two matrices A and B is ||R(A B, w)||, the spectral norm of their regularized spectral difference. Lemma 4 in the next section relates ||R(N, w)|| to L1 recovery guarantees, and implies that when ||R(N, w)|| O( navg) a simple variation of truncated SVD can recover M from X to the minmax lower bound on L1 recovery in Theorem 1.

When the number of samples is at least a few log k factors more than k, this spectral concentration could probably be achieved with the help of the regularization. The more interesting, challenging, and prevalent setting, is when few observations are available, and this is the main focus of the paper.

For Bernoulli Parameter matrices, [LLV17] obtained the tight bound ||R(N, w)|| = O( navg) that holds even for sparse graphs, or equivalently few observations. However it requires that every Mi,j = O(||M||1/k2), hence holds only for a very limited and often impractical subclasses of parameter matrices M. They posed the question whether this bound also holds for general M. In Appendix F, we provide a counterexample that answers the question in the negative. We construct an explicit Bernoulli parameter matrix M, s.t. w.h.p. ||R(N, w)|| = Ω(navg), much larger than O( navg).

Yet upper bounding ||R(N, w)|| is just one approach to achieving optimal sample complexity. One of this paper s contribution is an alternative approach that decomposes the noise matrix N into two parts. A large part with small spectral norm, and a small part, in fact a submatrix, that may have a large spectral norm. While we cannot identify the "noisy" part, as shown in Section 4, we can ensure that no small part has a large inﬂuence on the estimate.

The next theorem establishes the above partition for all parameter matrices, and for all settings. To specify the matrix decomposition, both here and later, for A Rk m and subset S [k] [m], let AS be the projection of matrix A over S that agrees with A for indices in S and is zero elsewhere. Further let AI := AI [m] and AIc := A[k]\I [m] be the matrices derived from A by zeroing out all rows outside, and inside, set I, respectively.

For the weights w = ( wf, wb) above, let wf(I) := P

i I wf(i) denote the weight of row subset I. In the next theorem and thereafter, we refer to rows that are needed to be zero out to achieve spectral concentration as contaminated rows. Theorem 3. For X M, any ϵ 1 navg max log4 k

k , exp navg

8 , with probability 1 6k 3, there is a row subset Icn [k] of possibly contaminated rows with weight wf(Icn) ϵk and

||R(N, w)Iccn|| O navg log 2

Since wf(i) is the maximum of 1 and the number of observations in row i, the theorem shows that for some Icn [k] with only a few rows, XIcn contains only few observations, and zeroing out rows Icn from the regularized noise matrix R(N, w) would result in small spectral norm. The above result is the ﬁrst to imply nontrivial concentration in spectral norm, in linear sample regime, for the general

matrices in any of the settings we considered. We derive an algorithm that uses this result and to provably recovers low rank parameter matrices, even when the number of samples are only O(k).

To prove Theorem 3, we extend a technique used in [LLV17] for specialized Bernoulli matrices with entries all below O(||M||1/k2), to bound the spectral noise norm of general models and Matrices. In Appendix E, we use standard probabilistic methods and concentration inequalities, to establish concentration in ℓ ℓ2 norm for all sub-matrices of R(N, w). We then recursively apply a form of Grothendieck-Pietsch Factorization [LT13] and incorporate these bounds to partition R(N, w) into successively smaller submatrices and upper bound their spectral norms, until the resulting submatrix is very small. Finally we show that the squared spectral norm of any matrix is at most the sum of the squared spectral norms of its decomposition parts, and thus upper bound the spectral norm of R(N, w), except for the small submatrix, that is excluded.

4 Recovery Algorithms

We start by describing a simple generalisation of truncated SVD, on which our algorithm builds. Recall that for any A Rk k, R(A, w) denotes the regularized matrix A, and that R(A, w)(t) is its rank-t truncated SVD. For any t > 0, regularization weights w, and matrix A, let the (t, w)-SVD of A be the de-regularized, rank-t-truncated SVD of regularized matrix A,

A(t,w) := D 1 2 (wf) R(A, w)(t) D 1 2 (wb).

Let A be rank-r and B be any matrix. The next lemma bounds the L1 distance between A and B(r,w) in terms of the regularized spectral distance between A and B. Appendix G provides a simple proof.

Lemma 4. For any rank-r matrix A Rk k r , matrix B Rk k, and weights w = (wf, wb),

||A B(r,w)||1 q

j wb(j)) ||R(A B, w)||.

Recall that N = X M. The lemma implies that when ||R(N, w)|| is small, X(r, w), the (r, w)-SVD of X, would recover M. However, ||R(N, w)|| could be large. Instead, Theorem 3 in the last section implies that w.h.p., the following essential property holds.

Essential property There is an unknown contaminated set, Icn [k], such that after zeroing all Icn rows in X and M, their regularized spectral distance is at most,

||R(N, w)Iccn|| τ := O( navg log(rnavg)), (1)

and the weight of the set Icn is at most, wf(Icn) Wcn := O(k/(rnavg)2). The above property holds with probability 1 6k 3, by choosing ϵ = 1/(rnavg)2 in Theorem 3. As the property holds with high probability, the reminder of this section assumes that it holds.

This property implies a small regularized spectral distance between XIccn and MIccn, and since wf(Icn) Wcn, the noisier part of observation matrix, namely XIcn, is limited to just a few rows and observations.

Recall that the (r, w)-SVD of any matrix A is the de-regularized rank-r-truncated SVD of regularized matrix R(A, w). Equation (1) and Lemma 4 implies that a simple (r, w)-SVD of XIccn would recover MIccn to a small error, and since the rows Icn of X have only a few observations, recover M as well. But the set Icn of contaminated rows is unknown.

Building on this simple (r, w)-SVD, we derive our main algorithm, Curated SVD, that achieves the same performance guarantee up to a constant factor, even when the set Icn is unknown.

In Curated SVD, for every row subset I [k] with weight wf(I) Wcn, we limit the maximum impact the submatrix R(X, w)I can have on the truncated SVD of regularized observation matrix R(X, w). In particular, this limits the impact of the unknown noisier submatrix R(X, w)Icn.

To describe the algorithm, ﬁrst we formally deﬁne the impact of a row subset on SVD components. For a general matrix A Rk k, let A = Pk j=1 σjujvj be its SVD. Then for any i, j [k], let the impact of row i of A, on the jth component in SVD of A, be H(A, i, j) := σ2 j uj(i)2. Similarly,

for a row subset I, the impact of AI, or simply impact of I, on jth component in SVD of A be H(A, I, j) := P

i I H(A, i, j), the sum of the impact of each row. From the standard properties of SVD, it is easy to see that H(A, I, j) = σ2 j P

i Iuj(i)2 = ||AI vj||2.

Next, we present an overview of the algorithm and its analysis. Essentially, Curated SVD ﬁnds a set Izr [k] of row indices, such that the following objectives are fulﬁlled:

(i) Let (R(X, w)Iczr)(2r) = P2r j=1 σjujvj , be the rank 2r-truncated SVD of regularized observation matrix upon zeroing out rows Izr. For every row subset I of weight wf(I) Wcn and j [2r], the impact of I on the jth component is H(R(X, w)Iczr, I, j) = σ2 j P

i Iuj(i)2 16τ 2.

(ii) The total weight of the set Izr is small, wf(Izr) = P

i Izr wf(i) O(k/navg).

After ﬁnding such a row subset Izr, Curated SVD zeroes out rows Izr from X and simply returns the (2r, w)-SVD of XIczr as the estimate of M.

Objective (i), ensures that for every collection of rows I of weight wf(I) Wcn, the impact of R(X, w)I on each of the ﬁrst 2r components of SVD is small. As weight of Icn is at most Wcn, in particular it limits the impact of the noisier submatrix R(X, w)Icn.

Objective (ii) ensures that the weight of Izr, and hence the number of observations in XIzr, that get ignored in ﬁnal truncated regularized SVD is small. This limits the loss due to the ignored rows Izr.

Lemma 13 in the appendix uses these two observations to show that, the (2r, w)-SVD of XIczr recovers M.

Next, we describe the algorithm Curated-SVD and show that it ﬁnds a set Izr that achieves Objective (i) and (ii). The pseudo-code of the algorithm is in Appendix B.

4.1 Description of the Curated SVD

Curated-SVD starts with Izr = φ. In each iteration it calculates R(X, w)Iczr

(2r) = P2r j=1 σjujvj , the rank-2r truncated SVD of R(X, w)Iczr. Then it calls subroutine Row-Deletion for each j [2r].

Row deletion checks for the row subsets I Ic

zr with small weight and high impact that violate Objective (i). It then adds rows from such subsets I to Izr to reach Objective (i) in a way that Izr does not end up too heavy. To do this, Row-Deletion tries to ﬁnd a row subset I Ic

zr with weight wf(I) Wcn and maximum impact H(R(X, w)Iczr, I, j) = σ2 j P

i Iuj(i)2. This however is essentially the well-known NP-hard 0-1-knapsack problem. Instead, we use a greedy 0.5-approximation algorithm [SCGDS92] for 0-1 knapsack, to obtain a row subset I with weight Wcn, such that its impact is at least half of the maximum possible, namely H(R(X, w)Iczr, I, j) 0.5 max{H(R(X, w)Iczr, I , j) : I Ic

i I wf(i) Wcn}.

If the impact of row collection I, found using the greedy algorithm, is H(R(X, w)Iczr, I, j) 8τ 2, sub-procedure Row-Deletion terminates. Else it adds a row i I to Izr, with probability of row i I proportional to its impact to the weight ratio, H(R(X, w)Iczr, i, j)/ wf(i). Row-Deletion repeats this procedure on the remaining rows in Ic

zr, until it terminates.

After calling Row-Deletion for each j [2r], Curated-SVD checks if the new rows were added to Izr in this iteration, in which case it repeats the same procedure in the next iteration with updated Izr, else it returns (2r, w)-SVD of XIczr as the estimate of M.

Curated SVD achieves Objective (i). Iterations in Curated-SVD stop when for the current choice of Izr, for all j [2r], Row-Deletion fails to add any row to Izr. This happens when for each j [2r], the greedy-approximation algorithm in Row-Deletion ﬁnds the row subset I that has impact 8τ 2, which implies that the impact of every row subsets I Ic

zr of weight wf(I ) Wcn, is at most 16τ 2. Therefore, iterations in the algorithm stops only when Izr meets Objective (i).

Curated SVD achieves Objective (ii) w.p. 1 O(k 2). This is more challenging of the two objectives. The proof is based on the following key observation proved in the Appendix C.1.

There exist a row subset Ihv s.t. for every subset I Ic

hv of weight wf(I) Wcn, the following holds ||R(XI, w)|| 2τ. And moreover, the weight of Ihv is wf(Ihv) O(k/navg).

It is easy to see that ||R(XI, w)||2 upper bounds the impact submatrix R(XI, w) can have on the components of SVD. Therefore, from the previous observation, any row subset I of weight wf(I) Wcn with impact 8τ 2, must have more than half of its impact due to the rows I Ihv. Recall that Row-Deletion adds a row from I to Izr only when I has impact 8τ 2. We show that in each step in expectation it adds more weight from I Ihv to Izr, then from I Ic

Using this we show that, the expected total weight of rows added to Izr is at most a constant times the weight of Ihv, that is O(k/navg), and Objective (ii) is achieved with constant probability. Repeating the procedure O(log k) times on the same data, Objective (ii) holds with probability 1 O(k 2).

Finally, by combining Lemma 13 and the fact that the Curated SVD achieves both objectives, we prove Theorem 2 in Appendix D.

Broader impact

This work does not present any foreseeable societal consequence.

Acknowledgements

We thank Vaishakh Ravindrakumar and Yi Hao for their helpful comments in the prepration of this manuscript.

We are grateful to the National Science Foundation (NSF) for supporting this work through grants CIF-1564355 and CIF-1619448.

[Abb17] Emmanuel Abbe. Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research, 18(1):6446 6531, 2017. 1.1, 1.4

[ABH15] Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recovery in the stochastic block model. IEEE Transactions on Information Theory, 62(1):471 487, 2015. 1.1, 1.4

[AGH+13] Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. A practical algorithm for topic modeling with provable guarantees. In International Conference on Machine Learning, pages 280 288, 2013. 1.4

[AGH+14] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773 2832, 2014. 1.1, 1.4

[AS15] Emmanuel Abbe and Colin Sandon. Community detection in general stochastic block models: Fundamental limits and efﬁcient algorithms for recovery. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 670 688. IEEE, 2015. 1.4

[BBW18] Xin Bing, Florentina Bunea, and Marten Wegkamp. A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics. ar Xiv preprint ar Xiv:1805.06837, 2018. 1.4

[BCLS17] Christian Borgs, Jennifer Chayes, Christina E Lee, and Devavrat Shah. Thy friend is my friend: Iterative collaborative ﬁltering for sparse matrix estimation. In Advances in Neural Information Processing Systems, pages 4715 4726, 2017. 1.1, 1.4, D.1

[Bha13] Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business Media, 2013. 5, 7, G.4

[BLM15] Charles Bordenave, Marc Lelarge, and Laurent Massoulié. Non-backtracking spectrum of random graphs: community detection and non-regular ramanujan graphs. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 1347 1357. IEEE, 2015. 1.1, 1.4

[CCT12] Kamalika Chaudhuri, Fan Chung, and Alexander Tsiatas. Spectral clustering of graphs with general degrees in the extended planted partition model. In Conference on Learning Theory, pages 35 1, 2012. 3

[Das99] Sanjoy Dasgupta. Learning mixtures of gaussians. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 634 644. IEEE, 1999. 1.4

[GHK15] Rong Ge, Qingqing Huang, and Sham M Kakade. Learning mixtures of gaussians in high dimensions. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 761 770. ACM, 2015. 1.4

[HJW15] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of discrete distributions. In 2015 IEEE International Symposium on Information Theory (ISIT), pages 2291 2295. IEEE, 2015. 1.3

[HKKV18] Qingqing Huang, Sham M Kakade, Weihao Kong, and Gregory Valiant. Recovering structured probability matrices. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018), volume 94, 2018. 1.3, 1.3, 1.4

[HKZ12] Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460 1480, 2012. 1.1, 1.4

[Hof99] Thomas Hofmann. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth conference on Uncertainty in artiﬁcial intelligence, pages 289 296. Morgan Kaufmann Publishers Inc., 1999. 1.1

[HS17] Samuel B Hopkins and David Steurer. Bayesian estimation from few samples: community detection and related problems. ar Xiv preprint ar Xiv:1710.00264, 2017. 1.4

[JY13] Antony Joseph and Bin Yu. Impact of regularization on spectral clustering. ar Xiv preprint ar Xiv:1312.1733, 2013. 3

[KMA16] Arnav Kapur, Kshitij Marwah, and Gil Alterovitz. Gene expression prediction using low-rank matrix completion. BMC bioinformatics, 17(1):243, 2016. 1.1

[KOPS15] Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On learning distributions from their samples. In Conference on Learning Theory, pages 1066 1100, 2015. 1.3

[KW17] Zheng Tracy Ke and Minzhe Wang. A new svd approach to optimal topic estimation. ar Xiv preprint ar Xiv:1704.07016, 2017. 1.4

[LLV17] Can M Le, Elizaveta Levina, and Roman Vershynin. Concentration and regularization of random graphs. Random Structures & Algorithms, 51(3):538 561, 2017. 1.3, 3, 3, E, F

[LT13] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013. 3, E

[MD19] Andrew D Mc Rae and Mark A Davenport. Low-rank matrix completion and denoising under poisson noise. ar Xiv preprint ar Xiv:1907.05325, 2019. 1.4

[MNS18] Elchanan Mossel, Joe Neeman, and Allan Sly. A proof of the block model threshold conjecture. Combinatorica, 38(3):665 708, 2018. 1.1

[MR05] Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366 375. ACM, 2005. 1.1, 1.4

[QR13] Tai Qin and Karl Rohe. Regularized spectral clustering under the degree-corrected stochastic blockmodel. In Advances in Neural Information Processing Systems, pages 3120 3128, 2013. 3

[Que87] João Filipe Queiró. On the interlacing property for singular values and eigenvalues. Linear Algebra and Its Applications, 97:23 28, 1987. 6

[SCGDS92] UK Sarkar, PP Chakrabarti, Sujoy Ghose, and SC De Sarkar. A simple 0.5-bounded greedy algorithm for the 0/1 knapsack problem. Information Processing Letters, 42(3):173 177, 1992. 4.1, 2

[SCH15] Karl Stratos, Michael Collins, and Daniel Hsu. Model-based word embeddings from decompositions of count matrices. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1282 1291, 2015. 1.1, 1.4

[SKKR00] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Application of dimensionality reduction in recommender system-a case study. Technical report, Minnesota Univ Minneapolis Dept of Computer Science, 2000. 1.1

[Szp01] Wojciech Szpankowski. Average case analysis of algorithms on sequences. chapter Analytic Poissonization and Depoissonization, pages 442 519. John Wiley & Sons, Inc., 2001. 2.2

[VW04] Santosh Vempala and Grant Wang. A spectral algorithm for learning mixture models. Journal of Computer and System Sciences, 68(4):841 860, 2004. 1.4

[ZHPA13] James Y Zou, Daniel J Hsu, David C Parkes, and Ryan P Adams. Contrastive learning using spectral methods. In Advances in Neural Information Processing Systems, pages 2238 2246, 2013. 1.1

[ZLWY14] Xiaowei Zhou, Jiming Liu, Xiang Wan, and Weichuan Yu. Piecewise-constant and low-rank approximation for identiﬁcation of recurrent copy number variations. Bioinformatics, 30(14):1943 1949, 2014. 1.1