# robust_subspace_approximation_in_a_stream__b83ea91d.pdf

Robust Subspace Approximation in a Stream

Roie Levin1 roiel@cs.cmu.edu Anish Sevekari2 asevekar@andrew.cmu.edu David P. Woodruff1 dwoodruf@cs.cmu.edu

1 Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213 2 Department of Mathematical Sciences, Carnegie Mellon University, Pittsburgh, PA 15213

We study robust subspace estimation in the streaming and distributed settings. Given a set of n data points {ai}n i=1 in Rd and an integer k, we wish to ﬁnd a linear subspace S of dimension k for which

i M(dist(S, ai)) is minimized, where dist(S, x) := miny S x y 2, and M( ) is some loss function. When M is the identity function, S gives a subspace that is more robust to outliers than that provided by the truncated SVD. Though the problem is NP-hard, it is approximable within a (1 + ϵ) factor in polynomial time when k and ϵ are constant. We give the ﬁrst sublinear approximation algorithm for this problem in the turnstile streaming and arbitrary partition distributed models, achieving the same time guarantees as in the ofﬂine case. Our algorithm is the ﬁrst based entirely on oblivious dimensionality reduction, and signiﬁcantly simpliﬁes prior methods for this problem, which held in neither the streaming nor distributed models.

1 Introduction

A fundamental problem in large-scale machine learning is that of subspace approximation. Given a set of n data points {ai}n i=1 in Rd and an integer k, we wish to ﬁnd a linear subspace S of dimension k for which

i M(dist(S, ai)) is minimized, where dist(S, x) := miny S x y 2, and M( ) is some loss function. When M( ) = ( )2, this is the well-studied least squares subspace approximation problem. The minimizer in this case can be computed exactly by computing the truncated SVD of the data matrix.

Otherwise M is often chosen from ( )p for some p 0, or from a class of functions called Mestimators, with the goal of providing a more robust estimate than least squares in the face of outliers. Indeed, for p < 2, since one is not squaring the distances to the subspace, one is placing less emphasis on outliers and therefore capturing more of the remaining data points. For example, when M is the identity function, we are ﬁnding a subspace so as to minimize the sum of distances to it, which could arguably be more natural than ﬁnding a subspace so as to minimize the sum of squared distances. We can write this problem in the following form:

min S dim k

i dist(S, ai) = min X rank k

i (A AX)i 2

where A is the matrix in which the i-th row is the vector ai. This is the form of robust subspace approximation that we study in this work. We will be interested in the approximate version of the problem for which the goal is to output a k-dimensional subspace S for which with high probability,

i dist(S , ai) (1 + ϵ)

i dist(S, ai) (1)

The particular form with M equal to the identity was introduced to the machine learning community by Ding et al. [10], though these authors employed heuristic solutions. The series of work in

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

[7],[15] and [8, 12, 20, 5] shows that if M( ) = | |p for p = 2, there is no algorithm that outputs a (1 + 1/ poly(d)) approximation to this problem unless P = NP. However, [5] also show that for any p there is an algorithm that runs in O(nnz(A) + (n + d) poly(k/ϵ) + exp(poly(k/ϵ)) time and outputs a k-dimensional subspace whose cost is within a (1 + ϵ) factor of the optimal solution cost. This provides a considerable computational savings since in most applications k d n. Their work builds upon techniques developed in [13] and [11] which give O ( nd poly(k/ϵ) + exp ( (k/ϵ)O(p))) time algorithms for the p 1 case. These in turn build on the weak coreset construction of [9]. In other related work [6] give algorithms for performing regression with a variety of M-estimator loss functions.

Our Contributions. We give the ﬁrst sketching-based solution to this problem. Namely, we show it sufﬁces to compute Z A, where Z is a poly(log(nd)kϵ 1) n random matrix with entries chosen obliviously to the entries of A. The matrix Z is a block matrix with blocks consisting of independent Gaussian entries, while other blocks consist of independent Cauchy random variables, and yet other blocks are sparse matrices with non-zero entries in { 1, 1}. Previously such sketchingbased solutions were known only for M( ) = ( )2. Prior algorithms [8, 12, 20, 5] also could not be implemented as single-shot sketching algorithms since they require ﬁrst making a pass over the data to obtain a crude approximation, and then using (often adaptive) sampling methods in future passes to reﬁne to a (1 + ϵ)-approximation. Our sketching-based algorithm, achieving O(nnz(A) + (n + d) poly(log(nd)k/ϵ)+exp(poly(kϵ 1)) time, matches the running time of previous algorithms and has considerable beneﬁts as described below.

Streaming Model. Since Z is linear and oblivious, one can maintain Z A in the presence of insertions and deletions to the entries of A. Indeed, given the update Ai,j Ai,j + for some R, we simply update the j-th column ZAj in our sketch to ZAj + Z ei, where ei is the i-th standard unit vector. Also, the entries of Z can be represented with limited independence, and so Z can be stored with a short random seed. Consequently, we obtain the ﬁrst algorithm with d poly(log(nd)kϵ 1) memory for this problem in the standard turnstile data stream model [19]. In this model, A Rn d is initially the zero matrix, and we receive a stream of updates to A where the i-th update is of the form (xi, yi, ci), which means that Axi,yi should be incremented by ci. We are allowed one pass over the stream, and should output a rank-k matrix X which is a (1 + ϵ) approximation to the robust subspace estimation problem, namely

i (A AX )i 2 (1 + ϵ) min X rank k

i (A AX)i 2 . The space complexity of the algorithm is the total number of words required to store this information during the stream. Here, each word is O(log(nd)) bits. Our algorithm achieves d poly(log(nd)kϵ 1) memory, and so only logarithmically depends on n. This is comparable to the memory of streaming algorithms when M( ) = ( )2 [3, 14], which is the only prior case for which streaming algorithms were known.

Distributed Model. Since our algorithm maintains Z A for an oblivious linear sketch Z, it is parallelizable, and can be used to solve the problem in the distributed setting in which there are s machines holding A1, A2, . . . , As, respectively, and A = s i=1 Ai. This is called the arbitrary partition model [17]. In this model, we can solve the problem in one round with s d poly(log(nd)kϵ 1) communication by having each machine agree upon (a short seed describing) Z, and sending ZAi to a central coordinator who computes and runs our algorithm on Z A =

i ZAi. The arbitrary partition model is stronger than the so-called row partition model, in which the points (rows of A) are partitioned across machines. For example, if each machine corresponds to a shop, the rows of A correspond to customers, the columns of A correspond to items, and Ai c,d indicates how many times customer c purchased item d at shop i, then the row partition model requires customers to make purchases at a single shop. In contrast, in the arbitrary partition model, customers can purchase items at multiple shops.

2 Notation and Terminology

For a matrix A, let Ai denote the i-th row of A, and A j denote the j-th column of A.

Deﬁnition 2.1 ( 2,1, 1,2, 1,1, med,1, F ). For a matrix A Rn m, let:

i Ai 2 A 1,2 A 2,1 =

i Ai 2 2 A 1,1

i Ai 1 A med,1

where med denotes the function that takes the median of absolute values. Deﬁnition 2.2 (X , ). Let:

min X rank k A AX 2,1 X argmin X rank k A AX 2,1

Deﬁnition 2.3 ((α, β)-coreset). For a matrix A Rn d and a target rank k, W is an (α, β)- coreset if its row space is an α-dimensional subspace of Rd that contains a β-approximation to X . Formally: argmin X rank k A AXW 2,1 β

Deﬁnition 2.4 (Count-Sketch Matrix). A random matrix S Rr t is a Count-Sketch matrix if it is constructed via the following procedure. For each of the t columns S i, we ﬁrst independently choose a uniformly random row h(i) {1, 2, ..., r}. Then, we choose a uniformly random element of { 1, 1} denoted σ(i). We set Sh(i),i = σ(i) and set Sj,i = 0 for all j = i.

For the applications of Count-Sketch matrices in this paper, it sufﬁces to use O(1)-wise instead of full independence for the hash and sign functions. Thus these can be stored in O(1) space, and multiplication SA can be computed in nnz(A) time. For more background on such sketching matrices, we refer the reader to the monograph [22].

We also use the following notation: [n] denotes the set {1, 2, 3, n}. [[E]] denotes the indicator function for event E. nnz(A) denotes the number of non-zero entries of A. A denotes the pseudoinverse of A. I denotes the identity matrix.

3 Algorithm Overview

At a high level we follow the framework put forth in [5] which gives the ﬁrst input sparsity time algorithm for the robust subspace approximation problem. In their work Clarkson and Woodruff ﬁrst ﬁnd a crude (poly(k), K)-coreset for the problem. They then use a non-adaptive implementation of a residual sampling technique from [9] to improve the approximation quality but increase the dimension, yielding a (K poly(k), 1 + ϵ)-coreset. From here they further use dimension reducing sketches to reduce to an instance with parameters that depend only polynomially on k/ϵ. Finally they pay a cost exponential only in poly(k/ϵ) to solve the small problem via a black box algorithm of [2].

There are several major obstacles to directly porting this technique to the streaming setting. For one, the construction of the crude approximation subspace uses leverage score sampling matrices which are non-oblivious and thus not usable in 1-pass turnstile model algorithms. We circumvent this difﬁculty in Section 4.1 by showing that if T is a sparse poly(k) n matrix of Cauchy random variables, the row span of TA contains a rank-k matrix which is a log(d) poly(k) approximation to the best rank-k matrix under the 2,1 norm.

Second, the residual sampling step requires sampling rows of A with respect to probabilities proportional to their distance to the crude approximation (in our case TA). This is challenging because one does not know TA until the end of the stream, much less the distances of rows of A to TA. We handle this in Section 4.2 using a row-sampling data structure of [18] developed for regression, which for a matrix B maintains a sketch HB in a stream from which one can extract samples of rows of B according to probabilities given by their norms. By linearity, it sufﬁces to maintain HA and TA in parallel in the stream, and apply the sample extraction procedure to HA (I PT A), where PT A = (TA) (TA(TA) ) 1TA is the projection onto the rowspace of TA. Unfortunately, the extraction procedure only returns noisy perturbations of the original rows which majorly invalidates the analysis in [5] of the residual sampling. In Section 4.2 we give an analysis of non-adaptive noisy

residual sampling which we name BOOTSTRAPCORESET. This gives a procedure for transforming our poly(k)-dimensional space containing a poly(k) log(d) approximation into a poly(k) log(d)- dimensional space containing a 3/2 factor approximation.

Third, requiring the initial crude approximation to be oblivious yields a coarser log(d) poly(k) initial approximation than the constant factor approximation of [5]. Thus the dimension of the subspace after residual sampling is poly(k) log(d). Applying dimension reduction techniques reduces the problem to an instance with poly(k) rows and log(d) poly(k) columns. Here the black box algorithm of [2] would take time dpoly(k) which is no longer ﬁxed parameter tractable as desired. Our key insight is that ﬁnding the best rank-k matrix under the Frobenius norm, which can be done efﬁciently, is a log d(log log d) poly(k) approximation to the 2,1 norm minimizer. From here we can repeat the residual sampling argument which this time yields a small instance with poly(k) rows by log d(log log d) poly(k/ϵ) columns. Sublogarithmic in d makes all the difference and now enumerating can be done in time (n + d) poly(k/ϵ) + exp(poly(k/ϵ). All this is done in parallel in a single pass of the stream.

Lastly, the sketching techniques applied after the residual sampling are not oblivious in [5]. We instead use an obvlious median based embedding in Section 5.1, and show that we can still use the black box algorithm of [2] to ﬁnd the minimizer under the med,1 norm in Section 5.2.

We present our results as two algorithms for the robust subspace approximation problem. The ﬁrst runs in fully polynomial time but gives a coarse approximation guarantee, which corresponds to stopping before repeating the residual sampling a second time. The second algorithm captures the entire procedure, and uses the ﬁrst as a subroutine.

Algorithm 1 COARSEAPPROX

Input: A Rn d as a stream Output: X Rd d such that A AX 2,1 log d(log log d) poly(k)

1: T Rpoly(k) n Sparse Cauchy matrix // as in Thm. 4.1 2: C1 Rpoly(k) n Sparse Cauchy matrix // as in Thm. 4.4 3: S1 Rlog d poly(k) d Count Sketch composed with Gaussian // as in Thm. 4.3 4: R1 Rpoly(k) d Count Sketch matrix // as in Thm. 4.3 5: G1 Rlog d poly(k) log d poly(k) Gaussian matrix // as in Thm. 4.4 6: Compute TA online 7: Compute C1A online 8: U 1 Rlog d poly(k) d BOOTSTRAPCORESET(A, TA, 1/2) // as in Alg. 3 9: ˆX Rpoly(k) log d poly(k) argmin X rank k C1(A AR 1XU 1 )S 1 G1 F // as in Fact 4.2

10: return R 1 ˆXU

Theorem 3.1 (Coarse Approximation in Polynomial Time). Given a matrix A Rn d, Algorithm 1 with constant probability computes a rank k matrix X Rd d such that:

log d(log log d) poly(k) A AX 2,1 that runs in time O(nnz(A))+d poly(k log(nd)). Furthermore, it can be implemented as a one-pass streaming algorithm with space O (d poly(k log(nd))) and time per update O(poly(log(nd)k)).

Proof Sketch We show the following are true in subsequent sections:

1. The row span of TA is a (poly(k), log d poly(k))-coreset for A (Section 4.1) with probability 24/25. 2. BOOTSTRAPCORESET(A, TA, 1/2) is a (log d poly(k), 3/2)-coreset with probability 49/50 (Section 4.2). 3. If: ˆX = argmin X rank k C1AS 1 G1 C1AR 1XU 1 S 1 G1 F

then with probability 47/50: A AR 1 ˆXU 1 2,1 poly(k)

log d log log d

(Sections 4.3 and 4.4, with ϵ = 1/2).

By a union bound, with probability 88/100 all the statements above hold, and the theorem is proved. BOOTSTRAPCORESET requires d poly(k log(nd)) space and time. Left matrix multiplications by Sparse Cauchy matrices TA and C1A can be done in O(nnz(A)) time (see Section J of [21] for a full description of Sparse Cauchy matrices). Computing remaining matrix products and ˆX requires time d poly(k log d).

Algorithm 2 (1 + ϵ)-APPROX

Input: A Rn d as a stream Output: X Rd d such that A AX 2,1 (1 + ϵ)

1: ˆX Rpoly(k) log d poly(k) COARSEAPPROX(A) // as in Thm. 3.1 2: C2 R log d(log log d) poly(k/ϵ) n Cauchy matrix // as in Thm. 5.1 3: S2 R log d(log log d) poly(k/ϵ) d Count Sketch composed with Gaussian // as in Thm. 4.3 4: R2 Rpoly(k/ϵ) d Count Sketch matrix // as in Thm. 4.3 5: G2 R log d(log log d) poly(k/ϵ) log d(log log d) poly(k/ϵ) Gaussian matrix // as in Thm. 5.1 6: Compute AR 2 online 7: Compute AS 2 online 8: Let V Rlog d poly(k) k be such that ˆX = WV is the rank-k decomposition of ˆX 9: U 2 Rpoly(k/ϵ) log d log log d d BOOTSTRAPCORESET(A, V U 1 , ϵ ) // as in Alg. 3, U1 as computed during COARSEAPPROX in line 1. 10: ˆX Rpoly(k/ϵ) poly(k/ϵ) log d log log d argmin X rank k C2(A AR 2XU 2 )S 2 G2 med,1 // as in Thm. 5.2 11: return R 2 ˆX U

Theorem 3.2 ((1 + ϵ)-Approximation). Given a matrix A Rn d, Algorithm 2 with constant probability computes a rank k matrix X Rd d such that:

A AX 2,1 (1 + ϵ) A AX 2,1

that runs in time

O(nnz(A)) + (n + d) poly (k log(nd)

) + exp ( poly (k

Furthermore, it can be implemented as a one-pass streaming algorithm with space O ( d poly ( k log(nd)

ϵ )) and time per update O(poly(log(nd)k/ϵ)).

Proof Sketch We show the following are true in subsequent sections:

1. If V is such that ˆX = WV , then V is a (poly(k), poly(k) log d log log d)-coreset with probability 88/100 (Theorem 3.1). 2. BOOTSTRAPCORESET(A, V U 1 , ϵ ) is a (poly(k/ϵ ) log d log log d, (1 + ϵ ))-coreset with probability 49/50 (Reusing Section 4.2). 3. If: ˆX argmin X C2(A AR 2XU 2 )S 2 G2 med,1

then with probability 19/20: A AR 2 ˆX U 2 2,1 (1 + O(ϵ ))

(Reusing Section 4.3 and Section 5.1).

4. A black box algorithm of [2] computes ˆX to within (1 + O(ϵ )) (Section 5.2).

By a union bound, with probability 81/100 all the statements above hold. Setting ϵ appropriately small as a function of ϵ, the theorem is proved.

COARSEAPPROX and BOOTSTRAPCORESET together require d poly(k log(nd)/ϵ) space and O(nnz(A)) + d poly(k log(nd)/ϵ) time. Right multiplication by the sketching matrices AS 2 and AR 2 can be done in time nnz(A). Computing remaining matrix products and ˆX requires time (n+d) poly(log(d)k/ϵ)+exp(poly(k/ϵ)) (See end of Section 5.2 for details on this last bound).

We give further proofs and details of these theorems in subsequent sections. Refer to the full version of the paper for complete proofs.

4 Coarse Approximation

4.1 Initial Coreset Construction

We construct a (poly(k), log d poly(k))-coreset which will serve as our starting point.

Theorem 4.1. If T Rpoly(k) n is a Sparse Cauchy matrix, then the row space of TA contains a k dimensional subspace with corresponding projection matrix X such that with probability 24/25:

A AX 2,1 log d poly(k) min X rank k A AX 2,1 = log d poly(k)

In order to deal with the awkward 2,1 norm, here and several times elsewhere we make use of a well known theorem due to Dvoretzky to convert it into an entrywise 1-norm. Fact 4.1 (Dvoretzky s Theorem (Special Case), Section 3.3 of [16]). There exists an appropriately scaled Gaussian Matrix G Rd d log(1/ϵ)

ϵ2 such that w.h.p. the following holds for all y Rd simultaneously y G 1 (1 ϵ) y 2

Thus the rowspace of TA with T as in Theorem 4.1 above is a (poly(k), log d poly(k))-coreset for A.

4.2 Bootstrapping a Coreset

Given a poor coreset Q for A, we now show how to leverage known results about residual sampling from [9] and [5] to obtain a better coreset of slightly larger dimension.

Algorithm 3 BOOTSTRAPCORESET

Input: A Rn d, Q Rα d (α, β)-coreset, ϵ (0, 1) Output: U R(α+β poly(k/ϵ)) d (α + β poly(k/ϵ), (1 + ϵ))-coresets 1: Compute HA online // as in Lem. 4.2.2 2: P β poly(k/ϵ) samples of rows of A(I Q) according to P(HA(I Q)) // as in Lem.

3: U Orthonormal basis for Row Span ([ Q P

4: return U

Theorem 4.2. Given Q, an (α, β)-coreset for A, with probability 49/50 BOOTSTRAPCORESET returns an (α + β poly(k/ϵ), (1 + ϵ))-coreset for A. Furthermore BOOTSTRAPCORESET runs in space and time O(d poly(β log(nd)k/ϵ)), with poly(β log(nd)k/ϵ) time per update in the streaming setting.

Proof. Consider the following idealized noisy sampling process that samples rows of a matrix B. Sample a row Bi of B with probability Bi 2 B 2,1 and add an arbitrary noise vector Ei such that

Ei 2 ν Bi 2,1, where we ﬁx the parameter ν = ϵ 100kβ . Supposing we had such a process P (B), we can prove the following lemma.

Lemma 4.2.1. Suppose Q is an (α, β)-coreset for A, and P is a noisy subset of rows of the residual A(I Q) of size β(poly k/ϵ) each sampled according to P (A(I Q)). Then with probability

99/100, Row Span(Q) Row Span(P) is an (α + β poly(k/ϵ)) dimensional subspace containing a k-dimensional subspace with corresponding projection matrix X such that:

A AX 2,1 (1 + ϵ)

It remains to show that we can sample from P in a stream.

Lemma 4.2.2. Let B Rn d be a matrix, and let δ, ν (0, 1) be given. Also let s be a given integer. Then there is an oblivious sketching matrix H Rpoly(s/(δν)) n and a sampling process P, such that P(HB) returns a collection of s = O(s) distinct row indices i1, . . . , is [n] and approximations Bij = Bij + Eij with Eij 2 ν Bij 2 for j = 1, . . . , s. With probability 1 δ over the choice of H, the probability an index i appears in the sampled set {i1, . . . , is } is at least the probability that i appears in a set of s samples without replacement from the distribution ( B1, 2

B 2,1 , . . . Bn, 2

) . Furthermore the multiplication HB and sampling process P can be done in

nnz(B)+d poly(s/(δν)) time, and can be implemented in the streaming model with d poly(s/(δν)) bits of space.

Setting b = log(nd), δ = 1/100, γ = ν = ϵ 100kβ and s = β poly(k/ϵ), it follows that P contains β poly(k/ϵ) samples from P (A(I Q)) with probability 99/100. By Lemma 4.2.1 and a union bound, the projection matrix of Row Span(Q) Row Span(P) is an (α + β poly(k/ϵ), (1 + ϵ))- coreset for A with probability 49/50. BOOTSTRAPCORESET takes total time O(nnz(A)) + O(d poly(β log(nd)k/ϵ)) and space O(d poly(β log(nd)k/ϵ)).

Note that in our main algorithm we cannot compute the projection A(I Q) until the after the stream is ﬁnished. Fortunately, since H is oblivious, we can right multiply HA by (I Q) once Q is available, and only then perform the sampling procedure P.

4.3 Right Dimension Reduction

We show how to reduce the right dimension of our problem. This result is used in both Algorithm 1 and Algorithm 2.

Theorem 4.3. If U is an (α, β)-coreset, S Rα poly(k/ϵ) d is a Count Sketch matrix composed with a matrix of i.i.d. Gaussians, and R Rd poly(k/ϵ) is a Count Sketch matrix, then with probability 49/50, if X = argmin X AS AR XU S 2,1 then:

A AR X U 2,1 (1 + O(ϵ)) min X rank k A AXU 2,1

4.4 Left Dimension Reduction

We show how to reduce the left dimension of our problem. Together with results from Section 4.3, this preserves the solution to X to within a coarse log d log log d poly(k/ϵ) factor.

Theorem 4.4. Suppose the matrices S1, R1 and U1 are as in Algorithm 1. If C1 Rpoly(k/ϵ) n is a Sparse Cauchy matrix, and G1 Rlog d poly(k/ϵ) log d poly(k/ϵ) is a matrix of appropriately scaled i.i.d. Gaussians (as in Fact 4.1), and

ˆX = argmin X rank k C1AS 1 G1 C1AR 1XU 1 S 1 G1 F

then with probability 24/25: AS 1 AR 1 ˆXU 1 S 1 2,1 log d log log d poly(k/ϵ)

The rank constrained Frobenius norm minimization problem above has a closed form solution.

Fact 4.2. For a matrix M, let UMΣMV M be the SVD of M. Then:

argmin X rank k Y ZXW F = Z [UZU ZY VW V W ]k W

5 (1 + ϵ)-Approximation

5.1 Left Dimension Reduction

The following median based embedding allows us to reduce the left dimension of our problem. Together with results from Section 4.3, this preserves the solution to X to within a (1 + O(ϵ)) factor. Theorem 5.1. Suppose S2, R2 and U2 are as in Algorithm 2. If C2 R log d log log d poly(k/ϵ) n is a Cauchy matrix, and G2 R log d log log d poly(k/ϵ) log d log log d poly(k/ϵ) is a matrix of appropriately scaled i.i.d. Gaussians (as in Fact 4.1), and:

ˆX = argmin X rank k C2AS 2 G2 C2AR 2XU 2 S 2 G2 med,1

then with probability 99/100: AS 2 G2 AR 2 ˆX U 2 S 2 G2 1,1 (1 + ϵ) min X rank k AS 2 G2 AR 2XU 2 S 2 G2 1,1

Proof. The following fact is known:

Fact 5.1 (Lemma F.1 from [1]). Let L be a t dimensional subspace of Rs. Let C Rm s be a matrix with m = O ( 1

ϵ ) and i.i.d. standard Cauchy entries. With probability 99/100, for all x L we have (1 ϵ) x 1 Cx med (1 + ϵ) x 1

The theorem statement is simply the lemma applied to L = Col Span ([AS 2 | AR 2]).

5.2 Solving Small Instances

Given problems of the form ˆX = argmin X rank k Y ZXW med,1, we leverage an algorithm for checking the feasibility of a system of polynomial inequalities as a black box. Lemma 5.1. [2] Given a set K = {β1, , βs} of polynomials of degree d in k variables with coefﬁcients in R, the problem of deciding whether there exist X1, Xk R for which βi(X1, , Xk) 0 for all i [s] can be solved deterministically with (sd)O(k) arithmetic operations over R. Theorem 5.2. Fix any ϵ (0, 1) and k [0, min(m1, m2)]. Let Y Rn m , Z Rn m1, and W Rm2 m be any matrices. Let C Rm n be a matrix of i.i.d. Cauchy random variables, and G Rm m poly(1/ϵ) be a matrix of scaled i.i.d. Gaussian random variables. Then conditioned on C satisfying Fact 5.1 for the adjoined matrix [Y, Z] and G satisfying the condition of Fact 4.1, a rank-k projection matrix X can be found that minimizes C(Y ZXW)G med,1 up to a (1 + ϵ)- factor in time poly(m m /ϵ)O(mk)+(m +m ) poly(1/ϵ), where m = max(m1, m2).

We remark that if, as we do in our algorithm, we set the all the parameters m, m and m to be log log d log d poly(k/ϵ), we can write the runtime of this step (Line 9 of Algorithm 2) as (n+d) poly(k/ϵ)+exp(poly(k/ϵ))). If poly(k/ϵ) log d/(log log d)2, then this step is captured in the (n + d) poly(k/ϵ) term. Otherwise this step is captured in the exp(poly(k/ϵ)) term.

6 Experiments

In this section we empirically demonstrate the effectiveness of COARSEAPPROX compared to the truncated SVD. We experiment on synthetic and real world data sets. Since the algorithm is randomized, we run it 20 times and take the best performing run. For a fair comparison, we use an input sparsity time approximate SVD as in [4].

For the synthetic data, we use two example matrices all of dimension 1000 100. In Figure 1a we use a Rank-3 matrix with additional large outlier noise. First we sample U random 100 3 matrix and V random 3 10 matrix. Then we create a random sparse matrix W with each entry nonzero with probability 0.9999 and then scaled by a uniform random variable between 0 and 10000 n. We

(a) Random Rank-3 Matrix Plus Large Outliers

(b) Large Outlier Rank-2 Matrix

(d) E. Coli

Figure 1: Comparison of Algorithm 1 on synthetic and real world examples.

use 10 UV + W. In Figure 1b we create a simple Rank-2 matrix with a large outlier. The ﬁrst row is n followed by all zeros. All subsequent rows are 0 followed by all ones.

While the approximation guarantee of COARSEAPPROX is weak, we ﬁnd that it performs well against the SVD baseline in practice on our examples, namely when the data has large outliers rows. The second example in particular serves as a good demonstration of the robustness of the (2,1)-norm to outliers in comparison to the Frobenius norm. When k = 1, the truncated SVD which is the Frobenius norm minimizer recovers the ﬁrst row of large magnitude, whereas our algorithm recovers the subsequent rows. Note that both our algorithm and the SVD recover the matrix exactly when k is greater than or equal to rank.

We have additionally compared our algorithm against the SVD on two real world datasets from the UCI Machine Learning Repository: Glass is a 214 9 matrix representing attributes of glass samples, and E.Coli is a 336 7 matrix representing attributes of various proteins. For this set of experiments, we use a heuristic extension of our algorithm that performs well in practice. After running COARSEAPPROX, we iterate solving Yt = min Y CAS G Y Zt 1 1,1 and Zt = min Z CAS G Yt Z 1,1 (via Iteratively Reweighted Least Squares for speed). Finally we output the rank k Frobenius minimizer constrained to Row Space(Yt Zt). In Figure 1c we consistently outperform the SVD by between 5% and 15% for nearly all k, and nearly match the SVD otherwise. In Figure 1d we are worse than the SVD by no more than 5% for k = 1 to 4, and beat the SVD by up to 50% for k = 5 and 6. We have additionally implemented a greedy column selection algorithm which performs worse than the SVD on all of our datasets.

Acknowledgements: We would like to thank Ainesh Bakshi for many helpful discussions. D. Woodruff thanks partial support from the National Science Foundation under Grant No. CCF1815840. Part of this work was also done while D. Woodruff was visiting the Simons Institute for the Theory of Computing.

[1] Arturs Backurs, Piotr Indyk, Ilya P. Razenshteyn, and David P. Woodruff. Nearly-optimal bounds for sparse recovery in generic norms, with applications to k-median sketching. In SODA, 2016.

[2] Saugata Basu, Richard Pollack, and Marie-Françoise Roy. On the combinatorial and algebraic complexity of quantiﬁer elimination. In J. ACM, 1994.

[3] Kenneth L. Clarkson and David P. Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC 2009, Bethesda, MD, USA, May 31 - June 2, 2009, pages 205 214, 2009.

[4] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the Forty-ﬁfth Annual ACM Symposium on Theory of Computing, STOC 13, pages 81 90, New York, NY, USA, 2013. ACM.

[5] Kenneth L. Clarkson and David P. Woodruff. Input sparsity and hardness for robust subspace approximation. 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 310 329, 2015.

[6] Kenneth L. Clarkson and David P. Woodruff. Sketching for m-estimators: A uniﬁed approach to robust regression. In SODA, 2015.

[7] Amit Deshpande, Madhur Tulsiani, and Nisheeth K. Vishnoi. Algorithms and hardness for subspace approximation. In SODA, 2011.

[8] Amit Deshpande and Kasturi R. Varadarajan. Sampling-based dimension reduction for subspace approximation. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, June 11-13, 2007, pages 641 650, 2007.

[9] Amit Deshpande and Kasturi R. Varadarajan. Sampling-based dimension reduction for subspace approximation. In STOC, 2007.

[10] Chris H. Q. Ding, Ding Zhou, Xiaofeng He, and Hongyuan Zha. R1-pca: rotational invariant l1-norm principal component analysis for robust subspace factorization. In ICML, 2006.

[11] Dan Feldman and Michael Langberg. A uniﬁed framework for approximating and clustering data. In STOC, 2011.

[12] Dan Feldman, Morteza Monemizadeh, Christian Sohler, and David P. Woodruff. Coresets and sketches for high dimensional subspace approximation problems. In Proceedings of the Twenty First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 630 649, 2010.

[13] Dan Feldman, Morteza Monemizadeh, Christian Sohler, and David P. Woodruff. Coresets and sketches for high dimensional subspace approximation problems. In SODA, 2010.

[14] Mina Ghashami, Edo Liberty, Jeff M. Phillips, and David P. Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM J. Comput., 45(5):1762 1792, 2016.

[15] Venkatesan Guruswami, Prasad Raghavendra, Rishi Saket, and Yi Wu. Bypassing ugc from some optimal geometric inapproximability results. ACM Trans. Algorithms, 12:6:1 6:25, 2010.

[16] P. Indyk. Algorithmic applications of low-distortion geometric embeddings. In Proceedings of the 42Nd IEEE Symposium on Foundations of Computer Science, FOCS 01, pages 10 , Washington, DC, USA, 2001. IEEE Computer Society.

[17] Ravi Kannan, Santosh Vempala, and David P. Woodruff. Principal component analysis and higher correlations for distributed data. In Proceedings of The 27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, June 13-15, 2014, pages 1040 1057, 2014.

[18] Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error lp-sampling with applications. In SODA, 2010.

[19] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 2005.

[20] Nariankadu D. Shyamalkumar and Kasturi R. Varadarajan. Efﬁcient subspace approximation algorithms. Discrete & Computational Geometry, 47(1):44 63, 2012.

[21] Zhao Song, David P. Woodruff, and Peilin Zhong. Low rank approximation with entrywise l1-norm error. Co RR, abs/1611.00898, 2016.

[22] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2):1 157, 2014.