# single_pass_entrywisetransformed_low_rank_approximation__0a1fe026.pdf

Single Pass Entrywise-Transformed Low Rank Approximation

Yifei Jiang 1 Yi Li 2 Yiming Sun 2 Jiaxin Wang 3 David P. Woodruff 4

In applications such as natural language processing or computer vision, one is given a large n d matrix A = (ai,j) and would like to compute a matrix decomposition, e.g., a low rank approximation, of a function f(A) = (f(ai,j)) applied entrywise to A. A very important special case is the likelihood function f (A) = log (|aij| + 1). A natural way to do this would be to simply apply f to each entry of A, and then compute the matrix decomposition, but this requires storing all of A as well as multiple passes over its entries. Recent work of Liang et al. shows how to ﬁnd a rank-k factorization to f(A) for an n n matrix A using only n poly(ϵ 1k log n) words of memory, with overall error 10 f(A) [f(A)]k 2 F + poly(ϵ/k) f(A) 2 1,2, where [f(A)]k is the best rank-k approximation to f(A) and f(A) 2 1,2 is the square of the sum of Euclidean lengths of rows of f(A). Their algorithm uses three passes over the entries of A. The authors pose the open question of obtaining an algorithm with n poly(ϵ 1k log n) words of memory using only a single pass over the entries of A. In this paper we resolve this open question, obtaining the ﬁrst single-pass algorithm for this problem and for the same class of functions f studied by Liang et al. Moreover, our error is f(A) [f(A)]k 2 F + poly(ϵ/k) f(A) 2 F , where f(A) 2 F is the sum of squares of Euclidean lengths of rows of f(A). Thus our error is signiﬁcantly smaller, as it removes the factor of 10 and also f(A) 2 F f(A) 2 1,2. We also give an algorithm for regression, pointing out an error in previous work, and empirically validate our results.

1Tianjin University, China 2School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 3Wuhan University of Technology, China 4Department of Computer Science, Carnegie Mellon University, USA. Correspondence to: Yi Li <yili@ntu.edu.sg>, David P. Woodruff <dwoodruf@andrew.cmu.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

1. Introduction

There are numerous applications with matrices that are too large to ﬁt in main memory, such as matrices arising in machine learning, image clustering, natural language processing, network analysis, and recommendation systems. This makes it difﬁcult to process such matrices, and one common way of doing so is to stream through the entries of the matrix one at a time and maintain a short summary or sketch that allows for further processing. In some of these applications, such as network analysis or distributed recommendation systems, matters are further complicated because one would like to take the difference or sum of two or more matrices over time.

A common goal in such applications is to compute a low rank approximation to a large matrix A Rn d. If the rank of the low rank approximation is k, then one can approximate A as U V , where U Rn k and V Rk d. This results in a parameter reduction, as U and V only have (n+d)k parameters in total, as compared to the nd parameters required of A. Since k min(n, d), this parameter reduction is signiﬁcant. Not only does it result in much smaller storage, when multiplying A by a vector x, it also now only takes O((n + d)k) time instead of O(nd) time, since one can ﬁrst compute V x and then U (V x).

A challenge in the above applications is that often wants to compute a low rank approximation not to A, but to an entrywise transformation to A by a function f. Namely, if A = (ai,j), then we deﬁne f(A) = f(ai,j) where we apply the function f to each entry of A. Common functions f include f(x) = logc 2(|x| + 1) or f(x) = |x|α

for 0 α 2. Indeed, for word embeddings in natural language processing (NLP), an essential subroutine is to perform a low rank approximation of a matrix after applying the log-likelihood function to each entry, which corresponds to f(x) = log2(|x| + 1). Note that in NLP the input matrices are often word co-occurrence count matrices, which can be created e.g., from the entire Wikipedia database. Thus, such matrices are huge, with millions of rows and columns, and hard to store in memory. This necessitates models such as the streaming model for processing such data.

We can indeed capture the above scenarios formally with the streaming model of computation. In this model, there is

Single Pass Entrywise-Transformed Low Rank Approximation

a large underlying matrix A, and we see a long sequence of updates to its coordinates in the form (i, j, δ) with δ { 1}, and representing the update Ai,j Ai,j + δ. Each pass over the data stream is very expensive, and thus one would like to minimize the number of such passes. Also, one would like to use as little memory as possible to compute a low rank approximation of the transformed matrix f(A) in this model. In this paper we will consider approximately optimal low rank approximations, meaning factorizations U V for which U V f(A) 2 F [f(A)]k f(A) 2 F + poly(ϵ/k) f(A) 2 F , where [f(A)]k is the optimal rank-k matrix approximating f(A) in Frobenius norm. Recall the Frobenius norm B F of a matrix B is deﬁned to be (P i,j B2 i.j)1/2, which is the entrywise Euclidean norm of B.

Although there is a body of work in the streaming model computing low rank approximations of matrices (Boutsidis et al., 2016; Clarkson & Woodruff, 2009; Drineas et al., 2012; Upadhyay, 2014; Woodruff, 2014), such methods no longer apply in our setting due to the non-linearity of the function f. Indeed, a number of existing methods are based on dimensionality reduction, or sketching, whereby one stores S A for a random matrix S with a small number of rows. If there were no entrywise transformation applied to A, then given an update Ai,j Ai,j +δ, one could simply update the sketch (S A)j (S A)j + Si δ, where (S A)j denotes the j-th column of S A and Si denotes the i-th column of S. However, given an entrywise transformation f, which may be non-linear, and given that we may see multiple updates to an entry of A, e.g., in the difference of two datasets, it is not clear how to maintain S f(A) in a stream.

A natural question is: can we compute a low-rank approximation to f(A) in the streaming model with a small number of passes, ideally one, and a small amount of memory, ideally n poly(k/ϵ) memory?

Motivated by the applications above, this question was asked by Liang et al. (2020); see also earlier work which studies entrywise low rank approximation in the distributed model (Woodruff & Zhong, 2016). The work of Liang et al. (2020) studies the function f(x) = log2(|x| + 1) and gives a three-pass algorithm for n n matrices A achieving n poly(ϵ 1k log n) memory and outputting factors U and V with the following error guarantee:

U V f(A) 2 F 10 [f(A)]k f(A) 2 F + poly(ϵ/k) f(A) 2 1,2 ,

where for a matrix B, B 1,2 is the sum of the Euclidean lengths of its columns. We note that this error guarantee is considerably weaker than what we would like, as there is a multiplicative factor 10 and an additive error that depends on f(A) 1,2. Using the relationship between the

1-norm and the 2-norm, we have that f(A) 1,2 could be as large as n f(A) F , and so their additive error can be a n factor larger than what is desired. Also, although the memory is of the desired order, the fact that the algorithm requires 3 passes can signiﬁcantly slow it down. Moreover, when data is crawled from the internet, e.g. in applications of network trafﬁc, it may be impractical to store the entire data set (Durme & Lall, 2009). Therefore, in these settings it is impossible to make more than one pass over the data. Liang et al. (2020) say Whether there exists a one-pass algorithm is still an open problem, and is left for future work.

1.1. Our Contributions

In this paper, we resolve this main open question of (Liang et al., 2020), obtaining a one-pass algorithm achieving (n+ d) poly(ϵ 1k log n) memory for outputting a low rank approximation for the function f(x) = log2(|x| + 1), and achieving the stronger error guarantee:

U V f(A) 2 F [f(A)]k f(A) 2 F + poly(ϵ/k) f(A) 2 F .

We note that the poly(ϵ/k) factor in both the algorithm of (Liang et al., 2020) and our algorithm can be made arbitrarily small by increasing the memory by a poly(k/ϵ) factor, and thus it sufﬁces to consider error of the form [f(A)]k f(A) 2 F + ϵ f(A) 2 F . We also note that our algorithm can be trivially adapted to rectangular matrices, so for ease of notation, we focus on the case n = d.

At a conceptual level, the algorithm of (Liang et al., 2020) uses one pass to obtain so-called approximate leverage scores of f(A), then a second pass to sample columns of f(A) according to these, and ﬁnally a third pass to do socalled adaptive sampling. In contrast, we observe that one can just do squared column norm sampling of f(A) to obtain the above error guarantee, which is a common method for low rank approximation to A. However, in one pass it is not possible to sample actual columns of A or of f(A) according to these probabilities, so we build a data structure to sample noisy columns by approximations to their squared norms in a single pass. This is related to block ℓ2-sampling in a stream, see, e.g., (Mahabadi et al., 2020). However, the situation here is complicated by the fact that we must sample according to the sum of squares of f values of entries in a column of A, rather than the squared length of the column of A itself. The transformation function f s nonlinearity makes many of the techniques considered in (Mahabadi et al., 2020) inapplicable. To this end we build new hashing and sub-sampling data structures, generalizing data structures for length or squared length sampling from (Andoni et al., 2009; 2011; Levin et al., 2018), and we give a novel analysis for sampling noisy columns of A

Single Pass Entrywise-Transformed Low Rank Approximation

proportional to the sum of squares of f values to their entries.

Additionally, we empirically validate our algorithm on a real-world data set, demonstrating signiﬁcant advantages over the algorithm of Liang et al. (2020) in practice.

Finally, we apply our new sampling techniques to the regression problem, showing that our techniques are more broadly applicable. Although Liang et al. (2020) claim a result for regression, we point out an error in their analysis1, showing that their algorithm obtains larger error than claimed, and the error of our regression algorithm is considerably smaller.

All omitted proofs can be found in the full version of this paper, which is given in the supplementary material.

2. Preliminaries

Notation. We use [n] to denote the set {1, 2, . . . , n}. For a vector x Rn, we denote by |x| the vector whose ith entry is |xi|. For a matrix A RN N, let Ai, be the i-th row of A and A ,j be the j-th column of A. We also sometimes abbreviate Ai, or A ,i as Ai. We also deﬁne the norms A p,q = (P

i Ai, p q)1/p, of which the Frobenius norm A F = A 2,2 is a special case. Note that A p,q A q,p. We use σi(A) to denote the i-th singular value of A and [A]k to denote the best rank-k approximation to A. For a function f, let f(A) denote the entrywise-transformed matrix (f(A))ij = f(Aij).

Low-rank Approximation. Given a matrix M Rn n, an integer k n and an accuracy parameter ϵ > 0, our goal is to output an n k matrix L with orthonormal columns for which M LL M 2 F M [M]k 2 F + ϵ M 2 F , where [M]k = arg minrank(M ) k M M 2 F . Thus, LL provides a rank-k matrix to project the columns of M onto. The rows of L M can be thought of as the latent features in applications, and the rank-k matrix LL M can be factored as L (L M), where L M is a k n matrix and L is an n k matrix.

Best Low-rank Approximation. Consider the singular value decomposition G = Pr t=1 σtutv t , where σ1 σ2 σr > 0 are the nonzero singular values of G, and {ut} and {vt} are orthonormal sets of column vectors such that G ut = σtvt and Gvt = σtut for each t r. The vectors {ut} and {vt} are called the left and the right singular values of G, respectively. By the Eckart-Young theorem, for any rotationally invariant norm , the matrix Dk that minimizes G Dk among all matrices D of rank at most k is given by Dk = Pk t=1 Gvtv t . This implies that G Dk 2 F = Pr t=k+1 σ2 t .

1We have conﬁrmed this with the authors.

For a matrix A, we denote by A 2 its operator norm, which is equal to its largest singular value.

Useful Inequalities. The ﬁrst one is the matrix Bernstein inequality.

Lemma 1 (Matrix Bernstein). Let X1, . . . , Xs be s independent copies of a symmetric random matrix X Rd d

with E(X) = 0, X 2 ρ and E(X2) 2 σ2. Let W = 1

s Ps i=1 Xi. For any ϵ > 0, it holds that

Pr{ W 2 > ϵ} 2d e sϵ2/(σ2+ρϵ/3).

Below we list a few useful inequalities regarding the function f(x) = log(1 + |x|).

Proposition 2. For x > 0, it holds that ln(1+x) > x/(1+ 2x).

Proposition 3. It holds for all x, y R and all a 0 that f(x + y) f(x) + f(y) and f(ax) af(x). As a consequence, for x, y Rn it holds that f(x + y) 2 2 ( f(x) 2 + f(y) 2)2.

Proposition 4. It holds for all x, y 0 that f( p

x2 + y2)2 f(x)2 + f(y)2.

Lemma 5. Let a1, . . . , am be real numbers and ϵ1, . . . , ϵm be 4-wise independent random variables on the set { 1, 1} (i.e., Rademacher random variables). It holds that

where C > 0 is an absolute constant.

Lemma 6. For arbitrary vectors y, z Rn such that f(y) 2 2 ξ 2 f(z) 2 2 for some ξ (0, 1), it holds that (1 3ξ2/3) f(y) 2 2 f(y + z) 2 2 (1 + 3ξ) f(y) 2 2.

3. Algorithm

Our algorithm uses two important subroutines: a subsampling data structure called an H-Sketch, and a sketch for approximating the inner product of a transformed vector and a raw vector called Log Sum. The former is inspired from a subsampling algorithm of (Levin et al., 2018) and is meant to sample a noisy approximation to a column from a distribution which is close to the desired distribution. In fact, one can show that it is impossible to sample the actual columns in a single pass, hence, we have to resort to noisy approximations and show they sufﬁce. The latter Log Sum sketch is the same as in (Liang et al., 2020), which approximates the inner product f(x), y for vectors x, y. Executing these sketches in parallel is highly non-trivial since the subsampling algorithm of (Levin et al., 2018) samples columns of A according to their ℓ2 norms, but here we must sample them according to the squares of their ℓ2 norms after applying f to each entry.

Single Pass Entrywise-Transformed Low Rank Approximation

Algorithm 1 Basic heavy hitter substructure Input: A Rn n, ν, φ Output: a data structure H

1: w O(1/(φ2ν3)) 2: Prepare a pairwise independent hash function h : [n] [w] 3: Prepare 4-wise independent random signs {ϵi}n i=1 4: Prepare a hash table H with w buckets, where each bucket stores a vector in Rn. 5: for each v [w] do 6: Hv P

i h 1(v) ϵi Ai 7: end for 8: return H

Roughly speaking, the above combination gives us a small set of poly(k/ϵ) noisy columns of f(A), sampled approximately from the squared ℓ2 norm of each column of f(A), after which we can appeal to squared column-norm sampling results for low rank approximation in (Frieze et al., 2004), which argue that if you then compute the top-k left singular vectors of f(A), forming the columns of the n k matrix L, then LL f(A) is a good rank-k approximation to f(A). The ﬁnal output of the low-rank approximation will be two factors, L and L f(A). The algorithm in (Liang et al., 2020) ﬁrst computes L by an involved algorithm in three passes, and then computes L f(A) in another pass using LOGSUM sketches. Our algorithm follows the same outline but we shall demonstrate how to compute L in only one pass, which is our sole focus for low-rank approximation in this paper. Note that our ultimate goal, which we only achieve approximately, is to sample columns of f(A) proportional to their squared ℓ2 norms. This is a fundamentally different sampling scheme from that of (Liang et al., 2020), which performs leverage score sampling followed by adaptive sampling, which are not amenable to a one-pass implementation.

3.1. H-Sketch

We ﬁrst present a basic heavy hitter structure in Algorithm 1, and a complete heavy hitter structure in Algorithm 2 by repeating the basic structure R times. The complete heavy hitter structure supports a query function. Below we analyze the guarantee of this heavy hitter data structure.

Let M = f(A) 2 F . We deﬁne Iϵ = {i [n] : f(Ai) 2 2 ϵM}, the set of the indices of the ϵ-heavy columns. Let α be a small constant to be determined later.

Lemma 7. With probability at least 0.9, all columns in Iαφ are isolated from each other under h.

Lemma 8. For each u [n], it holds with probability at

Algorithm 2 Complete heavy hitter structure D Input: A Rn n, ν, φ, δ Output: a data structure H

1: R O(log(n/δ)) 2: for each r [R] do 3: Initialize a basic substructure H(r) (Algorithm 1) with parameters ν and φ 4: end for

5: function QUERY(i) 6: for each r [R] do

7: vr H(r) hr(i) hr is the hash function in H(r)

8: end for 9: r index r of the median of { f(vr) 2}r [R] 10: return vr 11: end function

least 2/3 that

i (Iαφ {u}) 1{h(i)=h(u)}ϵi Ai

where C > 0 is an absolute constant.

Lemma 9. Suppose that ν (0, 0.05] and α = 0.3/C > β, where C is the absolute constant in Lemma 8. With probability at least 1 δ, for all i [n], the output vr of Algorithm 2 satisﬁes that

(a) (1 ν) f(Ai) 2 2 f(vr ) 2 2 (1 + ν) f(Ai) 2 2 for all i Iφ;

(b) f(vr ) 2 2 0.92φM for all i Iαφ;

(c) f(vr ) 2 2 (1 + ν3/2)2φM.

Proof. Fix u [n]. With probability at least 0.9 1/3 > 0.5, the events in Lemmata 7 and 8 happen. Condition on those events.

From the proof of Lemma 8, we know that i Iφ do not collide with other elements in Iαφ. Hence, it follows from Lemma 6 (where ξ2 3C/(φw) (ν/3)3) that

(1 ν) f(Ai) 2 2 f(Hh(i)) 2 2 (1 + ν) f(Ai) 2 2 ,

provided that w 34C/(φν3).

When i Iαφ, we have

f(Hh(i)) 2 2 3C αφ + 1

M (0.9 + ν3/2)φM

Single Pass Entrywise-Transformed Low Rank Approximation

Algorithm 3 Sampling using H-Sketch

Require: (i) An estimate c M such that M c M KM; (ii) a complete heavy hitter structure D0 of parameters (O(1), O(ϵ3/(KL3)), 1/(ˆL + 1)); (iii) ˆL complete heavy hitter structures (see Algorithm 1), denoted by D1, . . . , DˆL, where Dj (j [ˆL]) has parameters (O(1), O(ϵ3/L3), 1/(ˆL + 1)) and is applied to the columns of A downsampled at rate 2 j; 1: L log(Kn/ϵ), ˆL log n 2: ζ a random variable uniformly distributed in [1/2, 1] 3: for j = 0, . . . , ˆL do 4: Λj top Θ(L3/ϵ3) heavy hitters from Dj 5: end for 6: j0 log(4Kϵ 3L3) 7: ζ uniform variable in [1/2, 1] 8: for j = 1, . . . , j0 do 9: Let λ(j) 1 , . . . , λ(j) s be the elements in Λ0 contained in [(1 + ϵ)ζ c M 2j , (2 ϵ)ζ c M 2j ]

10: f Mj |λ(j) 1 | + + |λ(j) s | 11: end for 12: for j = j0 + 1, . . . , L do 13: Find the largest ℓfor which Λℓcontains sj elements λ(j) 1 , . . . , λ(j) sj in [(1 + ϵ)ζ c M 2j , (2 ϵ)ζ c M 2j ] for (1

20ϵ)L2/ϵ2 sj 2(1 +

14: if such an ℓexists then 15: f Mj (|λ(j) 1 | + + |λ(j) sj |)2ℓ

16: Wj Λℓ 17: else 18: f Mj 0 19: end if 20: end for 21: j sample from [L] according to pdf Pr(j = j) = f Mj/ P j f Mj 22: i sample from Wj according to pdf Pr(i = i) = |λ(j) i |/f Mj 23: vj ,i vector returned by QUERY(i ) on Dj 24: return vj ,i

When i Iαφ \ Iφ, we have that Hi contains only i and columns from [n] \ Iαφ. Hence by Proposition 3,

f(Hh(i)) 2 2

(1 + ν3/2)2φM,

provided that w 3C/(φν3).

Finally, repeating O(log(n/δ)) times and taking the median and a union bound over all n columns gives the

claimed result.

Next we analyze the sampling algorithm, presented in Algorithm 3, which simulates sampling a column from A according to the column norms. The following theorem is our guarantee.

Theorem 10. Let ϵ > 0 be a constant small enough. Algorithm 3 outputs vj ,i which satisﬁes that, with probability at least 0.9, there exists u [n] such that

(1 O(ϵ)) f(Au) 2 2 f(vj ,i ) 2 2 (1 + O(ϵ)) f(Au) 2 2 .

Furthermore, there exists an absolute constant c (0, 1/2] such that

Pr{u = i} c f(Ai) 2 2 f(A) 2 F for all i belonging to some set I [n] such that P

i I f(Ai) 2 2 (1 6ϵ)M, provided that ϵ further satisﬁes that ϵ c/C for some absolute constant C > 0.

Proof. The analysis of the algorithm is largely classical, for which we deﬁne the following notions:

(1) Tj = ζM/2j;

(2) Sj = i [n] : f(Ai) 2 2 (Tj, 2Tj] is the j-th level set of A;

(3) a level j [L] is important if |Sj| ϵ2j/L;

(4) J [L] is the set of all important levels.

It follows from the argument in (Li et al., 2021), or an argument similar to (Andoni et al., 2009) that the columns we miss contribute to only an O(ϵ)-fraction of the norm, and for each level j {1, . . . , j0} J , each of the recovered columns λi (i [sj]) corresponds to some u = u(i) Sj and satisﬁes that (1 O(ϵ)) f(Aui) 2 2 λi (1 + O(ϵ)) f(Aui) 2 2.

Next we prove the second part. For a ﬁxed i [n], deﬁne events

Ei = {i falls in a level j [j0] J }

and a set of good columns

I = {i : Pr{Ei} β}

for some constant β 1/2. Since all non-important levels always contribute to at most a 2ϵ-fraction of M, it follows that the bad columns contribute to at most a 2ϵ/(1 β)- fraction of M, that is, X

i I f(Ai) 2 2 2ϵ 1 β M.

Single Pass Entrywise-Transformed Low Rank Approximation

Next we deﬁne the event that

Fi = {Ei and f(Ai) 2 2 [(1 + ϵ)Tj, 2(1 ϵ)Tj]},

then it holds for all i I that Pr{Fi} Pr{Ei} O(ϵ) 0.9β for ϵ sufﬁciently small.

Let Gj denote the event that the magnitude level j is chosen, and j(i) is the index of the magnitude level containing column i. Then for those i s with j = j(i) [j0] J ,

Pr{Gj|Fi} = (1 O(ϵ))f Mj (1 O(ϵ)) P

j f Mj = (1 O(ϵ))Mj

= (1 O(ϵ))Mj

Pr{u = i | Gj Fi} = λ(j) t f Mj = 1 O(ϵ) f(Ai) 2 2 (1 O(ϵ))Mj

= (1 O(ϵ)) f(Ai) 2 2 Mj

Pr{u = i} = Pr{u = i | Gj Fi} Pr{Gj|Fi} Pr{Fi}

0.9β(1 O(ϵ)) f(Ai) 2 2 M 0.8β f(Ai) 2 2 M ,

provided that ϵ is sufﬁciently small.

Now we show how to obtain an overestimate c M for M. We assume that all entries of A are integer multiples of η = 1/ poly(n) and are bounded by poly(n), which is a common and necessary assumption for streaming algorithms, otherwise storing a single number would take too much space. Let f(x) = log2(1 + |ηx|), then f(A) 2 F = P

i,j f(η 1A), where η 1A has integer entries. Hence, we can run the algorithm implied by Theorem 2 of (Braverman et al., 2016) on η 1A in parallel in order to obtain a constant-factor estimate to f(A) 2 F . To justify this application of the theorem, we verify in Appendix B.3 that the function f(|x|) is slow-jumping, slow-dropping and predictable on nonnegative integers as deﬁned by (Braverman et al., 2016).

Finally, we calculate the sketch length. The overall sketch length is dominated by that of Algorithm 3. In Algorithm 3, there are ˆL = O(ϵ 1 log n) heavy hitter structures D1, . . . , DˆL, each of which has a sketch length of O(1/(φ2ν3) log(n L)) = poly(L, 1/ϵ, log n) = poly(log n, 1/ϵ). There is an additional heavy hitter structure D0 of sketch length O(poly(K, L, 1/ϵ, log n)) = poly(log n, 1/ϵ). Hence the overall sketch length is poly(log n, 1/ϵ). Each cell of the sketch stores an ndimensional vector. We summarize this in the following theorem.

Theorem 11. Suppose that A (ηZ)n n with |Aij| poly(n) is given in a turnstile stream, where η = 1/ poly(n). There exists a randomized sketching algorithm which maintains a sketch of n poly(ϵ 1 log n) space and outputs a vector vj ,i Rn which satisﬁes the same guarantee as given in Theorem 10.

3.2. Low-Rank Approximation

Suppose that we have an approximate sampling of the rows of f(A) so that we obtain a sample f(Ai) + Ei with probability pi satisfying

pi c f(Ai) 2 2 f(A) 2 F (1)

for some absolute constant c 1. The pi s are known to us (if c = 1, then we do not need to know the pi).

The following is our main theorem in this section, which is analogous to Theorem 2 of (Frieze et al., 2004). Theorem 12. Let V denote the subspace spanned by s samples drawn independently according to the distribution (1), where each sample has the form f(Ai) + Ei for some i [n]. Suppose that Ei 2 γ f(Ai) 2 for some γ > 0. Then with probability at least 9/10, there exists an orthonormal set of vectors y1, y2, . . . , yk in V such that f(A) f(A)

min D:rank(D) k f(A) D 2 F + 10k

sc (1+γ)2 f(A) 2 F .

The theorem shows that the subspace spanned by a sample of columns chosen according to (1) contains an approximation to f(A) that is nearly the best possible. Note that if the top k right singular vectors of S belong to this subspace, then f(A) Pk t=1 vtv t would provide the required approximation to f(A) and we would be done.

Now, the difference between Theorem 10 and the assumption (1) is that we do not have control over pi for an O(ϵ)- fraction of the rows (in squared row norm contribution) in Theorem 10. Let A be the submatrix of A after removing those rows, then f(A) F (1 + O(ϵ)) f(A ) F . We can apply Theorem 12 to A and take more samples such that we obtain s rows from A (which holds with 1 exp( Ω(s)) probability by a Chernoff bound). We therefore have the following corollary. Corollary 13. Let yi s be as in Algorithm 4 and c and ϵ be as in Theorem 10. It holds with probability at least 0.7 that f(A) f(A) X

Single Pass Entrywise-Transformed Low Rank Approximation

Algorithm 4 Rank-k Approximation Input: A Rn n, rank parameter k, number of samples s

1: Initialize s parallel instances of (modiﬁed) Algorithm 3 2: Let (h1, ˆp1), . . . , (hq, ˆpq) be the returned vectors and the sampling probability from the s instances of (modiﬁed) Algorithm 3

3: F concatenated matrix h1 sˆp1 hs sˆps

4: Compute the top k left singular vectors of F, forming L Rn k

5: return L

min D:rank(D) k f(A) D 2 F + 30k

sc + ϵ f(A) 2 F .

Note that Algorithm 3 can be easily modiﬁed to return the sampling probability of the sampled column, which is just λ(j) i / P

j f Mj. However, for each sample, we may lose control of it with a small constant probability. To overcome this, inspecting the proof of H-Sketch, we see that for ﬁxed stream downsampling and ﬁxed ζ in Algorithm 3, repeating each heavy hitter structure O(log(L/δ)) times and taking the median of each f Mj will lower the failure probability of estimating the contribution of each important level to δ/(L + 1), allowing for a union bound over all levels. Hence, with probability at least 1 δ, we can guarantee that we obtain a (1 O(ϵ))-approximation to (f(A))u 2 2 and thus a (1 O(ϵ))-approximation to f(A) 2 F . Hence the returned ˆpu is a (1 O(ϵ))-approximation to the true rowsampling probability pu = (f(A))u 2 2/ f(A) 2 F . Different runs of the sampling algorithm may produce different values of ˆpu for the same u but they are all (1 O(ϵ))- approximations to pu. We can guarantee this for all our s samples by setting δ = O(1/s), which allows for a union bound over all s samples.

Therefore, at the cost of an extra O(log s) factor in space, we can assume that ˆpu = (1 O(ϵ))pu for all s samples. The overall algorithm is presented in Algorithm 4.

The following main theorem follows from Corollary 13 and the argument in (Frieze et al., 2004).

Theorem 14. Let s = O(k/ϵ) be the number of samples and y1, . . . , yk be the output of Algorithm 4. It holds with probability at least 0.7 that

f(A) LL f(A) 2

F min D:rank(D) k f(A) D 2 F + ϵ f(A) 2 F .

4. Experiments

To demonstrate the beneﬁts of our algorithm empirically, we conducted experiments on low-rank approximation

with a real NLP data set and used the function f (x) = log (|x| + 1).

The data we use is based on the Wikipedia data used by Liang et al. (2020). The data matrix A Rn n (n = 104) contains information about the correlation among the n words. Its entries are A i,j = pj log(Ni,j N/(Ni Nj)+1), where Ni,j is the number of times words i and j co-occur in a window size of 10, Ni is the number of times word i appears and Ni s are in a decreasing order, N = P i Ni and pj = max{1, (Nj/N10)2} is a weighting factor which adds larger weights to more frequent words. Since A , and thus f(A ), have almost the same column norms, we instead consider A = A 11 A , where 1 is the vector of all 1 entries.

We compare the accuracy and runtime with the previous three-pass algorithm of Liang et al. (2020). The task is to ﬁnd L Rn k with orthornormal columns to minimize the error ratio

f(A) LL f(A) F f(A) UU f(A) F ,

where U Rn k has the top k left singular vectors of f(A) as columns. The numerator f(A) LL f(A) F is the approximation error and the denominator f(A) UU f(A) F is the best approximation error, both in Frobenius norm.

4.1. Algorithm Implementation

We present our empirical results for the one-pass algorithm and a faster implementation of the two-pass algorithm in Figure 1 and Figure 2. Both algorithms run in 0.1% of the runtime of Liang et al. (2020) s three-pass algorithm. The one-pass algorithm is less accurate than the two-pass algorithm when the space usage is small, which is not unexpected, because the second pass enables noiseless column samples. Still, as discussed below, one-pass algorithms are essential in certain internet NLP applications. Even our two-pass algorithm has a considerable advantage over the prior three-pass algorithm, by matching its accuracy in signiﬁcantly less time and one fewer pass.

For the two-pass algorithm, we sample the columns of A using Algorithm 3 in the ﬁrst pass and only recover the positions of the heavy columns in each magnitude level. Taking s samples will incur O(s poly(ϵ 1 log n)) heavy hitters in total. In the second pass we obtain precise f(A) values for our samples and thus noiseless column samples. Then we calculate the sampling probability ˆpu according to the error-free column norms.

To reduce the runtime, we do not run s independent copies of Algorithm 3 for s samples; instead, we take m samples from each single run of Algorithm 3 and run s/m indepen-

Single Pass Entrywise-Transformed Low Rank Approximation

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

space usage

error ratio

one-pass alg. two-pass alg. three-pass alg.

Figure 1. Error ratios of the one-pass, two-pass and three-pass algorithms. The x-axis is the ratio between the space of the sketch maintained by the tested algorithm and the space to store the input matrix. The y-axis is the error ratio e(L). Solid dots denote the mean of the error ratios over 10 independent trials and the vertical bars denote the standard deviation of the one-pass and two-pass algorithms.

dent copies of Algorithm 3. Hence the s samples we obtain are not fully independent.

All experiments are conducted under MATLAB 2019b on a laptop with a 2.20GHz CPU and 16GB RAM.

We set k = 10, m = 100 and plot the error ratios of our algorithm in Figure 1. For each value of space usage, the mean and standard deviation are reported from 10 independent runs. In the same ﬁgure, we also plot the results of the three-pass algorithm of Liang et al. (2020) at comparable levels of space usage. Since the three-pass algorithm is considerably slower, we run the three-pass algorithm only once. Additionally we plot the runtimes of all algorithms in Figure 2.

We can observe that even at the space usage of approximately 12% of the input data, the error ratio of our two-pass algorithm is stably around 1.05. The one-pass algorithm is less accurate than the one-pass algorithm when the space usage is less than 20%, which is not unexpected, because the second pass enables noiseless column samples. Overall, the error ratio of the oneand two-pass algorithms is similar to that of the three-pass algorithm for space usage level at least 0.2, while the runtime of both algorithms is at most 0.1% of that of the three-pass algorithm, which is a signiﬁcant improvement.

5. Application to Linear Regression

In this section, we consider approximately solving the linear regression problem using the H-Sketch from Section 3.1.

We shall need to sample rows from the concatenated matrix Q = f(A) b , where A Rn d and b Rn are given in a turnstile stream, and f(x) = ln(1 + |x|) is the transformation function. This can still be achieved using the same H-Sketch in Section 3.1, applied to the concatenated matrix A b , with the transformation function

2500 3000 3500 4000

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1.8

space usage

one-pass alg. two-pass alg. three-pass alg.

runtime (in seconds)

Figure 2. Runtimes of the one-pass, two-pass and three-pass algorithms. The x-axis is the ratio between the space of the sketch maintained by the tested algorithm and the space to store the input matrix. The y-axis is the runtime in seconds. Solid dots denote the mean of 10 independent trials of the one-pass and two-pass algorithms. The standard deviations are all less than 1.5 seconds and thus omitted.

f(x) (x Rd+1) replaced with g(x) = f(x1:d) xd+1 , where x1:d denotes the ﬁrst d coordinate of x and xd+1 the last coordinate of x. Then, using identical arguments in Section 3.1 for the squared ℓ2-norm on the ﬁrst d coordinates and a standard Count-Sketch argument for the last coordinate, it is straightforward to show an analogous version of Theorem 10 as below. We omit the identical proof.

Theorem 15. Let ϵ > 0, and let Hj0,i0 be the vector returned by Algorithm 3. With probability at least 0.9, it holds that there exists u [n] such that

(1 O(ϵ)) f(Au) 2 2 + |bu|2 g(Hj0,i0) 2 2

(1 + O(ϵ)) f(Au) 2 2 + |bu|2 .

Theorem 15 states that each sample is a noisy version of the u-th row of Q. Let pu = Qu 2 2/ Q 2 F be the true sampling probability of the u-th row. As argued at the beginning at Section 3.2, we may assume, at the cost of an O(log s) factor in space, that every sample is good, i.e., the returned sampling probability ˆpu satisﬁes that ˆpu = (1 O(ϵ))pu and the noise in each sample is at most an O(ϵ)-fraction in its ℓ2 norm.

Below we present our algorithm for linear regression in Algorithm 5, assuming that every sample is good in the sense that ˆpu = (1 O(ϵ))pu and the noise in each sample is at most an O(ϵ)-fraction in ℓ2-norm. The guarantee is given in Theorem 16 and the proof is deferred to Section D.

Theorem 16. Given matrix A Rn d and vector b Rn, let κ = κ(f(A)) be the condition number of the transformed matrix. Let s = O( dκ2

δ ), then Algorithm 5 outputs a vector x Rd, which, with probability at least 1 δ, satisﬁes

M x b 2 (1 + ϵ) min x Rd f(A)x b 2 + ,

Single Pass Entrywise-Transformed Low Rank Approximation

Algorithm 5 Linear Regression

Require: A Rn d, number of samples s

1: Initialize s parallel instances of Algorithm 3 2: Run Algorithm 3 on the concatenated matrix f(A) b to obtain s row samples h1, . . . , hs and the corresponding sampling probabilities ˆp1, . . . , ˆps 3: T vertical concatenation of h1 sˆp1 , . . . , hs sˆps 4: M ﬁrst d columns of T, b last column of T 5: x arg minx Rd Mx b 2

d + b 2 2 f(A) 2 2 κ b 2 + q

f(A) 2 F + b 2 2

The total space used by Algorithm 5 is O(d2κ2 log 1

δ ) poly(log n, 1

Finally, we note an error in (Liang et al., 2020). Let G = f(A) and S be a subspace embedding sketching matrix of s = poly(d/ϵ) rows for the column space of G. In their proof of the regression problem in Section E, the upper bound of C2 is wrong as it claims that (g SG) 2 C (SG) 2; a correct bound should be (g SG) 2 10ndκη (SG) 2, where η = maxi,j |(g SG)i,j (SG)i,j| could be linear in n by their LOGSUM guarantee. The proof in (Liang et al., 2020) does not account for such a dependence on n and κ. Correcting the proof would yield a similar bound as ours but with an addtive error = n2κ3 poly( ϵ

d) b 2, which depends polynomially on n. Our additive error has no dependence on n but depends on b 2 2/ f(A) 2 2 and has an additional additive term of ϵ Q 2, which is an artefact of sampling the rows of Q.

Acknowledgements

Y. Li was partially supported and Y. Sun was supported by Singapore Ministry of Education (Ac RF) Tier 2 grant MOE2018-T2-1-013. D. Woodruff thanks support from NSF grant No. CCF-1815840, Ofﬁce of Naval Research grant N00014-18-1-256, and a Simons Investigator Award.

Andoni, A., Ba, K. D., Indyk, P., and Woodruff, D. P. Efﬁcient sketches for earth-mover distance, with applications. In 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, October 25-27, 2009, Atlanta, Georgia, USA, pp. 324 330, 2009.

Andoni, A., Krauthgamer, R., and Onak, K. Streaming algorithms via precision sampling. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pp. 363 372, 2011. doi: 10.1109/FOCS.2011.82.

Boutsidis, C., Woodruff, D. P., and Zhong, P. Optimal principal component analysis in distributed and streaming models. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pp. 236 249, 2016.

Braverman, V., Chestnut, S. R., Woodruff, D. P., and Yang, L. F. Streaming space complexity of nearly all functions of one variable on frequency vectors. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 16, pp. 261 276, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450341912. doi: 10. 1145/2902251.2902282. URL https://doi.org/ 10.1145/2902251.2902282.

Clarkson, K. L. and Woodruff, D. P. Numerical linear algebra in the streaming model. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC 2009, Bethesda, MD, USA, May 31 - June 2, 2009, pp. 205 214, 2009.

Drineas, P., Magdon-Ismail, M., Mahoney, M. W., and Woodruff, D. P. Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res., 13: 3475 3506, 2012.

Durme, B. V. and Lall, A. Streaming pointwise mutual information. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, NIPS 09, pp. 1892 1900, Red Hook, NY, USA, 2009. Curran Associates Inc. ISBN 9781615679119.

Frieze, A. M., Kannan, R., and Vempala, S. S. Fast montecarlo algorithms for ﬁnding low-rank approximations. J. ACM, 51(6):1025 1041, 2004.

Levin, R., Sevekari, A. P., and Woodruff, D. P. Robust subspace approximation in a stream. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montr eal, Canada, pp. 10706 10716, 2018.

Li, Y., Woodruff, D., and Yasuda, T. Exponentially improved dimensionality reduction for ℓ1: Subspace embeddings and independence testing. Accepted to COLT 2021. Full version available at ar Xiv:2104.12946 [cs.DS], 2021.

Liang, Y., Song, Z., Wang, M., Yang, L., and Yang, X. Sketching transformed matrices with applications to natural language processing. In The 23rd International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], pp. 467 481, 2020.

Single Pass Entrywise-Transformed Low Rank Approximation

Mahabadi, S., Razenshteyn, I. P., Woodruff, D. P., and Zhou, S. Non-adaptive adaptive sampling on turnstile streams. In Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, June 22-26, 2020, pp. 1251 1264, 2020.

Upadhyay, J. Differentially private linear algebra in the streaming model. ar Xiv preprint ar Xiv:1409.5414, 2014.

Woodruff, D. P. Low rank approximation lower bounds in row-update streams. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 1781 1789, 2014.

Woodruff, D. P. and Zhong, P. Distributed low rank approximation of implicit functions of a matrix. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 847 858. IEEE, 2016.