# sublinear_time_lowrank_approximation_of_distance_matrices__3f3205b8.pdf

Sublinear Time Low-Rank Approximation of Distance Matrices

Ainesh Bakshi Department of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 abakshi@cs.cmu.edu

David P. Woodruff Department of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 dwoodruf@cs.cmu.edu

Let P = {p1, p2, . . . pn} and Q = {q1, q2 . . . qm} be two point sets in an arbitrary metric space. Let A represent the m n pairwise distance matrix with Ai,j = d(pi, qj). Such distance matrices are commonly computed in software packages and have applications to learning image manifolds, handwriting recognition, and multi-dimensional unfolding, among other things. In an attempt to reduce their description size, we study low rank approximation of such matrices. Our main result is to show that for any underlying distance metric d, it is possible to achieve an additive error low rank approximation in sublinear time. We note that it is provably impossible to achieve such a guarantee in sublinear time for arbitrary matrices A, and our proof exploits special properties of distance matrices. We develop a recursive algorithm based on additive projection-cost preserving sampling. We then show that in general, relative error approximation in sublinear time is impossible for distance matrices, even if one allows for bicriteria solutions. Additionally, we show that if P = Q and d is the squared Euclidean distance, which is not a metric but rather the square of a metric, then a relative error bicriteria solution can be found in sublinear time. Finally, we empirically compare our algorithm with the singular value decomposition (SVD) and input sparsity time algorithms. Our algorithm is several hundred times faster than the SVD, and about 8-20 times faster than input sparsity methods on real-world and and synthetic datasets of size 108. Accuracy-wise, our algorithm is only slightly worse than that of the SVD (optimal) and input-sparsity time algorithms.

1 Introduction

We study low rank approximation of matrices A formed by the pairwise distances between two (possibly equal) sets of points or observations P = {p1, . . . , pm} and Q = {q1, . . . , qn} in an arbitrary underlying metric space. That is, A is an m n matrix for which Ai,j = d(pi, qj). Such distance matrices are the outputs of routines in commonly used software packages such as the pairwise command in Julia, the pdist2 command in Matlab, or the crossdist command in R.

Distance matrices have found many applications in machine learning, where Weinberger and Sauk use them to learn image manifolds [18], Tenenbaum, De Silva, and Langford use them for image understanding and handwriting recognition [17], Jain and Saul use them for speech and music [12],

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

and Demaine et al. use them for music and musical rhythms [7]. For an excellent tutorial on Euclidean distance matrices, we refer the reader to [8], which lists applications to nuclear magnetic resonance (NMR), crystallagraphy, visualizing protein structure, and multi-dimensional unfolding.

We consider the most general case for which P and Q are not necessarily the same set of points. For example, one may have two large unordered sets of samples from some distribution, and may want to determine how similar (or dissimilar) the sample sets are to each other. Such problems arise in hierarchical clustering and phylogenetic analysis1. Formally, Let P = {p1, p2, . . . pm} and Q = {q1, q2 . . . qn} be two sets of points in an arbitrary metric space. Let A represent the m n pairwise distance matrix with Ai,j = d(pi, qj). Since the matrix A may be very large, it is often desirable to reduce the number of parameters needed to describe it. Two standard methods of doing this are via sparsity and low-rank approximation. In the distance matrix setting, if one ﬁrst ﬁlters P and Q to contain only distinct points, then each row and column can contain at most a single zero entry, so typically such matrices A are dense. Low-rank approximation, on the other hand, can be highly beneﬁcial since if the point sets can be clustered into a small number of clusters, then each cluster can be used to deﬁne an approximately rank-1 component, and so A is an approximately low rank matrix.

To ﬁnd a low rank factorization of A, one can compute its singular value decomposition (SVD), though in practice this takes min(mn2, m2n) time. One can do slightly better with theoretical algorithms for fast matrix multiplication, though not only are they impractical, but there exist much faster randomized approximation algorithms. Indeed, one can use Fast Johnson Lindenstrauss transforms (FJLT) [16], or Count Sketch matrices [4, 13, 15, 1, 5], which for dense matrices A, run in O(mn) + (m + n)poly(k/ϵ) time. At ﬁrst glance the O(mn) time seems like it could be optimal. Indeed, for arbitrary m n matrices A, outputting a rank-k matrix B for which

A B 2 F A Ak 2 F + ϵ A 2 F (1.1)

can be shown to require Ω(mn) time. Here Ak denotes the best rank-k approximation to A in Frobenius norm, and recall for an m n matrix C, C 2 F = P

i=1,...,m,j=1,...,n C2 i,j. The additive error guarantee above is common in low-rank approximation literature and appears in [10]. To see this lower bound, note that if one does not read nearly all the entries of A, then with good probability one may miss an entry of A which is arbitrarily large, and therefore cannot achieve (1.1).

Perhaps surprisingly, [14] show that for positive semideﬁnite (PSD) n n matrices A, one can achieve (1.1) in sublinear time, namely, in n k poly(1/ϵ) time. Moreover, they achieve the stronger notion of relative error, that is, they output a rank-k matrix B for which

A B 2 F (1 + ϵ) A Ak 2 F . (1.2)

The intuition behind their result is that the large entries causing the Ω(mn) lower bound cannot hide in a PSD matrix, since they necessarily create large diagonal entries. A natural question is whether it is possible to obtain low-rank approximation algorithms for distance matrices in sublinear time as well. A driving intuition that it may be possible is that no matter which metric the underlying points reside in, they necessarily satisfy the triangle inequality. Therefore, if Ai,j = d(pi, qj) is large, then since d(pi, qj) d(pi, q1) + d(q1, p1) + d(p1, qj), at least one of d(pi, q1), d(q1, p1), d(p1, qj) is large, and further, all these distances can be found by reading the ﬁrst row and column of A. Thus, large entries cannot hide in the matrix. Are there sublinear time algorithms achieving (1.1)? Are there sublinear time algorithms achieving (1.2)? These are the questions we put forth and study in this paper.

1.1 Our Results Our main result is that we obtain sublinear time algorithms achieving the additive error guarantee similar to (1.1) for distance matrices, which is impossible for general matrices A. We show that for every metric d, this is indeed possible. Namely, for an arbitrarily small constant γ > 0, we give an algorithm running in e O((m1+γ +n1+γ)poly(kϵ 1)) time and achieving guarantee A MNT 2 F A Ak 2 F +ϵ A 2 F , for any distance matrix with metric d. Note that our running time is signiﬁcantly sublinear in the description size of A. Indeed, thinking of the shortest path metric on an unweighted bipartite graph in which P corresponds to the left set of vertices, and Q corresponds to the right set of vertices, for each pair of points pi P and qj Q, one can choose d(pi, qj) = 1 or d(pi, qj) > 1 independently of all other distances by deciding whether to include the edge {pi, qj}. Consequently,

1See, e.g., https://en.wikipedia.org/wiki/Distance_matrix

there are at least 2Ω(mn) possible distance matrices A, and since our algorithm reads o(mn) entries of A, cannot learn whether d(pi, qj) = 1 or d(pi, qj) > 1 for each i and j. Nevertheless, it still learns enough information to compute a low rank approximation to A.

We note that a near matching lower bound holds just to write down the output of a factorization of a rank-k matrix B into an m k and a k n matrix. Thus, up to an (mγ + nγ)poly(kϵ 1) factor, our algorithm is also optimal among those achieving the additive error guarantee of (1.1).

A natural followup question is to consider achieving relative error (1.2) in sublinear time. Although large entries in a distance matrix A cannot hide, we show it is still impossible to achieve the relative error guarantee in less than mn time for distance matrices. That is, we show for the ℓ distance metric, that there are instances of distance matrices A with unequal P and Q for which even for k = 2 and any constant accuracy ϵ, must read Ω(mn) entries of A. In fact, our lower bound holds even if the algorithm is allowed to output a rank-k approximation for any 2 k = o(min(m, n)) whose cost is at most that of the best rank-2 approximation to A. We call the latter a bicriteria algorithm, since its output rank k may be larger than the desired rank k. Therefore, in some sense obtaining additive error (1.1) is the best we can hope for.

We next consider the important class of Euclidean matrices for which the entries correspond to the square of the Euclidean distance, and for which P = Q. In this case, we are able to show that if we allow the low rank matrix B output to be of rank k + 4, then one can achieve the relative error guarantee of (1.2) with respect to the best rank-k approximation, namely, that

A B 2 F (1 + ϵ) A Ak 2 F .

Further, our algorithm runs in a sublinear n k poly(1/ϵ) amount of time. Thus, our lower bound ruling out sublinear time algorithms achieving (1.2) for bicriteria algorithms cannot hold for this class of matrices.

Finally, we empirically compare our algorithm with the SVD and input sparsity time algorithms [4, 13, 15, 1, 5]. Our algorithm is several hundred times faster than the SVD, and about 8-20 times faster than input sparsity methods on real-world datasets such as MNIST, Gisette and Poker, and a synthetic clustering dataset of size 108. Accuracy-wise, our algorithm is only slightly worse than that of the SVD (optimal error) and input-sparsity time algorithms. Due to space constraints, we defer all of our proofs to the Supplementary Material 2.

2 Row and Column Norm Estimation

We observe that we can obtain a rough estimate for the row or column norms of a distance matrix by uniformly sampling a small number of elements of each row or column. The only structural property we need to obtain such an estimate is approximate triangle inequality.

Deﬁnition 2.1. (Approximate Triangle Inequality.) Let A be a m n matrix. Then, matrix A satisﬁes approximate triangle inequality if, for any ϵ [0, 1], for any p [m], q, r [n]

|Ap,r maxi [m] |Ai,q Ai,r||

(1 + ϵ) Ap,q (1 + ϵ) Ap,r + max i [m] |Ai,q Ai,r| (2.1)

|Ap,q Ap,r|

1 + ϵ max i [m] |Ai,q Ai,r| (1 + ϵ) (Ap,q + Ap,r) (2.2)

Further, similar equations hold for AT .

The above deﬁnition captures distance matrices if we set ϵ = 0. In order to see this, recall, each entry in a m n matrix A is associated with a distance between points sets P and Q, such that |P| = m and |Q| = n. Then, for points p P, q Q, Ap,q represents d(p, q), where d is an arbitrary distance metric. Further, for arbitrary point, i P and r Q, maxi |Ai,q Ai,r| = maxi |d(i, r) d(i, q)|. Intuitively, we would like to highlight that, for the case where A is a distance matrix, maxi [m] |Ai,q Ai,r| represents a lower bound on the distance d(q, r). Since d is a metric, it follows triangle inequality, and d(p, q) d(p, r) + d(q, r). Further, by reverse triangle inequality, for all i [m], d(q, r) |d(i, q) d(i, r)|. Therefore, Ap,q Ap,r + maxi |Ai,q Ai,r| and distance matrices satisfy equation 2.1. Next, maxi [m] |Ai,q Ai,r| = maxi [m] |di,q di,r| d(q, r),

2The full version of our paper is available at https://arxiv.org/pdf/1809.06986.pdf

and d(q, r) d(p, r) + d(p, q) = Ap,r + Ap,q therefore, equation 2.2 is satisﬁed. We note that approximate triangle inequality is a relaxation of the traditional triangle inequality and is sufﬁcient to obtain coarse estimates to row and column norms of A in sublinear time.

Algorithm 1 : Row Norm Estimation.

Input: A Distance Matrix Am n, Sampling parameter b.

1. Let x = argmini [m]Ai,1.

2. Let d = maxj [n] Ax,j.

3. For i [m], let Ti be a uniformly random sample of Θ(b) indices in [n].

4. e Xi = d2 + P

Output: Set { e X1, e X2, . . . e Xm}

Lemma 2.1. (Row Norm Estimation.) Let A be a m n matrix such that A satisﬁes approximate triangle inequality. For i [m], let Ai, be the ith row of A. Algorithm 1 uniformly samples Θ(b) elements from Ai, and with probability at least 9/10 outputs an estimator which obtains an O (n/b)-approximation to Ai, 2 2. Further, Algorithm 1 runs in O(bm + n) time.

To obtain an O (n/b) approximation for all the m rows simultaneously with high probability, we can compute O (log(m)) estimators for each row and take their median. We also observe that Column and Row Norm Estimation are symmetric operations and a slight modiﬁcation to Algorithm 1 yields a Column Norm Estimation algorithm with the following guarantee: Corollary 2.1. (Column Norm Estimation.) There exists an algorithm that uniformly samples Θ(b) elements from A ,j and with probability 9/10, outputs an estimator which is an O (m/b)- approximation to A ,j 2 2. Further, this algorithm runs in O(bn + m) time.

3 Projection-Cost Preserving Sketches

We would like to reduce the dimensionality of the matrix in a way that approximately preserves low-rank structure. The main insight is that if we can approximately preserve all rank-k subspaces in the column and row space of the matrix, then we can recursively sample rows and columns to obtain a much smaller matrix. To this end, we introduce a relaxation of projection-cost preserving sketches [6] that satisfy an additive error guarantee. We show that sampling columns of A according to approximate column norms yields a matrix that preserves projection cost for all rank-k projections up to additive error. We give a similar proof to the relative error guarantees in [6], but need to replace certain parts with our different distribution which is only based on row and column norms rather than leverage scores, and consequently we obtain additive error in places instead. As our lower bound shows, this is necessary in our setting. Theorem 3.1. (Column Projection-Cost Preservation.) Let A be a m n matrix such that A satisﬁes approximate triangle inequality. For j [n], let e Xj be an O (m/b)-approximate estimate for the jth column of A such that it satisﬁes the guarantee of Corollary A.1. Then, let q = {q1, q2 . . . qn} be a probability distribution over the columns of A such that qj = e Xj/ P

j e Xj . Let

δ ) for some constant c. Then, construct C using t columns of A and set each one to A ,j/ tqj with probability qj. With probability at least 1 δ, for any rank-k orthogonal projection X, C XC 2 F = A XA 2 F ϵ A 2 F .

We observe that we can also estimate the row norms in sublinear time and immediately obtain a similar guarantee for row projection-cost preservation. Next, we describe how to apply projectioncost preserving sketching for low-rank approximation. Let C be a column pcp for A. Then, an approximate solution for the best rank-k approximation to C is an approximate solution for the best rank-k approximation to A. Formally, Lemma 3.1. Let C be a column pcp for A satisfying the guarantee of Theorem A.2. Let P C be the minimizing projection matrix for min X C XC 2 F and P A be the projection matrix that minimizes min X A XA 2 F . Then, for any projection matrix P such that C PC 2 F C P CC 2 F +ϵ C 2 F , with probability at least 98/100, A PA 2 F A P AA 2 F +ϵ A 2 F . A similar guarantee holds if C is a row pcp of A.

4 A Sublinear Time Algorithm.

Algorithm 2 : First Sublinear Time Algorithm.

Input: A Distance Matrix Am n, integer k and ϵ > 0.

1. Set b1 = ϵn0.34

log(n) and b2 = ϵm0.34

log(m). Set s1 = Θ mk2 log(m)

b1ϵ2 and s2 = Θ nk2 log(n)

2. Let e Xj be the estimate for A ,j 2 2 returned by Column Norm Estimation(A, b1).

Recall, e Xj is an O m b1

-approximation to A ,j.

3. Let q = {q1, q2 . . . qn} denote a distribution over columns of A such that qi = e Xj P

m A ,j 2 2 A 2 F . Construct a column pcp for A by sampling s1 columns of A

such that each column is set to A ,j s1qj with probability qj. Let AS be the resulting m s1 matrix that follows guarantees of Theorem A.2. 4. To account for the rescaling, consider O(ϵ 1 log(n)) weight classes for scaling parameters of the columns of AS. Let AS|Wg be the columns of AS restricted to the weight class Wg (deﬁned below.) 5. Run the Row Norm Estimation(AS|Wg, b2) estimation algorithm with parameter b2 for each weight class independently and sum up the estimates for a given row. Let e Xi be the resulting O n b2

-approximate estimator for ASi, .

6. Let p = {p1, p2, . . . pm} denote a distribution over rows of AS such that pi = e Xi P

n ASi, 2 2 AS 2 F . Construct a row pcp for AS by sampling s2 rows of AS such

that each row is set to ASi, s2pi with probability pi. Let TAS be the resulting s2 s1 matrix that follows guarantees of Corollary A.2. 7. Run the input-sparsity time low-rank approximation algorithm (corresponding to Theorem 4.2) on TAS with rank parameter k to obtain a rank-k approximation to TAS, output in factored form: L, D, WT . Note, LD is an s2 k matrix and WT is a k s1 matrix. 8. Consider the regression problem min X AS XWT 2 F . Sketch the problem using the leverage scores of WT as shown in Theorem 4.3 to obtain a sampling matrix E with poly( k

ϵ ) columns. Compute XAS = argmin X ASE XWT E 2 F . Let XASWT = P N T be such that P has orthonormal columns. 9. Consider the regression problem min X A P X 2 F . Sketch the problem using the the leverage scores of P following Theorem 4.3 to obtain a sampling matrix E with poly( k

ϵ ) rows. Compute XA = argmin X E A E P X 2 F .

Output: M = P , NT = XA

In this section, we give a sublinear time algorithm which relies on constructing column and row pcps, which in turn rely on our column and row norm estimators. Intuitively, we begin with obtaining coarse estimates to column norms. Next, we sample a subset of the columns of A with probability proportional to their column norm estimates to obtain a column pcp for A. We show that the rescaled matrix still has enough structure to get a coarse estimate to its row norms. Then, we compute the row norm estimates of the sampled rescaled matrix, and subsample its rows to obtain a small matrix that is a row pcp. We run an input-sparsity time algorithm ([4]) on the small matrix to obtain a low-rank approximation. The main theorem we prove is as follows:

Theorem 4.1. (Sublinear Low-Rank Approximation.) Let A Rm n be a matrix that satisﬁes approximate triangle inequality. Then, for any ϵ > 0 and integer k, Algorithm 2 runs in time O m1.34 + n1.34 poly( k

ϵ ) and outputs matrices M Rm k, N Rn k such that with probability at least 9/10, A MNT 2 F A Ak 2 F + ϵ A 2 F

Column Sampling. We observe that constructing a column and row projection-cost preserving sketches require sampling columns proportional to their relative norms and subsequently subsample columns proportional to the relative row norms of the sampled, rescaled matrix. In the previous section, we obtain coarse approximations to these norms and thus use our estimates to serve as a proxy for the real distribution. For j [n], let e Xj be our estimate for the column A ,j. We deﬁne

a probability distribution over the columns of A as qj = e Xj P

i [n] e Xj . Given that we can estimate

column norms up to an O (m/b1)-factor, qj b1

m A ,j 2 2 A 2 F , where b1 is a parameter to be set later. Therefore, we oversample columns of A by a Θ (m/b1)-factor to construct a column pcp for A. Let

AS be a scaled sample of s1 = Θ mk2 log(m)

b1ϵ2 columns of A such that each column is set to A ,j s1qj with probability qj. Then, by Theorem A.2 for any ϵ > 0, with probability at least 99/100, for all rank-k projection matrices X, AS XAS 2 F A XAk 2 F + ϵ A 2 F .

Handling Rescaled Columns. We note that during the construction of the column pcp, the jth column, if sampled, is rescaled by 1 s1qj . Therefore, the resulting matrix, AS may no longer be a distance matrix. To address this issue, we partition the columns of AS into weight classes such that the gth weight class contains column index j if the corresponding scaling factor 1 qj lies in the

interval (1 + ϵ)g, (1 + ϵ)g+1 . Note, we can ignore the ( 1 s1 )-factor since every entry is rescaled

by the same constant. Formally, Wg = n i [s1] 1 qj (1 + ϵ)g, (1 + ϵ)g+1 o . Next, with high probability, for all j [n], if column j is sampled, 1/qj nc for a large constant c. If instead, qj 1 nc , the probability that the jth is sampled would be at most 1/nc , for some c > c. Union bounding over such events for n columns, the number of weight classes is at most log1+ϵ(nc) = O ϵ 1 log(n) . Let AS|Wg denote the columns of AS restricted to the set of indices in Wg. Observe that all entries in AS|Wg are scaled to within a (1 + ϵ)-factor of each other and therefore, satisfy approximate triangle inequality (equation 2.1).

Therefore, row norms of AS|Wg can be computed using Algorithm 1 and the estimator is an O (n/b2)- approximation (for some parameter b2), since Lemma 2.1 blows up by a factor of at most (1 + ϵ). Summing over the estimates from each partition above, with probability at least 99/100, we obtain an O (n/b2)-approximate estimate to ASi, 2 2, simultaneously for all i [m]. However, we note that each iteration of Algorithm 1 reads b2m + n entries of A and there are at most O(ϵ 1 log(n)) iterations.

Row Sampling. Next, we construct a row pcp for AS. For i [m], let e Xi be an O (n/b2)- approximate estimate for ASi, 2 2. Let p = {p1, p2, . . . , pm} be a distribution over the rows of

AS such that pi = e Xi P

n ASi, 2 2 AS 2 F

. Therefore, we oversample rows by a Θ (n/b2) factor to

obtain a row pcp for AS. Let TAS be a scaled sample of s2 = Θ nk2 log(n)

b2ϵ2 rows of AS such that

each row is set to ASi, s2pi with probability pi. By the row analogue of Theorem A.2, with probability at

least 99/100, for all rank-k projection matrices X, TAS TASX 2 F AS ASX 2 F +ϵ AS 2 F .

Input-sparsity Time Low-Rank Approximation. Next, we compute a low-rank approximation for the smaller matrix, TAS, in input-sparsity time. To this end we use the following theorem from [4]:

Theorem 4.2. (Clarkson-Woodruff LRA.) For A Rm n, there is an algorithm that with failure probability at most 1/10 ﬁnds L Rm k, W Rn k and a diagonal matrix D Rk k, such that A LDWT 2 F (1 + ϵ) A Ak 2 F and runs in time O nnz(A) + (n + m)poly( k

ϵ ) , where nnz(A) is the number of non-zero entries in A.

Running the input-sparsity time algorithm with the above guarantee on the matrix TAS, we obtain a rank-k matrix LDWT , such that TAS LDWT 2 F (1 + ϵ) TAS (TAS)k 2 F where (TAS)k is the best rank-k approximation to TAS under the Frobenius norm. Since TAS is a small

matrix, we can afford to read all of it by querying at most O nm log(n) log(m)

b1b2 poly( k

ϵ ) entries of A

and the algorithm runs in time O s1s2 + (s1 + s2)poly( k

Constructing a solution for A. Note, while LDWT is an approximate rank-k solution for TAS, it does not have the right dimensions as A. If we do not consider running time, we could construct a low-rank approximation to A as follows: since projecting TAS onto WT is approximately optimal, it follows from Lemma A.1 that with probability 98/100,

AS ASWWT 2 F = AS (AS)k 2 F ϵ AS 2 F (4.1)

Let (AS)k = PNT be such that P has orthonormal columns. Then, AS PPT AS 2 F = AS (AS)k 2 F and by Lemma A.1 it follows that with probability 98/100, A PPT A 2 F A Ak 2 F + ϵ A F . However, even approximately computing a column space P for (AS)k using an input-sparsity time algorithm is no longer sublinear. To get around this issue, we observe that an approximate solution for TAS lies in the row space of WT and therefore, an approximately optimal solution for AS lies in the row space of WT . We then set up the following regression problem: min X AS XWT 2 F .

Note, this regression problem is still too big to be solved in sublinear time. Therefore, we sketch it by sampling columns of AS according to the leverage scores of WT to set up a smaller regression problem. Formally, we use a theorem of [4] (Theorem 38) to approximately solve this regression problem (also see [9] for previous work.) Theorem 4.3. (Fast Regression.) Given a matrix A Rm n and a rank-k matrix B Rm k, such that B has orthonormal columns, the regression problem min X A BX 2 F can be solved up to (1 + ϵ) relative error, with probability at least 2/3 in time O (m log(m) + n)poly( k

ϵ ) by constructing a sketch E with poly( k

ϵ ) rows and solving min X EA EBX 2 F . Note, a similar guarantee holds for solving min X A XB 2 F .

Since WT has orthonomal rows, the leverage scores are precomputed. With probability at least 99/100, we can compute XAS = argmin X ASE XWT E 2 F , where E is a leverage score sketching matrix with poly k

ϵ columns, as shown in Theorem 4.3.

AS XASWT 2 F (1 + ϵ) min X AS XWT 2 F (1 + ϵ) AS ASWWT 2 F

= AS (AS)k 2 F ϵ AS 2 F (4.2)

where the last two inequalities follow from equation 4.1. Recall, AS is an m s1 matrix and thus the running time is O (m + s1) log(m)poly k

ϵ . Let XASWT = P N T be such that P has orthonormal columns. Then, the column space of P contains an approximately optimal solution for A, since AS P N T 2 F = AS (AS)k 2 F ϵ AS 2 F and AS is a column pcp for A. It follows from Lemma A.1 that with probability at least 98/100,

A P P T A 2 F A Ak F + ϵ A F (4.3)

Therefore, there exists a good solution for A in the column space of P . Since we cannot compute this explicitly, we set up the following regression problem: min X A P X 2 F . Again, we sketch the regression problem above by sampling columns of A according to the leverage scores of P . We can then compute XA = argmin X E A E P X 2 F with probability at least 99/100, where E is a leverage score sketching matrix with poly k

ϵ rows. Then, using the properties of leverage score sampling from Theorem 4.3,

A P XA 2 F (1 + ϵ) min X A P X 2 F (1 + ϵ) A P P T A 2 F

A Ak 2 F + O(ϵ) A 2 F (4.4)

where the second inequality follows from X being the minimizer and P T A being some other matrix, and the last inequality follows from equation 4.3. Recall, P is an m k matrix and by Theorem 38 of CW, the time taken to solve the regression problem is O (m log(m) + n)poly k

ϵ . Therefore, we observe that P XA sufﬁces and we output it in factored form by setting M = P and N = XT A. Union bounding over the probabilistic events, and rescaling ϵ, with probability at least 9/10, Algorithm 2 outputs M Rm k and N Rn k such that the guarantees of Theorem 4.1 are satisﬁed.

Finally, we analyze the overall running time of Algorithm 2. Computing the estimates for the column norms and constructing the column pcp for A has running time O m log(m)

ϵ ) + b1n + m .

Metric SVD IS Sublinear

L2 398.76 8.94 1.69 L1 410.60 8.15 1.81 L 427.90 9.18 1.63 Lc 452.16 8.49 1.76

Table 1: Running Time (in seconds) on the Clustering Dataset for Rank = 20

Metric SVD IS Sublinear

L2 398.50 34.32 4.16 L1 560.9 39.51 3.72 L 418.01 39.32 3.99 Lc 390.07 38.33 3.91

Table 2: Running Time (in seconds) on the MNIST Dataset for Rank = 40

A similar guarantee holds for the rows. The input-sparsity time algorithm low-rank approximation runs in O s1s2 + (s1 + s2)poly( k

ϵ ) and constructing a solution for A is dominated by O (m log(m) + n)poly k

ϵ . Setting b1 = ϵn0.34

log(n) and b2 = ϵm0.34

log(m), we note that the overall running

time is e O (m1.34 + n1.34)poly k

ϵ where e O hides log(m) and log(n) factors. This completes the proof of Theorem 4.1.

We extend the above result to obtain a better exponent in the running time. The critical observation here is that we can recursively sketch rows and columns of A such that all the rank-k projections are preserved. Intuitively, it is important to preserve all rank-k subspaces since we do not know which projection will be approximately optimal when we recurse back up. At a high level, the algorithm is then to recursively sub-sample columns and rows of A such that we obtain projection-cost preserving sketches at each step. We defer the details to the Supplementary Material. Theorem 4.4. Let A Rm n be a matrix such that it satisﬁes approximate triangle inequality. Then, for any ϵ > 0, integer k and a small constant γ > 0, there exists an algorithm that runs in time e O m1+γ + n1+γ poly( k

ϵ ) to output matrices M Rm k and N Rn k such that with probability at least 9/10, A MNT 2 F A Ak 2 F + ϵ A 2 F .

5 Relative Error Guarantees In this section, we consider the relative error guarantee 1.2 for distance matrices. We begin by showing a lower bound for any relative error approximation for distance matrices and also preclude the possibility of a sublinear bi-criteria algorithm outputting a rank-poly(k) matrix satisfying the rank-k relative error guarantee. Theorem 5.1. Let A be an n n distance matrix. Let B be a rank-poly(k) matrix such that A B 2 F c A Ak for any constant c > 1. Then, any algorithm that outputs such a B requires Ω(nnz(A)) time. Euclidean Distance Matrices. We show that in the special case of Euclidean distances, when the entries correspond to squared distances, there exists a bi-criteria algorithm that outputs a rank-(k + 4) matrix satisfying the relative error rank-k low-rank approximation guarantee. Note, here the point sets P and Q are identical. Let A be such a matrix s.t. Ai,j = xi xj 2 2 = xi 2 2 + xj 2 2 2 xi, xj . Then, we can write A as A1 + A2 2B such that each entry in the ith row of A1 is xi 2 2, each entry in the jth column of A2 is xj 2 2 and B is a PSD matrix, where Bi,j = xi, xj . The main ingredient we use is the sublinear low-rank approximation of PSD matrices from [14]. We show that there exists an algorithm (see Supplementary Material) that outputs the description of a rank-(k + 4) matrix AWWT in sublinear time such it satisﬁes the relative-error rank-k low rank approximation guarantee. Theorem 5.2. Let A be a Euclidean Distance matrix. Then, for any ϵ > 0 and integer k, there exists an algorithm that with probability at least 9/10, outputs a rank (k + 4) matrix WWT such that A AWWT F (1 + ϵ) A Ak F , where Ak is the best rank-k approximation to A and runs in O(npoly( k

6 Experiments In this section, we benchmark the performance of our sublinear time algorithm with the conventional SVD Algorithm (optimal error), iterative SVD methods and the input-sparsity time algorithm from [4]. We use the built-in svd function in numpy s linear algebra package to compute the truncated SVD. We also consider the iterative SVD algorithm implemented by the svds function in scipy s sparse linear algebra package, however, the error achieved is typically 3 orders of magnitude worse than computing the SVD and thus we defer the results to the Supplementary Material. We implement the input-sparsity time low-rank approximation algorithm from [4] using a count-sketch matrix [3].

Synthetic Clustering Dataset

MNIST Dataset

Figure 6.1: We plot error on a synthetic dataset with 20 clusters and the MNIST dataset. The distance matrix is created using ℓ2, ℓ1, ℓ and ℓc metrics. We compare the error achieved by SVD (optimal), our Sublinear Algorithm and the Input Sparsity Algorithm from [4].

Finally, we implement Algorithm 2, as this has small recursive overhead. The experiments are run on a Macbook Pro 2017 with 16GB RAM, 2.8GHz quad-core 7th-generation Intel Core i7 processor. The ﬁrst dataset is a synthetic clustering dataset generated in scikit-learn using the make blobs function. We generate 10000 points with 200 features split up into 20 clusters. We note that given the clustered structure, this dataset is expected to have a good rank-20 approximation, as observed in our experiments. The second dataset we use is the popular MNIST dataset which is a collection of 70000 handwritten characters but sample 10000 points from it. In the Supplementary Material we also consider the Gisette and Poker datasets. Given n points, {p1, p2, . . . pn}, in Rd, we compute a n n distance matrix A, such that Ai,j = d(pi, pj), where d is Manhattan (ℓ1), Euclidean (ℓ2), Chebyshev (ℓ ) or Canberra3 (ℓc) distance. We compare the Frobenius norm error of the algorithms in Figure 6.1 and their corresponding running time in Table 1 and 2. We note that our sublinear time algorithm is only marginally worse in terms of absolute error, but runs 100-250 times faster than SVD and 8-20 times faster than the input-sparsity time algorithm.

Acknowledgments: The authors thank partial support from the National Science Foundation under Grant No. CCF-1815840. Part of this work was done while the author was visiting the Simons Institute for the Theory of Computing.

3See https://en.wikipedia.org/wiki/Canberra_distance

[1] Jean Bourgain, Sjoerd Dirksen, and Jelani Nelson. Toward a uniﬁed theory of sparse dimensionality reduction in euclidean space. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 499 508, 2015.

[2] Robert Cattral, Franz Oppacher, and Dwight Deugo. Evolutionary data mining with automatic rule generalization. Recent Advances in Computers, Computing and Communications, 1(1):296 300, 2002.

[3] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming, pages 693 703. Springer, 2002.

[4] Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the forty-ﬁfth annual ACM symposium on Theory of computing, pages 81 90. ACM, 2013.

[5] Michael B. Cohen. Nearly tight oblivious subspace embeddings by trace inequalities. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016, pages 278 287, 2016.

[6] Michael B. Cohen, Cameron Musco, and Christopher Musco. Input sparsity time low-rank approximation via ridge leverage score sampling. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pages 1758 1777, 2017.

[7] Erik D. Demaine, Francisco Gomez-Martin, Henk Meijer, David Rappaport, Perouz Taslakian, Godfried T. Toussaint, Terry Winograd, and David R. Wood. The distance geometry of music. Comput. Geom., 42(5):429 454, 2009.

[8] Ivan Dokmanic, Reza Parhizkar, Juri Ranieri, and Martin Vetterli. Euclidean distance matrices: essential theory, algorithms, and applications. IEEE Signal Processing Magazine, 32(6):12 30, 2015.

[9] Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Relative-error cur matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844 881, 2008.

[10] Alan M. Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algorithms for ﬁnding low-rank approximations. J. ACM, 51(6):1025 1041, 2004.

[11] Isabelle Guyon, Steve Gunn, Asa Ben-Hur, and Gideon Dror. Result analysis of the nips 2003 feature selection challenge. In Advances in neural information processing systems, pages 545 552, 2005.

[12] Viren Jain and L.K. Saul. Exploratory analysis and visualization of speech and music by locally linear embedding. In Departmental Papers (CIS), pages iii 984, 06 2004.

[13] Xiangrui Meng and Michael W. Mahoney. Low-distortion subspace embeddings in inputsparsity time and applications to robust linear regression. In Symposium on Theory of Computing Conference, STOC 13, Palo Alto, CA, USA, June 1-4, 2013, pages 91 100, 2013.

[14] Cameron Musco and David P. Woodruff. Sublinear time low-rank approximation of positive semideﬁnite matrices. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 672 683, 2017.

[15] Jelani Nelson and Huy L. Nguyen. OSNAP: faster numerical linear algebra algorithms via sparser subspace embeddings. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pages 117 126, 2013.

[16] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In FOCS, pages 143 152, 2006.

[17] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319 2323, 2000.

[18] Kilian Q. Weinberger and Lawrence K. Saul. Unsupervised learning of image manifolds by semideﬁnite programming. In 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2004), with CD-ROM, 27 June - 2 July 2004, Washington, DC, USA, pages 988 995, 2004.