# faster_linear_algebra_for_distance_matrices__4ed57e0f.pdf

Faster Linear Algebra for Distance Matrices

Piotr Indyk

MIT indyk@mit.edu

Sandeep Silwal

MIT silwal@mit.edu

The distance matrix of a dataset X of n points with respect to a distance function f represents all pairwise distances between points in X induced by f. Due to their wide applicability, distance matrices and related families of matrices have been the focus of many recent algorithmic works. We continue this line of research and take a broad view of algorithm design for distance matrices with the goal of designing fast algorithms, which are speciﬁcally tailored for distance matrices, for fundamental linear algebraic primitives. Our results include efﬁcient algorithms for computing matrix-vector products for a wide class of distance matrices, such as the 1 metric for which we get a linear runtime, as well as an (n2) lower bound for any algorithm which computes a matrix-vector product for the 1 case, showing a separation between the 1 and the 1 metrics. Our upper bound results, in conjunction with recent works on the matrix-vector query model, have many further downstream applications, including the fastest algorithm for computing a relative error low-rank approximation for the distance matrix induced by 1 and 2

2 functions and the fastest algorithm for computing an additive error lowrank approximation for the 2 metric, in addition to applications for fast matrix multiplication among others. We also give algorithms for constructing distance matrices and show that one can construct an approximate 2 distance matrix in time faster than the bound implied by the Johnson-Lindenstrauss lemma.

1 Introduction

Given a set of n points X = {x1, . . . , xn}, the distance matrix of X with respect to a distance function f is deﬁned as the n n matrix A satisfying Ai,j = f(xi, xj). Distances matrices are ubiquitous objects arising in various applications ranging from learning image manifolds [TSL00, WS06], signal processing [SY07], biological analysis [HS93], and non-linear dimensionality reduction [Kru64, Kru78, TSL00, CC08], to name a few1. Unfortunately, explicitly computing and storing A requires at least (n2) time and space. Such complexities are prohibitive for scaling to large datasets.

A silver lining is that in many settings, the matrix A is not explicitly required. Indeed in many applications, it sufﬁces to compute some underlying function or property of A, such as the eigenvalues and eigenvectors of A or a low-rank approximation of A. Thus an algorithm designer can hope to use the special geometric structure encoded by A to design faster algorithms tailored for such tasks.

Therefore, it is not surprising that many recent works explicitly take advantage of the underlying geometric structure of distance matrices, and other related families of matrices, to design fast algorithms (see Section 1.2 for a thorough discussion of prior works). In this work, we continue this line of research and take a broad view of algorithm design for distance matrices. Our main motivating question is the following:

1We refer the reader to the survey [DPRV15] for a more thorough discussion of applications of distance matrices.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Can we design algorithms for fundamental linear algebraic primitives which are speciﬁcally tailored for distance matrices and related families of matrices?

We make progress towards the motivating question by studying three of the most fundamental primitives in algorithmic linear algebra. Speciﬁcally:

1. We study upper and lower bounds for computing matrix-vector products for a wide array of

distance matrices,

2. We give algorithms for multiplying distance matrices faster than general matrices, and,

3. We give fast algorithms for constructing distance matrices.

1.1 Our Results

We now describe our contributions in more detail.

1. We study upper and lower bounds for constructing matrix-vector queries for a

wide array of distance matrices.

A matrix-vector query algorithm accepts a vector z as input and outputs the vector Az. There is substantial motivation for studying such queries. Indeed, there is now a rich literature for fundamental linear algebra algorithms which are in the matrix free" or implicit" model. These algorithms only assume access to the underlying matrix via matrix-vector queries. Some well known algorithms in this model include the power method for computing eigenvalues and the conjugate gradient descent method for solving a system of linear equations. For many fundamental functions of A, nearly optimal bounds in terms of the number of queries have been achieved [MM15, BHSW20, BCW22]. Furthermore, having access to matrix-vector queries also allows the simulation of any randomized sketching algorithm, a well studied algorithmic paradigm in its own right [Woo14]. This is because randomized sketching algorithms operate on the matrix A or A where is a suitably chosen random matrix, such as a Gaussian matrix. Typically, is chosen so that the sketches A or A have signiﬁcantly smaller row or column dimension compared to A. If A is symmetric, we can easily acquire both types of matrix sketches via a small number of matrix-vector queries.

Therefore, creating efﬁcient versions of matrix-vector queries for distance matrices automatically lends itself to many further downstream applications. We remark that our algorithms can access to the set of input points but do not explicitly create the distance matrix. A canonical example of our upper bound results is the construction of matrix-vector queries for the function f(x, y) = kx ykp

Theorem 1.1. Let p 1 be an integer. Suppose we are given a dataset of n points X = {x1, . . . , xn} Rd. X implicitly deﬁnes the matrix Ai,j = kxi xjkp

p. Given a query z 2 Rn, we can compute Az exactly in time O(ndp). If p is odd, we also require O(nd log n) preprocessing time.

We give similar guarantees for a wide array of functions f and we refer the reader to Table 1 which summarizes our matrix-vector query upper bound results. Note that some of the functions f we study in Table 1 do not necessarily induce a metric in the strict mathematical sense (for example the function f(x, y) = kx yk2

2 does not satisfy the triangle inequality). Nevertheless, we still refer to such functions under the broad umbrella term of distance functions" for ease of notation. We always explicitly state the function f we are referring to.

Crucially, most of our bounds have a linear dependency on n which allows for scalable computation as the size of the dataset X grows. Our upper bounds are optimal in many cases, see Theorem A.13.

Combining our upper bound results with optimized matrix-free methods, immediate corollaries of our results include faster algorithms for eigenvalue and singular value computations and low-rank approximations. Low-rank approximation is of special interest as it has been widely studied for distance matrices; for low-rank approximation, our bounds outperform prior results for speciﬁc distance functions. For example, for the 1 and 2

2 case (and in general PSD matrices), [BCW20] showed that a rank-k approximation can be found in time O(ndk/" + nkw 1/"w 1). This bound has extra poly(1/") overhead compared to our bound stated in Table 2. The work of [IVWW19] has a worse poly(k, 1/") overhead for an additive error approximation for the 2 case. See Section 1.2 for further discussion of prior works. The downstream applications of matrix-vector queries are

summarized in Table 2.

Function f(x, y) Preprocessing Query Time Reference p

p for p even kx ykp

p O(ndp) Thms. A.1 / A.3 p

p for p odd kx ykp

p O(nd log n) O(ndp) Thms. 2.2 / A.4 Mixed 1 maxi,j |xi yj| O(nd log n) O(n2) Thm. A.5 Mahalanobis Distance2 x T My O(nd2) O(nd) Thm. A.6 Polynomial Kernel hx, yip O(ndp) Thm. A.7 Total Variation Distance TV(x, y) O(nd log n) O(nd) Thm. A.8 KL Divergence DKL(x k y) O(nd) Thm. A.2 Symmetric Divergence DKL(x k y) + DKL(y k x) O(nd) Thm. A.9 Cross Entropy H(x, y) O(nd) Thm. A.9 Hellinger Distance2 Pd

x(i)y(i) O(nd) Thm. A.10 Table 1: A summary of our results for exact matrix-vector queries.

We also study fundamental limits for any upper bound algorithms. In particular, we show that no algorithm can compute a matrix-vector query for general inputs for the 1 metric in subquadratic time, assuming a standard complexity-theory assumption called the Strong Exponential Time Hypothesis (SETH) [IP01, IPZ01]. Theorem 1.2. For any > 0 and d = !(log n), any algorithm for exactly computing Az for any input z, where A is the 1 distance matrix, requires (n2 ) time (assuming SETH).

This shows a separation between the functions listed in Table 1 and the 1 metric. Surprisingly, we can create queries for the approximate matrix-vector query in substantially faster time.

Theorem 1.3. Suppose X {0, 1, . . . , O(1)}d. We can compute By in time O(n d O(

d log(d/"))) where k A Bk1 ".

To put the above result into context, the lower bound of Theorem 1.2 holds for points sets in {0, 1, 2}d in d log n dimensions. In contrast, if we relax to an approximation guarantee, we can obtain a subquadratic-time algorithm for d up to (log2(n)/ log log(n)).

Finally, we provide a general understanding of the limits of our upper bound techniques. In Theorem B.1, we show that essentially the only f for which our upper bound techniques apply have a linear structure" after a suitable transformation. We refer to Appendix Section B for details.

2. We give algorithms for multiplying distance matrices faster than general matrices.

Fast matrix-vector queries also automatically imply fast matrix multiplication, which can be reduced to a series of matrix-vector queries. For concreteness, if f is the p

p function which induces A, and B is any n n matrix, we can compute AB in time O(n2dp). This is substantially faster than the general matrix multiplication bound of nw n2.37. We also give an improvement of this result for the case where we are multiplying two distance matrices arising from 2

2. See Table 2 for summary.

3. We give fast algorithms for constructing distance matrices.

Finally, we give fast algorithms for constructing approximate distance matrices. To establish some context, recall the classical Johnson-Lindenstrauss (JL) lemma which (roughly) states that a random projection of a dataset X Rd of size n onto a dimension of size O(log n) approximately preserves all pairwise distances [JL84]. A common applications of this lemma is to instantiate the 2 distance matrix. A naive algorithm which computes the distance matrix after performing the JL projection requires approximately O(n2 log n) time. Surprisingly, we show that the JL lemma is not tight with respect to creating an approximate 2 distance matrix; we show that one can initialize the 2 distance in an asymptotically better runtime. Theorem 1.4 (Informal; See Theorem D.5 ). We can calculate a n n matrix B such that each (i, j) entry Bij of B satisﬁes (1 ")kxi xjk2 Bij (1 + ")kxi xjk2 in time O(" 2n2 log2(" 1 log n)).

Problem f(x, y) Runtime Prior Work

(1 + ") Relative error rank k

low-rank approximation 1, 2

ndk "1/3 + nkw 1

Theorem C.4

Additive error "k Ak F rank k

low-rank approximation 2

ndk "1/3 + nkw 1

Theorem C.6

O(nd poly(k, 1/"))

(1 + ") Relative error rank k

low-rank approximation Any in Table 1

T k "1/3 + nkw 1

Theorem C.7

"1/3 + nkw 1

(1 ") Approximation to

top k singular values Any in Table 1

T k "1/2 + nk2

Theorem C.8

[MM15] Multiply distance matrix A

with any B 2 Rn n Any in Table 1 O(Tn) Lemma C.9 O(nw)

Multiply two distance

matrices A and B 2

O(n2dw 2) Lemma C.11 O(nw)

Table 2: Applications of our matrix-vector query results. T denotes the matrix-vector query time, given in Table 1. w 2.37 is the matrix multiplication constant [AW21].

Our result can be viewed as the natural runtime bound which would follow if the JL lemma implied an embedding dimension bound of O(poly(log log n)). While this is impossible, as it would imply an exponential improvement over the JL bound which is tight [LN17], we achieve our speedup by carefully reusing distance calculations via tools from metric compression [IRW17]. Our results also extend to the 1 distance matrix; see Theorem D.5 for details.

Notation. Our dataset will be the n points X = {x1, . . . , xn} Rd. For points in X, we denote xi(j) to be the jth coordinate of point xi for clarity. For all other vectors v, vi denotes the ith coordinate. We are interested in matrices of the form Ai,j = f(xi, xj) for f : Rd Rd ! R which measures the similarity between any pair of points. f might not necessarily be a distance function but we use the terminology distance function" for ease of notation. We will always explicitly state the function f as needed. w 2.37 denotes the matrix multiplication constant, i.e., the exponent of n in the time required to compute the product of two n n matrix [AW21].

1.2 Related Works

Matrix-Vector Products Queries. Our work can be understood as being part of a long line of classical works on the matrix free or implicit model as well as the active recent line of works on the matrix-vector query model. Many widely used linear algebraic algorithms such as the power method, the Lanczos algorithm [Lan50], conjugate gradient descent [S+94], and Wiedemann s coordinate recurrence algorithm [Wie86], to name a few, all fall into this paradigm. Recent works such as [MM15, BHSW20, BCW22] have succeeded in precisely nailing down the query complexity of these classical algorithms in addition to various other algorithmic tasks such as low-rank approximation [BCW22], trace estimation [MMMW21], and other linear-algebraic functions [SWYZ21b, RWZ20]. There is also a rich literature on query based algorithms in other contexts with the goal of minimizing the number of queries used. Examples include graph queries [Gol17], distribution queries [Can20], and constraint based queries [ES20] in property testing, inner product queries in compressed sensing [EK12], and quantum queries [LSZ21, CHL21].

Most prior works on query based models assume black-box access to matrix-vector queries. While this is a natural model which allows for the design non-trivial algorithms and lower bounds, it is not always clear how such queries can be initialized. In contrast, the focus of our work is not on obtaining query complexity bounds, but rather complementing prior works by creating an efﬁcient matrix-vector query for a natural class of matrices.

Subquadratic Algorithms for Distance Matrices. Most work on subquadratic algorithms for distance matrices have focused on the problem of computing a low-rank approximation. [BW18, IVWW19] both obtain an additive error low-rank approximation applicable for all distance matrices. These works only assume access to the entries of the distance matrix whereas we assume we also have access to the underlying dataset. [BCW20] study the problem of computing the low-rank approximation of PSD matrices with also sample access to the entries of the matrix. Their results extend to low-rank approximation for the 1 and 2

2 distance matrices in addition to other more specialized metrics such as spherical metrics. Table 2 lists the runtime comparisons between their results and ours.

Practically, the algorithm of [IVWW19] is the easiest to implement and has outstanding empirical performance. We note that we can easily simulate their algorithm with no overall asymptotic runtime overhead using O(log n) vector queries. Indeed, their algorithm proceeds by sampling rows of the matrix according to their 2

2 value and then post-processing these rows. The sampling probabilities only need to be accurate up to a factor of two. We can acquire these sampling probabilities by performing O(log n) matrix-vector queries which sketches the rows onto dimension O(log n) and preserves all row-norms up to a factor of two with high probability due to the Johnson-Lindenstrauss lemma [JL84]. This procedure only incurs an additional runtime of O(T log n) where T is the time required to perform a matrix-vector query.

The paper [ILLP04] shows that the exact L1 distance matrix can be created in time O(n(w+3)/2) n2.69 in the case of d = n, which is asymptotically faster than the naive bound of O(n2d) = O(n3). In contrast, we focus on creating an (entry-wise) approximate distance matrices for all values of d.

We also compare to the paper of [ACSS20]. In summary, their main upper bounds are approximation algorithms while we mainly focus on exact algorithms. Concretely, they study matrix vector products for matrices of the form Ai,j = f(kxi xjk2

2) for some function f : R ! R. They present results on approximating the matrix vector product of A where the approximation error is additive. They also consider a wide range of f, including polynomials and other kernels, but the input to is always the 2 distance squared. In contrast, we also present exact algorithms, i.e., with no approximation errors. For example one of our main upper bounds is an exact algorithm when Ai,j = kxi xjk1 (see Table 1 for the full list). Since it is possible to approximately embed the 1 distance into 2

2, their methods could be used to derive approximate algorithms for 1, but not the exact ones. Furthermore, we also study a wide variety of other distance functions such as 1 and p

p (and others listed in Table 1) which are not studied in Alman et al. In terms of technique, the main upper bound technique of Alman et al. is to expand f(kxi xjk2

2) and approximate the resulting quantity via a polynomial. This is related to our upper bound results for p

p for even p where we also use polynomials. However, our results are exact, while theirs are approximate. Our 1 upper bound technique is orthogonal to the polynomial approximation techniques used in Alman et al. We also employ polynomial techniques to give upper bounds for the approximate 1 distance function which is not studied in Alman et al. Lastly, Alman et al. also focus on the Laplacian matrix of the weighted graph represented by the distance matrix, such as spectral sparsiﬁcation and Laplacian system solving. In contrast, we study different problems including low-rank approximations, eigenvalue estimation, and the task of initializing an approximate distance matrix. We do not consider the distance matrix as a graph or consider the associated Laplacian matrix.

It is also easy to verify the folklore" fact that for a gram matrix AAT , we can compute AAT v in time O(nd) if A 2 Rn d by computing AT v ﬁrst and then A(AT v). Our upper bound for the 2

2 function can be reduced to this folklore fact by noting that kx yk2

2 2hx, yi. Thus the 2

2 matrix can be decomposed into two rank one components due to the terms kxk2

2, and a gram matrix due to the term hx, yi. This decomposition of the 2

2 matrix is well-known (see Section 2 in [DPRV15]). Hence, a matrix-vector query for the 2

2 matrix easily reduces to the gram matrix case. Nevertheless, we explicitly state the 2

2 upper bound for completeness since we also consider all p

p functions for any integer p 1.

Polynomial Kernels. There have also been works on faster algorithms for approximating a kernel matrix K deﬁned as the n n matrix with entries Ki,j = k(xi, xj) for a kernel function k. Speciﬁcally for the polynomial kernel k(xi, xj) = hxi, xjip, recent works such as [ANW14, AKK+20, WZ20, SWYZ21a] have shown how to ﬁnd a sketch K0 of K which approximately satisﬁes k K0zk2 k Kzk2 for all z. In contrast, we can exactly simulate the matrix-vector product Kz. Our runtime is O(ndp) which has a linear dependence on n but an exponential dependence on p while the

aforementioned works have at least a quadratic dependence on n but a polynomial dependence on p. Thus our results are mostly applicable to the setting where our dataset is large, i.e. n d and p is a small constant. For example, p = 2 is a common choice in practice [CHC+10]. Algorithms with polynomial dependence in d and p but quadratic dependence in n are suited for smaller datasets which have very large d and large p. Note that a large p might arise if approximates a non-polynomial kernel using a polynomial kernel via a taylor expansion. We refer to the references within [ANW14, AKK+20, WZ20, SWYZ21a] for additional related work. There is also work on kernel density estimation (KDE) data structures which upon query y, allow for estimation of the sum P

x2X k(x, y) in time sublinear in |X| after some preprocessing on the dataset X. For widely used kernels such as the Gaussian and Laplacian kernels, KDE data structures were used in [BIMW21] to create a matrix-vector query algorithm for kernel matrices in time subquadratic in |X| for input vectors which are entry wise non-negative. We refer the reader to [CS17, BCIS18, SRB+19, BIW19, CKNS20] and references within for prior works on KDE data structures.

2 Faster Matrix-Vector Product Queries for 1

We derive faster matrix-vector queries for distance matrices for a wide array of distance metrics. First we consider the case of the 1 metric such that Ai,j = f(xi, xj) where f(x, y) = kx yk1 = Pd

i=1 |xi yi|.

Algorithm 1 Preprocessing

1: Input: Dataset X Rd

2: procedure PREPROCESSING 3: for i 2 [d] do 4: Ti sorted array of the ith coordinates of all x 2 X. 5: end for 6: end procedure

Algorithm 2 matrix-vector Query for p = 1

1: Input: Dataset X Rd

2: Output: z = Ay 3: procedure QUERY({Ti}i2[d], y) 4: y1, , yn coordinates of y. 5: Associate every xi 2 X with the scalar yi 6: for i 2 [d] do 7: Compute two arrays Bi, Ci as follows. 8: Bi contains the partial sums of yjxj(i) computed in the order induced by Ti 9: Ci contains the partial sums of yj computed in the order induced by Ti 10: end for 11: z 0n

12: for k 2 [n] do 13: for i 2 [d] do 14: q position of xk(i) in the order of Ti 15: S1 Bi[q] 16: S2 Bi[n] Bi[q] 17: S3 Ci[q] 18: S4 Ci[n] Ci[q] 19: z(k)+ = xk(i) (S3 S4) + S2 S1 20: end for 21: end for 22: end procedure

We ﬁrst analyze the correctness of Algorithm 2. Theorem 2.1. Let Ai,j = kxi xjk1. Algorithm 2 computes Ay exactly.

Proof. Consider any coordinate k 2 [n]. We show that (Ay)k is computed exactly. We have

yjkxk xjk1 =

yj|xk(i) xj(i)| =

yj|xk(i) xj(i)|.

Let i denote the order of [n] induced by Ti. We have

yj|xk(i) xj(i)| =

j: i(k) i(j)

yj(xj(i) xk(i)) +

j: i(k)> i(j)

yj(xk(i) xj(i))

We now consider the inner sum. It rearranges to the following:

j: i(k)> i(j)

j: i(k) i(j)

j: i(k) i(j)

j: i(k)> i(j)

= xk(i) (S3 S4) + S2 S1,

where S1, S2, S3, and S4 are deﬁned in lines 15 18 of Algorithm 2 and the last equality follows from the deﬁnition of the arrays Bi and Ci. Summing over all i 2 [d] gives us the desired result.

The following theorem readily follows.

Theorem 2.2. Suppose we are given a dataset {x1, . . . , xn} which implicitly deﬁnes the distance matrix Ai,j = kxi xjk1. Given a query y 2 Rd, we can compute Ay exactly in O(nd) query time. We also require a one time O(nd log n) preprocessing time which can be reused for all queries.

3 Lower and Upper bounds for 1

In this section we give a proof of Theorem 1.2. Speciﬁcally, we give a reduction from a so-called Orthogonal Vector Problem (OVP) [Wil05] to the problem of computing matrix-vector product Az, where Ai,j = kxi xjk1, for a given set of points X = {x1, . . . , xn}. The orthogonal vector problem is deﬁned as follows: given two sets of vectors A = {a1, . . . , an} and B = {b1, . . . , bn}, A, B {0, 1}d, |A| = |B| = n, determine whether there exist x 2 A and y 2 B such that the dot product x y = Pd

j=1 xjyj (taken over reals) is equal to 0. It is known that if OVP can be solved in strongly subquadratic time O(n2 ) for any constant > 0 and d = !(log n), then SETH is false. Thus, an efﬁcient reduction from OVP to the matrix-vector product problem yields Theorem 1.2.

Lemma 3.1. If the matrix-vector product problem for 1 distance matrices induced by n vectors of dimension d can be solved in time T(n, d), then OVP (with the same parameters) can be solved in time O(T(n, d)).

Proof. Deﬁne two functions, f, g : {0, 1}d ! [0, 1], such that f(0) = g(0) = 1/2, f(1) = 0, g(1) = 1. We extend both functions to vectors by applying f and g coordinate wise and to sets by letting f({a1, . . . , an}) = {f(a1), . . . , f(an)}); the function g is extended in the same way for B. Observe that, for any pair of non-zero vectors a, b 2 {0, 1}d, we have kf(a) g(b)k1 = 1 if and only if a b > 0, and kf(a) g(b)k1 = 1/2 otherwise.

Consider two sets of binary vectors A and B. Without loss of generality we can assume that the vectors are non-zero, since otherwise the problem is trivial. Deﬁne three distance matrices: matrix MA deﬁned by the set f(A), matrix MB deﬁned by the set g(B) and MAB deﬁned by the set f(A)[f(B). Furthermore, let M be the cross-distance matrix, such that such that Mi,j = kf(ai) g(bj)k1. Observe that the matrix MAB contains blocks MA and MB on its diagonal, and blocks M and M T off-diagonal. Thus, MAB 1 = MA 1 + MB 1 + 2M 1, where 1 denotes an all-ones vector of the appropriate dimension. Since M 1 = (MAB 1 MA 1 MB 1)/2, we can calculate M 1 in time O(T(n, d)). Since all entries of M are either 1 or 1/2, we have that M 1 < n2 if and only if there is an entry Mi,j = 1/2. However, this only occurs if ai bj = 0.

3.1 Approximate 1 Matrix-Vector Queries

In light of the lower bounds given above, we consider initializing approximate matrix-vector queries for the 1 function. Note that the lower bound holds for points in {0, 1, 2}d and thus it is natural to consider approximate upper bounds for the case of limited alphabet.

Binary Case. We ﬁrst consider the case that all points x 2 X are from {0, 1}d. We ﬁrst claim the existence of a polynomial T with the following properties. Indeed, the standard Chebyshev polynomials satisfy the following lemma, see e.g., see Chapter 2 in [SV+14]. Lemma 3.2. There exists a polynomial T : R ! R of degree O(

d log(1/")) such that T(0) = 0 and |T(x) 1| " for all x 2 [1/d, 1].

Now note that kx yk1 can only take on two values, 0 or 1. Furthermore, kx yk1 = 0 if and only if kx yk2

2 = 0 and kx yk1 = 1 if and only if kx yk2

2 1. Therefore, kx yk1 = 0 if and only if T(kx yk2

2/d) = 0 and kx yk1 = 1 if and only if |T(kx yk2

2/d) 1| ". Thus, we have that

|Ai,j T(kxi xjk2

2/d)| = |kxi xjk1 T(kxi xjk2

for all entries Ai,j of A. Note that T(kx yk2

2/d) is a polynomial with O((2d)t) monomials in the variables x(1), . . . , x(d). Consider the matrix B satisfying Bi,j = T(kxi xjk2

2/d). Using the same ideas as our upper bound results for f(x, y) = hx, yip, it is straightforward to calculate the matrix vector product By (see Section A.2). To summarize, for each k 2 [n], we write the kth coordinate of By as a polynomial in the d coordinates of xk. This polynomial has O((2d)t) monomials and can be constructed in O(n(2d)t) time. Once constructed, we can evaluate the polynomial at x1, . . . , xn to obtain all the n coordinates of By. Each evaluation requires O((2d)t) resulting in an overall time bound of O(n(2d)t).

Theorem 3.3. Let Ai,j = kxi xjk1. We can compute By in time O(n(2d)

d log(1/")) where k A Bk1 ".

Entries in {0, . . . , M}. We now consider the case that all points x 2 X are from {0, . . . , M}d. Our argument will be a generalization of the previous section. At a high level, our goal is to detect which of the M + 1 possible values in {0, . . . , M} is equal to the 1 norm. To do so, we appeal to the prior section and design estimators which approximate the indicator function kx yk1 i . By summing up these indicators, we can approximate kx yk1.

Our estimators will again be designed using the Chebyshev polynomials. To motivate them, suppose that we want to detect if kx yk1 i or if kx yk1 < i. In the ﬁrst case, some entry in x y will have absolute value value at least i where as in the other case, all entries of x y will be bounded by i 1 in absolute value. Thus if we can boost this signal , we can apply a polynomial which performs thresholding to distinguish the two cases. This motivates considering the functions of kx ykk

k for a larger power k. In particular, in the case that kx yk1 i, we have kx ykk

k ik and otherwise, kx ykk

k dik 1. By setting k log(d), the ﬁrst value is much larger than the latter, which we can detect using the threshold polynomials of the previous section.

We now formalize our intuition. It is known that appropriately scaled Chebyshev polynomials satisfy the following guarantees, see e.g., see Chapter 2 in [SV+14]. Lemma 3.4. There exists a polynomial T : R ! R of degree O(

t log(t/")) such that |T(x)| "/t for all x 2 [0, 1/(10t)] and |T(x) 1| "/t2 for all x 2 [1/t, 1].

Given x, y 2 Rd, our estimator will ﬁrst try to detect if kx yk1 i. Let T1 be a polynomial from Lemma 3.4 with t = O(M k) for k = O(M log(Md)) and assuming k is even. Let T2 be a polynomial from Lemma 3.4 with t = O(

d log(M/")). Our estimator will be

(x(j) y(j))k

If coordinate j is such that |x(j) y(j)| i, then

(x(j) y(j))k

ik M k 1 M k

and so T1 will evaluate to a value very close to 1. Otherwise, we know that

(x(j) y(j))k

ik M k (i 1)k

ik M k = 1 M k (1 1/i)k 1 M k 1 poly(M, d)

by our choice of k, which means that T1 will evaluate to a value close to 0. Formally,

(x(j) y(j))k

will be at least 1/d if there is a j 2 [d] with |x(j) y(j)| i and otherwise, will be at most 1/(10d). By our choice of T2, the overall estimate output at least 1 " in the ﬁrst case and a value at most " in the second case.

The polynomial which is the concatenation of T2 and T1 has O

(dk deg(T1))deg(T2)

d log(Md)) monomials, if we consider the expression as a polynomial in the variables x(1), . . . , x(d). Our ﬁnal estimator will be the sum across all i 1. Following our upper bound techniques for matrix-vector products for polynomial, e.g. in Section A.2, and as outlined in the prior section, we get the following overall query time: Theorem 3.5. Suppose we are given X = {x1, . . . , xn} {0, . . . , M}d which implicitly deﬁnes the matrix Ai,j = kxi xjk1. For any query y, we can compute By in time n (d M)O(M

d log(Md/")) where k A Bk1 ".

4 Empirical Evaluation

We perform an empirical evaluation of our matrix-vector query for the 1 distance function. We chose to implement our 1 upper bound since it s a clean algorithm which possesses many of the same underlying algorithmic ideas as some of our other upper bound results. We envision that similar empirical results hold for most of our upper bounds in Table 1. Furthermore, matrix-vector queries are the dominating subroutine in many key practical linear algebra algorithms such as the power method for eigenvalue estimation or iterative methods for linear regression: a fast matrix-vector query runtime automatically translates to faster algorithms for downstream applications.

Dataset (n, d) Algo. Preprocessing Time Avg. Query Time

Gaussian Mixture (5 104, 50) Naive 453.7 s 43.3 s Ours 0.55 s 0.09 s

MNIST (5 104, 784) Naive 2672.5 s 38.6 s Ours 5.5 s 1.9 s

Glove (1.2 106, 50) Naive - 2.6 days (estimated) Ours 16.8 s 3.4 s Table 3: Dataset description and empirical results. (n, d) denotes the number of points and dimension of the dataset, respectively. Query times are averaged over 10 trials with Gaussian vectors as queries.

Experimental Design. We chose two real and one synthetic dataset for our experiments. We have two small" datasets and one large" dataset. The two small datasets have 5 104 points whereas the large dataset has approximately 106 points. The ﬁrst dataset is points drawn from a mixture of three spherical Gaussians in R50. The second dataset is the standard MNIST dataset [Le C98] and ﬁnally, our large dataset is Glove word embeddings2 in R50 [PSM14].

The two small datasets are small enough that one can feasibly initialize the full n n distance matrix in memory in reasonable time. A 5 104 5 104 matrix with each entry stored using 32 bits requires 10 gigabytes of space. This is simply impossible for the Glove dataset as approximately 5.8 terabytes of space is required to initialize the distance matrix (in contrast, the dataset itself only requires < 0.3 gigabytes to store).

2Can be accessed here: http://github.com/erikbern/ann-benchmarks/

The naive algorithm for the small datasets is the following: we initialize the full distance matrix (which will count towards preprocessing), and then we use the full distance matrix to perform a matrix-vector query. Note that having the full matrix to perform a matrix-vector product only helps the naive algorithm since it can now take advantage of optimized linear algebra subroutines for matrix multiplication and does not need to explicitly calculate the matrix entries. Since we cannot initialize the full distance matrix for the large dataset, the naive algorithm in this case will compute the matrix-vector product in a standalone fashion by generating the entries of the distance matrix on the ﬂy. We compare the naive algorithm to our Algorithms 1 and 2.

Our experiments are done in a 2021 M1 Macbook Pro with 32 gigabytes of RAM. We implement all algorithms in Python 3.9 using Numpy with Numba acceleration to speed up all algorithms whenever possible.

Results. Results are shown in Table 3. We show preprocessing and query time for both the naive and our algorithm in seconds. The query time is averaged over 10 trials using Gaussian vectors as queries. For the Glove dataset, it was infeasible to calculate even a single matrix-vector product, even using fast Numba accelerated code. We thus estimated the full query time by calculating the time on a subset of 5 104 points of the Glove dataset and extrapolating to the full dataset by multiplying the query time by (n/(5 104))2 where n is the total number of points. We see that in all cases, our algorithm outperforms the naive algorithm in both preprocessing time and query time and the gains become increasingly substantial as the dataset size increases, as predicted by our theoretical results.

Acknowledgements. This research was supported by the NSF TRIPODS program (award DMS2022448), Simons Investigator Award, MIT-IBM Watson AI Lab, GISTMIT Research Collaboration grant, and NSF Graduate Research Fellowship under Grant No. 1745302.

[ACSS20] Josh Alman, Timothy Chu, Aaron Schild, and Zhao Song. Algorithms and hardness

for linear algebra on geometric graphs. In Sandy Irani, editor, 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, pages 541 552. IEEE, 2020.

[AG11] Uri M Ascher and Chen Greif. A ﬁrst course on numerical methods. SIAM, 2011.

[AKK+20] Thomas D Ahle, Michael Kapralov, Jakob BT Knudsen, Rasmus Pagh, Ameya Vel-

ingker, David P Woodruff, and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 141 160. SIAM, 2020.

[ANW14] Haim Avron, Huy Nguyen, and David Woodruff. Subspace embeddings for the

polynomial kernel. Advances in neural information processing systems, 27, 2014.

[AW21] Josh Alman and Virginia Vassilevska Williams. A reﬁned laser method and faster

matrix multiplication. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 522 539. SIAM, 2021.

[BCIS18] Arturs Backurs, Moses Charikar, Piotr Indyk, and Paris Siminelakis. Efﬁcient density

evaluation for smooth kernels. 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 615 626, 2018.

[BCW20] Ainesh Bakshi, Nadiia Chepurko, and David P Woodruff. Robust and sample optimal

algorithms for psd low rank approximation. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 506 516. IEEE, 2020.

[BCW22] Ainesh Bakshi, Kenneth L Clarkson, and David P Woodruff. Low-rank approximation

with 1/"1/3 matrix-vector products. ar Xiv preprint ar Xiv:2202.05120, 2022.

[BHSW20] Mark Braverman, Elad Hazan, Max Simchowitz, and Blake Woodworth. The gradient

complexity of linear regression. In Conference on Learning Theory, pages 627 647. PMLR, 2020.

[BIMW21] Arturs Backurs, Piotr Indyk, Cameron Musco, and Tal Wagner. Faster kernel matrix

algebra via density estimation. In Proceedings of the 38th International Conference on Machine Learning, pages 500 510, 2021.

[BIW19] Arturs Backurs, Piotr Indyk, and Tal Wagner. Space and time efﬁcient kernel density

estimation in high dimensions. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, Neur IPS, pages 15773 15782, 2019.

[BW18] Ainesh Bakshi and David Woodruff. Sublinear time low-rank approximation of distance matrices. Advances in Neural Information Processing Systems, 31, 2018.

[Can20] Clément L Canonne. A survey on distribution testing: Your data is big. but is it blue?

Theory of Computing, pages 1 100, 2020.

[CC08] Michael AA Cox and Trevor F Cox. Multidimensional scaling. In Handbook of data

visualization, pages 315 347. Springer, 2008.

[CHC+10] Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, and Chih-Jen

Lin. Training and testing low-degree polynomial data mappings via linear svm. Journal of Machine Learning Research, 11(4), 2010.

[CHL21] Andrew M Childs, Shih-Han Hung, and Tongyang Li. Quantum query complexity

with matrix-vector products. ar Xiv preprint ar Xiv:2102.11349, 2021.

[CKNS20] Moses Charikar, Michael Kapralov, Navid Nouri, and Paris Siminelakis. Kernel density

estimation through density constrained near neighbor search. 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 172 183, 2020.

[CS17] Moses Charikar and Paris Siminelakis. Hashing-based-estimators for kernel den-

sity in high dimensions. In Chris Umans, editor, 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 1032 1043, 2017.

[DPRV15] Ivan Dokmanic, Reza Parhizkar, Juri Ranieri, and Martin Vetterli. Euclidean distance

matrices: essential theory, algorithms, and applications. IEEE Signal Processing Magazine, 32(6):12 30, 2015.

[EK12] Yonina C Eldar and Gitta Kutyniok. Compressed sensing: theory and applications.

Cambridge university press, 2012.

[ES20] Rogers Epstein and Sandeep Silwal. Property testing of lp-type problems. In 47th

International Colloquium on Automata, Languages, and Programming (ICALP 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.

[Gol17] Oded Goldreich. Introduction to property testing. Cambridge University Press, 2017.

[HS93] Liisa Holm and Chris Sander. Protein structure comparison by alignment of distance

matrices. Journal of molecular biology, 233(1):123 138, 1993.

[ILLP04] Piotr Indyk, Moshe Lewenstein, Ohad Lipsky, and Ely Porat. Closest pair problems

in very high dimensions. In International Colloquium on Automata, Languages, and Programming, pages 782 792. Springer, 2004.

[IP01] Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. Journal of

Computer and System Sciences, 62(2):367 375, 2001.

[IPZ01] Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have

strongly exponential complexity? Journal of Computer and System Sciences, 63(4):512 530, 2001.

[IRW17] Piotr Indyk, Ilya Razenshteyn, and Tal Wagner. Practical data-dependent metric

compression with provable guarantees. Advances in Neural Information Processing Systems, 30, 2017.

[IVWW19] Pitor Indyk, Ali Vakilian, Tal Wagner, and David P Woodruff. Sample-optimal low-

rank approximation of distance matrices. In Conference on Learning Theory, pages 1723 1751. PMLR, 2019.

[JL84] W. Johnson and J. Lindenstrauss. Extensions of lipschitz maps into a hilbert space.

Contemporary Mathematics, 26:189 206, 01 1984.

[Kru64] Joseph B Kruskal. Multidimensional scaling by optimizing goodness of ﬁt to a

nonmetric hypothesis. Psychometrika, 29(1):1 27, 1964.

[Kru78] Joseph B Kruskal. Multidimensional scaling. Number 11. Sage, 1978.

[Kuc09] Marek Kuczma. An introduction to the theory of functional equations and inequalities:

Cauchy s equation and Jensen s inequality. Springer Science & Business Media, 2009.

[Lan50] Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of

linear differential and integral operators. 1950.

[Le C98] Yann Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.

[LN17] Kasper Green Larsen and Jelani Nelson. Optimality of the johnson-lindenstrauss

lemma. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 633 638. IEEE, 2017.

[LSZ21] Troy Lee, Miklos Santha, and Shengyu Zhang. Quantum algorithms for graph problems

with cut queries. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 939 958. SIAM, 2021.

[MM15] Cameron Musco and Christopher Musco. Randomized block krylov methods for

stronger and faster approximate singular value decomposition. Advances in neural information processing systems, 28, 2015.

[MMMW21] Raphael A Meyer, Cameron Musco, Christopher Musco, and David P Woodruff.

Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142 155. SIAM, 2021.

[PSM14] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global

vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532 1543, 2014.

[RWZ20] Cyrus Rashtchian, David P Woodruff, and Hanlin Zhu. Vector-matrix-vector queries for

solving linear algebra, statistics, and graph problems. ar Xiv preprint ar Xiv:2006.14015, 2020.

[S+94] Jonathan Richard Shewchuk et al. An introduction to the conjugate gradient method

without the agonizing pain, 1994.

[SRB+19] Paris Siminelakis, Kexin Rong, Peter Bailis, Moses Charikar, and Philip Alexander

Levis. Rehashing kernel evaluation in high dimensions. In Proceedings of the 36th International Conference on Machine Learning, ICML, pages 5789 5798, 2019.

[SV+14] Sushant Sachdeva, Nisheeth K Vishnoi, et al. Faster algorithms via approximation

theory. Foundations and Trends in Theoretical Computer Science, 9(2):125 210, 2014.

[SWYZ21a] Zhao Song, David Woodruff, Zheng Yu, and Lichen Zhang. Fast sketching of polyno-

mial kernels of polynomial degree. In International Conference on Machine Learning, pages 9812 9823. PMLR, 2021.

[SWYZ21b] Xiaoming Sun, David P Woodruff, Guang Yang, and Jialin Zhang. Querying a matrix

through matrix-vector products. ACM Transactions on Algorithms (TALG), 17(4):1 19, 2021.

[SY07] Anthony Man-Cho So and Yinyu Ye. Theory of semideﬁnite programming for sensor

network localization. Mathematical Programming, 109(2):367 384, 2007.

[TSL00] Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319 2323, 2000.

[Wie86] Douglas Wiedemann. Solving sparse linear equations over ﬁnite ﬁelds. IEEE transac-

tions on information theory, 32(1):54 62, 1986.

[Wil05] Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implica-

tions. Theoretical Computer Science, 348(2-3):357 365, 2005.

[Woo14] David P Woodruff. Sketching as a tool for numerical linear algebra. ar Xiv preprint

ar Xiv:1411.4357, 2014.

[WS06] Kilian Q Weinberger and Lawrence K Saul. Unsupervised learning of image manifolds

by semideﬁnite programming. International journal of computer vision, 70(1):77 90, 2006.

[WZ20] David Woodruff and Amir Zandieh. Near input sparsity time kernel embeddings

via adaptive sampling. In International Conference on Machine Learning, pages 10324 10333. PMLR, 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] We include formal upper and

lower bounds which natural state the applicability and limitations of our theorems. (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] All proofs are included

in the main body or in the Appendix.

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main exper-

imental results (either in the supplemental material or as a URL)? [Yes] The code is included in the supplementary. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes]

4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A]

(e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]