# faster_kernel_matrix_algebra_via_density_estimation__3a1a7685.pdf

Faster Kernel Matrix Algebra via Density Estimation

Arturs Backurs * 1 Piotr Indyk * 2 Cameron Musco * 3 Tal Wagner * 4

Abstract We study fast algorithms for computing fundamental properties of a positive semideﬁnite kernel matrix K Rn n corresponding to n points x1, . . . , xn Rd. In particular, we consider estimating the sum of kernel matrix entries, along with its top eigenvalue and eigenvector. We show that the sum of matrix entries can be estimated to 1 + ϵ relative error in time sublinear in n and linear in d for many popular kernels, including the Gaussian, exponential, and rational quadratic. For these kernels, we also show that the top eigenvalue (and an approximate eigenvector) can be approximated to 1 + ϵ relative error in time subquadratic in n and linear in d. Our results represent signiﬁcant advances in the best known runtimes for these problems. They leverage the positive deﬁniteness of the kernel matrix, along with a recent line of work on efﬁcient kernel density estimation.

1. Introduction

Kernels are a ubiquitous notion in statistics, machine learning, and other ﬁelds. A kernel is a function k : Rd Rd R that measures the similarity1 between two d-dimensional vectors. Many statistical and machine learning methods, such as support vector machines, kernel ridge regression and kernel density estimation, rely on appropriate choices of kernels. A prominent example of a kernel function is the Radial Basis Function, a.k.a. Gaussian kernel, deﬁned as

k(x, y) = exp( x y 2).

Other popular choices include the Laplace kernel, exponential kernel, etc. See Shawe-Taylor et al. (2004); Hofmann

*Equal contribution 1Toyota Technological Institute at Chicago, Chicago, IL, USA 2Massachusetts Institute of Technology, Cambridge, MA, USA 3University of Massachusetts Amherst, Amherst, MA, USA 4Microsoft Research Redmond, Redmond, WA, USA. Correspondence to: Arturs Backurs <backurs@ttic.edu>, Piotr Indyk <indyk@mit.edu>, Cameron Musco <cmusco@umass.edu>, Tal Wagner <talw@mit.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). 1This should be contrasted with distance functions that measure the dissimilarity between two vectors.

et al. (2008) for an overview.

Kernel methods typically operate using a kernel matrix. Given n vectors x1, . . . , xn Rd, the kernel matrix K Rn n is deﬁned as Ki,j = k(xi, xj). For most popular kernels, e.g., the Gaussian kernel, k is a positive deﬁnite function, and so K is positive semideﬁnite (PSD). Furthermore, it is often the case that K s entries are in the range [0,1], with 1 s on the diagonal we assume this throughout.

Although popular, the main drawback of kernel methods is their efﬁciency. Most kernel-based algorithms have running times that are at least quadratic in n; in fact, many start by explicitly materializing the kernel matrix K in the preprocessing stage. This quadratic runtime is likely necessary as long as exact (or high-precision) answers are desired. Consider perhaps the simplest kernel problem, where the goal is to compute the sum of matrix entries, i.e., s(K) = P

i,j Ki,j. It was shown in Backurs et al. (2017) that, for the Gaussian kernel, computing s(K) up to 1 + ϵ relative error requires n2 o(1) time under the Strong Exponential Time Hypothesis (SETH), as long as the dimension d is at least polylogarithmic in n, and ϵ = exp( ω(log2 n)). The same limitations were shown to apply to kernel support vector machines, kernel ridge regression, and other kernel problems.

Fortunately, the aforementioned lower bound does not preclude faster algorithms for larger values of ϵ (say, ϵ = Θ(1)). Over the last decade many such algorithms have been proposed. In our context, the most relevant ones are those solving kernel density evaluation (Charikar & Siminelakis, 2017; Backurs et al., 2018; 2019; Siminelakis et al., 2019; Charikar et al., 2020). Here, we are given two sets of vectors X = {x1, . . . , xm}, and Y = {y1, . . . , ym}, and the goal is to compute the values of k(yi) = 1 m P

j k(xj, yi) for i = 1, . . . , m. For the Gaussian kernel, the best known algorithm, due to Charikar et al. (2020), estimates these values to 1 + ϵ relative error in time O(dm/(µ0.173+o(1)ϵ2)), where µ is a lower bound on k(xi).

Applying this algorithm directly to approximating the Gaussian kernel sum yields a runtime of roughly O(dn1.173+o(1)/ϵ2), since we can set µ = 1/n and still achieve (1 + ϵ) approximation as k(xi, xi) = 1 for all xi. It is a natural question whether this bound can be improved and if progress on fast kernel density estimation can be extended to other fundamental kernel matrix problems.

Faster Kernel Matrix Algebra via Density Estimation

Our results In this paper we give much faster algorithms for approximating two fundamental quantities: the kernel matrix sum and the kernel matrix top eigenvector/eigenvalue. Consider a kernel matrix K induced by n points x1, . . . , xn Rd and a kernel function with values in [0, 1] such that the matrix K is PSD and has 1 s on the diagonal. Furthermore, suppose that the kernel k is supported by a kernel density evaluation algorithm with running time of the form O(dm/(µpϵ2)) for m points, relative error (1 + ϵ), and density lower bound µ. Then we give:

1. An algorithm for (1 + ϵ)-approximating s(K) in time:

2+5p 4+2p /ϵ

2+p log2 n .

For many popular kernels the above runtime is sublinear in n see Table 1. Our algorithm is based on subsampling O( n) points from x1, . . . , xn and then applying fast kernel density evaluation to these points. We complement the algorithm with a (very simple) lower bound showing that sampling Ω( n) points is necessary to estimate s(K) up to a constant factor. This shows that our sampling complexity is optimal.

2. An algorithm that returns an approximate top eigenvector z Rn with z 2 = 1 and z T Kz (1 ϵ) λ1(K), where λ1(K) is K s top eigenvalue, running in time:

O dn1+p log(n/ϵ)2+p

For many popular kernels, this runtime is subquadratic in n see Table 1. This is the ﬁrst subquadratic time algorithm for top eigenvalue approximation and a major improvement over forming the full kernel matrix. By a simple argument, Ω(dn) time is necessary even for constant factor approximation, and thus our algorithm is within an O(np) factor of optimal. Our algorithm is also simple and practical, signiﬁcantly outperforming baseline methods empirically see Sec. 5.

Application An immediate application of our kernel sum algorithm is a faster algorithm for estimating the kernel alignment (Cristianini et al., 2002), a popular measure of similarity between kernel matrices. Given K and K , the alignment between K and K is deﬁned as

ˆA(K, K ) = K, K p

K, K K , K ,

where K, K = P

i,j Ki,j K i,j is the inner product between the matrices K and K interpreted as vectors. Our algorithm yields an efﬁcient algorithm for estimating ˆA(K, K ) as long as the product kernels K K, K K

and K K are supported by fast kernel density evaluation algorithms as described earlier. This is the case for e.g., the Gaussian or Laplace kernels.

Related work The problem of evaluating kernel densities, especially for the Gaussian kernel, has been studied extensively. In addition to the recent randomized algorithms discussed in the introduction, there has been a considerable amount of work on algorithms in low dimensional spaces, including (Greengard & Strain, 1991; Yang et al., 2003; Lee et al., 2006; Lee & Gray, 2009; March et al., 2015). We present a streamlined version of the Fast Gauss Transform algorithm of Greengard & Strain (1991) in the appendix. In addition, there has been a considerable effort designing core-sets for this problem (Phillips & Tai, 2018a;b).

The sum of kernel values can be viewed as a similarity analog of the sum of pairwise distances in metric spaces. The latter quantity can be approximated in time linear in the number n of points in the metric space (Indyk, 1999; Chechik et al., 2015; Cohen et al., 2018). Note that it is not possible to achieve an o(n)-time algorithm for this problem, as a single point can greatly affect the overall value. To the best of our knowledge, our algorithms for the kernel matrix sum are the ﬁrst that achieve sublinear in n running time and give O(1)-approximation for a nontrivial kernel problem.

Computing the top eigenvectors of a kernel matrix is a central problem it is the primary operation behind kernel principal component analysis (Sch olkopf et al., 1997). Projection onto these eigenvectors also yields an optimal low-rank approximation to the kernel matrix, which can be used Low-rank approximation is widely used to approximate kernel matrices, to speed up kernel learning methods (Williams & Seeger, 2001; Fine & Scheinberg, 2001). Signiﬁcant work has focused on fast low-rank approximation algorithms for kernel matrices or related distance matrices (Musco & Musco, 2017; Musco & Woodruff, 2017; Bakshi & Woodruff, 2018; Indyk et al., 2019; Bakshi et al., 2020). These algorithms have runtime scaling roughly linearly in n. However, they do not give any nontrivial approximation to the top eigenvalues or eigenvectors of the kernel matrix themselves, unless we assume that the matrix is near low-rank. To the best of our knowledge, prior to our work, no subquadratic time approximation algorithms for the top eigenvalue were known, even for Θ(1) approximation.

Our techniques: kernel sum We start by noting that, via a Chernoff bound, a simple random sampling of kernel matrix entries provides the desired estimation in linear time.

Claim 1. For a positive deﬁnite kernel k : Rd Rd [0, 1]

with k(x, x) = 1 x, uniformly sample t = O n log(1/δ)

off-diagonal entries of K, Ki1,j1, ..., Kit,jt and let s(K) = n + n(n 1)

t Pt ℓ=1 Kiℓ,jℓ. Then with probability 1 δ, s(K) (1 ϵ) s(K).

Our goal is to do better, giving 1 ϵ approximation to s(K) in sublinear time. We achieve this by (a) performing a more

Faster Kernel Matrix Algebra via Density Estimation

Kernel k(x, y) KDE algorithm KDE runtime Kernel Sum Top Eigenvector Gaussian e x y 2 2 Charikar et al. (2020) dm/(µ0.173+o(1)) O(dn0.66) O(dn1.173+o(1))

Gaussian Greengard & Strain (1991) (see appendix) m log(m)O(d) n0.5 log(n)O(d) n log(n)O(d)

Exponential e x y 2 Charikar et al. (2020) dm/(µ0.1+o(1)) O(dn0.6) O(dn1.1+o(1)) Rational quadratic 1 (1+ x y 2 2)β Backurs et al. (2018) d O(dn0.5) O(dn) All of the above (lower bound) - - - Ω(dn0.5) Ω(dn)

Table 1. Instantiations of our main results, giving sublinear time kernel sum approximation and subquadratic time top eigenvector approximation. All running times are up to polylogarithmic factors assuming constant accuracy parameter ϵ and kernel parameter β. The KDE runtime depends on m, the number of query points and µ, a lower bound on the density for each query point.

structured random sampling, i.e., sampling a principal submatrix as opposed to individual entries, and (b) providing an efﬁcient algorithm for processing this submatrix. Subsampling the matrix requires a more careful analysis of the variance of the estimator. To accomplish this, we use the fact that the kernel matrix is PSD, which implies that its mass cannot be too concentrated. For the second task, we use fast kernel density estimation algorithms, combined with random row sampling to reduce the running time.

Our techniques: top eigenvector Our algorithm for top eigenvector approximation is a variant on the classic power method, with fast approximate matrix vector multiplication implemented through kernel density evaluation. Kernel density evaluation on a set of n points with corresponding kernel matrix K Rn n, can be viewed as approximating the vector Kz Rn where z(i) = 1/n for all i. Building on this primitive, it is possible to implement approximate multiplication with general z Rn (Charikar & Siminelakis, 2017). We can then hope to leverage work on the noisy power method , which approximates the top eigenvectors of a matrix using just approximate matrix vector multiplications with that matrix (Hardt & Price, 2014). However, existing analysis assumes random noise on each matrix vector multiplication, which does not align with the top eigenvector. This cannot be guaranteed in our setting. Fortunately, we can leverage additional structure: if the kernel k is nonnegative, then by the Perron-Frobenius theorem, K s top eigenvector is entrywise non-negative. This ensures that, if our noise in approximating Kz at each step of the power method is entrywise non-negative, then this noise will have non-negative dot product with the top eigenvector. We are able to guarantee this property, and show convergence of the method to an approximate top eigenvector, even when the error might align signiﬁcantly with the top eigenvector.

2. Preliminaries

Throughout, we focus on nice kernels satisfying:

Deﬁnition 2 (Nice Kernel Function). A kernel function k : Rd Rd [0, 1] is nice if it is positive deﬁnite and satisﬁes

k(x, x) = 1 for all x Rd.

Many popular kernels, such as the Gaussian kernel, the exponential kernel and the rational quadratic kernel described in the introduction, are indeed nice. We also assume that k admits a fast KDE algorithm. Speciﬁcally:

Deﬁnition 3 (Fast KDE). A kernel function k admits a O(dm/(µpϵ2)) time KDE algorithm, if given a set of m points x1, . . . , xn Rd, we can process them in O(dm/(µpϵ2)) time for some p 0 such that we can answer queries of the form 1

i k(y, xi) up to (1+ϵ) relative error in O(d/(µpϵ2)) time for any query point y Rd with probability 2/3 assuming that 1

i k(y, xi) µ.

Note that for a kernel function satisfying Def. 3, evaluating k(yj) for m points Y = {y1, . . . , ym} against m points X = {x1, . . . , xm} requires O(dm/(µpϵ2)) total time.

3. Sublinear time algorithm for kernel sum

Our proposed kernel sum approximation algorithm will sample a set A of s = Θ( n) input points and look at the principal submatrix KA of K corresponding to those points. We prove that the sum of off-diagonal entries in KA (appropriately scaled) is a good estimator of the sum of off-diagonal entries of K. Since for a nice kernel, the sum of diagonal entries is always n, this is enough to give a good estimate of the full kernel sum s(K). Furthermore, we show how to estimate the sum of off-diagonal entries of KA quickly via kernel density evaluation, in ds2 δ/ϵO(1) = dn1 δ/2/ϵO(1)

time for a constant δ > 0. Overall this yields:

Theorem 4. Let K Rn n be a kernel matrix deﬁned by a set of n points and a nice kernel function k (Def. 2) that admits O(dm/(µpϵ2)) time approximate kernel density evaluation (Def. 3). After sampling a total of O( n/ϵ2)

points, we can in O dn

2+5p 4+2p log2(n)/ϵ

2+p time approximate the sum of entries of K within a factor of 1 + ϵ with high probability 1 1/nΘ(1).

For any square matrix K, let so(K) be the sum of off diagonal entries. Crucial to our analysis will be the lemma:

Faster Kernel Matrix Algebra via Density Estimation

Lemma 5 (PSD Mass is Spread Out). Let K Rn n be a PSD matrix with diagonal entries all equal to 1. Let so,i(K) be the sum of off diagonal entries in the ith row of K. If so(K) ϵn for some ϵ 1, then i:

so,i(K) 2 p

Lemma 5 implies that if the off-diagonal elements contribute signiﬁcantly to s(K) (i.e., s0(K) ϵn), then the off diagonal weight is spread relatively evenly across at least Ω( p

ϵ so(K)) = Ω( n) rows/columns, allowing our strategy of sampling a principal submatrix with just Θ( n) rows to work.

Proof. Assume for the sake of contradiction that there is a row with so,i(K) > 2 p

s0(K)/ϵ. Let x be the vector that

has value q

ϵ at index i and 1 elsewhere. Then:

x T Kx so(K)

ϵ + so(K) + n so(K)

where the last inequality follows from the assumptions that s0(K) ϵn and ϵ 1. The above contradicts K being PSD, completing the lemma.

3.1. Our estimator

For a subset A [n], let KA be the corresponding kernel matrix (which is a principal submatrix of K). Suppose that A is chosen by adding every element i [n] to A independently at random with probability p (we will later set p = 1/ n). Then Z n + so(KA)/p2 is an unbiased estimator of s(K). That is, E[Z] = s(K). We would like to show that the variance Var[Z] is small. In fact, we will show that Var[Z] O(s(K)2). Thus, taking Var[Z]/(ϵ2 E[Z]2) = O(1/ϵ2) samples of Z and returning the average yields a 1 + ϵ approximation of s(K) with a constant probability. To amplify the probability of success to 1 δ for any δ > 0, we take the median of O(log(1/δ)) estimates and apply Chernoff bound in a standard way.

In the appendix, we leverage Lemma 5 to prove the following upper bound on the variance.

Lemma 6. Var[Z] = O(s(K)2).

3.2. Approximating the value of the estimator

To turn the argument from the previous section into an algorithm, we need to approximate the value of Z = n + so(KA)/p2 for p = 1/ n efﬁciently. It is sufﬁcient to efﬁciently approximate Z = n + n so(KA) when so(KA) = Ω(ϵ), as otherwise the induced loss in approximating s(K) is negligible since we always have s(K) n.

Let K be a kernel matrix of size m m for which so(K ) Ω(ϵ). We show that for such a kernel matrix it is possible

to approximate so(K ) in time m2 δ for a constant δ > 0. This is enough to yield a sublinear time algorithm for estimating Z since KA is m m with m pn = n.

A simple algorithm. We note that it is sufﬁcient to approximate the contribution to so(K ) from the rows i for which the sum of entries so,i is Ω(ϵ/m), as the contribution from the rest of the rows in negligible under the assumption that so(K ) = Ω(ϵ). So ﬁx an i and assume that so,i Ω(ϵ/m). To estimate so,i, we use a kernel density evaluation algorithm. Our goal is to approximate

j:j =i K i,j = X

j:j =i k(xi, xj).

The approach is to ﬁrst process the points x1, . . . , xm using the algorithm for the KDE. The KDE query algorithm then allows us to answer queries of the form 1

j k(q, xj) for an arbitrary query point q in time O(d/(µpϵ2)), where µ is a lower bound on 1

j k(q, xj). So we set µ = Ω(ϵ/m2) and query the KDE data structure on all q = x1, . . . , xm.

The above does not quite work however to estimate the off-diagonal sum we need to answer queries of the form 1 m P j:j =i k(xi, xj) instead of 1

m P j k(xi, xj). This could be solved if the KDE data structure were dynamic, so that we could remove any point xi from it. Some of the data structures indeed have this property. To provide a general reduction, however, we avoid this requirement by building several static data structures and then answering a single query of the form 1

m P j:j =i k(xi, xj) by querying O(log m) static data structures. Assume w.l.o.g. that m is an integer power of 2. Then we build a data structure for points x1, . . . , xm/2 and another for xm/2+1, . . . , xm. We also build 4 data structures for sets x1, . . . , xm/4 and xm/4+1, . . . , xm/2, and xm/2+1, . . . , x3m/4, and x3m/4+1, . . . , xm. And so forth for log m levels. Suppose that we want to estimate 1 m P

j:j =1 k(x1, xj). For that we query the data structures on sets xm/2+1, . . . , xm and xm/4+1, . . . , xm/2, and xm/8+1, . . . , xm/4 and so forth for a total of log m data structures one from each level. Similarly we can answer queries for an arbitrary i. The threshold µ for all the data structures is the same as before: µ = Ω(ϵ/m2).

Since we need to query O(log m) data structures for every xi and we also need to amplify the probability of success to, say, 1 1/m2. Thus, the ﬁnal runtime of the algorithm is

O(m d/(µpϵ2) log2 m) = O(dm1+2p/ϵ2+p log2 m).

A faster algorithm. We note that in the previous case, if so,i = Θ(ϵ/m) for every i, then we can approximate so(K ) efﬁciently by sampling a few i, evaluating the corresponding so,i exactly (in O(dm) time) and returning the empirical mean of the evaluated so,i. This works since the variance is small. There can be, however, values of i for which so,i

Faster Kernel Matrix Algebra via Density Estimation

is large. For these values of i, we can run a kernel density evaluation algorithm. More formally, we deﬁne a threshold t > 0 and run a kernel density evaluation algorithm on every i with µ = t/m2 (similarly as in the previous algorithm). This reports all i for which so,i t/m2. This takes time

O(dm log2 m 1/(µpϵ2)) = O(dm1+2p log2 m/(ϵ2tp)).

Let I be the set of remaining i. To estimate the contribution from so,i with i I, we repeatedly sample i I and evaluate so,i exactly using the linear scan, and output the average of the evaluated so,i scaled by |I| as an estimate of the contribution of so,i from i I. Since we can ignore the contribution from i I with so,i o(ϵ/m), we can assume that t/m so,i Ω(ϵ/m) for every i I and then

Vari[so,i]/(ϵ2(E i [so,i])2) O(t2/ϵ4)

samples are sufﬁcient to get a 1 + ϵ approximation. This step takes time O(t2dm/ϵ4). The ﬁnal runtime is

O(dm1+2p log2 m/(ϵ2tp) + dt2m/ϵ4) =

2+p log2(m)/ϵ

by setting t = m

2p 2+p ϵ 2 2+p . Since m = Θ( n) with high probability, we achieve O(dm

2+5p 4+2p log2(m)/ϵ

2+p ) runtime for approximating the random variable Z = n + n so(KA). Since we evaluate the random variable Z O(1/ϵ2) times, the ﬁnal runtime to approximate the sum of entries of the kernel matrix K within a factor of 1 + ϵ is O(dn

2+5p 4+2p log2(n)/ϵ

3.3. Sample complexity lower bound

We next prove a lower bound, which shows that sampling just O( n) data points, as is done in our algorithm, is optimal up to constant factors.

Theorem 7. Consider any nice kernel k such that k(x, y) 0 as x y . In order to estimate Pn i,j k(xi, xj) within any constant factor, we need to sample at least Ω( n) points from the input x1, . . . , xn.

Proof. Suppose that we want to approximate s(K) = Pn i,j k(xi, xj) within a factor of C > 0.

Consider two distributions of x1, . . . , xn. For the ﬁrst distribution x1, . . . , xn are sampled independently from a large enough domain such that k(xi, xj) 0 for all i = j with high probability. In this case s(K) n. For the other distribution we sample again x1, . . . , xn independently at random as before and then sample s1, . . . , s

2Cn from 1, . . . , n without repetition. Then we set xst = xs1 for t = 2, . . . ,

2Cn. For this distribution s(K) 2Cn. To distinguish between these two distributions we need to sample at least Ω( p

n/C) = Ω( n) points from x1, . . . , xn.

3.4. Application to kernel alignment

Our algorithm can be immediately used to estimate the value of the kernel alignment ˆA(K, K ) as deﬁned in the introduction. The only requirement is that the submatrices of product kernels K K, K K and K K are supported by fast kernel density evaluation algorithms. We formalize this as follows. Let C be a set of nice kernels deﬁned over pairs of points in Rd. For any two kernel functions k, k C, the product kernel k k : R2d R2d [0, 1] is such that for any p, q, p , q Rd we have

(k k )((p, p ), (q, q )) = k(p, q) k (p , q ).

Deﬁnition 8. Let C = C1, C2, . . . be a sequence of sets of nice kernels, such that Cd is deﬁned over pairs of points in Rd. We say that C is closed under product if for any two k, k Cd, the product kernel k k belongs to C2d.

It is immediate that the Gaussian kernel, interpreted as a sequence of kernels for different values of the dimension d, is closed under product. Thus we obtain the following:

Corollary 9. Given two Gaussian kernel matrices K, K Rn n, each deﬁned by a set of n points in Rd, and ϵ (0, 1), ˆA(K, K ) can be estimated to 1 ϵ relative error in time O(dn0.66/ϵO(1) log2 n) with high probability 1 1/nΘ(1).

4. Subquadratic time top eigenvector

We now present our top eigenvector approximation algorithm, which is a variant on the noisy power method with approximate matrix vector multiplication implemented through kernel density evaluation. Existing analysis of the noisy power method assumes random noise on each matrix vector multiplication, which has little correlation with the top eigenvector (Hardt & Price, 2014; Balcan et al., 2016). This prevents this top direction from being washed out by the noise. In our setting this cannot be guaranteed the noise distribution arising from implementing matrix multiplication with K using kernel density evaluation is complex.

To avoid this issue, we use that since the kernel k is nice, K is entrywise non-negative, and by the Perron-Frobenius theorem, so is its top eigenvector. Thus, if our noise in approximating Kz is entrywise non-negative (i.e., if we overestimate each weighted kernel density), then the noise will have non-negative dot product with the top eigenvector, and will not wash it out, even if it is highly correlated it.

We formalize this analysis in Theorem 11, ﬁrst giving the required approximate matrix vector multiplication primitive that we use in Deﬁnition 10. We give a full description of our noisy power method variant in Algorithm 1. In Section 4.1 we discuss how to implement the matrix vector multiplication primitive efﬁciently using existing KDE algorithms.

Deﬁnition 10 (Non-negative Approximate Matrix Vector

Faster Kernel Matrix Algebra via Density Estimation

Multiplication). An ϵ-non-negative approximate MVM algorithm for a matrix K Rn n takes as input a non-negative vector x Rn and returns y = Kx + e where e is an entrywise non-negative error vector satisfying: e 2 ϵ Kx 2.

Algorithm 1 Kernel Noisy Power Method input: Error parameter ϵ (0, 1). Iteration count I. ϵ2/12non-negative approximate MVM algorithm (Def. 10) K( ) for nice kernel matrix K Rn n. output: z Rn with z 2 = 1

1: Initialize z0 Rn with z0(i) = 1 n for all i. 2: Initialize λ := 0. 3: for i = 0 to I do 4: zi+1 := K(zi). 5: if z T i zi+1 > λ then 6: z := zi. 7: λ := z T i zi+1. 8: end if 9: zi+1 := zi+1/ zi+1 2. 10: end for 11: return z.

Theorem 11. The kernel noisy power method (Algorithm 1) run for I = O log(n/ϵ)

ϵ iterations outputs a unit vector

z with z T Kz (1 ϵ) λ1(K).

Proof. Let V ΛV T = K be K s eigendecomposition. Λ is diagonal containing the eigenvalues in decreasing order λ1 . . . λn 0. V is orthonormal, with columns equal to the corresponding eigenvectors of K: v1, . . . , vn. Let m be the largest index such that λm (1 ϵ/4) λ1.

Let ci = V T zi be the ith iterate, written in the eigenvector basis. Let ci,m be its ﬁrst m components and ci,n m be the last n m components. We will argue that for I = O log(n/ϵ)

ϵ iterations, there is at least one iteration i I

where ci,m 2 2 (1 ϵ/4), and so zi aligns mostly with large eigenvectors. Formally this gives

z T i Kzi = c T i Λci ci,m 2 2 (1 ϵ/4) λ1 (1 ϵ/4)2 λ1 (1 ϵ/2) λ1.

Further, when we check to set z := zi at line (4) we have

z T i zi+1 = z T i Kzi = z T i Kzi + z T i e.

Since z0, K, and e are all entrywise non-negative, zi is entrywise non-negative for all i. Thus, z T i e 0. By our bound on e 2 we also have z T i e e 2 ϵ2/12 Kzi 1 2 ϵ2/12 λ1. Overall this gives

z T i Kzi z T i zi+1 z T i Kzi + ϵ2/12 λ1.

So, if there is an i with ci,m 2 2 (1 ϵ/4) and thus z T i Kzi (1 ϵ/2) λ1, we will not output z = zj with z T j Kzj (1 ϵ/2 ϵ2/12) λ1 > (1 ϵ) λ1, ensuring our ﬁnal error bound.

To prove that there is some iterate with ci,m 2 2 (1 ϵ/4), since zi 2 2 = ci 2 2 = 1, it sufﬁces to argue that ci,n m 2 2 ϵ/4. Assume for the sake of contradiction that for all i I we have ci,n m 2 2 > ϵ/4. Under this assumption we can argue that ci(1)2 grows signiﬁcantly with respect to ci,n m 2 2 in each step. Speciﬁcally, we can show by induction that ci(1) ci,n m 2 (1+ϵ/6)i

n . This gives a contradiction since for I = O(log(n/ϵ)/ϵ) it would imply that 1 c I(1)2 4 ϵ c I,n m 2 2 and thus we must have c I,n m 2 2 < ϵ/4. This contradiction proves the theorem.

Base case. Initially, z0 has all entries equal to 1/ n. Thus

c0(1) c0,n m 2 c0(1) = v T 1 z0 = 1 n

= 1 n v1 1 1 n,

where we use that v1 is a non-negative unit vector by the Perron-Frobenius theorem so Pn j=1 v1(j) v1 2 = 1.

Inductive step. Assume inductively that ci(1) ci,n m 2

n . Before normalization at step (7) we have zi+1 = Kzi + e. Normalization doesn t affect the ratio between ci+1(1) and ci+1,n m 2 thus we can ignore this step.

For all j 1, . . . , n we have ci+1(j) = λj ci(j) + v T j e. Since both e and v1 are all non-negative, this gives ci+1(1) λ1 ci(1). Further, by triangle inequality and the fact that for j > m, λj < (1 ϵ/4),

ci+1,n m 2 (1 ϵ/4)λ1 ci,n m 2 + e 2 (1 ϵ/4)λ1 ci,n m 2 + ϵ2/12 Kzi 2 (1 ϵ/2)λ1 ci,n m 2 + ϵ2/12 λ1.

By our assumption (for contradiction) that ci,n m 2 ϵ/4 we then have

ci+1,n m 2 1 ϵ/2 + ϵ2/12

ci,n m 2 λ1

(1 ϵ/6) ci,n m 2 λ1.

Overall, this gives that

ci+1(1) ci+1,n m 2 λ1 ci(1) (1 ϵ/6) ci,n m 2 λ1

(1 + ϵ/6) ci(1) ci,n m 2 (1 + ϵ/6)i+1

by our inductive assumption. This gives our contradiction, completing the proof.

Faster Kernel Matrix Algebra via Density Estimation

4.1. Approximate kernel matrix vector multiplication

In the appendix, we show how to use fast a KDE algorithm to instantiate the non-negative approximate MVM primitive (Def. 10) required by Algo. 1. A similar approach was taken by Charikar et al. (2020). We provide our own analysis, which applies black box to any KDE algorithm.

Theorem 12. Let k be a nice kernel (Def. 2) admitting O(dm/µpϵ2) time approximate kernel density evaluation (Def. 3). Let K Rn n be the associated kernel matrix for n points in d dimensions. There is an ϵ-nonnegative approximate MVM algorithm for K running in time O dn1+p log(n/ϵ)p

Combined with Theorem 11, Theorem 12 immediately gives our ﬁnal subquadratic time eigenvalue approximation result:

Corollary 13. Let k be a nice kernel (Def. 2) admitting O(dm/µpϵ2) time approximate kernel density evaluation (Def. 3). Let K Rn n be the associated kernel matrix for n points in d dimensions. There is an algorithm running in time O dn1+p log(n/ϵ)2+p

ϵ7+4p , which outputs a unit vector z

with z T Kz (1 ϵ) λ1(K)

Proof. Algo. 1 requires I = O log(n/ϵ)

ϵ approximate

MVMs, each with error parameter ϵ2/12. By Thm. 12, each

matrix vector multiply requires time O dn1+p log(n/ϵ)1+p

ϵ6+4p . Multiplying by I gives the ﬁnal bound.

4.2. Lower bound

It is easy to see that Ω(dn) time is necessary to estimate λ1(K) to even a constant factor. Thus, for ϵ = Θ(1), the runtime of Cor. 13 is tight, up to an np log(n)2+p factor.

Theorem 14. Consider any nice kernel k such that k(x, y) 0 as x y . Let K Rn n be the associated kernel matrix of n points x1, . . . , xn Rd. Estimating λ1(K) to any constant factor requires Ω(nd) time.

Proof. Let c be any constant. Consider two input cases. In the ﬁrst, no two points in x1, . . . , xn are identical. In the second, a random set of c points are exact duplicates of each other. Scaling up these point sets by an arbitrarily large constant value, we can see that their kernel matrices are arbitrarily close to K = I in the ﬁrst case and K = I + E in the second, where E has a 1 at positions (i, j) and (j, i) if i, j are in the duplicate set, and zeros everywhere else. We can check that λ1(I) = 1 while λ1(I + E) = c. Thus, to approximate the top eigenvalue up to a c factor we must distinguish the two cases. If we read o(n/c) points, then with good probability we will see no duplicates. Thus, we must read Ω(n/c) points, requiring Ω(nd/c) time.

5. Empirical evaluation

In this section we empirically evaluate the kernel noisy power method (Algorithm 1) for approximating the top eigenvalue of the kernel matrix K Rn n associated with an input dataset of n points.

The reason we focus our evaluation on approximating the top eigenvector (Section 4), and not on approximating the kernel matrix sum (Section 3) is that the latter problem admits a fast algorithm by vanilla random sampling (in time O(n), see Claim 1), which is very fast in practice. We observed this baseline method obtains comparable empirical running times to our method, the latter not yielding speedup. We thus focus our experiments on the kernel power method. All previous algorithms for approximating the top eigenvector have running time Ω(n2), which is very slow in practice even for moderate n, raising a stronger need for empirical improvement.

For evaluating the kernel noisy power method, we use the Laplacian kernel, k(x, y) = exp( x y 1/σ). It is a nice kernel (by Def. 2), and furthermore, a Fast KDE implementation for it (as per Def. 3) with p = 0.5 was recently given in (Backurs et al., 2019), based on the Hashing-Based Estimators (HBE) technique of (Charikar & Siminelakis, 2017). This can be plugged into Corollary 13 to obtain a provably fast and accurate instantiation of the kernel noisy power method. The resulting algorithm is referred to in this section as KNPM.

Evaluated methods. We compare KNPM to two baselines: The usual (Full) power method, and a Uniform noisy power method. The latter is similar to KNPM, except that the KDE subroutine in Corollary 13 is evaluated by vanilla uniform sampling instead of a Fast KDE.

In more detail, denote by Uni KDE(y, X) a randomized algorithm for KDE(y, X) on a point y and a pointset X, that draws a uniformly random sample X of X and returns the mean of k(y, x) over all x X . By Bernstein s inequality, if the sample size is Ω(1/(µϵ2)) then this returns a (1 ϵ)-approximation of KDE(y, X) in time O(d/(µϵ2)). Therefore, it is a Fast KDE algorithm as deﬁned in Deﬁnition 3, with p = 1. This algorithm is indeed used and analyzed in many prior works on KDE approximation. Since Algorithm 1 reduces the kernel power method to a sequence of approximate KDE computations (cf. Corollary 13), we may use Uni KDE for each of them, thus obtaining the natural baseline we call Uniform. While it asypmtotically does not lead to sub-quadratic running time, it empirically performs signiﬁcantly better than the full power method, as our experiments will show.

Evaluation metrics. The computational cost of each algorithm is measured by the number of kernel evaluations it

Faster Kernel Matrix Algebra via Density Estimation

performs (i.e., how many entries of K it computes). Computed kernel values are not carried over across iterations. The full power methods computes the entire matrix rowby-row in each iteration, since it is too large to be stored in memory. The other two methods use a small sample of entries.

Our reasons for using the number of kernel evaluations as a proxy for the running time are the following:

The three algorithms we evaluate compute power method by a sequence of kernel computations, differing in the choice of points for evaluation (Full computes k(x, y) for all x, y, while Uniform and KNPM choose pairs at random according to their different sampling schemes). Thus this measure of efﬁciency allows for a direct comparison between them.

This measure is softwareand architecture-free, unaffected by access to specialized libraries (BLAS, MATLAB, etc) or hardware (SIMD, GPU, etc). This is particularly important with linear algebraic operations, which behave very differently in different environments (many are speciﬁcally designed to handle them in a particular way), and can result in many artifacts when measuring runtimes.

Compatibility with prior literature (e.g., Backurs et al. 2019).

Nonetheless, we believe that methodologically sound wallclock time experiments would be valuable, and we leave this for future work.

The accuracy of each method is evaluated in each iteration by the relative error, 1 z T Kz/λ1(K), where z is the unit vector computed by the algorithm in that iteration, and λ1(K) is the true top eigenvalue. λ1(K) is computed by letting the full power method run until convergence. This error measure corresponds directly to ϵ from Corollary 13.

Datasets. We use classes of the Forest Covertype dataset (Blackard & Dean, 1999), which is a 54-dimensional dataset often used to evaluate kernel methods in high dimensions (Siminelakis et al., 2019; Backurs et al., 2019). We use 5 of the 7 classes (namely classes 3 7), whose sizes range from 2.7K to 35.7K points. We have omitted the two larger classes since we could not compute an accurate groundtruth λ1(K) for them, and hence could not measure accuracy. We also use the full training set of the MNIST dataset (60K points in 784 dimensions).

Parameter setting. We use bandwidth σ = 0.05 (other choices produce similar results). The full power method has no parameters. The Uniform and KNPM methods each have a single parameter that governs the sampling rate. For both,

we start with a small sampling rate, and gradually increase it by multiplying by 1.1 in each iteration. In this way, the approximate matrix multiplication becomes more accurate as the method converges closer to the true top eigenvalue.

Results. All results are reported in Figure 1 on the following page. Both the Uniform and KNPM variants of the noisy power method give a much better tradeoff in terms of accuracy vs. computation than the full power method. Additionally, KNPM consistently outperforms Uniform.

6. Conclusion

We have shown that fast kernel density evaluation methods can be used to give much faster algorithms for approximating the kernel matrix sum and its top eigenvector. Our work leaves open a number of directions. For top eigenvector computation it is open if the gaps between the linear in n lower bound of Theorem 14 and our slightly superlinear runtimes for the Gaussian and exponential kernels can be closed. Extending our techniques to approximate the top k eigenvectors/values in subquadratic time would also be very interesting, as this is a key primitive in kernel PCA and related methods. Finally, it would be interesting to identify other natural kernel matrix problems that can be solved in sublinear or subquadratic time using fast KDE methods. Conversely, one might hope to prove lower bounds ruling this out. Lower bounds that hold for error ϵ = Θ(1) would be especially interesting known lower bounds against e.g., subquadratic time algorithms for the kernel sum, only hold when high accuracy ϵ = exp( ω(log2 n)) is demanded.

Acknowledgments

This research was supported in part by the NSF TRIPODS program (awards CCF-1740751 and DMS-2022448); NSF award CCF-2006798; NSF award CCF-2006806; MIT-IBM Watson collaboration; a Simons Investigator Award; NSF award IIS-1763618; and an Adobe Research grant.

Faster Kernel Matrix Algebra via Density Estimation

Figure 1. Results for the Full, Uniform and Kernel Noisy variants of the power method, on classes 3 7 of the Forest Covertype dataset, and on the MNIST dataset. We can see that the noisy power method implemented with KDE (KNPM) achieves a signiﬁcantly better tradeoff between accuracy and number of kernel evaluations, as compared to the baselines.

Faster Kernel Matrix Algebra via Density Estimation

Backurs, A., Indyk, P., and Schmidt, L. On the ﬁne-grained complexity of empirical risk minimization: Kernel methods and neural networks. In Advances in Neural Information Processing Systems 27 (NIPS), pp. 4308 4318, 2017.

Backurs, A., Charikar, M., Indyk, P., and Siminelakis, P. Efﬁcient density evaluation for smooth kernels. In Proceedings of the 59th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 615 626, 2018.

Backurs, A., Indyk, P., and Wagner, T. Space and time efﬁcient kernel density estimation in high dimensions. In Advances in Neural Information Processing Systems 32 (NIPS), pp. 15799 15808, 2019.

Bakshi, A. and Woodruff, D. Sublinear time low-rank approximation of distance matrices. In Advances in Neural Information Processing Systems 31 (NIPS), pp. 3782 3792, 2018.

Bakshi, A., Chepurko, N., and Jayaram, R. Testing positive semi-deﬁniteness via random submatrices. Proceedings of the 61st Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2020.

Balcan, M.-F., Du, S. S., Wang, Y., and Yu, A. W. An improved gap-dependency analysis of the noisy power method. In Proceedings of the 29th Annual Conference on Computational Learning Theory (COLT), pp. 284 309, 2016.

Blackard, J. A. and Dean, D. J. Comparative accuracies of artiﬁcial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture, 24(3):131 151, 1999.

Charikar, M. and Siminelakis, P. Hashing-based-estimators for kernel density in high dimensions. In Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 1032 1043, 2017.

Charikar, M., Kapralov, M., N., N., and Siminelakis, P. Kernel density estimation through density constrained near neighbor search. In Proceedings of the 61st Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2020.

Chechik, S., Cohen, E., and Kaplan, H. Average distance queries through weighted samples in graphs and metric spaces: High scalability with tight statistical guarantees. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 659, 2015.

Cohen, E., Chechik, S., and Kaplan, H. Clustering small samples with quality guarantees: Adaptivity with one2all PPS. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Cristianini, N., Shawe-Taylor, J., Elisseeff, A., and Kandola, J. S. On kernel-target alignment. In Advances in Neural Information Processing Systems 15 (NIPS), pp. 367 373, 2002.

Fine, S. and Scheinberg, K. Efﬁcient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2(Dec):243 264, 2001.

Greengard, L. and Strain, J. The fast Gauss transform. SIAM Journal on Scientiﬁc and Statistical Computing, 12(1): 79 94, 1991.

Hardt, M. and Price, E. The noisy power method: a meta algorithm with applications. In Advances in Neural Information Processing Systems 27 (NIPS), pp. 2861 2869, 2014.

Hofmann, T., Sch olkopf, B., and Smola, A. J. Kernel methods in machine learning. The Annals of Statistics, pp. 1171 1220, 2008.

Indyk, P. Sublinear time algorithms for metric space problems. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC), pp. 428 434, 1999.

Indyk, P., Vakilian, A., Wagner, T., and Woodruff, D. Sample-optimal low-rank approximation of distance matrices. Proceedings of the 32nd Annual Conference on Computational Learning Theory (COLT), 2019.

Lee, D. and Gray, A. G. Fast high-dimensional kernel summations using the Monte Carlo multipole method. In Advances in Neural Information Processing Systems 22 (NIPS), 2009.

Lee, D., Moore, A. W., and Gray, A. G. Dual-tree fast Gauss transforms. In Advances in Neural Information Processing Systems 19 (NIPS), 2006.

March, W. B., Xiao, B., and Biros, G. ASKIT: Approximate skeletonization kernel-independent treecode in high dimensions. SIAM Journal on Scientiﬁc Computing, 37 (2):A1089 A1110, 2015.

Musco, C. and Musco, C. Recursive sampling for the Nystr om method. In Advances in Neural Information Processing Systems 30 (NIPS), pp. 3836 3848, 2017.

Musco, C. and Woodruff, D. P. Sublinear time low-rank approximation of positive semideﬁnite matrices. In Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 672 683, 2017.

Faster Kernel Matrix Algebra via Density Estimation

Phillips, J. M. and Tai, W. M. Improved coresets for kernel density estimates. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 2718 2727, 2018a.

Phillips, J. M. and Tai, W. M. Near-optimal coresets of kernel density estimates. In 34th International Symposium on Computational Geometry (So CG), 2018b.

Sch olkopf, B., Smola, A., and M uller, K.-R. Kernel principal component analysis. In International Conference on Artiﬁcial Neural Networks, pp. 583 588. Springer, 1997.

Shawe-Taylor, J., Cristianini, N., et al. Kernel methods for pattern analysis. Cambridge University Press, 2004.

Siminelakis, P., Rong, K., Bailis, P., Charikar, M., and Levis, P. Rehashing kernel evaluation in high dimensions. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 5789 5798, 2019.

Williams, C. and Seeger, M. Using the nystr om method to speed up kernel machinesstr om method to speed up kernel machines. In Advances in Neural Information Processing Systems 14 (NIPS), pp. 682 688, 2001.

Yang, C., Duraiswami, R., Gumerov, N. A., and Davis, L. Improved fast gauss transform and efﬁcient kernel density estimationauss transform and efﬁcient kernel density estimation. In Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003.