# optimal_sketching_for_trace_estimation__00ed94ef.pdf

Optimal Sketching for Trace Estimation

Shuli Jiang Robotics Institute Carnegie Mellon University shulij@andrew.cmu.edu

Hai Pham Language Technologies Institute Carnegie Mellon University htpham@cs.cmu.edu

David P. Woodruff Computer Science Department Carnegie Mellon University dwoodruf@cs.cmu.edu

Qiuyi (Richard) Zhang Google Brain qiuyiz@google.com

Matrix trace estimation is ubiquitous in machine learning applications and has traditionally relied on Hutchinson s method, which requires O(log(1/δ)/ϵ2) matrixvector product queries to achieve a (1 ϵ)-multiplicative approximation to tr(A) with failure probability δ on positive-semideﬁnite input matrices A. Recently, the Hutch++ algorithm was proposed, which reduces the number of matrix-vector queries from O(1/ϵ2) to the optimal O(1/ϵ), and the algorithm succeeds with constant probability. However, in the high probability setting, the non-adaptive Hutch++ algorithm suffers an extra O( p

log(1/δ)) multiplicative factor in its query complexity. Non-adaptive methods are important, as they correspond to sketching algorithms, which are mergeable, highly parallelizable, and provide low-memory streaming algorithms as well as low-communication distributed protocols. In this work, we close the gap between non-adaptive and adaptive algorithms, showing that even non-adaptive algorithms can achieve O( p

log(1/δ)/ϵ + log(1/δ)) matrixvector products. In addition, we prove matching lower bounds demonstrating that, up to a log log(1/δ) factor, no further improvement in the dependence on δ or ϵ is possible by any non-adaptive algorithm. Finally, our experiments demonstrate the superior performance of our sketch over the adaptive Hutch++ algorithm, which is less parallelizable, as well as over the non-adaptive Hutchinson s method.

1 Introduction

The problem of implicit matrix trace estimation arises naturally in a wide range of applications [1]. For example, during the training of Gaussian Process, a popular non-parametric kernel-based method, the calculation of the marginal log-likelihood contains a heavy-computation term, i.e., the log determinant of the covariance matrix, log(det(K)), where K Rn n, and n is the number of data points. The canonical way of computing log(det(K)) is via Cholesky decomposition on K, whose time complexity is O(n3). Since log(det(K)) = Pn i=1 log(λi), where λi s are the eigenvalues of K, one can compute tr(log(K)) instead. Trace estimation combined with polynomial approximation (e.g., the Chebyshev polynomial or Stochastic Lanczos Quadrature) to log [2], or trace estimation combined with maximum entropy estimation [3] provide fast ways of estimating tr(log(K)) for large-scale data. Other popular applications of implicit trace estimation include counting triangles and computing the Estrada Index in graphs [4, 5], approximating the generalized rank of a matrix [6], and studying non-convex loss landscapes from the Hessian matrix of large neural networks (NNs) [7, 8].

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

To deﬁne the problem, we consider the matrix-vector product model as formalized in [9, 10], where there is a real symmetric input matrix A Rn n that cannot be explicitly presented but one has oracle access to A via matrix-vector queries, i.e., one can obtain Aq for any desired query vector q Rn. For example, due to a tremendous amount of trainable parameters of large NNs, it is often prohibitive to compute or store the entire Hessian matrix H with respect to some loss function from the parameters [7], which is often used to study the non-convex loss landscape. However, with Pearlmutter s trick [11] one can compute Hq for any chosen vector q. The goal is to efﬁciently estimate the trace of A, denoted by tr(A), up to ϵ error, i.e., to compute a quantity within (1 ϵ)tr(A). For efﬁciency, such algorithms are randomized and succeed with probability at least 1 δ. The minimum number of queries q required to solve the problem is referred to as the query complexity.

Computing matrix-vector products Aq through oracle access, however, can be costly. For example, computing Hessian-vector products Hq on large NNs takes approximately twice the time of backpropagation. When estimating the eigendensity of H, one computes tr(f(H)) for some density function f, and needs repeated access to the matrix-vector product oracle. As a result, even with Pearlmutter s trick and distributed computation on modern GPUs, it takes 20 hours to compute the eigendensity of a single Hessian H with respect to the cross-entropy loss on the CIFAR-10 dataset [12], from a set of ﬁxed weights for Res Net-18 [13] which has approximately 11 million parameters [7]. Thus, it is important to understand the fundamental limits of implicit trace estimation as the query complexity in terms of the desired approximation error ϵ and the failure probability δ.

Hutchinson s method [14], a simple yet elegant randomized algorithm, is the ubiquitous work force for implicit trace estimation. Letting Q = [q1, . . . , qq] Rn q be q vectors with i.i.d. Gaussian or Rademacher (i.e., 1 with equal probability) random variables, Hutchinson s method returns an estimate of tr(A) as 1

q Pq i=1 q T i Aqi = 1

qtr(QT AQ). Although Hutchinson s method dates back to 1990, it is surprisingly not well-understood on positive semi-deﬁnite (PSD) matrices. It was originally shown that for PSD matrices A with the qi being Gaussian random variables, in order to obtain a multiplicative (1 ϵ) approximation to tr(A) with probability at least 1 δ, O(log(1/δ)/ϵ2) matrix-vector queries sufﬁce [15].

A recent work [16] proposes a variance-reduced version of Hutchinson s method that shows only O(1/ϵ) matrix-vector queries are needed to achieve a (1 ϵ)-approximation to any PSD matrix with constant success probability, in contrast to the O(1/ϵ2) matrix-vector queries needed for Hutchinson s original method. The key observation is that the variance of the estimated trace in Hutchinson s method is largest when there is a large gap between the top few eigenvalues and the remaining ones. Thus, by splitting the number of matrix-vector queries between approximating the top O(1/ϵ) eigenvalues, i.e., by computing a rank-O(1/ϵ) approximation to A, and performing trace estimation on the remaining part of the spectrum, one needs only O(1/ϵ) queries in total to achieve a (1 ϵ) approximation to tr(A). Furthermore, [16] shows Ω(1/ϵ) queries are in fact necessary for any trace estimation algorithm, up to a logarithmic factor, for algorithms succeeding with constant success probability. While [16] mainly focuses on the improvement on ϵ in the query complexity with constant failure probability, we focus on the dependence on the failure probability δ.

Algorithm 1 Hutch++: Stochastic trace estimation with adaptive matrix-vector queries

1: Input: Matrix-vector multiplication oracle for PSD matrix A Rn n. Number m of queries. 2: Output: Approximation to tr(A). 3: Sample S Rn m

3 and G Rn m

3 with i.i.d. N(0, 1) entries. 4: Compute an orthonormal basis Q Rn m

3 for the span of AS via QR decomposition. 5: return t = tr(QT AQ) + 3 mtr(GT (I QQT )A(I QQT )G).

Algorithm 2 NA-Hutch++: Stochastic trace estimation with non-adaptive matrix-vector queries

1: Input: Matrix-vector multiplication oracle for PSD matrix A Rn n. Number m of queries. 2: Output: Approximation to tr(A). 3: Fix constants c1, c2, c3 such that c1 < c2 and c1 + c2 + c3 = 1. 4: Sample S Rn c1m, R Rn c2m, and G Rn c3m, with i.i.d. N(0, 1) entries. 5: Z = AR, W = AS 6: return t = tr((ST Z) (W T Z)) + 1 c3m(tr(GT AG) tr(GT Z(ST Z) W T G)).

Achieving a low failure probability δ is important in applications where failures are highly undesirable, and the low failure probability regime is well-studied in related areas such as compressed sensing [17], data stream algorithms [18, 19], distribution testing [20], and so on. While one can always reduce the failure probability from a constant to δ by performing O(log(1/δ)) independent repetitions

and taking the median, this multiplicative overhead of O(log(1/δ)) can cause a huge slowdown in practice, e.g., in the examples above involving large Hessians.

Two algorithms were proposed in [16]: Hutch++ (Algorithm 1), which requires adaptively chosen matrix-vector queries and NA-Hutch++ (Algorithm 2) which only requires non-adaptively chosen queries. We call the matrix-vector queries adaptively chosen if subsequent queries are dependent on previous queries q and observations Aq, whereas the algorithm is non-adaptive if all queries can be chosen at once without any prior information about A. Note that Hutchinson s method uses only non-adaptive queries. [16] shows that Hutch++ can use O( p

log(1/δ)/ϵ + log(1/δ)) adaptive matrix-vector queries to achieve (1 ϵ) approximation with probability at least 1 δ, while NA-Hutch++ can use O(log(1/δ)/ϵ) non-adaptive queries. Thus, in many parameter regimes the non-adaptive algorithm suffers an extra p

log(1/δ) multiplicative factor over the adaptive algorithm.

It is important to understand the query complexity of non-adaptive algorithms for trace estimation because the advantages of non-adaptivity are plentiful: algorithms that require only non-adaptive queries can be easily parallelized across multiple machines while algorithms with adaptive queries are inherently sequential. Furthermore, non-adaptive algorithms correspond to sketching algorithms which are the basis for many streaming algorithms with low memory [21] or distributed protocols with low-communication overhead (for an example application to low rank approximation, see [22]). We note that there are numerous works on estimating matrix norms in a data stream [23, 24, 25, 26], most of which use trace estimation as a subroutine.

1.1 Our Contributions

Improving the Non-adaptive Query Complexity. We give an improved analysis of the query complexity of the non-adaptive trace estimation algorithm NA-Hutch++ (Algorithm 2), based on a new low-rank approximation algorithm and analysis in the high probability regime, instead of applying an off-the-shelf low-rank approximation algorithm as in [16]. Instead of O(log(1/δ)/ϵ) queries as shown in [16], we show that O( p

log(1/δ)/ϵ + log(1/δ)) non-adaptive queries sufﬁce to achieve a multiplicative (1 ϵ) approximation of the trace with probability at least 1 δ, which matches the query complexity of the adaptive trace estimation algorithm Hutch++. Since our algorithm is non-adaptive, it can be used in subroutines in streaming and distributed settings for estimating the trace, with lower memory than was previously possible for the same failure probability. Theorem 1.1 (Restatement of Theorem 3.1). Let A be any PSD matrix. If NA-Hutch++ is im-

plemented with m = O

ϵ + log(1/δ) matrix-vector multiplication queries, then with

probability 1 δ, the output t of NA-Hutch++ satisﬁes (1 ϵ)tr(A) t (1 + ϵ)tr(A).

The improved dependence on δ is perhaps surprising in the non-adaptive setting, as simply repeating a constant-probability algorithm would give an O(log(1/δ)/ϵ) dependence. Our non-adaptive algorithm is as good as the best known adaptive algorithm, and much better than previous nonadaptive algorithms [16, 14]. The key difference between our analysis and the analysis in [16] is in the number of non-adaptive matrix-vector queries we need to obtain an O(1)-approximate rank-k approximation to A in Frobenius norm.

Speciﬁcally, to reduce the total number of matrix-vector queries, our queries are split between (1) computing A, a rank-k approximation to the matrix A, and (2) performing trace estimation on A A. Let Ak = minrank-k A A Ak F be the best rank-k approximation to A in Frobenius norm. For our algorithm to work, we require A A O(1) A Ak F with probability 1 δ. Previous results from [27] show the number of non-adaptive queries required to compute A is O(k log(1/δ)), where each query is an i.i.d. Gaussian or Rademacher vector. We prove O(k + log(1/δ)) non-adaptive Gaussian query vectors sufﬁce to compute A. Low rank approximation requires both a so-called subspace embedding and an approximate matrix product guarantee (see, e.g., [28], for a survey on sketching for low rank approximation), and we show both hold with the desired probability, with some case analysis, for Gaussian queries. A technical overview can be found in Section 3.

The improvement on the number of non-adaptive queries to achieve O(1)-approximate rank-k approximation has many other implications, which can be of an independent interest. For example, since low-rank approximation algorithms are extensively used in streaming algorithms suitable for low-memory settings, this new result directly improves the space complexity of the state-of-the-art

streaming algorithm for Principle Component Analysis (PCA) [29] from O(d (k log(1/δ))) to O(d (k + log(1/δ))) for constant approximation error ϵ, where d is the dimension of the input.

Lower Bound. Previously, no lower bounds were known on the query complexity in terms of δ in a high probability setting. In this work, we give a novel matching lower bound for non-adaptive (i.e., sketching) algorithms for trace estimation, with novel techniques based on a new family of hard input distributions, showing that our improved O( p

log(1/δ)/ϵ + log(1/δ)) upper bound is optimal, up to a log log(1/δ) factor, for any ϵ (0, 1). The methods previously used to prove an Ω(1/ϵ) lower bound with constant success probability (up to logarithmic factors) in [16] do not apply in the high probability setting. Indeed, [16] gives two lower bound methods based on a reduction from two types of problems: (1) a communication complexity problem, and (2) a distribution testing problem between clean and negatively spiked random covariance matrices. Technique (1) does not apply since there is not a multi-round lower bound for the Gap-Hamming communication problem used in [16] that depends on δ. One might think that since we are proving a non-adaptive lower bound, we could use a non-adaptive lower bound for Gap-Hamming (which exists, see [18]), but this is wrong because even the non-adaptive lower bound in [16] uses a 2-round lower bound for Gap-Hamming, and there is no such lower bound known in terms of δ. Technique (2) also does not apply, as it involves a 1/ϵ 1/ϵ matrix, which can be recovered exactly with 1/ϵ queries; further, increasing the matrix dimensions would break the lower bound as their two cases would no longer need to be distinguished. Thus, such a hard input distribution fails to show the additive Ω(log(1/δ)) term in the lower bound.

Our starting point for a hard instance is a family of Wigner matrices (see Deﬁnition 2.1) shifted by an identity matrix so that they are PSD. However, due to strong concentration properties of these matrices, they can only be used to provide a lower bound of Ω( p

log(1/δ)/ϵ) when ϵ < 1/ p

log(1/δ). Indeed, setting δ to be a constant in this case recovers the Ω(1/ϵ) lower bound shown in [16] but via a completely different technique. For larger ϵ, we consider a new distribution testing problem between clean Wigner matrices and the same distribution with a large rank-1 noisy PSD matrix, and then argue with probability roughly δ, all non-adaptive queries have unusually tiny correlation with this rank-1 matrix, thus making it indistinguishable between the two distributions. This gives the desired additive Ω(log(1/δ)) lower bound, up to a log log(1/δ) factor.

Theorem 1.2 (Restatement of Theorem 4.1). Suppose A is a non-adaptive query-based algorithm that returns a (1 ϵ)-multiplicative estimate to tr(A) for any PSD matrix A with probability at least 1 δ.

Then, the number of matrix-vector queries must be at least m = Ω

ϵ + log(1/δ) log(log(1/δ))

1.2 Related Work

A summary of prior work on the query complexity of trace estimation of PSD matrices is given in Table 1. For the upper bounds, prior to the work of [30], the analysis of implicit trace estimation mainly focused on the variance of estimation with different types of query vectors. [30] gave the ﬁrst upper bound on the query complexity. The work of [15] improved the bounds in [30]. On the lower bound side, although [15] gives a necessary condition on the query complexity for Gaussian query vectors, this condition does not directly translate to a bound on the minimum number of query vectors. The work of [16] gives the ﬁrst lower bound on the query complexity in terms of ϵ but only works for constant failure probability.

Upper Bounds Prior Work Query Complexity Query Vector Type Failure Probability Algorithm Type [30] O(log(1/δ)/ϵ2) Gaussian δ non-adaptive [30] O(log(rank(A)/δ)/ϵ2) Rademacher δ non-adaptive [15] O(log(1/δ)/ϵ2) Gaussian, Rademacher δ non-adaptive [16] O( p

log(1/δ)/ϵ + log(1/δ)) Gaussian, Rademacher δ adaptive [16] O(log(1/δ)/ϵ) Gaussian, Rademacher δ non-adaptive This Work O( p

log(1/δ)/ϵ + log(1/δ)) Gaussian δ non-adaptive Lower Bounds [16] Ω(1/(ϵ log(1/ϵ))) constant adaptive [16] Ω(1/ϵ) constant non-adaptive This Work Ω( p

log(1/δ)/ϵ + log(1/δ) log log(1/δ)) δ non-adaptive Table 1: Upper and lower bounds on the query complexity for trace estimation of PSD matrices.

2 Problem Setting

Notation. A matrix A Rn n is symmetric positive semi-deﬁnite (PSD) if it is real, symmetric and has non-negative eigenvalues. Hence, x Ax 0 for all x Rn. Let tr(A) = Pn i=1 Aii denote the trace of A. Let A F = (Pn i=1 Pn j=1 A2 ij)1/2 denote the Frobenius norm and A op = sup v 2=1 Av 2 denote the operator norm of A. Let N(µ, σ2) denote the Gaussian distribution with mean µ and variance σ2. Our analysis extensively relies on the following facts:

Deﬁnition 2.1 (Gaussian and Wigner Random Matrices). We let G N(n) denote an n n random Gaussian matrix with i.i.d. N(0, 1) entries. We let W W(n) = G + GT denote an n n Wigner matrix, where G N(n).

Fact 2.1 (Rotational Invariance of a standard Gaussian). Let R Rn n be an orthornormal matrix. Let g Rn be a random vector with i.i.d. N(0, 1) entries. Then Rg has the same distribution as g.

Fact 2.2 (Upper and Lower Gaussian Tail Bounds). Letting Z N(0, 1) be a univariate Gaussian random variable, for any t > 0, Pr[|Z| t] = Θ(t 1 exp( t2

3 An Improved Analysis of NA-Hutch++

Suppose we are trying to compute a sketch so as to estimate the trace of a matrix A up to a (1 ϵ)-factor with success probability at least 1 δ. Note that we focus on the case where we make matrix-vector queries non-adaptively. For any algorithm that accomplishes this with small constant failure probability, one can simply repeat this procedure O(log(1/δ)) times to amplify the success probability to 1 δ. Since these queries are non-adaptive and must be presented before any observations are made, it seems intuitive that the number of non-adaptive queries of NA-Hutch++ (Algorithm 2) should be O(log(1/δ)/ϵ) as shown in [16]. In this section, we give a proof sketch as to why this can be reduced to O( p

log(1/δ)/ϵ + log(1/δ)) as stated in Theorem 3.1. All proof details are provided in the supplementary material.

Theorem 3.1. Let A be a PSD matrix. If NA-Hutch++ is implemented with m = O( p

log(1/δ)/ϵ + log(1/δ)) matrix-vector multiplication queries, then with probability 1 δ, the output of NA-Hutch++, denoted by t, satisﬁes (1 ϵ)tr(A) t (1 + ϵ)tr(A).

NA-Hutch++ splits its matrix-vector queries between computing an O(1)-approximate rank-k approximation A and performing Hutchinson s estimate on the residual matrix A A containing the small eigenvalues. The trade-off between the rank k and the number l of queries spent on estimating the small eigenvalues is summarized in Theorem 3.2.

Theorem 3.2 (Theorem 4 of [16]). Let A Rn n be PSD, δ (0, 1

2), l N, k N. Let A and be any matrices with tr(A) = tr( A) + tr( ) and F O(1) A Ak F where Ak = arg minrank k Ak A Ak F . Let Hl(M) denote Hutchinson s trace estimator with l queries on matrix M. For ﬁxed constants c, C, if l c log( 1

δ ), then with probability 1 δ, for Z =

tr( A) + Hl( ), we have |Z tr(A)| C q

The total number of matrix-vector queries directly depends on the number of non-adaptive queries required to compute an O(1)-approximate rank-k approximation A. Consider S Rn c1m, R Rn c2m for some constants c1, c2 > 0 as deﬁned in Algorithm 2, and set our low rank approximation of A to be A = AR(ST AR) (AS)T . The standard analysis [16] applies a result from streaming low-rank approximation in [27], which requires m = O(k log(1/δ)) to get A A F O(1) A Ak F with probability 1 δ. [16] then sets k = O(1/ϵ) and l = O(log(1/δ)/ϵ) in Theorem 3.2 to get a (1 ϵ) approximation to tr(A). However, the right-hand side of Theorem 3.2 suggests the optimal split between k and l should be k = l. The reason [16] cannot achieve such an optimal split is due to a large number m of queries to compute the O(1)-approximate rank k-approximation. We give an improved analysis of this result, which may be of independent interest.

To get O(1) low rank approximation error, we need the non-adaptive query matrices S, R to satisfy two properties: the subspace embedding property (see Lemma 3.3), and an approximate matrix product for orthogonal subspaces (see Lemma 3.4). While it is known that m = O(k + log(1/δ))

sufﬁces to achieve the ﬁrst property, we show that m = O(k + log(1/δ)) sufﬁces to achieve the second property when S, R are matrices with i.i.d. Gaussian random variables, stated in Lemma 3.4.

Lemma 3.3 (Subspace Embedding (Theorem 6 of [28])). Given δ (0, 1

2) and ϵ (0, 1). Let S Rr n be a random matrix with i.i.d. Gaussian random variables N(0, 1

r). Then for any ﬁxed d-dimensional subspace A Rn d, and for r = O((d + log( 1

δ ))/ϵ2), the following holds with probability 1 δ simultaneously for all x Rd, SAx 2 = (1 ϵ) Ax 2 Lemma 3.4 (Approximate Matrix Product for Orthogonal Subspaces). Given δ (0, 1

2), let U Rn k, W Rn p be two matrices with orthonormal columns such that U T W = 0, p max(k, log(1/δ)), rank(U) = k and rank(W ) = p. Let S Rr n be a random matrix with i.i.d. Gaussian random variables N(0, 1

r). For r = O(k + log( 1

δ )), the following holds with probability 1 δ, U T ST SW F O(1) W F .

Note that we will apply the above two lemmas with constant ϵ. The proof intuition is as follows: consider a sketch matrix S of size r with i.i.d. N(0, 1

r) random variables as in Lemma 3.4. The range of U Rn k corresponds to an orthonormal basis of a rank-k low rank approximation to A, and the range of W Rn p is the orthogonal complement. Note that both SU and SW are random matrices consisting of i.i.d. N(0, 1

r) random variables and thus the task is to bound the size, in Frobenius norm, of the product of two random Gaussian matrices with high probability. Intuitively, the size of the matrix product is proportional to the rank k and inversely proportional to our sketch size r. The overall failure probability δ, however, is inversely proportional to k, since as k grows, the matrix product involves summing over more squared Gaussian random variables, i.e., χ2 random variables, and thus becomes even more concentrated. We show that for k log(1/δ), a sketch size of O(k) sufﬁces since the failure probability for each χ2 random variable is small enough to pay a union bound over k terms. On the other hand, when k < log(1/δ), we show that r = O(log(1/δ)) sufﬁces for the union bound. Combining the two cases gives r = O(k + log(1/δ)).

Having shown the above, we next show that the low rank approximation error, i.e., A A F , is upper bounded by: 1) the inﬂation in eigenvalues by applying a sketch matrix S as in Lemma 3.3; and 2) the approximate product of the range of a low rank approximation to A and its orthogonal complement, as in Lemma 3.4. Together these show that m = O(k + log(1/δ)) sufﬁces for A to be an O(1)-approximate rank-k approximation to A with probability 1 δ, as stated in Theorem 3.5. Note that in both Lemma 3.3 and Lemma 3.4, the entries of the random matrix are scaled Gaussian random variables N(0, 1

r). However, when one sets the low rank approximation as A = AR(ST AR) (AS)T , the scale cancels and one can choose standard Gaussians in the sketching matrix for convenience as in Theorem 3.5.

Theorem 3.5. Let A Rn n be an arbitrary PSD matrix. Let Ak = arg minrank-k Ak A Ak F be the optimal rank-k approximation to A in Frobenius norm. If S Rn m and R Rn cm are random matrices with i.i.d. N(0, 1) entries for some ﬁxed constant c > 0 with m = O(k +log(1/δ)), then with probability 1 δ, the matrix e A = (AR)(ST AR) (AS)T satisﬁes A e A F O(1) A Ak F .

This improved result enables us to choose k = l = O( p

log(1/δ)/ϵ) in Theorem 3.2, and combined with Theorem 3.5, this shows that only O( p

log(1/δ)/ϵ + log(1/δ)) matrix-vector queries are needed to output a number in (1 ϵ)tr(A) with probability 1 δ, as we conclude in Theorem 3.1.

4 Lower Bounds

In this section, we show that our upper bound on the query complexity of non-adaptive trace estimation is tight, up to a factor of O(log log(1/δ)).

Theorem 4.1 (Lower Bound for Non-Adaptive Queries). Let ϵ (0, 1). Any algorithm that accesses a real PSD matrix A through matrix-vector multiplication queries Aq1, Aq2, . . . , Aqm, where q1, . . . , qm are real-valued, non-adaptively chosen vectors, requires

ϵ + log(1/δ) log log(1/δ)

queries to output an estimate t such that with probability at least

1 δ, (1 ϵ)tr(A) t (1 + ϵ)tr(A).

Our lower bound hinges on two separate cases: we ﬁrst show an Ω( p

log(1/δ)/ϵ) lower bound in Section 4.1 whenever ϵ = O(1/ p

log(1/δ)). Second, we show an Ω( log(1/δ) log log(1/δ)) lower bound in

Section 4.2 that applies to any ϵ (0, 1). Observe that for ϵ < 1/ p

log(1/δ), the ﬁrst lower bound holds; for ϵ 1/ p

log(1/δ), our second lower bound dominates. Therefore, combining both lower bounds implies that for every ϵ and δ, the query complexity of O( p

log(1/δ)/ϵ + log(1/δ)) for non-adaptive trace estimation is tight, up to a log log(1/δ) factor.

We now give a proof sketch of the two lower bounds. All details are in the supplementary material. Our lower bounds crucially make use of rotational invariance of the Gaussian distribution (see Fact 2.1) to argue that the ﬁrst q queries are, w.l.o.g., the standard basis vectors e1, ..., eq. Note that our queries can be assumed to be orthonormal. Both lower bounds use the family of n n Wigner matrices (see Deﬁnition 2.1) with shifted mean, i.e., W + C I for some C > 0 depending on W op, as part of the hard input distribution. The mean shift ensures that our ultimate instance is PSD with high probability.

4.1 Case 1: Lower Bound for Small ϵ

The ﬁrst lower bound is based on the observation that due to rotational invariance, the not-yet-queried part of W is distributed almost identically to W , up to some mean shift, conditioned on the queried known part, no matter how the queries are chosen. The sum of diagonal entries of the not-yet-queried part is Gaussian, and this still has too much deviation to determine the overall trace of the input up to a (1 ϵ) factor when n = p

log(1/δ)/ϵ and ϵ < 1/ p

Theorem 4.2 (Lower Bound for Small ϵ). For any PSD matrix A and all ϵ = O(1/ p

log(1/δ)), any algorithm that succeeds with probability at least 1 δ in outputting an estimate t such that (1 ϵ)tr(A) t (1 + ϵ)tr(A), requires m = Ω( p

log(1/δ)/ϵ) matrix-vector queries.

4.2 Case 2: Lower Bound for Every ϵ

The second lower bound presented in Theorem 4.3 is shown via reduction to a distribution testing problem between two distributions presented in Problem 4.4. Theorem 4.3 (Lower Bound on Non-adaptive Queries for PSD Matrices). Let ϵ (0, 1). Any algorithm that accesses a real, PSD matrix A through matrix-vector queries Aq1, Aq2, . . . , Aqm, where q1, . . . , qm are real-valued non-adaptively chosen vectors, requires m = Ω( log(1/δ) log log(1/δ)) to output an estimate t such that with probability at least 1 δ, (1 ϵ)tr(A) t (1 + ϵ)tr(A).

In the distribution testing problem, we consider Wigner matrices W W(log(1/δ)) shifted by Θ( p

log(1/δ))I. The problem requires an algorithm for distinguishing between a sample Q from this Wigner distribution and a sample P from this distribution shifted by a random rank-1 PSD matrix. The rank-1 matrix is the outer product of a random vector with itself and is chosen to provide a constant factor gap between the trace of P and Q. Problem 4.4 (Hard PSD Matrix Distribution Test). Given δ (0, 1

2), set n = log(1/δ). Choose g Rn to be an independent random vector with i.i.d. N(0, 1) entries. Consider two distributions:

Distribution P on matrices C log3/2( 1

δ ) 1 g 2 2 gg T + W + 2 q

δ )I , for some ﬁxed constant C > 1.

Distribution Q on matrices W + 2 q

where W W(n) as in Deﬁnition 2.1. Let A be a random matrix drawn from either P or Q with equal probability. Consider any algorithm which, for a ﬁxed query matrix Q Rn q, observes AQ, and guesses if A P or A Q with success probability at least 1 δ.

We then show in Lemma 4.5 that any algorithm which succeeds with probability 1 δ in distinguishing P from Q requires Ω( log(1/δ) log log(1/δ)) non-adaptive matrix-vector queries.

Due to rotational invariance and since queries are non-adaptive, the ﬁrst q queries are the ﬁrst q standard unit vectors. By Fact 2.2, with probability at least 1 log(1/δ), however, a single coordinate of g

has absolute value at most 1 log(1/δ). By independence, with probability at least ( 1 log(1/δ))q, all of the ﬁrst q coordinates of g are simultaneously small, and thus give the algorithm almost no information to distinguish P from Q; this probability is δ if q = O( log(1/δ) log log(1/δ)).

Lemma 4.5 (Hardness of Problem 4.4). For a non-adaptive query matrix Q Rn q as in Problem 4.4, given δ (0, 1

2), for n = log(1/δ), if q = o( log(1/δ) log log(1/δ)), no algorithm can solve Problem 4.4 with success probability 1 δ.

5 Experiments

1Part I: Comparison of Failure Probability and Running Time We give sequential and parallel implementations of the non-adaptive trace estimation algorithm NA-Hutch++ (Algorithm 2), the adaptive algorithm Hutch++ (Algorithm 1) and Hutchinson s method [14]. We speciﬁcally explore the beneﬁts of the non-adaptive algorithm in a parallel setting, where all algorithms have parallel access to a matrix-vector oracle. All the code is included in the supplementary material and will be publicly released.

Metrics. We say an estimate failed if on input matrix A, the estimate t returned by an algorithm falls into either case: t < (1 ϵ)tr(A) or t > (1 + ϵ)tr(A). We measure the performance of each algorithm by: 1) the number of failed estimates across 100 random trials, 2) the total wall-clock time to perform 100 trials with sequential execution, and 3) the total wall-clock time to perform 100 trials with parallel execution.

Datasets and Applications. We consider different applications of trace estimation from synthetic to real-world datasets. In many applications, trace estimation is used to estimate not only tr(A), but also tr(f(A)) for some function f : R R. Letting A = V ΣV T be the eigendecomposition of A, we have f(A) := V f(Σ)V T , where f(Σ) denotes applying f to each of the eigenvalues. Due to the expensive computation of eigendecompositions of large matrices, the matrix-vector multiplication f(A)v is often estimated by polynomials implicitly computed via an oracle algorithm for a random vector v. The Lanczos algorithm is a very popular choice due to its superior performance (e.g. [31, 2, 7]). We compare the performance of our trace estimation algorithms on the following applications and datasets, and use the Lanczos algorithm as the matrix-vector oracle on a random vector v in some particular cases.

Fast Decay Spectrum. We ﬁrst consider a synthetic dataset of size 5000 with a fast decaying spectrum, following [16], which is a diagonal matrix A with i-th diagonal entry Aii = 1/i2. Matrices with fast decaying spectrum will cause high variance in the estimated trace of Huthinson, but low variance for Hutch++ and NA-Hutch++. The matrix-vector oracle is simply Av. Graph Estrada Index. Given a binary adjacency matrix A {0, 1}n n of a graph, the Graph Estrada Index is deﬁned as tr(exp(A)), which measures the strength of connectivity within the graph. Following [16], we use roget s Thesaurus semantic graph2 with 1022 nodes, which was originally studied in [5], and use the Lanczos algorithm with 40 steps to approximate exp(A)v as the matrix-vector oracle. Graph Triangle Counting. Given a binary adjacency matrix A {0, 1}n n of a graph, the number of triangles in the graph is 1/6 tr(A3). This is an important graph summary with numerous applications in graph-mining and social network analysis (e.g. [32, 33]). We use arxiv_cm, the Condense Matter collaboration network dataset from ar Xiv 3. This is a common benchmark graph with 23, 133 nodes and 173, 361 triangles. The matrix-vector oracle is A3v. Note that A3 in this case is not necessarily a PSD matrix. Log-likelihood Estimation for Gaussian Process. When performing maximum likelihood estimation (MLE) to optimize the hyperparameters of a kernel matrix A for Gaussian Processes, one needs to compute the gradient of the log-determininant of A, which involves estimating tr(A 1) [2]. Following [2], we use the precipitation4 dataset, which consists

1Our code is available at: https://github.com/11hifish/Opt Sketch Trace Est 2http://vlado.fmf.uni-lj.si/pub/networks/data/ 3https://snap.stanford.edu/data/ca-Cond Mat.html 4https://catalog.data.gov/dataset/u-s-hourly-precipitation-data

of the measured amount of precipitation during a day collected from 5,500 weather stations in the US in 2010. We sample 1,000 data points, and construct a covariance matrix A using the RBF kernel with length scale 1. We use the Lanczos algorithm with 40 steps as in [2] to approximate A 1v as the matrix-vector oracle.

Figure 1: The performance comparison of Hutch++, NA-Hutch++ and Huthinson over 4 datasets (mean 1 std. across 10 random runs). The approximation error for all settings is set at ϵ = 0.01. Both Hutch++ and NA-Hutch++ outperform Hutchinson in terms of failed estimates. The parallel version of the non-adaptive NA-Hutch++ is signiﬁcantly faster than the adaptive Hutch++, making it more practical in real-world applications. Legend: Hutch++ is , NA-Hutch++ is , and Hutchinson is .

Implementation. We use random vectors with i.i.d. N(0, 1) entries as the query vectors for all algorithms. NA-Hutch++ requires additional hyperparameters to specify how the queries are split between random matrices S, R, G (see Algorithm 2). We set c1 = c3 = 1

4 and c2 = 1

2 as [16] suggests. For each setting, we conduct 10 random runs and report the mean number of failed estimates across 100 trials and the mean total wall-clock time (in seconds) conducting 100 trials with one standard deviation. For all of our experiments, we ﬁx the error parameter ϵ = 0.01 and measure the performance of each algorithm with {10, 30, 50, . . . , 130, 150} queries on synthetic, roget and precipitation, and with {100, 200, . . . , 700, 800} queries on arxiv_cm which has a signiﬁcantly larger size. The parallel versions are implemented using Python multiprocessing5 package. Due to the large size of arxiv_cm, we use sparse_dot_mkl6, a Python wrapper for Intel Math Kernel Library (MKL) which supports fast sparse matrix-vector multiplications, to implement the matrix-vector oracle for this dataset. During the experiments, we launch a pool of 40 worker processes in our parallel execution. All experiments are conducted on machines with 40 CPU cores.

Results and Discussion. The results of Hutch++, NA-Hutch++ and Hutchinson over the 4 datasets are presented in Figure 1. The performance of all algorithms is consistent across different datasets with different matrix-vector oracles, and even on a non-PSD instance from arxiv_cm. Given the same number of queries, Hutch++ and NA-Hutch++ both give signiﬁcantly fewer failed estimates than Hutchinson, particularly on PSD instances. It is not surprising to see that Hutchinson fails to

5https://docs.python.org/3/library/multiprocessing.html 6https://github.com/flatironinstitute/sparse_dot

Figure 2: The eigenspectrum of the two datasets and the performance comparison of Hutch++, NA-Hutch++ and Hutchinson on maximum entropy estimation based log determinant estimation.

achieve a (1 ϵ)-approximation to the trace most of the time due to the high variance in its estimation, given a small number of queries and a high accuracy requirement (ϵ = 0.01).

For computational costs, the difference in running time of all algorithms is insigniﬁcant in our sequential execution. In our parallel execution, however, Hutch++ becomes signiﬁcantly slower than the other two, NA-Hutch++ and Hutchinson, which have very little difference in their parallel running time. Hutch++ suffers from slow running time due to its adaptively chosen queries, despite the fact that Hutch++ consistently gives the least number of failed estimates.

It is not hard to see that NA-Hutch++ gives the best trade-off between a high success probability in estimating an accurate trace with only a few number of queries, and a fast parallel running time due to the use of non-adaptive queries, which makes NA-Hutch++ more practical on large, real-world datasets. We remark that although the Lanczos algorithm is adaptive itself, even with a sequential matrix-vector oracle, our non-adaptive trace estimation can still exploit much more parallelism than adaptive methods, as shown by our experiments.

Part II: Comparison of Performance on Log Determinant Estimation We give an additional experiment to compare the performance of Hutch++, NA-Hutch++ and Hutchinson on estimating log(det(K)) = tr(log(K)), for some covariance matrix K. Estimating log(det(K)) is required when computing the marginal log-likelihood in large-scale Gaussian Process models. Recently, [3] proposed a maximum entropy estimation based method for log determinant estimation, which uses Hutchinson s trace estimation as a subroutine to estimate up to the k-th moments of the eigenvalues, given a ﬁxed k. The i-th moment of the eigenvalues is E[λi] = 1 ntr(Ki), where K is an n n PSD matrix, and λ is the vector of eigenvalues. [3] shows that their proposed approach outperforms traditional Chebyshev/Lanczos polynomials for computing log(det(K)) in terms of absolute value of the relative error, i.e., abs (estimated log determinant - true log determinant)/abs(true log determinant).

We compare the estimated log determinant of a covariance matrix with different trace estimation subroutines for estimating the moments of the eigenvalues. We use 2 PSD matrices from the UFL Sparse Matrix Collection7: bcsstk20 (size 485 485) and bcsstm08 (size 1074 1074), with varying max moments {10, 15, . . . , 30} and 30 matrix-vector queries. We repeated each run 100 times and reported the mean estimated log determinant with each trace estimation subroutine. While an improved estimate of the eigenvalue moments does not necessarily lead to an improved estimate of the log determinant, it is not hard to show that an accurate moment estimation does lead to improved log determinant estimation in extreme cases where the eigenspectrum of K contains a few very large eigenvalues. Such a case will cause Hutchinson s method to have very large variance, while our method reduces the variance by ﬁrst removing the large eigenvalues. The eigenspectrums of both input matrices and the results are presented in Figure 2.

6 Conclusion

We determine an optimal Θ( p

log(1/δ)/ϵ + log(1/δ)) bound on the number of queries to achieve (1 ϵ) approximation of the trace with probability 1 δ for non-adaptive trace estimation algorithms, up to a log log(1/δ) factor. This involves both designing a new algorithm, as well as proving a new lower bound. We conduct experiments on synthetic and real-world datasets and conﬁrm that our non-adaptive algorithm has a higher success probability compared to Hutchinson s method for the same sketch size, and has a signiﬁcantly faster parallel running time compared to adaptive algorithms.

7https://sparse.tamu.edu/

Acknowledgments and Disclosure of Funding

We would like to thank the anonymous reviewers for their feedback. We are also grateful to Raphael Meyer for many detailed comments on the lower bound proofs. D. Woodruff was supported by NSF CCF-1815840, Ofﬁce of Naval Research grant N00014-18-1-2562, and a Simons Investigator Award.

[1] Shashanka Ubaru and Yousef Saad. Applications of trace estimation techniques. In Tomáš Kozubek, Martin ˇCermák, Petr Tichý, Radim Blaheta, Jakub Šístek, Dalibor Lukáš, and Jiˇrí Jaroš, editors, High Performance Computing in Science and Engineering, pages 19 33, Cham, 2018. Springer International Publishing.

[2] Kun Dong, David Eriksson, Hannes Nickisch, David Bindel, and Andrew G Wilson. Scalable log determinants for gaussian process kernel learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

[3] Jack K. Fitzsimons, Diego Granziol, Kurt Cutajar, Michael A. Osborne, Maurizio Filippone, and Stephen J. Roberts. Entropic trace estimates for log determinants. In ECML/PKDD (1), volume 10534 of Lecture Notes in Computer Science, pages 323 338. Springer, 2017.

[4] Haim Avron. Counting triangles in large graphs using randomized matrix trace estimation. 08 2010.

[5] Ernesto Estrada and Naomichi Hatano. Communicability in complex networks. Phys. Rev. E, 77:036111, Mar 2008.

[6] Yuchen Zhang, Martin Wainwright, and Michael Jordan. Distributed estimation of generalized matrix rank: Efﬁcient algorithms and lower bounds. In International Conference on Machine Learning, pages 457 465. PMLR, 2015.

[7] Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2232 2241. PMLR, 09 15 Jun 2019.

[8] Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael Mahoney. Pyhessian: Neural networks through the lens of the hessian, 2020.

[9] Xiaoming Sun, David P. Woodruff, Guang Yang, and Jialin Zhang. Querying a matrix through matrix-vector products. ACM Trans. Algorithms, 17(4), October 2021.

[10] Cyrus Rashtchian, David P. Woodruff, and Hanlin Zhu. Vector-matrix-vector queries for solving linear algebra, statistics, and graph problems. In APPROX-RANDOM, 2020.

[11] Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Computation, 6:147 160, 1994.

[12] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[14] Michael F. Hutchinson. A stochastic estimator of the trace of the inﬂuence matrix for laplacian smoothing splines. volume 19, page 433 450, 1990.

[15] Farbod Roosta-Khorasani and Uri Ascher. Improved bounds on sample size for implicit matrix trace estimators. Found. Comput. Math., 15(5):1187 1212, October 2015.

[16] Raphael A. Meyer, Cameron Musco, Christopher Musco, and David P. Woodruff. Hutch++: Optimal stochastic trace estimation, 2020.

[17] Anna C. Gilbert, Hung Q. Ngo, Ely Porat, Atri Rudra, and Martin J. Strauss. L2/l2-foreach sparse recovery with low risk. Co RR, abs/1304.6232, 2013.

[18] T. S. Jayram and David P. Woodruff. Optimal bounds for johnson-lindenstrauss transforms and streaming problems with sub-constant error. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 23-25, 2011, pages 1 10, 2011.

[19] Akshay Kamath, Eric Price, and David P. Woodruff. A simple proof of a new set disjointness with applications to data streams, 2021.

[20] Ilias Diakonikolas, Themis Gouleakis, Daniel M. Kane, John Peebles, and Eric Price. Optimal testing of discrete distributions with high probability. Co RR, abs/2009.06540, 2020.

[21] S. Muthukrishnan. Data streams: Algorithms and applications. Found. Trends Theor. Comput. Sci., 1(2), 2005.

[22] Christos Boutsidis, David P. Woodruff, and Peilin Zhong. Optimal principal component analysis in distributed and streaming models. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pages 236 249, 2016.

[23] Yi Li, Huy L. Nguyen, and David P. Woodruff. On sketching matrix norms and the top singular vector. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014, pages 1562 1581, 2014.

[24] Yi Li and David P. Woodruff. On approximating functions of the singular values in a stream. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pages 726 739, 2016.

[25] Vladimir Braverman, Stephen R. Chestnut, Robert Krauthgamer, Yi Li, David P. Woodruff, and Lin F. Yang. Matrix norms in data streams: Faster, multi-pass and row-order. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 648 657, 2018.

[26] Vladimir Braverman, Robert Krauthgamer, Aditya Krishnan, and Roi Sinoff. Schatten norms in matrix streams: Hello sparsity, goodbye dimension. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, pages 1100 1110, 2020.

[27] Kenneth L. Clarkson and David P. Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, STOC 09, page 205 214, New York, NY, USA, 2009. Association for Computing Machinery.

[28] David P Woodruff. Sketching as a tool for numerical linear algebra. ar Xiv preprint ar Xiv:1411.4357, 2014.

[29] Christos Boutsidis, David P. Woodruff, and Peilin Zhong. Optimal principal component analysis in distributed and streaming models. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, STOC 16, page 236 249, New York, NY, USA, 2016. Association for Computing Machinery.

[30] Haim Avron and Sivan Toledo. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-deﬁnite matrix. J. ACM, 58(2), April 2011.

[31] Lin Lin, Yousef Saad, and Chao Yang. Approximating spectral densities of large matrices. SIAM Review, 58, 08 2013.

[32] Mihail N. Kolountzakis, Gary L. Miller, Richard Peng, and Charalampos E. Tsourakakis. Efﬁcient triangle counting in large graphs via degree-based vertex partitioning. Lecture Notes in Computer Science, page 15 24, 2010.

[33] A. Pavan, Kanat Tangwongsan, Srikanta Tirthapura, and Kun-Lung Wu. Counting and sampling triangles from a graph stream. Proc. VLDB Endow., 6(14):1870 1881, September 2013.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] We clearly state the scope to which our theorem and lower bounds hold. See Section 1 for an overview and Section 3, Section 4 for more details. (c) Did you discuss any potential negative societal impacts of your work? [N/A] As our work is very fundamental and theoretical, we do not see immediate potential negative societal impacts. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Section 3 and Section 4 for our theoretical results, including all the assumptions. (b) Did you include complete proofs of all theoretical results? [Yes] In the main paper, we give intuitive proof sketch due to limited space. See the supplementary material for all proof details. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See Section 5 for a complete description of the datasets and all implementation details. All datasets used in the experiments are public. The links to the datasets are provided in the footnote in Section 5. All code for the experiment are included in the supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 5 for all the details on how the experiments are conducted and how the hyperparameters are chosen. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] We conduct every experiment setting for 10 random runs and report the mean and one standard deviation with respect to each metric. See Section 5 for details. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Section 5 for details on total computational time and the types of machines we use. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

Our work does not use existing assets.

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

Our work does not involve crowdsourcing or human subjects.

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]