# dynamic_trace_estimation__29a24133.pdf

Dynamic Trace Estimation

Prathamesh Dharangutte Dept.of Computer Science & Engineering New York University ptd244@nyu.edu

Christopher Musco Dept.of Computer Science & Engineering New York University cmusco@nyu.edu

We study a dynamic version of the implicit trace estimation problem. Given access to an oracle for computing matrix-vector multiplications with a dynamically changing matrix A, our goal is to maintain an accurate approximation to A s trace using as few multiplications as possible. We present a practical algorithm for solving this problem and prove that, in a natural setting, its complexity is quadratically better than the standard solution of repeatedly applying Hutchinson s stochastic trace estimator. We also provide an improved algorithm assuming slightly stronger assumptions on the dynamic matrix A. We support our theory with empirical results, showing signiﬁcant computational improvements on three applications in machine learning and network science: tracking moments of the Hessian spectral density during neural network optimization, counting triangles, and estimating natural connectivity in a dynamically changing graph.

1 Introduction

Implicit or matrix-free trace estimation is a ubiquitous computational primitive in linear algebra, which has become increasingly important in machine learning and data science. Given access to an oracle for computing matrix-vector products Ax1, . . . , Axm between an n n matrix A and chosen vectors x1, . . . , xm, the goal is to compute an approximation to A s trace, tr(A) = Pn i=1 Aii. This problem arises when A s diagonal entries cannot be accessed explicitly, usually because forming A is computationally prohibitive. As an example, consider A which is the Hessian matrix of a loss function involving a neural network. While forming the Hessian is infeasible when the network is large, backpropagation can be used to efﬁciently compute Hessian-vector products [32].

In other applications, A is a matrix function of another matrix B. For example, if B is a graph adjacency matrix, tr(B3) equals six times the number of triangle in the graph [2]. Computing A = B3 explicitly to evaluate the trace would require O(n3) time, while the matrix-vector multiplication Ax = B (B (Bx)) only requires O(n2) time. Similarly, in log-determinant approximation, useful in e.g. Bayesian log likelihood computation or determinantal point process (DPP) methods, we want to approximate the trace of A = log(B) [5, 17, 36]. Again, A takes O(n3) time to form explicitly, but Ax = log(B)x can be computed in roughly O(n2) time using iterative methods like the Lanczos algorithm [21]. Dynamic versions of the log-determinant estimation problem have been studied due to applications in greedy methods for DPP inference [15].

In data science and machine learning, other applications of implicit trace estimation include matrix norm and spectral sum estimation [16, 41, 31], as well as methods for eigenvalue counting [9] and spectral density estimation [45, 27]. Spectral density estimation methods typically use implicit trace estimation to estimate moments of a matrix s eigenvalue distribution i.e., tr(A), tr(A2), tr(A3), etc. which can then be used to compute an approximation to that entire distribution. In deep learning, spectral density estimation is used to quickly analyze the spectra of weight matrices [33, 28] or to probe information about the Hessian matrix during optimization [13, 49]. Trace estimation has also been used for neural networks weight quantization [10, 34] and to understand training dynamics [44].

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

1.1 Static Trace Estimation

The mostly widely used implicit trace estimation algorithm is Hutchinson s estimator [14, 22]. Letting g1, . . . , gℓ Rn be random vectors with i.i.d. mean 0 and variance 1 entries (e.g., standard Gaussian or 1 Rademachers), Hutchinson s approximates tr(A) via the average hℓ(A) = 1

ℓ P1 i=1 g T i (Agi). This estimator requires ℓmatrix-vector multiplications to compute. Its variance can be shown to be O( A 2 F /ℓ) and with high probability, when ℓ= O(1/ϵ2), we have the error guarantee [3, 30]:

|hℓ(A) tr(A)| < ϵ A F . (1)

While improvements on Hutchinson s estimator have been studied for restricted classes of matrices (positive semideﬁnite, sparse, nearly low-rank, etc.) [30, 40, 38, 36, 19], the method is the best known for general matrices no techniques achieve guarantee (1) with o(1/ϵ2) matrix-vector products.

1.2 Dynamic Trace Estimation

We explore a natural and widely applicable dynamic version of the implicit trace estimation problem: given access to a matrix-vector multiplication oracle for a dynamically changing matrix A, maintain an approximation to A s trace. This problem arises in applications involving optimization in machine learning where we need to estimate the trace of a constantly changing Hessian matrix H (or some function of H) during model training. In other applications, A is dynamic because it is repeatedly modiﬁed by some algorithmic process. E.g., in the transit planning method of [43], edges are added to a network to optimally increase the Estrada index [11]. Evaluating this connectivity measure requires computing tr(exp(B)), where B is the dynamically changing network adjacency matrix and exp(B) is a matrix exponential. A naive solution to the dynamic problem is to simply apply Hutchinson s estimator to every snapshot of A as it changes over time. To achieve a guarantee like (1) for m time steps, we require O(m/ϵ2) matrix-vector multiplies. The goal of this paper is to improve on this bound when the changes to A are bounded. Formally, we abstract the problem as follows:

Problem 1 (Dynamic trace estimation). Let A1, ..., Am be n n matrices satisfying:

1. Ai F 1, for all i [1, m]. 2. Ai+1 Ai F α, for all i [1, m 1].

Given implicit matrix-vector multiplication access to each Ai in sequence, the goal is to compute trace approximations t1, . . . , tm for tr(A1), ...., tr(Am) such that, for each i 1, . . . , m,

P[|ti tr(Ai)| ϵ] δ. (2)

Above A1, . . . , Am represent different snapshots of a dynamic matrix at m time steps. We require Ai F 1 only to simplify the form of our error bounds no explicit rescaling is necessary for matrices with larger norm. If we assume Ai F U for some (unknown) upper bound U, the guarantee of (2) would simply change to involve a ϵU terms instead of ϵ. The second condition bounds how much the matrices change over time. Again for simplicity, we assume a ﬁxed upper bound α on the difference at each time step, but the algorithms presented in this paper will be adaptive to changing gaps between Ai and Ai+1, and will perform better when these gaps are small on average. By triangle inequality, α 2, but in applications we typically have α 1, meaning that the changes in the dynamic matrix are small relative to its Frobenius norm. If this is not the case, there is no hope to improve on the naive method of applying Hutchinson s estimator repeatedly to each Ai.

Note on Matrix Functions. In many applications Ai = f(Bi) for a dynamically changing matrix B. While we may have Ai+1 Ai F = f(Bi+1) f(Bi) F Bi+1 Bi F for functions like the matrix exponential, this is not an immediate issue. To improve on Hutchinson s estimator, the important requirement is simply that Ai+1 Ai F is small in comparison to Ai+1 F . As discussed in Section 5, this is typically the case for application involving matrix functions.

We will measure the complexity of any algorithm for solving Problem 1 in the matrix-vector multiplication oracle model of computation, meaning that we consider the cost of matrix-vector products (which are the only way A1, . . . , Am can be accessed) to be signiﬁcantly larger than other computational costs. We thus seek solely to minimize the number of such products used [39]. The matrix-vector oracle model has seen growing interest in recent years as it generalizes both the matrix sketching and Krylov subspace models in linear algebra, naturally captures the true computational cost of algorithms in these classes, and is amenable to proving strong lower-bounds [37, 6].

1.3 Main Result

Our main result is an algorithm for solving Problem 1 more efﬁciently than Hutchinson s estimator: Theorem 1.1. For any ϵ, δ, α (0, 1), the Delta Shift algorithm (Algorithm 1) solves Problem 1 with

O m α log(1/δ)

ϵ2 + log(1/δ)

total matrix-vector multiplications involving A1, . . . , Am.

For large m, the ﬁrst term dominates the complexity in Theorem 1.1. For comparison, a tight analysis of Hutchinson s estimator [29] establishes that the naive approach requires O m log(1/δ)/ϵ2 , which is worse than Theorem 1.1 by a factor of α. A natural setting is when α = O(ϵ), in which case Algorithm 1 requires O(log(1/δ)/ϵ) matrix-multiplications on average over m time steps, in comparison to O(log(1/δ)/ϵ2) for Hutchinson s estimator, a quadratic improvement in ϵ.

To prove Theorem 1.1, we introduce a dynamic variance reduction scheme. By linearity of trace, tr(Ai+1) = tr(Ai) + tr( i), where i = Ai+1 Ai. Instead of directly estimating tr(Ai+1), we combine previous estimate for tr(Ai) with an estimate for tr( i), computed via Hutchinson s estimator. Each sample for Hutchinson s estimator applied to i requires just two matrix-vector multiplies: one with Ai and one with Ai+1. At the same time, when i has small Frobenius norm (bounded by α), we can estimate its trace more accurately than tr(Ai+1)1. While intuitive, this approach requires care to make work. In a naive implementation, error in estimating tr( 1), tr( 2), . . . , compounds over time, eliminating any computational savings. To avoid this issue, we introduce a novel damping strategy that actually estimates tr(Ai+1 (1 γ)Ai)) for a positive damping factor γ.

We compliment our main result with a nearly matching conditional lower bound: in Section 4 we argue that our Delta Shift method cannot be improved in the dynamic setting unless Hutchinson s estimator can be improved in the static setting. We also present an improvement to Delta Shift under more stringent, but commonly present, bounds on A1, . . . , Am and each Ai+1 Ai than Problem 1.

1.4 Related Work

Prior work on implicit trace estimation and applications in machine learning is discussed in the beginning of this section. While there are no other methods that improve on Hutchinson s estimator in the dynamic setting, the idea of variance reduction has found applications in other work on implicit trace estimation [1, 12, 26, 30]. In these results, the trace of a matrix A is estimated by decomposing A = B + where B has an easily computed trace (e.g., because it is low-rank) and F A F , so tr( ) is more easily approximated with Hutchinson s estimator than tr(A) directly.

2 Preliminaries

Notation. We let B Rm k denote a real-valued matrix with m rows and k columns. x Rn denotes a real-valued vector with n entries. Subscripts like Bi or xj typically denote a matrix or vector in a sequence, but we use double subscripts with matrices to denote entries: Bij being the entry at the ith row and jth column. Let σℓ(B) denote the ℓth singular value of B. B F denotes the Frobenius norm of B, q P

i,j B2 ij = P

ℓσℓ(B)2. B denotes the nuclear norm, P

let E[v] and Var[v] denote the expectation and variance of a random variable v.

Hutchinson s Estimator. Our algorithm uses Hutchinson s trace estimator with Rademacher 1 random variables as a subroutine. Speciﬁcally, let g1, . . . , gℓ Rn be independent random vectors, with each entry +1 or 1 with probability 1/2. Let A Rn n. Hutchinson s estimator for tr(A) is:

i=1 g T i (Agi) (3)

1While we consider the general, unstructured problem setting, we note that, if Ai has additional structure, it is not necessarily easier to estimate the trace of i than that of Ai. For example, if Ai is a PSD matrix then specialized trace estimation algorithms that improve on Hutchinson s method can be used [30]. Understanding dynamic traces estimation methods for sequences of structured matrices is a natural direction for future work.

Fact 2.1 (Hutchinson s expectation and variance). For any positive integer ℓand matrix A we have:

E[hℓ(A)] = tr(A), Var[hℓ(A)] = 2

Fact 2.1 follows from simple calculations, found e.g. in [3]. Similar bounds hold when Hutchinson s estimator is implemented with different random variables. For example, random Gaussians also lead to a variance bound of 2

ℓ A 2 F . However, Rademachers tend to work better empirically. Given Fact 2.1, Chebyshev s inequality immediately implies a concentration bound for Hutchinson s estimator. Fact 2.2 (Chebyshev s Inequality). For a random variable X with mean E[X] = µ and variance Var[X] = σ2, for any k 1, P (|X µ| kσ) 1/k2. Claim 2.3. For any ϵ, δ (0, 1), if ℓ= 2 ϵ2δ then Pr [|hℓ(A) tr(A)| ϵ A F ] δ.

The δ dependence in Claim 2.3 can be improved from 1

δ to log(1/δ) via the Hanson-Wright inequality, which shows that hℓ(A) is a sub-exponential random variable [30, 35]. We also require Hanson Wright to obtain our bound involving log(1/δ). From this tighter result, Hutchinson s yields a total matrix-vector multiplication bound of O(m log(1/δ)/ϵ2) for solving Problem 1 by simply applying the estimator in sequence to A1, . . . , Am.

3 Main Algorithmic Result

As discussed in Section 1.2, a natural idea for solving Problem 1 with fewer than O(m/ϵ2) queries is to take advantage of the small differences between Ai+1 and Ai to compute a running estimate of the trace. In particular, instead of estimating tr(A1), tr(A2), . . . , tr(Am) individually using Hutchinson s estimator, we denote i = Ai Ai 1 and use linearity of the trace to write:

tr(Aj) = tr(A1) +

i=2 tr( i). (4)

By choosing a large ℓ0, we can compute an accurate approximation hℓ0(A1) to tr(A1). Then, for j > 1 and ℓ ℓ0, we can approximate tr(Aj) via the following unbiased estimator:

tr(Aj) hℓ0(A1) +

i=2 hℓ( i) (5)

Since i F α Ai+1 F , we expect to approximate tr( 2), . . . , tr( m) much more accurately than tr(A2), . . . , tr(Am) directly. At the same time, the estimator in (5) only incurs a 2 factor overhead in matrix-vector multiplies in comparisons to Hutchinson s: it requires 2 (m 1)ℓto compute hℓ( 2), . . . , hℓ( m) versus (m 1)ℓto compute hℓ(A2), . . . , hℓ(Am). The cost of the initial estimate hℓ0(A1) is necessarily higher, but can be amortized over time.

3.1 Our Approach

While intuitive, the problem with the approach above is that error compounds due to the sum in (5). Each hℓ( i) is roughly α/

ℓaway from tr( i), so after j steps we naively expect total error O(j α/

ℓ). We can do slightly better by arguing that, due to their random nature, error actually accumulates as O( j α/

ℓ), but regardless, there is accumulation. One option is to restart the estimation process: after some number of steps q, throw out all previous trace approximations, compute an accurate estimate for tr(Aq), and for j > q construct an estimator based on tr(Aj) = tr(Aq) + Pj 1 i=q+1 tr( i). While possible to analyze theoretically, this approach turns out to be difﬁcult to implement in practice due to several competing parameters (see details in Section 5).

Instead, we introduce a more effective approach based on a damped variance reduction strategy, which is detailed in Algorithm 1, which we call Delta Shift. Instead of being based on (4), Delta Shift uses the following recursive identity involving a ﬁxed parameter 0 γ < 1 (to be chosen later):

tr(Aj) = (1 γ)tr(Aj 1) + tr(b j), where b j = Aj (1 γ)Aj 1. (6)

Algorithm 1 Delta Shift Input: Implicit matrix-vector multiplication access to A1, ..., Am Rn n, positive integers ℓ0, ℓ, damping factor γ [0, 1]. Output: t1, . . . , tm approximating tr(A1), . . . , tr(Am).

Initialize t1 1 ℓ0 Pℓ0 i=1 g T i A1gi, where g1, . . . , gℓ0 Rn are random 1 vectors for j 2 to m do

Draw ℓrandom 1 vectors g1, . . . , gℓ Rn z1 Aj 1g1, . . . , zℓ Aj 1gℓ, w1 Ajg1, . . . , wℓ Ajgℓ tj (1 γ)tj 1 + 1

ℓ Pℓ i=1 g T i (wi (1 γ)zi) end for

Given an estimate tj 1 for tr(Aj 1), Delta Shift estimates tr(Aj) by (1 γ)tj 1 + hℓ(c j). This approach has several useful properties: 1) if tj 1 is an unbiased estimate for tr(Aj 1), tj is an unbiased estimate for tr(Aj), 2) c j F is not much larger than j F if γ is small, and 3) by shrinking tj 1 by a factor of (1 γ) when computing tj, we reduces the variance of this leading term. The last property ensures that error does not accumulate over time, leading to our main result: Theorem 1.1 (Restated). For any ϵ, δ, α (0, 1), Algorithm 1 run with γ = α, ℓ0 = O log(1/δ)/ϵ2 , and ℓ= O α log(1/δ)/ϵ2 solves Problem 1. In total, it requires

O m α log(1/δ)

ϵ2 + log(1/δ)

matrix-vector multiplications with A1, . . . , Am.

The full proof of Theorem 1.1 relies on the Hanson-Wright inequality, and is given in Appendix B. Here, we give a simple proof of essentially the same statement, but with a slightly weaker dependence on the failure probability δ.

Proof. Let γ = α, ℓ0 = 2 ϵ2δ, and ℓ= 8α

ϵ2δ. The proof is based on an inductive analysis of the variance of tj, the algorithms estimate for tr(Aj). Speciﬁcally, we claim that that for j = 1, . . . , m:

Var[tj] δϵ2. (7)

For the base case, j = 1, (7) follows directly from Fact 2.1 because t1 is simply Hutchinson s estimator applied to A1, and A1 F 1. For the inductive case, tj is the sum of two independent estimators, tj 1 and hℓ(b j). So, to bound its variance, we just need to bound the variance of these two terms. To address the second, note that by triangle inequality, b j F = Aj (1 γ)Aj 1 F Aj Aj 1 F + γ Aj 1 F 2α. Thus, by Fact 2.1, Var[hℓ(b j)] 8

ℓα2. Combined with the inductive assumption that Var[tj 1] δϵ2, we have:

Var[tj] = (1 γ)2Var[tj 1] + Var[hℓ(b j)] (1 α)2δϵ2 + 8α2

ℓ (1 α)δϵ2 + αδϵ2 = δϵ2.

This proves (7), and by Chebyshev s inequality we thus have Pr [|tj tr(Aj)| ϵ] δ for all j.

3.2 Selecting γ in Practice

While Delta Shift is simple to implement, in practice, its performance is sensitive to the choice of γ. For the Theorem 1.1 analysis, we assume γ = α, but α may not be known apriori, and may change over time. To address this issue, we describe a way to select a near optimal γ at each time step j (the choice may vary over time) with very little additional computational overhead. Let vj 1 = Var[tj 1] be the variance of our estimator for tr(Aj 1). We have that vj = (1 γ)2vj 1 + Var[hℓ(Aj (1 γ)Aj 1)] (1 γ)2vj 1 + 2

ℓ Aj (1 γ)Aj 1 2 F . At time step j, a natural goal is to choose damping parameter γ that minimizes this upper bound on the variance of tj:

γ = arg min γ

(1 γ)2vj 1 + 2

where b j = Aj (1 γ)Aj 1 as before. While (8) cannot be computed directly, observing that B 2 F = tr(BT B) for any matrix B, the above quantity can be estimated as vj = (1 γ)2 vj 1 + 2 ℓhℓ(b T j b j), where vj 1 is an estimate for vj 1. The estimate hℓ(b T j b j) can be computed using exactly the same ℓmatrix-vector products with Aj and Aj 1 that are used to estimate tr(b j), so there is little computational overhead. Moreover, since b T j b j is positive semideﬁnite, as long as ℓ log(1/δ), we will obtain a relative error approximation to its trace with probability 1 δ [3].

An alternative approach to estimating vj would be to simply compute the empirical variance of the average hℓ(b j), but this requires ﬁxing γ. An advantage of our closed form approximation is that it can be used to analytically optimize γ. Speciﬁcally, expanding b T j b j, we have that:

vj = (1 γ)2 vj 1 + 2

ℓ hℓ(AT j Aj)+ (1 γ)2hℓ(AT j Aj) 2(1 γ)hℓ(AT j 1Aj) . (9)

Above, each estimate hℓis understood to use the same set of random vectors. Taking the derivative and setting to zero, we have that the minimizer of (9), denoted γ , equals:

γ = 1 2hℓ(AT j 1Aj) ℓ vj 1 + 2hℓ(AT j 1Aj 1). (10)

This formula for γ motivates an essentially parameter free version of Delta Shift, which is used in our experimental evaluation (Algorithm 2 in Appendix A). The only input to the algorithm is the number of matrix-vector multiplies used at each time step, ℓ. For simplicity, unlike Algorithm 1, we do not use a larger number of matrix-vector multiplies when estimating A1. This leads to somewhat higher error for the ﬁrst matrices in the sequence A1, . . . , Am, but error quickly falls for large j.

4 Algorithm Improvements and Lower Bound

In this section, we prove a lower bound showing that, in general, Theorem 1.1 is likely optimal. On the other hand, we show that, if we make a slightly stronger assumption on A1, . . . , Am and the dynamic updates i = Ai+1 Ai, an improvement on Delta Shift is possible.

4.1 Lower Bound

As noted, for a large number of time steps m, the matrix-vector multiplication complexity of Delta Shift is dominated by the leading term in Theorem 1.1, O(m α log(1/δ)/ϵ2). We show that it is unlikely an improvement on this term can be obtained in general:

Lemma 4.1. Suppose there is an algorithm S that solves Prob. 1 with o(m α log(1/δ)/ϵ2) total matrix-vector multiplies with A2, . . . , Am, and any number of matrix-vector multiplies with A1 when α = 1/(m 1). Then there is an algorithm T that achieves (1) for a single A with o(log(1/δ)/ϵ2) matrix-vector multiplies.

Proof. The proof is via a direct reduction. Given a matrix A, positive integer m > 1, and parameter α = 1 m 1, construct the sequence of matrices:

A1 = 0, A2 = α A, . . . A1/α = (1 α)A, Am = A

Since A = 0, and every A2, . . . , Am is a scaling of A, any algorithm S satisfying the assumption of Lemma 4.1 can be implemented with o (m 1) α log(1/δ)/ϵ2 = o(log(1/δ)/ϵ2) matrix-vector multiplications with A. Moreover, if S is run on this sequence of matrices, on the last step it outputs an approximation tm to tr(A) with Pr[|tm tr(A)| ϵ] δ. So algorithm T can simply simulate S on A1, . . . , Am and return its ﬁnal estimate to satisfy (1).

Lemma 4.1 is a conditional lower-bound on matrix-vector query algorithms for solving Problem 1: if Hutchinson s estimator cannot be improved for static trace estimation (and it hasn t been for 30 years) then Delta Shift cannot be improved for dynamic trace estimation. We believe the bound could be made unconditional through a slight generalization of existing lower bounds on trace estimation in the matrix-vector multiplication model [30, 46].

4.2 Improved Algorithm

A recent improvement on Hutchinson s estimator, called Hutch++, was described in [30]. For the static trace estimation problem, Hutch++ achieves a matrix-vector multiplication complexity of O(1/ϵ) to compute a relative error (1 ϵ) approximation to the trace of any positive semi-deﬁnite matrix (PSD), improving on the O(1/ϵ2) required by Hutchinson s. It does so via a variance reduction method (also used e.g. in [12]) which allocates some matrix-vector products to a randomized SVD algorithm which approximates the top singular vector subspace of A. This approximate subspace is projected off of A and Hutchinson s used to estimate the trace of the remainder.

In our setting it is not realistic to assume PSD matrices while in many applications A1, . . . , Am are all PSD, it is rarely the case the 1, . . . , m 1 are. Nevertheless, we can take advantage of a more general bound proven in [30] for any matrix:

Fact 4.2 (Hutch++ expectation and variance). Let h++ ℓ (A) be the Hutch++ estimator of [30] applied to any matrix A with ℓmatrix-vector multiplications. We have:

E[h++ ℓ (A)] = tr(A), Var[h++ ℓ (A)] 16

Recall that A denotes the nuclear norm of A. Comparing to the variance of Hutchinson s estimator from Fact 2.1, notice that the variance of Hutch++ depends on 1

ℓ2 instead of 1

ℓ, implying faster convergence as the number of matrix-vector products, ℓ, increases. A trade-off is that the variance scales with A 2 instead of A 2 F . A 2 is strictly larger, and possible a factor of n larger than A 2 F . However, for matrices that are rank k, A 2 k A 2 F , so the norms are typically much closer for low-rank or nearly low-rank matrices. In many problems, 1, . . . , m may have low-rank structure, in which case, an alternative based on Hutch++ provides better performance.

Formally, we introduce a new variant of Problem 1 to capture this potential improvement.

Problem 2 (Dynamic trace estimation w/ Nuclear norm assumption). Let A1, ..., Am satisfy:

1. Ai 1, for all i [1, m]. 2. Ai+1 Ai α, for all i [1, m 1].

Given matrix-vector multiplication access to each Ai in sequence, the goal is to compute trace approximations t1, . . . , tm for tr(A1), ...., tr(Am) such that, for all i, P[|ti tr(Ai)| ϵ] δ.

In Appendix C we prove the following result on a variant of Delta Shift that we call Delta Shift++:

Theorem 4.3. For any ϵ, δ, α (0, 1), Delta Shift++ (Algorithm 4) solves Problem 2 with

total matrix-vector multiplications involving A1, . . . , Am.

Theorem 4.3 is stronger than Theorem 1.1 for vanilla Delta Shift in that it has a linear instead of a quadratic dependence on ϵ. In particular, its leading term scales as p

α/ϵ2, whereas Theorem 1.1 scaled with α/ϵ2. However, the result does require stronger assumptions on A1, . . . , Am and each i = Ai+1 Ai in that Problem 2 requires these matrices to have bounded nuclear norm instead of Frobenius norm. Since the nuclear norm of a matrix is strictly larger than its Frobenius norm, these requirements are stronger than those of Problem 1. As we will show in Section 5, the beneﬁt of improved ϵ dependence often outweights the cost of these more stringent assumptions.

5 Experiments

We show that our proposed algorithm outperforms three alternatives on both synthetic and real-world trace estimation problems. Speciﬁcally, we evaluate the following methods:

Hutchinson. The naive method of estimating each tr(A1), . . . , tr(Am) using an independent Hutchinson s estimator, as discussed in Section 2. No Restart. The estimator of (5), which uses the same variance reduction strategy as Delta Shift for all j 2, but does not restart or add damping to reduce error accumulation.

Restart. The estimator discussed in Sec. 3.1, which periodically restarts the variance reduction strategy, using Hutchinson s to obtain a fresh estimate for tr(Aj). Pseudocode is in Appendix A. Delta Shift. Our parameter free, damped variance reduction estimator detailed in Appendix A. We allocated a ﬁxed number of matrix-vector queries, Q, to be used over all time steps 1, . . . , m. For Hutchinson and Delta Shift, the same number of vectors Q/m was used at each step. For Restart and No Restart, the distribution was non-uniform, and parameter selections are described in Appendix D.

0 20 40 60 80 100 Time step (i)

|tr(Ai) ti| maxi tr(Ai)

Average error 0.25σ for 100 trials

Hutchinson Restart No Restart Delta Shift

(a) Synthetic data with low perturbation

0 20 40 60 80 100 Time step (i)

|tr(Ai) ti| maxi tr(Ai)

Average error 0.25σ for 100 trials

Hutchinson Restart No Restart Delta Shift

(b) Synthetic data with signiﬁcant perturbation Figure 1: Comparison of Delta Shift with Hutchinson and No Restart on synthetic data with Q = 104.

0 20 40 60 80 100 Time step (i)

|tr(Ai) ti| maxi tr(Ai)

Average error 0.25σ for 100 trials

Hutchinson No Restart Delta Shift

(a) Synthetic data with Q = 2 103

0 20 40 60 80 100 Time step (i)

|tr(Ai) ti| maxi tr(Ai)

Average error 0.25σ for 100 trials

Hutchinson No Restart Delta Shift

(b) Synthetic data with Q = 8 103

Figure 2: Comparison of Delta Shift with Hutchinson and No Restart on synthetic data.

Synthetic data: To simulate the dynamic setting, we generate a random matrix A Rn n and add random perturbations for each of 100 time steps. We consider two cases: low (Fig. 1(a)) and signiﬁcant (Fig. 1(b)) perturbations, the exact details of which, as well as the allocation of matrix-vector products for No Restart and Restart, are discussed in Appendix D. We report scaled absolute error between the estimator at time, tj, and the true trace tr(Aj). As expected, Hutchinson is outperformed even by No Restart when perturbations are small. Delta Shift performs best, and its error actually improve slightly over time. Delta Shift also performs best for the large perturbation experiment. We note that choosing the multiple parameters for the Restart method was a challenge in comparison to Delta Shift. Tuning the method becomes infeasible for larger experiments, so we exclude this method in our other experiments. That includes for the plots in Fig. 2, which show that Delta Shift continues to outperform Hutchinson and No Restart for lower values of Q.

Counting triangles: Our ﬁrst real-data experiment is on counting triangles in a dynamic unweighted, undirected graph G via the fact that the number of triangles equals 1

6tr(B3), where B is the adjacency matrix. The graph dataset we use is the Wikipedia vote network dataset with 7115 nodes [25, 24]. At each timestep we perturb the graph by adding a random k-clique, for k chosen uniformly between 10 and 150. After 75 time steps, we start randomly deleting among the subgraphs added. We follow the same setup for number of matrix-vector products used by the estimators and the error reported as in the synthetic experiments(Appendix D). Note that for this particular application, the actual number of matrix-vector multiplications with B is 3Q, since each oracle call computes B(B(Bx)). As seen in Fig. 3, Delta Shift provides the best estimates overall.

Estimation natural connectivity: To evaluate the Delta Shift++ algorithm introduced in Section 4, we address an application in [43] on estimating natural connectivity in a dynamic graph, which is a function of tr(exp(B)) for adjacency matrix B. This problem has also been explored in [7, 8, 4]. We

0 20 40 60 80 100 Time step (i)

|tr(A3 i) ti| maxi tr(Ai)

Average error 0.25σ for 50 trials

Hutchinson No Restart Delta Shift

(a) Graph data with Q = 2 103

0 20 40 60 80 100 Time step (i)

|tr(A3 i) ti| maxi tr(Ai)

Average error 0.25σ for 50 trials

Hutchinson No Restart Delta Shift

(b) Graph data with Q = 104

Figure 3: Comparison of Delta Shift with Hutchinson and No Restart for triangle counting experiment.

0 20 40 60 80 100 Time step (i)

|tr(exp (Ai)) ti| maxi tr(exp (Ai))

Average error 0.25σ for 100 trials

Delta Shift Delta Shift++

Figure 4: Error of Delta Shift and Delta Shift++ for dynamic estimation of natural connectivity.

0.010 0.015 0.020 0.025 0.030

Average relative error

Total matix-vector products (Q)

Average error across 100 timesteps for 100 trials

Hutchinson No Restart Delta Shift

Figure 5: Average error vs. computational cost for synthetic data with large perturbations.

use the road network data Gleich/minnesota (available at https://sparse.tamu.edu/Gleich/ minnesota). We perturb the graph over time by choosing two nodes at random and adding an edge between them, and use the Lanczos method to approximate matrix-vector products with exp(B). We ﬁnd that Delta Shift++ performs better than Delta Shift (Fig. 4), as the change in exp(B) tends to be nearly low-rank, and thus have small nuclear norm (see [4] for details). Both Delta Shift and Delta Shift++ perform signiﬁcantly better than naive Hutchinson s when 100 matrix-vector products are used per time step.

The key takeaway from the experiments above is that Delta Shift and Delta Shift++ are able to obtain good dynamic trace approximations in far fewer matrix-vector products compared to Hutchinson s and other methods, resulting in considerable computational savings. This is made evident in Figure 5, which plots average relative error across all time steps vs. total number of matrix-vector products(Q), for various values of Q. In order to achieve the accuracy level as Delta Shift, Hutchinson s requires substantially more matrix-vector products.

Hessian spectral density: Finally, we evaluate the performance of Delta Shift on an application pertaining to a dynamically changing Hessian matrix, H, involved in training a neural network. As discussed in Section 1, a common goal is to approximate the spectral density of H. Most methods for doing so, like the popular Kernel Polynomial Method [45], require computing the trace of polynomials of the matrix H. We consider the sequence of Chebyshev polynomials T0, . . . , Tq, and estimate tr(T0(H)), . . . , tr(Tq(H)). Other polynomial basis sets can also be used (e.g., Legendre polynomials). Experimental details are discussed in section Appendix D, but we summarized the results here. We implement matrix vector products with H using the Py Hessian library [47], and report

Table 1: Average relative error for trace of Chebyshev polynomials of Hessian.

Hutchinson No Restart Delta Shift

T1(H) 2.5e-02 3.7e-02 1.7e-02 T2(H) 1.2e-06 1.7e-06 8.0e-07 T3(H) 4.0e-02 4.1e-02 3.1e-02 T4(H) 1.5e-06 1.7e-06 1.0e-06 T5(H) 2.1e-02 4.3e-02 1.9e-02

Table 2: Average relative error for trace of Chebyshev polynomials of Hessian.

Hutchinson No Restart Delta Shift

T1(H) 1.9e-02 5.0e-02 1.5e-02 T2(H) 1.2e-06 2.9e-06 9.9e-07 T3(H) 7.7e-02 9.4e-02 6.1e-02 T4(H) 1.7e-06 2.8e-06 1.5e-06 T5(H) 2.1e-02 4.2e-02 1.8e-02

average error over 25 training epochs for the Hessian of a Res Net model with 269722 parameters trained it on the CIFAR-10 dataset. As it is impossible to compute the true trace of these matrices, we use Hutchinson s estimator with a greater number of queries as placeholder for ground-truth, and compare the performance against the computed values. As can be seen in Tables 1 and 2, Delta Shift obtains uniformly better approximation to the trace values, although the improvement is small. This makes sense, as more progress on each training epoch implies a greater change in the Hessian over time, meaning α is larger and thus Delta Shift s advantage over Hutchinson s is smaller.

Acknowledgements

We would like to think Cameron Musco for helpful discussions, as well as the paper referees for detailed feedback. This work was supported by NSF Award #2045590.

[1] Ryan P. Adams, Jeffrey Pennington, Matthew J. Johnson, Jamie Smith, Yaniv Ovadia, Brian Patton, and James Saunderson. Estimating the spectral density of large implicit matrices. ar Xiv:1802.03451, 2018.

[2] Haim Avron. Counting triangles in large graphs using randomized matrix trace estimation. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2010.

[3] Haim Avron and Sivan Toledo. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-deﬁnite matrix. Journal of the ACM, 58(2), 2011.

[4] Bernhard Beckermann, Daniel Kressner, and Marcel Schweitzer. Low-rank updates of matrix functions. SIAM Journal on Matrix Analysis and Applications, 39(1):539 565, 2018.

[5] Christos Boutsidis, Petros Drineas, Prabhanjan Kambadur, and Anastasios Zouzias. A randomized algorithm for approximating the log determinant of a symmetric positive deﬁnite matrix. Linear Algebra and its Applications, 533, 03 2015.

[6] Mark Braverman, Elad Hazan, Max Simchowitz, and Blake Woodworth. The gradient complexity of linear regression. In Proceedings of the 33rd Annual Conference on Computational Learning Theory (COLT), volume 125, pages 627 647, 2020.

[7] Hau Chan, Leman Akoglu, and Hanghang Tong. Make it or break it: Manipulating robustness in large networks. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 325 333. SIAM, 2014.

[8] Chen Chen, Ruiyue Peng, Lei Ying, and Hanghang Tong. Network connectivity optimization: Fundamental limits and effective algorithms. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1167 1176, 2018.

[9] Edoardo Di Napoli, Eric Polizzi, and Yousef Saad. Efﬁcient estimation of eigenvalue counts in an interval. Numerical Linear Algebra with Applications, 2016.

[10] Zhen Dong, Zhewei Yao, Yaohui Cai, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. Advances in Neural Information Processing Systems 33 (Neur IPS), 2020.

[11] Ernesto Estrada and Naomichi Hatano. Communicability in complex networks. Phys. Rev. E, 77:036111, Mar 2008.

[12] Arjun Singh Gambhir, Andreas Stathopoulos, and Kostas Orginos. Deﬂation as a method of variance reduction for estimating the trace of a matrix inverse. SIAM Journal on Scientiﬁc Computing, 39(2):A532 A558, 2017.

[13] Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97, pages 2232 2241, 2019.

[14] Didier Girard. Un algorithme simple et rapide pour la validation croisee géenéralisée sur des problémes de grande taille. Technical report, 1987.

[15] Insu Han, Prabhanjan Kambadur, Kyoungsoo Park, and Jinwoo Shin. Faster greedy map inference for determinantal point processes. In International Conference on Machine Learning, pages 1384 1393. PMLR, 2017.

[16] Insu Han, Dmitry Malioutov, Haim Avron, and Jinwoo Shin. Approximating the spectral sums of large-scale matrices using stochastic Chebyshev approximations. SIAM Journal on Scientiﬁc Computing, 2017.

[17] Insu Han, Dmitry Malioutov, and Jinwoo Shin. Large-scale log-determinant computation through stochastic Chebyshev expansions. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 908 917, 2015.

[18] D. L. Hanson and F. T. Wright. A bound on tail probabilities for quadratic forms in independent random variables. Ann. Math. Statist., 42(3):1079 1083, 06 1971.

[19] Yuanyang Zhu Hanyu Li. Randomized block Krylov space methods for trace and logdeterminant estimators. ar Xiv:2003.00212, 2020.

[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[21] Nicholas J. Higham. Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, 2008.

[22] Michael F. Hutchinson. A stochastic estimator of the trace of the inﬂuence matrix for Laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 19(2):433 450, 1990.

[23] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[24] Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. Predicting positive and negative links in online social networks. In Proceedings of the 19th International Conference on World Wide Web, WWW 10, pages 641 650, New York, NY, USA, 2010. Association for Computing Machinery.

[25] Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. Signed Networks in Social Media, pages 1361 1370. Association for Computing Machinery, New York, NY, USA, 2010.

[26] Lin Lin. Randomized estimation of spectral densities of large matrices made accurate. Numerische Mathematik, 136(1):183 213, 2017.

[27] Lin Lin, Yousef Saad, and Chao Yang. Approximating spectral densities of large matrices. SIAM Review, 58(1):34 65, 2016.

[28] Michael Mahoney and Charles Martin. Traditional and heavy tailed self regularization in neural network models. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97, pages 4284 4293, 2019.

[29] Per-Gunnar Martinsson and Joel A. Tropp. Randomized numerical linear algebra: Foundations and algorithms. Acta Numerica, 29:403 572, 2020.

[30] Raphael A. Meyer, Cameron Musco, Christopher Musco, and David Woodruff. Hutch++: optimal stochastic trace estimation. Proceedings of the 4th Symposium on Simplicity in Algorithms (SOSA), 2021.

[31] Cameron Musco, Praneeth Netrapalli, Aaron Sidford, Shashanka Ubaru, and David P. Woodruff. Spectrum approximation beyond fast matrix multiplication: Algorithms and hardness. Proceedings of the 9th Conference on Innovations in Theoretical Computer Science (ITCS), 2018.

[32] Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147 160, 1994.

[33] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. The emergence of spectral universality in deep networks. In Proceedings of the 21st International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1924 1932, 2018.

[34] Xu Qian, Victor Li, and Crews Darren. Channel-wise hessian aware trace-weighted quantization of neural networks. ar Xiv preprint ar Xiv:2008.08284, 2020.

[35] Mark Rudelson, Roman Vershynin, et al. Hanson-Wright inequality and sub-Gaussian concentration. Electronic Communications in Probability, 18, 2013.

[36] Arvind K. Saibaba, Alen Alexanderian, and Ilse C. F. Ipsen. Randomized matrix-free trace and log-determinant estimators. Numerische Mathematik, 137(2):353 395, 2017.

[37] Max Simchowitz, Ahmed El Alaoui, and Benjamin Recht. Tight query complexity lower bounds for PCA via ﬁnite sample deformed Wigner law. In Proceedings of the 50th Annual ACM Symposium on Theory of Computing (STOC), pages 1249 1259, 2018.

[38] Andreas Stathopoulos, Jesse Laeuchli, and Kostas Orginos. Hierarchical probing for estimating the trace of the matrix inverse on toroidal lattices. SIAM Journal on Scientiﬁc Computing, 35(5):S299 S322, 2013.

[39] Xiaoming Sun, David P. Woodruff, Guang Yang, and Jialin Zhang. Querying a matrix through matrix-vector products. In Proceedings of the 46th International Colloquium on Automata, Languages and Programming (ICALP), volume 132, pages 94:1 94:16, 2019.

[40] Jok M. Tang and Yousef Saad. Domain-decomposition-type methods for computing the diagonal of a matrix inverse. SIAM Journal on Scientiﬁc Computing, 33(5):2823 2847, 2011.

[41] Shashanka Ubaru and Yousef Saad. Applications of trace estimation techniques. In High Performance Computing in Science and Engineering, pages 19 33, 2018.

[42] Martin J. Wainwright. Basic tail and concentration bounds, pages 21 57. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.

[43] Sheng Wang, Yuan Sun, Christopher Musco, and Zhifeng Bao. Public transport planning: When transit network connectivity meets commuting demand. In Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data, 2021.

[44] Mingwei Wei and David J Schwab. How noise affects the hessian spectrum in overparameterized neural networks. ar Xiv preprint ar Xiv:1910.00195, 2019.

[45] Alexander Weiße, Gerhard Wellein, Andreas Alvermann, and Holger Fehske. The kernel polynomial method. Reviews of modern physics, 78(1):275, 2006.

[46] Karl Wimmer, Yi Wu, and Peng Zhang. Optimal query complexity for estimating the trace of a matrix. In Proceedings of the 41st International Colloquium on Automata, Languages and Programming (ICALP), pages 1051 1062, 2014.

[47] Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael Mahoney. Pyhessian: Neural networks through the lens of the hessian. ar Xiv preprint ar Xiv:1912.07145, 2019.

[48] Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W. Mahoney. Pyhessian: Neural networks through the lens of the hessian. In IEEE Big Data, 2020.

[49] Zhewei Yao, Amir Gholami, Sheng Shen, Kurt Keutzer, and Michael W Mahoney. ADAHESSIAN: An adaptive second order optimizer for machine learning. AAAI Conference on Artiﬁcial Intelligence (AAAI 2021), 2021.