# acceleration_through_spectral_density_estimation__265d98ca.pdf

Average-Case Acceleration Through Spectral Density Estimation

Fabian Pedregosa * 1 Damien Scieur * 2

We develop a framework for the average-case analysis of random quadratic problems and derive algorithms that are optimal under this analysis. This yields a new class of methods that achieve acceleration given a model of the Hessian s eigenvalue distribution. We develop explicit algorithms for the uniform, Marchenko Pastur, and exponential distributions. These methods have a simple momentum-like update, in which each update only makes use on the current gradient and previous two iterates. Furthermore, the momentum and step-size parameters can be estimated without knowledge of the Hessian s smallest singular value, in contrast with classical accelerated methods like Nesterov acceleration and Polyak momentum. Through empirical benchmarks on quadratic and logistic regression problems, we identify regimes in which the the proposed methods improve over classical (worst-case) accelerated methods.

1. Introduction

The traditional analysis of optimization algorithms is a worst-case analysis (Nemirovski, 1995; Nesterov, 2004). This type of analysis provides a complexity bound for any input from a function class, no matter how unlikely. However, hard-to-solve inputs might rarely occur in practice, in which case these complexity bounds might not be representative of the observed running time.

Average-case analysis provides instead the expected complexity of an algorithm over a class of problems, and is more representative of its typical behavior. While the average-case analysis is standard for analyzing sorting (Knuth, 1997) and cryptography (Katz & Lindell, 2014) algorithms, little is known of the average-complexity of op-

*Equal contribution 1Google Research 2Samsung SAIT AI Lab, Montreal. Correspondence to: Fabian Pedregosa <pedregosa@google.com>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

Suboptimality

worst-case average-case

Figure 1. A worst-case analysis can lead to misleading results where the worst-case running times is much worse than the observed running time. Above: convergence of individual random square least squares in gray, while the average suboptimality (orange) is well below the worst-case (blue).

timization algorithms.

Our ﬁrst contribution ( 2) is to develop an average-case analysis of optimization methods on quadratic objectives. A crucial difference with the worst-case analysis is that it depends on the expected spectrum of the Hessian, rather on just the extremal eigenvalues.

While it is unfeasible to assume knowledge of the full spectrum, recent works have shown that the spectrum of large deep learning models is highly predictable and can be approximated using classical models from random matrix theory, such as the Marchenko-Pastur distribution (Sagun et al., 2017; Martin & Mahoney, 2018).

A second contribution ( 3) is to exploit this regularity to develop practical methods that are optimal under the average-case analysis. We consider different parametric models for the expected spectrum such as the Marchenko Pastur, uniform, and exponential distributions, and derive for each model an average-case optimal algorithm. These are all momentum-like methods where the hyper-parameter of the distribution can be estimated without knowledge of the smallest eigenvalue.

Finally, we compare these average-case optimal methods on synthetic and real datasets. We identify regimes where average-case exhibits a large computational gain with respect to respect to traditional accelerated methods.

Average-Case Acceleration Through Spectral Density Estimation

1.1. Related Work

We comment on two of the main ideas behind the proposed methods.

Polynomial-Based Iterative Methods. Our work draws heavily from the classical framework of polynomial-based iterative methods (Fischer, 1996) that can be traced back to the Chebyshev iterative method of Flanders & Shortley (1950) and was later instrumental in the development of the celebrated conjugate gradient method (Hestenes et al., 1952). More recently, this framework has been used to derive accelerated gossip algorithms (Berthier et al., 2018) and accelerated algorithms for smooth games (Azizian et al., 2020), to name a few. Although originally derived for the worst-case analysis of optimization algorithms, in this work we extend this framework to analyze also the averagecase runtime.

In concurrent work, Lacotte & Pilanci (2020) derive average-case optimal methods for a different class of methods where the update can be multiplied by a preconditioner matrix.

Spectral Density Estimation. The realization that deep learning networks behave as linear models in the inﬁnitewidth limit (Jacot et al., 2018; Novak et al., 2018) has sparked a renewed interest in the study of spectral density of large matrices. The study of spectral properties of these very large models is made possible thanks to improved tools (Ghorbani et al., 2019) and more precise theoretical models (Jacot et al., 2019; Pennington & Worah, 2017).

Notation. Throughout the paper we denote vectors in lowercase boldface (x) and matrices in uppercase boldface letters (H). Probability density functions and eigenvalues are written in Greek letters (µ, λ), while polynomials are written in uppercase Latin letter (P, Q). We will sometimes omit integration variable, with the understanding that ϕ dµ is a shorthand for ϕ(λ) dµ(λ).

2. Average-Case Analysis

In this section we introduce the average-case analysis framework for random quadratic problems. The main result is Theorem 2.1, which relates the expected error with other quantities that will be easier to manipulate, such as the residual polynomial. This is a convenient representation of an optimization method that will allow us in the next section to pose the problem of ﬁnding an optimal method as a best approximation problem in the space of polynomials.

Let H Rd d be a random symmetric positive-deﬁnite matrix and x Rd a random vector. These elements determine the following (random) quadratic minimization

f(x) def = 1

2(x x ) H(x x ) . (OPT)

Our goal is to quantify the expected error E xt x , where xt is the t-th update of a ﬁrst-order method starting from x0 and E is the expectation over the random H, x0, x .

Remark 1 Problem OPT subsumes the quadratic minimization problem minx x Hx+b x+c, but the notation above will be more convenient for our purposes.

Remark 2 The expectation in the expected error E xt x 2 is over the inputs and not over any randomness of the algorithm, as would be common in the stochastic literature. In this paper we will only consider deterministic algorithms.

To solve OPT, we will consider ﬁrst-order methods. These are methods in which the sequence of iterates xt is in the span of previous gradients, i.e.,

xt+1 x0 + span{ f(x0), . . . , f(xt)} . (1)

This class of algorithms includes for instance gradient descent and momentum, but not quasi-Newton methods, since the preconditioner could allow the iterates to go outside of the span. Furthermore, we will only consider oblivious methods, that is, methods in which the coefﬁcients of the update are known in advance and don t depend on previous updates. This leaves out some methods that are speciﬁc to quadratic problems like conjugate gradient.

From First-Order Method to Polynomials. There is an intimate link between ﬁrst order methods and polynomials that simpliﬁes the analysis on quadratic objectives. Using this link, we will be able to assign to each optimization method a polynomial that determines its convergence. The next Proposition gives a precise statement:

Proposition 2.1 (Hestenes et al., 1952) Let xt be generated by a ﬁrst-order method. Then there exists a polynomial Pt of degree t such that Pt(0) = 1 that veriﬁes

xt x = Pt(H)(x0 x ) . (2)

Following Fischer (1996), we will refer to this polynomial Pt as the residual polynomial.

Example 1 (Gradient descent). The residual polynomial associated with the gradient descent method has a remark-

Average-Case Acceleration Through Spectral Density Estimation

ably simple form. Subtracting x on both sides of the gradient descent update xt+1 = xt γH(x x ) gives

xt+1 x = (I γH)(xt x ) (3)

= = (I γH)t(x0 x ) (4)

and so the residual polynomial is Pt(λ) = (1 γλ)t.

A convenient way to collect statistics on the spectrum of a matrix is through its empirical spectral distribution, which we now deﬁne.

Deﬁnition 3 (Empirical/Expected spectral distribution). Let H be a random matrix with eigenvalues {λ1, . . . , λd}. The empirical spectral distribution of H, called µH, is the probability measure

µH(λ) def = 1

d d i=1δλi(λ) , (5)

where δλi is the Dirac delta, a distribution equal to zero everywhere except at λi and whose integral over the entire real line is equal to one.

Since H is random, the empirical spectral distribution µH is a random measure. Its expectation over H,

µ def = E H[µH] , (6)

is called the expected spectral distribution.

Example 2 (Marchenko-Pastur distribution) Consider a matrix A Rn d, where each entry is an i.i.d. random variables with mean zero and variance σ2. Then it is known that the expected spectral distribution of H = 1 n A A converges to the Marchenko-Pastur distribution (Marchenko & Pastur, 1967) as n and d at a rate in which d/n r (0, ). The Marchenko-Pastur distribution µMP is deﬁned as

r, 0}δ0(λ)+

2πσ2rλ 1λ [ ,L] dλ, (7)

where def = σ2(1 r)2, L def = σ2(1+ r)2 are the extreme nonzero eigenvalues and δ0 is a Dirac delta at zero (which disappears for r 1).

In practice, the Marchenko-Pastur distribution is often a good approximation to the spectral distribution of highdimensional models, even for data that might not verify the i.i.d. assumption, like the one in Figure 2.

Before presenting the main result of this section, we state one simplifying assumption on the initialization that we make throughout the rest of the paper.

Assumption 1 We assume that x0 x is independent of H and E (x0 x )(x0 x ) = R2I . (8)

Hessian eigenvalues 20news dataset

Marchenko-Pastur distribution

Figure 2. Eigenvalue distribution of high dimensional problems can often be approximated with simple models. Above, a ﬁt to the Marchenko-Pastur distribution (blue) of the Hessian s eigenvalues histogram (yellow) in a least square problem with data the News20 dataset (20k features, Keerthi & De Coste (2005)).

This assumption is veriﬁed for instance when both x0 and x are drawn independently from a distribution with scaled identity covariance. It is also veriﬁed in a least squares problem of the form minx Ax b 2, where the target vector veriﬁes b = Ax .1

Theorem 2.1 Let xt be generated by a ﬁrst-order method, associated to the polynomial Pt. Then we can decompose the expected error at iteration t as

initialization R2

P 2 t algorithm

problem dµ . (9)

This identity represents the expected error of an algorithm in terms of three interpretable quantities:

1. The distance to optimum at initialization enters through the constant R, which is the diagonal scaling in Assumption 1.

2. The optimization method enters in the formula through its residual polynomial Pt. The main purpose of the rest of the paper will be to ﬁnd optimal choices for this polynomial.

3. The difﬁculty of the problem class enters through the expected spectral distribution µ def = E H[µH].

Remark 4 Although we will not use it in this paper, similar formulas can be derived for the objective and gradient

1After the ﬁrst version of this paper appeared online, Paquette et al. (2020) generalized these results to least squares problems with a noisy target vector b = Ax + η.

Average-Case Acceleration Through Spectral Density Estimation

suboptimality:

E[f(xt) f(x )] = R2

R P 2 t (λ)λ dµ(λ) (10)

E f(xt) 2 = R2

R P 2 t (λ)λ2 dµ(λ) . (11)

3. Average-Case Acceleration

The framework developed in the previous section opens the door to exploring the question of optimality with respect to the average-case complexity. Does a method exist, that is optimal in this analysis? If so, what is this method?

In this section is to give a constructive answer. We will ﬁrst introduce some concepts from the theory of orthogonal polynomials which will be necessary to develop optimal methods.

Deﬁnition 5 Let α be a non-decreasing function such that Q dα is ﬁnite for all polynomials Q. We will say that the sequence of polynomials P0, P1, . . . is orthogonal with respect to dα if Pi has degree i and

= 0 if i = j > 0 if i = j . (12)

Furthermore, if they verify Pi(0) = 1 for all i, we call these residual orthogonal polynomials.

While a degree t polynomial needs in principle t scalars to be fully speciﬁed, residual orthogonal polynomials admit a compressed recursive representation in which each polynomial can be represented using only two scalars and the previous two residual orthogonal polynomials. This compressed representation is known as the three-term recurrence and will be crucial to ensure optimal methods enjoy the low memory and low per-iteration cost of momentum methods. The following lemma states this property more precisely.

Lemma 6 (Three-term recurrence) (Fischer, 1996, 2.4) Any sequence of residual orthogonal polynomials P1, P2, . . . veriﬁes the following three-term recurrence

Pt(λ) = (at + btλ)Pt 1(λ) + (1 at)Pt 2(λ) (13)

for some scalars at, bt, with a 1 = a0 = 1 and b 1 = 0.

Remark 7 Although there exist algorithms to compute the coefﬁcients at and bt recursively (e.g., Fischer (1996, Algorithm 2.4.2)), numerical stability and computational costs make these methods unfeasible for our purposes. In Sections 4 6 we will see how to compute these coefﬁcients for speciﬁc distributions of µ.

We have now all ingredients to state the main result of this section, a simple algorithm with optimal average-case complexity.

Theorem 3.1 Let Pt be the residual orthogonal polynomials of degree t with respect to the weight function λ dµ(λ), and let at, bt be the constants associated with its three-term recurrence. Then the algorithm

xt = xt 1 + (1 at)(xt 2 xt 1)

+ bt f(xt 1) , (14)

has the smallest expected error E xt x over the class of oblivious ﬁrst-order methods. Moreover, its expected error is:

E xt x 2 = R2

R Pt dµ . (15)

Remark 8 Algorithm (14), although being optimal over the space of all ﬁrst-order methods, does not require storage of previous gradients. Instead, it has a very convenient momentum-like form and only requires storing two d-dimensional vectors.

Remark 9 (Relationship with Conjugate Gradient). The derivation of the proposed method bears a strong resemblance with the Conjugate Gradient method (Hestenes et al., 1952). One key conceptual difference is that the conjugate gradient constructs the optimal polynomial for the empirical spectral distribution, while we construct a polynomial that is optimal only for the expected spectral distribution. A more practical advantage is that the proposed method is more amenable to non-quadratic minimization.

The previous theorem gives a recipe to construct an optimal algorithm from a sequence of residual orthogonal polynomials. However, in practice we may not have access to the expected spectral distribution µ, let alone its sequence of residual orthogonal polynomials.

The next sections are devoted to solving this problem through different parametric assumption on the spectral distribution: exponential ( 4), Marchenko-Pastur ( 5), and uniform ( 6).

4. Optimal Method under the Exponential Distribution

In this section we assume that the expected spectral distribution follows an exponential distribution:

λ0 e λ/λ0, λ [0, ) , (16)

Average-Case Acceleration Through Spectral Density Estimation

where λ0 is a free parameter of the distribution. In this case the largest eigenvalue ( Lipschitz gradient constant) is unbounded, and so this setting can be identiﬁed the convex but non-smooth setting.

A consequence of Theorem 3.1 is that deriving the optimal algorithm is equivalent to ﬁnding the sequence of residual orthogonal polynomials with respect to the weight function λ dµ(λ). Orthogonal (non-residual) polynomials for this weight function when dµ is an exponential distribution have been studied under the name of generalized Laguerre polynomials (Abramowitz et al., 1972, p. 781). There is a constant factor between the orthogonal and residual orthogonal polynomial, and so the latter can be constructed from the former by ﬁnding the appropriate normalization that makes the polynomial residual. This normalization is given by the following lemma:

Lemma 10 The sequence of scaled Laguerre polynomials

P0(λ) = 1 , P1(λ) = 1 λ0

Pt(λ) = 2 t + 1 λ0

t λ Pt 1(λ) + t 1

are a family of residual orthogonal polynomials with respect to the measure λ λ0 e λ/λ0.

From the above Lemma, we can derive the method with best expected error for the decaying exponential distribution. The resulting algorithm is surprisingly simple:

Input: Initial guess x0, λ0 > 0 Algorithm: Iterate over t = 1...

xt = xt 1+t 1

t + 1(xt 1 xt 2) λ0

t + 1 f(xt 1)

Decaying Exponential Acceleration

Remark 11 In this distribution, and unlike in the MP that we will see in the next section, the largest eigenvalue is not bounded, so this method is more akin to subgradient descent and not gradient descent. Note that because of this, the step-size is decreasing.

Remark 12 Note the striking similarity with the averaged gradient method of Polyak & Juditsky (1992), which reads

θt = yt 1 γ f(θ)

This algorithm admits the following momentum-based form

derived by Flammarion & Bach (2015, 2.2):

yt= txt 1 (t 1)xt 2

xt = xt 1 + t 1

t + 1(xt 1 xt 2) γ t + 1 f(yt) .

The difference between EXP and this formula is where the gradient is computed. In EXP, the gradient is evaluated at the current iterate xt, while in Polyak averaging, the gradient is computed at the extrapolated iterate iterate yt.

Parameters Estimation. The exponential distribution has a free parameter λ0. Since the expected value of this distribution is 1/λ0, we can estimate the parameter λ0 from the sample mean, which is 1

dtr(H). Hence, we ﬁt this parameter as λ0 = d/tr(H).

4.1. Rate of Convergence

For this algorithm we are able to give a simple expression for the expected error. The next theorem shows that it converges at rate O(1/t).

Lemma 13 If we apply Algorithm (EXP) to problem (OPT), where the spectral distribution of H is the decaying exponential dµ(λ) = e λ/λ0, then

E xt x 2 = R2

λ0(t + 1) . (18)

In the case of the convex non-smooth functions, optimal algorithm achieves a (worst-case) rate which is O(1/

t) (see for instance (Nesterov, 2009)). We thus achieve acceleration, by assuming the function quadratic with the exponential as average spectral density.

5. Optimal Method under the Marchenko-Pastur Distribution

In this section, we will derive the optimal method under the Marchenko-Pastur distribution µMP, introduced in Example 2. As in the previous section, the ﬁrst step is to construct a sequence of residual orthogonal polynomials.

Theorem 5.1 The following sequence polynomials are orthogonal with respect to the weight function λ dµMP(λ):

P0(λ) = 0 , P1(λ) = 1 δ0 = 0, ρ = 1 + r r

δt = ( ρ δt 1) 1

Pt = ρδt + δt σ2 rλ Pt 1 + (1 ρδt) Pt 2 .

We show in Appendix C.2 that these polynomials are shifted Chebyshev polynomials of the second kind. Surprisingly, the Chebyshev of the ﬁrst kind are used to minimize

Average-Case Acceleration Through Spectral Density Estimation

Input: Initial guess x0, MP parameters r, σ2

Preprocessing: ρ = 1+r r , δ0 = 0, x1 = x0 1 (1+r)σ2 f(x0)

Algorithm (Optimal): Iterate over t = 1...

δt = ( ρ δt 1) 1, xt = xt 1 + (1 + ρδt) (xt 2 xt 1) + δt f(xt 1)

σ2 r (MP-OPT)

Algorithm (Asymptotic variant): Iterate over t = 1...

xt = xt 1 min{r 1, r}(xt 2 xt 1) 1

σ2 min{1, r 1} f(xt 1) (MP-ASY)

Marchenko-Pastur Acceleration

over the worst case. From the optimal polynomial, we can write the optimal algorithm for OPT when the spectral density function of H is the Marchenko-Pastur distribution. By using Theorem 3.1, we obtain the Marchenko-Pastur accelerated method, described in Algorithm MP-OPT.

Spectral Density

Gradient Descent Poly.

Optimal Worst-case Polynomial

Optimal Average-case Poly. (MP)

Residual Polynomial

Figure 3. Residual polynomials for gradient descent, the minimax optimal polynomial (Chebyshev of the ﬁrst kind), and the optimal polynomial for the MP distribution, overlayed with a MP distribution in dashed lines. Gradient descent has the highest value at the edges, which is in line with its known suboptimal dependency on the condition number. The MP polynomial is closer to zero than the Chebyshev except at the borders, which Chebyshev is optimal by construction.

It s instructive to compare the residual polynomials of different methods to better understand their convergence properties. In Figure 5 we plot the residual polynomials for gradient descent, its classical worst-case analysis accelerated variant (Chebyshev polynomials) and the average-case analysis accelerated (under the MP distribution) variant above. In numbers, the worst-case convergence rate (max value in interval) is 0.88t for Chebyshev and 0.93t for

MP, so Chebyshev s worst-case rate is 1.7 times smaller. However, the expected rate is 0.67t for MP and 0.74t

for Chebyshev, making MP 1.4 times faster on average.

Asymptotic Behavior. The step-size and momentum terms of the Marchenko-Pastur accelerated method (MPOPT) depend on the time-varying parameter δt. It s possible to compute the limit coefﬁcients as t . This requires to solve

δ = ( ρ δ ) 1, thus δ = r or δ = 1 r.

The sequence δt converges to r when r < 1, otherwise it converges to 1/ r. Replacing the value δt in Algorithm MP-OPT by its asymptotic value gives the simpliﬁed variant MP-ASY.

Algorithm MP-OPT corresponds to a gradient descent with a variable momentum and step-size terms, all converging to a simpler one (Algorithm MP-ASY). However, even if we assume that the spectral density of the Hessian is the Marchenko-Pastur distribution, we still need to estimate the hyper-parameters σ and r. The next section proposes a way to estimate those parameters.

Hyper-Parameters Estimation. Algorithm MP-OPT and Algorithm MP-ASY require the knowledge of the parameters r and σ. To ensure convergence, it is desirable to scale the parameters such that the largest eigenvalue λmax lies inside the support of the distribution, bounded by σ2(1 + r)2. This gives us one equation to set two parameters. We will get another equation by matching r with the ﬁrst moment of the MP distribution, which can be estimated as the trace of H. More formally, the two conditions reads

λmax(H) = σ2(1 + r)2 , γ d tr(H) = r . (20)

Average-Case Acceleration Through Spectral Density Estimation

With the notation τ def = 1

d tr(H), def = λmax(H) the parameters σ2 and r are given by

, r = γτ . (21)

6. Optimal Method under the Uniform Distribution

We now focus on the (unnormalized) uniform distribution over [ , L]. This weight function is 1 if λ [ , L] and 0 otherwise.

We show in Appendix C.3 that a sequence of orthogonal residual polynomials with respect to this density is a sequence of shifted Legendre polynomials. Legendre polynomials are orthogonal w.r.t. the uniform distribution in [ 1, 1] and are deﬁned as

t Qt(λ) = (2t 1)λ Qt 1(λ) (t 1) Qt 2(λ) . (22)

We detail in Appendix C.3 the necessary translation and normalization steps to obtain the sequence of residual orthogonal polynomials.

Input: Initial guess x0 = x1, and L. Init: e 1 = m0 = 0.

Algorithm: Iterate over t = 1...

et = (L )2t2

4dt(4t2 1) (UNIF)

mt = (dt et + mt 1dtet 1) 1

xt = xt 1 + 1 mt(dt et) (xt 2 xt 1)

+ mt f(xt 1)

Uniform Acceleration

Like the Marchenko-Pastur accelerated gradient, the parameters and L can be estimated through the moment of the uniform distribution.

7. Experiments

We compare the proposed methods and classical accelerated methods on settings with varying degrees of mismatch with our assumptions. We ﬁrst compare them on quadratics generated from a synthetic dataset, where the empirical spectral density is (approximately) a Marchenko Pastur distribution. We then compare these methods

on another quadratic problem, generated using two nonsynthetic datasets, where the MP assumption breaks down. Finally, we compare some the applicable methods in a logistic regression problem. We will see that the proposed methods perform reasonably well in this scenario, although being far from their original quadratic deterministic setting. A full description of datasets and methods can be found in Appendix D.

Methods. We consider worst-case (solid lines) and average-case (dashed lines) accelerated methods. The method Modiﬁed Polyak runs Polyak momentum in the strongly convex case and defaults to the momentum method of (Ghadimi et al., 2015) in the non-strongly convex regime.

Synthetic Quadratics. We consider the least squares problem with objective function f(x) = Ax b 2, where each entry of A Rn d and b Rn are generated from an i.i.d. random Gaussian distribution. Using different ratios d/n we generate problems that are convex ( = 0) and strongly convex ( > 0).

Non-Synthetic Quadratics. The last two columns of Figure 4 compare the same methods, this time on two real (non-synthetic) datasets. We use two UCI datasets: digits2

(n = 1797, d = 64), and breast cancer3 (n = 569, d = 32).

Logistic Regression. Using synthetic i.i.d. data, we compare the proposed methods on a logistic regression problem. We generate two datasets, ﬁrst one with n > d and second one with d < n. Results are shown in Figure 5.

7.1. Discussion

Importance of Spectral Distributions. Average-case accelerated methods exhibit the largest gain when the eigenvalues follow the assumed expected spectral distribution. This is the case for the Marchenko-Pastur acceleration in the ﬁrst two columns of Figure 4, where its convergence rate is almost identical to that of conjugate gradient, a method that is optimal for the empirical (instead of expected) spectral distribution.

Beyond Quadratic Optimization. Results on 4 different logistic regression problems were mixed. On the one hand, as for quadratic objectives, the Uniform acceleration method has a good overall performance. However, contrary to the quadratic optimization results, Marchenko-Pastur acceleration has only mediocre performance in 3 out of 4 ex-

2https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+

of+Handwritten+Digits

3https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+

Wisconsin+%28Diagnostic%29

Average-Case Acceleration Through Spectral Density Estimation

Eigenvalue Density

synthetic, ℓ = 0

MP distribution

synthetic, ℓ > 0

0 5000 10000

0 2500 5000 7500

breast cancer

0 100 200 300 iterations

objective suboptimality

Marchenko-Pastur (simplified)

Marchenko-Pastur

Modified Polyak

Exponential

Conjugate Gradient

0 100 200 300 iterations

0 1000 2000 3000 iterations

0 500 1000 1500 2000 iterations

Figure 4. Benchmark on quadratic objectives. The top row shows the Hessians spectral density overlaid with the MP distribution, while the bottom row shows the objective suboptimality as a function of the number of iterations. In the ﬁrst and second row, strongly convex and convex (respectively) synthetic problem with MP distribution. In all plots, the (Accelerated Gradient for) Marchenko-Pastur algorithm performs better than the classical worst-case analysis optimal method (Modiﬁed Polyak), and is close in performance with Conjugate Gradient one, which is optimal for the empirical (instead of expected) spectral density. Note from the ﬁrst plot that the Marchenko-Pastur method maintains a linear convergence rate even in the presence of zero eigenvalues.

0 1000 2000 3000 iterations

objective suboptimality

synthetic 2

Marchenko-Pastur (simplified)

Marchenko-Pastur

Modified Polyak

Exponential

Uniform Nesterov

0 1000 2000 3000 iterations

synthetic 1

0 500 1000 1500 2000 iterations

0 500 1000 1500 2000 iterations

100 breast cancer

Figure 5. Benchmark on logistic regression. Objective suboptimality as a function of the number of iterations on a logistic regression problem. Similarly to the results for quadratic objectives, we observe that average-case acceleration methods have an overall good performance, with the Uniform acceleration being best in 3 out of 4. This provides some empirical evidence that average-case accelerated methods are applicable on problems beyond their original quadratic setting.

Average-Case Acceleration Through Spectral Density Estimation

periments. These results should also be contrasted with the fact that we did not adapt the methods to the non-quadratic setting. For example, a ﬁrst step would be to restart the algorithm to accommodate a varying Hessian, or to perform backtracking line-search to choose the magnitude of the update. We leave these adaptations to be the subject of future work.

8. Conclusion and Future Work

In this work, we ﬁrst developed an average-case analysis of optimization algorithms, and then used it to develop a family of novel algorithms that are optimal under this averagecase analysis. We outline potential future work directions.

Modeling outlier eigenvalues. Often the MP distribution fails to model accurately the outlier eigenvalues that arise in real data (e.g., last row of Figure 4). A potential area for improvement is to consider distributions that can model these outlier eigenvalues.

Non-quadratic and stochastic extension. As seen in Figure 5, the methods are applicable and perform well beyond the quadratic setting in which they were conceived. We currently lack the theory to explain this observation.

Convergence rates and asymptotic behavior. We have only been able to derive the average-case rate for the decaying exponential distribution. It would be interesting to derive average-case convergence rates for more methods.

Regarding the asymptotic behavior, we have noted that some average-case optimal methods like Marchenko-Pastur acceleration converge asymptotically to Polyak momentum. After the ﬁrst version of this work appeared online, (Scieur & Pedregosa, 2020) showed that this is the case for almost all average-case optimal methods.

Acknowledgements

We would like to thank our colleagues Adrien Taylor for thought-provoking early discussions, Crist obal Guzm an, Jeffrey Pennington, Francis Bach, Pierre Gaillard and Rapha el Berthier for many pointers and dicussions, and Hossein Mobahi, Nicolas Le Roux, Courtney Paquette, Geoffrey Negiar, Nicolas Loizou, R emi Le Priol and Reza Babanezhad for fruitful discussion and feedback on the manuscript.

Average-Case Acceleration Through Spectral Density Estimation

Abramowitz, M., Stegun, I. A., et al. Handbook of mathematical functions with formulas, graphs, and mathematical tables, 1972.

Azizian, W., Scieur, D., Mitliagkas, I., Lacoste-Julien, S., and Gidel, G. Accelerating Smooth Games by Manipulating Spectral Shapes. ar Xiv preprint ar Xiv:2001.00602, 2020.

Bai, Z. and Silverstein, J. W. Spectral analysis of large dimensional random matrices. Springer, 2010.

Berthier, R., Bach, F., and Gaillard, P. Accelerated Gossip in Networks of Given Dimension using Jacobi Polynomial Iterations. ar Xiv preprint ar Xiv:1805.08531, 2018.

Fischer, B. Polynomial based iteration methods for symmetric linear systems. Springer, 1996.

Flammarion, N. and Bach, F. From averaging to acceleration, there is only a step-size. In Conference on Learning Theory, 2015.

Flanders, D. A. and Shortley, G. Numerical determination of fundamental modes. Journal of Applied Physics, 1950.

Gautschi, W. Orthogonal polynomials: applications and computation. Acta numerica, 1996.

Ghadimi, E., Feyzmahdavian, H. R., and Johansson, M.

Global convergence of the heavy-ball method for convex optimization. In 2015 European Control Conference (ECC). IEEE, 2015.

Ghorbani, B., Krishnan, S., and Xiao, Y. An investigation into neural net optimization via hessian eigenvalue density. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.

Hestenes, M. R., Stiefel, E., et al. Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards, 1952.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, 2018.

Jacot, A., Gabriel, F., and Hongler, C. The asymptotic spectrum of the Hessian of DNN throughout training. ar Xiv preprint ar Xiv:1910.02875, 2019.

Katz, J. and Lindell, Y. Introduction to modern cryptography. CRC press, 2014.

Keerthi, S. S. and De Coste, D. A modiﬁed ﬁnite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 2005.

Knuth, D. E. The art of computer programming, volume 3. Pearson Education, 1997.

Lacotte, J. and Pilanci, M. Optimal Randomized First Order Methods for Least-Squares Problems. ar Xiv preprint ar Xiv:2002.09488, 2020.

Marchenko, V. A. and Pastur, L. A. Distribution of eigenvalues for some sets of random matrices. Matematicheskii Sbornik, 1967.

Martin, C. H. and Mahoney, M. W. Implicit selfregularization in deep neural networks: Evidence from random matrix theory and implications for learning. ar Xiv preprint ar Xiv:1810.01075, 2018.

Nemirovski, A. Information-based complexity of convex programming. Lecture Notes, 1995.

Nesterov, Y. Introductory lectures on convex optimization. Springer, 2004.

Nesterov, Y. Primal-dual subgradient methods for convex problems. Mathematical programming, 2009.

Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., Abolaﬁa, D. A., Pennington, J., and Sohl Dickstein, J. Bayesian deep convolutional networks with many channels are Gaussian processes. ar Xiv preprint ar Xiv:1810.05148, 2018.

Paquette, C., van Merrinboer, B., and Pedregosa, F. Halting Time is Predictable for Large Models: A Universality Property and Average-case Analysis. ar Xiv preprint ar Xiv:2006.04299, 2020.

Pennington, J. and Worah, P. Nonlinear random matrix theory for deep learning. In Advances in Neural Information Processing Systems, 2017.

Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 1992.

Sagun, L., Evci, U., Guney, V. U., Dauphin, Y., and Bottou, L. Empirical analysis of the hessian of over-parametrized neural networks. ar Xiv preprint ar Xiv:1706.04454, 2017.

Scieur, D. and Pedregosa, F. Universal Average-Case Optimality of Polyak Momentum. In Proceedings of the 37th International Conference on Machine Learning (ICML), 2020.

Tao, T. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.