# universal_averagecase_optimality_of_polyak_momentum__38957022.pdf Universal Average-Case Optimality of Polyak Momentum Damien Scieur * 1 Fabian Pedregosa * 2 Polyak momentum (PM), also known as the heavy-ball method, is a widely used optimization method that enjoys an asymptotic optimal worst-case complexity on quadratic objectives. However, its remarkable empirical success is not fully explained by this optimality, as the worstcase analysis contrary to the average-case is not representative of the expected complexity of an algorithm. In this work we establish a novel link between PM and the average-case analysis. Our main contribution is to prove that any optimal average-case method converges in the number of iterations to PM, under mild assumptions. This brings a new perspective on this classical method, showing that PM is asymptotically both worst-case and average-case optimal. 1. Introduction Polyak momentum (PM), also known as the heavy-ball method, is a widely used optimization method. Originally developed to solve linear equations (Frankel, 1950; Rutishauser, 1959), it was generalized to smooth functions and popularized in the optimization community by Boris Polyak (Polyak, 1964; 1987). This method has seen a renewed interest in recent years, as its stochastic variant which replaces the gradient with a stochastic estimate is effective on deep learning problems (Sutskever et al., 2013; Zhang et al., 2020). PM also enjoys a locally optimal rate of convergence for strongly convex and twice differentiable objectives. As is common within the optimization literature, this optimality is relative to the worst-case analysis, that provides complexity bounds for any input from a function class, no matter how unlikely. Despite its widespread use, the worstcase is not representative of the typical behavior of opti- *Equal contribution 1Samsung SAIT AI Lab, Montreal 2Google Research. Correspondence to: Damien Scieur . Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). mization methods. The simplex method, for example, has a worst-case exponential complexity, that becomes polynomial when considering the average-case (Spielman & Teng, 2004). A more representative analysis of the typical behavior is given by the average-case complexity, which averages the algorithm s complexity over all possible inputs. The average-case analysis is standard for analyzing sorting (Knuth, 1997) and cryptography (Katz & Lindell, 2014) algorithms, to name a few. However, little is known of the average-complexity of optimization algorithms, whose analysis depends on the often unknown probability distribution over the inputs. The recent work of Pedregosa & Scieur (2020); Lacotte & Pilanci (2020) overcame this dependency on the input probability distribution through the use of random matrix theory techniques. In the same papers, the authors noticed the convergence of some optimal average-case methods to PM, as the number of iterations grows (see Figure 1). This is rather surprising given their crucial differences. For instance, average-case optimal methods use knowledge of the full spectral distribution, while PM only requires knowledge of its edges (i.e., smallest and largest eigenvalue). Since this convergence was only shown on specific methods, it raises the question on whether this is a spurious phenomenon or if this holds more generally: As the number of iterations grows, all average-case optimal methods converge to Polyak momentum. The main contribution of this paper is to give a positive answer to this conjecture. The main, but not so restrictive assumption, is that the probability density function of the eigenvalues is non-zero on the interval containing its support. With this we can show the previously unknown property that PM is asymptotically optimal under the averagecase analysis, bringing a new perspectiveon the remarkable empirical performance of this classical method. Furthermore, this statement is universal, i.e., independent of the probability distribution over the inputs. All Average-case Optimal Methods Converge to Polyak Momentum Marchenko-Pastur eigenvalue dist. 0 10 20 iterations Marchenko-Pastur Polyak momentum 0 10 20 iterations Marchenko-Pastur Polyak momentum Figure 1. Convergence of optimal average-case methods to Polyak Momentum. For the Marchenko-Pastur and uniform distribution of eigenvalues (left), we construct the method that has optimal average-case complexity and plot the momentum (middle) and steps-size (right) parameters. For the two methods considered, the momentum and step-size parameters converge as the number of iterations grows to those of Polyak momentum, displayed here as a straight line. 1.1. Related work This work draws from the fields of optimization, complexity analysis and orthogonal polynomials, of which we comment on the most closely related ideas. Average-case analysis. The average-case analysis has a long history in computer science and numerical analysis. Often it is used to justify the superior performances of algorithms such as Quicksort (Hoare, 1962) and the simplex method in linear programming (Spielman & Teng, 2004). Despite this rich history, it s challenging to transfer these ideas into continuous optimization due to the ill-defined notion of a typical continuous optimization problem. In the context of optimization, Pedregosa & Scieur (2020) derived a framework for analyzing the averagecase gradient-based methods and developed methods that are non-asymptotic optimal algorithms with respect to the average-case. Such average-case analysis finds applications in various domains. For instance, Lacotte & Pilanci (2020) use this framework to derive optimal average-case algorithms to minimize least-squares with random matrix sketching. Prior to this stream of papers, Berthier et al. (2018) use methods based on Jacobi polynomials to design average-case optimal gossip methods, but without generalizing the framework. In the numerical analysis literature, Deift & Trogdon (2019) have recently developed an average-case complexity of conjugate gradient. Asymptotics or orthonormal polynomials. A key ingredient of the proof are asymptotics or orthonormal polynomials. This is a vast subject with applications in stochastic processes (Grenander & Szeg o, 1958), random matrix theory (Deift, 1999) and numerical integration (Mhaskar, 1997) to name a few. The monograph of (Lubinsky, 2000) discusses all results used in this paper. Notation. Throughout the paper we denote vectors in lowercase boldface (x), matrices in uppercase boldface letters (H), and polynomials in uppercase latin letter (P, Q). We will sometimes omit integration variable, with the understanding that R ϕ dµ is a shorthand for R ϕ(λ) dµ(λ). 2. Average-Case Analysis of Gradient-Based Methods The goal of the average-case analysis is to quantify the expected error E xt x 2, where xt is the t-th update of some optimization method and the expectation is taken over all possible problem instances. To make this analysis tractable, and following (Pedregosa & Scieur, 2020), we consider quadratic optimization problems of the form n f(x) def = 1 2(x x ) H(x x ) o , (OPT) where H Rd d is a random symmetric positive-definite matrix and x is a random d-dimensional vector which is a solution of (OPT). Remark 1. Problem (OPT) subsumes the quadratic minimization problem minx x Hx+b x+c but the notation above will be more convenient for our purposes. Remark 2. The expectation in E xt x 2 is over the inputs and not over any randomness of the algorithm, as is common in the stochastic literature. In this paper we only consider deterministic algorithms. All Average-case Optimal Methods Converge to Polyak Momentum We consider in this paper the class of first order methods, which build xt using a pre-defined linear combination of an initial guess and previous gradients: xt x0 + span{ f(x0), . . . , f(xt)}. (1) This wide class includes most gradient-based optimization methods, such as gradient descent and momentum. However, it excludes quasi-Newton methods, preconditioned gradient descent or Adam (to cite a few), as the preconditioning allows the iterates to go outside span. 2.1. Tools of the trade: orthogonal polynomials and spectral densities Average-case optimal methods rely on two key concepts that we now introduce: residual orthogonal polynomials and the expected spectral distribution. 2.1.1. ORTHOGONAL (RESIDUAL) POLYNOMIALS This section defines orthogonal polynomials and residual polynomials. Definition 1. Let α be a non-decreasing function such that R Q dα is finite for all polynomials Q. We will say that the sequence of polynomials P0, P1, . . . is orthogonal with respect to dα if Pi has degree i and ( = 0 if i = j > 0 if i = j . (2) Furthermore, if they verify Pi(0) = 1 for all i, we call these residual orthogonal polynomials. Residual orthogonal polynomials verify a three-term recurrence (Fischer, 1996, 2.4), that is, there exists a sequence of real values at, bt such that Pt(λ) = (at + btλ)Pt 1(λ) + (1 at)Pt 2(λ) , (3) where P0(λ) = 1 and P1(λ) = 1 + b1λ . 2.1.2. EXPECTED SPECTRAL DISTRIBUTION The expected spectral distribution and the extreme eigenvalues of the matrix H play similar roles in the case of, respectively, average-case and worst-case optimal methods. They measure the problem s difficulty and define the optimal method s parameters. Definition 2 (Empirical/Expected Spectral Measure). Let H be a random matrix with eigenvalues {λ1, . . . , λd}. The empirical spectral measure of H, called µH, is the probability measure µH(λ) def = 1 d Pd i=1δλi(λ) , (4) where δλi is the Dirac delta, a distribution equal to zero everywhere except at λi and whose integral over the entire real line is equal to one. Since H is random, the empirical spectral measure µH is a random measure. Its expectation over H, µ def = E H[µH] , (5) is called the expected spectral distribution. Example 1 (Marchenko-Pastur density and large least squares problems). Consider a matrix A Rn d, where each entry is an iid random variable with mean zero and variance σ2. Then it is known that the expected spectral distribution of H = 1 n A A converges to to the Marchenko-Pastur distribution (Marchenko & Pastur, 1967) as n and d at a rate in which the asymptotic ratio d/n r is finite. The Marchenko-Pastur distribution dµMP is defined as r, 0}δ0(λ) + 2πσ2rλ 1λ [ℓ,L] . (6) Here ℓ def = σ2(1 r)2, L def = σ2(1 + r)2 are the extreme nonzero eigenvalues, δ0 is a Dirac delta at zero (which disappears if r 1) and 1λ [ℓ,L] is a rectangular window function, equal to 1 for λ [ℓ, L] and 0 elsewhere. 2.2. Average-case optimal methods With these two ingredients, we can construct the method with optimal average-case complexity. We rewrite the expected error as an integral using the expected spectral density µ as a weight function. Theorem 1. (Pedregosa & Scieur, 2020) Assume x0, x are random variables independent of H, satisfying E[(x0 x )(x0 x ) ] = R2I. Let xt be generated by a first-order method, associated to the polynomial Pt. Then the expected error at iteration t reads initialization z}|{ R2 P 2 t |{z} algorithm problem z}|{ dµ . (7) The optimal first order method is obtained by minimizing (7) over the space of residual polynomials of degree t. This is equivalent to finding a sequence of residual polynomials {Pi} orthogonal w.r.t. the weight function λµ(λ), as shown in the following theorem. Theorem 2. (Pedregosa & Scieur, 2020) Let at and bt be the coefficients of the three-term recurrence (3) for the sequence of residual polynomials orthogonal w.r.t. λ dµ(λ). Then the following method has optimal average-case com- All Average-case Optimal Methods Converge to Polyak Momentum plexity over the class of problems (OPT):1 x1 = x0 + b1 f(x0), (8) xt = xt 1 + (1 at)(xt 2 xt 1) + bt f(xt 1) . Due to the dependency of the coefficients at, bt on the expected spectral distribution, equation (8) does not represents a single scheme, but rather a family of algorithms: each different expected spectral distribution generates a different optimal method. Below is an example of such optimal algorithm w.r.t the Marchenko-Pastur expected spectral distribution (see Example 2). Example 2 (Marchenko-Pastur acceleration, (Pedregosa & Scieur, 2020)). Let dµ be the density associated with the Marchenko-Pastur distribution. Then, the recurrence of the optimal average-case method associated with this distribution is ρ = 1 + r r , δ0 = 0; x1 = x0 1 (1 + r)σ2 f(x0) ; δt = (ρ + δt 1) 1 ; xt = xt 1 + (1 + ρδt)(xt 2 xt 1) + δt σ2 r f(xt 1) . The coefficients come from the orthogonal polynomials w.r.t. λ dµ(λ), which is a shifted Chebyshev polynomials of the second kind. 2.3. Polyak Momentum and worst-case optimality The Polyak momentum algorithm (Polyak, 1964) has an optimal worst-case convergence rate over the class of first order methods with constant coefficients (Polyak, 1987; Scieur et al., 2017). The method requires knowledge of the smallest and largest eigenvalue of the Hessian H (denoted ℓand L respectively) and iterates as follows: x1 = x0 2 L + ℓ f(x0) (PM) xt+1 = xt + Remark 3. Unlike the Marchenko-Pastur accelerated method of Example 2, coefficients of this method are constant in the iterations. Furthermore, these coefficients only depend on the edges of the spectral distribution and not on the full density. 1Throughout the paper, we will color-code momentum and step-size parameters. 3. All Roads Lead to Polyak Momentum Main result Theorem 3. Assume the density function dµ is strictly positive in the interval [ℓ, L] with ℓ> 0. Then the parameters of the optimal average-case method (8) converge to those of (PM): lim t (1 at) = | {z } = (PM) momentum lim t bt = 2 | {z } = (PM) step-size The key insight of the proof is to cast the three-term recurrence of residual orthogonal polynomials into orthonormal polynomials2 in the interval [ 1, 1]. Once this is done, we will use asymptotic properties of these polynomials. The proof is split into three steps. Step 0 introduces notation and some known results. Step 1 writes the coefficients of optimal average-case methods in terms of properties of a class of orthonormal polynomials in the [ 1, 1] interval. Step 2 computes the limits of the expressions derived in the previous step by using known asymptotic properties of orthonormal polynomials. Step 0: Definitions. In the classical theory of orthogonal polynomials, the weight function associated with orthogonal polynomials is defined in the interval [ 1, 1]. However, in our case the spectral densities are instead defined in [ℓ, L]. To translate results from one setting to the other we define the following linear mapping from [ℓ, L] to [ 1, 1]: m(λ) = L + ℓ L ℓ 2 L ℓλ . (11) For notational convenience, we will also use the shorthand m0 def = m(0). We also define Qi as the i-th degree orthonormal polynomial with respect to the density function dν : [ 1, 1] 7 R, where dν satisfies dν(m(λ)) = λ dµ(λ) . (12) Similar to the three-term recurrence of residual orthogonal polynomials, orthonormal polynomials also verify a similar three-term recurrence with coefficients αt, βt: αt Qt(ξ) = (ξ βt)Qt 1(ξ) αt 1Qt 2(ξ) . (13) 2A sequence Q1, Q2, ... of orthogonal polynomials with respect to dω is orthonormal if R Q2 i dω = 1. All Average-case Optimal Methods Converge to Polyak Momentum Step 1: Parameters of optimal method and orthonormal polynomials. Theorem 2 links the coefficient at, bt of the optimal average-case method to the residual orthogonal polynomial. Instead, this step relates these coefficients to a different orthonormal polynomial in the interval [ 1, 1]. This eases the usage of the rich theory that studies asymptotic properties of orthonormal polynomials. This relationship is stated in the following lemma: Lemma 1. Let at, bt be the parameters associated with optimal average-case method (Theorem 2). These coefficients verify the following identity, 1 at = αt 1 Qt(m0) , and (14) bt = 2 αt(L ℓ) Qt 1(m0) Qt(m0) . (15) Proof. We first note that {Qi(m(λ))}i is a sequence of orthogonal polynomials with respect to the weight function λ dµ(λ). Indeed, if i = j, using the substitution ξ = m(λ) gives Z L ℓ Qi(m(λ))Qj(m(λ)) λ dµ(λ) | {z } =dν(m(λ)) 1 Qi Qj dν = 0, (17) where the last identity follows by orthogonality. Since orthogonality is preserved after multiplication by a scalar, the polynomial Pt(λ) def = Qt(m(λ))/Qt(m0) is also orthogonal with respect to the weight function λ dµ(λ). The normalization 1/Qt(m0) ensure the polynomial to be a residual polynomial. Note that Qt(m0) cannot be zero and so the normalization is well defined. This is because the roots of orthogonal polynomials are always contained in their support, and by construction m0 is outside [ 1, 1]. Using Theorem 2, the coefficients of the optimal averagecase method can be derived from the three-term recurrence of this polynomial. Indeed, starting from the three-term recurrence of Qi (13), we obtain for Pt Pt(λ) = (m βt 1)Qt 1(m(λ)) αt Qt(m0) αt 1 Qt 2(m(λ)) L ℓ βt 1 2 L ℓλ Qt 1(m0) Qt(m0) | {z } =(at+btλ) Qt(m0) | {z } = (1 at) where in the last line we used the definition of m and the identity Pi = Qi(m(λ))/Qi(m0) for i = t 1 and i = t 2. Finally, matching the coefficients of this recurrence with (3) yields the identity in the Lemma. Step 2: Asymptotics of orthonormal polynomials. This step uses known result on asymptotics of orthonormal polynomials to compute the limit t of expressions derived in the previous step. We use the following theorem on the asymptotic ratio between two successive orthonormal polynomials. Theorem 4 (Rakhmanov (1983);3 Ratio Asymptotics). Let {Qi} be a sequence of orthonormal polynomials with respect to a weight function strictly positive in ] 1, 1[, and zero elsewhere. Then we have the following limit for the ratio of polynomials evaluated outside of the support, lim t Qt(ξ) Qt 1(ξ) = ξ + p ξ2 1 for ξ [ 1, 1] . (18) We can use this result to compute the limit of the ratio Qt 1(m0)/Qt(m0), that appears in (14), as m(0) > 1 (and thus is not in the interval [ 1, 1]): (18) = L + ℓ L ℓ 2 1 1 (19) The other dependency of Eq. (14) on the iteration t is through the coefficients αt, βt. To compute the limits of these we use the following known asymptotics:4 Theorem 5 (M at e et al. (1985); Limits of recurrence coefficients). Under the same assumptions as Theorem 4, the limits of the coefficients αt, βt in the orthonormal threeterm recurrence is lim t αt = 1 2 , lim t βt = 0 . (21) Using this last theorem together with (19), we have lim t (1 at) (14) = lim t which is the claimed limit. To conclude the proof, we compute the same limit for the step-size bt: (14) = 2 L ℓ lim t α 1 t lim t (19,21) = 2 3The original version of this theorem was stated for monic orthogonal polynomials but is valid for polynomials with other normalizations like orthonormal, see for instance (Lubinsky, 2000; Denisov, 2004). 4It can be shown that the last two theorems are equivalent (Nevai, 1979, Theorem 13), although for simplicity we will treat them as independent results. All Average-case Optimal Methods Converge to Polyak Momentum 4. Asymptotic Expected Convergence Rates The previous section showed convergence of the method s parameters to PM, but said nothing about its rate of convergence. This section fills this gap by providing the asymptotic convergence of the expected convergence rate E xt x 2. More precisely, we show that the expected convergence rate converges to the rate of convergence of Polyak, and that this convergence rate is independent of the probability distribution. Theorem 4.1. Under the same assumptions of Theorem 3, the asymptotic expected rate of convergence of the optimal method converges to the worst-case rate of convergence, Proof. Let Pt be the residual orthogonal polynomial w.r.t. λ dµ(λ). Pedregosa & Scieur (2020, Theorem 3.1) showed that the expected rate of convergence for average-case optimal methods admits the following simple form5 E xt x 2 = R2 Z R Pt dµ . (25) This form is particularly convenient for us, as we can then use the the three-term recurrence to obtain a recurrence of this expression. Let rt = R R Pt dµ. After using the recurrence over Pt, R (at + λbt)Pt 1(λ) + (1 at)Pt 2(λ) dµ(λ) R Pt 1 dµ | {z } =rt 1 R Pt 2 dµ | {z } =rt 2 where in the last identity we have used the orthogonality between Pt and P0(λ) = 1 w.r.t. λ dµ(λ). In all, we have that the convergence rate rt is described by the recurrence rt = atrt 1 + (1 at)rt 2, r0 = 1, r1 = Z Replacing at by its asymptotic value a , which reads a def = 1 + we obtain a different recurrence, which we denote ert with the same limit: ert = a ert 1 + (1 a )ert 2 . (29) 5In particular, the result shows that the integral of the squared polynomial is equal to the integral of the polynomial if the polynomial is optimal w.r.t. dλµ(λ). This recurrence is easier to analyze than the previous one because the coefficients are constant. It s a property of linear recurrence relations with constant coefficients that there exists constants c1, c2 such that ert = c1(1 a )t + c2 . (30) Since the method is convergent, we must have r = 0, which implies c2 = 0. Taking limits of the t-th root we recover the desired rate: t rt = lim t ert = lim t 5. Discussion and Simulations: Speed of Convergence to PM The main result (Theorem 3) shows that, asymptotically, any average-case optimal method converge towards Polyak momentum. This could be interpreted as evidence against average-case optimal methods, as average-case optimal methods are not essentially different from PM. However, simulations show other dynamics at play. In Figure 2 we plot the speed of convergence of the parameters of the optimal average-method for the Marchenko Pastur distribution with different ratios r = d n (and hence condition number) and for the uniform distribution with different intervals. We see a clear effect of the condition number on the speed of convergence. The more illconditioned the problem, the slower the convergence of the optimal method to PM, implying that PM behaves sub-optimally for a larger number of iterations. This observation is consistent with the results of (Pedregosa & Scieur, 2020), who showed important speedups in the illconditioned regime. 6. Conclusion and Perspectives In this work, we ve shown that optimal average-case methods for minimizing quadratics converge to PM under mild assumptions on the expected spectral distribution. This universality over the probability measure is somewhat surprising, as Polyak momentum method only depends on the edges of the spectrum, while on the other hand optimal average-case methods depend on the whole spectrum. A potential area for future work is the analysis of the rate of convergence of optimal method to Polyak momentum algorithm. It seems the convergence of the step-size and momentum parameters are bounded polynomially in the number of iterations. This observation indicates the potential All Average-case Optimal Methods Converge to Polyak Momentum 0 2 4 6 0.0 1.0 eigenvalue dist. 0 10 20 30 40 10-8 difference of momentum 0 10 20 30 40 10-8 100 difference of step-size log condition number 10-11 10-6 10-1 104 109 10-11 eigenvalue dist. 101 103 105 100 difference of momentum 101 103 105 100 difference of step-size log condition number Figure 2. Speed of convergence to Polyak Momentum. For different parametrizations of the Marchenko-Pastur (top line) and uniform (bottom line) distributions, we plot the absolute difference between the average-case optimal momentum parameter (middle) and averagecase optimal step-size (right) and the momentum and step-size of the Polyak method. The plots show a high anti-correlation between the speed of convergence of optimal average-case methods to PM and problem conditioning: for well-conditioned problems (small condition number) the parameters converge faster to PM than for ill-conditioned (large condition number) problems. Thus, in a regime were we perform only a few iterations, Polyak momentum may not be the best choice. benefit of optimal methods over PM in the case where we perform a small number of iteration, typical in machinelearning problems. A second research direction is the study of optimal polynomials on the complex plane. In this case, we are no longer solving the optimization problem (OPT). Instead, we aim to solve the linear system Ax = b, where the matrix A is non-symmetric, with potentially complex eigenvalues. This has implication in the study of optimal algorithm in game theory (Azizian et al., 2020) or in the acceleration of primal-dual algorithms (Bollapragada et al., 2018). Finally, our results are only valid in the strongly convex regime (ℓ> 0), ruling out the important case r = 1 in the Marchenko-Pastur distribution, which corresponds to large least squares problems with a square matrix. After the first version of this paper appeared, Paquette et al. (2020) derived an average-case analysis for gradient descent and showed a gap between the asymptotic average-case and worst-case convergence rate. The development of averagecase optimal methods and the study of their asymptotic limits in this regime remains an open problem. All Average-case Optimal Methods Converge to Polyak Momentum Acknowledgements We would like to thank our colleagues Reza Babanezad, Simon Lacoste-Julien, Remi Lepriol, Nicolas Loizou, Adam Ibrahim, Nicolas Leroux and Courtney Paquette for their insightful discussions and relevant remarks. We also thank Francis Bach and Rapha el Berthier for their useful remarks and pointers. Azizian, W., Scieur, D., Mitliagkas, I., Lacoste-Julien, S., and Gidel, G. Accelerating smooth games by manipulating spectral shapes. Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), 2020. Berthier, R., Bach, F., and Gaillard, P. Accelerated Gossip in Networks of Given Dimension using Jacobi Polynomial Iterations. ar Xiv preprint ar Xiv:1805.08531, 2018. Bollapragada, R., Scieur, D., and d Aspremont, A. Nonlinear acceleration of momentum and primal-dual algorithms. ar Xiv preprint ar Xiv:1810.04539, 2018. Deift, P. Orthogonal polynomials and random matrices: a Riemann-Hilbert approach, volume 3. American Mathematical Soc., 1999. Deift, P. and Trogdon, T. The conjugate gradient algorithm on well-conditioned Wishart matrices is almost deterministic. ar Xiv preprint ar Xiv:1901.09007, 2019. Denisov, S. On Rakhmanov s theorem for Jacobi matrices. Proceedings of the American Mathematical Society, 2004. Fischer, B. Polynomial based iteration methods for symmetric linear systems. Springer, 1996. Frankel, S. P. Convergence rates of iterative treatments of partial differential equations. Mathematical Tables and Other Aids to Computation, 1950. Grenander, U. and Szeg o, G. Toeplitz forms and their applications. American Mathematical Soc., 1958. Hoare, C. A. R. Quicksort. The Computer Journal, 1962. Katz, J. and Lindell, Y. Introduction to modern cryptography. CRC press, 2014. Knuth, D. E. The art of computer programming, volume 3. Pearson Education, 1997. Lacotte, J. and Pilanci, M. Optimal randomized firstorder methods for least-squares problems. Proceedings of the 37th International Conference on Machine Learning (ICML), 2020. Lubinsky, D. Asymptotics of orthogonal polynomials: some old, some new, some identities. Acta Applicandae Mathematica, 2000. Marchenko, V. A. and Pastur, L. A. Distribution of eigenvalues for some sets of random matrices. Matematicheskii Sbornik, 1967. M at e, A., Nevai, P., and Totik, V. Asymptotics for the ratio of leading coefficients of orthonormal polynomials on the unit circle. Constructive Approximation, 1985. Mhaskar, H. N. Introduction to the theory of weighted polynomial approximation. World Scientific, 1997. Nevai, P. G. Orthogonal polynomials, volume 213. American Mathematical Soc., 1979. Paquette, C., van Merri enboer, B., and Pedregosa, F. Halting Time is Predictable for Large Models: A Universality Property and Average-case Analysis. ar Xiv preprint ar Xiv:2006.04299, 2020. Pedregosa, F. and Scieur, D. Average-case Acceleration through spectral density estimation. Proceedings of the 37th International Conference on Machine Learning (ICML), 2020. Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 1964. Polyak, B. T. Introduction to optimization. Optimization Software, Inc. Publications Division, New York, 1987. Rakhmanov, E. A. On the asymptotics of the ratio of orthogonal polynomials. II. Mathematics of the USSRSbornik, 1983. Rutishauser, H. Theory of gradient methods. In Refined iterative methods for computation of the solution and the eigenvalues of self-adjoint boundary value problems. Springer, 1959. Scieur, D., Roulet, V., Bach, F., and d Aspremont, A. Integration methods and optimization algorithms. In Advances in Neural Information Processing Systems, 2017. Spielman, D. A. and Teng, S.-H. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 2004. Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learning. In International conference on machine learning, 2013. Zhang, A., Lipton, Z., Li, M., and Smola, A. J. Dive into deep learning. https://d2l.ai, 2020.