# polynomial_preconditioning_for_gradient_methods__33ffc71a.pdf Polynomial Preconditioning for Gradient Methods Nikita Doikov 1 Anton Rodomanov 2 We study first-order methods with preconditioning for solving structured nonlinear convex optimization problems. We propose a new family of preconditioners generated by symmetric polynomials. They provide first-order optimization methods with a provable improvement of the condition number, cutting the gaps between highest eigenvalues, without explicit knowledge of the actual spectrum. We give a stochastic interpretation of this preconditioning in terms of coordinate volume sampling and compare it with other classical approaches, including the Chebyshev polynomials. We show how to incorporate a polynomial preconditioning into the Gradient and Fast Gradient Methods and establish the corresponding global complexity bounds. Finally, we propose a simple adaptive search procedure that automatically chooses the best possible polynomial preconditioning for the Gradient Method, minimizing the objective along a low-dimensional Krylov subspace. Numerical experiments confirm the efficiency of our preconditioning strategies for solving various machine learning problems. 1. Introduction Motivation. Preconditioning is an important tool for improving the performance of numerical algorithms. The classical example is the preconditioned Conjugate Gradient Method (Hestenes & Stiefel, 1952) for solving a system of linear equations. It proposes to modify the initial system in a way to improve its eigenvalue distribution and thus to accelerate the convergence of the method. The question of choosing the right preconditioner heavily depends on the problem structure, and there exist many problem-specific recommendations which provide us with a good trade-off between computational cost and the spectrum properties of 1EPFL, Switzerland 2UCLouvain, Belgium. Correspondence to: Nikita Doikov , Anton Rodomanov . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). the new system. Some notable examples include Jacobi or the diagonal preconditioners, symmetric successive overrelaxation, the incomplete Cholesky factorization (Golub & Van Loan, 2013), Laplacian preconditioning for graph problems (Vaidya, 1991; Spielman & Teng, 2004), preconditioners for discretizations of system of partial differential equations (Mardal & Winther, 2011). Another important class of numerical algorithms are the second-order methods or Newton s Method (see, e.g. (Nesterov, 2018)), that aims to solve difficult ill-conditioned problems by using local curvature information (the Hessian matrix) as a preconditioner at every step. However, being a powerful algorithm, each iteration of Newton s Method is very expensive. It requires to solve a system of linear equations with the Hessian matrix, and in case of quadratic objective it is equivalent to solving the original problem. In this paper, our goal is to solve a general nonlinear optimization problem with a structured convex objective by the efficient first-order methods. Thus, in the case of unconstrained minimization of a smooth function: minx f(x), the simplest method that we study is as follows, for k 0: xk+1 = xk αk P f(xk), (1) where αk > 0 is a stepsize and P is a fixed preconditioning matrix. P := I corresponds to the classical gradient descent. Another natural choice is P := B 1, where B is a curvature matrix of our problem1, which is directly available for the algorithm. That resembles the Newton-type direction, and the method with this preconditioner tends to converge much faster in practice (see Figure 1). However, computing B 1 (or solving the corresponding linear system with B) is a very expensive operation in the large scale setting. Instead of using B 1, we propose a new family of Symmetric Polynomial Preconditioners, that provably improve the spectrum of the objective. The first member of our family is P := tr (B)I B. (2) We prove that using preconditioner (2) within method (1), makes the condition number insensitive to the gap between 1See the definition of B in our Assumption 2.1 and the corresponding Examples 2.2, 2.3, 2.4 of different problems. Polynomial Preconditioning for Gradient Methods 0 250 500 750 1000 Iteration Func. residual 0 250 500 750 1000 Iteration 100 phishing 0 250 500 750 1000 Iteration breast-cancer 0 250 500 750 1000 Iteration Func. residual 0 250 500 750 1000 Iteration 0 2500 5000 Iteration P = I P = B 1 Figure 1: Training logistic regression with the standard gradient descent (P = I), and using the inverse of the curvature matrix (P = B 1) as a preconditioner in (1). The latter method works much faster, while it can be very expensive to compute B 1 for large scale problems. the top two eigenvalues of the curvature matrix. Since it is quite common for real data to have a highly nonuniform spectrum with several large gaps between the top eigenvalues (see Figure 2), our preconditioning can significantly improve the convergence of the first-order methods. At the same time, one step of the form (1),(2) is still cheap to compute. It involves just the standard matrix operations (trace and the matrix-vector product), without the need to solve linear systems with the curvature matrix as in Newton s Method. 1 7 13 19 25 31 1 4 7 10 13 16 1 2 3 4 5 6 7 8 9 breast-cancer 1 6 11 16 21 26 1 3 5 7 9 11 13 15 1 5 9 13 17 21 25 Figure 2: Leading eigenvalues (in the logarithmic scale) of the curvature matrix B, for several typical datasets2. There are large gaps between the top eigenvalues. This approach works for general structured nonlinear problems (not necessarily quadratics) and also for the problems with possible composite parts (e.g., constrained minimization or non-smooth regularization). 2https://www.csie.ntu.edu.tw/ cjlin/ libsvmtools/datasets/ Our new family of Symmetric Polynomial Preconditioners gradually interpolate between the first preconditioner (2) and P B 1 as the other extreme case. We show that increasing the order of the preconditioner, we are able to cut off several top eigenvalues of the curvature matrix, without knowing the actual spectrum. We can incorporate these preconditioners both into the Gradient Method, as well as into the accelerated Fast Gradient Method (Nesterov, 1983), with a further provable improvement of the condition number. Finally, we address the common question of choosing the best possible preconditioner. We propose a new adaptive strategy for the basic nonlinear Gradient Method based on the Krylov subspace minimization. In this approach, preconditioner P is defined as a general polynomial of the curvature matrix B of a fixed (small) degree τ: P := a0I + a1B + . . . + aτBτ, where the vector of coefficients a Rτ+1 is found by solving a certain linear system of size τ +1 in each iteration of the method. It has a plain interpretation of projecting the direction B 1 f(xk) onto an affine set Kτ xk, which is the Krylov subspace: Kτ x def = span f(x), B f(x), . . . , Bτ f(x) . (3) In case of small τ, we can solve this linear system easily and obtain the best preconditioning guarantee for our method, which is adaptive for each iteration. Related Work. It is widely known that the standard Conjugate Gradient Method is optimal in the class of the firstorder algorithms for unconstrained minimization of convex quadratic functions (Nemirovski, 1995). The kth iteration of the Conjugate Gradients finds the full minimum of the objective over the k-dimensional Krylov subspace, and thus it provably solves the problem after k = n iterations, where n is the dimension of the problem. Quadratic minimization is equivalent to solving a system of linear equations, therefore it is often referred as the linear case. Polynomial preconditioning for solving large linear systems has been extensively studied during the last several decades; see (Dubois et al., 1979; Johnson et al., 1983; Saad, 1985; Van Gijzen, 1995; Liu et al., 2015; Loe & Morgan, 2022) and references therein. See also Section 5.3 for the comparison of our preconditioning strategies with the linear Conjugate Gradient Method. The situtation with nonlinear problems is more difficult. Along with the basic Gradient Method, the classical approaches include the Nonlinear Conjugate Gradients and Quasi-Newton Methods (see, e.g. (Nocedal & Wright, 2006)), which typically demonstrate a decent practical performance, while replicating the standard Conjugate Gradients in the linear case. However, these methods lack of Polynomial Preconditioning for Gradient Methods Preconditioner Condition number, β/α Methods Cost Classical Gradient Method P = I λ1/λn GM, FGM cheap Full Preconditioning P = B 1 1 GM, FGM expensive Symmetric Polynomial Preconditioning (ours) P = Pτ (11) λ1/λn ξτ(λ) GM, FGM cheap for small τ Krylov Subspace Minimization (ours) optimal polynom. λτ+1/λn GM cheap for small τ Table 1: The value β/α for different preconditioning strategies, λ = λ(B). Note that ξτ(λ) 1, and ξτ(λ) 0 in case of large spectral gaps, namely when λ1/λτ+1 (see Section 4). For solving the problem with ϵ > 0 accuracy, GM needs k(ϵ) = O(β/α 1/ϵ) and k(ϵ) = O(β/α L/µ log 1/ϵ) iterations for convex and strongly convex functions correspondingly. FGM needs only p k(ϵ) iterations (Theorems 3.1 and 3.2). having any good global complexity bounds, and thus in the worst-case scenario they can actually perform even worse than the Gradient Method (Gupta et al., 2023). At the same time, the Fast Gradient Method developed by (Nesterov, 1983) is optimal for the class of nonlinear problems with a uniformly bounded eigenvalues of the Hessian (Nemirovski & Yudin, 1983). This assumption does not take into account the actual distribution of the spectrum. Hence, it can not distinguish the problems with large gaps between the top eigenvalues, as in Figure 2. There have been several attempts to study more specific problem formulations, and so to gain a provable advantage for the optimization algorithms by leveraging the spectrum of the Hessian. Thus, the quadratic minimization problems were studied under the assumption of a particular probability distribution for the eigenvalues (Scieur & Pedregosa, 2020; Cunha et al., 2022), or assuming a certain fixed spectral gap (Goujaud et al., 2022), revealing the advantages of employing the Heavy-ball Method (Polyak, 1987) in these cases. Another example is the Stochastic Spectral Descent (Kovalev et al., 2018), which improves the condition number for quadratic problems if we know some of the eigenvectors. In this work, we consider a refined smoothness characterization of the objective with the curvature matrix B (Assumption 2.1). It is similar in spirit to that one used in Stochastic Dual Newton Ascent (Qu et al., 2016). An important particular instance of this class of algorithms is the Randomized Coordinate Descent with Volume Sampling (Rodomanov & Kropotov, 2020). In the latter method, it was proposed to select subsets of variables of certain size m proportionally to the determinants of principal submatrices of B. While this approach was practically implementable only for the subsets of size m = 1 or 2, it was shown that, in theory, the method is insensitive to the large spectral gap between the top m 1 eigenvalues. Surprisingly, our new family of the Symmetric Polynomial Preconditioners can be viewed as a deterministic version of the Volume Sampling technique (with m = τ +1 where τ is the degree of a preconditioning polynomial; preconditioner (2) corresponds to τ = 1). Thus, we provide the Volume Sampling with a novel deterministic interpretation, which also leads to new accelerated and composite optimization algorithms (see Section 4.3 for a detailed comparison). Contributions. We propose several polynomial preconditioning strategies for first-order methods for solving a general composite convex optimization problem, and prove their better global complexity guarantees, specifically: We study the convergence of the basic Gradient Method (GM, Algorithm 1) and the accelerated Fast Gradient Method (FGM, Algorithm 2) with a general (arbitrarily fixed) preconditioning matrix. We introduce two condition numbers, that are designated to the different parts of the objective (L/µ for nonlinearity and β/α for the curvature matrix), and show that they serve as main complexity factors. We develop a new family of Symmetric Polynomial Preconditioners (Section 4). Combining them with the preconditioned Gradient Methods, we establish a significant improvement of the curvature condition number β/α in case of large gaps between the top eigenvalues of the matrix (see Table 1). Then, we propose a new adaptive procedure based on the Krylov subspace minimization (Algorithm 3) that achieves the best polynomial preconditioning. We present the guarantees we can get, including cutting off the top eigenvalues directly and by employing the Chebyshev polynomials, and compare this approach with the Symmetric Polynomial Preconditioning. Numerical experiments are provided. 2. Notation and Assumptions We consider the following optimization problem given in the composite form: F = min x Rn n F(x) def = f(x) + ψ(x) o , (4) Polynomial Preconditioning for Gradient Methods where f : Rn R is a differentiable convex function which is the main part of the problem, and ψ : Rn R {+ } is a proper closed convex function that can be nondifferentiable but has a simple structure. For example, it can be an indicator of a convex set, or ℓ1-regularizer. Additionally, we fix some symmetric positive-definite matrix B Rn n (notation B = B 0). This matrix plays the key role in our characterization of the smoothness properties of f. Namely, we assume the following (considering for simplicity two-times differentiable functions): Assumption 2.1. The Hessian of f is uniformly bounded, for some constants L µ 0: µB 2f(x) LB, x Rn. (5) Having fixed the matrix B, we define the corresponding induced norm by x B def = Bx, x 1/2, x Rn. Thus, matrix B is responsible for fixing the coordinate system in the problem. Then, condition (5) can be rewritten in terms of the global lower and upper bound on the first-order approximation of f (Nesterov, 2018): µ 2 y x 2 B f(y) f(x) f(x), y x 2 y x 2 B, x, y Rn. (6) In what follows, we denote by λ = λ(B) Rn the vector of eigenvalues for the matrix B, sorted in a nonincreasing order: λ1 λ2 . . . λn. The classical example is B := I (identity matrix). Then, condition (5) implies that the function f is (strongly) convex and has the Lipschitz continuous gradient. However, by choosing a specific B, we tend to achieve a better granularity of the description of our problem class and thus to improve the convergence properties of the methods. Example 2.2. Let a Rn. Then, the quadratic function f(x) = 1 2 Bx, x a, x , satisfies condition (5) with L = µ = 1. We see that in this case, the so-called condition number L/µ is just 1, which means that preconditioning the Gradient Method (1) with the matrix P := B 1 would give an immediate convergence to the solution. However, inverting the matrix is prohibitively expensive for large scale problems. Our aim is to find a suitable trade-off between improving the condition number and the arithmetic cost of algorithm steps. Let us consider the following important example which can be met in many practical applications. Example 2.3. Let A Rm n be a given data matrix, and b Rm be a given vector. Denote, f(x) = g(Ax + b) Then, the derivatives are as follows: f(x) = A g(Ax + b) and 2f(x) = A 2g(Ax + b)A. Hence, assuming: µIm 2g(x) LIm, x, with some L µ 0, condition (5) is satisfied 3 with At the same time, for B := In (the standard Euclidean norm), the Lipschitz constant increases by the factor λ1(A A), which makes the problem extremely illconditioned. A particular case of this example is separable optimization, or generalized linear models (Bishop, 2006), which covers the classical regression and classification models. Example 2.4. Let i=1 ϕ( ai, x ), x Rn, where ϕ : R R is a loss function satisfying: µ ϕ (t) L, t R, with some L µ 0. Then, forming the matrix A Rm n whose rows are a 1 , . . . , a m and setting B := A A, condition (5) holds. 3. Preconditioned Gradient Methods A natural intention would be to use the global upper bound (6) as a model for the smooth part of the objective. However, the direct minimization of such upper model requires to solve the linear system with the matrix B, which can computationally unfeasible for large scale problems. Instead, let us fix for our preconditioner some positive definite symmetric matrix P = P 0, which satisfies the following bound, for some α := α(P ) and β := β(P ) α > 0: αB 1 P βB 1. (7) We are going to use this matrix instead of B 1 in our methods. For a fixed symmetric positive definite matrix P and parameter M > 0, we denote the gradient step from a point x dom ψ along a gradient direction g Rn by Grad Step M,P (x, g) def = argmin y dom ψ n g, y + ψ(y) + M 2 y x 2 P 1 o . This operation is well-defined since the objective function in the above minimization problem is strongly convex. We assume that both ψ and P are reasonably simple so that the corresponding gradient step can be efficiently computed. An important case is ψ = 0 for which we have 3Here, we assume that A A 0 which is typically the case when m n. Otherwise, we can reduce the dimensionality. Polynomial Preconditioning for Gradient Methods Grad Step M,P (x, g) = x 1 M P g. The latter expression can be efficiently computed whenever one can cheaply multiply the matrix P by any vector. 3.1. Preconditioned Basic Gradient Method First, we consider the basic first-order scheme shown in Algorithm 1 for solving the composite problem (4). For simplicity, in this section, we only present a version of this method with a fixed step size and assume that all necessary constants are known. An adaptive version of Algorithm 1 which does not have these limitations and is more efficient in practice can be found in Appendix C. Algorithm 1 Preconditioned Basic Gradient Method Input: x0 dom ψ, P = P 0, M > 0. for k = 0, 1, . . . do Compute xk+1 = Grad Step M,P xk, f(xk) . end for For Algorithm 1, we can prove the following results. Theorem 3.1. Consider Algorithm 1 with M = βL. Then, at each iteration k 1, we have F(xk) F β α L x0 x 2 B k . (8) When µ > 0, the convergence is linear: for all k 1, F(xk) F 1 1 k [F(x0) F ]. (9) We see that one of the principal complexity factors in the above estimates is the condition number β/α which depends on the choice of our preconditioner P (see (7)). For the basic choice P = I, we have β/α = λ1/λn. However, as we show in the following sections, it is possible to use more efficient (and still quite cheap) preconditioners which improve this condition number. 3.2. Preconditioned Fast Gradient Method Now let us consider an accelerated scheme shown in Algorithm 2. This algorithm is one of the standard variants of the Fast Gradient Method (FGM) known as the Method of Similar Triangles (see, e.g., Section 6.1.3 in (Nesterov, 2018)) but adapted to our assumptions (5) and (7). As in other versions of FGM, to properly handle strongly convex problems, Algorithm 2 requires the knowledge of the strong convexity parameter α and µ (or, more precisely, their product ρ = αµ). For non-strongly convex problems, we can always choose α = µ = 0. See also Appendix C for a variant of Algorithm 2 which can automatically adjust the constant M in iterations. Algorithm 2 Preconditioned Fast Gradient Method Input: x0 dom ψ, P = P 0, M > 0, ρ 0. Set v0 = x0, A0 = 0. for k = 0, 1, . . . do Find ak+1 from eq. Ma2 k+1 Ak+ak+1 = 1 + ρ(Ak + ak+1). Set Ak+1 = Ak + ak+1, Hk = 1+ρAk+1 Set θk = ak+1 Ak+1 , ωk = ρ Hk , γk = ωk(1 θk) 1 ωkθk . Set ˆvk = (1 γk)vk + γkxk. Set yk = (1 θk)xk + θk ˆvk. Compute vk+1 = Grad Step Hk,P ˆvk, f(yk) . Set xk+1 = (1 θk)xk + θkvk+1. end for The convergence results for Algorithm 2 are as follows. Theorem 3.2. Consider Algorithm 2 with M = βL and ρ = αµ. Then, at each iteration k 1, we have F(xk) F 2 β α L x0 x 2 B k2 . (10) When µ > 0, the convergence is linear: for all k 1, F(xk) F 1 rα 2 x0 x 2 B. Comparing these estimates with those from Theorem 3.1, we see that the accelerated scheme is much more efficient. For instance, to reach accuracy ϵ > 0 in terms of the objective function in the non-strongly convex case, Algorithm 1 needs k(ϵ) = β α L x0 x 2 B ϵ iterations, while for Algorithm 2 this number is only k2(ϵ) = p 2k(ϵ). Similar conclusions are valid in the strongly convex case. Despite having much weaker dependency on the condition number β/α, Algorithm 2 is still quite sensitive to it. Thus, the proper choice of the preconditioner P is important for both our methods. 4. Symmetric Polynomial Preconditioning We would like to have a family of preconditioners Pτ for our problem indexed by some parameter τ. Varying τ should provide us with a trade off between the spectral quality of approximation (7) of the inverse matrix and the arithmetical cost of computing the preconditioner. Surprisingly, such a family of preconditioners can be built by using symmetric polynomials, the classical objects of Algebra. We prove that our preconditioning improves the condition number β α of the problem, by automatically cutting off the large gaps between the top eigenvalues. Polynomial Preconditioning for Gradient Methods 4.1. Definition and Basic Properties We define the family of symmetric matrices {Pτ}0 τ n 1 recursively. We start with identity matrix: P0 def = I. Then, Pτ def = 1 τ i=1 ( 1)i 1Pτ i Ui, (11) where Uτ def = tr (Bτ)I Bτ are the auxiliary matrices. It turns out that matrices (11) serve as a good approximation of the inverse matrix: Pτ B 1, up to some multiplicative constant, and the quality of such approximation is gradually improving when increasing parameter τ. Let us look at several first members. Clearly, P1 = tr (B)I B, (12) which is very easy to handle, by having computed the trace of the curvature matrix. Then, multiplying P1 by any vector would require just one matrix-vector multiplication with our original B. Further, P2 = 1 2tr (P1B)I P1B = 1 2 [tr (B)]2 tr (B2) I tr (B)B + B2, (13) thus its use would cost just two matrix-vector products with B, having evaluated4,5 the numbers tr (B) and tr (B2). It is clear that in general Pτ = pτ(B), where pτ is a polynomial of a fixed degree τ with coefficients that can be found recursively from (11). Let us give a useful interpretation for our family of preconditioners, that also explains their name. For a Rn 1, we denote by σ0(a), . . . , σn 1(a) the elementary symmetric polynomials in n 1 variables6. It is known that every symmetric polynomial (that is invariant to any permutation of the variables) can be represented as a weighed sum of elementary symmetric polynomials (Dummit & Foote, 2004). We establish the following important characterization. Lemma 4.1. Let B = QDiag (λ)Q be the spectral decomposition. Then, Pτ = QDiag (στ(λ 1), . . . , στ(λ n))Q , (14) where λ i Rn 1 is the vector that contains all elements of λ except λi. In particular, we justify Pτ 0. For τ = n 1, we get 4Note that tr (B2) = Pn i=1 B[:, i] 2 2, where B[:, i] Rn is the ith column of B. 5For general τ, we can also use a stochastic estimate of the trace: ξτ def = n Bτu, u , where u Rn is uniformly distributed on the unit sphere. It would give an unbiased estimate: E[ξτ] = n E[tr (u Bτu)] = ntr (E[uu ]Bτ) = tr (Bτ). 6That is στ(a) def = P 1 i1<... 0, we use the matrix P = pτ(B) as a preconditioner, where pτ is a specifically constructed polynomial of degree τ such that P 0. A natural question is how optimal is this choice of a polynomial? Indeed, the problem of polynomial approximation has a long and rich history with an affirmative answer provided by the classical Chebyshev polynomials (Mason & Handscomb, 2002) for the uniform approximation bound. We present a new adaptive algorithm that automatically achieves the best polynomial preconditioning. Then, we study what are the complexity guarantees that we can get with this optimal approach. In this section, we focus on the non-composite case only, i.e. the problem of unconstrained minimization of a smooth function: minx Rn f(x). 5.1. Gradient Method with Krylov Preconditioning We denote Pa def = a0I + a1B + . . . + aτBτ, where vector a = (a0, . . . , aτ) Rτ+1 is a parameter. In each iteration, it is found by solving the linear system: a = A 1 τ gτ Rτ+1, (18) where Aτ = Aτ(x) R(τ+1) (τ+1) is the Gram matrix with the following structure (0 i, j τ): Aτ(x) (i,j) def = L f(x), Bi+j+1 f(x) , (19) and gτ = gτ(x) Rτ+1 is defined by (0 i τ): gτ(x) (i) def = f(x), Bi f(x) . (20) Note that this operation is exactly the projection of the direction 1 LB 1 f(xk) onto the Krylov subspace (3): xk+1 xk := argmin h Kτxk h + 1 LB 1 f(xk) 2 B. Polynomial Preconditioning for Gradient Methods Fortunately, for computing this projection we indeed do not need to invert the curvature matrix B, but to solve only a small linear system (18) of size τ + 1. We are ready to formulate our new adaptive method. Algorithm 3 Gradient Method with Krylov Preconditioning Initialization: x0 Rn, τ 0, L > 0. for k = 0, 1, . . . do Form matrix Aτ(xk) and vector gτ(xk) by (19), (20). Compute ak = Aτ(xk) 1gτ(xk) Rτ+1. Set xk+1 = xk Pak f(xk). end for We prove the following optimality result. Theorem 5.1. Let P 0 be any preconditioner that is a polynomial of degree τ of the curvature matrix: P = pτ(B), pτ R[s], deg(pτ) = τ. Then, for the iteration of Algorithm 3 we have the global rates (8),(10) with the condition number that is attributed to P (7): β Hence, our method automatically chooses the best possible preconditioning matrix from the polynomial class. Let us understand what are the bounds for β α that we can achieve in this case. 5.2. Bounds for the Condition Number Let us assume that the top τ > 0 eigenvalues of B are all separated. Then, we can easily cut them off with the following simple construction. Define qτ(s) def = 1 s λ1 1 s λ2 . . . 1 s λτ . (21) Proposition 5.2. For any τ, taking P = pτ(B), where pτ(s) := 1+qτ (s) (αs 1) s with qτ defined by (21) and α = 2 λτ+1+λn , the condition number is bounded by β The worst case instance for the cutting strategy is when all the eigenvalues except one share the same value. A better approach in such a situation would be to find a bound from the uniform polynomial approximation for the whole interval [λn, λ1], which is achieved with the Chebyshev polynomials (Nemirovski, 1995). Proposition 5.3. For a fixed 0 < ϵ < 1, let τ := q ϵ . Then, taking P = pτ(B), where pτ(s) := s with Qτ is a normalized Chebyshev polynomial7of the first kind of degree τ + 1, the condition number is bounded by β 7See Appendix B.9 for the precise definition. 5.3. Discussion We see that in the case of unconstrained smooth minimization, it is possible to achieve the guarantee of the best polynomial of a fixed degree τ, by computing a certain projection onto the corresponding Krylov subspace. Namely, we can achieve β λn (Proposition 5.2), which cuts off the top τ eigenvalues of the spectrum completely, if they are separated from the others. At the same time, by using the Chebyshev polynomials, we can contract a part of the spectrum uniformly, with an appropriate degree τ (Proposition 5.3). It remains to be an open question where we can incorporate adaptive Krylov preconditioning into the Fast Gradient Method, which would give us a further improvement of the condition number. 6. Experiments Huber Loss. Let us present an illustrative experiment, with the regression model (Example 2.4) with the Huber loss function: ( t2 2µ, if |t| µ, |t| µ 2 , otherwise, where µ := 0.1 is a parameter. The data is generated with a fixed distribution of eigenvalues: λ1 > λ2 > λ3 = . . . λn = 1. Thus, we have two gaps between the leading eigenvalues. We use the Gradient Method (Algorithm 1), with the adaptive search to fit the parameter M. The results are shown in Figure 4. Using the preconditioner P1 helps the method to deal with the large gap between λ1 and λ2, while P2 makes the method to be insensitive to the gap between λ1 and λ3, as predicted by our theory. 0 200 400 600 Iterations Func. residual Huber, 1 = 100, 2 = 10 0 2000 4000 6000 Iterations Func. residual Huber, 1 = 1000, 2 = 10 0 2000 4000 6000 Iterations Func. residual Huber, 1 = 1000, 2 = 800 0 20000 40000 Iterations Func. residual Huber, 1 = 8000, 2 = 800 Figure 4: Minimizing the Huber loss. Logistic Regression. We examine the training of logistic regression on real data. In Figure 5, we see that the best convergence is achieved by the Fast Gradient Method (FGM, Algorithm 2) with P2. Using Symmetric Polynomial Preconditioning makes the methods to converge much Polynomial Preconditioning for Gradient Methods better (two times faster for GM using P2 instead of P0 I, and about 1.5 times faster for FGM). Among the versions of GM, the most encouraging performance belongs to the Krylov preconditioning, which is consistent with the theory. 0 2000 4000 Iterations Func. residual GM, P0 GM, P1 GM, P2 0 2000 4000 Iterations 100 Krylov, mnist Krylov, = 0 Krylov, = 1 Krylov, = 2 Krylov, = 3 0 250 500 750 1000 Iterations 100 FGM, mnist FGM, P0 FGM, P1 FGM, P2 Figure 5: Training logistic regression. Acknowledgements The work of the first author was supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract number 22.00133. The work of the second author received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No. 788368). Bishop, C. M. Pattern recognition and machine learning, volume 4. Springer, 2006. Bullins, B. Highly smooth minimization of non-smooth problems. In Conference on Learning Theory, pp. 988 1030. PMLR, 2020. Cunha, L., Gidel, G., Pedregosa, F., Scieur, D., and Paquette, C. Only tails matter: Average-case universality and robustness in the convex regime. In International Conference on Machine Learning, pp. 4474 4491. PMLR, 2022. Deshpande, A., Rademacher, L., Vempala, S. S., and Wang, G. Matrix approximation and projective clustering via volume sampling. Theory of Computing, 2(12):225 247, 2006. Dubois, P. F., Greenbaum, A., and Rodrigue, G. H. Approximating the inverse of a matrix for use in iterative algorithms on vector processors. Computing, 22(3):257 268, 1979. Dummit, D. S. and Foote, R. M. Abstract algebra, volume 3. Wiley Hoboken, 2004. Golub, G. H. and Van Loan, C. F. Matrix computations. JHU press, 2013. Goujaud, B., Scieur, D., Dieuleveut, A., Taylor, A. B., and Pedregosa, F. Super-acceleration with cyclical step-sizes. In International Conference on Artificial Intelligence and Statistics, pp. 3028 3065. PMLR, 2022. Gupta, S. D., Freund, R. M., Sun, X. A., and Taylor, A. Nonlinear conjugate gradient methods: worst-case convergence rates via computer-assisted analyses. ar Xiv preprint ar Xiv:2301.01530, 2023. Hardy, G. H., Littlewood, J. E., and P olya, G. Inequalities. Cambridge University Press, second edition, 1952. Hestenes, M. R. and Stiefel, E. Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards, 49(6):409, 1952. Johnson, O. G., Micchelli, C. A., and Paul, G. Polynomial preconditioners for conjugate gradient calculations. SIAM Journal on Numerical Analysis, 20(2):362 376, 1983. Kalman, D. A matrix proof of Newton s identities. Mathematics Magazine, 73(4):313 315, 2000. Kovalev, D., Richt arik, P., Gorbunov, E., and Gasanov, E. Stochastic spectral and conjugate descent methods. Advances in Neural Information Processing Systems, 31, 2018. Liu, Q., Morgan, R. B., and Wilcox, W. Polynomial preconditioned gmres and gmres-dr. SIAM Journal on Scientific Computing, 37(5):S407 S428, 2015. Loe, J. A. and Morgan, R. B. Toward efficient polynomial preconditioning for gmres. Numerical Linear Algebra with Applications, 29(4):e2427, 2022. Mardal, K.-A. and Winther, R. Preconditioning discretizations of systems of partial differential equations. Numerical Linear Algebra with Applications, 18(1):1 40, 2011. Mason, J. C. and Handscomb, D. C. Chebyshev polynomials. Chapman and Hall/CRC, 2002. Nemirovski, A. Information-based complexity of convex programming. Lecture Notes, 834, 1995. Nemirovski, A. and Yudin, D. Problem complexity and method efficiency in optimization. 1983. Nesterov, Y. A method for solving the convex programming problem with convergence rate O(1/kˆ2). In Dokl. akad. nauk Sssr, volume 269, pp. 543 547, 1983. Nesterov, Y. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127 152, 2005. Polynomial Preconditioning for Gradient Methods Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125 161, 2013. Nesterov, Y. Lectures on convex optimization, volume 137. Springer, 2018. Nocedal, J. and Wright, S. Numerical optimization. Springer Science & Business Media, 2006. Polyak, B. T. Introduction to optimization. Optimization Software, 1987. Qu, Z., Richt arik, P., Tak ac, M., and Fercoq, O. Sdna: stochastic dual newton ascent for empirical risk minimization. In International Conference on Machine Learning, pp. 1823 1832. PMLR, 2016. Rodomanov, A. and Kropotov, D. A randomized coordinate descent method with volume sampling. SIAM Journal on Optimization, 30(3):1878 1904, 2020. Saad, Y. Practical use of polynomial preconditionings for the conjugate gradient method. SIAM Journal on Scientific and Statistical Computing, 6(4):865 881, 1985. Scieur, D. and Pedregosa, F. Universal average-case optimality of polyak momentum. In International conference on machine learning, pp. 8565 8572. PMLR, 2020. Spielman, D. A. and Teng, S.-H. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pp. 81 90, 2004. Vaidya, P. M. Solving linear equations with symmetric diagonally dominant matrices by constructing good preconditioners. A talk based on this manuscript, 2(3.4):2 4, 1991. Van Gijzen, M. A polynomial preconditioner for the gmres algorithm. Journal of Computational and Applied Mathematics, 59(1):91 107, 1995. Vishnoi, N. K. Lx=b. Foundations and Trends in Theoretical Computer Science, 8(1 2):1 141, 2013. Supplementary Material A. Extra Experiments Logistic Regression. Let us present experimental results for our preconditioning strategies, for the training of Logistic Regression with several real datasets. We investigate both the number of iterations and the number of matrix-vector products (the most difficult operation) required to reach a certain accuracy level in the functional residual. The results are shown in Figure 6. 0 2000 4000 Iterations Func. residual GM, P0 GM, P1 GM, P2 0 2000 4000 Iterations Krylov, a9a Krylov, = 0 Krylov, = 1 Krylov, = 2 Krylov, = 3 0 2000 4000 Iterations FGM, P0 FGM, P1 FGM, P2 0 5000 10000 Matrix-vector products Func. residual GM, P0 GM, P1 GM, P2 0 5000 10000 Matrix-vector products Krylov, a9a Krylov, = 0 Krylov, = 1 Krylov, = 2 Krylov, = 3 0 5000 10000 Matrix-vector products FGM, P0 FGM, P1 FGM, P2 0 2000 4000 Iterations Func. residual GM, phishing GM, P0 GM, P1 GM, P2 0 2000 4000 Iterations 100 Krylov, phishing Krylov, = 0 Krylov, = 1 Krylov, = 2 Krylov, = 3 0 2000 4000 Iterations FGM, phishing FGM, P0 FGM, P1 FGM, P2 0 5000 10000 Matrix-vector products Func. residual GM, phishing GM, P0 GM, P1 GM, P2 0 5000 10000 Matrix-vector products 100 Krylov, phishing Krylov, = 0 Krylov, = 1 Krylov, = 2 Krylov, = 3 0 5000 10000 Matrix-vector products 100 FGM, phishing FGM, P0 FGM, P1 FGM, P2 0 200 400 600 Iterations Func. residual GM, breast-cancer GM, P0 GM, P1 GM, P2 0 50 100 150 200 Iterations 100 Krylov, breast-cancer Krylov, = 0 Krylov, = 1 Krylov, = 2 Krylov, = 3 0 200 400 Iterations 100 FGM, breast-cancer FGM, P0 FGM, P1 FGM, P2 0 1000 2000 Matrix-vector products Func. residual GM, breast-cancer GM, P0 GM, P1 GM, P2 0 500 1000 Matrix-vector products 100 Krylov, breast-cancer Krylov, = 0 Krylov, = 1 Krylov, = 2 Krylov, = 3 0 2000 4000 Matrix-vector products 100 FGM, breast-cancer FGM, P0 FGM, P1 FGM, P2 0 500 1000 Iterations Func. residual GM, ionosphere GM, P0 GM, P1 GM, P2 0 100 200 300 400 Iterations 100 Krylov, ionosphere Krylov, = 0 Krylov, = 1 Krylov, = 2 Krylov, = 3 0 500 1000 Iterations 100 FGM, ionosphere FGM, P0 FGM, P1 FGM, P2 0 2000 4000 Matrix-vector products Func. residual GM, ionosphere GM, P0 GM, P1 GM, P2 0 500 1000 1500 Matrix-vector products 100 Krylov, ionosphere Krylov, = 0 Krylov, = 1 Krylov, = 2 Krylov, = 3 0 5000 Matrix-vector products 100 FGM, ionosphere FGM, P0 FGM, P1 FGM, P2 0 2000 4000 Iterations Func. residual GM, covtype GM, P0 GM, P1 GM, P2 0 2000 4000 Iterations Krylov, covtype Krylov, = 0 Krylov, = 1 Krylov, = 2 Krylov, = 3 0 2000 4000 Iterations FGM, covtype FGM, P0 FGM, P1 FGM, P2 0 5000 10000 Matrix-vector products Func. residual GM, covtype GM, P0 GM, P1 GM, P2 0 5000 10000 Matrix-vector products Krylov, covtype Krylov, = 0 Krylov, = 1 Krylov, = 2 Krylov, = 3 0 5000 10000 Matrix-vector products FGM, covtype FGM, P0 FGM, P1 FGM, P2 Figure 6: Training logistic regression with Algorithm 1 (GM) and Algorithm 2 (FGM) employing Symmetric Polynomial Preconditioning (11); and with Algorithm 3 (Krylov). We see that using Symmetric Polynomial Preconditioning (P1 and P2) significantly accelerates both the Gradient Method (GM) and the Fast Gradient Method (GM), without extra arithmetic efforts during each iteration. Using the Krylov Polynomial Preconditioning for Gradient Methods preconditioning is more costly, while it equips GM with the best possible iteration rates. Typical Distributions of the Data Spectrum. Let us provide the plots with the distributions of the leading eigenvalues (in the logarithmic scale) of the curvature matrix B for a selection of typical machine learning datesets8. We see (Figure 7) that it is quite common to have several top eigenvalues well separated from others. In these cases, our new Symmetric Polynomial Preconditioners provides the gradient methods with the best acceleration. 1 7 13 19 25 31 1 3 5 7 9 11 1 2 5 8 11 14 1 2 4 6 8 10 breast-cancer 1 2 4 7 9 12 1 24 47 70 93 116 1 6 11 16 21 26 1 3 6 9 12 16 1 5 9 13 17 21 25 1 5 9 13 17 21 25 1 4 7 10 13 16 1 5 10 16 21 27 1 3 5 7 9 11 13 1 2 4 6 8 10 1 22 43 64 85 106 1 2 5 7 10 13 Figure 7: Leading eigenvalues (in the logarithmic scale) of the curvature matrix for several real datasets. 8https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ Polynomial Preconditioning for Gradient Methods Comparison with BFGS and L-BFGS. In the following experiment, we compare the performance of the Gradient Method (GM) and Fast Gradient Method (FGM) equipped with our Symmetric Polynomial Preconditioning, and GM with Krylov preconditioning, with the classical BFGS and L-BFGS optimization schemes (Nocedal & Wright, 2006). The results are presented in Figure 8. We show both the number of matrix-vector products (the most expensive operation) and the total computational time9 required to reach a given accuracy in terms of the functional residual. 0 500 1000 1500 2000 Matrix-vector products Func. residual breast-cancer BFGS L-BFGS GM, 0 GM, 1 GM, 2 0.00 0.02 0.04 0.06 0.08 Time, s Func. residual breast-cancer BFGS L-BFGS GM, 0 GM, 1 GM, 2 0 1000 2000 3000 4000 Matrix-vector products Func. residual breast-cancer BFGS L-BFGS FGM, 0 FGM, 1 FGM, 2 0.000 0.025 0.050 0.075 0.100 0.125 0.150 Func. residual breast-cancer BFGS L-BFGS FGM, 0 FGM, 1 FGM, 2 0 200 400 600 800 1000 Matrix-vector products Func. residual breast-cancer BFGS L-BFGS Krylov, 0 0.000 0.005 0.010 0.015 0.020 0.025 0.030 Func. residual breast-cancer BFGS L-BFGS Krylov, 0 Figure 8: Comparison of our methods with Quasi-Newton methods: training Logistic Regression We see that the performance of the Gradient Method and Fast Gradient Method with Symmetric Polynomial preconditioner of order τ = 2 (GM, 2 and FGM, 2 correspondingly) is comparable to that one of the BFGS and L-BFGS methods. 9Clock time was evaluated using the machine with Intel Core i5 CPU, 1.6GHz; 8 GB RAM. All methods were implemented in Python. Polynomial Preconditioning for Gradient Methods In the next experiment, we consider the problem of minimizing the Soft Maximum objective (log-sum-exp): min x Rn f(x) := µ ln m P i=1 exp ai,x bi µ max 1 i m h ai, x bi i , (22) where µ > 0 is a sufficiently small number. The problems of this type are important in applications with minimax strategies for matrix games and for training ℓ -regression (Nesterov, 2005; Bullins, 2020). The vectors a1, . . . , am Rn and b Rn form our data, with n = 100, m = 200. That the structure of this objective satisfies Example 2.3 and the corresponding curvature matrix has the following distribution of the spectrum (in the double-logarithmic scale): 1 20 40 60 80 100 Note that the function (22) does not have a finite-sum structure, thus it is impossible to apply to this problem stochastic optimization methods. We use the value µ = 0.005 for the smoothing parameter. The results are shown in Figure 9. 0 2000 4000 6000 8000 10000 Matrix-vector products Func. residual log-sum-exp, = 0.005 BFGS L-BFGS GM, 0 GM, 1 GM, 2 0.0 0.5 1.0 1.5 Time, s Func. residual log-sum-exp, = 0.005 BFGS L-BFGS GM, 0 GM, 1 GM, 2 0 2000 4000 6000 8000 10000 Matrix-vector products Func. residual log-sum-exp, = 0.005 BFGS L-BFGS FGM, 0 FGM, 1 FGM, 2 0.0 0.5 1.0 1.5 2.0 Time, s Func. residual log-sum-exp, = 0.005 BFGS L-BFGS FGM, 0 FGM, 1 FGM, 2 Figure 9: Comparison of our methods with Quasi-Newton methods: training Soft Maximum (log-sum-exp objective) We see that using our Symmetric Polynomial Preconditioning significantly helps the Gradient Method (GM) and the Fast Gradient Method (FGM). The performance of FGM with preconditioner of order τ = 2 is comparable to that one of the BFGS algorithm, both in terms of the matrix-vector products and total computational time. Polynomial Preconditioning for Gradient Methods B.1. Proof of Theorem 3.1 Let us consider one iteration of the method, for some k 0. By definition, xk+1 = argmin y dom ψ Ωk(y) , where Ωk(y) def = f(xk) + f(xk), y xk + M 2 y xk 2 P 1 + ψ(y) is strongly convex with respect to P 1 norm with parameter M := βL. Thus, we have, for any y dom ψ: 2 y xk 2 P 1 + F(y) Ωk(y) Ωk(xk+1) + M 2 y xk+1 2 P 1 f(xk) + f(xk), xk+1 xk + L 2 xk+1 xk 2 B + ψ(xk+1) + M 2 y xk+1 2 P 1 (6) F(xk+1) + M 2 y xk+1 2 P 1. Hence, substituting y := x (solution to the problem), we establish the boundness for all iterates: xk+1 x P 1 xk x P 1. (24) Further, let us take y := γkx + (1 γk)xk, for some γk [0, 1]. We obtain F(xk+1) (23) F(γkx + (1 γk)xk) + γ2 k M 2 x xk 2 P 1 γk F + (1 γk)F(xk) + γ2 k M 2 x xk 2 P 1. Now, setting Ak def = k (k + 1), ak+1 def = Ak+1 Ak = 2(k + 1), and γk := ak+1 Ak+1 = 2 k+2, we obtain Ak+1 F(xk+1) F (25) Ak F(xk) F + a2 k+1 Ak+1 M 2 x xk 2 P 1 (24) Ak F(xk) F + a2 k+1 Ak+1 M 2 x x0 2 P 1 (7) Ak F(xk) F + a2 k+1 Ak+1 β 2 x x0 2 B. Telescoping this bound for the first k iterations, we get F(xk) F (26) β α L 2 x x0 2 B 1 Ak a2 i Ai = O β α L 2k x x0 2 B . To prove the linear rate for the strongly convex case, we continue as follows F(xk+1) (25),(6) γk F + (1 γk)F(xk) + γ2 k βL αµ F(xk) F . Choosing γk := αµ 2βL < 1, we get the exponential rate F(xk+1) F 1 αµ 4βL F(xk) F , which completes the proof. B.2. Proof of Theorem 3.2 Let x dom ψ and k 0 be arbitrary. From (6), (7), and the fact that ρ = αµ, it follows that F(x) = f(x) + ψ(x) ℓk(x) + ρ 2 x yk 2 P 1, ℓk(x) def = f(yk) + f(yk), x yk + ψ(x). Polynomial Preconditioning for Gradient Methods Ak F(xk) + ak+1F(x) + 1 + ρAk 2 x vk 2 P 1 Akℓk(xk) + ak+1ℓk(x) + 1 + ρAk 2 x vk 2 P 1 + ρak+1 2 x yk 2 P 1 Akℓk(xk) + ak+1ℓk(x) + 1 + ρAk+1 2 x ˆvk 2 P 1 def = ζk(x), where the final inequality follows from the convexity of the squared norm and the fact that, according to our definitions, (1 + ρAk)vk + ρak+1yk 1 + ρAk+1 = (1 ωk)vk + ωkyk = (1 ωk)vk + ωk[(1 θk)xk + θk ˆvk] = ˆvk. Note that ζk is a (1 + ρAk+1)-strongly convex function w.r.t. P 1, and vk+1 is precisely its minimizer. Therefore, ζk(x) ζk(vk+1) + 1 + ρAk+1 2 x vk+1 2 P 1. (28) Since ℓk is a convex function, we have, by our definition of xk+1, Akℓk(xk) + ak+1ℓk(vk+1) Ak+1ℓk(xk+1). On the other hand, by the definition of xk+1 and yk, xk+1 yk = θk(vk+1 ˆvk) = ak+1 Ak+1 (vk+1 ˆvk). ζk(vk+1) = Akℓk(xk) + ak+1ℓk(vk+1) + 1 + ρAk+1 2 vk+1 ˆvk 2 P 1 Ak+1 h ℓk(xk+1) + Ak+1(1 + ρAk+1) 2a2 k+1 xk+1 yk 2 P 1 i . In view of our choice of ak+1, we have the following identity: Ma2 k+1 Ak+1 = 1 + ρAk+1. (29) Combining this with the fact that M = βL and using (7) and (6), we get ζk(vk+1) Ak+1 h ℓk(xk+1) + M 2 xk+1 yk 2 P 1 i Ak+1 h ℓk(xk+1) + L 2 xk+1 yk 2 B i = Ak+1 h f(yk) + f(yk), xk+1 yk + L 2 xk+1 yk 2 B + ψ(xk+1) i Ak+1F(xk+1). Substituting the above bound into (28), and that one into (28), we thus obtain Ak F(xk) + ak+1F(x) + 1 + ρAk 2 x vk 2 P 1 Ak+1F(xk+1) + 1 + ρAk+1 2 x vk+1 2 P 1. This inequality is valid for any k 0. Fixing an arbitrary k 1 and summing up the previous inequalities for all indices k = 0, . . . , k 1, we get Ak F(xk) Ak F(x) + 1 + ρA0 2 x v0 2 P 1 = Ak F(x) + 1 2 x x0 2 P 1. Substituting further x = x (an optimal solution) and using (7), gives us the following convergence rate estimate: F(xk) F x x0 2 P 1 2Ak x x0 2 B 2αAk . (30) Polynomial Preconditioning for Gradient Methods To complete the proof, it remains to use standard lower bounds on Ak (see (Nesterov, 2018)). Specifically, dropping the second term from the right-hand side of (29) and rearranging, we obtain, for any k 0, M ak+1 = Ak+1 Ak = ( p Cancelling p Ak+1 on both sides and using the fact that A0 = 0, we obtain, for any k 1, Squaring both sides, substituting the resulting inequality into (30) and replacing M = βL, we get (10). When µ > 0, we can drop the first term from the right-hand side of (29). This gives us Hence, for any k 0, Ak+1 Ak = ak+1 q Ak+1, q def = r ρ or, equivalently, Ak+1 Ak 1 q . Consequently, for any k 1, Ak A1 (1 q)k 1 1 M(1 q)k 1 , where the final inequality is due to (29) combined with the fact that A0 = 0. Substituting this inequality into (30) and replacing M = βL, ρ = αµ, we get the second bound from Theorem 3.2. B.3. Proof of Lemma 4.1 Let us denote by uk(a) the k-th power sum of the variables: uk(a) def = n 1 P i=1 ak i , a Rn 1. Then, the classical Newton-Girard identities (see, e.g. (Kalman, 2000)) state the following relation between the elementary symmetric polynomials: i=1 ( 1)i 1στ i(a) ui(a). (31) Note that for the matrix Uτ def = tr (Bτ)I Bτ, the following spectral decomposition holds: Uτ = QDiag n P i=1 λτ i λτ 1, n P i=1 λτ i λτ 2, . . . , n P i=1 λτ i λτ n Q = QDiag uτ(λ 1), uτ(λ 2), . . . , uτ(λ n) Q . Now, the identity that we need to prove is Pτ = QDiag στ(λ 1), στ(λ 2), . . . , στ(λ n) Q . (33) Polynomial Preconditioning for Gradient Methods We justify (33) by induction. By definition, P0 def = I and σ0(a) 1, therefore (33) holds for τ = 0, which is our base. Let us fix τ 1 and assume that (33) is true for all smaller indices. Then, Pτ def = 1 τ i=1 ( 1)i 1Pτ i Ui (33),(32) = QDiag τP i=1 ( 1)i 1στ i(λ 1) ui(λ 1), . . . , τP i=1 ( 1)i 1στ i(λ n) ui(λ n) Q (31) = QDiag στ(λ 1), . . . , στ(λ n) Q . Hence, (33) is proven for all 0 τ n 1. B.4. Proof of Theorem 4.2 By Lemma 4.1, we have the following representation of our preconditioner: Pτ = QDiag στ(λ 1), στ(λ 2), . . . , στ(λ n) Q . Further, for the spectrum of the matrix B1/2PτB1/2 = QDiag λ1 στ(λ 1), λ2 στ(λ 2), . . . , λn στ(λ n) Q , it holds, according to Lemma D.10, that λ1 στ(λ 1) λ2 στ(λ 2) . . . λn στ(λ n). (34) Consequently, λn στ(λ n)I B1/2PτB1/2 λ1 στ(λ 1)I, which proves the required bound. B.5. Proof of Theorem 4.5 Let B = Q Diag(λ)Q be a spectral decomposition of B, where λ = λ(B) and Q Rn n is an orthogonal matrix. Formula (3.5) in (Rodomanov & Kropotov, 2020) states that ES Volτ+1(B)[IS(BS S) 1I S ] = 1 στ+1(λ)Q Diag στ(λ 1), . . . , στ(λ n) Q . (Their στ is στ+1 in our notation.) But, according to Lemma 4.1, Q Diag στ(λ 1), . . . , στ(λ n) Q = Pτ, and the claim follows. B.6. Proof of Lemma 4.3 Clearly, when τ = 0, we have ξ0(λ) 1 . For τ = n 1, inequalities (16) are in fact identities, and ξn 1(λ) (15) = λn λ1 . To prove that ξτ(λ) is decreasing in τ, we need to justify that, for any 0 τ n 2, we have ξτ(λ) def = στ (λ 1) στ (λ n) ξτ+1(λ) def = στ+1(λ 1) στ+1(λ n). This follows from Lemma D.11 (recall that, by our assumptions, λ1 . . . λn > 0). To prove the limit, let us divide the right hand side of ξτ(λ) = στ (λ 1) στ (λ n) 2 i1<... 0 by our assumption), we obtain ξτ(λ) = στ(λ 1) στ(λ n) λn + sτ 1 λ1 + sτ 1 , where sτ 1 is the sum of all but the τ 1 largest elements of λ {1,n} = (λ2, . . . , λn 1): Substituting this into the previous display, we get ξτ(λ) λn + Pn 1 i=τ+1 λi λ1 + Pn 1 i=τ+1 λi = Pn i=τ+1 λi λ1 + Pn 1 i=τ+1 λi , which is exactly the desired inequality. B.8. Proof of Proposition 5.2 The problem of finding the best polynomial preconditioner can be reformulated as minimizing the norm of a symmetric matrix over the set of (positive) polynomials of a fixed degree τ 0: n γ(pτ) def = Bpτ(B) I o , where Pτ def = pτ R[s] : deg(pτ) = τ, pτ(B) 0 . Here we use the spectral norm to measure the size of a symmetric matrix, and the objective can be rewritten as γ(pτ) = max s Spec (B) |spτ(s) 1|, (35) where Spec (B) is the discrete set of eigenvalues of the curvature matrix. For any value of γ := γ(pτ), our original approximation guarantee (7) clearly satisfied with β = 1 + γ and α = 1 γ, and the condition number becomes10 β α = 1+γ 1 γ . (36) Now, we take qτ(s) := 1 s λ1 1 s λ2 . . . 1 s λτ . (37) First, note that qτ(0) = 1 and thus the polynomial 1 + qτ(s) (αs 1) is divisible by s. Hence the degree of the polynomial pτ(s) := 1+qτ (s) (αs 1) is exactly τ. Then, we obtain γ = max s Spec (B) |spτ(s) 1| = max s Spec (B) |qτ(s) (αs 1)| max s {λτ+1,...,λn} |αs 1| = λτ+1 λn λτ+1+λn , (38) and the optimal value is α = 2 λτ+1+λn , where we put formally λn+1 λn. It remains to substitute this bound into β α = 1+γ 1 γ , which is monotone in γ. 10We are interested in γ < 1, since γ = 1 trivially holds for zero polynomial. Polynomial Preconditioning for Gradient Methods B.9. Proof of Proposition 5.3 Let us use an upper bound on γ from (35), which is the uniform polynomial approximation for the whole interval [λn, λ1]: γ(pτ) max s [λn,λ1] |spτ(s) 1|. (39) Then, we use Qτ(s) def = Tτ+1 λ1+λn 2s λ1 λn Tτ+1 λ1+λn λ1 λn 1, (40) where Tτ+1( ) is the standard Chebyshev polynomial of the first kind of degree τ + 1. Namely, we can define them recursively: T0(x) def = 1, T1(x) def = x, Tk+1(x) def = 2x Tk(x) Tk 1(x), k 1. Note that Qτ(0) = 1, thus the polynomial 1 Qτ(s) is divisible by s. Then, we take pτ(s) := 1 Qτ (s) which is the polynomial of degree τ. This choice ensures that γ (39),(40) max x [ 1,1] |Tτ+1(x)| Tτ+1 λ1+λn = Tτ+1 λ1+λn λ1 λn 1 2 λ1 λn λ1+ λn where the last inequality is the classical bound for the Chebyshev polynomials (see, e.g. Section 16.4 in (Vishnoi, 2013)). Thus, the condition number β α = 1+γ 1 γ decreases exponentially with τ. B.10. Proof of Theorem 5.1 Let us fix an arbitrary P = P 0 such that P = pτ(B), for some polynomial pτ R[s] and deg(pτ) = τ. We take β := β(P ) and α := α(P ) (from the definition (7)) and denote P := 1 βLP . Let us consider an arbitrary iteration k 0, and denote the following step T := xk P f(xk). Recall also that xk+1 := xk Pak f(xk). By the optimality of ak as the projection of B 1 f(xk) onto the Krylov subspace, we have: f(xk), xk+1 xk + L 2 xk+1 xk 2 B L 2 xk+1 xk + 1 LB 1 f(xk) 2 B 1 2L f(xk) 2 B 1 LB 1 f(xk) 2 B 1 2L f(xk) 2 B 1 f(xk), T xk + L 2 T xk 2 B. (41) Hence, we obtain, for any y Rn: f(xk+1) (6) f(xk) + f(xk), xk+1 xk + L 2 xk+1 xk 2 B (41) f(xk) + f(xk), T xk + L f(xk) + f(xk), T xk + βL 2 T xk 2 P 1 f(xk) + f(xk), y xk + βL 2 y xk 2 P 1. Polynomial Preconditioning for Gradient Methods where we used that T is the minimizer of the last upper bound in y. Thus, using convexity, we get, for any y Rn: f(xk+1) f(y) + βL 2 y xk 2 P 1 f(y) + β 2 y xk 2 B. (42) In particular, substituting y := xk we justify that the method is monotone: f(xk+1) f(xk), k 0. Therefore, xk F0 def = n x Rn : f(x) f(x0), o and we assume that the initial level set F0 is bounded, denoting D0 def = sup x F0 x x B < + . Substituting y := γkx + (1 γk)xk, γk [0, 1] into (42), we obtain f(xk+1) γkf + (1 γk)f(xk) + γ2 k β α L γkf + (1 γk)f(xk) + γ2 k β α L 2 D2 0. (43) Substituting γk := 2 k+1 and using the standard technique (see the proof of Theorem 3.1), we establish the global rate for the convex case: f(xk) f O β α LD2 0 k . For strongly convex functions (µ > 0), we continue as f(xk+1) (43),(6) γkf + (1 γk)f(xk) + γ2 k βL αµ f(xk) f , and choosing γk := αµ 2βL we establish the exponential rate. C. Adaptive Search In this section, we briefly present adaptive versions of Algorithms 1 and 2 which do not require the knowledge of the constant M = βL and can automatically tune it in iterations yet preserving the original worst-case efficiency estimates. This is achieved by using a standard backtracking line search which can be found, e.g., in (Nesterov, 2013). In what follows, for any x, y dom ψ, M > 0 and P = P 0, we define the following predicate: Quad Growth M,P (x, y): f(y) f(x) + f(x), y x + M 2 y x 2 P 1. According to our assumptions (6) and (7), we know that this predicate is surely satisfied for any pair of points once M βL. The adaptive version of Algorithm 1 is presented in Algorithm 4. This method starts with a certain initial guess M0 for the constant βL and then, at every iteration, repeatedly increases the current guess in two times until the predicate becomes satisfied. This process is guaranteed to terminate (when Mk becomes bigger or equal to βL, or even sooner). After that, we accept the new point xk+1 and choose a new optimistic guess of the constant M for the next iteration by halving the value of Mk that we have accepted at the current iteration. We assume that the preconditioner P is sufficiently simple so that we can efficiently check the predicate Quad Growth Mk,P (xk, xk+1). For example, if ψ = 0, then xk+1 = xk 1 Mk P f(xk) and Mk xk+1 xk 2 P 1 = f(xk), xk xk+1 can be efficiently computed. For Algorithm 4, we can prove exactly the same rates as in Theorem 3.1 (up to absolute constants) provided that M0 βL. (44) The proof is essentially the same as in Appendix B.1 with only two minor differences: 1) inequality (23) is now guaranteed by our predicate; 2) instead of using M = βL in (26), we should use the bound Mk 2βL which follows from (44) and Polynomial Preconditioning for Gradient Methods Algorithm 4 Adaptive Preconditioned GM Input: x0 dom ψ, P = P 0, M0 > 0. for k = 0, 1, . . . do Find smallest integer ik 0 such that xk+1 = Grad Step Mk,P xk, f(xk) , Mk = 2ik Mk satisfies the predicate Quad Growth Mk,P (xk, xk+1). Set Mk+1 = Mk/2. end for the fact that any value of M βL is always acceptable in the line search. Using a classical argument from (Nesterov, 2013), it is not difficult to show that, on average, Algorithm 4 makes only 2 gradient steps at each iteration. In contrast to an upper estimate of the constant βL, an initial guess satisfying (44) can be easily generated. One simple recipe is to make a trial step x 1 = Grad Step M 0,P x0, f(x0) for an arbitrarily chosen M 0 > 0 and then compute M0 = f(x 1) f(x0) f(x0), x 1 x0 1 2 x 1 x0 2 P 1 . Alternatively, we can find a suitable M0 be choosing an arbitrary M 0 > 0 and then repeatedly halving it until the predicate Quad Growth(x0, x 1(M)) stops being satisfied for x 1(M) = Grad Step M,P (x0, f(x0)). This auxiliary procedure either terminates in a logarithmic number of steps, in which case we get a suitable M0, or, otherwise, we quickly find an approximate solution of our problem. Similar technique can be applied for the Fast Gradient Method. Specifically, let us introduce an auxiliary procedure shown in Algorithm 5 for computing one iteration of Algorithm 2 for a given value of M. Then, the adaptive FGM method can be constructed as shown in Algorithm 6. As in the basic method, we can show that the rates from Theorem 3.2 still remain valid (up to absolute constants) for Algorithm 6, provided that M0 satisfies (44). For generating the initial guess M0, we can use exactly the same techniques as before. Algorithm 5 (x+, v+, A+; y) = Fast Grad Step M,ρ,P (x, v, A) Require: M > 0; ρ 0; P = P 0; x, v dom ψ; A > 0. Find a+ from eq. Ma2 + A+a+ = 1 + ρ(A + a+). Set A+ = A + a+, H = 1+ρA+ a+ , θ = a+ H , γ = ω(1 θ) 1 ωθ . Set ˆv = (1 γ)v + γx, y = (1 θ)x + θˆv. Compute v+ = Grad Step H,P ˆv, f(y) . Set x+ = (1 θ)x + θv+. Return (x+, v+, A+; y). Algorithm 6 Adaptive Preconditioned FGM Input: x0 dom ψ, P = P 0, M0 > 0. Set v0 = x0, A0 = 0. for k = 0, 1, . . . do Find smallest integer ik 0 such that (xk+1, vk+1, Ak+1; yk) = Fast Grad Step Mk,P xk, vk, Ak , Mk = 2ik Mk satisfies the predicate Quad Growth Mk,P (yk, xk+1). Set Mk+1 = Mk/2. end for Polynomial Preconditioning for Gradient Methods D. Auxiliary Results D.1. Elementary Symmetric Polynomials The elementary symmetric polynomial in variables x Rn of degree k (integer, 1 k n) is defined as 1 i1<... n. Thus, σk(x) is defined for any any x Rn with n 0 and any integer k. The following three properties are obvious from the definition. Observation D.1 (symmetry). For any x Rn with n 0, any integer k, and any permutation π := (π1, . . . , πn) of indices {1, . . . , n}, we have σk(x) = σk(xπ), where xπ := (xπ1, . . . , xπn) Rn is the vector obtained from x by rearranging11 its components according to π. Observation D.2. For any x Rn + with n 0 and any integer k, we have σk(x) 0. Observation D.3. For any x Rn + with at least 1 k n strictly positive elements, we have σk(x) > 0. Let us now establish a number of other properties that will be useful in our analysis. In what follows, given a vector x with n 1 elements (indexed by 1, . . . , n), and an index 1 i n, we use the notation x i to denote the (n 1)-dimensional vector obtained from x by removing its ith element. More generally, for a set of indices I {1, . . . , n}, we denote by x I the (n |I|)-dimensional vector obtained from x by removing the elements with indices from I. Also, for a vector x with n 0 elements, we use the notation x 0 to express the fact that each element of x is nonnegative. For an empty vector x (i.e., when n = 0), this inequality is always assumed to be satisfied (vacuously). We start with a simple but very useful recursive decomposition. Lemma D.4. For any x Rn, any index 1 i n, and any integer k, we have σk(x) = xiσk 1(x i) + σk(x i). Proof. In view of Observation D.1, it suffices to prove the identity only for i = 1. If k < 0 or k > n, then σk(x) = σk 1(x 1) = σk(x 1) = 0, and we get the identity 0 = 0 which is indeed valid. If k = 0, then σk 1(x 1) = 0, while σk(x) = σk(x 1) = 1, and we get the identity 1 = 1 which is also valid. If k = 1, then j=1 xj = x1 + j=2 xj = x1 + σ1(x 1), so the claim is valid since σ0(x 1) = 1 by definition. Finally, in the general case when 2 k n, we have 1 i1<... 0 (see Observation D.3 and recall that 1 k n 1 by our assumption). The inequality we need to prove is then (x1σk 1 + σk)2 (x1σk 2 + σk 1)(x1σk + σk+1). After the expansion of both sides, the above inequality becomes x2 1σ2 k 1 + 2x1σk 1σk + σ2 k x2 1σk 2σk + x1(σk 2σk+1 + σk 1σk) + σk 1σk+1. Making cancellations and rearranging, we come to the following inequality we need to prove: x2 1(σ2 k 1 σk 2σk) + x1(σk 1σk σk 2σk+1) + (σ2 k σk 1σk+1) 0. Since x1 0, it suffices to prove the following three inequalities: σ2 k 1 σk 2σk, σk 1σk σk 2σk+1, σ2 k σk 1σk+1. But this is simple: the first and the third ones are valid in view of our inductive assumption, while the second one follows from the other two (indeed, σk 1σ2 k σ2 k 1σk+1 σk 2σkσk+1, and it remains to cancel σk which is assumed to be positive). Lemma D.7. For any x Rn, any index 1 i n, such that x i 0, and any integer k, we have σk(x)σk(x i) σk+1(x)σk 1(x i). Polynomial Preconditioning for Gradient Methods Proof. Let us denote for brevity σk := σk (x i) for any k {k 1, k, k + 1}. Then, according to Lemma D.4, σk(x) = xiσk 1 + σk, σk+1(x) = xiσk + σk+1. The inequality we need to justify is then (xiσk 1 + σk)σk (xiσk + σk+1)σk 1, or, equivalently, σ2 k σk 1σk+1. But this is indeed true according to Lemma D.5 (and our assumption that x i 0). Lemma D.8. For any x Rn + with n 0 and any integer 0 k n, we have σk+1(x) s k(x)σk(x), where s k(x) is the sum of all but k largest elements of x (i.e., s k(x) = Pn i=k+1 x[i], where x[1] . . . x[n] are the components of x sorted in nonincreasing order, and s k(x) := 0 for the empty vector x). Proof. We can assume that n 1 since otherwise k = n = 0, the vector x is empty, σk+1(x) = s k(x) = 0, and we get a trivial inequality 0 0. Further, in view of Observation D.1, we can assume w.l.o.g. that x1 . . . xn. Then, we need to prove that σk+1(x) n X i=k+1 xi σk(x). Since both sides of the above inequality are continuous in x (as certain polynomials), we can assume w.l.o.g. that all the components of x are strictly positive. Applying repeatedly Lemma D.7, we obtain σk(x) σk(x 1) σk 1(x 1) σk 1(x {1,2}) σk 2(x {1,2}) . . . σ1(x {1,...,k}) σ0(x {1,...,k}) = where the final identity follows from the definitions of σ1 and σ0. (Note that each denominator in the above display is strictly positive, see Observation D.3 and recall that, by our assumptions, x has strictly positive components and 0 k n.) This is exactly the desired inequality. Lemma D.9. For any x Rn, any indices 1 i, j n, any integer k, and any a, b R, we have the implication xi xj and aσk(x {i,j}) & bσk 1(x {i,j}) = (axi + b)σk(x i) & (axj + b)σk(x j), where & is either or . Furthermore, if xi > xj, then the reverse implication is also true. Proof. We can assume that i = j since otherwise the claim is trivial. In particular, we can assume that n 2. According to Lemma D.4, we can decompose σk(x i) = xjσk 1 + σk, σk(x j) = xiσk 1 + σk, where σk := σk (x {i,j}) for any k {k 1, k}. The inequality after the implication sign is then (axi + b)(xjσk 1 + σk) & (axj + b)(xiσk 1 + σk). After the expansion of both sides, this inequality reads xixjaσk 1 + xiaσk + xjbσk 1 + bσk & xixjaσk 1 + xibσk 1 + xjaσk + bσk. Polynomial Preconditioning for Gradient Methods Making cancellations and rearranging, we obtain the following equivalent inequality: (xi xj)aσk & (xi xj)bσk 1. This inequality is obviously true if xi xj and aσk & bσk 1. On the other hand, if xi > xj, we can cancel (xi xj) on both sides and conclude that aσk & bσk 1. Lemma D.10. For any x Rn, any indices 1 i, j n, such that x {i,j} 0, and any integer k, we have the implication xi xj = xiσk(x i) xjσk(x j). Proof. Follows from Lemma D.9 (applied to a = 1 and b = 0) since σk(x {i,j}) 0 in view of our assumption that x {i,j} 0 (see Observation D.2). Lemma D.11. For any x Rn, any indices 1 i, j n, such that x {i,j} 0, and any integer k, we have the implication xi xj = σk(x i)σk+1(x j) σk+1(x i)σk(x j). Proof. For brevity, let us denote σk := σk (x {i,j}) for any k {k 1, k, k + 1}. According to Lemma D.5 and our assumption that x {i,j} 0, we have σ2 k σk 1σk+1. Since we also assume that xi xj, we can therefore apply Lemma D.9 with a = σk and b = σk+1 to get (xiσk + σk+1)σk(x i) (xjσk + σk+1)σk(x j). This is exactly the desired inequality since, according to Lemma D.4, σk+1(x i) = xjσk + σk+1, σk+1(x j) = xiσk + σk+1. Lemma D.12. For any x Rn, any indices 1 i, j n, with x {i,j} 0, and any integer12 1 k dim(x {i,j}) + 1, the following implication holds: xi xj = σk(x i) xi + s k 1(x {i,j}) σk(x j) xj + s k 1(x {i,j}) , where s k 1(x {i,j}) is the sum of all but k 1 largest elements of the vector x {i,j}. Proof. Follows from Lemma D.9 applied to a = 1 and b = s k 1(x {i,j}) since σk(x {i,j}) bσk 1(x {i,j}) according to Lemma D.8 (applied to k = k 1 and x = x {i,j}; note that dim(x ) = n 2 0 and 0 k dim(x ) by our assumption). 12Here, dim(x) denotes the number of elements in the vector x.