# regularized_nonlinear_acceleration__c8a1436f.pdf Regularized Nonlinear Acceleration Damien Scieur INRIA & D.I., UMR 8548, École Normale Supérieure, Paris, France. damien.scieur@inria.fr Alexandre d Aspremont CNRS & D.I., UMR 8548, École Normale Supérieure, Paris, France. aspremon@di.ens.fr Francis Bach INRIA & D.I., UMR 8548, École Normale Supérieure, Paris, France. francis.bach@inria.fr We describe a convergence acceleration technique for generic optimization problems. Our scheme computes estimates of the optimum from a nonlinear average of the iterates produced by any optimization method. The weights in this average are computed via a simple and small linear system, whose solution can be updated online. This acceleration scheme runs in parallel to the base algorithm, providing improved estimates of the solution on the fly, while the original optimization method is running. Numerical experiments are detailed on classical classification problems. 1 Introduction Suppose we want to solve the following optimization problem min x Rn f(x) (1) in the variable x Rn, where f(x) is strongly convex with respect to the Euclidean norm with parameter µ, and has a Lipschitz continuous gradient with parameter L with respect to the same norm. This class of function is often encountered, for example in regression where f(x) is of the form f(x) = L(x) + Ω(x), where L(x) is a smooth convex loss function and Ω(x) is a smooth strongly convex penalty function. Assume we solve this problem using an iterative algorithm of the form xi+1 = g(xi), for i = 1, ..., k, (2) where xi Rn and k the number of iterations. Here, we will focus on the problem of estimating the solution to (1) by tracking only the sequence of iterates xi produced by an optimization algorithm. This will lead to an acceleration of the speed of convergence, since we will be able to extrapolate more accurate solutions without any calls to the oracle g(x). Since the publication of Nesterov s optimal first-order smooth convex minimization algorithm [1], a significant effort has been focused on either providing more explicit interpretable views on current acceleration techniques, or on replicating these complexity gains using different, more intuitive schemes. Early efforts were focused on directly extending the original acceleration result in [1] to broader function classes [2], allow for generic metrics, line searches or simpler proofs [5, 6], produce adaptive accelerated algorithms [7], etc. More recently however, several authors [8, 9] have started 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. using classical results from control theory to obtain numerical bounds on convergence rates that match the optimal rates. Others have studied the second order ODEs obtained as the limit for small step sizes of classical accelerated schemes, to better understand their convergence [10, 11]. Finally, recent results have also shown how to wrap classical algorithms in an outer optimization loop, to accelerate convergence [12] and reach optimal complexity bounds. Here, we take a significantly different approach to convergence acceleration stemming from classical results in numerical analysis. We use the iterates produced by any (converging) optimization algorithm, and estimate the solution directly from this sequence, assuming only some regularity conditions on the function to minimize. Our scheme is based on the idea behind Aitken s 2 algorithm [13], generalized as the Shanks transform [14], whose recursive formulation is known as the ε-algorithm [15] (see e.g. [16, 17] for a survey). In a nutshell, these methods fit geometrical models to linearly converging sequences, then extrapolate their limit from the fitted model. In a sense, this approach is more statistical in nature. It assumes an approximately linear model holds for iterations near the optimum, and estimates this model using the iterates. In fact, Wynn s algorithm [15] is directly connected to the Levinson-Durbin algorithm [18, 19] used to solve Toeplitz systems recursively and fit autoregressive models (the Shanks transform solves Hankel systems, but this is essentially the same problem [20]). The key difference here is that estimating the autocovariance operator is not required, as we only focus on the limit. Moreover, the method presents strong links with the conjugate gradient when applied to unconstrained quadratic optimization. We start from a slightly different formulation of these techniques known as minimal polynomial extrapolation (MPE) [17, 21] which uses the minimal polynomial of the linear operator driving iterations to estimate the optimum by nonlinear averaging (i.e., using weights in the average which are nonlinear functions of the iterates). So far, for all the techniques cited above, no proofs of convergence of these estimates were given in the case where the iterates made the estimation process unstable. Our contribution here is to add a regularization in order to produce explicit bounds on the distance to optimality by controlling the stability through the regularization parameter, thus explicitly quantifying the acceleration provided by these techniques. We show in several numerical examples that these stabilized estimates often speed up convergence by an order of magnitude. Furthermore this acceleration scheme thus runs in parallel to the original algorithm, providing improved estimates of the solution on the fly, while the original method is progressing. The paper is organized as follows. In section 2.1 we recall basic results behind MPE for linear iterations and we will introduce in section 2.2 a formulation of the approximate version of MPE and make a link with the conjugate gradient method. Then, in section 2.3, we generalize these results to generic nonlinear iterations and show, in section 2.4, how to fully control the impact of nonlinearity. We use these results to derive explicit bounds on the acceleration performance of our estimates. 2 Approximate Minimal Polynomial Extrapolation In what follows, we recall the key arguments behind minimal polynomial extrapolation (MPE) as derived in [22] or also [21]. We also explain a variant called approximate minimal polynomial extrapolation (AMPE) which allows to control the number of iterates used in the extrapolation, hence reduces its computational complexity. We begin by a simple description of the method for linear iterations, then extend these results to the generic nonlinear case. Finally, we fully characterize the acceleration factor provided by a regularized version of AMPE, using regularity properties of the function f(x), and the result of a Chebyshev-like, tractable polynomial optimization problem. 2.1 Linear Iterations Here, we assume that the iterative algorithm in (2) is in fact linear, with xi = A(xi 1 x ) + x , (3) where A Rn n (not necessarily symmetric) and x Rn. We assume that 1 is not an eigenvalue of A, implying that (3) admits a unique fixed point x . Moreover, if we assume that A 2 < 1, then xk converge to x at rate xk x 2 A k 2 x0 x . We now recall the minimal polynomial extrapolation (MPE) method as described in [21], starting with the following definition. Definition 2.1 Given A Rn n, s.t. 1 is not an eigenvalue of A and v Rn, the minimal polynomial of A with respect to the vector v is the lowest degree polynomial p(x) such that p(A)v = 0, p(1) = 1. Note that the degree of p(x) is always less than n and the condition p(1) = 1 makes p unique. Notice that because we assumed that 1 is not an eigenvalue of A, having p(1) = 1 is not restrictive since we can normalize each minimal polynomial with the sum of its coefficients (see Lemma A.1 in the supplementary material). Given an initial iterate x0, MPE starts by forming a matrix U whose columns are the increments xi+1 xi, with ui = xi+1 xi = (A I)(xi x ) = (A I)Ai(x0 x ). (4) Now, let p be the minimal polynomial of A with respect to the vector u0 (where p has coefficients ci and degree d), and U = [u0, u1, ..., ud]. So Pd i=0 ciui = Pd i=0 ci Aiu0 = p(A)u0 = 0 , p(1) = Pd i=0 ci = 1. (5) We can thus solve the system Uc = 0, P i ci = 1 to find p. In this case, the fixed point x can be computed exactly as follows 0 = Pd i=0 ci Aiu0 = Pd i=0 ci Ai(A I)(x0 x ) = (A I) Pd i=0 ci Ai(x0 x ) = (A I) Pd i=0 ci(xi x ). Hence, using the fact that 1 is not an eigenvalue of A and p(1) = 1, (A I) Pd i=0 ci(xi x ) = 0 Pd i=0 ci(xi x ) = 0 Pd i=0 cixi = x . This means that x is obtained by averaging iterates using the coefficients in c. The averaging in this case is called nonlinear, since the coefficients of c vary with the iterates themselves. 2.2 Approximate Minimal Polynomial Extrapolation (AMPE) Suppose now that we only compute a fraction of the iterates xi used in the MPE procedure. While the number of iterates k might be smaller than the degree of the minimal polynomial of A with respect to u0, we can still try to make the quantity pk(A)u0 small, where pk(x) is now a polynomial of degree at most k. The corresponding difference matrix U = [u0, u1, ..., uk] Rn (k+1) is rectangular. This is also known as the Eddy-Mešina method [3, 4] or reduced rank extrapolation with arbitrary k (see [21, 10]). The objective here is similar to (5), but the system is now overdetermined because k < deg(P). We will thus choose c to make Uc 2 = p(A)u0 2, for some polynomial p such that p(1) = 1, as small as possible, which means solving for c argmin Uc 2 s.t. 1T c = 1 (AMPE) in the variable c Rk+1. The optimal value Uc 2 of this problem is decreasing with k, satisfies Uc 2 = 0 when k is greater than the degree of the minimal polynomial, and controls the approximation error in x with equation (4). Setting ui = (A I)(xi x ), we have Pk i=0 c i xi x 2 = (I A) 1 Pk i=0 c i ui 2 (I A) 1 2 Uc 2. We can get a crude bound on Uc 2 from Chebyshev polynomials, using only an assumption on the range of the spectrum of the matrix A. Assume A symmetric, 0 A σI I and deg(p) k. Indeed, Uc 2 = p (A)u0 2 u0 2 min p:p(1)=1 p(A) 2 u0 2 min p:p(1)=1 max A:0 A σI p(A) 2, (6) where p is the polynomial with coefficients c . Since A is symmetric, we have A = QΛQT where Q is unitary. We can thus simplify the objective function: max A:0 A σI p(A) 2 = max Λ:0 Λ σI p(Λ) 2 = max Λ:0 Λ σI max i |p(λi)| = max λ:0 λ σ |p(λ)|. We thus have Uc 2 u0 2 min p:p(1)=1 max λ:0 λ σ |p(λ)|. So we must find a polynomial which takes small values in the interval [0, σ]. However, Chebyshev polynomials are known to be polynomials for which the maximal value in the interval [0, 1] is the smallest. Let Ck be the Chebyshev polynomial of degree k. By definition, Ck(x) is a monic polynomial1 which solves Ck(x) = argmin p:p is monic max x:x [ 1,1] |p(x)|. We can thus use a variant of Ck(x) in order to solve the minimax problem min p:p(1)=1 max λ:0 λ σ |p(λ)|. (7) The solution of this problem is given in [23] and admits an explicit formulation: T (x) = Ck(t(x)) Ck(t(1)) , t(x) = 2x σ Note that t(x) is simply a linear mapping from interval [0, σ] to [ 1, 1]. Moreover, min p:p(1)=1 max λ:0 λ σ |p(λ)| = max λ:0 λ σ |Tk(λ)| = |Tk(σ)| = 2ζk 1 + ζ2k , (8) 1 σ) < σ. (9) Since u0 2 = (A I)(x0 x ) 2 A I 2 x0 x , we can bound (6) by Uc 2 u0 2 min p:p(1)=1 max λ:0 λ σ |p(λ)| A I 2 2ζk 1 + ζ2k x0 x 2. This leads to the following proposition. Proposition 2.2 Let A be symmetric, 0 A σI I and ci be the solution of (AMPE). Then Pk i=0 c i xi x 2 κ(A I) 2ζk 1+ζ2k x0 x 2, (10) where κ(A I) is the condition number of the matrix A I and ζ is defined in (9). Note that, when solving quadratic optimization problems, the rate in this bound matches that obtained using the optimal method in [6]. Also, the bound on the rate of convergence of this method is exactly the one of the conjugate gradient with an additional factor κ(A I). Remark: This method presents a strong link with the conjugate gradient. Denote v B = v T Bv the norm induced by the definite positive matrix B. By definition, at the k-th iteration, the conjugate gradient computes an approximation s of x which follows s = argmin x x A s.t. x Kk , where Kk = {Ax , A2x , ..., Akx } is called a Krylov subspace. Since x Kk, we have that x is a linear combination of the element in Kk, so x = Pk i=1 ci Aix = q(A)x , where q(x) is a polynomial of degree k and q(0) = 0. So conjugate gradient solves s = argminq:q(0)=0 q(A)x x A = argminˆq:ˆq(0)=0 ˆq(A)x A, which is very similar to equation (AMPE). However, while conjugate gradient has access to an oracle which gives the result of the product between matrix A and any vector v, the AMPE procedure can only use the iterations produced by (3) (meaning that the AMPE procedure does not need to know A). Moreover, we analyze the convergence of AMPE in another norm ( 2 instead of A). These two reasons explain why a condition number appears in the rate of convergence of AMPE (10). 1A monic polynomial is a univariate polynomial in which the coefficient of highest degree is equal to 1. 2.3 Nonlinear Iterations We now go back to the general case where the iterative algorithm is nonlinear, with xi+1 = g( xi), for i = 1, ..., k, (11) where xi Rn and the function g has a symmetric Jacobian at point x . We also assume that the method has a unique fixed point written x and linearize these iterations around x , to get xi x = A( xi 1 x ) + ei, (12) where A is now the Jacobian matrix (i.e., the first derivative) of g taken at the fixed point x and ei Rn is a second order error term ei 2 = O( xi 1 x 2 2). Note that, by construction, the linear and nonlinear models share the same fixed point x . We write xi the iterates that would be obtained using the asymptotic linear model (starting at x0) xi x = A(xi 1 x ). Running the algorithm described in (11), we thus observe the iterates xi and build U from their differences. As in (AMPE) we then compute c using matrix U and finally estimate x = Pk i=0 ci xi. In this case, our estimate for x is based on the coefficient c, computed using the iterates xi. We will now decompose the error made by the estimation by comparing it with the estimation which comes from the linear model: Pk i=0 ci xi x 2 Pk i=0( ci ci)xi 2 + Pk i=0 ci( xi xi) 2 + Pk i=0 cixi x 2 . (13) This expression shows us that the precision is comparable to the precision of the AMPE process in the linear case (third term) with some perturbation. Also, if ei 2 is small then xi xi 2 is small as well. But we need more information about c 2 and c c 2 if we want to go further. We now show the following proposition computing the perturbation c = ( c c ) of the optimal solution of (AMPE), c , induced by E = U U. It will allow us to bound the first term on the right-hand side of (13) (see proof A.2 in the Appendix). For simplicity, we will use P = U T U U T U. Proposition 2.3 Let c be the optimal solution to (AMPE) c = argmin 1T c=1 Uc 2 for some matrix U Rn k. Suppose U becomes U = U + E and write c + c the perturbed solution to (AMPE). Let M = U T U and the perturbation matrix P = U T U U T U. Then, c = I M 111T M 1Pc . (14) We see here that the perturbation can be potentially large. Even if c 2 and P 2 can be potentially small, M 1 2 is huge in general. It can be shown that U T U (the square of a Krylov-like matrix) presents an exponential condition number (see [24]) because the minimal eigenvalue decays very fast. Moreover, the eigenvalues are perturbed by P, leading to a potential huge perturbation c, especially if P 2 is comparable (or bigger) to λmin(U T U). 2.4 Regularized AMPE The condition number of the matrix U T U in problem (AMPE) can be arbitrary large. Indeed, this condition number is related to the one of Krylov matrices which has been proved in [24] to be exponential in k. By consequence, this conditioning problem coupled with nonlinear errors lead to highly unstable solutions c (which we observe in our experiments). We thus study a regularized formulation of problem (AMPE), which reads minimize c T (U T U + λI)c subject to 1T c = 1 (RMPE) The solution of this problem may be computed with a linear system, and the regularization parameter controls the norm of the solution, as shown in the following Lemma (see proof A.3 in Appendix). Lemma 2.4 Let c λ be the optimal solution of problem (RMPE). Then c λ = (U T U + λI) 11 1T (U T U + λI) 11 and c λ 2 λ + U 2 2 kλ . (15) This allows us to obtain the following corollary extending Proposition 2.3 to the regularized AMPE problem in (RMPE), showing that the perturbation of c is now controlled by the regulaization parameter λ. Corollary 2.5 Let c λ, defined in (15), be the solution of problem (RMPE). Then the solution of problem (RMPE) for the perturbed matrix U = U + E is given by c λ + cλ where cλ = WM 1 λ Pc λ = M 1 λ W T Pc λ and c λ 2 P 2 where Mλ = (U T U + P + λI) and W = I M 1 λ 11T is a projector of rank k 1. These results lead us to the following simple algorithm. Algorithm 1 Regularized Approximate Minimal Polynomial Extrapolation (RMPE) Input: Sequence {x0, x1, ..., xk+1}, parameter λ > 0 Compute U = [x1 x0, ..., xk+1 xk] Solve the linear system (U T U + λI)z = 1 Set c = z/(z T 1) Output: Pk i=0 cixi, the approximation of the fixed point x The computational complexity (with online updates or in batch mode) of the algorithm is O(nk2) and some strategies (batch and online) are discussed in the Appendix A.3. Note that the algorithm never calls the oracle g(x). It means that, in an optimization context, the acceleration does not require f(x) or f (x) to compute the extrapolation. Moreover, it does not need a priori information on the function, for example L and µ (unlike Nesterov s method). 2.5 Convergence Bounds on Regularized AMPE To fully characterize the convergence of our estimate sequence, we still need to bound the last term on the right-hand side of (13), namely Pk i=0 cixi x 2. A coarse bound can be provided using Chebyshev polynomials, however the norm of the Chebyshev s coefficients grows exponentially as k grows. Here we refine this bound to better control the quality of our estimate. Let g(x ) σI. Consider the following Chebyshev-like optimization problem, written S(k, α) min {q Rk[x]: q(1)=1} max x [0,σ] ((1 x)q(x))2 + α q 2 2 where Rk[x] is the ring of polynomials of degree at most k and q Rk+1 is the vector of coefficients of the polynomial q(x). This problem can be solved exactly using a semi-definite solver because it can be reduced to a SDP program (see Appendix A.4 for the details of the reduction). Our main result below shows how S(k, α) bounds the error between our estimate of the optimum constructed using the iterates xi in (RMPE) and the optimum x of problem (1). Proposition 2.6 Let matrices X = [x0, x1, ..., xk], X = [x0, x1, ..., xk], E = (X X) and scalar κ = (A I) 1 2. Suppose c λ solves problem (RMPE) minimize c T ( U T U + λI)c subject to 1T c = 1 c λ = ( U T U + λI) 11 1T ( U T U + λI) 11 (17) in the variable c Rk+1, with parameters U Rn (k+1). Assume A symmetric with 0 A I. Then 2 E 2 + κ P 2 2 S(k, λ/ x0 x 2 2) 1 with P is defined in Corollary 2.5 and S(k, α) is defined in (16). We have that S(k, λ/ x0 x 2 2) 1 2 is similar to the value Tk(σ) (see (8)) so our algorithm achieves a rate similar to the Chebyshev s acceleration up to some multiplicative scalar. We thus need to choose λ so that this multiplicative scalar is not too high (while keeping S(k, λ/ x0 x 2 2) 1 2 small). We can analyze the behavior of the bound if we start close to the optimum. Assume E 2 = O( x0 x 2 2), U 2 = O( x0 x 2) P 2 = O( x0 x 3 2). This case is encountered when minimizing a smooth strongly convex function with Lipchitzcontinuous Hessian using fixed-step gradient method (this case is discussed in details in the Appendix, section A.6). Also, let λ = β P 2 for β > 0 and x0 x small. We can thus approximate the right parenthesis of the bound by lim x x 2 0 E 2 + κ P 2 = lim x x 2 0 Therefore, the bound on the precision of the extrapolation is approximately equal to X c λ x 2 κ S k, β P 2 x0 x 2 2 Also, if we use equation (8), it is easy to see that S (k, 0) min {q Rk[x]: q(1)=1} max x [0,σ1] |q(x)| = Tk(t(σ)) = 2ζk where ζ is defined in (9). So, when x0 x 2 is close to zero, the regularized version of AMPE tends to converge as fast as AMPE (see equation (10)) up to a small constant. 3 Numerical Experiments We test our methods on a regularized logistic regression problem written f(w) = Pm i=1 log 1 + exp( yiξT i w) + τ where Ξ = [ξ1, ..., ξm]T Rm n is the design matrix and y is a { 1, 1}n vector of labels. We used the Madelon UCI dataset, setting τ = 102 (in order to have a ratio L/τ equal to 109). We solve this problem using several algorithms, the fixed-step gradient method for strongly convex functions [6, Th. 2.1.15] using stepsize 2/(L + µ), where L = Ξ 2 2/4 + τ and µ = τ, the accelerated gradient method for strongly convex functions [6, Th. 2.2.3] and our nonlinear acceleration of the gradient method iterates using RMPE in Proposition 2.6 with restarts. This last algorithm is implemented as follows. We do k steps (in the numerical experiments, k is typically equal to 5) of the gradient method, then extrapolate a solution X c λ where c λ is computed by solving the RMPE problem (17) on the gradient iterates X, with regularization parameter λ determined by a grid search. Then, this extrapolation becomes the new starting point of the gradient method. We consider it as one iteration of RMPEk using k gradient oracle calls. We also analyze the impact of an inexact line-search (described in Appendix A.7) performed after this procedure. The results are reported in Figure 1. Using very few iterates, the solution computed using our estimate (a nonlinear average of the gradient iterates) are markedly better than those produced by the Nesterovaccelerated method. This is only partially reflected by the theoretical bound from Proposition 2.6 which shows significant speedup in some regions but remains highly conservative (see Figure 3 in section A.6). Also, Figure 2 shows us the impact of regularization. The AMPE process becomes unstable because of the condition number of matrix M, which impacts the precision of the estimate. 4 Conclusion and Perspectives In this paper, we developed a method which is able to accelerate, under some regularity conditions, the convergence of a sequence {xi} without any knowledge of the algorithm which generates this sequence. The regularization parameter present in the acceleration method can be computed easily using some inexact line-search. 0 2 4 6 8 10 104 f(xk) f(x ) Gradient oracle calls Gradient Nesterov Nest. + backtrack RMPE 5 RMPE 5 + LS 0 500 1000 1500 CPU Time (sec.) Gradient Nesterov Nest. + backtrack RMPE 5 RMPE 5 + LS Figure 1: Solving logistic regression on UCI Madelon dataset (500 features, 2000 data points) using the gradient method, Nesterov s accelerated method and RMPE with k = 5 (with and without line search over the stepsize), with penalty parameter τ equal to 102 (Condition number is equal to 1.2 109). Here, we see that our algorithm has a similar behavior to the conjugate gradient: unlike the Nesterov s method, where we need to provide parameters µ and L, the RMPE algorithm adapts himself in function of the spectrum of g(x ) (so it can exploit the good local strong convexity parameter), without any prior specification. We can, for example, observe this behavior when the global strong convexity parameter is bad but not the local one. 0 2 4 6 8 10 f(xk) f(x ) Gradient oracle calls Figure 2: Logistic regression on Madelon UCI Dataset, solved using Gradient method, Nesterov s method and AMPE (i.e. RMPE with λ = 0). The condition number is equal to 1.2 109. We see that without regularization, AMPE is unstable because ( U T U) 1 2 is huge (see Proposition 2.3). The algorithm itself is simple. By solving only a small linear system we are able to compute a good estimate of the limits of the sequence {xi}. Also, we showed (using the gradient method on logistic regression) that the strategy which consists in alternating the algorithm and the extrapolation method can lead to impressive results, improving significantly the rate of convergence. Future work will consist in improving the performance of the algorithm by exploiting the structure of the noise matrix E in some cases (for example, using the gradient method, the norm of the column Ek in the matrix E is decreasing when k grows), extending the algorithm to the constrained case, the stochastic case and to the non-symmetric case. We will also try to refine the term (16) present in the theoretical bound. Acknowledgment. The research leading to these results has received funding from the European Union s Seventh Framework Programme (FP7-PEOPLE-2013-ITN) under grant agreement no 607290 Spa RTa N, as well as support from ERC SIPA and the chaire Économie des nouvelles données with the data science joint research initiative with the fonds AXA pour la recherche. [1] Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27(2):372 376, 1983. [2] AS Nemirovskii and Y. E Nesterov. Optimal methods of smooth convex minimization. USSR Computational Mathematics and Mathematical Physics, 25(2):21 30, 1985. [3] Mešina, M. [1977], Convergence acceleration for the iterative solution of the equations x = ax + f , Computer Methods in Applied Mechanics and Engineering 10(2), 165 173. [4] Eddy, R. [1979], Extrapolating to the limit of a vector sequence , Information linkage between applied mathematics and industry pp. 387 396. [5] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009. [6] Y. Nesterov. Introductory Lectures on Convex Optimization. Springer, 2003. [7] Y. Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, 152(1-2):381 404, 2015. [8] Yoel Drori and Marc Teboulle. Performance of first-order methods for smooth convex minimization: a novel approach. Mathematical Programming, 145(1-2):451 482, 2014. [9] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57 95, 2016. [10] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510 2518, 2014. [11] Andre Wibisono and Ashia C Wilson. On accelerated methods in optimization. Technical report, 2015. [12] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pages 3366 3374, 2015. [13] Alexander Craig Aitken. On Bernoulli s numerical solution of algebraic equations. Proceedings of the Royal Society of Edinburgh, 46:289 305, 1927. [14] Daniel Shanks. Non-linear transformations of divergent and slowly convergent sequences. Journal of Mathematics and Physics, 34(1):1 42, 1955. [15] Peter Wynn. On a device for computing the em(sn) transformation. Mathematical Tables and Other Aids to Computation, 10(54):91 96, 1956. [16] C Brezinski. Accélération de la convergence en analyse numérique. Lecture notes in mathematics, (584), 1977. [17] Avram Sidi, William F Ford, and David A Smith. Acceleration of convergence of vector sequences. SIAM Journal on Numerical Analysis, 23(1):178 196, 1986. [18] N Levinson. The Wiener RMS error criterion in filter design and prediction, appendix b of wiener, n.(1949). Extrapolation, Interpolation, and Smoothing of Stationary Time Series, 1949. [19] James Durbin. The fitting of time-series models. Revue de l Institut International de Statistique, pages 233 244, 1960. [20] Georg Heinig and Karla Rost. Fast algorithms for Toeplitz and Hankel matrices. Linear Algebra and its Applications, 435(1):1 59, 2011. [21] David A Smith, William F Ford, and Avram Sidi. Extrapolation methods for vector sequences. SIAM review, 29(2):199 233, 1987. [22] Stan Cabay and LW Jackson. A polynomial extrapolation method for finding limits and antilimits of vector sequences. SIAM Journal on Numerical Analysis, 13(5):734 752, 1976. [23] Gene H Golub and Richard S Varga. Chebyshev semi-iterative methods, successive overrelaxation iterative methods, and second order richardson iterative methods. Numerische Mathematik, 3(1):157 168, 1961. [24] Evgenij E Tyrtyshnikov. How bad are Hankel matrices? Numerische Mathematik, 67(2):261 269, 1994. [25] Y. Nesterov. Squared functional systems and optimization problems. In High performance optimization, pages 405 440. Springer, 2000. [26] J. B. Lasserre. Global optimization with polynomials and the problem of moments. SIAM Journal on Optimization, 11(3):796 817, 2001. [27] P. Parrilo. Structured Semidefinite Programs and Semialgebraic Geometry Methods in Robustness and Optimization. Ph D thesis, California Institute of Technology, 2000. [28] A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization : analysis, algorithms, and engineering applications. MPS-SIAM series on optimization. SIAM, 2001.