# on_the_iteration_complexity_of_hypergradient_computation__5bee9057.pdf On the Iteration Complexity of Hypergradient Computation Riccardo Grazzi 1 2 Luca Franceschi 1 2 Massimiliano Pontil 1 2 Saverio Salzo 1 We study a general class of bilevel problems, consisting in the minimization of an upper-level objective which depends on the solution to a parametric fixed-point equation. Important instances arising in machine learning include hyperparameter optimization, meta-learning, and certain graph and recurrent neural networks. Typically the gradient of the upper-level objective (hypergradient) is hard or even impossible to compute exactly, which has raised the interest in approximation methods. We investigate some popular approaches to compute the hypergradient, based on reverse mode iterative differentiation and approximate implicit differentiation. Under the hypothesis that the fixed point equation is defined by a contraction mapping, we present a unified analysis which allows for the first time to quantitatively compare these methods, providing explicit bounds for their iteration complexity. This analysis suggests a hierarchy in terms of computational efficiency among the above methods, with approximate implicit differentiation based on conjugate gradient performing best. We present an extensive experimental comparison among the methods which confirm the theoretical findings. 1. Introduction Several problems arising in machine learning and related disciplines can be formulated as bilevel problems, where the lower-level problem is a fixed point equation whose solution is part of an upper-level objective. Instances of this framework include hyperparameter optimization (Maclaurin et al., 2015; Franceschi et al., 2017; Liu et al., 2018; Lorraine et al., 2019; Elsken et al., 2019), meta-learning (Andrychowicz et al., 2016; Finn et al., 2017; Franceschi 1Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia, Genoa, Italy 2Department of Computer Science, University College London, London, UK. Correspondence to: Riccardo Grazzi . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). et al., 2018), as well as recurrent and graph neural networks (Almeida, 1987; Pineda, 1987; Scarselli et al., 2008). In large scale scenarios, there are thousands or even millions of parameters to find in the upper-level problem, making black-box approaches like grid and random search (Bergstra & Bengio, 2012) or Bayesian optimization (Snoek et al., 2012) impractical. This has made gradient-based methods (Domke, 2012; Maclaurin et al., 2015; Pedregosa, 2016) popular in such settings, but also it has raised the issue of designing efficient procedures to approximate the gradient of the upper-level objective (hypergradient) when finding a solution to the lower-level problem is costly. The principal goal of this paper is to study the degree of approximation to the hypergradient of certain iterative schemes based on iterative or implicit differentiation1. In the rest of the introduction we present the bilevel framework, alongside some relevant examples in machine learning. We then outline the gradient approximation methods that we analyse in the paper and highlight our main contributions. Finally, we discuss and compare our results with previous work in the field. The bilevel framework. In this work, we consider the following bilevel problem. min λ Λ f(λ) := E(w(λ), λ) subject to w(λ) = Φ(w(λ), λ), (1) where Λ is a closed convex subset of Rn and E : Rd Λ R and Φ: Rd Λ Rd are continuously differentiable functions. We assume that the lower-level problem in (1) (which is a fixed point-equation) admits a unique solution. However, in general, explicitly computing such solution is either impossible or expensive. When f is differentiable, this issue affects the evaluation of the hypergradient f(λ), which at best can only be approximately computed. A prototypical example of the bilevel problem (1) is min λ Λ f(λ) := E(w(λ), λ)) subject to w(λ) = arg min u Rd ℓ(u, λ), (2) 1The reader interested in the convergence analysis of gradientbased algorithms for bilevel optimization is referred to (Pedregosa, 2016; Rajeswaran et al., 2019) and references therein. On the Iteration Complexity of Hypergradient Computation where ℓ: Rd Λ R is a loss function, twice continuously differentiable and strictly convex w.r.t. the first variable. Indeed if we let Φ be such that Φ(w, λ) = w α(λ) 1ℓ(w, λ), where α: Λ R++ is differentiable, then problem (2) and problem (1) are equivalent. Specific examples of problem (2), which include hyperparameter optimization and meta-learning, are discussed in Section 3.1. Other instances of the bilevel problem (1), which are not of the special form of problem (2), arise in the context of so-called equilibrium models (EQM). Notably, these comprise some types of connectionist models employed in domains with structured data. Stable recurrent neural networks (Miller & Hardt, 2019), graph neural networks (Scarselli et al., 2008) and the formulations by Bai et al. (2019) belong to this class. EQM differ from standard (deep) neural networks in that the internal representations are given by fixed points of learnable dynamics rather than compositions of a finite number of layers. The learning problem for such type of models can be written as min λ=(γ,θ) Λ f(λ) := i=1 Ei(wi(γ), θ), subject to wi(γ) = φi(wi(γ), γ), for 1 i n, where the operators φi : Rd Λ Rd (here Φ = (φi)n i=1) are associated to the training points xi s, and the error functions Ei are the losses incurred by a standard supervised algorithm on the transformed dataset {wi(γ), yi}n i=1. A specific example is discussed in Section 3.2 In this paper, we present a unified analysis which allows for the first time to quantitatively compare popular methods to approximate f(λ) in the general setting of problem (1). The strategies we consider can be divided in two categories: 1. Iterative Differentiation (ITD) (Maclaurin et al., 2015; Franceschi et al., 2017; 2018; Finn et al., 2017). One defines the sequence of functions ft(λ) = E(wt(λ), λ), where wt(λ) are the fixed-point iterates generated by the map Φ( , λ). Then f(λ) is approximated by ft(λ), which in turn is computed using forward (FMAD) or reverse (RMAD) mode automatic differentiation (Griewank & Walther, 2008). 2. Approximate Implicit Differentiation (AID) (Pedregosa, 2016; Rajeswaran et al., 2019; Lorraine et al., 2019). First, an (implicit) equation for f(λ) is obtained through the implicit function theorem. Then, this equation is approximately solved by using a two stage scheme. We analyse two specific methods in this class: the fixed-point method (Lorraine et al., 2019), also referred to as recurrent backpropagation in the context of recurrent neural networks (Almeida, 1987; Pineda, 1987), and the conjugate gradient method (Pedregosa, Both schemes can be efficiently implemented using automatic differentiation (Griewank & Walther, 2008; Baydin et al., 2018) achieving similar cost in time, while ITD has usually a larger memory cost than AID2. Contributions. Although there is a vast amount of literature on the two hypergradient approximation strategies previously described, it remains unclear whether it is better to use one or the other. In this work, we shed some light over this issue both theoretically and experimentally. Specifically our contributions are the following: We provide iteration complexity results for ITD and AID when the mapping defining the fixed point equation is a contraction. In particular, we prove nonasymptotic linear rates for the approximation errors of both approaches. We make a theoretical and numerical comparison among different ITD and AID strategies considering several experimental scenarios. We note that, to the best of our knowledge, non-asymptoptic rates of convergence for AID were only recently given in the case of meta-learning (Rajeswaran et al., 2019). Furthermore, we are not aware of any previous results about non-asymptotic rates of convergence for ITD. Related Work. Iterative differentiation for functions defined implicitly has been extensively studied in the automatic differentiation literature. In particular (Griewank & Walther, 2008, Chap. 15) derives asymptotic linear rates for ITD under the assumption that Φ( , λ) is a contraction. Another attempt to theoretically analyse ITD is made by Franceschi et al. (2018) in the context of the bilevel problem (2). There, the authors provide sufficient conditions for the asymptotic convergence of the set of minimizers of the approximate problem to the set of minimizers of the bilevel problem. In contrast, in this work, we give rates for the convergence of the approximate hypergradient ft(λ) to the true one (i.e. f(λ)). ITD is also considered in (Shaban et al., 2019) where ft(λ) is approximated via a procedure which is reminiscent of truncated backpropagation. The authors bound the norm of the difference between ft(λ) and its truncated version as a function of the truncation steps. This is different from our analysis which directly considers the problem of estimating the gradient of f. In the case of AID, an asymptotic analysis is presented in (Pedregosa, 2016), where the author proves the convergence of an inexact gradient projection algorithm for the minimization of the function f defined in problem (2), using increasingly accurate estimates of f(λ). Whereas 2This is true when ITD is implemented using RMAD, which is the standard approach when λ is high dimensional. On the Iteration Complexity of Hypergradient Computation Rajeswaran et al. (2019) present complexity results in the setting of meta-learning with biased regularization. Here, we extend this line of work by providing complexity results for AID in the more general setting of problem (1). We also mention the papers by Amos & Kolter (2017) and Amos (2019), which present techniques to differentiate through the solutions of quadratic and cone programs respectively. Using such techniques allows one to treat these optimization problems as layers of a neural network and to use backpropagation for the end-to-end training of the resulting learning model. In the former work, the gradient is obtained by implicitly differentiating through the KKT conditions of the lower-level problem, while the latter performs implicit differentiation on the residual map of Minty s parametrization. A different approach to solve bilevel problems of the form (2) is presented by Mehra & Hamm (2019), who consider a sequence of single level objectives involving a quadratic regularization term penalizing violations of the lower-level first-order stationary conditions. The authors provide asymptotic convergence guarantees for the method, as the regularization parameter tends to infinity, and show that it outperforms both ITD and AID on different settings where the lower-level problem is non-convex. All previously mentioned works except (Griewank & Walther, 2008) consider bilevel problems of the form (2). Another exception is (Liao et al., 2018), which proposes two improvements to recurrent backpropagation, one based on conjugate gradient on the normal equations, and another based on Neumann series approximation of the inverse. 2. Theoretical Analysis In this section we establish non-asymptotic bounds on the hypergradient (i.e. f(λ)) approximation errors for both ITD and AID schemes (proofs can be found in Appendix A). In particular, in Section 2.1 we address the iteration complexity of ITD, while in Section 2.2, after giving a general bound for AID, we focus on two popular implementations of the AID scheme: one based on the conjugate gradient method and the other on the fixed-point method. Notations. We denote by applied to a vector (matrix) the Euclidean (spectral) norm. For a differentiable function f : Rn Rm we denote by f (x) Rm n the derivative of f at x. When m = 1, we denote by f : Rn Rn the gradient of f. For a real-valued function g: Rn Rm R we denote by 1g(x, y) Rn and 2g(x, y) Rm the partial derivatives w.r.t. the first and second variable respectively. We also denote by 2 1g(x, y) Rn n and 2 12g(x, y) Rn m the second derivative of g w.r.t. the first variable and the mixed second derivative of g w.r.t. the first and second variable. For a vector-valued function h: Rn Rm Rk we denote, by 1h(x, y) Rk n and 2h(x, y) Rk m the partial Jacobians w.r.t. the first and second variable respectively at (x, y) Rm Rn. In the rest of the section, referring to problem (1), we will group the assumptions as follows. Assumption A is general while Assumption B and C are specific enrichments for ITD and AID respectively. Assumption A. For every λ Λ, (i) w(λ) is the unique fixed point of Φ( , λ). (ii) I 1Φ(w(λ), λ) is invertible. (iii) 1Φ( , λ) and 2Φ( , λ) are Lipschitz continuous with constants ν1,λ and ν2,λ respectively. (iv) 1E( , λ) and 2E( , λ) are Lipschitz continuous with constants η1,λ and η2,λ respectively. A direct consequence of Assumption A(i)-(ii) and of the implicit function theorem is that w( ) and f( ) are differentiable on Λ. Specifically, for every λ Λ, it holds that w (λ) = (I 1Φ(w(λ), λ)) 1 2Φ(w(λ), λ) (4) f(λ) = 2E(w(λ), λ) + w (λ) 1E(w(λ), λ). (5) See Theorem A.1 for details. In the special case of problem (2), equation (4) reduces (see Corollary A.1) to w (λ) = 2 1ℓ(w(λ), λ) 1 2 21ℓ(w(λ), λ). Before starting with the study of the two methods ITD and AID, we give a lemma which introduces three additional constants that will occur in the complexity bounds. Lemma 2.1. Let λ Λ and let Dλ > 0 be such that w(λ) Dλ. Then there exist LE,λ, LΦ,λ R+ s.t. sup w 2Dλ 1E(w, λ) LE,λ, sup w 2Dλ 2Φ(w, λ) LΦ,λ The proof exploits the fact that the image of a continuous function applied to a compact set remains compact. 2.1. Iterative Differentiation In this section we replace w(λ) in (1) by the t-th iterate of Φ( , λ), for which we additionally require the following. Assumption B. For every λ Λ, Φ( , λ) is a contraction with constant qλ (0, 1). The approximation of the hypergradient f(λ) is then obtained as in Algorithm 1. Assumption B looks quite restrictive, however it is satisfied in a number of interesting cases: (a) In the setting of the bilevel optimization problem (2), suppose that for every λ Λ, ℓ( , λ) is µℓ(λ)-strongly On the Iteration Complexity of Hypergradient Computation Algorithm 1 Iterative Differentiation (ITD) 1. Let t N, set w0(λ) = 0, and compute, for i = 1, 2, . . . t wi(λ) = Φ(wi 1(λ), λ). 2. Set ft(λ) = E(wt(λ), λ). 3. Compute ft(λ) using automatic differentiation. convex and Lℓ(λ)-Lipschitz smooth for some continuously differentiable functions µℓ: Λ R++, and Lℓ: Λ R++. Set κ(λ) = Lℓ(λ)/µℓ(λ), α(λ) = 2 µℓ(λ) + Lℓ(λ), and qλ = κ(λ) 1 κ(λ) + 1. (6) Then, Φ(w, λ) = w α(λ) 1ℓ(w, λ) is a contraction w.r.t. w with constant qλ (see Appendix B). (b) For strongly convex quadratic functions, accelerated methods like Nesterov s (Nesterov, 1983) or heavyball (Polyak, 1987) can be formulated as fixed-point iterations of a contraction in the norm defined by a suitable positive definite matrix. (c) In certain graph and recurrent neural networks of the form (3), where the transition function is assumed to be a contraction (Scarselli et al., 2008; Almeida, 1987; Pineda, 1987). The following lemma is a simple consequence of the theory on Neumann series and shows that Assumption B is stronger than Assumption A(i)-(ii). For reader s convenience the proof is given in Appendix A. Lemma 2.2. Let Assumption B be satisfied. Then, for every λ Λ, Φ( , λ) has a unique fixed point and, for every w Rd, I 1Φ(w, λ) is invertible and (I 1Φ(w, λ)) 1 1 1 qλ . In particular, (i) and (ii) in Assumption A hold. With Assumption B in force and if wt(λ) is defined as at point 1 in Algorithm 1, we have the following proposition that is essential for the final bound. Proposition 2.1. Suppose that Assumptions A(iii) and B hold and let t N, with t 1. Moreover, for every λ Λ, let wt(λ) be computed by Algorithm 1 and let Dλ and LΦ,λ be as in Lemma 2.1. Then, wt( ) is differentiable and, for every λ Λ, w t(λ) w (λ) ν2,λ + ν1,λ LΦ,λ 1 qλ Dλtqt 1 λ + LΦ,λ 1 qλ qt λ. (7) Leveraging Proposition 2.1, we give the main result of this section. Theorem 2.1. (ITD bound) Suppose that Assumptions A(iii)-(iv) and B hold and let t N with t 1. Moreover, for every λ Λ, let wt(λ) and ft be defined according to Algorithm 1 and let Dλ, LE,λ, and LΦ,λ be as in Lemma 2.1. Then, ft is differentiable and, for every λ Λ, ft(λ) f(λ) c1(λ)+c2(λ) t qλ +c3(λ) qt λ, c1(λ) = η2,λ + η1,λLΦ,λ c2(λ) = ν2,λ + ν1,λLΦ,λ c3(λ) = LE,λ LΦ,λ In this generality this is a new result that provides a nonasymptotic linear rate of convergence for the gradient of ft towards that of f. 2.2. Approximate Implicit Differentiation In this section we study another approach to approximate the gradient of f. We derive from (4) and (5) that f(λ) = 2E(w(λ), λ) + 2Φ(w(λ), λ) v(λ) (9) where v(λ) is the solution of the linear system (I 1Φ(w(λ), λ) )v = 1E(w(λ), λ). (10) However, in the above formulas w(λ) is usually not known explicitly or is expensive to compute exactly. To solve this issue f(λ) is estimated as in Algorithm 2. Note that, unlike ITD, this procedure is agnostic about the algorithms used to compute the sequences wt(λ) and vt,k(λ). Interestingly, in the context of problem (2), choosing Φ(w, λ) = w 1ℓ(w, λ) in Algorithm 2 yields the same procedure studied by Pedregosa (2016). The number of iterations t and k in Algorithm 2 give a direct way of trading off accuracy and speed. To quantify this trade off we consider the following assumptions. Assumption C. For every λ Λ, (i) w Rd, I 1Φ(w, λ) is invertible. (ii) wt(λ) w(λ) ρλ(t) w(λ) , ρλ(t) 1, and ρλ(t) 0 as t + . (iii) vt,k(λ) vt(λ) σλ(k) vt(λ) and σλ(k) 0 as k + . On the Iteration Complexity of Hypergradient Computation Algorithm 2 Approximate Implicit Differentiation (AID) 1. Let t N and compute wt(λ) by t steps of an algorithm converging to w(λ), starting from w0(λ) = 0. 2. Compute vt,k(λ) after k steps of a solver for the system (I 1Φ(wt(λ), λ) )v = 1E(wt(λ), λ). (11) 3. Compute the approximate gradient as ˆ f(λ):= 2E(wt(λ), λ) + 2Φ(wt(λ), λ) vt,k(λ). If Assumption C(i) holds, then, for every λ Λ, since the map w 7 (I 1Φ(w, λ)) 1 is continuous, we have sup w 2Dλ (I 1Φ(w, λ)) 1 1 µλ < + , (12) for some µλ > 0. We note that, in view of Lemma 2.2, Assumption B implies Assumption C(i) (which in turn implies Assumption A(ii)) and in (12) one can take µλ = 1 qλ. We stress that, Assumption C(ii)-(iii) are general and do not specify the type of algorithms solving the fixed-point equation w = Φ(w, λ) and the liner system (11). It is only required that such algorithms have explicit rates of convergence ρλ(t) and σλ(k) respectively. Finally, we note that Assumption C(ii) is less restrictive than Assumption B and encompasses the procedure at point 1 in Algorithm 1: indeed in such case C(ii) holds with ρλ(t) = qt λ. It is also worth noting that the AID procedure requires only to store the last lower-level iterate, i.e. wt(λ). This is a considerable advantage over ITD, which instead requires to store the entire lower-level optimization trajectory (wi(λ))0 i t, if implemented using RMAD. The iteration complexity bound for AID is given below. This is a general bound which depends on the rate of convergence ρλ(t) of the sequence (wt(λ))t N and the rate of convergence σλ(k) of the sequence (vt,k(λ))k N. Theorem 2.2. (AID bound) Suppose that Assumptions A(i)(iii)(iv) and C(i) (iii) hold. Let λ Λ, t N, k N. Let Dλ, LE,λ, and LΦ,λ be as in Lemma 2.1 and let µλ be defined according to (12). Let ˆ f(λ) be defined as in Algorithm 2 and let ˆ = ˆ f(λ) f(λ) . Then, ˆ η2,λ + η1,λLΦ,λ µλ + ν2,λLE,λ µλ + ν1,λLΦ,λLE,λ Dλρλ(t) + LΦ,λLE,λ µλ σλ(k). (13) Furthermore, if Assumption B holds, then µλ = 1 qλ and ˆ c1(λ) + c2(λ) ρλ(t) + c3(λ)σλ(k). (14) where c1(λ), c2(λ) and c3(λ) are defined in Theorem 2.1. Theorem 2.2 provides a non-asymptotic rate of convergence for ˆ f which contrasts with the asymptotic result given in Pedregosa (2016). In this respect, making Assumption C(i) instead of the weaker Assumption A(ii) is critical. Depending on the choice of the solver for the linear system (11) different AID methods are obtained. In the following we consider two cases. AID with the Conjugate Gradient Method (AID-CG). For the sake of brevity we set Aλ,t = I 1Φ(wt(λ), λ) and bλ,t = 1E(wt(λ), λ). Then, the linear system (11) is equivalent to the following minimization problem min v Rd 1 2 Aλ,tv bλ,t 2, (15) which, if 1Φ(wt(λ), λ) is symmetric (so that Aλ,t is also symmetric) is in turn equivalent to min v Rd 1 2v Aλ,tv v bλ,t. (16) Several first order methods solving problems (15) or (16) satisfy assumption C(iii) with linear rates and require only Jacobian-vector products. In particular, for the symmetric case (16), the conjugate gradient method features the following linear rate vt,k(λ) vt(λ) κ(Aλ,t) 1 p κ(Aλ,t) + 1 k vt,0(λ) vt(λ) , (17) where κ(Aλ,t) is the condition number of Aλ,t. In the setting of case (a) outlined in Section 2.1, Aλ,t = α(λ) 2 1ℓ(wt(λ), λ) and µℓ(λ)I 2 1ℓ(wt(λ), λ) Lℓ(λ)I. Therefore the condition number of Aλ,t satisfies κ(Aλ,t) Lℓ(λ)/µℓ(λ) = κ(λ) and hence κ(Aλ,t) 1 p κ(Aλ,t) + 1 κ(λ) + 1 κ(λ) 1 κ(λ) + 1 = qλ. (18) AID with the Fixed-Point Method (AID-FP). In this paragraph we make a specific choice for the sequence (vt,k(λ))k N in Assumption C(iii). We let Assumption B be satisfied and consider the following algorithm. For every λ Λ and t N, we choose vt,0(λ) = 0 Rd and, for k = 1, 2, . . . $ vt,k(λ) = 1Φ(wt(λ), λ) vt,k 1(λ) + 1E(wt(λ), λ). On the Iteration Complexity of Hypergradient Computation In such case the rate of convergence σλ(k) is linear. More precisely, since 1Φ(wt(λ), λ) qλ < 1 (from Assumption B), then the mapping T : v 7 1Φ(wt(λ), λ)v + 1E(wt(λ), λ) is a contraction with constant qλ. Moreover, the fixed-point of T is the solution of (11). Therefore, vt,k(λ) vt(λ) qk λ vt,0(λ) vt(λ) . In the end the following result holds. Theorem 2.3. If Assumption B holds and (vt,k(λ))k N is defined according to (19), then Assumption C(iii) is satisfied with σλ(k) = qk λ. Now, plugging the rate σλ(k) = qk λ into the general bound (14) yields ˆ c1(λ) + c2(λ) ρλ(t) + c3(λ)qk λ. (20) However, an analysis similar to the one in Section 2.1 shows that the above result can be slightly improved as follows. Theorem 2.4. (AID-FP bound) Suppose that Assumptions A(i)(iii)(iv) and Assumption B hold. Suppose also that (19) holds. Let ˆ f(λ) be defined according to Algorithm 2 and ˆ = ˆ f(λ) f(λ) . Then, for every t, k N, ˆ c1(λ) + c2(λ)1 qk λ 1 qλ ρλ(t) + c3(λ)qk λ, (21) where c1(λ), c2(λ) and c3(λ) are given in Theorem 2.1. We end this section with a discussion about the consequences of the presented results. 2.3. Discussion Theorem 2.2 shows that Algorithm 2 computes an approximate gradient of f with a linear convergence rate (in t and k), provided that the solvers for the lower-level problem and the linear system converge linearly. Furthermore, under Assumption B, both AID-FP and ITD converge linearly. However, if in Algorithm 2 we define wt(λ) as at point 1 in Algorithm 1 (so that ρλ(t) = qt λ), and take k = t, then the bound for AID-FP (21) is lower than that of ITD (8), since qλ(1 qt λ)/(1 qλ) = Pt i=1 qi λ < t for every t 1. This analysis suggests that AID-FP converge faster than ITD. We now discuss the choice of the algorithm to solve the linear system (11) in Algorithm 2. Theorem 2.4 provides a bound for AID-FP, which considers procedure (19). However, we see from (14) in Theorem 2.2 that a solver for the linear system with rate of convergence σλ(k) faster than qk λ may give a better bound. The above discussion, together with (17) and (18), proves that AID-CG has a better asymptotic rate than AID-FP for instances of problem (2) where the lower-level objective ℓ( , λ) is Lipschitz smooth and strongly convex (case (a) outlined in Section 2.1). Finally, we note that both ITD and AID consider the initialization w0(λ) = 0. However, in a gradient-based bilevel optimization algorithm, it might be more convenient to use a warm start strategy where w0(λ) is set based on previous upper-level iterations. Our analysis can be applied also in this case, but the related upper bounds will depend on the upper-level dynamics. This aspect makes it difficult to theoretically analyse the benefit of a warm start strategy, which remains an open question. 3. Experiments In the first part of this section we focus on the hypergradient approximation error and show that the upper bounds presented in the previous section give a good estimate of the actual convergence behaviour of ITD and AID strategies on a variety of settings. In the second part we present a series of experiments pertaining optimization on both the settings of hyperparameter optimization, as in problem (2), and learning equilibrium models, as in problem (3). The algorithms have been implemented3 in Py Torch (Paszke et al., 2019). In the following, we shorthand AID-FP and AID-CG with FP and CG, respectively. 3.1. Hypergradient Approximation In this section, we consider several problems of type (2) with synthetic generated data (see Appendix C.1 for more details) where D = (X, y) and D = (X , y ) are the training and validation sets respectively, with X Rne p, X Rn e p, being ne, n e the number of examples in each set and p the number of features. Specifically we consider the following settings, which are representative instances of relevant bilevel problems in machine learning. Logistic Regression with ℓ2 Regularization (LR). This setting is similar to the one in Pedregosa (2016), but we introduce multiple regularization parameters: (xe,ye) D ψ(yex e w(λ)), w(λ) = argmin w Rp (xe,ye) D ψ(yex e w) + 1 2w diag(λ)w, where λ Rp ++, ψ(x) = log(1 + e x) and diag(λ) is the diagonal matrix formed by the elements of λ. Kernel Ridge Regression (KRR). We extend the setting presented by Pedregosa (2016) considering a p-dimensional Gaussian kernel parameter γ in place of the usual one: 3The code is freely available at the following link. https://github.com/prolearner/hypertorch On the Iteration Complexity of Hypergradient Computation 0 500 1000 t || f(λ) g(λ)|| Logistic Regression ITD FP k = t FP k = 10 CG k = t CG k = 10 0 50 100 150 200 t Kernel Ridge Regression 0 50 100 150 t Biased Regularization 0 200 400 t 109 Hyper Representation Figure 1. Convergence of different hypergradient approximations, where g(λ) is equal to ft(λ) for ITD and to ˆ f(λ) for CG and FP. Mean and standard deviation (shaded areas) are computed over 20 values of λ sampled uniformly from [λmin, λmax]n. f(β, γ) = 1 2 y K (γ)w(β, γ) 2, w(β, γ) = argmin w Rne 1 2w (K(γ) + βI) w w y, (22) where β (0, ), γ Rp ++ and K (γ), K(γ) are respectively the validation and training kernel matrices (see Appendix C.1). Biased Regularization (BR). Inspired by Denevi et al. (2019); Rajeswaran et al. (2019), we consider the following. 2 X w(λ) y 2, w(λ) = argmin w Rp 1 2 Xw y 2 + β where β R++ and λ Rp. Hyper-representation (HR). The last setting, reminiscent of (Franceschi et al., 2018; Bertinetto et al., 2019), concerns learning a (common) linear transformation of the data and is formulated as 2 X Hw(H) y 2 w(H) = argmin w Rd 1 2 XHw y 2 + β where H Rp d and β R++. LR and KRR are high dimensional extensions of classical hyperperparameter optimization problems, while BR and HR, are typically encountered in multi-task/meta-learning as single task objectives4. Note that Assumption B (i.e. Φ( , λ) is a contraction) can be satisfied for each of the aforesaid scenarios, since they all belong to case (a) of Section 2.1 (KRR, BR and HR also to case (b)). We solve the lower-level problem in the same way for both ITD and AID methods. In particular, in LR we use the 4In multi-task/meta-learning the upper-level objectives are averaged over multiple tasks and the hypergradient is simply the average of the single task one. gradient descent method with optimal step size as in case (a) of Section 2.1, while for the other cases we use the heavyball method with optimal step size and momentum constants. Note that this last method is not a contraction in the original norm, but only in a suitable norm depending on the lowerlevel problem itself. To compute the exact hypergradient, we differentiate f(λ) directly using RMAD for KRR, BR and HR, where the closed form expression for w(λ) is available, while for LR we use CG with t = k = 2000 in place of the (unavailable) analytic gradient. Figure 1 shows how the approximation error is affected by the number of lower-level iterations t. As suggested by the iteration complexity bounds in Section 2, all the approximations, after a certain number of iterations, converge linearly to the true hypergradient5. Furthermore, in line with our analysis (see Section 2.3), CG gives the best gradient estimate (on average), followed by FP, while ITD performs the worst. For HR, the error of all the methods increases significantly at the beginning, which can be explained by the fact that the heavy ball method is not a contraction in the original norm and may diverge at first. CG k = 10 outperforms FP k = 10 on 3 out of 4 settings but both remain far from convergence. 3.2. Bilevel Optimization In this section, we aim to solve instances of the bilevel problem (1) in which λ has a high dimensionality. Kernel Ridge Regression on Parkinson. We take f(β, γ) as defined in problem (22) where the data is taken from the UCI Parkinson dataset (Little et al., 2008), containing 195 biomedical voice measurements (22 features) from people with Parkinson s disease. To avoid projections, we replace β and γ respectively with exp(β) and exp(γ) in the RHS of the two equations in (22). We split the data randomly into three equal parts to make the train, validation and test sets. 5The asymptotic error can be quite large probably due to numerical errors (more details in Appendix C). On the Iteration Complexity of Hypergradient Computation Table 1. Objective (test accuracy) values after s gradient descent steps where s is 1000, 500 and 4000 for Parkinson, 20 newsgroup and Fashion MNIST respectively. Test accuracy values are in %. kr = 10 for Parkinson and 20 newsgroup while for Fashion MNIST kr = 5. Parkinson t = 100 t = 150 ITD 2.39 (75.8) 2.11 (69.7) FP k = t 2.37 (81.8) 2.20 (77.3) CG k = t 2.37 (78.8) 2.20 (77.3) FP k = kr 2.71 (80.3) 2.60 (78.8) CG k = kr 2.33 (77.3) 2.02 (77.3) 20 newsgroup t = 10 t = 25 t = 50 1.08 (61.3) 0.97 (62.8) 0.89 (64.2) 1.03 (62.1) 1.02 (62.3) 0.84 (64.4) 0.93 (63.7) 0.78 (63.3) 0.64 (63.1) 0.94 (63.6) 0.97 (63.0) 0.82 (64.3) 0.75 (64.2) Fashion MNIST t = 5 t = 10 0.41 (84.1) 0.43 (83.8) 0.41 (84.1) 0.43 (83.8) 0.42 (83.9) 0.42 (84.0) 0.42 (83.9) 0.42 (84.0) Logistic Regression on 20 Newsgroup6. This dataset contains 18000 news divided in 20 topics and the features consist in 101631 tf-idf sparse vectors. We split the data randomly into three equal parts for training, validation and testing. We aim to solve the bilevel problem min λ Rp CE(X w(λ), y ) w(λ) = argmin w Rp c CE(Xw, y) + 1 2cp j=1 exp(λj)w2 ij where CE is the average cross-entropy loss, c = 20 and p = 101631. To improve the performance, we use warm-starts on the lower-level problem, i.e. we take w0(λi) = wt(λi 1) for all methods, where (λi)s i=1 are the upper-level iterates. Training Data Optimization on Fashion MNIST. Similarly to (Maclaurin et al., 2015), we optimize the features of a set of 10 training points, each with a different class label on the Fashion MNIST dataset (Xiao et al., 2017). More specifically we define the bilevel problem as min X Rc p CE(X w(X), y ) w(X) = arg min w Rp c CE(Xw, y) + β where β = 1, c = 10, p = 784, y = (0, . . . , c) and (X , y ) contains the usual training set. We solve each problem using (hyper)gradient descent with fixed step size selected via grid search (additional details are provided in Appendix C.2). The results in Table 1 show the upper-level objective and test accuracy both computed on the approximate lower-level solution wt(λ) after bilevel optimization7. For Parkinson and Fashion MNIST, there is little difference among the methods for a fixed t. For 20 newsgroup, CG k = t reaches the lowest objective value, followed by CG k = 10. We recall that for ITD we have cost in memory which is linear in t and that, in the case of 20 newsgroup for some t between 50 and 100, this cost 6http://qwone.com/ jason/20Newsgroups/ 7For completeness, we also report in the Appendix (Table 2) the upper-level objective and test accuracy both computed on the exact lower-level solution w(λ). exceeded the 11GB on the GPU. AID methods instead, require little memory and, by setting k < t, yield similar or even better performance at a lower computation time. Finally, we stress that since the upper-level objective is nonconvex, possibly with several minima, gradient descent with a more precise estimate of the hypergradient may get more easily trapped in a bad local minima. Equilibrium Models. Our last set of experiments investigates the behaviour of the hypergradient approximation methods on a simple instance of EQM (see problem (3)) on non-structured data. EQM are an attractive class of models due to their mathematical simplicity, enhanced interpretability and memory efficiency. A number of works (Miller & Hardt, 2019; Bai et al., 2019) have recently shown that EQMs can perform on par with standard deep nets on a variety of complex tasks, renewing the interest in these kind of models. We use a subset of ne = 5000 instances randomly sampled from the MNIST dataset as training data and employ a multiclass logistic classifier paired with a cross-entropy loss. We picked a small training set and purposefully avoided stochastic optimization methods to better focus on issues related to the computation of the hypergradients itself, avoiding the introduction of other sources of noise. We parametrize φi as φi(wi, γ) = tanh(Awi +Bxi +c) for 1 i ne (23) where xi Rp is the i-th example, wi Rh and γ = (A, B, C) Rh h Rh p Rh. Such a model may be viewed as a (infinite layers) feed-forward neural network with tied weights or as a recurrent neural network with static inputs. Additional experiments with convolutional equilibrium models may be found in Appendix C.3. Imposing A < 1 ensures that the transition functions (23), and hence Φ, are contractions. This can be achieved during optimization by projecting the singular values of A onto the interval [0, 1 ε] for ε > 0. We note that regularizing the norm of 1φi or adding L1 or L penalty terms on A may encourage, but does not strictly enforce, A < 1. We conducted a series of experiments to ascertain the importance of the contractiveness of the map Φ, as well as to understand which of the analysed methods is to be pre- On the Iteration Complexity of Hypergradient Computation 0 200 400 600 Time (s) 10 2 Objective 0 10 20 30 40 50 Iterations 100 Test accuracy 0 10 20 30 40 50 Iterations 100 10 1 Hypergradient norm ||g(λ)|| 10 3 10 2 10 1 100 Learning rate Test accuracy vs learning rate Figure 2. Experiments on EQM problems. Mean (solid or dashed lines) and point-wise minimum-maximum range (shaded regions) across 5 random seeds that only control the initialization of λ. The estimated hypergradient g(λ) is equal to ft(λ) for ITD and ˆ f(λ) for AID. We used t = k = 20 for all methods and Nesterov momentum for optimizing λ, applying a projection operator at each iteration except for the methods marked with . When performing projection, the curves produced by the three approximation schemes mostly overlap, indicating essentially the same performance (although at a different computational cost). ferred in this setting. Since here 1Φ is not symmetric, the conjugate gradient method must be applied on the normal equations of problem (15). We set h = 200 and use t = 20 fixed-point iterations to solve the lower-level problem in all the experiments. The first three plots of Figure 2 report training objectives, test accuracies and norms of the estimated hypergradient for each of the three methods, either applying or not the constraint on A, while the last explores the sensitivity of the methods to the choice of the learning rate. Unconstrained runs are marked with . Referring to the rightmost plot, it is clear (large shaded regions) that not constraining the spectral norm results in unstable behaviour of the memory-less AID methods (green and blue lines) for all but a few learning rates, while ITD (violet), as expected, suffers comparatively less. On the contrary, when A < 1 is enforced, all the approximation methods are successful and stable, with FP to be preferred being faster then CG on the normal equations and requiring substantially less memory than ITD. As a side note, referring to Figure 2 left and center-left, we observe that projecting onto the spectral ball acts as powerful regularizer, in line with the findings of Sedghi et al. (2019). 4. Conclusions We studied a general class of bilevel problems where at the lower-level we seek for a solution to a parametric fixed point equation. This formulation encompasses several learning algorithms recently considered in the literature. We established results on the iteration complexity of two strategies to compute the hypergradient (ITD and AID) under the assumption that the fixed point equation is defined by a contraction mapping. Our practical experience with these methods on a number of bilevel problems indicates that there is a trade-off between the methods, with AID based on the conjugate gradient method being preferable due to a potentiality better approximation of the hypergradient and lower space complexity. When the contraction assumption is not satisfied, however, our experiments on equilibrium models suggest that ITD is more reliable than AID methods. In the future, it would be valuable to extend the ideas presented here to other challenging machine learning scenarios not covered by our theoretical analysis. These include bilevel problems in which the lower-level is only locally contactive, nonsmooth, possibly nonexpansive or can only be solved via a stochastic procedure. At the same time, there is a need to clarify the tightness of the iteration complexity bounds presented here. Acknowledgments. This work was supported in part by a grant from SAP SE. Almeida, L. B. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In Proceedings, 1st First International Conference on Neural Networks, volume 2, pp. 609 618, 1987. Amos, B. Differentiable optimization-based modeling for machine learning. Ph D thesis, Ph D thesis. Carnegie Mellon University, 2019. Amos, B. and Kolter, J. Z. Optnet: Differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pp. 136 145, 2017. Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and De Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pp. 3981 3989, 2016. Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium models. In Advances in Neural Information Processing Systems, pp. 688 699, 2019. On the Iteration Complexity of Hypergradient Computation Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18(153), 2018. Bergstra, J. and Bengio, Y. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13(Feb):281 305, 2012. Bertinetto, L., Henriques, J. F., Torr, P. H., and Vedaldi, A. Meta-learning with differentiable closed-form solvers. ICLR, 2019. Denevi, G., Ciliberto, C., Grazzi, R., and Pontil, M. Learning-to-learn stochastic gradient descent with biased regularization. In Proc. 36th International Conference on Machine Learning, volume 97 of PMLR, pp. 1566 1575, 2019. Domke, J. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, pp. 318 326, 2012. Elsken, T., Metzen, J. H., and Hutter, F. Neural architecture search: A survey. Journal of Machine Learning Research, 20(55):1 21, 2019. Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126 1135, 2017. Franceschi, L., Donini, M., Frasconi, P., and Pontil, M. Forward and reverse gradient-based hyperparameter optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1165 1173, 2017. Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pp. 1563 1572, 2018. Griewank, A. and Walther, A. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, volume 105. Siam, 2008. Liao, R., Xiong, Y., Fetaya, E., Zhang, L., Yoon, K., Pitkow, X., Urtasun, R., and Zemel, R. Reviving and improving recurrent back-propagation. In International Conference on Machine Learning, pp. 3088 3097, 2018. Little, M., Mc Sharry, P., Hunter, E., Spielman, J., and Ramig, L. Suitability of dysphonia measurements for telemonitoring of parkinson s disease. Nature Precedings, pp. 1 1, 2008. Liu, H., Simonyan, K., and Yang, Y. Darts: Differentiable architecture search. ar Xiv preprint ar Xiv:1806.09055, 2018. Lorraine, J., Vicol, P., and Duvenaud, D. Optimizing millions of hyperparameters by implicit differentiation. ar Xiv preprint ar Xiv:1911.02590, 2019. Maclaurin, D., Duvenaud, D., and Adams, R. Gradientbased hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113 2122, 2015. Mehra, A. and Hamm, J. Penalty method for inversionfree deep bilevel optimization. ar Xiv preprint ar Xiv:1911.03432, 2019. Miller, J. and Hardt, M. Stable recurrent models. ICLR, 2019. Nesterov, Y. E. A method for solving the convex programming problem with convergence rate o (1/kˆ 2). In Dokl. akad. nauk Sssr, volume 269, pp. 543 547, 1983. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024 8035. 2019. Pedregosa, F. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pp. 737 746, 2016. Pineda, F. J. Generalization of back-propagation to recurrent neural networks. Physical review letters, 59(19):2229, 1987. Polyak, B. T. Introduction to Optimization. Optimization Software Inc. Publication Division, New York, NY, USA, 1987. Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems, pp. 113 124, 2019. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61 80, 2008. Sedghi, H., Gupta, V., and Long, P. M. The singular values of convolutional layers. ICLR, 2019. On the Iteration Complexity of Hypergradient Computation Shaban, A., Cheng, C.-A., Hatch, N., and Boots, B. Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1723 1732, 2019. Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951 2959, 2012. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.