# onestep_differentiation_of_iterative_algorithms__bd2bfb55.pdf

One-step differentiation of iterative algorithms

Jérôme Bolte Toulouse School of Economics, Université Toulouse Capitole, Toulouse, France.

Edouard Pauwels Toulouse School of Economics (IUF), Toulouse, France.

Samuel Vaiter CNRS & Université Côte d Azur, Laboratoire J. A. Dieudonné. Nice, France.

In appropriate frameworks, automatic differentiation is transparent to the user at the cost of being a significant computational burden when the number of operations is large. For iterative algorithms, implicit differentiation alleviates this issue but requires custom implementation of Jacobian evaluation. In this paper, we study one-step differentiation, also known as Jacobianfree backpropagation, a method as easy as automatic differentiation and as efficient as implicit differentiation for fast algorithms (e.g., superlinear optimization methods). We provide a complete theoretical approximation analysis with specific examples (Newton s method, gradient descent) along with its consequences in bilevel optimization. Several numerical examples illustrate the well-foundedness of the one-step estimator.

1 Introduction

Differentiating the solution of a machine learning problem is an important task, e.g., in hyperparameters optimization [9], in neural architecture search [31] and when using convex layers [3]. There are two main ways to achieve this goal: automatic differentiation (AD) and implicit differentiation (ID). Automatic differentiation implements the idea of evaluating derivatives through the compositional rules of differential calculus in a user-transparent way. It is a mature concept [27] implemented in several machine learning frameworks [36, 16, 1]. However, the time and memory complexity incurred may become prohibitive as soon as the computational graph becomes bigger, a typical example being unrolling iterative optimization algorithms such as gradient descent [5]. The alternative, implicit differentiation, is not always accessible: it does not solely rely on the compositional rules of differential calculus (Jacobian multiplication) and usually requires solving a linear system. The user needs to implement custom rules in an automatic differentiation framework (as done, for example, in [4]) or use dedicated libraries such as [11, 3, 10] implementing these rules for given models. Provided that the implementation is carefully done, this is most of the time the gold standard for the task of differentiating problem solutions.

Contributions. We study a one-step Jacobian approximator based on a simple principle: differentiate only the last iteration of an iterative process and drop previous derivatives. The idea of dropping derivatives or single-step differentiation was explored, for example, in [24, 23, 39, 22, 40, 29] and our main contribution is a general account and approximation analysis for Jacobians of iterative algorithms. One-step estimation constitutes a rough approximation at first sight and our motivation to study is its ease-of-use within automatic differentiation frameworks: no custom Jacobian-vector products or vector-Jacobian products needed as long as a stop_gradient primitive is available (see Table 1).

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Algorithm 1: Automatic Input: θ 7 x0(θ) X, k > 0. Eval:

with_gradient for i = 1, . . . , k do

xi(θ) = F(xi 1(θ), θ) Return: xk(θ) Differentiation: native autodiff on Eval.

Algorithm 2: Implicit Input: x0 X, k > 0. Eval:

stop_gradient for i = 1, . . . , k do

xi = F(xi 1, θ) Return: xk Differentiation: Custom implicit VJP / JVP.

Algorithm 3: One-step Input: x0 X, k > 0. Eval:

stop_gradient for i = 1, . . . , k 1 do

xi = F(xi 1, θ) with_gradient xk(θ) = F(xk 1, θ) Return: xk(θ) Differentiation: native autodiff on Eval Table 1: Qualitative comparison of differentiation strategies. Native autodiff refers to widespread primitives in differentiable programming (e.g. grad in JAX). Custom JVP/VJP refers to specialized libraries such as jaxopt [12] or qpth [4] implicit differentiation in specific contexts.

Algorithm Implementation Efficient

Automatic Native autodiff no Implicit Custom yes One step Native autodiff yes

We conduct an approximation analysis of one-step Jacobian (Corollary 1). The distance to the true Jacobian are produced by the distance to the solution (i.e., the quality of completion of the optimization phase) and the lack of contractivity of the iteration mapping. This imprecision has to be balanced with the ease of implementation of the one-step Jacobian. This suggests that one-step differentiation is efficient for small contraction factors, which corresponds to fast algorithms. We indeed show that one-step Jacobian is asymptotically correct for super-linearly convergent algorithms (Corollary 2) and provide similar approximation rate as implicit differentiation for quadratically convergent algorithms (Corollary 3). We exemplify these results with hypergradients in bilevel optimization, conduct a detailed complexity analysis and highlight in Corollary 4 the estimation of approximate critical points. Finally, numerical illustrations are provided to show the practicality of the method on logistic regression using Newton s algorithm, interior point solver for quadratic programming and weighted ridge regression using gradient descent.

Related works. Automatic differentiation [27] was first proposed in the forward mode in [44] and its reverse mode in [30]. The study of the behaviour of differentiating iterative procedure by automatic differentiation was first analyzed in [25] and [8] in the optimization community. It was studied in the machine learning community for smooth methods [35, 32, 37], and nonsmooth methods [14]. Implicit differentiation is a recent highlight in machine learning. It was shown to be a good way to estimate the Jacobian of problem solutions, for deep equilibrium network [5], optimal transport [33], and also for nonsmooth problems [13], such as sparse models [10]. Inexact implicit differentiation was explored in [19]. Truncated estimation of the "Neumann series" for the implicit differentiation is routinely used [32, 34] and truncated backpropagation was investigated in [42] for the gradient descent iterations. In the very specific case of min-min problems, [2] studied the speed of convergence of automatic, implicit, and analytical differentiation.

The closest work to ours is [22] under the name Jacobian-free backpropagation but differs significantly in the following ways. Their focus is on single implicit layer networks, and guarantees are qualitative (descent direction of [22, Theorem 3.1]). In contrast, we provide quantitative results on abstract fixed points and applicable to any architecture. The idea of dropping derivatives was proposed in [21] for meta-learning, one-step differentiation was also investigated to train Transformer architectures [23], and to solve bilevel problems with quasi-Newtons methods [39].

2 One-step differentiation

2.1 Automatic, implicit and one-step differentiation

Throughout the text F : Rn Rm Rn denotes a recursive algorithmic map from Rn to Rn with m parameters. For any θ Rm, we write Fθ : x 7 F(x, θ) and let F k θ denote k recursive composition of Fθ, for k N. The map Fθ defines a recursive algorithm as follows

x0(θ) Rn and xk+1(θ) = F(xk(θ), θ), (1)

We denote by Jx Fθ the Jacobian matrix with respect to the variable x. The following assumption is sufficient to ensure a non degenerate asymptotic behavior.

Assumption 1 (Contraction) Let F : Rn Rm Rn be C1, 0 ρ < 1, and X Rn be nonempty convex closed, such that for any θ Rm, Fθ(X) X and Jx Fθ op ρ.

Remark 1 The main algorithms considered in this paper fall in the scope of smooth optimization. The algorithmic map Fθ is associated to a smooth parametric optimization problem given by f : Rn Rm R such that fθ : x 7 f(x, θ) is strongly convex, uniformly in θ. Two examples of algorithmic maps are given by gradient descent, Fθ(x) = x α fθ(x), or Newton s Fθ(x) = x α 2fθ(x) 1 fθ(x) for positive step α > 0. For small step sizes, gradient descent provides a contraction and Newton s method provides a local contraction, both fall in the scope of Assumption 1. Other examples include inertial algorithms such as the Heavy Ball method [38], which has to be considered in phase space and for which a single iteration is not contracting, but a large number of iteration is (see e.g. [38, 35]).

The following lemma gathers known properties regarding the fixed point of Fθ, denoted by x(θ) and for which we will be interested in estimating derivatives.

Lemma 1 Under Assumption 1, for each θ in Rm there is a unique fixed point of Fθ in X denoted by x(θ), which is a C1 function of θ. Furthermore, for all k N, we have xk(θ) x(θ) ρk x0(θ) x(θ) ρk x0 Fθ(x0)

This is well known, we briefly sketch the proof. The mapping Fθ is a ρ contraction on X use the convexity of X and the intermediate value theorem. Banach fixed point theorem ensures existence and uniqueness, differentiability is due to the implicit function theorem. The convergence rate is classical. We are interested in the numerical evaluation of the Jacobian Jθ x(θ), thus well-defined under Assumption 1.

The automatic differentiation estimator JADxk(θ) = Jθxk(θ) propagates the derivatives (either in a forward or reverse mode) through iterations based on the piggyback recursion [26], for i = 1, . . . , k 1,

Jθxi+1(θ) = Jx F(xi(θ), θ)Jθxi(θ) + JθF(xi(θ), θ). (2)

Under assumption 1 we have JADxk(θ) Jθ x(θ) and the convergence is asymptotically linear [25, 8, 35, 41, 14]. JADxk(θ) is available in differentiable programming framework implementing common primitives such as backpropagation.

The implicit differentiation estimator JIDxk(θ) is given by application of the implicit function theorem using xk as a surrogate for the fixed point x,

JIDxk(θ) = (I Jx F(xk(θ), θ)) 1JθF(xk(θ), θ). (3)

By continuity of the derivatives of F, we also have JIDxk(θ) Jθ x(θ) as k , (see e.g. [27, Lemma 15.1] or [12, Theorem 1]). Implementing JIDxk(θ) requires either manual implementation or dedicated techniques or libraries [4, 3, 45, 28, 20, 12] as the matrix inversion operation is not directly expressed using common differentiable programming primitives. A related estimator is the Inexact Automatic Differentiation (IAD) estimator which implements (2) but with Jacobians evaluated at the last iterates xk, which can be seen as an approximation of JID [35, 19].

The one-step estimator JOSxk(θ) is the Jacobian of the fixed point map for the last iteration

JOSxk(θ) = JθF(xk 1(θ), θ). (4)

def F(x, theta):

# here a gradient step return x - alpha * grad_f(x, theta)

def iterative_procedure(theta, x0): x = x0 for _ in range(100): x = F(x, theta) return x

# Automatic differentiation J_ad = jacfwd(iterative_procedure)(theta, x0)

# Implicit differentiation J_id = # Custom implementation (e.g. jaxopt)

# One-step differentiation def iterative_procedure_with_stop_grad(theta, x0): x = x0 for _ in range(99):

x = F(stop_gradient(x),

stop_gradient(theta)) x = F(x, theta) return x J_os = jacfwd(iterative_procedure)(theta, x0)

Figure 1: Implementation of Algorithms 1, 2 and 3 in jax. F is a gradient step of some function f. The custom implementation of implicit differentiation is not explicitly stated. The function stop_gradient is present in jax.lax and jacfwd computes the full Jacobian using forward-mode AD.

Contrary to automatic differentiation or implicit differentiation estimates, we do not have JOSxk(θ) Jθ x(θ) in general as k , but we will see that the error is essentially proportional to ρ, and thus negligible for fast algorithms for which the estimate is accurate.

From a practical viewpoint, the three estimators JAD, JID and JOS are implemented in a differentiable programming framework, such as jax, thanks to a primitive stop_gradient, as illustrated by Algorithms 1, 2 and 3. The computational effect of the stop_gradient primitive is to replace the actual Jacobian Jx F(xi(θ), θ) by zero for chosen iterations i {1, . . . , k}. Using it for all iterations except the last one, allows one to implement JOS in (4) using Algorithm 3. This illustrates the main interest of the one-step estimator: it can be implemented using any differentiable programming framework which provides a stop_gradient primitive and does not require custom implementation of implicit differentiation. Figure 1 illustrates an implementation in jax for gradient descent.

2.2 Approximation analysis of one step differentiation for linearly convergent algorithms

The following lemma is elementary. It describes the main mathematical mechanism at stake behind our analysis of one-step differentiation.

Lemma 2 Let A Rn n with A op ρ < 1 and B, B Rn m, then

(I A) 1B B = A(I A) 1B + B B.

Moreover, we have the following estimate,

(I A) 1B B op ρ 1 ρ B op + B B op.

Proof : First for any v Rn, we have (I A)v Iv Av (1 ρ) v , which shows that I A is invertible (the kernel of I A is trivial). We also deduce that (I A) 1 op 1/(1 ρ). Second, we have (I A) 1 I = A(I A) 1, since ((I A) 1 I)(I A) = A, and therefore

A(I A) 1B + B B = ((I A) 1 I)B + B B = (I A) 1B B.

The norm bound follows using the submultiplicativity of operator norm, the triangular inequality and the fact that A op ρ and (I A) 1 op 1/(1 ρ).

Corollary 1 Let F and X be as in Assumption 1 such that θ 7 F(x, θ) is LF Lipschitz and x 7 JθF(x, θ) is LJ Lipschitz (in operator norm) for all x Rn. Then, for all θ Rm,

JOSxk(θ) Jθ x(θ) op ρLF

1 ρ + LJ xk 1 x(θ) . (5)

Proof : The result follows from Lemma 2 with A = Jx F( x(θ), θ), B = JθF( x(θ), θ) and B = JθF(xk 1, θ) using the fact that B op LF and B B op LJ x(θ) xk 1 .

Remark 2 (Comparison with implicit differentiation) In [12] a similar bound is described for JID, roughly under the assumption that x F(x, θ) also has LJ Lipschitz Jacobian, one has

JIDxk(θ) Jθ x(θ) op LJLF (1 ρ)2 xk x(θ) + LJ 1 ρ xk x(θ) . (6)

For small ρ and large k, the main difference between the two bounds (5) and (6) lies in their first term which is of the same order whenever ρ and LJ x xk 1 are of the same order.

Corollary 1 provides a bound on JOSxk(θ) Jθ x(θ) op which is asymptotically proportional to ρ. This means that for fast linearly convergent algorithms, meaning ρ 1, one-step differentiation provides a good approximation of the actual derivative. Besides, given F which satisfies Assumption 1, with a given ρ < 1, not specially small, one can set Fθ = F K θ for some K N. In this case, F satisfies assumption 1 with ρ = ρK and the one-step estimator in (4) applied to F becomes a K-steps estimator on F itself, we only differentiate through the K last steps of the algorithm which amounts to truncated backpropagation [42].

Example 1 (Gradient descent) Let f : Rn Rm R be such that f( , θ) is µ-strongly convex (µ > 0) with L Lipschitz gradient for all θ Rm, then the gradient mapping F : (x, θ) 7 x α xf(x, θ) satisfies Assumption 1 with ρ = max{1 αµ, αL 1}, smaller than 1 as long as 0 < α < 2/L. The optimal α = 2/(L + µ) leads to a contraction factor ρ = 1 2µ/(L + µ). Assuming that 2 xθf is also L Lipschitz, Corollary 1 holds with LF = LJ = 2L/(µ + L) 2. For step size 1/L, Corollary 1 holds with LF = LJ 1 and ρ = 1 µ/L. In both cases, the contraction factor ρ is close to 0 when L/µ 1, for well conditioned problems. As outlined above, we may consider Fθ = F K θ in which case Corollary 1 applies with a smaller value of ρ, recovering the result of [42, Proposition 3.1]

2.3 Superlinear and quadratic algorithms

The one-step Jacobian estimator in (4) as implemented in Algorithm 3 is obviously not an exact estimator in the sense that one does not necessarily have JOSxk(θ) Jθ x(θ) as k . However, it is easy to see that this estimator is exact in the case of exact single-step algorithms, meaning F satisfies F(x, θ) = x(θ) for all x, θ. Indeed, in this case, one has Jx F(x, θ) = 0 and JθF(x, θ) = Jθ x(θ) for all x, θ. Such a situation occurs, for example, when applying Newton s method to an unconstrained quadratic problem. This is a very degenerate situation as it does not really make sense to talk about an iterative algorithm in this case. It turns out that this property of being asymptotically correct remains valid for very fast algorithms, that is, algorithms that require few iterations to converge, the archetypal example being Newton s method for which we obtain quantitative estimates.

2.3.1 Super-linear algorithms

The following is a typical property of fast converging algorithms.

Assumption 2 (Vanishing Jacobian) Assume that F : Rn Rm Rn is C1 and that the recursion xk+1 = Fθ(xk) converges globally, locally uniformly in θ to the unique fixed point x(θ) of Fθ such that Jx F( x(θ), θ) = 0.

Note that under Assumption 2, it is always possible to find a small neighborhood of x such that Jx Fθ op remains small, that is, Assumption 1 holds locally and Lemma 1 applies. Furthermore, it is possible to show that the derivative estimate is asymptotically correct as follows.

Corollary 2 (Jacobian convergence) Let F : Rn Rm Rn be as in Assumption 2. Then JOSxk(θ) Jθ x(θ) as k , and JOS x(θ) = Jθ x(θ).

Proof : Since Jx F( x(θ), θ) = 0, implicit differentiation of the fixed point equation reduces to Jθ x(θ) = JθF( x(θ), θ), and the result follows by continuity of the derivatives.

Example 2 (Superlinearly convergent algorithm) Assume that F is C1 and for each ρ > 0, there is R > 0 such that Fθ(x) x(θ) ρ x x(θ) for all x, θ such that x x(θ) R. Then Corollary 2 applies as for any v

Jx F( x(θ), θ)v = lim t 0 F( x(θ) + tv, θ) x(θ)

2.3.2 Quadratically convergent algorithms

0 2 4 6 8 Iteration

One-step diff Forward diff Dist optimum Implicit diff

Figure 2: Newton s method quadratic convergence.

Under additional quantitative assumptions, it is possible to obtain more precise convergence estimates similar to those obtained for implicit differentiation, see Remark 2.

Corollary 3 Let F be as in Assumption 2 such that x 7 J(x,θ)F(x, θ) (joint jacobian in (x, θ)) is LJ Lipschitz (in operator norm). Then, the recursion is asymptotically quadratically convergent and for each k 1,

JOSxk(θ) Jθ x(θ) op LJ xk 1(θ) x(θ) . (7)

Proof : Following the same argument as in the proof of Corollary 2, we have

JOSxk(θ) Jθ x(θ) op = JθF(xk(θ), θ) JθF( x(θ), θ) op LJ xk 1(θ) x(θ) . (8)

As for the quadratic convergence, we may assume that x(θ) = 0 and drop the θ variable to simplify notations. We have F(0) = 0 and for all x,

0 Jx F(tx)xdt x Z 1

0 Jx F(tx) opdt x 2LJ

0 tdt = LJ x 2

Thus LJ/2 xk+1 [LJ/2 xk ]2, and asymptotic quadratic convergence follows.

Example 3 (Newton s algorithm) Assume that f : Rn Rm R is C3 with Lipschitz derivatives, and for each θ, x 7 f(x, θ) is µ-strongly convex. Then Newton s algorithm with backtracking line search satisfies the hypotheses of Corollary 3, see [15, Sec. 9.5.3]. Indeed, it takes unit steps after a finite number of iterations, denoting by x(θ) the unique solution to xf(x, θ) = 0, for all x locally around x(θ)

F(x, θ) = x 2 xxf(x, θ) 1 xf(x, θ).

We have, 2 xxf(x, θ)F(x, θ) = 2 xxf(x, θ)x xf(x, θ), differentiating using tensor notations,

2 xxf(x, θ)Jx F(x, θ) = 3f(x, θ)[x, , ] 3f(x, θ)[F(x, θ), , ]

so that Jx F( x(θ), θ) = 0 and Lipschitz continuity of the derivatives of f implies Lipschitz continuity of Jx F(x, θ) using the fact that 2 xxf(x, θ) µI and that matrix inversion is smooth and Lipschitz for such matrices. The quadratic convergence of the three derivative estimates for Newton s method is illustrated on a logistic regression example in Figure 2.

3 Hypergradient descent for bilevel problems

Consider the following bilevel optimization problem

min θ g(x(θ)) s.t. x(θ) arg minyf(y, θ),

where g and f are C1 functions. We will consider bilevel problems such that the inner minimum is uniquely attained and can be described as a fixed point equation x = F(x, θ) where F is as in Assumption 1. The problem may then be rewritten as

min θ g(x(θ)) s.t. x(θ) = F(x(θ), θ), (9)

Method Time Memory Error sources

Piggyback recursion kn(ωCF + nm) n(n + m) suboptimality + burn-in AD forward-mode kωCF m n + m suboptimality + burn-in AD reverse-mode kωCF kn suboptimality + burn-in Implicit differentiation ωCF n + n3 n suboptimality One-step differentiation ωCF n suboptimality + lack of contractivity

Forward algorithm k CF n suboptimality

Table 2: Time and memory complexities of the estimators in Section 2.1 (up to multiplicative constant). F has time complexity denoted by CF and we consider k iterations in Rn with m parameter. ω is the multiplicative overhead of evaluating a gradient (cheap gradient principle).

see illustrations in Remark 1. Gradient descent (or hyper-gradient) on (9) using our one-step estimator in (4) consists in the following recursion

θl+1 = θl αJOSxk(θl)T g(xk(θl)), (10)

where α > 0 is a step size parameter. Note that the quantity JOSxk(θ)T g(xk) is exactly what is obtained by applying backpropagation to the composition of g and Algorithm 3, without any further custom variation on backpropagation. Note that one in practice may use amortized algorithms, such as [18]. This section is dedicated to the theoretical guarantees which can be obtained using such a procedure, proofs are postponed to Appendix A.

3.1 Complexity analysis of different hypergradient strategies

We essentially follow the complexity considerations in [27, Section 4.6]. Let CF denote the computation time cost of evaluating the fixed-point map F and ω > 0 be the multiplicative overhead of gradient evaluation, in typical applications, ω 5 (cheap gradient principle [6]). The time cost of evaluating the Jacobian of F is nωCF (n gradients). Forward algorithm evaluation (i.e., xk) has computational time cost k CF with a fixed memory complexity n. Vanilla piggyback recursion (2) requires k 1 full Jacobians and matrix multiplications of costs n2m. The forward-mode of AD has time complexity kωCF m (compute m partial derivatives each of them cost ω times the time to evaluate the forward algorithm), and requires to store the iterate vector of size n and m partial derivatives. The reverse-mode of AD has time complexity kωCF (cheap gradient principle on F k θ ) and requires to store k vectors of size n. Implicit differentiation requires one full Jacobian (ωCF n) and solution of one linear system of size n n, that is roughly n3. Finally, one-step differentiation is given by only differentiating a single step of the algorithm at cost ωCF . For each estimate, distance to the solution will result in derivative errors. In addition, automatic differentiation based estimates may suffer from the burn-in effect [41] while one-step differentiation will suffer from a lack of contractivity as in Corollary 1. We summarize the discussion in Table 2. Let us remark that, if CF n2, then reverse AD has a computational advantage if k n, which makes sense for fast converging algorithms, but in this case, one-step differentiation has a small error and a computational advantage compared to reverse AD.

Remark 3 (Implicit differentiation: but on which equation?) Table 2 is informative yet formal. In practical scenarios, implicit differentiation should be performed using the simplest equation available, not necessarily F = 0. This can significantly affect the computational time required. For instance, when using Newton s method F = [ 2f] 1 f, implicit differentiation should be applied to the gradient mapping f = 0, not F. In typical application C f = O(n2), and the dominant cost of implicit differentiation is O(n3), which is of the same order as the one-step differentiation as CF = O(n3) (a linear system needs to be inverted). However, if the implicit step was performed on F instead of f, it would incur a prohibitive cost of O(n4). In conclusion, the implicit differentiation phase is not only custom in terms of the implementation, but also in the very choice of the equation.

3.2 Approximation analysis of one step differentiation

The following corollary, close to [42, Prop. 3.1], provides a bound on the one-step differentiation (Algorithm 3) gradient estimator for (9). The bound depends on ρ, the contraction factor, and distance to the solution for the inner algorithm. The proof is given in Appendix A.

Corollary 4 Let F : Rn Rm Rn be as in Corollary 1 and consider the bilevel problem (9), where g is a C1, lg Lipschitz function with l Lipschitz gradient. Then, θ(g x)(θ) JOSxk(θ)T g(xk) ρLF lg

1 ρ + LJlg xk 1 x(θ) + LF l x(θ) xk .

3.3 Approximate critical points

The following lemma is known, but we provide a proof for completeness in Appendix A.

Lemma 3 Assume that h: Rm R is C1 with L Lipschitz gradient and lower bounded by h . Assume that for some ϵ > 0, for all l N, θl+1 θl + 1

L. Then, for all K N, K 1, we have

min l=0,...,K h(θl) 2 ϵ2 + 2L(h(x0) h )

Combining with Corollary 4, it provides a complexity estimate for the recursion in (10).

Corollary 5 Under the setting of Corollary 4, consider iterates in (10) with k 2 and assume the following

supθ x0(θ) Fθ(x0(θ)) M, for some M > 0. F is LF Lipschitz and J(x,θ)F is LJ Lipschitz jointly in operator norm.

g is C1, lg Lipschitz with l Lipschitz gradient.

1 α LJ 1 ρ LF 1 ρ + 1 lg + l LF 1 ρ LF 1 ρ g x is lower bounded by g .

Then setting ϵ = ρ 1 ρ(LF lg + (LJlg + LF l )Mρk 2), for all K N,

min l=0,...,K θ(g x)(θl) 2 ϵ2 + 2L((g x)(θ0) g )

The level of approximate criticality is the sum of a term proportional to ρ and a term inversely proportional to K. For large values of k (many steps on the inner problem), K (many steps on the outer problem), approximate criticality is essentially proportional to ρLF lg/(1 ρ) which is small if ρ is close to 0 (e.g. superlinear algorithms).

4 Numerical experiments

We illustrate our findings on three different problems. First, we consider Newton s method applied to regularized logistic regression, as well as interior point solver for quadratic problems. These are two fast converging algorithms for which the results of Section 2.3 can be applied and the one-step procedure provides accurate estimations of the derivative with a computational overhead negligible with respect to solution evaluation, as for implicit differentiation. We then consider the gradient descent algorithm applied to a ridge regression problem to illustrate the behavior of the one step procedure for linearly convergent algorithms.

Logistic regression using Newton s algorithm. Let A RN n be a design matrix, the first column being made of 1s to model an intercept. Rows are denoted by (a1, . . . , a N). Let x Rn and y { 1, 1}N. We consider the regularized logistic regression problem

i=1 θiℓ( ai, x yi) + λ x 1 2, (11)

Interior point for QP Newton for logistic regression

0 500 1000 1500 0 500 1000 1500 2000

Total number of parameters (1e3)

Figure 3: Timing experiment to evaluate one gradient. Left: Differentiable QP in (12), one step and implicit estimators agree up to an error of order 10 16. Right: Newton s method on logistic regression (11), one step and implicit estimators agree up to an error of order 10 12. Label None represent solving time and Autodiff , Implicit and One step represent solving time and gradient evaluation for each estimator in Section 2.1. The mean is depicted using shading to indicate standard deviation estimated over 10 runs.

where ℓis the logistic loss, ℓ: t 7 log(1 + exp( t)), λ > 0 is a regularization parameter, and x 1 denotes the vector made of entries of x except the first coordinate (we do not penalize intercept). This problem can be solved using Newton s method which we implement in jax using backtracking line search (Wolfe condition). Gradient and Hessian are evaluated using jax automatic differentiation, and the matrix inversion operations are performed with an explicit call to a linear system solver.

We denote by x(θ) the solution to problem (11) and try to evaluate the gradient of θ 7 x(θ) 2/2 using the three algorithms presented in Section 2.1. We simulate data with Gaussian class conditional distributions for different values of N and n. The results are presented in Figure 3 where we represent the time required by algorithms as a function of the number of parameters required to specify problem (11), in our case size of A and size of y, which is (n + 1)N.

Figure 3 illustrates that both one-step and implicit differentiation enjoy a marginal computational overhead, contrary to algorithmic differentiation. In this experiment, the one-step estimator actually has a slight advantage in terms of computation time compared to implicit differentiation.

Interior point solver for quadratic programming: The content of this section is motivated by elements described in [4], which is associated with a pytorch library implementing a standard interior point solver. Consider the following quadratic program (QP):

min x Rn 1 2x T Qx + q T x s.t. Ax = θ, Gx h, (12)

where Q Rn n is positive definite, A Rm n and G Rp n are matrices, q Rn, θ Rm and h Rp are vectors. We consider x(θ) the solution of problem (12) as a function of θ, the right-hand side of the equality constraint. We implemented in jax a standard primal-dual Interior Point solver for problem (12). Following [4], we use the implementation described in [43], and we solve linear systems with explicit calls to a dedicated solver. For generic inputs, this algorithm converges very fast, which we observed empirically. Differentiable programming capacities of jax can readily be used to implement the automatic differentiation and one-step derivative estimators without requiring custom interfaces as in [4]. Indeed, implicit differentiation for problem (12) was proposed in [4] with an efficient pytorch implementation. We implemented these elements in jax in order to evaluate Jθx(θ) using implicit differentiation. More details on this experiment are given in Appendix B.

We consider evaluating the gradient of the function θ 7 x(θ) 2/2 using the three algorithms proposed in Section 2.1. We generate random instances of QP in (12) of various sizes. The number of parameters needed to describe each instance is n(n + 1) + (n + 1)m + (n + 1)p. The results are presented in Figure 3 where we represent the time required by algorithms as a function of the number of parameters required to specify problem (12). In all our experiments, the implicit and one-step estimates agree up to order 10 6. From Figure 3, we

Figure 4: Differentiation of gradient descent for solving weighted Ridge regression on cpusmall. Top line: condition number of 1000. Bottom line: condition number of 100. Left column: small learning rate 1

L. Right column: big learning rate 2 µ+L. Dotted (resp. filled) lines represent the error of the iterates (resp. of the Jacobians).

see that both one-step and implicit differentiation enjoy a marginal additional computational overhead, contrary to algorithmic differentiation.

Weighted ridge using gradient descent. We consider a weighted ridge problem with A RN n, y RN, λ > 0 and a vector of weights θ RN:

x(θ) = arg min x Rn fθ(x) = 1

i=1 θi(yi ai, x )2 + λ

We solve this problem using gradient descent with adequate step-size F(x, θ) = x α fθ(x) with x0(θ) = 0, and we consider the K-step truncated Jacobian propagation F = F K θ with K = 1/κ where κ is the effective condition number of the Hessian. Figure 4 benchmarks the use of automatic differentiation, one-step differentiation, and implicit differentiation on the data set cpusmall provided by Lib SVM [17] for two types of step-sizes. We monitor both quantities xk(θ) x(θ) 2 for the iterates, and Jθxk(θ) Jθ x(θ) 2 for the Jacobian matrices. As expected, implicit differentiation is faster and more precise, it is our gold standard which requires custom implicit system to be implemented (or the use of an external library). For large steps, autodiff suffers from the burn-in phenomenon described in [41], which does not impact the one step estimator. Therefore, for a fixed time budget, the one step strategies allows to obtain higher iterate accuracy and similar, or better, Jacobian accuracy. In the small step regime, one step differentiation provides a trade-off, for a fixed time budget, one obtains better estimates for the iterate and worse estimates for the Jacobian matrices. Our results suggest that K-step truncated backpropagation allows to save computation time, at the cost of possibly degraded derivatives compared to full backpropagation, in line with [42].

5 Conclusion

We studied the one-step differentiation, also known as Jacobian-free backpropagation, of a generic iterative algorithm, and provided convergence guarantees depending on the initial rate of the algorithm. In particular, we show that one-step differentiation of a quadratically convergent algorithm, such as Newton s method, leads to a quadratic estimation of the Jacobian. A future direction of research would be to understand how to extend our findings to the nonsmooth world as in [14] for linearly convergent algorithms.

Acknowledgements

The authors acknowledge the support of the AI Interdisciplinary Institute ANITI funding, through the French Investments for the Future PIA3 program under the grant agreement ANR-19-PI3A0004, Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant numbers FA8655-22-1-7012, ANR Ma SDOL 19-CE23-0017-0, ANR Chess, grant ANR-17-EURE-0010, ANR Regulia, ANR Gra Va ANR-18-CE40-0005. Jérôme Bolte, Centre Lagrange, and TSE-P.

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

[2] Pierre Ablin, Gabriel Peyré, and Thomas Moreau. Super-efficiency of automatic differentiation for functions defined as a minimum. In ICML, volume 119, pages 32 41, 13 18 Jul 2020.

[3] Akshay Agrawal, Brandon Amos, Shane Barratt, Stephen Boyd, Steven Diamond, and J Zico Kolter. Differentiable convex optimization layers. Advances in neural information processing systems, 32, 2019.

[4] Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pages 136 145. PMLR, 2017.

[5] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. Advances in Neural Information Processing Systems, 32, 2019.

[6] Walter Baur and Volker Strassen. The complexity of partial derivatives. Theoretical computer science, 22(3):317 330, 1983.

[7] Amir Beck. First-order methods in optimization. SIAM, 2017.

[8] Thomas Beck. Automatic differentiation of iterative processes. Journal of Computational and Applied Mathematics, 50(1-3):109 118, 1994.

[9] Yoshua Bengio. Gradient-Based Optimization of Hyperparameters. Neural Computation, 12(8):1889 1900, 2000.

[10] Q. Bertrand, Q. Klopfenstein, M. Blondel, S. Vaiter, A. Gramfort, and J. Salmon. Implicit differentiation of lasso-type models for hyperparameter optimization. In ICML, 2020.

[11] Mathieu Blondel, Quentin Berthet, Marco Cuturi, Roy Frostig, Stephan Hoyer, Felipe Llinares-López, Fabian Pedregosa, and Jean-Philippe Vert. Efficient and modular implicit differentiation. ar Xiv preprint ar Xiv:2105.15183, 2021.

[12] Mathieu Blondel, Quentin Berthet, Marco Cuturi, Roy Frostig, Stephan Hoyer, Felipe Llinares-López, Fabian Pedregosa, and Jean-Philippe Vert. Efficient and modular implicit differentiation. Advances in Neural Information Processing Systems, 35:5230 5242, 2022.

[13] Jérôme Bolte, Tam Le, Edouard Pauwels, and Tony Silveti-Falls. Nonsmooth implicit differentiation for machine-learning and optimization. In Neur IPS, volume 34, pages 13537 13549, 2021.

[14] Jerome Bolte, Edouard Pauwels, and Samuel Vaiter. Automatic differentiation of nonsmooth iterative algorithms. In Advances in Neural Information Processing Systems, 2022.

[15] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[16] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018.

[17] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1 27, 2011.

[18] Mathieu Dagréou, Pierre Ablin, Samuel Vaiter, and Thomas Moreau. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. In Neur IPS, 2022.

[19] Matthias J Ehrhardt and Lindon Roberts. Analyzing inexact hypergradients for bilevel learning. ar Xiv preprint ar Xiv:2301.04764, 2023.

[20] Laurent El Ghaoui, Fangda Gu, Bertrand Travacca, Armin Askari, and Alicia Tsai. Implicit deep learning. SIAM Journal on Mathematics of Data Science, 3(3):930 958, 2021.

[21] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, volume 70, pages 1126 1135, 2017.

[22] Samy Wu Fung, Howard Heaton, Qiuwei Li, Daniel Mc Kenzie, Stanley Osher, and Wotao Yin. Jfb: Jacobian-free backpropagation for implicit models. Proceedings of the AAAI Conference on Artificial Intelligence, 2022.

[23] Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Ke Wei, and Zhouchen Lin. Is attention better than matrix decomposition? In ICLR, 2021.

[24] Zhengyang Geng, Xin-Yu Zhang, Shaojie Bai, Yisen Wang, and Zhouchen Lin. On training implicit models. Advances in Neural Information Processing Systems, 34:24247 24260, 2021.

[25] Jean Charles Gilbert. Automatic differentiation and iterative processes. Optimization methods and software, 1(1):13 21, 1992.

[26] Andreas Griewank and Christele Faure. Piggyback differentiation and optimization. In Large-scale PDE-constrained optimization, pages 148 164. Springer, 2003.

[27] Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM, 2008.

[28] Fangda Gu, Heng Chang, Wenwu Zhu, Somayeh Sojoudi, and Laurent El Ghaoui. Implicit graph neural networks. Advances in Neural Information Processing Systems, 33:11984 11995, 2020.

[29] Howard Heaton, Daniel Mc Kenzie, Qiuwei Li, Samy Wu Fung, Stanley Osher, and Wotao Yin. Learn to predict equilibria via fixed point networks. ar Xiv preprint ar Xiv:2106.00906, 2021.

[30] Seppo Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master s thesis, Univ. Helsinki, 1970.

[31] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In ICLR, 2019.

[32] Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In AISTATS, pages 1540 1552, 2020.

[33] Giulia Luise, Alessandro Rudi, Massimiliano Pontil, and Carlo Ciliberto. Differential properties of sinkhorn approximation for learning with wasserstein distance. In Neur IPS, volume 31, 2018.

[34] Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable gradientbased tuning of continuous regularization hyperparameters. In ICML, pages 2952 2960, 2016.

[35] Sheheryar Mehmood and Peter Ochs. Automatic differentiation of some first-order methods in parametric optimization. In International Conference on Artificial Intelligence and Statistics, pages 1584 1594. PMLR, 2020.

[36] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024 8035, 2019.

[37] Edouard Pauwels and Samuel Vaiter. The derivatives of sinkhorn-knopp converge. SIAM J Optim, 33(3):1494 1517, 2023.

[38] Boris Polyak. Introduction to optimization. In Optimization Software, Publications Division. Citeseer, 1987.

[39] Zaccharie Ramzi, Florian Mannel, Shaojie Bai, Jean-Luc Starck, Philippe Ciuciu, and Thomas Moreau. SHINE: SHaring the INverse estimate from the forward pass for bi-level optimization and implicit models. In International Conference on Learning Representations, 2022.

[40] Subham Sekhar Sahoo, Anselm Paulus, Marin Vlastelica, Vít Musil, Volodymyr Kuleshov, and Georg Martius. Backpropagation through combinatorial algorithms: Identity with projection works. In The Eleventh International Conference on Learning Representations, 2023.

[41] Damien Scieur, Gauthier Gidel, Quentin Bertrand, and Fabian Pedregosa. The curse of unrolling: Rate of differentiating through optimization. Advances in Neural Information Processing Systems, 35:17133 17145, 2022.

[42] Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for bilevel optimization. In AISTATS, pages 1723 1732. PMLR, 2019.

[43] Lieven Vandenberghe. The cvxopt linear and quadratic cone program solvers. Online: http://cvxopt. org/documentation/coneprog. pdf, 2010.

[44] R. E. Wengert. A simple automatic derivative evaluation program. Communications of the ACM, 7(8):463 464, August 1964.

[45] Matt Johnson Zico Kolter, David Duvenaud. Deep implicit layers - neural odes, deep equilibirum models, and beyond, 2020. Neur IPS Tutorial.

A Proof of Section 3

Proof of Corollary 4: We have θ(g x)(θ) = Jθ x(θ)T g( x(θ)), and

Jθ x(θ)T g( x(θ)) JOSxk(θ)T g(xk) =(Jθ x(θ)T JOSxk(θ)T ) g( x(θ))

JOSxk(θ)T ( g(xk) g( x(θ)).

The result follows from the triangular inequality combined with Corollary 1 and the following

(Jθ x(θ)T JOSxk(θ)T ) g( x(θ)) Jθ x(θ) JOSxk(θ) op g( x(θ))

lg Jθ x(θ) JOSxk(θ) op,

JOSxk(θ)T ( g(xk) g( x(θ)) JOSxk(θ) op g(xk) g( x(θ)) LF l x(θ) xk

Proof of Lemma 3: We have, using the descent lemma (see [7]), for all l 0 . . . K,

h(θl+1) h(θl) h(θl), θl+1 θl + L

2 θl+1 θl 2

L , θl+1 θl

+ θl+1 θl 2

θl+1 θl + h(θl)

ϵ2 min l=0,...,K h(θl) 2 .

Summing for l = 0, . . . , K, we have

h h(θ0) h(θK+1) h(θ0) K + 1

ϵ2 min l=0,...,K h(θl) 2 .

and we deduce, by using concavity of the square root, that

min l=0,...,K h(θl) 2 ϵ2 + 2L(h(x0) h )

Lemma 4 Let A: Rd Rn m and B : Rd Rm p, be MA and MB bounded and LA and LB Lipschitz respectively, in operator norm, then A B is MALB + LAMB Lipschitz in operator norm.

Proof : We have for all x, y Rd

A(x)B(x) A(y)B(y) op = (A(x) A(y))B(x) A(y)(B(y) B(x)) op (A(x) A(y))B(x) op + A(y)(B(y) B(x)) op A(x) A(y) op B(x) op + A(y) op B(y) B(x) op (LAMB + MALB) x y

Lemma 5 The function M 7 (I M) 1 is (1 ρ) 2 Lipschitz in operator norm on the set of square matrices M such that M op ρ < 1.

Proof : For any two square matrices A, B with operator norm bounded by ρ < 1, I A and I B are invertible and the operator norm of the inverse is at most (1 ρ) 1 (see proof of

Lemma 2). We have

(I A) (I A) 1 (I B) 1 (I B) = A B

(I A) 1 (I B) 1 = (I A) 1(A B)(I B) 1

(I A) 1 (I B) 1 op = (I A) 1(A B)(I B) 1 op (I A) 1 op (A B) op (I B) 1 op

Proof of Corollary 5: Let us evaluate a Lipschitz constant of θ(g x). We have

θ(g x)(θ) = Jθ x(θ)T g( x(θ))

= I Jx F( x, θ)) 1JθF( x, θ) T g( x(θ)).

This can be seen as a composition of a matrix function with x. Combining Lemma 4 and Lemma 5 with the fact that Jx F and JθF are both LJ Lipschitz and LF bounded, the function (x, θ) 7 (I Jx F(x, θ)) 1Jθf(x, θ) is LJLF (1 ρ) 2 + LJ

1 ρ = LJ 1 ρ LF 1 ρ + 1 Lipschitz.

Invoking Lemma 4 and using the fact that the operator norm of a vector is the euclidean norm, we have that the function (x, θ) 7 (I Jx F(x, θ)) 1Jθf(x, θ) g(x) is LJ 1 ρ LF 1 ρ + 1 lg +

l LF 1 ρ Lipschitz.

From the implicit function theorem, Jθ x(θ) op LF /(1 ρ) so that x is LF /(1 ρ)

Lipschitz and θ(g x) is LJ 1 ρ LF 1 ρ + 1 lg + l LF 1 ρ LF 1 ρ Lipschitz so the choice of α is

such that 1/α a Lipschitz constant for the gradient.

Using Corollary 4, we are in the setting of Lemma 3 and the result follows.

B Experimental details Section 4

B.1 QP experiment

Implicit differentiation: We explicitly formed the Jacobians of the iterative process (function f) and solved the underlying linear systems using a solver in jax. The inversion was not done based on VJP (using fixed point iterations or conjugate gradient) because, for the relatively small scale of our problems, pure linear algebra was more efficient. We did not form the full implicit Jacobian in equation (3) using full matrix inversion; we only solved a linear system involving a left incoming vector in equation (3). In this sense, we did implement VJP for implicit differentiation although we explicitly formed Jacobians because this was the most efficient.

Generation of random QPs: To generate the QP instances, the quantities n, m, p are varied on a logarithmic scale. We set Q = M T M where M is of size n n with entries uniform in [ 1, 1]. Constraint matrices A and G also have entries uniform in [ 1, 1], and we chose m and p to be smaller than n so that feasibility is generic and occurs with probability one.