# automatic_differentiation_of_nonsmooth_iterative_algorithms__27cac092.pdf Automatic differentiation of nonsmooth iterative algorithms Jérôme Bolte Toulouse School of Economics, University of Toulouse Capitole. Toulouse, France. Edouard Pauwels IRIT, CNRS, Université de Toulouse. Institut Universitaire de France (IUF). Toulouse, France. Samuel Vaiter CNRS & Université Côte d Azur, Laboratoire J. A. Dieudonné. Nice, France. Differentiation along algorithms, i.e., piggyback propagation of derivatives, is now routinely used to differentiate iterative solvers in differentiable programming. Asymptotics is well understood for many smooth problems but the nondifferentiable case is hardly considered. Is there a limiting object for nonsmooth piggyback automatic differentiation (AD)? Does it have any variational meaning and can it be used effectively in machine learning? Is there a connection with classical derivative? All these questions are addressed under appropriate nonexpansivity conditions in the framework of conservative derivatives which has proved useful in understanding nonsmooth AD. We characterize the attractor set of nonsmooth piggyback iterations as a set-valued fixed point which remains in the conservative framework. Among various consequences we have almost everywhere convergence of classical derivatives. Our results are illustrated on parametric convex optimization with forward-backward, Douglas-Rachford and Alternating Direction of Multiplier algorithms as well as the Heavy-Ball method. 1 Introduction xkp q Jxkp q derivative? Figure 1: We study existence and meaning of Jpb x as a derivative of x, compatible with automatic differentiation of the iterates pxkp qqk PN. Differentiable programming. We consider a Lipschitz function F : Rp ˆ Rm fiÑ Rp, representing an iterative algorithm, parameterized by P Rm, with Lipschitz initialization x0 : fiÑ x0p q and xk 1p q Fpxkp q, q F pxkp qq, (1) where F : Fp , q, under the assumption that xkp q converges to the unique fixed point of F : xp q fixp F q. Such recursion represent for instance algorithms to solve an optimization problem minx hpxq (e.g. empirical risk minimization), such as gradient descent: Fpx, q x rhpxq. But (1) could also be a fixed-point equation such as a deep equilibrium network [5]. In the last years, a paradigm shift occurred: such algorithms are now implemented in algorithmic differentiation (AD)-friendly frameworks such as Tensorflow [1], Py Torch [42] or JAX [13]. For a differentiable F, it is possible to compute iteratively the derivatives of 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Sparse Inv. Covar. (DR) Trend Filtering (ADMM) 0 500 1000 iteration k 0 500 1000 iteration k 0 500 1000 iteration k 0 500 1000 iteration k Figure 2: Illustration of the linear convergence of proximal splitting methods. (First line) Distance of the iterates to the fixed point. (Second line) Distance of the piggyback Jacobians to the Jacobian of the fixed point. The acronyms are FB for Forward-Backward, DR for Douglas-Rachford and ADMM for Alternating Direction Method of Multipliers. In all cases, despite nonsmoothness, piggyback Jacobians converge, illustrating Corollary 2. Blue lines represent the median of 100 repetitions with random data, and the blue shaded area represents the first and last deciles. xk with respect to using the differential calculus rules resulting in so called piggyback recursion: B B xk 1p q B1Fpxkp q, q B B xkp q B2Fpxkp q, q, (2) where B B xk is the Jacobian of xk with respect to . In practice, automatic differentiation frameworks do not compute the full Jacobian, but compute either vector-Jacobian products in reverse-mode (or backpropagation) [48] or Jacobian-vector products in forward mode [53]. We rather consider the full Jacobian, and therefore, our findings apply to both modes. We focus on two issues arising with nonsmooth recursions, illustrated in Figure 1. (i) what can be said about the chain rule (2) and its asymptotics when the function F is not smooth (for example a projected gradient step)? (ii) how to interpret its asymptotics as a notion of derivative for x, the fixed point of F ? We propose a joint answer to both questions, providing a solid theoretical ground to the idea of algorithmic differentiation of numerical solvers involving nonsmooth components in a differentiable programming context. Related works. Algorithmic use of the chain rule (2) to differentiate programs takes its root in [53], with forward differentiation, and later in reverse mode [35]. Along with the development of AD, convergence of the iterative sequence (2) was investigated, notably in the optimization community as reviewed in [28]. This important survey paper gathers results in differentiable programming rediscovered/reused later: implicit differentiation [43, 45] and its stability [8], adjoint fixed point iteration [5] (a key aspect of the deep equilibrium network) and linear convergence of (2). Notably, linear convergence of Jacobians was investigated in [25, 27] for the forward mode and in [15, Theorem 2.3] for the reverse mode. This was more recently investigated for C2 functions in imaging for primal-dual algorithms [14, 9] and in machine learning for gradient descent [39, 36] and the Heavy-ball [39] method. In the specific context where F solves a min-min problem, the authors in [2] proved the linear convergence of the Jacobians. The use of AD for nonsmooth functions was justified with the notion of conservative Jacobians [12, 11] with a nonsmooth version of the chain rule for compositional models. Correctness of AD was also investigated in [34] for a large class of piecewise analytic functions, and in [33] where a qualification condition is used to compute a Clarke Jacobian. Along with AD, a natural way to differentiate a model (1) is by implicit differentiation, recently applied in several works [5, 3, 21]. In a nonsmooth context, an implicit function theorem [10] was proved for path-differentiable functions. In terms of applications, nonsmooth piggyback derivatives are applied to hyperparameter tuning for inverse problems in [8] while the case of Lasso was investigated in [7]. Other relevant applications include plug-and-play denoising [32], parameter selection [19], bilevel programming [41] Contributions: Under suitable nonexpansivity assumptions, our contributions are as follows. We address both questions illustrated in Figure 1 for nonsmooth recursions. Set-valued extensions of the piggyback recursion (2) have a well defined limit: the fixed point of subset map (Theorem 1), it is conservative for the fixed point map x. This is a nonsmooth infinite chain rule for AD (Theorem 2). For almost all , despite nonsmoothness, recursion (2) is well defined using the classical Jacobian and converges to the classical Jacobian of the fixed point x (Corollary 2). This has implications for both forward and reverse modes of AD. For a large class of functions (Lipschitz-gradient selection), it is possible to give a quantitative rate estimate (Corollary 3), namely to prove linear convergence of the derivatives. We show that these results can be applied to proximal splitting algorithms in nonsmooth convex op- timization. We include forward backward (Proposition 2), as well Douglas Rachford (Proposition 3) and ADMM, a numerical illustration of the convergence of derivatives is given in Figure 2. We also illustrate that, contrarily to the smooth case, nonsmooth piggy back derivatives of momen- tum methods such as Heavy-ball, may diverge even if the iterates converge linearly (Proposition 4). Notations. A function f : Rp Ñ Rm is locally Lipschtiz if, for each x P Rn, there exists a neighborhood of x on which f is Lipschitz. Denoting by R Ñ Rp, the full measure set where f is differentiable, the Clarke Jacobian [16] at x P Rp is defined as Jac cfpxq conv M P Rpˆm, Dpykqk 0 s.t. lim kÑ8 yk x, yk P R, lim Bf By pykq M The Clarke subdifferential, Bcf is defined similarly. Given two matrices A, B with compatible dimension, r A, Bs is their concatenation. For a set X, conv X is its convex hull. The symbol B denotes a unit ball, the corresponding norm should be clear from the context. 2 Nonsmooth piggyback differentiation We first show how the use of the notion of conservative Jacobians allow us to justify rigorously the existence of a nonsmooth equivalent of piggyback iterations in (2) that is compatible with AD. Conservative Jacobians. Conservative Jacobians were introduced in [12] as a generalization of derivatives to study automatic differentiation of nonsmooth functions. Given a locally Lipschitz continuous function f : Rp Ñ Rm, the set-valued J : Rp Ñ Rmˆp is a conservative Jacobian for the path differentiable f if J has a closed graph, is locally bounded and nowhere empty with d dtfpγptqq Jpγptqq9γptq a.e. (4) for any γ : r0, 1s Ñ Rp absolutely continuous with respect to the Lebesgue measure. Conservative gradients are defined similarly. We refer to [12] for extensive examples and properties of this class of function, key ideas are recalled in Appendix A for completeness. Let us mention that the classes of convex functions, definable functions, or semialgebraic functions are contained in the set of path differentiable functions. Given Df : Rp Ñ Rp, a conservative gradient for f : Rp Ñ R, we have: (Clarke subgradient), for all x P Rp, Bcfpxq Ä convp Dfpxqq. (Gradient almost everywhere) Dfpxq trfpxqu for almost all x P Rp. (Calculus) differential calculus rules preserve conservativity, e.g. sum and compositions of conservative Jacobians are conservative Jacobians. Finally, Df can be used as a first order optimization oracle for methods of gradient type [11]. Piggyback differentiation of recursive algorithms. The following is standing throughout the text. Assumption 1 (The conservative Jacobian of the iteration mapping is a contraction) F is locally Lipschitz, path differentiable, jointly in px, q, and JF is a conservative Jacobian for F. There exists 0 1, such that for any px, q P Rp ˆ Rm and any pair r A, Bs P JF px, q, with A P Rpˆp and B P Rpˆm, the operator norm of A is at most . Jx0 is a conservative Jacobian for the initialization function fiÑ x0p q. Under Assumption 1, F is a strict contraction so that pxkp qqk PN converges linearly to xp q fixp F q the unique fixed point of the iteration mapping F . More precisely, for all k P N, we have }xkp q xp q} k }x0 F px0q} Furthermore, for every k P N, let us define the following set-valued piggyback recursion: Jxk 1p q t AJ B, r A, Bs P JF pxkp q, q, J P Jxkp qu . (PB) We will show in Section 3 that (PB) plays the same role as (2) in the nonsmooth setting. Note that one can recursively evaluates a sequence Jk P Jxk, k P N, through operations actually implemented in nonsmooth AD frameworks, as follows Jk 1 Ak Jk Bk where r Ak, Bks P JF pxkp q, q, (5) Remark 1 (Local contractions) Assumption 1 may be relaxed as follows: for all , the fixed point set fixp F q is a singleton x such that xkp q Ñ xp q as k Ñ 8, and the operator norm condition on JF in Assumption 1 holds at the point p xp q, q. By graph closedness of JF , in a neighborhood of p xp q, q, F is a strict contraction and the operator norm condition on JF holds, possibly with a larger contraction factor . After finitely many steps, the iterates pxkqk PN remain on some neighborhood and all our convergence results hold, due to their asymptotic nature. Remark 2 (Relation to existing work) For a smooth F a natural conservative Jacobian is the classical one. The, hypotheses in [39, 36] for gradient descent (F is C1), are exactly the classical counterpart of Assumption 1. On the other hand [25, 27, 15] use a more general assumption on spectral radius, which allow to treat the Heavy-Ball method, e.g. in [39]. However this does not generalize to sets of matrices, as shown in Section 5. Hence Assumption 1 is on operator norm and not on spectral radius, which is sharp, contrary to the smooth case. 3 Asymptotics of nonsmooth piggyback differentiation 3.1 Fixed point of affine iterations Gap and Haussdorf distance. Being given two nonempty compact subsets X, Y of Rp, set gapp X, Yq max x PX dpx, Yq where dpx, Yq min y PY }x y}, and define the Hausdorff distance between X and Y by distp X, Yq maxpgapp X, Yq, gapp Y, Xqq. Note that gapp X, Yq 0 if, and only if, X Ñ Y, whereas distp X, Yq 0 if, and only if, X Y. Moreover, X Ñ Y gapp X, Yq B where B is the unit ball. It means that gapp X, Yq measures the default of inclusion of X in Y, see [46, Chapter 4] for more details. Affine iterations by packets of matrices. Let J Ä Rpˆpp mq be a compact subset of matrices such that any matrix of the form r A, Bs P J with A P Rpˆp is such that A has operator norm at most 1. We let J act naturally on the matrices of size p ˆ m as follows J : Rpˆm Ñ Rpˆm the function from Rpˆm to subsets of Rpˆm which is defined for each X P Rpˆm as follows: J p Xq t AX B, r A, Bs P J u. This defines a set-valued map through, for any X Ä Rpˆm, J p Xq t AX B, r A, Bs P J , X P Xu. (6) Recursions of the form (PB) generate sequences p Xkqk PN of subsets of Rpˆm satisfying Xk 1 J p Xkq @k P N. (7) The following is an instance of the Banach Picard theorem (whose proof is recalled in Appendix B). Theorem 1 (Set-valued affine contractions) Let J Ä Rpˆpp mq be a compact subset of matrices as above with 1. Then there is a unique nonempty compact set fixp J q Ä Rpˆm satisfying fixp J q J pfixp J qq, where the action of J is given in (6). Let p Xkqk PN be a sequence of compact subsets of Rpˆm, such that X0 H, and satisfying the recursion (7). We have for all k P N distp Xk, fixp J qq k distp X0, J p X0qq where dist is the Hausdorff distance related to the Euclidean norm on p ˆ m matrices. 3.2 An infinite chain rule and its consequences Define the following set-valued map based on the fix operator from Theorem 1, x : Ñ fix r JF p xp q, qs . where xp q is the unique fixed point of the algorithmic recursion. Since xp q fixp F q, we have equivalently that Jpb x is the fixed-point of the Jacobian at the fixed-point: Jpb x : Ñ fix r JF pfixp F q, qs. We have the following (proved in Appendix C) and a consequence from Theorem 1. Theorem 2 (A conservative mapping for the fixed point map) Under Assumption 1, Jpb x is welldefined, and is a conservative Jacobian for the fixed point map x. Corollary 1 (Convergence of the piggyback derivatives) Under Assumption 1, for all , the recursion (PB) satisfies lim kÑ8 gapp Jxkp q, Jpb x p qq 0. (8) Unrolling the expression of Jxk, using (6) and (7), we can rewrite (8) with a compositional product: lim KÑ 8 gap JF pxkp q, q p Jx0p qq, Jpb In plain words, this is a limit-derivative exchange result: Asymptotically, the gap between the automatic differentiation of xk and the derivative of the limit is zero. In particular the recursion (5) produces bounded sequences whose accumulation points are in Jpb x . Since conservative Jacobians equal classical Jacobians almost everywhere [12], we have convergence of classical derivatives. Corollary 2 (Convergence a.e. of the classical piggyback derivatives) Under Assumption 1, for almost all , the classical Jacobian B B xkp q, is well defined for all k and converges towards the classical Jacobian of x. That is B B xkp q B B xp q, for almost all . Remark 3 (Connection to implicit differentiation) The authors in [10] proved a qualification-free version of the implicit function theorem. Assuming that for every r A, Bs P Jp xp q, q, the matrix I A is invertible, we have that p I Aq 1B, r A, Bs P JF p xp q, q is a conservative Jacobian for x. Under Assumption 1, one has Jimp x p q Ä Jpb x p q for all . Unfortunately, as soon as F is not differentiable, the inclusion may be strict, see details in Appendix D. 3.3 Consequence for algorithmic differentiation Given k P N, 9 P Rm, wk P Rp, the following algorithms allow us to compute 9xk Jk 9 using the forward mode of automatic differentation (Jacobian Vector Products, JVP), and T k Jk using the backward mode of automatic differentiation (Vector Jacobian Products, VJP). In a compositional model 9 is the derivative of an inner functions controlling algorithm parameters , with another variable real variable λ P R, for example an hyper parameter. The goal is to combine B pλq Bλ and Bxkp q B with the chain rule in a forward pass to obtain the total derivative Bxkp pλqq Bλ . On the other hand, in a compositional model, wk is typically the gradient of an outer loss functions evaluated at xkp q. In this case the goal is to combine derivatives of iterates Bxkp q B with wk B pxkq Bxk in a backward pass to obtain B pxkp qq Algorithm 1: Algorithmic differentiation of recursion (1), forward and reverse modes Input: k P N, P Rm, 9 P Rm, wk P Rp, initialization function x0p q, recursion function Fpx, q, conservative Jacobians JF px, q and Jx0p q. Initialize: x0 x0p q P Rp. Forward mode (JVP): 9x0 J 9 , J P Jx0p q. for i 1, . . . , k do xi Fpxi 1, q 9xi Ai 1 9xi 1 Bi 1 9 r Ai 1, Bi 1s P JF pxi 1, q Return: 9xk Reverse mode (VJP): k 0. for i 1, . . . , k do xi Fpxi 1, q for i k, . . . , 1 do i 1 wi wi 1 AT i 1 wi r Ai 1, Bi 1s P JF pxi 1, q k k JT w0, J P Jx0p q Return: k Proposition 1 (Convergence of VJP and JVP) Let k P N, 9 P Rm, wk P Rp, xk P Rp, 9xk P Rp, k P Rm be as in Agorithm 1 under Assumption 1. Then for almost all P Rm, 9xk Ñ B x Assume furthermore that, as k Ñ 8, wk Ñ w (for example, wk r pxkq for a C1 loss ), then for almost all P Rm, T k Ñ w T B x Remark 4 In addition to Proposition 1, in both cases, for all , all accumulation points of both 9xk and T k are elements of Jpb x 9 and w T Jpb x respectively. This is a consequence of Corollary 2 combined with algorithmic differentiation arguments which proof is given in Appendix D. 3.4 Linear convergence rate for semialgebraic piecewise smooth selection function Semialgebraic functions are ubiquitous in machine learning (piecewise polynomials, 1, 2 norms, determinant matrix rank . . . ). We refer the reader to [11] for a thorough discussion of their extensions, and use in machine learning. For more technical details, see [17, 18] for introductory material on semialgebraic and o-minimal geometry. Lipschitz gradient selection functions. Let F : Rp fiÑ Rq be semialgebraic and continuous. We say that F has a Lipschitz gradient selection ps, F1, . . . , Fmq if s: Rp fiÑ p1, . . . , mq is semialgebraic and there exists L 0 such that for i 1 . . . , m, Fi : Rp fiÑ Rp is semialgebraic with L-Lipschitz Jacobian, and for all x P Rp, Fpxq Fspxqpxq. For any x P Rp, set Ipxq ti P t1, . . . , mu , Fpxq Fipxqu. The set-valued map Js F : Rp Ñ Rpˆq given by Js F : x Ñ conv Bx pxq, i P Ipxq , is a conservative Jacobian for F as shown in [11]. Here BFi Bx denotes the classical Jacobian of Fi. Let us stress that such a structure is ubiquitous in applications [11, 34]. Rate of convergence. We may now strengthen Corollary 1 by proving the linear convergence of piggyback derivatives towards the fixed point. The following is a consequence of the fact that the proposed selection conservative Jacobians of Lipschitz gradient selection functions are Lipschitz-like (Lemma 4 in Appendix E.1). Note that semialgebraicity is only used as a sufficient condition to ensure conservativity of the selection Jacobian together with this Lipschitz like property. It could be relaxed if it can be guaranteed by other means, in particular one could consider the broader class of definable functions in order to handle log-likelihood data fitting terms. Corollary 3 (Linear convergence of piggyback derivatives) In addition to Assumption 1, assume that F has a Lipschitz gradient selection structure as above. Then, for any and 0, there exists C 0 such that the recursion (PB) with JF Js F satisfies for all k P N, gapp Jxkp q, Jpb x p qq Cp? qk. Moreover, classical Jacobians in Corollary 2 converge at a linear rate for almost all . 4 Application to proximal splitting methods in convex optimization Consider the composite parametric convex optimization problem, where P Rm represents parameters and x P Rp is the decision variable xp q arg minxfpx, q gpx, q. The purpose of this section is to construct examples of functions F used in recursion (1) based on known algorithms. The following assumption will be standing throughout the section. Assumption 2 f is semialgebraic, convex, its gradient with respect to x for fixed , rxf, is locally Lipschitz jointly in px, q and L-Lipschitz in x for fixed . Semialgebraicity implies that rxf is path-differentiable jointly in px, q, we denote by J2 f its Clarke Jacobian. The function g is semialgebraic, convex in x for fixed , and lower semicontinuous. For all 0, we assume that G px, q fiÑ prox gp , qpxq is locally Lipschitz jointly in px, q. Semialgebraicity implies that it is also path differentiable jointly in px, q, we denote by JG its Clarke Jacobian. This assumption covers a very large diversity of problems in convex optimization as most gradient and prox operations used in practice are semialgebraic (or definable). Under Assumption 2, we will provide sufficient conditions on f and g for Assumption 1, for different algorithmic recursions. These will therefore imply convergence as stated in Corollary 1 and 2, Proposition 1, as well Corollary 3 in the piecewise selection case. The proofs are postponed to Appendix F. 4.1 Splitting algorithms In this section we provide sufficient condition for Assumption 1 to hold. The underlying conservative Jacobian is obtained by combining Clarke Jacobians of elementary algorithmic operations (gradient, proximal operator in Assumption 2), using the compositional rules of differential calculus [11] and implicit differentiation [10]. Using [12], such Jacobians are conservative by semialgebraicity and their combination provide conservative Jacobians for the corresponding algorithmic recursion F. These objects are explicitly constructed in Appendix F. Forward backward algorithm. The forward backward iterations are given for 0 by xk 1 prox gp , q pxk rxfpxk, qq . (10) Proposition 2 Under Assumption 2 with 0 2 L, denote by F : Rpˆm Ñ Rp the forwardbackward recursion in (10). For µ 0, if either f or g is µ-strongly convex in x for all , then F is a strict contraction and Assumption 1 holds. It is well known that if f is µ-strongly convex, choosing 2{p L µq provides a contraction factor F B p 1q{p1 q, where L{µ 1 is the condition number of the problem. Douglas Rachford algorithm. Given 0, the algorithm goes as follows 2p I R fp , q R gp , qqyk, (11) where R fp , q 2prox fp , q I is the reflected proximal operator, which is 1-Lipschitz (and similarly for g). Following [6, Theorem 26.11], if the problem has a minimizer, then pykqk PN converges to a fixed point of (11), y such that x prox gp yq is a solution to the optimization problem. Following [26, Theorem 1], if f is strongly convex, then R fp , q is -Lipschitz for some 1 and our differentiation result applies to Douglas-Rachford splitting in this setting. Proposition 3 Under Assumption 2 with 0, denote by F : Rpˆm Ñ Rp the Douglas-Rachford recursion in (11). If f is µ-strongly convex in x for all , then F is a strict contraction and Assumption 1 holds. Following [26, Proposition 3], choosing 1{?Lµ provides a contraction factor of order DRp? 1q{p? 1q F B, where again L{µ is the condition number of the problem. In this respect Douglas-Rachford s iterations provide a faster asymptotic rate than those of Forward-Backward, which may also impact the convergence of derivatives in the context of Corollary 3. Alternating Direction Method of Multipliers. Consider the separable convex problem u,v φ puq pvq subject to A u B v c . (12) The alternating direction method of multipliers (ADMM) algorithm combines two partial minimization of an augmented Lagrangian, and a dual update: uk 1 arg min 2 }A u B vk c }2 vk 1 arg min 2 }A uk 1 B vk c }2 xk 1 xk p A uk 1 B vk 1 c q. As observed in [23], the ADMM algorithm can be seen as the Douglas-Rachford splitting method applied to the Fenchel dual of problem (12) (see Appendix F.3 for more details). More precisely, ADMM updates are equivalent to Douglas-Rachford iterations applied to the following problem xq loooooooooomoooooooooon xq looooomooooon Therefore, if φ is strongly convex with Lipschitz gradient and A is injective, then ADMM converges linearly and one is able to combine derivatives of proximal operators to differentiate ADMM. 4.2 Numerical illustrations We now detail how Figure 2 discussed in the introduction is obtained, and how it illustrates our theoretical results. We consider four scenarios (Ridge, Lasso, Sparse inverse covariance selection and Trend filtering) corresponding to the four columns. For each of them, the first line shows the empirical linear rate of the iterates xk and the second line shows the empirical linear rate of the derivative B B xk. All experiments are repeated 100 times and we report the median along with the first and last deciles. Forward Backward for the Ridge. The Ridge estimator is defined for 0 as xp q arg minx PRp 1 2 Among several possibilities to solve it, one can use the Forward Backward algorithm applied to f : px, q fiÑ 1 2 and g : }x}2 2. Since g is strongly convex, the operator F is strongly convex, and thus Proposition 2 may be applied. Forward Backward algorithm for the Lasso. Consider the Forward Backward algorithm applied to the Lasso problem [49], with parameter 0, xp q P arg minx PRp 1 2 }x}1 arg minx 1 2L}Ax b}2 L}x}1, where L is any upper bound on the operator norm of AT A. The gradient of the quadratic part is 1-Lipschitz, so we may consider the forward backward algorithm (10), with unit step size and f : px, q fiÑ 1 2L}Ax b}2 2 and g: px, q fiÑ A well known qualification condition involving a generalized support at optimality ensures uniqueness of the Lasso solution [20, 37]. It holds for generic problem data [50]. Following [10, Proposition 5], under this qualification condition, the implicit conservative Jacobian JF is such that, at the solution x , the matrix set I JF only contains invertible matrices. This means that there exists 1, such that any M P JF px q has operator norm at most . Following Remark 1, all our convergence results apply qualitatively. Note that we recover the results of [7, Proposition 2] for the Lasso. Douglas Rachford for the Sparse Inverse Covariance Selection. The Sparse Inverse Covariance Selection [52, 22] reads xp q P arg minx PRnˆn trp Cxq log det x i,j |xi,j|, where C is a symmetric positive matrix and 0. It is possible to apply Douglas Rachford method to f : px, q fiÑ trp Cxq log det x and g : px, q fiÑ }x}1,1. It is known that f is locally strongly convex, indeed x fiÑ log det x is the standard self-concordant barrier in semidefinite programming [40]. Following Remark 1, all our convergence results apply qualitatively. ADMM for Trend Filtering. Introduced in [51] in statistics as a generalization of the Total Variation, the trend filtering estimator with observation P Rp reads xp q arg minx PRp 1 2 λ}Dpkqx}1, where Dpkq is a forward finite difference approximation of a differential operator of order k (here k 2). Using : u fiÑ λ}u}1, φ : v fiÑ }v }2 2 (strongly convex), A I (injective), B Dpkq, and c 0, we can apply the ADMM to solve trend filtering. 5 Failure of automatic differentiation for inertial methods This section focuses on the Heavy-Ball method for strongly convex objectives, in its global linear convergence regime. For C2 objectives, piggyback derivatives converge to the derivative of the solution map [28, 39, 36]. However, we provide a C1,1 strongly convex parametric objective with path differentiable derivative, such that piggyback derivatives of the Heavy Ball algorithm contain diverging vectors for a given parameter value. In this example, other conservative differentiation means (implicit differentiation, piggyback on gradient descent), avoid this divergent behaviors. 5.1 Heavy-ball algorithm and global convergence Consider a function f : Rp ˆ Rm Ñ R, and β 0, for simplicity, when the second argument is fixed, we write f : x fiÑ fpx, q. Set for all x, y, , Fpx, y, q px rf pxq βpx yq, xq, consider the Heavy-Ball algorithm pxk 1, yk 1q Fpxk, yk, q for k P N. If f is µ-strongly convex with L-Lipschitz gradient, then, choosing 1{L and β , the algorithm will converge globally at a linear rate to the unique solu- tion, xp q [24, Theorem 4], local convergence is due to Polyak [44]. Furthermore, if in addition f is C2 forward propagation of derivatives converge to the derivative of the solution [28, 29, 39]. 5.2 A diverging Jacobian accumulation Details and proof of the following result are given in Section G. Proposition 4 (Piggyback differentiation fails for the Heavy Ball method) Consider f : R2 Ñ R, such that for all P R, fpx, q x2{2 if x 0 and fpx, q x2{8 if x 0. Assume that 1 and β 3{4. Then the heavy ball algorithm converges globally to 0 and rf is path differentiable. The Clarke Jacobian of F with respect to px, yq at p0, 0, 0q is JF p0, 0, 0q conv t M1, M2u, where the product M1M1M2M2 has eigenvalue 9{8. The presence of an eigenvalue with modulus greater than 1 may produce divergence in (PB). Set f1 : px, q fiÑ "x2{2 if x 0 x2{8 if x 0. f2 : px, q fiÑ "x2{2 if x 0 x2{8 if x 0. Note that f1 and f2 are both equivalent to f as they implement the same function. With initializations xp q yp q , we run a few iterations of the Heavy Ball algorithm for 0, and implement (PB) alternating between two steps on f1 and two steps on f2 and differentiate the resulting sequence pxkqk PN with respect to using algorithmic differentiation. The divergence phenomenon predicted by Proposition 4 is illustrated in Figure 3, while the true derivative is 0 (the sequence is constant). 6 Conclusion We have developed a flexible theoretical framework to describe convergence of piggyback differentiation applied to nonsmooth recursions providing, in particular, a rigorous meaning to differentiation of nonsmooth solvers. The relevance of our approach is illustrated on composite convex optimization 0 50 100 150 200 iteration k 0 50 100 150 200 iteration k Figure 3: Behaviour of automatic differentiation for first-order methods on a quadratic function. (Left) Stability of the propagation of derivatives for the fixed step-size gradient descent. (Right) Instability of the propagation of Heavy-Ball initialized. Both methods are initialized at optimum. through widely used methods as forward-backward, Douglas-Rachford or ADMM algorithms. Our framework allows however to consider many other abstract algorithmic recursions and provides thus theoretical ground for more general problems such as variational inequalities or saddle point problems as in [14, 9]. As a matter for future work, we shall consider relaxing Assumption 1 to study a wider class of methods, e.g., when F is not a strict contraction. Acknowledgments J. B. and E. P. acknowledge the financial support of the AI Interdisciplinary Institute ANITI funding under the grant agreement ANR-19-PI3A-0004, Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant numbers FA9550-19-1-7026, FA8655-22-1-7012 and ANR Ma SDOL 19-CE23-0017-01. J. B. also acknowledges the support of ANR Chess, grant ANR-17EURE-0010, TSE-P and the Centre Lagrange. S. V. acknowledges the support ANR Gra Va, grant ANR-18-CE40-0005. [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor Flow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [2] Pierre Ablin, Gabriel Peyré, and Thomas Moreau. Super-efficiency of automatic differentiation for functions defined as a minimum. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 32 41. PMLR, 13 18 Jul 2020. [3] Akshay Agrawal, Brandon Amos, Shane Barratt, Stephen Boyd, Steven Diamond, and J. Zico Kolter. Differentiable convex optimization layers. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. [4] Charalambos D Aliprantis and Kim C Border. Infinite Dimensional Analysis: A Hitchhiker s Guide. Springer Science & Business Media, 2006. [5] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. Advances in Neural Information Processing Systems, 32, 2019. [6] Heinz H Bauschke, Patrick L Combettes, et al. Convex analysis and monotone operator theory in Hilbert spaces, volume 408. Springer, 2011. [7] Quentin Bertrand, Quentin Klopfenstein, Mathieu Blondel, Samuel Vaiter, Alexandre Gram- fort, and Joseph Salmon. Implicit differentiation of lasso-type models for hyperparameter optimization. In International Conference on Machine Learning, pages 810 821. PMLR, 2020. [8] Mathieu Blondel, Quentin Berthet, Marco Cuturi, Roy Frostig, Stephan Hoyer, Felipe Llinares- López, Fabian Pedregosa, and Jean-Philippe Vert. Efficient and modular implicit differentiation, 2021. [9] Lea Bogensperger, Antonin Chambolle, and Thomas Pock. Convergence of a Piggyback-style method for the differentiation of solutions of standard saddle-point problems. Technical report, 2022. [10] Jérôme Bolte, Tam Le, Edouard Pauwels, and Antonio Silveti-Falls. Nonsmooth Implicit Differentiation for Machine Learning and Optimization. In Advances in Neural Information Processing Systems, Online, 2021. [11] Jerome Bolte and Edouard Pauwels. A mathematical model for automatic differentiation in machine learning. In Conference on Neural Information Processing Systems, Vancouver, Canada, 2020. [12] Jérôme Bolte and Edouard Pauwels. Conservative set valued fields, automatic differentiation, stochastic gradient method and deep learning. Mathematical Programming, 2020. [13] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. [14] Antonin Chambolle and Thomas Pock. Learning consistent discretizations of the total variation. SIAM Journal on Imaging Sciences, 14(2):778 813, 2021. [15] Bruce Christianson. Reverse accumulation and attractive fixed points. Optimization Methods and Software, 3(4):311 326, 1994. [16] Frank H. Clarke. Optimization and nonsmooth analysis. Canadian Mathematical Society Series of Monographs and Advanced Texts. John Wiley & Sons, Inc., New York, 1983. A Wiley-Interscience Publication. [17] Michel Coste. An introduction to o-minimal geometry. Istituti editoriali e poligrafici internazion- ali Pisa, 2000. [18] Michel Coste. An introduction to semialgebraic geometry, 2000. [19] Charles-Alban Deledalle, Samuel Vaiter, Jalal M. Fadili, and Gabriel Peyré. Stein Unbiased Gr Adient estimator of the Risk (SUGAR) for multiple parameter selection. SIAM Journal on Imaging Sciences, 7(4):2448 2487, 2014. [20] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regression. Annals of statistics, 32(2):407 499, 2004. [21] Laurent El Ghaoui, Fangda Gu, Bertrand Travacca, Armin Askari, and Alicia Tsai. Implicit deep learning. SIAM Journal on Mathematics of Data Science, 3(3):930 958, 2021. [22] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441, 12 2007. [23] Daniel Gabay. Applications of the method of multipliers to variational inequalities. In Studies in mathematics and its applications, volume 15, pages 299 331. Elsevier, 1983. [24] Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global convergence of the heavy-ball method for convex optimization. In 2015 European control conference (ECC), pages 310 315. IEEE, 2015. [25] Jean Charles Gilbert. Automatic differentiation and iterative processes. Optimization Methods and Software, 1(1):13 21, 1992. [26] Pontus Giselsson and Stephen Boyd. Linear convergence and metric selection for douglas- rachford splitting and admm. IEEE Transactions on Automatic Control, 62(2):532 544, 2016. [27] Andreas Griewank, Christian Bischof, George Corliss, Alan Carle, and Karen Williamson. Derivative convergence for iterative equation solvers. Optimization Methods and Software, 2(3-4):321 355, 1993. [28] Andreas Griewank and Christèle Faure. Piggyback differentiation and optimization. In Large- scale PDE-constrained optimization, pages 148 164. Springer, 2003. [29] Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM, 2008. [30] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with Num Py. Nature, 585(7825):357 362, September 2020. [31] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90 95, 2007. [32] Samuel Hurault, Arthur Leclaire, and Nicolas Papadakis. Gradient step denoiser for convergent plug-and-play. In International Conference on Learning Representations, 2021. [33] Sham M Kakade and Jason D Lee. Provably correct automatic sub-differentiation for qual- ified programs. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. [34] Wonyeol Lee, Hangyeol Yu, Xavier Rival, and Hongseok Yang. On correctness of automatic differentiation for non-differentiable functions. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6719 6730. Curran Associates, Inc., 2020. [35] Seppo Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master s thesis, Univ. Helsinki, 1970. [36] Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1540 1552. PMLR, 26 28 Aug 2020. [37] Julien Mairal and Bin Yu. Complexity analysis of the lasso regularization path. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1835 1842, 2012. [38] Swann Marx and Edouard Pauwels. Path differentiability of ode flows. ar Xiv preprint ar Xiv:2201.03819, 2022. [39] Sheheryar Mehmood and Peter Ochs. Automatic differentiation of some first-order methods in parametric optimization. In International Conference on Artificial Intelligence and Statistics, pages 1584 1594. PMLR, 2020. [40] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM, 1994. [41] Peter Ochs, René Ranftl, Thomas Brox, and Thomas Pock. Techniques for gradient-based bilevel optimization with non-smooth lower level problems. Journal of Mathematical Imaging and Vision, 56(2):175 194, 2016. [42] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024 8035. Curran Associates, Inc., 2019. [43] Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International conference on machine learning, pages 737 746. PMLR, 2016. [44] Boris Polyak. Introduction to optimization. In Optimization Software, Publications Division. Citeseer, 1987. [45] Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019. [46] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009. [47] Halsey Lawrence Royden and Patrick Fitzpatrick. Real analysis, volume 32. Macmillan New York, 1988. [48] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533 536, October 1986. [49] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. [50] Ryan J Tibshirani. The lasso problem and uniqueness. Electronic Journal of statistics, 7:1456 1490, 2013. [51] Ryan J. Tibshirani. Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics, 42(1):285 323, 2014. [52] Martin J Wainwright, John Lafferty, and Pradeep Ravikumar. High-dimensional graphical model selection using \ell_1-regularized logistic regression. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006. [53] R. E. Wengert. A simple automatic derivative evaluation program. Communications of the ACM, 7(8):463 464, August 1964. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] Our submission follows the order mentionned in the abstract and introduction. (b) Did you describe the limitations of your work? [Yes] The two main assumptions are discussed in the main text. (c) Did you discuss any potential negative societal impacts of your work? [N/A] We believe that our work, being fully theoretical, does not bring negative societal impacts, especially points 1 to 7 of the ethics guidelines of Neur IPS 2022. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] Points 1 to 5 are satisfied. 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] Our analysis requires a small number of assumptions: assumption 1 on contraction and assumption 2 on semialgebraic regularity. (b) Did you include complete proofs of all theoretical results? [Yes] Most of the appendices are dedicated to them. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main ex- perimental results (either in the supplemental material or as a URL)? [Yes] The (self-contained) jupyter notebook to generate the figures is submitted in supllemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See appendix H. (c) Did you report error bars (e.g., with respect to the random seed after running exper- iments multiple times)? [Yes] See appendix H. Every curves report the median over 100 repetitions, along with first and last deciles. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See appendix H. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] See appendix I. (b) Did you mention the license of the assets? [Yes] See appendix I. (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] No new assets. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] No nonpublic data used. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] No personally identifiable information. 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]