# automatic_differentiation_of_nonsmooth_iterative_algorithms__27cac092.pdf

Automatic differentiation of nonsmooth iterative algorithms

Jérôme Bolte Toulouse School of Economics, University of Toulouse Capitole.

Toulouse, France.

Edouard Pauwels IRIT, CNRS, Université de Toulouse. Institut Universitaire de France (IUF).

Toulouse, France.

Samuel Vaiter CNRS & Université Côte d Azur,

Laboratoire J. A. Dieudonné.

Nice, France.

Differentiation along algorithms, i.e., piggyback propagation of derivatives, is now routinely used to differentiate iterative solvers in differentiable programming. Asymptotics is well understood for many smooth problems but the nondifferentiable case is hardly considered. Is there a limiting object for nonsmooth piggyback automatic differentiation (AD)? Does it have any variational meaning and can it be used effectively in machine learning? Is there a connection with classical derivative? All these questions are addressed under appropriate nonexpansivity conditions in the framework of conservative derivatives which has proved useful in understanding nonsmooth AD. We characterize the attractor set of nonsmooth piggyback iterations as a set-valued ﬁxed point which remains in the conservative framework. Among various consequences we have almost everywhere convergence of classical derivatives. Our results are illustrated on parametric convex optimization with forward-backward, Douglas-Rachford and Alternating Direction of Multiplier algorithms as well as the Heavy-Ball method.

1 Introduction

xkp q Jxkp q

derivative?

Figure 1: We study existence and meaning of Jpb

x as a derivative of x, compatible with automatic differentiation of the iterates pxkp qqk PN.

Differentiable programming. We consider a Lipschitz function F : Rp ˆ Rm ﬁÑ Rp, representing an iterative algorithm, parameterized by P Rm, with Lipschitz initialization x0 : ﬁÑ x0p q and

xk 1p q Fpxkp q, q F pxkp qq, (1)

where F : Fp , q, under the assumption that xkp q converges to the unique ﬁxed point of F : xp q ﬁxp F q. Such recursion represent for instance algorithms to solve an optimization problem minx hpxq (e.g. empirical risk minimization), such as gradient descent: Fpx, q x rhpxq. But (1) could also be a ﬁxed-point equation such as a deep equilibrium network [5].

In the last years, a paradigm shift occurred: such algorithms are now implemented in algorithmic differentiation (AD)-friendly frameworks such as Tensorﬂow [1], Py Torch [42] or JAX [13]. For a differentiable F, it is possible to compute iteratively the derivatives of

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Sparse Inv. Covar. (DR)

Trend Filtering (ADMM)

0 500 1000 iteration k

0 500 1000 iteration k

0 500 1000 iteration k

0 500 1000 iteration k

Figure 2: Illustration of the linear convergence of proximal splitting methods. (First line) Distance of the iterates to the ﬁxed point. (Second line) Distance of the piggyback Jacobians to the Jacobian of the ﬁxed point. The acronyms are FB for Forward-Backward, DR for Douglas-Rachford and ADMM for Alternating Direction Method of Multipliers. In all cases, despite nonsmoothness, piggyback Jacobians converge, illustrating Corollary 2. Blue lines represent the median of 100 repetitions with random data, and the blue shaded area represents the ﬁrst and last deciles.

xk with respect to using the differential calculus rules resulting in so called piggyback recursion:

B B xk 1p q B1Fpxkp q, q B

B xkp q B2Fpxkp q, q, (2)

where B B xk is the Jacobian of xk with respect to . In practice, automatic differentiation frameworks do not compute the full Jacobian, but compute either vector-Jacobian products in reverse-mode (or backpropagation) [48] or Jacobian-vector products in forward mode [53]. We rather consider the full Jacobian, and therefore, our ﬁndings apply to both modes. We focus on two issues arising with nonsmooth recursions, illustrated in Figure 1. (i) what can be said about the chain rule (2) and its asymptotics when the function F is not smooth (for example a projected gradient step)? (ii) how to interpret its asymptotics as a notion of derivative for x, the ﬁxed point of F ? We propose a joint answer to both questions, providing a solid theoretical ground to the idea of algorithmic differentiation of numerical solvers involving nonsmooth components in a differentiable programming context.

Related works. Algorithmic use of the chain rule (2) to differentiate programs takes its root in [53], with forward differentiation, and later in reverse mode [35]. Along with the development of AD, convergence of the iterative sequence (2) was investigated, notably in the optimization community as reviewed in [28]. This important survey paper gathers results in differentiable programming rediscovered/reused later: implicit differentiation [43, 45] and its stability [8], adjoint ﬁxed point iteration [5] (a key aspect of the deep equilibrium network) and linear convergence of (2). Notably, linear convergence of Jacobians was investigated in [25, 27] for the forward mode and in [15, Theorem 2.3] for the reverse mode. This was more recently investigated for C2 functions in imaging for primal-dual algorithms [14, 9] and in machine learning for gradient descent [39, 36] and the Heavy-ball [39] method. In the speciﬁc context where F solves a min-min problem, the authors in [2] proved the linear convergence of the Jacobians. The use of AD for nonsmooth functions was justiﬁed with the notion of conservative Jacobians [12, 11] with a nonsmooth version of the chain rule for compositional models. Correctness of AD was also investigated in [34] for a large class of piecewise analytic functions, and in [33] where a qualiﬁcation condition is used to compute a Clarke Jacobian. Along with AD, a natural way to differentiate a model (1) is by implicit differentiation, recently applied in several works [5, 3, 21]. In a nonsmooth context, an implicit function theorem [10] was proved for path-differentiable functions. In terms of applications, nonsmooth piggyback derivatives are applied to hyperparameter tuning for inverse problems in [8] while the case of Lasso was investigated in [7]. Other relevant applications include plug-and-play denoising [32], parameter selection [19], bilevel programming [41]

Contributions: Under suitable nonexpansivity assumptions, our contributions are as follows. We address both questions illustrated in Figure 1 for nonsmooth recursions. Set-valued extensions of

the piggyback recursion (2) have a well deﬁned limit: the ﬁxed point of subset map (Theorem 1), it is conservative for the ﬁxed point map x. This is a nonsmooth inﬁnite chain rule for AD (Theorem 2).

For almost all , despite nonsmoothness, recursion (2) is well deﬁned using the classical Jacobian

and converges to the classical Jacobian of the ﬁxed point x (Corollary 2). This has implications for both forward and reverse modes of AD. For a large class of functions (Lipschitz-gradient selection), it is possible to give a quantitative rate

estimate (Corollary 3), namely to prove linear convergence of the derivatives. We show that these results can be applied to proximal splitting algorithms in nonsmooth convex op-

timization. We include forward backward (Proposition 2), as well Douglas Rachford (Proposition 3) and ADMM, a numerical illustration of the convergence of derivatives is given in Figure 2. We also illustrate that, contrarily to the smooth case, nonsmooth piggy back derivatives of momen-

tum methods such as Heavy-ball, may diverge even if the iterates converge linearly (Proposition 4).

Notations. A function f : Rp Ñ Rm is locally Lipschtiz if, for each x P Rn, there exists a neighborhood of x on which f is Lipschitz. Denoting by R Ñ Rp, the full measure set where f is differentiable, the Clarke Jacobian [16] at x P Rp is deﬁned as

Jac cfpxq conv

M P Rpˆm, Dpykqk 0 s.t. lim

kÑ8 yk x, yk P R, lim

Bf By pykq M

The Clarke subdifferential, Bcf is deﬁned similarly. Given two matrices A, B with compatible dimension, r A, Bs is their concatenation. For a set X, conv X is its convex hull. The symbol B denotes a unit ball, the corresponding norm should be clear from the context.

2 Nonsmooth piggyback differentiation

We ﬁrst show how the use of the notion of conservative Jacobians allow us to justify rigorously the existence of a nonsmooth equivalent of piggyback iterations in (2) that is compatible with AD.

Conservative Jacobians. Conservative Jacobians were introduced in [12] as a generalization of derivatives to study automatic differentiation of nonsmooth functions. Given a locally Lipschitz continuous function f : Rp Ñ Rm, the set-valued J : Rp Ñ Rmˆp is a conservative Jacobian for the path differentiable f if J has a closed graph, is locally bounded and nowhere empty with

d dtfpγptqq Jpγptqq9γptq a.e. (4)

for any γ : r0, 1s Ñ Rp absolutely continuous with respect to the Lebesgue measure. Conservative gradients are deﬁned similarly. We refer to [12] for extensive examples and properties of this class of function, key ideas are recalled in Appendix A for completeness. Let us mention that the classes of convex functions, deﬁnable functions, or semialgebraic functions are contained in the set of path differentiable functions. Given Df : Rp Ñ Rp, a conservative gradient for f : Rp Ñ R, we have:

(Clarke subgradient), for all x P Rp, Bcfpxq Ä convp Dfpxqq. (Gradient almost everywhere) Dfpxq trfpxqu for almost all x P Rp. (Calculus) differential calculus rules preserve conservativity, e.g. sum and compositions of conservative Jacobians are conservative Jacobians.

Finally, Df can be used as a ﬁrst order optimization oracle for methods of gradient type [11].

Piggyback differentiation of recursive algorithms. The following is standing throughout the text.

Assumption 1 (The conservative Jacobian of the iteration mapping is a contraction) F is locally Lipschitz, path differentiable, jointly in px, q, and JF is a conservative Jacobian for F. There exists 0 1, such that for any px, q P Rp ˆ Rm and any pair r A, Bs P JF px, q, with A P Rpˆp and B P Rpˆm, the operator norm of A is at most . Jx0 is a conservative Jacobian for the initialization function ﬁÑ x0p q.

Under Assumption 1, F is a strict contraction so that pxkp qqk PN converges linearly to xp q ﬁxp F q the unique ﬁxed point of the iteration mapping F . More precisely, for all k P N, we have

}xkp q xp q} k }x0 F px0q}

Furthermore, for every k P N, let us deﬁne the following set-valued piggyback recursion:

Jxk 1p q t AJ B, r A, Bs P JF pxkp q, q, J P Jxkp qu . (PB)

We will show in Section 3 that (PB) plays the same role as (2) in the nonsmooth setting. Note that one can recursively evaluates a sequence Jk P Jxk, k P N, through operations actually implemented in nonsmooth AD frameworks, as follows

Jk 1 Ak Jk Bk where r Ak, Bks P JF pxkp q, q, (5)

Remark 1 (Local contractions) Assumption 1 may be relaxed as follows: for all , the ﬁxed point set ﬁxp F q is a singleton x such that xkp q Ñ xp q as k Ñ 8, and the operator norm condition on JF in Assumption 1 holds at the point p xp q, q. By graph closedness of JF , in a neighborhood of p xp q, q, F is a strict contraction and the operator norm condition on JF holds, possibly with a larger contraction factor . After ﬁnitely many steps, the iterates pxkqk PN remain on some neighborhood and all our convergence results hold, due to their asymptotic nature.

Remark 2 (Relation to existing work) For a smooth F a natural conservative Jacobian is the classical one. The, hypotheses in [39, 36] for gradient descent (F is C1), are exactly the classical counterpart of Assumption 1. On the other hand [25, 27, 15] use a more general assumption on spectral radius, which allow to treat the Heavy-Ball method, e.g. in [39]. However this does not generalize to sets of matrices, as shown in Section 5. Hence Assumption 1 is on operator norm and not on spectral radius, which is sharp, contrary to the smooth case.

3 Asymptotics of nonsmooth piggyback differentiation

3.1 Fixed point of afﬁne iterations

Gap and Haussdorf distance. Being given two nonempty compact subsets X, Y of Rp, set

gapp X, Yq max

x PX dpx, Yq where dpx, Yq min

y PY }x y},

and deﬁne the Hausdorff distance between X and Y by distp X, Yq maxpgapp X, Yq, gapp Y, Xqq. Note that gapp X, Yq 0 if, and only if, X Ñ Y, whereas distp X, Yq 0 if, and only if, X Y. Moreover, X Ñ Y gapp X, Yq B where B is the unit ball. It means that gapp X, Yq measures the default of inclusion of X in Y, see [46, Chapter 4] for more details.

Afﬁne iterations by packets of matrices. Let J Ä Rpˆpp mq be a compact subset of matrices such that any matrix of the form r A, Bs P J with A P Rpˆp is such that A has operator norm at most 1. We let J act naturally on the matrices of size p ˆ m as follows J : Rpˆm Ñ Rpˆm the function from Rpˆm to subsets of Rpˆm which is deﬁned for each X P Rpˆm as follows: J p Xq t AX B, r A, Bs P J u. This deﬁnes a set-valued map through, for any X Ä Rpˆm,

J p Xq t AX B, r A, Bs P J , X P Xu. (6)

Recursions of the form (PB) generate sequences p Xkqk PN of subsets of Rpˆm satisfying

Xk 1 J p Xkq @k P N. (7)

The following is an instance of the Banach Picard theorem (whose proof is recalled in Appendix B).

Theorem 1 (Set-valued afﬁne contractions) Let J Ä Rpˆpp mq be a compact subset of matrices as above with 1. Then there is a unique nonempty compact set ﬁxp J q Ä Rpˆm satisfying ﬁxp J q J pﬁxp J qq, where the action of J is given in (6).

Let p Xkqk PN be a sequence of compact subsets of Rpˆm, such that X0 H, and satisfying the recursion (7). We have for all k P N

distp Xk, ﬁxp J qq k distp X0, J p X0qq

where dist is the Hausdorff distance related to the Euclidean norm on p ˆ m matrices.

3.2 An inﬁnite chain rule and its consequences

Deﬁne the following set-valued map based on the ﬁx operator from Theorem 1,

x : Ñ ﬁx r JF p xp q, qs .

where xp q is the unique ﬁxed point of the algorithmic recursion. Since xp q ﬁxp F q, we have equivalently that Jpb

x is the ﬁxed-point of the Jacobian at the ﬁxed-point: Jpb

x : Ñ ﬁx r JF pﬁxp F q, qs. We have the following (proved in Appendix C) and a consequence from Theorem 1.

Theorem 2 (A conservative mapping for the ﬁxed point map) Under Assumption 1, Jpb

x is welldeﬁned, and is a conservative Jacobian for the ﬁxed point map x.

Corollary 1 (Convergence of the piggyback derivatives) Under Assumption 1, for all , the recursion (PB) satisﬁes

lim kÑ8 gapp Jxkp q, Jpb

x p qq 0. (8)

Unrolling the expression of Jxk, using (6) and (7), we can rewrite (8) with a compositional product:

lim KÑ 8 gap

JF pxkp q, q

p Jx0p qq, Jpb

In plain words, this is a limit-derivative exchange result: Asymptotically, the gap between the automatic differentiation of xk and the derivative of the limit is zero. In particular the recursion (5) produces bounded sequences whose accumulation points are in Jpb

x . Since conservative Jacobians equal classical Jacobians almost everywhere [12], we have convergence of classical derivatives.

Corollary 2 (Convergence a.e. of the classical piggyback derivatives) Under Assumption 1, for almost all , the classical Jacobian B B xkp q, is well deﬁned for all k and converges towards the classical Jacobian of x. That is

B B xkp q B

B xp q, for almost all .

Remark 3 (Connection to implicit differentiation) The authors in [10] proved a qualiﬁcation-free version of the implicit function theorem. Assuming that for every r A, Bs P Jp xp q, q, the matrix I A is invertible, we have that

p I Aq 1B, r A, Bs P JF p xp q, q

is a conservative Jacobian for x. Under Assumption 1, one has Jimp

x p q Ä Jpb

x p q for all . Unfortunately, as soon as F is not differentiable, the inclusion may be strict, see details in Appendix D.

3.3 Consequence for algorithmic differentiation

Given k P N, 9 P Rm, wk P Rp, the following algorithms allow us to compute 9xk Jk 9 using the forward mode of automatic differentation (Jacobian Vector Products, JVP), and T

k Jk using the backward mode of automatic differentiation (Vector Jacobian Products, VJP). In a compositional model 9 is the derivative of an inner functions controlling algorithm parameters , with another variable real variable λ P R, for example an hyper parameter. The goal is to combine B pλq

Bλ and Bxkp q

B with the chain rule in a forward pass to obtain the total derivative Bxkp pλqq

Bλ . On the other hand, in a compositional model, wk is typically the gradient of an outer loss functions evaluated at xkp q. In this case the goal is to combine derivatives of iterates Bxkp q

B with wk B pxkq

Bxk in a backward pass

to obtain B pxkp qq

Algorithm 1: Algorithmic differentiation of recursion (1), forward and reverse modes

Input: k P N, P Rm, 9 P Rm, wk P Rp, initialization function x0p q, recursion function

Fpx, q, conservative Jacobians JF px, q and Jx0p q. Initialize: x0 x0p q P Rp.

Forward mode (JVP):

9x0 J 9 , J P Jx0p q. for i 1, . . . , k do

xi Fpxi 1, q

9xi Ai 1 9xi 1 Bi 1 9 r Ai 1, Bi 1s P JF pxi 1, q Return: 9xk

Reverse mode (VJP): k 0. for i 1, . . . , k do

xi Fpxi 1, q for i k, . . . , 1 do

i 1 wi wi 1 AT

i 1 wi r Ai 1, Bi 1s P JF pxi 1, q k k JT w0, J P Jx0p q Return: k

Proposition 1 (Convergence of VJP and JVP) Let k P N, 9 P Rm, wk P Rp, xk P Rp, 9xk P Rp,

k P Rm be as in Agorithm 1 under Assumption 1. Then for almost all P Rm, 9xk Ñ B x

Assume furthermore that, as k Ñ 8, wk Ñ w (for example, wk r pxkq for a C1 loss ), then for

almost all P Rm, T

k Ñ w T B x

Remark 4 In addition to Proposition 1, in both cases, for all , all accumulation points of both 9xk and T

k are elements of Jpb

x 9 and w T Jpb

x respectively. This is a consequence of Corollary 2 combined with algorithmic differentiation arguments which proof is given in Appendix D.

3.4 Linear convergence rate for semialgebraic piecewise smooth selection function

Semialgebraic functions are ubiquitous in machine learning (piecewise polynomials, 1, 2 norms, determinant matrix rank . . . ). We refer the reader to [11] for a thorough discussion of their extensions, and use in machine learning. For more technical details, see [17, 18] for introductory material on semialgebraic and o-minimal geometry.

Lipschitz gradient selection functions. Let F : Rp ﬁÑ Rq be semialgebraic and continuous. We say that F has a Lipschitz gradient selection ps, F1, . . . , Fmq if s: Rp ﬁÑ p1, . . . , mq is semialgebraic and there exists L 0 such that for i 1 . . . , m, Fi : Rp ﬁÑ Rp is semialgebraic with L-Lipschitz Jacobian, and for all x P Rp, Fpxq Fspxqpxq. For any x P Rp, set Ipxq ti P t1, . . . , mu , Fpxq Fipxqu. The set-valued map Js

F : Rp Ñ Rpˆq given by Js

F : x Ñ conv

Bx pxq, i P Ipxq

, is a conservative Jacobian for F as shown in [11]. Here BFi

Bx denotes the classical Jacobian of Fi. Let us stress that such a structure is ubiquitous in applications [11, 34].

Rate of convergence. We may now strengthen Corollary 1 by proving the linear convergence of piggyback derivatives towards the ﬁxed point. The following is a consequence of the fact that the proposed selection conservative Jacobians of Lipschitz gradient selection functions are Lipschitz-like (Lemma 4 in Appendix E.1). Note that semialgebraicity is only used as a sufﬁcient condition to ensure conservativity of the selection Jacobian together with this Lipschitz like property. It could be relaxed if it can be guaranteed by other means, in particular one could consider the broader class of deﬁnable functions in order to handle log-likelihood data ﬁtting terms.

Corollary 3 (Linear convergence of piggyback derivatives) In addition to Assumption 1, assume that F has a Lipschitz gradient selection structure as above. Then, for any and 0, there exists C 0 such that the recursion (PB) with JF Js

F satisﬁes for all k P N, gapp Jxkp q, Jpb

x p qq Cp? qk. Moreover, classical Jacobians in Corollary 2 converge at a linear rate for almost all .

4 Application to proximal splitting methods in convex optimization

Consider the composite parametric convex optimization problem, where P Rm represents parameters and x P Rp is the decision variable

xp q arg minxfpx, q gpx, q.

The purpose of this section is to construct examples of functions F used in recursion (1) based on known algorithms. The following assumption will be standing throughout the section.

Assumption 2 f is semialgebraic, convex, its gradient with respect to x for ﬁxed , rxf, is locally Lipschitz jointly in px, q and L-Lipschitz in x for ﬁxed . Semialgebraicity implies that rxf is path-differentiable jointly in px, q, we denote by J2

f its Clarke Jacobian. The function g is semialgebraic, convex in x for ﬁxed , and lower semicontinuous. For all 0, we assume that G px, q ﬁÑ prox gp , qpxq is locally Lipschitz jointly in px, q. Semialgebraicity implies that it is also path differentiable jointly in px, q, we denote by JG its Clarke Jacobian.

This assumption covers a very large diversity of problems in convex optimization as most gradient and prox operations used in practice are semialgebraic (or deﬁnable). Under Assumption 2, we will provide sufﬁcient conditions on f and g for Assumption 1, for different algorithmic recursions. These will therefore imply convergence as stated in Corollary 1 and 2, Proposition 1, as well Corollary 3 in the piecewise selection case. The proofs are postponed to Appendix F.

4.1 Splitting algorithms

In this section we provide sufﬁcient condition for Assumption 1 to hold. The underlying conservative Jacobian is obtained by combining Clarke Jacobians of elementary algorithmic operations (gradient, proximal operator in Assumption 2), using the compositional rules of differential calculus [11] and implicit differentiation [10]. Using [12], such Jacobians are conservative by semialgebraicity and their combination provide conservative Jacobians for the corresponding algorithmic recursion F. These objects are explicitly constructed in Appendix F.

Forward backward algorithm. The forward backward iterations are given for 0 by

xk 1 prox gp , q pxk rxfpxk, qq . (10)

Proposition 2 Under Assumption 2 with 0 2 L, denote by F : Rpˆm Ñ Rp the forwardbackward recursion in (10). For µ 0, if either f or g is µ-strongly convex in x for all , then F is a strict contraction and Assumption 1 holds.

It is well known that if f is µ-strongly convex, choosing 2{p L µq provides a contraction factor F B p 1q{p1 q, where L{µ 1 is the condition number of the problem.

Douglas Rachford algorithm. Given 0, the algorithm goes as follows

2p I R fp , q R gp , qqyk, (11)

where R fp , q 2prox fp , q I is the reﬂected proximal operator, which is 1-Lipschitz (and similarly for g). Following [6, Theorem 26.11], if the problem has a minimizer, then pykqk PN converges to a ﬁxed point of (11), y such that x prox gp yq is a solution to the optimization problem. Following [26, Theorem 1], if f is strongly convex, then R fp , q is -Lipschitz for some 1 and our differentiation result applies to Douglas-Rachford splitting in this setting.

Proposition 3 Under Assumption 2 with 0, denote by F : Rpˆm Ñ Rp the Douglas-Rachford recursion in (11). If f is µ-strongly convex in x for all , then F is a strict contraction and Assumption 1 holds.

Following [26, Proposition 3], choosing 1{?Lµ provides a contraction factor of order DRp? 1q{p? 1q F B, where again L{µ is the condition number of the problem. In this respect Douglas-Rachford s iterations provide a faster asymptotic rate than those of Forward-Backward, which may also impact the convergence of derivatives in the context of Corollary 3.

Alternating Direction Method of Multipliers. Consider the separable convex problem

u,v φ puq pvq subject to A u B v c . (12)

The alternating direction method of multipliers (ADMM) algorithm combines two partial minimization of an augmented Lagrangian, and a dual update:

uk 1 arg min

2 }A u B vk c }2

vk 1 arg min

2 }A uk 1 B vk c }2

xk 1 xk p A uk 1 B vk 1 c q.

As observed in [23], the ADMM algorithm can be seen as the Douglas-Rachford splitting method applied to the Fenchel dual of problem (12) (see Appendix F.3 for more details). More precisely, ADMM updates are equivalent to Douglas-Rachford iterations applied to the following problem

xq loooooooooomoooooooooon

xq looooomooooon

Therefore, if φ is strongly convex with Lipschitz gradient and A is injective, then ADMM converges linearly and one is able to combine derivatives of proximal operators to differentiate ADMM.

4.2 Numerical illustrations

We now detail how Figure 2 discussed in the introduction is obtained, and how it illustrates our theoretical results. We consider four scenarios (Ridge, Lasso, Sparse inverse covariance selection and Trend ﬁltering) corresponding to the four columns. For each of them, the ﬁrst line shows the empirical linear rate of the iterates xk and the second line shows the empirical linear rate of the derivative B B xk. All experiments are repeated 100 times and we report the median along with the ﬁrst and last deciles.

Forward Backward for the Ridge. The Ridge estimator is deﬁned for 0 as xp q arg minx PRp 1

2 Among several possibilities to solve it, one can use the Forward Backward algorithm applied to f : px, q ﬁÑ 1

2 and g : }x}2

2. Since g is strongly convex, the operator F is strongly convex, and thus Proposition 2 may be applied.

Forward Backward algorithm for the Lasso. Consider the Forward Backward algorithm applied to the Lasso problem [49], with parameter 0, xp q P arg minx PRp 1

2 }x}1 arg minx 1 2L}Ax b}2

L}x}1, where L is any upper bound on the operator norm of AT A. The gradient of the quadratic part is 1-Lipschitz, so we may consider the forward backward algorithm (10), with unit step size and f : px, q ﬁÑ 1 2L}Ax b}2

2 and g: px, q ﬁÑ

A well known qualiﬁcation condition involving a generalized support at optimality ensures uniqueness of the Lasso solution [20, 37]. It holds for generic problem data [50]. Following [10, Proposition 5], under this qualiﬁcation condition, the implicit conservative Jacobian JF is such that, at the solution x , the matrix set I JF only contains invertible matrices. This means that there exists 1, such that any M P JF px q has operator norm at most . Following Remark 1, all our convergence results apply qualitatively. Note that we recover the results of [7, Proposition 2] for the Lasso.

Douglas Rachford for the Sparse Inverse Covariance Selection. The Sparse Inverse Covariance Selection [52, 22] reads xp q P arg minx PRnˆn trp Cxq log det x

i,j |xi,j|, where C is a symmetric positive matrix and 0. It is possible to apply Douglas Rachford method to f : px, q ﬁÑ trp Cxq log det x and g : px, q ﬁÑ }x}1,1. It is known that f is locally strongly convex, indeed x ﬁÑ log det x is the standard self-concordant barrier in semideﬁnite programming [40]. Following Remark 1, all our convergence results apply qualitatively.

ADMM for Trend Filtering. Introduced in [51] in statistics as a generalization of the Total Variation, the trend ﬁltering estimator with observation P Rp reads xp q arg minx PRp 1

2 λ}Dpkqx}1, where Dpkq is a forward ﬁnite difference approximation of a differential operator of order k (here k 2). Using : u ﬁÑ λ}u}1, φ : v ﬁÑ }v }2

2 (strongly convex), A I (injective), B Dpkq, and c 0, we can apply the ADMM to solve trend ﬁltering.

5 Failure of automatic differentiation for inertial methods

This section focuses on the Heavy-Ball method for strongly convex objectives, in its global linear convergence regime. For C2 objectives, piggyback derivatives converge to the derivative of the solution map [28, 39, 36]. However, we provide a C1,1 strongly convex parametric objective with path differentiable derivative, such that piggyback derivatives of the Heavy Ball algorithm contain diverging vectors for a given parameter value. In this example, other conservative differentiation means (implicit differentiation, piggyback on gradient descent), avoid this divergent behaviors.

5.1 Heavy-ball algorithm and global convergence

Consider a function f : Rp ˆ Rm Ñ R, and β 0, for simplicity, when the second argument is ﬁxed, we write f : x ﬁÑ fpx, q. Set for all x, y, , Fpx, y, q px rf pxq βpx yq, xq, consider the Heavy-Ball algorithm pxk 1, yk 1q Fpxk, yk, q for k P N.

If f is µ-strongly convex with L-Lipschitz gradient, then, choosing 1{L and β

, the algorithm will converge globally at a linear rate to the unique solu-

tion, xp q [24, Theorem 4], local convergence is due to Polyak [44]. Furthermore, if in addition f is C2 forward propagation of derivatives converge to the derivative of the solution [28, 29, 39].

5.2 A diverging Jacobian accumulation

Details and proof of the following result are given in Section G.

Proposition 4 (Piggyback differentiation fails for the Heavy Ball method) Consider f : R2 Ñ R, such that for all P R, fpx, q x2{2 if x 0 and fpx, q x2{8 if x 0. Assume that 1 and β 3{4. Then the heavy ball algorithm converges globally to 0 and rf is path differentiable. The Clarke Jacobian of F with respect to px, yq at p0, 0, 0q is JF p0, 0, 0q conv t M1, M2u, where the product M1M1M2M2 has eigenvalue 9{8.

The presence of an eigenvalue with modulus greater than 1 may produce divergence in (PB). Set

f1 : px, q ﬁÑ

"x2{2 if x 0 x2{8 if x 0. f2 : px, q ﬁÑ

"x2{2 if x 0 x2{8 if x 0.

Note that f1 and f2 are both equivalent to f as they implement the same function. With initializations xp q yp q , we run a few iterations of the Heavy Ball algorithm for 0, and implement (PB) alternating between two steps on f1 and two steps on f2 and differentiate the resulting sequence pxkqk PN with respect to using algorithmic differentiation. The divergence phenomenon predicted by Proposition 4 is illustrated in Figure 3, while the true derivative is 0 (the sequence is constant).

6 Conclusion

We have developed a ﬂexible theoretical framework to describe convergence of piggyback differentiation applied to nonsmooth recursions providing, in particular, a rigorous meaning to differentiation of nonsmooth solvers. The relevance of our approach is illustrated on composite convex optimization

0 50 100 150 200 iteration k

0 50 100 150 200 iteration k

Figure 3: Behaviour of automatic differentiation for ﬁrst-order methods on a quadratic function. (Left) Stability of the propagation of derivatives for the ﬁxed step-size gradient descent. (Right) Instability of the propagation of Heavy-Ball initialized. Both methods are initialized at optimum.

through widely used methods as forward-backward, Douglas-Rachford or ADMM algorithms. Our framework allows however to consider many other abstract algorithmic recursions and provides thus theoretical ground for more general problems such as variational inequalities or saddle point problems as in [14, 9]. As a matter for future work, we shall consider relaxing Assumption 1 to study a wider class of methods, e.g., when F is not a strict contraction.

Acknowledgments

J. B. and E. P. acknowledge the ﬁnancial support of the AI Interdisciplinary Institute ANITI funding under the grant agreement ANR-19-PI3A-0004, Air Force Ofﬁce of Scientiﬁc Research, Air Force Material Command, USAF, under grant numbers FA9550-19-1-7026, FA8655-22-1-7012 and ANR Ma SDOL 19-CE23-0017-01. J. B. also acknowledges the support of ANR Chess, grant ANR-17EURE-0010, TSE-P and the Centre Lagrange. S. V. acknowledges the support ANR Gra Va, grant ANR-18-CE40-0005.

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,

Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor Flow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorﬂow.org.

[2] Pierre Ablin, Gabriel Peyré, and Thomas Moreau. Super-efﬁciency of automatic differentiation

for functions deﬁned as a minimum. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 32 41. PMLR, 13 18 Jul 2020.

[3] Akshay Agrawal, Brandon Amos, Shane Barratt, Stephen Boyd, Steven Diamond, and J. Zico

Kolter. Differentiable convex optimization layers. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[4] Charalambos D Aliprantis and Kim C Border. Inﬁnite Dimensional Analysis: A Hitchhiker s

Guide. Springer Science & Business Media, 2006.

[5] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. Advances in Neural

Information Processing Systems, 32, 2019.

[6] Heinz H Bauschke, Patrick L Combettes, et al. Convex analysis and monotone operator theory

in Hilbert spaces, volume 408. Springer, 2011.

[7] Quentin Bertrand, Quentin Klopfenstein, Mathieu Blondel, Samuel Vaiter, Alexandre Gram-

fort, and Joseph Salmon. Implicit differentiation of lasso-type models for hyperparameter optimization. In International Conference on Machine Learning, pages 810 821. PMLR, 2020.

[8] Mathieu Blondel, Quentin Berthet, Marco Cuturi, Roy Frostig, Stephan Hoyer, Felipe Llinares-

López, Fabian Pedregosa, and Jean-Philippe Vert. Efﬁcient and modular implicit differentiation, 2021.

[9] Lea Bogensperger, Antonin Chambolle, and Thomas Pock. Convergence of a Piggyback-style

method for the differentiation of solutions of standard saddle-point problems. Technical report, 2022.

[10] Jérôme Bolte, Tam Le, Edouard Pauwels, and Antonio Silveti-Falls. Nonsmooth Implicit

Differentiation for Machine Learning and Optimization. In Advances in Neural Information Processing Systems, Online, 2021.

[11] Jerome Bolte and Edouard Pauwels. A mathematical model for automatic differentiation

in machine learning. In Conference on Neural Information Processing Systems, Vancouver, Canada, 2020.

[12] Jérôme Bolte and Edouard Pauwels. Conservative set valued ﬁelds, automatic differentiation,

stochastic gradient method and deep learning. Mathematical Programming, 2020.

[13] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal

Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018.

[14] Antonin Chambolle and Thomas Pock. Learning consistent discretizations of the total variation.

SIAM Journal on Imaging Sciences, 14(2):778 813, 2021.

[15] Bruce Christianson. Reverse accumulation and attractive ﬁxed points. Optimization Methods

and Software, 3(4):311 326, 1994.

[16] Frank H. Clarke. Optimization and nonsmooth analysis. Canadian Mathematical Society

Series of Monographs and Advanced Texts. John Wiley & Sons, Inc., New York, 1983. A Wiley-Interscience Publication.

[17] Michel Coste. An introduction to o-minimal geometry. Istituti editoriali e poligraﬁci internazion-

ali Pisa, 2000.

[18] Michel Coste. An introduction to semialgebraic geometry, 2000.

[19] Charles-Alban Deledalle, Samuel Vaiter, Jalal M. Fadili, and Gabriel Peyré. Stein Unbiased

Gr Adient estimator of the Risk (SUGAR) for multiple parameter selection. SIAM Journal on Imaging Sciences, 7(4):2448 2487, 2014.

[20] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regression.

Annals of statistics, 32(2):407 499, 2004.

[21] Laurent El Ghaoui, Fangda Gu, Bertrand Travacca, Armin Askari, and Alicia Tsai. Implicit

deep learning. SIAM Journal on Mathematics of Data Science, 3(3):930 958, 2021.

[22] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation

with the graphical lasso. Biostatistics, 9(3):432 441, 12 2007.

[23] Daniel Gabay. Applications of the method of multipliers to variational inequalities. In Studies

in mathematics and its applications, volume 15, pages 299 331. Elsevier, 1983.

[24] Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global convergence of

the heavy-ball method for convex optimization. In 2015 European control conference (ECC), pages 310 315. IEEE, 2015.

[25] Jean Charles Gilbert. Automatic differentiation and iterative processes. Optimization Methods

and Software, 1(1):13 21, 1992.

[26] Pontus Giselsson and Stephen Boyd. Linear convergence and metric selection for douglas-

rachford splitting and admm. IEEE Transactions on Automatic Control, 62(2):532 544, 2016.

[27] Andreas Griewank, Christian Bischof, George Corliss, Alan Carle, and Karen Williamson.

Derivative convergence for iterative equation solvers. Optimization Methods and Software, 2(3-4):321 355, 1993.

[28] Andreas Griewank and Christèle Faure. Piggyback differentiation and optimization. In Large-

scale PDE-constrained optimization, pages 148 164. Springer, 2003.

[29] Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of

algorithmic differentiation. SIAM, 2008.

[30] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen,

David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with Num Py. Nature, 585(7825):357 362, September 2020.

[31] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering,

9(3):90 95, 2007.

[32] Samuel Hurault, Arthur Leclaire, and Nicolas Papadakis. Gradient step denoiser for convergent

plug-and-play. In International Conference on Learning Representations, 2021.

[33] Sham M Kakade and Jason D Lee. Provably correct automatic sub-differentiation for qual-

iﬁed programs. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

[34] Wonyeol Lee, Hangyeol Yu, Xavier Rival, and Hongseok Yang. On correctness of automatic

differentiation for non-differentiable functions. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6719 6730. Curran Associates, Inc., 2020.

[35] Seppo Linnainmaa. The representation of the cumulative rounding error of an algorithm as a

Taylor expansion of the local rounding errors. Master s thesis, Univ. Helsinki, 1970.

[36] Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters

by implicit differentiation. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1540 1552. PMLR, 26 28 Aug 2020.

[37] Julien Mairal and Bin Yu. Complexity analysis of the lasso regularization path. In Proceedings

of the 29th International Coference on International Conference on Machine Learning, pages 1835 1842, 2012.

[38] Swann Marx and Edouard Pauwels. Path differentiability of ode ﬂows. ar Xiv preprint ar Xiv:2201.03819, 2022.

[39] Sheheryar Mehmood and Peter Ochs. Automatic differentiation of some ﬁrst-order methods in

parametric optimization. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1584 1594. PMLR, 2020.

[40] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex

programming. SIAM, 1994.

[41] Peter Ochs, René Ranftl, Thomas Brox, and Thomas Pock. Techniques for gradient-based

bilevel optimization with non-smooth lower level problems. Journal of Mathematical Imaging and Vision, 56(2):175 194, 2016.

[42] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,

Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024 8035. Curran Associates, Inc., 2019.

[43] Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International

conference on machine learning, pages 737 746. PMLR, 2016.

[44] Boris Polyak. Introduction to optimization. In Optimization Software, Publications Division.

Citeseer, 1987.

[45] Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with

implicit gradients. Advances in neural information processing systems, 32, 2019.

[46] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science

& Business Media, 2009.

[47] Halsey Lawrence Royden and Patrick Fitzpatrick. Real analysis, volume 32. Macmillan New

York, 1988.

[48] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by

back-propagating errors. Nature, 323(6088):533 536, October 1986.

[49] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society: Series B (Methodological), 58(1):267 288, 1996.

[50] Ryan J Tibshirani. The lasso problem and uniqueness. Electronic Journal of statistics, 7:1456

1490, 2013.

[51] Ryan J. Tibshirani. Adaptive piecewise polynomial estimation via trend ﬁltering. The Annals of

Statistics, 42(1):285 323, 2014.

[52] Martin J Wainwright, John Lafferty, and Pradeep Ravikumar. High-dimensional graphical model

selection using \ell_1-regularized logistic regression. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006.

[53] R. E. Wengert. A simple automatic derivative evaluation program. Communications of the ACM,

7(8):463 464, August 1964.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] Our submission follows the order mentionned in the abstract and introduction. (b) Did you describe the limitations of your work? [Yes] The two main assumptions are

discussed in the main text. (c) Did you discuss any potential negative societal impacts of your work? [N/A] We

believe that our work, being fully theoretical, does not bring negative societal impacts, especially points 1 to 7 of the ethics guidelines of Neur IPS 2022. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] Points 1 to 5 are satisﬁed. 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] Our analysis

requires a small number of assumptions: assumption 1 on contraction and assumption 2 on semialgebraic regularity. (b) Did you include complete proofs of all theoretical results? [Yes] Most of the appendices

are dedicated to them. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main ex-

perimental results (either in the supplemental material or as a URL)? [Yes] The (self-contained) jupyter notebook to generate the ﬁgures is submitted in supllemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] See appendix H. (c) Did you report error bars (e.g., with respect to the random seed after running exper-

iments multiple times)? [Yes] See appendix H. Every curves report the median over 100 repetitions, along with ﬁrst and last deciles. (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] See appendix H. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] See appendix I. (b) Did you mention the license of the assets? [Yes] See appendix I.

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

No new assets. (d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] No nonpublic data used. (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] No personally identiﬁable information. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]