# on_the_complexity_of_nonsmooth_automatic_differentiation__f86327eb.pdf

Published as a conference paper at ICLR 2023

ON THE COMPLEXITY OF NONSMOOTH AUTOMATIC DIFFERENTIATION

J erˆome Bolte1,2, Ryan Boustany1,2,4, Edouard Pauwels 2,3 & B eatrice Pesquet-Popescu 4

1 Toulouse School of Economics 2 Universit e de Toulouse 3 IRIT, CNRS. Institut Universitaire de France (IUF). 4 Thales LAS France {jerome.bolte, ryan.boustany}@ut-capitole.fr, edouard.pauwels@irit.fr, beatrice.pesquet@thalesgroup.com

Using the notion of conservative gradient, we provide a simple model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs. The overhead complexity of the backward mode turns out to be independent of the dimension when using programs with locally Lipschitz semi-algebraic or definable elementary functions. This considerably extends Baur-Strassen s smooth cheap gradient principle. We illustrate our results by establishing fast backpropagation results of conservative gradients through feedforward neural networks with standard activation and loss functions. Nonsmooth backpropagation s cheapness contrasts with concurrent forward approaches, which have, to this day, dimensional-dependent worst-case overhead estimates. We provide further results suggesting the superiority of backward propagation of conservative gradients. Indeed, we relate the complexity of computing a large number of directional derivatives to that of matrix multiplication, and we show that finding two subgradients in the Clarke subdifferential of a function is an NP-hard problem.

1 INTRODUCTION

Automatic evaluation of derivatives: Algorithmic differentiation (AD) appeared around 60 years ago (Beda et al. (1959); Wengert (1964)), and has been since then constantly developed and used in many contexts, see Griewank et al. (1989); Griewank and Walther (2008) for a thorough discussion. Today, it is at the core of modern learning architectures (Rumelhart et al., 1986; Le Cun et al., 2015; Baydin et al., 2018), to the point that training a neural network (NN) is ultimately a way to combine the outputs of AD. There are many practical and theoretical developments available nowadays: flexible and efficient numerical libraries (Abadi et al., 2016; Paszke et al., 2019; Bradbury et al., 2018), an implicit differentiation theory (Griewank and Faure, 2003; Griewank and Walther, 2008) and its extensions (Agrawal et al., 2019; Bai et al., 2019; Bolte et al., 2021; Blondel et al., 2021), the adjoint method (Farrell et al., 2013; Pearlmutter, 1995; Plessix, 2006) with application to neural ODEs (Chen et al., 2018), piggyback style differentiation of optimization algorithms (Griewank and Faure, 2003; Mehmood and Ochs, 2020; Bertrand et al., 2020; Lorraine et al., 2020), or differentiation of conjugate gradient algorithms (Gratton et al., 2014).

Backward algorithmic differentiation, or backpropagation, plays a particular role when smooth optimization tasks are at stake, as it evaluates the gradient of a function with a cost proportional to that of function evaluations, independently of dimension. This property, called the cheap gradient principle (Wolfe, 1982; Griewank and Walther, 2008), is at the root of the machine learning libraries revolution. According to the key complexity theory version of this result due to Baur and Strassen (1983), arithmetic complexity of the evaluation of the derivative of a rational function is at most 5 times the complexity of function evaluation. Extensions exist for smooth differentiable functions Baur and Strassen (1983); Griewank and Walther (2008) but standard computational practice of AD consists of little known about the nonsmooth case.

Published as a conference paper at ICLR 2023

The objective of this paper is precisely to present a simple, general, nonsmooth cheap conservative principle and to explore other complexity results for evaluating nonsmooth derivatives. This extends the cheap gradient principle of smooth AD to the path differentiable world Bolte and Pauwels (2020b) which includes semi-algebraic and more generally definable functions Coste (2000a;b), a class that contains the vast majority of machine learning programs used in practice, see for example Bolte and Pauwels (2020b).

Nonsmooth AD & computational complexity: Sorting values, pooling data, thresholding functions, or determining closest points are some of the most essential numerical decision operations. They are ubiquitous in machine learning and modern optimization. All of them are nonsmooth, and most of them have a very desirable feature: they are cheap to compute, much cheaper than smoothed surrogates. For instance, the famous Re LU activation in deep learning, whose role is to threshold to zero negative values to allow for the inactivity of neurons, requires only one bit of encoding in theory. On the other hand, other nonlinear activations potentially require auxiliary algorithms for their evaluation, incurring a higher computational cost. This simplicity of use also comes with the issue of finding an adequate way of training models and, thus differentiating objects.

The standard computational practice of AD consists in applying differential calculus rules directly to nonsmooth objects, replacing gradients by surrogates, typically Clarke subgradients. This is how AD is performed within Tensor Flow, Py Torch or Jax. This approach has shown tremendous success (Le Cun et al., 2015) and has been massively used for the last 10 years. Yet, despite this empirical success, Barton et al. claimed in Barton et al. (2018) that there does not seem to exist [at this day] a true analogous reverse AD mode to compute generalized derivatives for nonsmooth functions , illustrating the difficulty of nonsmooth AD. Conservative gradients were introduced as a faithful mathematical model capturing the formal application of calculus rules to subdifferentials by Bolte and Pauwels (2020a;b); Bolte et al. (2021). The author unfamiliar with this notion may reduce, in a ML context, conservative gradients to outputs of calculus rules formally applied to Clarke subgradients and Jacobians. Our goal is to provide an adequate computational complexity theory for conservative calculus, a theory that will therefore match standard practical approaches.

Among other possible first-order options offered by nonsmooth calculus, we also investigate the properties of directional derivatives and those of the Clarke subdifferential. For directional derivatives, our motivation comes from the fact that this nonsmooth operation has general calculus rules, while the Clarke subdifferential is central in terms of variational interpretation.

Contributions: The main thesis of this work is that conservative gradients have computational properties similar to smooth derivatives, which are much more favorable than those of alternative nonsmooth oracles such as subgradients or directional derivatives.

We provide a simple computational model for addressing the question of complexity theory of nonsmooth numerical programs. For the backward mode, we prove a cheap conservative gradient principle a la Baur-Strassen, generalizing state of the art to nonsmooth programs modeling most NNs. We establish that, regardless of dimension, the computational cost of a conservative gradient is of the order of that of function evaluation. Our results provide a theoretical validation of the fact that the cost of backpropagation does not depend on the programs smoothness. For the forward mode, we relate the computational cost of p directional derivatives to that of p ˆ p matrix multiplication. We provide lower complexity bounds that illustrate the limits to which this deficiency may be improved. This applies to existing nonsmooth AD frameworks (Khan and Barton, 2012; 2013). We establish that computing two distinct elements in the Clarke subdifferential of a given point is NP-hard for simple Re LU programs. This result also applies to the lexicographic subdifferential. In contrast, we show that the problem can be solved in polynomial time for conservative gradients. This reflects the computational difficulty of dealing with the Clarke subdifferential. A result of independent interest: deciding differentiability of a Re LU program at a point is NP-hard.

Relation with existing work: Conservative gradients were introduced in Bolte and Pauwels (2020a;b) to model formal subdifferentiation used by practitioners and nonsmooth backpropagation . They were further studied in Lewis and Tian (2021); Davis and Drusvyatskiy (2021); Bolte et al. (2021) and empirically investigated in Bertoin et al. (2021). Computational complexity was

Published as a conference paper at ICLR 2023

only qualitatively considered. We provide a rigorous description of this aspect based an arithmetic computational cost framework capturing programming with nondifferentiable components. The quest for a computationally cheap nonsmooth derivative has a long history in AD literature. Existing works of Griewank (Griewank and Walther, 2008; Griewank, 2013; Griewank and Rojas, 2019; Griewank and Walther, 2020) are essentially based on piecewise smoothness structures (Scholtes, 2012). A cheap subgradient principle was also given in Kakade and Lee (2018), but it requires a very strong qualification condition. As illustrated in Griewank and Rojas (2019), such qualification conditions can be computationally hard to check in practice.

In another research line, based on chain rules for directional derivatives, Khan-Barton (Khan and Barton, 2012; 2013; 2015; Barton et al., 2018) studied the vector forward mode AD. In particular, they investigated the forward AD framework to evaluate elements of the lexicographic subdifferential (see Nesterov (2005)), which is contained in the Clarke subdifferential. In the worst case, the computational overhead ratio they obtain is proportional to the ambient dimension. This contrasts with our cheap gradient principle, whose constant is dimension-less. While these contributions are most relevant to nonsmooth AD, their applicability to large-scale learning models is limited, due to the central role of forward AD.

Organization of the paper: We introduce elements of nonsmooth analysis and, in particular, the notion of conservative gradient used throughout this work in Section 2. Section 3 describes a general model of computation that allows one to express the computational cost and complexity of programs, functions and their conservative gradients. This section also presents an abstract program algorithmic differentiation framework. These elements are gathered in Section 4 which presents our extension of the Baur-Strassen result with the cheap conservative gradient principle and its illustrations. To conclude, in Section 5, we describe computational lower bounds for evaluating directional derivatives and distinct subgradients for simple programs.

2 NONSMOOTH GENERALIZED GRADIENTS

They are fundamental to expressing variations of nonsmooth losses in Machine Learning. Given a locally Lipschitz continuous function F : Rp Ñ R, the Clarke subdifferential of F is

Bc Fpxq conv " lim kÑ 8 Fpxkq : xk P diff F , xk ÝÑ kÑ 8 x * (1)

where diff F is the full measure set where F is differentiable and F is the standard gradient (Clarke, 1983). The subdifferential is set-valued, which we write Bc F : Rp Ñ Rp. For each x P Rp, elements of Bc Fpxq are called Clarke subgradients of F. A selection d in Bc F, is a function d: Rp Ñ Rp such that for all x P Rp, dpxq P Bc Fpxq. If F is C1 then Bc F t Fu everywhere so the only possible selection is d F. We will manipulate derived dictionaries, which typically provide a selection in either the Clarke subdifferential, or more general set-valued maps.

Example 1 For Re LU: t ÞÑ maxp0, tq, we have Bc Re LUptq is t0u if t ă 0, t1u if t ą 0 and r0, 1s if t 0. We may define the function Re LU1 as a selection in Bc Re LU :

Re LU1ptq 1, if t ą 0, Re LU1ptq 0, otherwise.

The chain-rule, essential to AD, generally fails for Clarke subgradients. This is why we now consider the more flexible notion of conservative gradients.

Definition 1 (Conservative gradient) Let F : Rp Ñ R be a locally Lipschitz continuous function and DF : Rp Ñ Rp a locally bounded, nonempty and graph closed set-valued map. Then DF is a conservative gradient for F, if for any absolutely continuous curve γ : r0, 1s Ñ Rp,

d dt Fpγptqq xv, 9γptqy @v P DF pγptqq, for almost all t P r0, 1s. (2)

In this case, F is called path differentiable. Conservative Jacobians are defined similarly. As in Section 2, d: Rp Ñ Rp is a selection of DF if dpxq P DF pxq for all x P Rp.

Published as a conference paper at ICLR 2023

A rich class of path differentiable functions is given by locally Lipschitz continuous semi-algebraic functions with the Clarke subdifferential as a conservative gradient. Actually, virtually all functions used in machine learning are path differentiable (Bolte and Pauwels, 2020a;b). The most salient facts about path differentiable functions and their conservative gradients are:

(Clarke subgradient), for all x P Rp, Bc Fpxq Ă convp DF pxqq. (Gradient almost everywhere) Conservative gradients are gradients a.e (Bolte and Pauwels, 2020a). (First-order oracle) Selection in conservative gradients can be used as surrogate gradients, while preserving convergence guaranties (Bolte and Pauwels, 2020a;b; Bolte et al., 2021).

Conservative Jacobians can be composed while preserving conservativity (Bolte and Pauwels, 2020a), a feature which do not enjoy Clarke Jacobians: let F : Rp Ñ Rm, G: Rm Ñ Rl be locally Lipschitz continuous mappings, d F : Rp Ñ Rmˆp and d G : Rm Ñ Rlˆm be selections in conservative Jacobians for F and G respectively. Then the product mapping x ÞÑ d Gp Fpxqqˆd F pxq is a selection in a conservative Jacobian for G F. The use of conservative Jacobians provides a very convenient framework to model AD in the nonsmooth case, see Bolte and Pauwels (2020a;b).

A fundamental theorem is the following:

Theorem 1 (Path differentiable functions are ubiquitous) (Bolte and Pauwels (2020a)) Locally Lipchitz semialgebraic (or definable) functions are path differentiable.

3 PROGRAMS, COMPLEXITY AND AUTOMATIC DIFFERENTIATION

3.1 CALCULUS MODEL, PROGRAMS, COMPUTATIONAL COST AND COMPLEXITY

A dictionary D is a finite set of real functions (e.g. t , , ˆ, {u), it is paired with P0p Dq, a set of elementary programs implementing them in real arithmetic. Starting from P0p Dq, we aim at capturing the notion of program of programs at any depth. As this is an inductive process, we call k P N a program level , which is simply an induction counter needed for consistency. Recursively, programs of level k 1, in Pk 1p Dq, consist of combinations of outputs of programs of level k, in Pkp Dq. For example if P1 and P2 are elementary programs in P0p Dq, then the program which sums the outputs of P1 and P0 is of level 1. More precisely:

Let p, q be input and output sizes respectively and m ě p q a memory size. A predecessor relation is a set valued map pr: t1, . . . , mu Ñ t1, . . . , mu such that for i 1, . . . , m for j P prpiq, j ă i. prpiq is empty if i ď p and nonempty otherwise. An adapted program sequence pgiqm i p 1 in Pkp Dq, is a set of programs such that gi has |prpiq| input arguments and a single output, for all i p 1, . . . , m. Given p, q, m, pr, pgiqm i p 1 , the program given in Algorithm 1 is a level k 1 program on D .

Algorithm 1: Program data: p, q, m, pr, pgiqm i p 1 .

Input: x px1, . . . xpq

1: for i p 1, p 2, . . . m do 2: xi gipxprpiqq where 3: xprpiq pxjqj Pprpiq. 4: end for Return: y : pxjqm j m q 1.

The set of programs with dictionary D is Pp Dq Ť

kě0 Pkp Dq. We shall see however that Pkp Dq P1p Dq for all k, using modification of the computational graph.

A cost on a dictionary D is a nonnegative function on D, it extends additively by induction on programs on D through the rule costp Pq řm i 1 costpgiq where P is a program on D as described in Algorithm 1. A direct example is the dictionary of arithmetic functions t , , ˆ, {u, together with addition or multiplication by fixed constants, denoted by c and ˆc respectively1, see also Section A.1. Throughout the paper, we assume that dictionaries contain at least operations and ˆ.

Each program on D may be represented by a program in P1p Dq with the same cost, by expanding all subprograms until they reduce to an elementary program. Cost evaluation is thus well defined on such programs. As detailed in Appendix A.1, this model of computation is equivalently expressed using directed acyclic graphs.

1Constants need to be distinguished from variables (for instance to define a polynomial)

Published as a conference paper at ICLR 2023

To sum up, we have defined the set of programs Pp Dq on D, which includes programs of programs. The programs gi in Algorithm 1 may be taken in Pp Dq. The cost of a program is evaluated through the calls it makes to elementary programs in the dictionary.

Programs vs functions: A program P defines a unique input-output function f: we say that P computes f, or implements f, and with a slight abuse of notation, we will identify P and f when there is no ambiguity (e.g. derivative of P). We use the equivalence relation to relate programs computing the same function. The equivalence classes correspond to functions expressible by programs with a given dictionary D. Given a function f : Rp Ñ Rq and a program P on dictionary D, with p inputs and q outputs, we write f r Ps to denote the fact that P is in the equivalence class of programs computing f, that is, P implements f.

Complexity of a function: The complexity of a function f over a dictionary D is the quantity comppf, Dq inf tcostp Pq, s.t P P Pp Dq, f r Psu, the infimum being over all programs implementing f on dictionary D. It could be infinite, if it is finite then it is attained.

3.2 AUTOMATIC DIFFERENTIATION

We pertain to programs implementing functions, that is Algorithm 1 with single outputs q 1.

Given a dictionary D of locally Lipschitz path differentiable functions, a derived dictionary is a set of functions D1 Ą D which extends D and contains operations required to express at least an element in a conservative gradient for each of the functions in D, for example, an element in the Clarke subdifferential. We also consider a cost function on D1, which we denote by cost and which extends to programs over D1. Given programs gi on D, i p 1, . . . , m, we define di a derived program on D1, with |prpiq| inputs and outputs, which returns an element of a conservative gradient for gi (as for instance a Clarke subgradient, or simply a gradient in the C1 case). By gdi, we denote a program on D1 evaluating pgipxq, dipxqq jointly for a given x. We denote by Algorithm 1 , an extension of Algorithm 1 which additionally returns wi dipxprpiqq for i p 1, . . . , m, by replacing line 2 in Algorithm 1 with a call to gdi instead of gi. The backward (resp. forward) AD program backpropp Pq (resp. forpropp Pq) is defined as follows:

Algorithm 2: Algorithmic differentiation of P as in Section 3.1 Input: variables pxiqp i 1 Forward evaluation with derivatives: evaluate wi dipxprpiqq, i p 1, . . . , m, with Algorithm 1 : Algorithm 1 with gdi instead of gi on line 2.

1: Forward mode: 2: Initialize: Bxi

Bx ei , i 1, . . . , p, from canonical basis in Rp. 3: for i p 1, . . . m do 4: Bxi

where x px1, . . . , xpq. 5: end for Return: Bxm

Bx and eventually xm.

1: Backward mode: 2: Initialize: v em 3: for t m, . . . p 1 do 4: for j P prptq do 5: Update coordinate j of v:

vrjs : vrjs vrtswtrjs

6: end for 7: end for Return: pvrjsqp j 1 and eventually xm.

Note that Algorithm 2 starts with Algorithm 1 , i.e., Algorithm 1 with gdi instead of gi on line 2. Its computational cost, denoted costpgdiq, should be thought of as an exogenous parameter: it may model, for instance, the use of underlying software libraries or the hardware properties.

4 COMPUTATIONAL COMPLEXITY OF NONSMOOTH AD

We now evaluate the complexity of the forprop and backprop operations for conservative gradients in the path-differentiable case which encompasses, as mentioned earlier, all semi-algebraic and

Published as a conference paper at ICLR 2023

definable locally Lipschitz functions. We show, in particular, that backpropagation with conservative gradients has a computational overhead ratio that is independent of the dimension. This is in contrast with the best known algorithmic oracles for the Clarke subdifferential (see Khan and Barton (2012; 2013; 2015); Barton et al. (2018) and Appendix A.2), whose computational overhead ratio scales linearly with the dimension.

Theorem 2 (Complexity of nonsmooth AD) Let P be a program over a dictionary D of pathdifferentiable functions with p inputs as in Algorithm 1 & 2. Then, the corresponding function r Ps is path differentiable, there is a conservative gradient DP for the function r Ps such that:

(i) (Cost of backward mode) At each input point x P Rp, the output of program backpropp Pq is in DP pxq and we have costpbackpropp Pqq ď ωb costp Pq, where ωb max i p 1,m tpcostpgdiq 2 maxpcostp q, costpˆqq|prpiq|q { cost pgiqu . (3)

(ii) (Cost of forward mode) At each input point x P Rp, the output of program forpropp Pq is in DP pxq and we have costpforpropp Pqq ď ωf ˆ costp Pq where ωf max i p 1,m tpcost pgdiq p|prpiq|costpˆq pp|prpiq| 1qcostp qq { cost pgiqu .

There is a dissymmetry between the two modes since the constant ωb is independent of the dimension p. This is why property (i) is sometimes called the cheap conservative gradient principle extending the classical smooth one which was derived by Baur and Strassen (1983) for real rational functions. Theorem 2 describes worst case upper bounds (maximum over i), which are tight, for example if prpiq, costs of gi and gdi are independent of i.

We will consider several examples now.

The class of Re LU programs Let DRe LU be the dictionary composed of elementary arithmetic operations, logarithm, exponential and the Re LU function: DRe LU : t , ˆ, c, ˆc, inv, exp, log, Re LUu. (4) A Re LU program P is a program with dictionary DRe LU; it can be expressed in a compositional form (Section 3.1) with program sequences in DRe LU. Note that this yields path differentiable functions.

Assumption 1 (Computational Cost) In Algorithms (2), define the dictionary D1Re LU : DRe LUY t Re LU1u as in Example 1; then, all operations from D1Re LU have unit cost (see Remark 1).

Corollary 1 (Backprop complexity of Re LU programs) Let P be a Re LU program, under Assumption 1, we have: costpbackpropp Pqq ď 5costp Pq. This extends to more complex cost weighting schemes (Remark 1) and to selection functions which virtually capture all losses in ML (Remark 2).

Table 1: Complexity constant of ωb in Theorem 2 for elementary g in DRe LU and derived program with dictionary D1 Re LU. This proves Corollary 1 (more details in appendix B.1).

g p , ˆq p c, ˆcq log exp inv Re LU pcostpgdq 2costpˆq|pr|q { cost pgq 5 3 4 3 5 3

Remark 1 (On refined cost systems) Unit cost in Assumption 1 gives a simple interpretation to Corollary 1: the cost of a program is the total number of numerical operations. This rough estimate of computational complexity, could be refined with different weighting schemes. However, the obtained constant 5 is robust to many different weighting choices, far beyond Assumption 1. We detail an example in the Appendix B.2 for which the cost of all smooth nonlinear operations different from or ˆ is cnonlin ě 1 and we model the cost of sign branching in computation of Re LU and Re LU1

with constant c Re LU ě 0. This yields the same constant as in Corollary 1.

Remark 2 (Beyond Re LU programs) Many other dictionaries could be considered. Re LU is an example chosen for its simplicity, but Corollary 1 would hold similarly (with the same constant 5) for many different nonsmooth activations or components such as absolute value, max-pooling, ELU function, ℓ1 and ℓ8 norms. Similar results could be developed for the class of selection functions, which encompasses the vast majority of ML building blocks (see Bolte and Pauwels (2020b)). This is sketched in Appendix B.3.

Published as a conference paper at ICLR 2023

Chaining backpropagation derived programs Our approach is flexible enough to describe programs of programs and backpropagation chaining. Let P be a program as in Algorithm 1, with adapted Re LU program sequence tpgiqm i p 1u. If costpgiq " |prpiq|, gi is a long program , with many operations per input.We may set gdi backproppgiq using Algorithm 2, i p 1, . . . , m. From Corollary 1, we have costpgdiq{costpgiq ď 5, and for long programs ωb 5 in Theorem 2. This illustrates the versatility of our approach as it captures the complexity of chaining backprop operations, the resulting estimate being quite sharp in the regime of long programs.

Beyond backpropagation Programs may be differentiated by other means than backpropagation. Examples include, forward propagation, with applications in optimization and algorithmic unrolling (Mehmood and Ochs, 2020; Lorraine et al., 2020; Maclaurin et al., 2015), implicit differentiation Agrawal et al. (2018); Winston and Kolter (2020); Bai et al. (2019); Bolte et al. (2021) with application in optimization and hyperparameter optimization (Bertrand et al., 2020), adjoint differentiation (Plessix, 2006) in programs with components involving ordinary differential equations (Courtier and Rabier, 1997; Chen et al., 2018), differentiation of conjugate gradient (Gratton et al., 2014), Cholesky algorithm (Smith, 1995), approximation of Jacobian matrices involving a non-uniform FFT (Wang and Fessler, 2021).

Let P be a program as in Algorithm 1. Theorem 2 relates the complexity of combining derived programs in Algorithm 2 to the following quantities, for i p 1, . . . , m:

costpgdiq{costpgiq: the computational overhead ratio . |prpiq|costpˆq{costpgiq: the ratio between multiplication cost and average cost per input argument.

The first quantity depends on the technique used to obtain gdi. The second quantity is typically less than 2 (at least one arithmetic operation per input) and becomes negligible for long programs (many operations per input).

For example in Mehmood and Ochs (2020); Lorraine et al. (2020); Maclaurin et al. (2015), for one i, the program gdi is an optimization algorithm in Rp, a long program differentiated using forward propagation. The corresponding overhead ratio is in this case 3p 5 (Theorem 2). If combined with an outer backward pass, we obtain a dimension-dependent overhead ratio, in contrast with full backward differentiation. Our model provides computational cost estimates for mixed techniques, here a combination of inner forward and outer backward propagation.

5 ON THE COMPUTATIONAL HARDNESS OF GENERALIZED GRADIENTS

Let P and DP be two programs such that DP evaluates jointly P and a derivative of P. In the sequel, we use the term (computational) overhead ratio of DP to denote the quantity costp DP q

costp P q and

computational overhead ratio of derivatives of P to denote the quantity compp DP q

costp P q . As established in Theorem 2, this ratio is dimensionless in the case of backpropagation with conservative gradients. Are there other ways to compute cheap nonsmooth gradients? Toward an answer to this question, we discuss this ratio for other nonsmooth differentiation oracles: directional derivatives (for which we relate worst-case complexity to that of matrix multiplication), lexicographic derivatives with forward AD (with an overhead ratio of order p Barton et al. (2018)). As for the Clarke subdifferential, we prove the hardness of subgradients enumeration. Our motivation to estimate the complexity of these particular types of derivatives (directional, lexicographic and Clarke) is that they serve as a basis to alternative implementable AD approaches (see Barton et al. (2018) and references therein), and are thus concurrent strategies of conservative gradient backpropagation. The results presented below do not provide a definitive answer, but they strongly suggest that backpropagation of conservative gradients has a much more favorable complexity.

5.1 THE OVERHEAD RATIO FOR EVALUATING p DIRECTIONAL DERIVATIVES

Given G: Rp Ñ R locally Lipschitz and x, d P Rp, the directional derivative of G at x in direction d is given by limtÓ0p Gpx tdq Gpxqq{t when the limit exists. This section considers a family of functions with p inputs and q real parameters, represented by a locally Lipschitz function F : Rp ˆ Rq Ñ R, for which we investigate hardness of evaluation of p directional derivatives. The function F may describe, for instance, a Re LU feedforward neural network empirical loss, parameterized by

Published as a conference paper at ICLR 2023

q real weights, with p inputs. For functions represented by Re LU programs, we prove an overhead ratio of order pω 2 op1q where ω is the matrix multiplication exponent (see definition below). In all rigor, it is not known whether ω ą 2 or ω 2, so the derived ratio could be essentially dimensionless (if ω 2), though all practical evidences are against this so far. The best known lower bound is ω ă 2.37 , and in practice, the matrix multiplication exponent is closer to 2.7, both corresponding to a dimension-dependent overhead, in contrast with the smooth case with essentially dimensionless overhead ratio to evaluate p directional derivatives (essentially a gradient).

Complexity of matrix multiplication: Throughout this section, we set D t , ˆ, c, ˆcu, with unit costs (corresponding to polynomial functions). Denote by cppq complexity of p ˆ p matrix multiplication. More precisely, if f : Rpˆp ˆ Rpˆp Ñ Rpˆp is such that fp A, Bq AB for all, square matrices A, B P Rpˆp, we have cppq comppf, Dq, which we may write cppq pω op1q where ω is called the matrix multiplication exponent. Note that cppq ě p2, as one needs at least one operation for each of the 2p2 entries.

Directional derivatives: Given a function F : RpˆRq Ñ R, we denote by F 1 1 : RpˆRqˆRpˆp Ñ Rp the function which associates to x P Rp, y P Rq and a matrix A P Rpˆp the p directional derivatives with respect to x variable, for fixed y, in directions given by the columns of A. The proof of the following theorem is given in Section C.

Theorem 3 (Computational ratio for directional derivatives) There exists a function F : Rp ˆ Rq Ñ R and a program PF implementing F on dictionary t , ˆ, Re LU, c, ˆcu (all operations have unit cost), such that for any program P 1 implementing py, Aq ÞÑ F 1 1p0, y, Aq on derived dictionary t , ˆ, Re LU, Re LU1, c, ˆcu,

costp P 1q{costp PF q ě pcppq 5pq{p40p2q pω 2 op1q. (5)

Theorem 3 has q parameters, parametric dependency is required to express hardness. Indeed, for some parameter values, computation may be trivial (e.g. null values). Alternatively, it states that for some values of the q parameters, computing p directional derivatives has cost as in (5).

The bound in (5) is sharp up to multiplicative constants for linear Re LU networks, see Remark 5 in Appendix A.2.

Consequences: Our overhead estimate is roughly pω 2, it constitutes a bottleneck: a cheap nonsmooth p directional derivatives principle , would imply easy matrix multiplication, to the point that ω 2. Since the seminal work of Strassen et al. (1969), it is known that ω ď log2p7q 2.81. Determining the precise exponent ω has been an object of intense research Robinson (2005). Asymptotically, one has 2 ď ω ă 2.373, see Williams (2012); Le Gall (2014), the best known bound being given in Alman and Williams (2021). In this case, the estimate in (5) is roughly p0.373.

These estimates may involve non-constructive existence proofs, or suffer from the curse of recursion: meaningful efficiency occurs only for inaccessible sizes. According to Dumas and Pan (2016), for values p ď 106 the most efficient practical algorithms have a complexity of the order p2.774, resulting in an overhead of order p0.774, in contrast with the constant overhead incurred by nonsmooth backpropagation. More discussion is given in Appendix A.2.

Comparison with the smooth case: If F is C1, evaluating p directional derivatives is comparatively easier because F 1px, dq x Fpxq, dy for all x, d P Rp. Hence, one may first evaluate F (once), at a cost similar to that of F (cheap gradient principle), and then evaluate p scalar products, at a cost p2. If the cost of F is of order p2 at least (for example F is a feedforward neural network with p inputs and a layer of p hidden neurons), then this is overall proportional to the cost of computing F.

5.2 COMPUTING CLARKE SUBGRADIENTS USING FORWARD AUTOMATIC DIFFERENTIATION

In Khan and Barton (2012; 2013; 2015), several automatic differentiation strategies are proposed to evaluate elements of the Clarke subdifferential. These approaches are based on directional (Shapiro, 1990) and lexicographic derivatives (Nesterov, 2005) which satisfy a chain rule under structural assumptions. The chain rule may be implemented using the vector forward mode of automatic differentiation (Barton et al., 2018), which suffers from computational overhead scaling linearly

Published as a conference paper at ICLR 2023

in p, contrary to the reverse mode in Theorem 2. Reducing this factor is an open question, even for compositional functions involving only univariate nonsmoothness such as absolute value (Khan, 2018). More details are given in Appendix A.2.1.

5.3 COMPUTATIONAL HARDNESS OF SUBGRADIENT ENUMERATION

We investigate in this section the hardness finding subgradients for programs defined on the elementary dictionary D0 t , , Re LUu with unit costs. Let us denote by Pp D0q the set of such programs. We will, with a slight abuse of notation, identify a program P P D0 t , , Re LUu with the function it computes to state our complexity result (proof in Section D).

Theorem 4 (Clarke subgradients and NP-Hardness) (i) The problem of finding two distinct subgradients in the Clarke subdifferential of P P Pp D0q at given input (or one single subgradient if it is reduced to a singleton) is NP-hard. (ii) Deciding if P P Pp D0q is not differentiable at some given input is NP-hard.

Remark 3 In Theorem 4, numerical parameters and inputs are constrained to be in t 1, 0, 1u, so that the hardness result does not depend on numerical representation and only involves program size (strong NP-hardness). See Appendix D for more details.

The above problems (i)-(ii) enter the field of computational complexity as we consider programs P P Pp D0q with a natural notion of size, given by their cost, costp Pq, the number of operations (recall that we assumed unit costs). Since the considered programs implement piecewise linear functions, it follows from (Barton et al., 2018, Proposition 2.7) that, our hardness result also holds for the lexicographic subdifferential Nesterov (2005), which reduces in this case to the set of neighboring gradients (see Section D).

The counterpart of the above problem for AD conservative gradients as in Definition 2 is tractable, illustrating a major computational difference between Clarke subdifferential and AD conservative gradient. The proof is in Section D.4, by reduction to a graph shortest path problem.

Proposition 1 (Finding two elements in autodiff conservative gradients is tractable) Given P P Pp D0q, with conservative gradient DP given by Theorem 2, finding two elements in DP pxq at a given input x (or one single element if DP pxq is a singleton) is solvable in polynomial time.

6 CONCLUSION

We extended the cheap gradient principle to nonsmooth automatic differentiation with a flexible version of Baur-Strassen s result: the overhead ratio of conservative gradients is independent of the dimension. On the other hand, we showed that the potential gain in efficiency of forward AD for multiple directional derivatives is limited due to an intrinsic connection to matrix multiplication. Finally, we have shown that for simple Re LU networks, the enumeration of Clarke subgradients is computationally hard, in contrast to the enumeration of conservative gradients.

The global picture is significantly different from the smooth case, with a well understood cheap gradient principle that yields cheap p directional derivatives , illustrating the specificities of nonsmoothness. Our results confirm the centrality of conservative gradients in nonsmooth AD and machine learning: they generalize gradients with a clear cheap principle , contrary to concurrent notions. An important open question in this context is the complexity of subgradients, or, in other words, the existence of a cheap subgradient principle . We conjecture a negative answer in general.

ACKNOWLEDGMENTS AND DISCLOSURE OF FUNDING

The authors acknowledge the support of the AI Interdisciplinary Institute ANITI funding under the grant agreement ANR-19-PI3A-0004. The authors acknowledge the support of the Association nationale de la recherche et de la technologie (ANRT) and Thales LAS France, which contributed to Ryan B s grant. J erome B. and Edouard P. acknowledge the financial support of Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant numbers FA9550-19-17026 FA8655-22-1-7012, and ANR Ma SDOL 19-CE23-0017-01. J erˆome B. also acknowledges the

Published as a conference paper at ICLR 2023

support of ANR Chess, grant ANR-17-EURE-0010, TSE-P and the Centre Lagrange. We thank our collaborators in the Thales LAS France, especially Andrei Purica, for helpful comments. We are grateful to Serge Gratton, Pierre Weiss and Pierre Boudier for useful reference suggestions.

Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265 283, 2016. URL https://www.usenix.org/system/files/ conference/osdi16/osdi16-abadi.pdf.

Akshay Agrawal, Robin Verschueren, Steven Diamond, and Stephen Boyd. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42 60, 2018.

Akshay Agrawal, Shane Barratt, Stephen Boyd, Enzo Busseti, and Walaa M Moursi. Differentiating through a cone program. J. Appl. Numer. Optim, 1(2):107 115, 2019.

Josh Alman and Virginia Vassilevska Williams. A refined laser method and faster matrix multiplication. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 522 539. SIAM, 2021.

Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations, Conference Track Proceedings, 2018.

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. Advances in Neural Information Processing Systems, 32, 2019.

Paul I. Barton, Kamil A. Khan, Peter Stechlinski, and Harry A.J. Watson. Computationally relevant generalized derivatives: theory, evaluation and applications. Optimization Methods and Software, 33(4-6):1030 1072, 2018. doi: 10.1080/10556788.2017.1374385. URL https://doi.org/ 10.1080/10556788.2017.1374385.

Walter Baur and Volker Strassen. The complexity of partial derivatives. Theoretical Computer Science, 22:317 330, 1983.

Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. Journal of Marchine Learning Research, 18:1 43, 2018.

L. M. Beda, L. N. Korolev, N. V. Sukkikh, and T. S. Frolova. Programs for automatic differentiation for the machine BESM. Technical report, Institute for Precise Mechanics and Computation Techniques, Academy of Science, Moscow, USSR, 1959.

David Bertoin, J erˆome Bolte, S ebastien Gerchinovitz, and Edouard Pauwels. Numerical influence of relu (0) on backpropagation. Advances in Neural Information Processing Systems, 34, 2021.

Quentin Bertrand, Quentin Klopfenstein, Mathieu Blondel, Samuel Vaiter, Alexandre Gramfort, and Joseph Salmon. Implicit differentiation of lasso-type models for hyperparameter optimization. In International Conference on Machine Learning, pages 810 821. PMLR, 2020.

Mathieu Blondel, Quentin Berthet, Marco Cuturi, Roy Frostig, Stephan Hoyer, Felipe Llinares-L opez, Fabian Pedregosa, and Jean-Philippe Vert. Efficient and modular implicit differentiation. ar Xiv preprint ar Xiv:2105.15183, 2021.

Jacek Bochnak, Michel Coste, and Marie-Franc oise Roy. Real algebraic geometry, volume 36. Springer Science & Business Media, 2013.

J erˆome Bolte and Edouard Pauwels. Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning. Mathematical Programming, pages 1 33, 2020a.

Published as a conference paper at ICLR 2023

J erˆome Bolte, Tam Le, Edouard Pauwels, and Tony Silveti-Falls. Nonsmooth implicit differentiation for machine-learning and optimization. Advances in Neural Information Processing Systems, 34, 2021.

J erˆome Bolte and Edouard Pauwels. A mathematical model for automatic differentiation in machine learning. In Conference on Neural Information Processing Systems, 2020b.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax.

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.

Frank H Clarke. Optimization and nonsmooth analysis. SIAM, 1983.

Michel Coste. An introduction to o-minimal geometry. Istituti editoriali e poligrafici internazionali Pisa, 2000a.

Michel Coste. An introduction to semialgebraic geometry, 2000b.

P. Courtier and F. Rabier. The use of adjoint equations in numerical weather prediction. Atmosphere Ocean, 35(sup1):303 322, 1997. doi: 10.1080/07055900.1997.9687354. URL https://doi. org/10.1080/07055900.1997.9687354.

Damek Davis and Dmitriy Drusvyatskiy. Conservative and semismooth derivatives are equivalent for semialgebraic maps. ar Xiv preprint ar Xiv:2102.08484, 2021.

Jean-Guillaume Dumas and Victor Pan. Fast matrix multiplication and symbolic computation. ar Xiv preprint ar Xiv:1612.05766, 2016.

Patrick E Farrell, David A Ham, Simon W Funke, and Marie E Rognes. Automated derivation of the adjoint of high-level transient finite element programs. SIAM Journal on Scientific Computing, 35 (4):C369 C393, 2013.

Serge Gratton, David Titley-Peloquin, Philippe Toint, and Jean Tshimanga Ilunga. Differentiating the method of conjugate gradients. SIAM Journal on Matrix Analysis and Applications, 35(1):110 126, 2014. doi: 10.1137/120889848. URL https://doi.org/10.1137/120889848.

A. Griewank and A. Rojas. Treating artificial neural net training as a nonsmooth global optimization problem. In International Conference on Machine Learning, Optimization, and Data Science (pp. 759-770). Springer, Cham., 2019.

A. Griewank and A. Walther. Beyond the oracle: Opportunities of piecewise differentiation. In Numerical Nonsmooth Optimization (pp. 331-361). Springer, Cham., 2020.

Andreas Griewank. On stable piecewise linearization and generalized algorithmic differentiation. Optimization Methods and Software, 28, 07 2013. doi: 10.1080/10556788.2013.796683.

Andreas Griewank and Christ ele Faure. Piggyback differentiation and optimization. In Large-scale PDE-constrained optimization, pages 148 164. Springer, 2003.

Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM, 2008.

Andreas Griewank et al. On automatic differentiation. Mathematical Programming: recent developments and applications, 6(6):83 107, 1989.

Sham M Kakade and Jason D Lee. Provably correct automatic sub-differentiation for qualified programs. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Kamil A Khan. Branch-locking ad techniques for nonsmooth composite functions and nonsmooth implicit functions. Optimization Methods and Software, 33(4-6):1127 1155, 2018.

Published as a conference paper at ICLR 2023

Kamil A Khan and Paul I Barton. Evaluating an element of the clarke generalized jacobian of a piecewise differentiable function. In Recent Advances in Algorithmic Differentiation, pages 115 125. Springer, 2012.

Kamil A Khan and Paul I Barton. Evaluating an element of the clarke generalized jacobian of a composite piecewise differentiable function. ACM Transactions on Mathematical Software (TOMS), 39(4):1 28, 2013.

Kamil A Khan and Paul I Barton. A vector forward mode of automatic differentiation for generalized derivative evaluation. Optimization Methods and Software, 30(6):1185 1212, 2015.

Franc ois Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th international symposium on symbolic and algebraic computation, pages 296 303, 2014.

Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436 444, 2015.

Adrian Lewis and Tonghua Tian. The structure of conservative gradient fields. ar Xiv preprint ar Xiv:2101.00699, 2021.

Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artificial Intelligence and Statistics, pages 1540 1552. PMLR, 2020.

Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning, pages 2113 2122. PMLR, 2015.

Sheheryar Mehmood and Peter Ochs. Automatic differentiation of some first-order methods in parametric optimization. In International Conference on Artificial Intelligence and Statistics, pages 1584 1594. PMLR, 2020.

Yu Nesterov. Lexicographic differentiation of nonsmooth functions. Mathematical programming, 104(2):669 700, 2005.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024 8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library. pdf.

Barak A Pearlmutter. Gradient calculations for dynamic recurrent neural networks: A survey. IEEE Transactions on Neural networks, 6(5):1212 1228, 1995.

R-E Plessix. A review of the adjoint-state method for computing the gradient of a functional with geophysical applications. Geophysical Journal International, 167(2):495 503, 2006.

Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. In international conference on machine learning, pages 2847 2854. PMLR, 2017.

Sara Robinson. Toward an optimal algorithm for matrix multiplication. SIAM news, 38(9):1 3, 2005.

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning Representations by Back-propagating Errors. Nature, 323(6088):533 536, 1986. doi: 10.1038/323533a0. URL http://www.nature.com/articles/323533a0.

Stefan Scholtes. Introduction to piecewise differentiable equations. Springer Science & Business Media, 2012.

Published as a conference paper at ICLR 2023

Alexander Schrijver. Theory of linear and integer programming. John Wiley & Sons, 1998.

Alexander Shapiro. On concepts of directional differentiability. Journal of optimization theory and applications, 66(3):477 487, 1990.

Stephen P Smith. Differentiation of the cholesky algorithm. Journal of Computational and Graphical Statistics, 4(2):134 147, 1995.

Volker Strassen et al. Gaussian elimination is not optimal. Numerische mathematik, 13(4):354 356, 1969.

Guanhua Wang and Jeffrey A. Fessler. Efficient approximation of jacobian matrices involving a non-uniform fast fourier transform (nufft), 2021. URL https://arxiv.org/abs/2111. 02912.

Robert Edwin Wengert. A simple automatic derivative evaluation program. Communications of the ACM, 7(8):463 464, 1964.

Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 887 898, 2012.

Ezra Winston and J Zico Kolter. Monotone operator equilibrium networks. Advances in neural information processing systems, 33:10718 10728, 2020.

Philip Wolfe. Checking the calculation of gradients. ACM Transactions on Mathematical Software (TOMS), 8(4):337 343, 1982.

Published as a conference paper at ICLR 2023

This is the appendix for On the complexity of nonsmooth automatic differentiation .

A Further comments, discussion and technical elements 14

B Proofs related to Section 4 15

C Proofs of Section 5.1 20

D Proofs of Section 5.3 21

A FURTHER COMMENTS, DISCUSSION AND TECHNICAL ELEMENTS

A.1 COMMENTS ON SECTION 3

A.1.1 COMPUTATIONAL MODEL IN SECTION 3.1

DAG representation and examples 3.1: We start with a remark regarding representations of programs as directed acyclic graphs and use them to illustrate the model of computation proposed in the main text. It reduces to that of arithmetic circuit complexity for a dictionary composed of elementary arithmetic operations.

Remark 4 (Programs as directed graphs) A predecessor relation trivially describes a directed acyclic graph (DAG). Therefore, a program is equivalently represented as a DAG, nodes corresponding either to input variables (empty predecessor) or computation (nonempty predecessor). Directed edges connect predecessor nodes to their successors. Each computation node contains a lower-level program (with a single output), with the number of input edges being coherent with the number of arguments. The cost of a node is that of the underlying program and the cost of P is the sum of the costs of its nodes. Nodes without outer edges are output nodes. See examples in Appendix A.1.

We represent programs using the DAG representation as in Remark 4. Let us define a simple dictionary D : t , ˆu and introduce a level 0 elementary program P0 such that P0pa, bq a b meaning that P0 computes the quantity a b. P0 is identified with from the dictionary. We also introduce a level 1 program P1 such that P1pa, b, cq a ˆ pb cq. We can construct an equivalent level 1 program, Q1 such that Q1pa, b, cq a ˆ b a ˆ c, in this case, we have P1 Q1, or r P1s r Q1s since they compute the same quantity. The level 2 program P2 is such that P2pa, b, c, dq pa bq ˆ pc dq Q1pa, c, dq P1pb, c, dq and uses level 1 programs Q1 and P1 in its computation nodes. The Directed Acyclic Graphs (DAGs) representing these programs are given in Figure 1. Assuming costp q costpˆq 1, we have costp P0q 1, costp P1q 2, costp Q1q 3 and costp P2q costp Q1q costp P1q costpˆq 6.

Figure 1: DAG illustrating different programs with dictionary D : t , ˆu. (a) P0pa, bq a b, of level 0 which is identified with from the dictionary, (b) P1pa, b, cq apb cq, of level 1, (c) Q1pa, b, cq ab ac, of level 1 and equivalent to P1, (d) P2pa, b, c, dq pa bqpc dq Q1pa, c, dq P1pb, c, dq, of level 2.

Published as a conference paper at ICLR 2023

A.2 COMMENTS ON SECTION 5

A.2.1 FORWARD AD AND CLARKE SUBGRADIENTS

Nesterov (2005) introduced the notion of lexicographic subdifferential, denoted here BLF for a Lipschitz function F : Rp Ñ R. The construction of BLF is based on successive local approximations of F with directional derivatives, and one has BLFpxq Ă Bc Fpxq for all x such that the first term is well defined.

It is known that automatic differentiation can be used to compute directional derivatives, particularly the forward mode of automatic differentiation (Griewank and Walther, 2008). Based on this observation, Khan and Barton developed several algorithms to evaluate elements of Bc F, based on directional derivatives (Khan and Barton, 2012; 2013; 2015). They concentrate on piecewise C1 functions, see for example Scholtes (2012), and propose to handle compositional structures with different restrictions on the function class considered, such as functions in abs-normal forms (Khan and Barton, 2012), or broader classes (Khan and Barton, 2013; Barton et al., 2018).

All these procedures either require to evaluate p directional derivatives (Khan and Barton, 2012; 2013), or rely on forward chain rule propagation for lexicographic derivatives (Khan and Barton, 2015; Barton et al., 2018), which also require to maintain p directional derivatives. For this reason, all these methods suffer from a multiplicative computational overhead ratio of the order of p in the worst case, and it is not known if this could be improved (Barton et al., 2018), although efforts have been made in this direction (Khan, 2018).

A.2.2 MATRIX MULTIPLICATIONS

Remark 5 The lower bound described in Theorem 3 is sharp for a linear Re LU network F as in (11) involving only square p ˆ p matrices. Indeed, p directional derivatives of F in directions a1, . . . , ap, can be computed with roughly Lcppq operations, using a matrix multiplication algorithm realizing the cppq bound, for example using the forward mode of AD Khan and Barton (2012; 2013). The naive PF algorithm for forward evaluation performs roughly 2Lp2 operations which results in the bound (neglecting terms of order one in numerator and denominator), compp Fd, D Y t Re LU, Re LU1uq

costp PF q ď cppq

for this class of networks, to be compared with (5). Finally, we remark that in the smooth case such complexity estimates reduce to gradient computation which can be done using backward algorithmic differentiation with a constant multiplicative overhead ratio.

We denote by Fd, the function Fd : py, Aq ÞÑ F 1 1p0, y, Aq which computes p directional derivatives at a given point. Setting ω lim suppÑ8 logpcppqq{ logppq, since P 1 is an arbitrary program implementing Fd, we have shown that asymptotically, for any ϵ ą 0

sup p,F r PF s,PF PPp DYt Re LUuq

compp Fd, D Y t Re LU, Re LU1uq

costp PF q ˆ p2 ω ϵ 8,

where the supremum is taken over all p and all functions F : Rpˆq Ñ R implemented by a program PF with dictionary D Y t Re LUu. It is not known whether ω ą 2.

B PROOFS RELATED TO SECTION 4

Proof of Theorem 2: Given a program P as in Section 3.1, the path differentiability of r Ps is immediate by composition and the chain rule property. The associated conservative gradient DP is constructed in Bolte and Pauwels (2020a).

We have the following cost estimates which can be deduced from the definition of the cost of a program in Section 3.1.

Algorithm 1 forward evaluation:

costp Pq costp Algorithm 1q

i p 1 cost pgiq (6)

Published as a conference paper at ICLR 2023

Algorithm 1 forward evaluation with derivatives: Algorithm 1 with gdi instead of gi on line 2

costp Algorithm 1 q

i p 1 cost pgdiq (7)

Algorithm 2 backward AD cost:

costpbackpropp Pqq costp Algorithm 1 q

i p 1 |prpiq|pcostp q costpˆqq

i p 1 cost pgdiq |prpiq|pcostp q costpˆqq. (8)

Algorithm 2 forward AD cost:

costpforpropp Pqq costp Algorithm 1 q

i p 1 p|prpiq|costpˆq pp|prpiq| 1qcostp q

i p 1 cost pgdiq p|prpiq|costpˆq pp|prpiq| 1qcostp q. (9)

Let us derive the complexity bound of Algorithm 1 according to Algorithm 2.

Backward AD complexity result: Using (8) and the fact that cost has value in R , we have

costpbackpropp Pqq

i p 1 cost pgdiq |prpiq|pcostp q costpˆqq

i p 1 costpgiq ˆ cost pgdiq |prpiq|pcostp q costpˆqq

ď max i p 1,m

ˆcost pgdiq |prpiq|pcostp q costpˆqq

i p 1 costpgiq,

where the inequality is due to factorization by the maximal value. Using (6), we obtain

costpbackpropp Pqq ď ωb ˆ costp Pq

where ωb is given in (3). This proves point (i).

Forward AD complexity result: Using (9) and the fact that cost has value in R , we have

costpforpropp Pqq

i p 1 cost pgdiq p|prpiq|costpˆq pp|prpiq| 1qcostp q

i p 1 costpgiq ˆ cost pgdiq p|prpiq|costpˆq pp|prpiq| 1qcostp q

ď max i p 1,m

ˆcost pgdiq p|prpiq|costpˆq pp|prpiq| 1qcostp q

i p 1 costpgiq,

where the inequality is due to factorization by the maximal value. Using (6), we obtain

costpforpropp Pqq ď ωf ˆ costp Pq

where ωf is given in (3).

Published as a conference paper at ICLR 2023

B.1 JUSTIFICATION OF THE COMPLEXITY TABLE 1 OF THE DRe LU-DICTIONARY.

The proof of Corollary 1 follows from Theorem 2 by computing the relevant constants. They are shown in Table 1, let us justify the proposed numbers.

Case 1 (costpˆq, costp q) Let us define gpa, bq a ˆ b. To evaluate g, we need one operation from DRe LU. The derived program d related to g, should satisfy dpa, bq pb, aq which does not require additional operation. Therefore, from Assumption 1 we can deduce that costpgq 1 and costpgdq 1. We get the same result for costp q by applying identical reasoning.

Case 2 (costpˆcq, costp cq) Let us define gpaq c ˆ a. To evaluate g, we need one operation from DRe LU. The derived program d related to g, should satisfy dpaq c which does not require additional operation from D1 Re LU. Therefore, from Assumption 1 we can deduce that costpgq 1 and costpgdq 1. We get the same result for costp cq by applying identical reasoning.

Case 3 (costplogq) Let us define gpaq logpaq. To evaluate g, we need one operation from DRe LU. The derived program d related to g, should satisfy dpaq 1{a, which requires the inverse operation from D1 Re LU. Therefore, from Assumption 1 we can deduce that costpgq 1 and costpgdq 2.

Case 4 (costpexpq) Let us define gpaq exppaq. To evaluate g, we need one operation from DRe LU. The derived program d related to g, should satisfy dpaq gpaq which does not require operation from D1 Re LU. Finally, from Assumption 1 we can deduce that costpgq 1 and costpgdq 1.

Case 5 (costpinvq) Let us define gpaq 1

a. To evaluate g, we need one operation from DRe LU. The derived program d related to g, should satisfy dpaq 1

a2 which requires one additional multiplication to compute the square and one p 1q multiplication operation from D1 Re LU. Finally, from Assumption 1 we can deduce that costpgq 1 and costpgdq 3.

Case 6 (costp Re LUq) Let us define gpxq Re LUpxq maxpx, 0q. To evaluate g, we need to evaluate the sign of x. The derived program Re LU1 can be computed also from the sign of x without further operation. We have costpgq 1 by hypothesis, but it is also reasonable to consider costpgdq 1 as both operations only require sign evaluation of the same object.

Remark 6 Since DRe LU dictionary contains the Re LU function, we can build other non-smooth functions such as the maximum and the absolute value. For example, maxtx, yu Re LUpx yq y Re LUpx yq Re LUpyq Re LUp yq.

B.2 AN EXTENSION OF TABLE 1

The justifications of the following are similar to Section B.1, simply taking into consideration different types of operations. Taking cnonlin c Re LU 1, we recover table 1. We replace Re LU by ˆRe LU which corresponds to its usage in practice and allows us to balance the cost of Re LU operations and that of multiplications.

The justification is the same as in Section B.1 taking into consideration different types of operations. For the ˆRe LU operation, the justification is as follows.

Case 7 (ˆcostp Re LUq) The operation has two argument and requires one sign evaluation and one multiplication in the worst case, so we assign it the cost 1 c Re LU. The differentiated program d should compute the function pa, bq ÞÑ p Re LUpbq, a ˆ Re LU1pbqq. One can write a program to compute jointly g and d as follows: return pa ˆ b, b, aq if b ě 0 and p0, 0, 0q if b ă 0. This only requires a bit sign check which cost is c Re LU and a multiplication. We therefore model this operation such that costpgdq costpgq 1 c Re LU.

Further refinements could be considered including various type of computational operations, such as memory moves, these are beyond the scope of the present paper.

B.3 ADDITIONAL ELEMENTARY NONSMOOTH PROGRAMS AND COST EXAMPLES

For simplicity, we do not discuss the dictionary and its related derived dictionary as there are many possibilities, one of them being DRe LU and D1 Re LU as all the considered operations can be equivalently

Published as a conference paper at ICLR 2023

Table 2: Extension of cost table. cnonlin ě 1 is the cost of nonlinear operations and c Re LU ě 0 is the cost of sign evaluation for Re LU or Re LU1.

g p , ˆq p c, ˆcq log exp inv ˆRe LU costpgq 1 1 cnonlin cnonlin cnonlin 1 c Re LU |pr| 2 1 1 1 1 2 costpgdq 1 1 2cnonlin cnonlin cnonlin 2 1 c Re LU costpgdq

cost pgq 1 1 2 1 cnonlin 2

costpˆq|pr|

cost pgq 4 2 1 cnonlin

2 1 c Re LU costpgdq 2costpˆq|pr|

cost pgq 5 3 ď 4 ď 3 ď 5 ď 5

expressed with Re LU. We use the same framework as in B.2 and we identify the cost of comparing two real numbers with c Re LU ą 0. For each program g and associated derived program d, we let

ω costpgdq 2costpˆq|pr|

Table 3: Extension of cost table. cnonlin ě 1 is the cost of nonlinear operations and c Re LU ě 0 is the cost of sign evaluation for Re LU or Re LU1. For simplicity c Re LU is abbreviated c R and cnonlin is abbreviated cnl

g p , ˆq | | ELU 3 ˆ 3-max-pool } }8 } }1 costpgq 1 1 c R 2 c R cnl 153 8c R n 2nc R 1 np2 c Rq 1 |pr| 2 1 1 9 n n costpd, gq 1 1 c R 2 c R cnl 153 8c R n 2nc R 1 np2 c Rq 1 costpgdq

cost pgq 1 1 1 1 1 1

costpˆq|pr|

cost pgq 4 1 1 c R

1 2 c R cnl

n n 2nc R 1

n np2 c Rq 1 ω 5 ď 3 ď 2 ď 1.12 ď 3 ď 2

Case 8 (Absolute value and Leaky-Re LU) Recall that |x| x if x ą 0 and x otherwise. Similarly Leaky-Re LUpxq x if x ą 0 and ax otherwise, for some parameter a P p0, 1q so that both cases are exactly the same. The reasoning and result are exactly the same for both operations so we treat the absolute value. The construction is similar as what was proposed for ˆcostp Re LUq treated in the previous section.

Let g be a program to evaluate | |, in the worst case it requires one sign evaluation and one multiplication so that costpgq 1 c Re LU. Similarly it is possible to built a program which returns px, 1q if x ą 0 and p x, 1q otherwise, this computes pgdq and require the exact same operations so that costpgdq costpgq 1 c Re LU.

Case 9 (ELU)

fpxq " x if x ě 0 apex 1q if x ă 0 with a ą 0.

Let g be a program to evaluate the ELU function, it requires a sign evaluation and in the worst case one nonlinear operation to evaluate ex, one multiplication to evaluate aex, and one substraction to evaluate aex a. Therefore, costpgq c Re LU cnonlin 2. The derived program d requires the same sign and returns 1 or aex depending on the sign. This does not require additional operation and therefore the joint computation of g and d satisfies costpgdq costpgq.

Published as a conference paper at ICLR 2023

Case 10 (max-m-linear) Set n a number of inputs and m ě 2 a number of linear functions which are parameters, represented by a matrix A and a fixed input vector of size n represented by x P Rn. Setting maxm : Rm to R the function which evaluates the maximum of m numbers, we consider g a program which evaluates the function A ÞÑ maxmp Axq. Recall that x is fixed so that the number of inputs is m ˆ n. The multiplication requires m ˆ p2n 1q multiplications and additions and the evaluation of maxm requires pm 1qc Re LU as it requires m 1 pairwise comparisons. We therefore have costpgq m ˆ p2n 1q pm 1qc Re LU.

As for the derived program d, setting Mi 0 except for row number i which attains the maximum in g which is set to x, we have an element of a conservative gradient for g. It is possible to jointly compute gp Aq and dp Aq by invoking a program which returns pp Axqris, Miq where i is any index realizing the max and Mi is as discussed. This does not require more operations and we have therefore costpgdq costpgq m ˆ 2n 1 pm 1qc Re LU

Case 11 (Two dimensional max-pooling (3 ˆ 3-max-pool)) We consider a kernel of size 3 ˆ 3 for simplicity. The goal is to differentiate with respect to the kernel weights for a fixed input. Let g denote a program implementing such a function, it is of the same form as max-m-linear except that the matrix A is of size 9 ˆ 25 (padding values at the boundary of the 3 ˆ 3 patch, this gives 5 ˆ 5 25 inputs and 9 outputs), but it is sparse and can be parametrized by only 9 values, and the evaluation of the linear function for a fixed 5ˆ5 input only requires 9ˆp9 8q 153 addition and multiplications. We then take the maximum of these 9 outputs so that and costpgq 153 8cnonlin. For the same reason as max-m-linear, we have costpgdq costpgq 153 8cnonlin.

Case 12 (l1-norm, } }8) Denote by g a program which evaluate the l1 norm on Rn. It has n inputs. In the worst case, its evaluation can be done with n 1 addition, n multiplication by 1 and n pairwise comparisons. Therefore we have costpgq 2n nc Re LU 1. For the same reasons as all examples before, it is possible to identify a derived program d without requiring additional operation so that costpgdq costpgq 2n nc Re LU 1.

Case 13 (Median of n numbers) Denote by g a program that evaluates the median of n numbers. This can be done by sorting the n numbers and outputting the value corresponding to t n

2 u, which requires roughly n logpnq operations, depending on the algorithm used. The sorting operation is a permutation, one could apply the same permutation to the vector p1, 2, . . . , nq without additional operation required. The number at position t n

2 u, call it i, is the index of the value associated with the median. Setting d to be the null vector in Rn with value 1 at position i only, we have a selection in a conservative gradient for the median with no additional operation required. Therefore in this case costpgq costpgdq.

Case 14 (Selection functions) This example encompasses virtually all examples used in machine learning and extends the median example above. Assume that f : Rp Ñ R is locally Lipschitz, given in the form fpxq fspxqpxq where s: Rp Ñ t1, . . . , mu is an index selection function, and for each i 1, . . . , m, fi : Rp Ñ R is a C2 function. Let g be a program computing f, one possibility is to first evaluate spxq at cost cs and then evaluate fspxqpxq at cost cf. As shown in Bolte and Pauwels (2020b), under very mild restrictions on s and f (which should be expressed with logarithms, polynomials, exponentials etc ...), the function x ÞÑ spxqfpxq is a conservative gradient for f. It can be seen that it is possible to evaluate jointly pg, gdq by first computing s, at a cost cs, then evaluate fs and fs jointly at a cost c . costpgq cs cf costpgdq cs c costpgdq

costpgq cs c

cs cf ď cs 5cf

cs cf where we used c ď 5cf, the cheap gradient principle for smooth programs. This ratio is close to 5 if cs is negligible, we recover the usual ratio for smooth programs. It is close to 1 if cs dominates, which is the case in the median example where fs just corresponds to coordinate number s of the input and has a constant derivative.

Published as a conference paper at ICLR 2023

C PROOFS OF SECTION 5.1

C.1 PROOF OF THE MAIN RESULT

Proof of Theorem 3: Let U P Rpˆp be an orthogonal matrix with entries in t 1, 1u which columns are denoted by u1, . . . , up (with squared norm p). Assume that we have as variables a matrix M P Rpˆp and two matrices A, B P Rpˆp with columns a1, . . . , ap and b1, . . . , bp respectively.

Consider the function

F : px, B, Mq ÞÑ 1

i 1 |r UBT Mxsi|.

The pair p M, Bq will be identified as y in the statement of the theorem. Considering the dictionary of elementary functions t , ˆ, Re LU, c, ˆcu, F has a representation as a program PF using the identity |t| Re LUptq Re LUp tq for all t P R. We may construct PF such that costp PF q 6p2 2p ď 8p2 where we count 2p2 p operation for each matrix vector multiplication to evaluate UBT Mx (there are three of them), p multiplication by 1 to evaluate UBT Mx , 2p application of Re LU (on UBT Mx and UBT Mx), p additions of Re LU outputs to evaluate p applications of the absolute value, p 1 for the outer sum and 1 for the division. Now consider the constraints

signp UBT Maiq ui, i 1, . . . p. (10)

The set of matrices A, B, M satisfying this constraint is an open set, call it S. We now restrict our attention to this open set and argue that costp P 1q does not change if the input variables are constrained to be in S.

We have for all i 1, . . . , p and p A, B, Mq P S, the following directional derivatives with respect to variable x

F 1 1p0, B, M, aiq 1

psignp UBT Maiq T UBT Mai 1

pu T i UBT Mai b T i Mai.

Setting the function G: p A, B, Mq ÞÑ řp i 1 F 1 1p0, B, M, aiq Trp MABT q, we have that G is a polynomial and MGp A, B, Mq řp i 1 bia T i BAT . Note that this does not depend on M.

Fix P 1 any program implementing the directional derivatives function py, Aq ÞÑ F 1 1p0, y, Aq of F described above, with dictionary t , ˆ, Re LU, Re LU1, c, ˆcu, as in the statement of the theorem.

Claim 1 There is a program P2 on dictionary D t , ˆ, c, ˆcu such that G r P2s (on the whole space) and costp P2q ď costp P 1q p.

We use the DAG representation of programs as in Remark 4. Therefore P 1 is described by a DAG which node are either input nodes or computation nodes implementing functions from D1 Re LU. We will modify the program by simple modifications of the computation nodes. We may obtain a program P0 implementing G on S with dictionary D1 Re LU with costp P0q ď costp P 1q p by summing the outputs of P 1. The Re LU1 nodes in P0 represent a semialgebraic function Coste (2000a;b) with values in a finite set. Therefore, there is a dense open semialgebraic set on which all Re LU1 nodes in P0 are locally constant (Coste, 2000a, Theorem 6.7). Reducing S if necessary, we obtain a program P1 on dictionary DRe LU such that P1 P0 on S by replacing each Re LU1 node in P0 by the corresponding constants. We have costp P1q ď costp P0q (we replace computing nodes by constants). By Lemma 1, there is a program P2 on D such that costp P2q costp P1q ď costp P0q ď costp P 1q p and G r P2s (on the whole space). This proves the claim.

We may obtain a program D2 implementing MG with dictionary D by backward algorithmic differentiation on P2, that is D2 backpropp P2q. we have therefore

compp BAT , Dq ď costp D2q ď costp P2, D2q ď 5costp P2q

ď 5p 5costp P 1q,

Published as a conference paper at ICLR 2023

where the first inequality is because D2 is a program computing BAT for all A, B on dictionary D, the second is because adding computation increases the cost, the third is a property of backward algorithmic differentiation on D and the last one is by construction of P2. Note that compp BAT , Dq cppq by definition, therefore we have the claimed lower bound

costp P 1q costp PF q ě cppq 5p

5costp PF q cppq 5p

C.2 AN ADDITIONAL LEMMA

Lemma 1 Let Q: Rp Ñ R be a polynomial and P1 be a program (without loss of generality of level 1) on the dictionary D1 t , ˆ, Re LU, c, ˆcu, such that Q r P1s for all inputs restricted to an open set S Ă Rp. Then there is a level 1 program P2 on the dictionary D D1zt Re LUu t , ˆ, c, ˆcu such that Q r P2s (for all inputs in Rp). Furthermore, if costp Re LUq costpˆcq, then, costp P2q costp P1q.

Proof : We use the DAG representation of programs as in Remark 4. Therefore P1 is described by a DAG which node are either input nodes or computation nodes implementing functions from D1. The function computed by P1 as well as each of its nodes are semi-algebraic Bochnak et al. (2013); Coste (2000a;b). For each Re LU node in the graph representing P1 (assume that there are N of them) we associate a number: the function Re LU1 evaluated on its input (with the convention that Re LU1p0q 0). This defines a semialgebraic function G: Rp Ñ t0, 1u N. As it has values in a finite set, by semialgebraicity, there is an open subset of S1 Ă S such that G is constant on S (Coste, 2000a, Theorem 6.7). Consider P2 which computation graph is the same as that of P1 except that each absolute value node is replaced by multiplication by the corresponding Re LU1 value (which is constant on S1). Then Q r P1s r P2s for all inputs in the open set S1. All computation nodes of programs on D are multivariate polynomials and two polynomials which agree on an open set are equal globally. This concludes the proof. l

D PROOFS OF SECTION 5.3

We investigate in this section the hardness of finding a Clarke subgradient for programs defined on the elementary dictionary D0 t , , Re LUu. We start with an equivalent representation of these programs as linear Re LU networks with skip connections and specific weight matrices. This equivalence preserve representation size up to polynomial factors. We will then prove a hardness result on such Re LU networks. This will provide proof arguments for Theorem 4 by the polynomial time equivalence of the two representation. We proceed similarly to prove Proposition 1, using the equivalence with the two representations.

D.1 POLYNOMIAL TIME EQUIVALENCE WITH LINEAR RELU NETWORKS WITH SKIP CONNECTIONS

Given a set of matrices M1 P t 1, 0, 1up1ˆp, M2 P t 1, 0, 1up2ˆp1, ...ML 1 P t 1, 0, 1up L 1ˆp L 2, ML P t 1, 0, 1u1ˆp L 1 we consider the function F : Rp Ñ R,

F : x ÞÑ MLΦL 1p ML 1ΦL 2p. . . Φ1p M1xqqq. (11)

where Φi : Rpi Ñ Rpi are given functions which apply to each coordinate, an activation function which is either the identity or the Re LU function. There is an obvious notion of size for this representation, corresponding to the number of free parameters (matrix entries and coordinates on which Re LU or identity is applied), the size of the representation is p L 1 řL 1 i 1 pi ˆ pi 1 pi.

A function F given in (11) can be represented by a program on D0 of equivalent size, this correspond to a naive implementation. Similarly, any program P P Pp D0q on p inputs and with a single output can be represented by a network as in (11) which size is at most 18costp Pq3. Indeed, we may assume that costp Pq ě p{2 without loss of generality, otherwise, the program would not perform operations on some of the input variables and it could be simplified by removing variables which do not affect the output. Recall that m in Algorithm 1 is the memory footprint of P, in our case, it

Published as a conference paper at ICLR 2023

is m p costp Pq, the number of inputs plus the total number of operations. Note that we have m ď 3costp Pq. Each operation , or Re LU in the program can be represented by a m ˆ m matrix composed with a certain Φ: Rm Ñ Rm which contribution to the Relu network size is at most pm2 mq ď 2m2 ď 18costp Pq2 since m is integer and m ď 3costp Pq. There are costp Pq such operations so that a program can be represented equivalently by linear Relu network, with L costp Pq layers which contribution to the network size is at most 18costp Pq2 so that the size of the resulting network is at most 18costp Pq3, which is the desired bound since.

We have shown that working with functions represented as in equation (11) is equivalent to work with programs in Pp D0q as it is possible to switch from one to the other at a cost of an increase of the representation size which is only cubic. Therefore we will from now on work with functions represented as linear relu networks with skip connections as in (11), and NP-hardness or polynomial time results on such function will be valid on Pp D0q by the construction above.

D.2 FURTHER PROPERTIES OF LINEAR Re LU NETWORKS

Throughout this section F denotes a with representation as in (11). This function is positively homogeneous, it satisfies Fp0q 0 and it. By piecewise linearity, its Clarke subdifferential is a polyhedron (see e.g., Arora et al. (2018); Raghu et al. (2017)). The Clarke subdifferential is a conservative gradient for this function, and we will associate to it a different conservative gradient, associated to Algorithm 2

Definition 2 (Autodiff conservative gradient) We consider a specific conservative gradient for F, it is given by Da F pxq t M T 1 D1M T 2 D2 . . . M T L 1DL 1M T L u, where for i 1, . . . , L 1, Di is a diagonal matrix which entries respects the sign pattern of the corresponding activation function: 1 if the activation is identity, 0 if the activation is Re LU and the input is negative, 1 if the input is positive and all elements in r0, 1s if the input is null. We have in particular

Da F p0q t M T 1 D1M T 2 D2 . . . M T L 1DL 1M T L u (12)

where in this case, diagonal entries of matrices Di corresponding to Re LU activations are arbitrary in r0, 1s and the remaining diagonal entries are 1 (corresponding to identity activations).

The autodiff conservative gradient is associated with the algorithmic differentiation of a natural numerical program implementing F as in Subsection 3.2. Furthermore, one can check that given a program P P Pp D0q, after the transformation outlined in Section D.1, we have that Dα F coincides with DP in Theorem 2. In the following definition, DF could be,for example, the Clarke subdifferential of F or the algorithmic differentiation conservative gradient Da F .

We consider the following problem.

Problem 1 (Conservative gradient enumeration) Given matrices M1 P Rp1ˆp, M2 P Rp2ˆp1, ... ML 1 P Rp L 1ˆp L 2, ML P R1ˆp L 1, and functions Φ1, . . . , ΦL 1, consider F : Rp Ñ R the associated linear Re LU network with skip connections in (11), x P Rp and DF : Rp Ñ Rp a conservative gradient for F. Compute two distinct elements in DF pxq or one element if it is a singleton.

This problem enters the field of computational complexity as we have associated to it a representation size corresponding to the number of free parameters to be chosen: each matrix entry and the activation (Re LU or identity) corresponding to each coordinate, resulting in a number of parameters p L 1 řL 1 i 1 pi ˆ pi 1 pi. In what follows, we will consider integral or rational entries for matrices and input x with the common notion of bit size. Schrijver (1998).

D.2.1 CLARKE ENUMERATION IS NP-HARD FOR RELU NETWORKS

The decision version of Problem 1, under the same assumptions, is to decide if there exists two distinct elements in DF pxq, that is, decide if DF pxq is not reduced to a singleton.

Theorem 5 (Finding two Clarke subgradients is NP-Hard) Decision version of problem (1) with matrix and vector entries in t 1, 0, 1u and DF Bc F is NP-hard.

Published as a conference paper at ICLR 2023

Sketch of proof: We encode a boolean formula π on p boolean variable, in a linear Re LU network with p inputs, of size proportional to that of π. We do so by replacing or operations by maxima, and operations by minima, negation by multiplication by 1 and adding Re LU operations to the result. Using Lemma 3 in appendix D.5, the resulting F is represented by a linear Re LU network. By construction, 0 is a global minimum of F so 0 P Bc Fp0q, and F takes positive values if and only if π is satisfiable if and only if Bc Fp0q t0u. We detail this proof in coming sections.

Theorem 5 illustrates the hardness enumerating Clarke subgradients of linear Re LU networks. For F as in (11) and x P Rp, Bc Fpxq is not a singleton if and only if F is not differentiable at x, therefore:

Corollary 2 (Deciding non-differentiability of a NN is NP-Hard) Given a linear Re LU network as in (11) with matrices as in Theorem 5 and x P Rp, deciding if F is not differentiable at x is NP-hard.

In the coming section, we will provide a proof for Theorem 5 and Corollary 2. By the polynomial time equivalence of the representation of programs in Pp D0q and functions as in (11) detailed in Section D.1, this proves Theorem 4.

We add a remark on lexicographic subdifferential. It follows from (Barton et al., 2018, Proposition 2.7) that, for linear Re LU network F as in (11), the lexicographic subdifferential Nesterov (2005) is the set of neighboring gradients and is contained in Clarke subdifferential.

Corollary 3 (Finding two lexicographic subgradients is NP-Hard) Theorem 5 remains true if DF is the lexicographic subdifferential.

D.3 PROOF OF THE MAIN HARDNESS RESULT

Preliminary on 3-SAT We will use reduction to 3-SAT problem which is among the most well known NP-complete problems. Recall that a boolean formula is built from boolean variables, and operators: AND (conjunction, denoted ) OR (disjunction, _) and NOT (negation, ). A literal, is either a variable or the negation of a variable. A clause is a disjunction of literals (or a single literal). A formula is in conjunctive normal form (CNF), if it is a conjunction of clauses or a clause. 3-SAT is the decidability problem associated to CNF formulas with clauses containing 3 literals, such formulas are called 3-CNF formulas.

Example 2 The formula pb1 _ b2 _ b3q pb1 _ b4 _ b5q p b2 _ b3 _ b6q is 3-CNF with 6 boolean variables b1, . . . , b6 and 3 clauses.

Problem 2 (3-SAT) Given p, n P N and a boolean function π with p boolean arguments b1, . . . , bp represented by a 3-CNF formula with n clauses, decide if there exists an assignment pb1, . . . , bpq P t0, 1up such that πpb1, . . . , bpq 1.

Proof of Theorem 5:

The reduction is to 3-SAT.

Consider a 3-CNF function π in p variables b1, . . . , bp with n clauses of size 3. We may assume without loss of generality that n is of the form 2k for k P N by adding clauses which are always true and increasing the number of clauses by a factor at most 2. We will consider p real variables x1, . . . , xp. Consider the first clause of π, say for example pb1 _ b2 _ b3q. We associate to each literal the corresponding variable x if the literal is equal to a variable, and x if it is the negation of the corresponding variable, for example x1, x2, x3. These are combined using Re LU max resulting in the expression Re LUpmaxtx1, x2, x3uq.

We proceed similarly for each clause, we obtain n 2k expressions involving Re LU max where the max is over three numbers. The max of 3 numbers is the same as the max of 4 numbers (by copying one of the inputs) and, according to Lemma 3, can be represented by a Re LU network with 2 Re LU layers of size at most 3 ˆ 2 6 with weight matrices in t 1, 0, 1u.

We may therefore represent the n Re LU max expressions with a network with p inputs and n outputs, with 3 Re LU layers (2 for each max and one for the outer Re LU) of size at most 6n (6 nodes for each max) involving matrices with entries in t 1, 0, 1u. These expressions are

Published as a conference paper at ICLR 2023

combined using the operator min applied to the n 2k clause. Thanks to Lemma 3 again, using minta, bu maxt a, bu, the max over the 2k numbers can be expressed with k layers of size at most 3 ˆ 2k 1 3

We call the resulting network F. It has a representation as in (11), with matrices with entries in Z3 t 1, 0, 1u as in Problem 1. It contains log2pnq 3 Re LU layers of size at most 6n and it has therefore a description which size is polynomially bounded in n which is proportional to the bit size representation of the 3-CNF formula π.

Example 3 If the 3-CNF formula is given by pb1 _ b2 _ b3q pb1 _ b4 _ b5q p b2 _ b3 _ b6q pb2 _ b2 _ b6q with p 6 boolean variables and n 4 clauses, we will obtain a network computing the following expression in 6 real variables x1, . . . , x6:

Fpx1, . . . , x6q minp Re LUpmaxpx1, x2, x3qq, Re LUpmaxpx1, x4, x5qq, Re LUpmaxp x2, x3, x6qq, Re LUpmaxpx2, x2, x6qqqq.

We have the following rules for min and max over real numbers a, b, c (we use the convention signp0q 0).

maxpa, b, cq ą 0 ô pa ą 0q _ pb ą 0q _ pc ą 0q.

maxpa, b, cq ą 0 ô maxpsignpaq, signpbq, signpcqq ą 0.

minpa, b, cq ą 0 ô pa ą 0q pb ą 0q pc ą 0q.

minpa, b, cq ą 0 ô minpsignpaq, signpbq, signpcqq ą 0.

a ą 0 ô p a ă 0q ô signpaq ą 0.

Re LUpmaxpsignpaq, signpbq, signpcqqq P t0, 1u.

Because of the min Re LU structure, we have Fpxq ě 0 for all x, furthermore, Fp0q 0, so that 0 is a global minimum of F and 0 P Bc Fp0q. For any x, we have Fpxq ą 0 if and only if the output of each max is positive, if and only if each max clause contains a positive argument. We therefore have that Fpxq ą 0 if and only if Fpsignpxqq ą 0 where sign is the coordinatewise application of the sign, taking value 0 at 0.

We have the following chain of equivalence

Bc Fp0q t0u ô Dx P Rp, Fpxq 0 ô Dx P Rp, Fpxq ą 0 ô Dx P Rp, xi 0 p@i 1, . . . , pq Fpxq ą 0 ô Dx P Rp, xi 0 p@i 1, . . . , pq Fpsignpxqq ą 0 ô Dx P t 1, 1up, Fpxq ą 0 ô Dx P t 1, 1up, πpbq 1, bi Ipxi 1q pi 1 . . . pq,

where I outputs 1 if the boolean argument is true, and 0 otherwise. The first equivalence is by Lemma 2, the second is because F ě 0, the third is because F is continuous, the fourth is by the discussion above and the fifth is obvious because all possible t 1, 1u patterns can be described as coordinatewise sign applied vectors in Rp with nonzero entries. For the last equivalence, for xi P t 1, 1u we set bi 0 if xi 1 and bi 1 if xi 1. Each Re LU max applied to the sign vector corresponds to a clause and its output is in t0, 1u. The output of each Re LU max clause is 1 if and only if at least one of its argument is 1, if and only if one of the litteral of the corresponding disjunction is 1 if and only if the disjunction applied to the corresponding boolean variables is true. Otherwise, it is 0. Similarly, the min combination has positive output if and only if all max outputs are 1 if and only if all the disjunctions applied to variables bi are true.

This shows that Problem 1 is NP-hard, because 0 P Bc Fp0q and Bc Fp0q t0u if and only if there exists two distinct elements in Bc Fp0q. l

Published as a conference paper at ICLR 2023

D.4 PROOF OF FEASIBILITY FOR AUTODIFF CONSERVATIVE GRADIENT

The counterpart of Problem 1 for AD conservative gradient in Definition 2 is tractable, illustrating a major computational difference between Clarke subdifferential and AD conservative gradient. The proof is in Section D.4, by reduction to a graph shortest path problem. By the polynomial time equivalence between linear Re LU network and programs on t , , Re LUu proved in Section D.1, this proves Proposition 1.

Proposition 2 Problem (1) with matrix entries in Q and DF Da F is polynomial time solvable.

Proof of Proposition 2: Consider the following polynomial expression:

M T 1 p Q1 Q1q . . . M T L 1p QL 1 QL 1q M T L , (13)

where we decomposed Di Qi Qi in Definition 2, such that Qi is constant, diagonal, with zero entries except for the 1 entries which are enforced by the network activation and sign pattern: strictly positive activation before application of Re LU when network is evaluated at x, or identity activations. Furthermore, Qi contains qi ď pi diagonal variables to be chosen in r0, 1s corresponding to the zero activation pattern before application of Re LU, for i 1, . . . , L 1. The strictly negative values before application of Re LU do not play an additional role, they correspond diagonal entries constrained to 0 in both Qi and Qi, i 1, . . . , L 1. Note that a polynomial is constant on a box if and only if it is constant so the polynomial expression in (13) is constant when diagonal entries are constrained in r0, 1s, if and only if it is constant. So the problem reduces to decide if the polynomial expression in (13) is non constant, with respect to variables Q1, . . . , QL 1. We show that this reduces to a graph connectivity problem over 2 řl 1 i 1 qi vertices and edge weight given by partial products in (13).

First, the problem can be reduced to finding a non-zero value in the expression in (13). Indeed, one can substract the value obtained choosing Qi 0, i 1, . . . , L 1 and use the following block representation: M T 1 M T 1 ˆ Q1 Q1 0 0 Q1

. . . ˆ M T L 1 0 0 M T L 1

ˆ QL 1 QL 1 0 0 QL 1

ˆ M T L M T L

M T 1 p Q1 Q1q . . . M T L 1p QL 1 QL 1q M T L M T 1 Q1 . . . M T L 1 QL 1M T L . (14)

Therefore, expression (13) is nonconstant if and only if expression in (14) takes a nonzero value for some assignment of Q1, . . . , QL 1. The number of variables in (13) and (14) is the same and they have exactly the same form. Therefore we assume without loss of generality that the problem is to decide if the polynomial expression in (13) is not equal to the null polynomial.

Expression (13) is a vector function each of its coordinates being a polynomial function. It is not uniformly null if and only if and only if there exists a coordinate which is not the null polynomial, so we may add a diagonal matrix Q0 with p0 p diagonal entries in r0, 1s (and Q0 0 for the sake of symmetry) and M0 P Rpˆ1 the vector of all ones and find a nonzero value for the product

M T 0 p Q0 Q0q M T 1 p Q1 Q1q . . . M T L 1p QL 1 QL 1q M T L , (15)

Expression (15) is now real valued and therefore defines a polynomial. For each 0 1 . . . L 1, denote by di P r0, 1sqi, the vector containing the diagonal entries of matrix Qi, this corresponds exactly to the variable diagonal elements of Di in Definition 2. Denote by Ppd0, . . . , d Lq the obtained polynomial, P is multilinear in d0, . . . , d L 1, that is, it has an affine dependency for one block vector if the others are fixed. Therefore the hessian of P has zero diagonal blocks and the function is harmonic (hessian has zero trace), as a consequence, the maximum principle for harmonic functions entails that its maximum and minimum on any polytope are attained at vertices.

For i 0, . . . , L 1 denote by i Ă Rqi, the convex hull of the origin and the canonical basis vectors, this is a qi dimensional simplex with nonempty interior. The polynomial P in (15) is identically zero if and only if it vanishes on the product of simplices 0 ˆ . . . ˆ L 1 (which has non empty interior), if and only if it vanishes on the product set of the edges of these simplices by the maximum principle. In other words, P is not identically zero, if and only if it contains a nonzero element when each di is restricted to be an element of the canonical basis (zero vector with exactly one nonzero entry) or the null vector.

Define a graph with a layer structure:

Published as a conference paper at ICLR 2023

The source layer V 1 contains a single source node, v 1,1. The zero-th layer V0 contains q0 p nodes v0,1 . . . v0,q0. Recursively, the i-th layer Vi contains qi nodes vi,1 . . . vi,qi, for i 1 . . . L 1. The sink layer VL contains a single node node v L,1.

We connect nodes between consecutive layers, respecting the order induced by the layer structure. For i 1, . . . L 1 and j 0, . . . , L, with j ą i, we connect layers Vi and Vj as follows

Compute the quantity

m i 1 M T m Qm

where if j i 1 the product reduces to the identity (R M T j ). For k 1, . . . , qi and l 1, . . . , qj, add an edge with between vi,k and vj,l if Rk,l 0.

The resulting graph has a number of nodes equal to the number of Re LU functions in F plus p additional nodes and the source and sink nodes. Computation of edges can be done in polynomial time: it requires at most 4p L 1q2 matrix products involving at most 2L 1 matrices. Indeed the product of m matrices has polynomial time complexity in the representation bit size of the m input matrices.

In this graph, a directed path from the source to the sink visits each layer at most once, and in that case it visits a single node. Each such path corresponds to a monomial with nonzero coefficient appearing in the polynomial P in (15) by construction of the graph structure. Conversely each nonzero coefficient of a given monomial in (15) is uniquely associated to a path in the graph which corresponds to the nodes associated to variables in the monomial. Therefore, the source is connected to the sink if and only if there is a nonzero monomial in (15), if and only if the corresponding polynomial is nonzero. Furthermore, each path corresponds to the evaluation of the program at an edge of the product 0 ˆ . . . ˆ L 1. Therefore finding a path connecting the source to the sink allows to compute a nonzero element in the product and if no such path exists, the polynomial is identically zero.

So we have shown that the truth value of problem 1 with DF Da F , is the same as the source being connected to the sink by a directed path in the graph we defined, which has size polynomialy bounded compared to network size. Connectivity can be solved, for example using Dijkstra s algorithm, in time Op|V |2q where |V | is the number of nodes (or vertices). A path represents a nonzero element of Dfp0q and if no such path exists, we conclude that DF p0q t0u. This shows that the problem is solvable in polynomial time and concludes the proof.

D.5 ADDITIONAL LEMMAS

The following lemma provides a characterization of singleton subgradient for linear Re LU networks.

Lemma 2 Let F be a linear Re LU network, then Bc Fp0q t0u if and only if F is constant.

Proof : If F is constant, the result is immediate because F 0. Now, suppose that Bc Fp0q t0u. We know that F is piecewise linear and there exists a finite set of polyhedron whose union is Rp, where F is affine linear over each polyhedron. Furthermore, F is positively homogeneous, therefore for each x P Rp, Bc Fpxq Bc Fpλxq with λ ą 0. Setting R Ă Rp, the full measure set where F is differentiable, one has that for all x P Rp and

Bc Fpxq conv " v P Rp, Dyk Ñ kÑ8 0 with yk P R, vk Fpykq Ñ kÑ8 v * t0u.

Therefore, each affine part has zero derivative on each polyhedra and by continuity we conclude that F is constant. l

The next lemma describes an explicit representation of maximum of finitely many numbers using a Re LU network with weights in t 1, 0, 1u.

Published as a conference paper at ICLR 2023

Lemma 3 Given k P N, k ą 0, there exists F, a Re LU network with k Re LU layers of size at most 3 ˆ 2k 1 and weight matrices with entries in t 1, 0, 1u, with p 2k inputs such that for any x P Rp,

Fpxq max i 1,...,2k xi.

Proof : We proceed by recursion on k. Note that for any x1, x2 P R

maxtx1, x2u Re LUpx1 x2q x2 Re LUpx1 x2q Re LUpx2q Re LUp x2q.

Set the matrices

1 1 0 1 0 1

B p1 1 1q .

The function F1 : R2 Ñ R given by

F1pxq BRe LUp Axq

satisfies F1pxq maxtx1, x2u. This proves the result for k 1.

Now assume that for k ě 1, we have a network with k Re LU layers of size at most 3ˆ2k represented by matrices M1, . . . , Mk 1 with entries in t 1, 0, 1u, such that the corresponding Re LU network, as in (11) Fk : R2k Ñ R satisfies for all x P R2k,

Fkpxq max i 1,...,2k xi.

Set Fk the concatenation of two copies of Fk, that is Fk : R2k 1 Ñ R2, such that for all x, y P R2k,

Fkpx, yq ˆ maxi 1,...,2k xi maxi 1,...,2k yi

The matrices representing Fk can be described in block form

Mi ˆ Mi 0 0 Mi

P Rp2piqˆp2pi 1q

for i 1, . . . , k 1, where p0 2k and pk 1. This network is made of k layers of size at most 3 ˆ 2k 1, it has 2k 1 inputs and two outputs and its weight matrices have elements in t 1, 0, 1u. The block representation of the last matrix of this network is of the form ˆ Mk 1 0 0 Mk 1

where l is the size of the row vector Mk 1. We have

1 1 0 1 0 1

ˆ ˆ Mk 1 0 0 Mk 1

Mk 1 Mk 1 0 Mk 1 0 Mk 1

We set Fk 1px, yq F1p Fkpxq, Fkpyqq F1p Fkpx, yqq for all x, y P R2k. In matrix notation we have

Fk 1px, yq BRe LUp A Fkpx, yqq.

The involved matrices are Mk 2 B, Aˆ Mk 1 and Mk . . . M1. They all have entries in t 1, 0, 1u and the corresponding network has layers of size at most 3ˆ2k 1. The result then holds by recursion. l