# oracle_complexity_in_nonsmooth_nonconvex_optimization__bd6289a5.pdf

Oracle Complexity in Nonsmooth Nonconvex

Optimization

Guy Kornowski Weizmann Institute of Science guy.kornowski@weizmann.ac.il

Ohad Shamir Weizmann Institute of Science ohad.shamir@weizmann.ac.il

It is well-known that given a smooth, bounded-from-below, and possibly nonconvex function, standard gradient-based methods can ﬁnd -stationary points (with gradient norm less than ) in O(1/ 2) iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently not smooth, making these results inapplicable. In this paper, we study nonsmooth nonconvex optimization from an oracle complexity viewpoint, where the algorithm is assumed to be given access only to local information about the function at various points. We provide two main results (under mild assumptions): First, we consider the problem of getting near -stationary points. This is perhaps the most natural relaxation of ﬁnding -stationary points, which is impossible in the nonsmooth nonconvex case. We prove that this relaxed goal cannot be achieved efﬁciently, for any distance and smaller than some constants. Our second result deals with the possibility of tackling nonsmooth nonconvex optimization by reduction to smooth optimization: Namely, applying smooth optimization methods on a smooth approximation of the objective function. For this approach, we prove an inherent trade-off between oracle complexity and smoothness: On the one hand, smoothing a nonsmooth nonconvex function can be done very efﬁciently (e.g., by randomized smoothing), but with dimension-dependent factors in the smoothness parameter, which can strongly affect iteration complexity when plugging into standard smooth optimization methods. On the other hand, these dimension factors can be eliminated with suitable smoothing methods, but only by making the oracle complexity of the smoothing process exponentially large.

1 Introduction

We consider optimization problems associated with functions f : Rd ! R, where f( ) is Lipschitz continuous and bounded from below, but otherwise satisﬁes no special structure, such as convexity. Clearly, in high dimensions, it is generally impossible to efﬁciently ﬁnd a global minimum of a nonconvex function. However, if we relax our goal to ﬁnding (approximate) stationary points, then the nonconvexity is no longer an issue. In particular, it is known that if f( ) is smooth namely, differentiable and with a Lipschitz continuous gradient then for any > 0, simple gradientbased algorithms can ﬁnd x such that krf(x)k , using only O(1/ 2) gradient computations, independent of the dimension (see for example [30, 22, 11]).

Unfortunately, many optimization problems of interest are inherently not smooth. For example, when training modern neural networks, involving max operations and rectiﬁed linear units, the associated optimization problem is virtually always nonconvex as well as nonsmooth. Thus, the positive results above, which crucially rely on smoothness, are inapplicable. Although there are positive results even for nonconvex nonsmooth functions, they tend to be either purely asymptotic in nature (e.g.,

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

[8, 23, 36, 16, 26]), depend on the internal structure and representation of the objective function, or require additional assumptions which many problems of interest do not satisfy, such as weak convexity or some separation between nonconvex and nonsmooth components1 (e.g., [12, 13, 18, 10, 15, 17, 6]). This leads to the interesting question of developing black-box algorithms with non-asymptotic guarantees for nonsmooth nonconvex functions, without assuming any special structure.

In this paper, we study this question via the well-known framework of oracle complexity [27]: Given a class of functions F, we associate with each f 2 F an oracle, which for any x in the domain of f( ), returns local information about f( ) at x (such as its value and gradient2). We consider iterative algorithms which can be described via an interaction with such an oracle: At every iteration t, the algorithm chooses an iterate xt, and receives from the oracle local information about f( ) at xt, which is then used to choose the next iteration xt+1. This framework captures essentially all iterative algorithms for black-box optimization. In this framework, we ﬁx some iteration budget T, and ask what properties can be guaranteed for the iterates x1, . . . , x T , as a function of T and uniformly over all functions in F (for example, how close to optimal they are, whether they contain an approximately-stationary point, etc.). Unfortunately, as recently pointed out in [37], neither small optimization error nor small gradients can be obtained for nonsmooth nonconvex functions with such local-information algorithms: Indeed, approximately-stationary points can be easily hidden inside some arbitrarily small neighborhood, which cannot be found in a bounded number of iterations.

Instead, we consider here two alternative approaches to tackle nonsmooth nonconvex optimization, and provide new oracle complexity results for each. We note that Zhang et al. [37] recently proposed another promising approach, by deﬁning a certain relaxation of approximate stationarity (so-called (δ, )-stationarity), and remarkably, prove that points satisfying this relaxed goal can be found via

simple iterative algorithms with provable guarantees. However, there exist cases where their deﬁnition does not resemble a stationary point in any intuitive case, and thus it remains to be seen whether it is the most appropriate one. We further discuss the pros and cons of their approach in Appendix B.

In our ﬁrst contribution, we consider relaxing the goal of ﬁnding approximately-stationary points, to that of ﬁnding near-approximately-stationary points: Namely, getting δ-close to a point x with a (generalized) gradient of norm at most . This is arguably the most natural way to relax the goal of ﬁnding -stationary points, while hopefully still getting meaningful algorithmic guarantees. Moreover, approaching stationary points is feasible in an asymptotic sense (see for instance [17]). Unfortunately, we formally prove that this relaxation already sets the bar too high: For a very large class of gradient-based algorithms (and in fact, all of them under a mild assumption), it is impossible to ﬁnd near-approximately-stationary point with worst-case ﬁnite-time guarantees, for small enough constant δ, .

In our second contribution, we consider tackling nonsmooth nonconvex optimization by reduction to smooth optimization: Given the target function f( ), we ﬁrst ﬁnd a smooth function f( ) (with Lipschitz gradients) which uniformly approximates it up to some arbitrarily small parameter , and then apply a smooth optimization method on f( ). Such reductions are common in convex optimization (e.g., [28, 7, 1]), and intuitively, should usually lead to points with meaningful properties with respect to the original function f( ), at least when is small enough. For example, it is known that stationary points of f( ) are the limit of approximately-stationary points of appropriate smoothings of f( ), as ! 0 [32, Thm. 9.67].

This naturally leads to the question of how we can ﬁnd a smooth approximation of a Lipschitz function f( ). Inspecting existing approaches for smoothing nonconvex functions, we notice an interesting trade-off between computational efﬁciency and the smoothness of the approximating function: On the one hand, there exist optimization-based methods from the functional analysis literature (in particular, Lasry-Lyons regularization [25]) which yield essentially optimal gradient Lipschitz parameters, but are not computationally efﬁcient. On the other hand, there exist simple, computationally tractable methods (such as randomized smoothing [19]), which unfortunately lead to much worse gradient Lipschitz parameters, with strong scaling in the input dimension. This in turn leads to larger iteration complexity, when plugging into standard smooth optimization methods. Is this kind of trade-off between computational efﬁciency and smoothness necessary? Considering

1A trivial example arising in deep learning, which does not satisfy most such structural assumptions, is the negative of the Re LU function, x 7! max{0, x}.

2For non-differentiable functions, we use a standard generalization of gradients following [14], see Sec. 2 for details.

this question from an oracle complexity viewpoint, we prove that this trade-off is indeed necessary under mild assumptions: If we want a smoothing method whose oracle complexity is polynomial in the problem parameters, we must accept that the gradient Lipschitz parameter may be no better than that obtained with randomized smoothing (up to logarithmic factors). Thus, in a sense, randomized smoothing is an optimal nonconvex smoothing method among computationally efﬁcient ones.

Overall, we hope that our work motivates additional research into black-box algorithms for nonconvex, nonsmooth optimization problems, with rigorous ﬁnite-time guarantees.

Our paper is structured as follows. In Sec. 2, we formally introduce the notations and terminology that we use. In Sec. 3, we present our results for getting near approximately stationary points. In Sec. 4, we present our results for smoothing nonsmooth nonconvex functions. We conclude in Sec. 5 with a discussion of open questions. The supplementary material contains, besides the full proofs, a further discussion of the notion of (δ, )-stationarity from [37] (Appendix B), and a proof that the dimension-dependencies arising from randomized smoothing provably affects the iteration complexity of vanilla gradient descent (Appendix C).

2 Preliminaries

Notation. We let bold-face letters (e.g., x) denote vectors. 0 is the zero vector in Rd (where d is clear from context), and e1, e2, . . . are the standard basis vectors. Given a vector x, xi denotes its i-th coordinate, and x denotes the normalized vector x/kxk (assuming x 6= 0). h , i, k k denote the standard Euclidean dot product and its induced norm over Rd, respectively. For any real number x, we denote [x]+ := max{x, 0}. Given two functions f( ), g( ) on the same domain X, we deﬁne kf gk1 = supx2X |f(x) g(x)|. We denote by Sd 1 :=

x 2 Rd "" kxk = 1

the unit sphere. For any natural N, we abbreviate [N] := {1, . . . , N}. We occasionally use standard big-O asymptotic notation, with O( ), ( ), ( ) hiding constants, O( ), ( ), ( ) hiding constants and logarithmic factors, and poly( ) meaning polynomial factors.

Gradients and generalized gradients. If a function f : Rd ! R is differentiable at x, we denote its gradient by rf(x). For possibly non-differentiable functions, we let @f(x) denote the set of generalized gradients (following [14]), which is perhaps the most standard extension of gradients to nonsmooth nonconvex functions. For Lipschitz functions (which are almost everywhere differentiable by Rademacher s theorem), one simple way to deﬁne it is

@f(x) := conv{u : u = lim

k!1 rf(xk), xk ! x}

(namely, the convex hull of all limit points of rf(xk), over all sequences x1, x2, . . . of differentiable points of f( ) which converge to x). With this deﬁnition, a stationary point with respect to f( ) is a point x satisfying 0 2 @f(x). Also, given some 0, we say that x is an -stationary point with respect to f( ), if there is some u 2 @f(x) such that kuk .

Oracles. In this paper, we consider two types of oracles:

A ﬁrst-order oracle O1

f(x), which given a function f( ) and a point x, returns f(x) and @f(x). We note that this differs somewhat from the standard formulation of ﬁrst-order oracles in the literature, where it is assumed that at points of non-differentiability, the oracle only returns some element in the generalized gradient set @f(x). However, this means that our oracle complexity lower bounds hold under weaker assumptions, since we only provide more information to the algorithm. We use this type of oracle in the results of Sec. 3, but discuss possible extensions in Remark 2.

An all-derivatives oracle O1

f (x), which returns f(x), @f(x), as well as all higher-order derivatives of f( ) at x, if they exist. This oracle provides more information than a ﬁrst-order oracle, so any lower bound with respect to an all-derivatives oracle trivially applies with respect to a ﬁrst-order oracle. Some of our bounds apply with respect to such a powerful oracle, and we present them in this generality, although we note that in the nonconvex-nonsmooth setting, it remains to be seen whether there is any use for higher-order derivative information. We use this type of oracle in the results of Sec. 4.

3 Hardness of Getting Near Approximately-Stationary Points

In this section, we present our ﬁrst main result, which establishes the hardness of getting near approximately-stationary points.

To avoid trivialities, and following [37], we will focus on functions f( ) which are Lipschitz and bounded from below. In particular, we will assume that f(0) infx f(x) is upper bounded by a constant. We note that this is without loss of generality, as 0 can be replaced by any other ﬁxed reference point. Moreover, we will focus on two broad families of algorithms, which together span nearly all algorithms of interest: The ﬁrst is the class of all deterministic algorithms (denoted as Adet), which are all oracle-based algorithms where x1 is chosen deterministically, and for all t > 1, xt is some deterministic function of x1 and the previously observed values and gradients. The second is the class of all zero-respecting algorithms (denoted as Azr), which are all deterministic or randomized oracle-based algorithms whose sequence of query points x1, x2, . . . satisfy support{xt} S

i<t support(@f(xi)) for all t, where support(M) (for some M Rd) is deﬁned as {i 2 [d] : 9y 2 M : yi 6= 0}: Namely, the set of coordinates on which M has some elements with non-zero values. The notion of zero-respecting algorithms was introduced in [11], and generalizes the common notion of linear-span algorithms [31], where xt is assumed to lie in the span of previous gradients. The class of zero-respecting algorithms includes nearly all algorithms of practical interest in our setting, including some which are not linear-span algorithms, such as coordinate descent methods [29]. Our main result in this section is the following: Theorem 1. There exist a large enough absolute constant C > 0 and a small enough absolute constant c > 0 such that the following holds: For any algorithm in Adet [ Azr interacting with a ﬁrst-order oracle, any T > 1, and any d 2T, there is a function f( ) on Rd such that

f( ) is C-Lipschitz, and f(0) infx f(x) C.

With probability at least 1 T exp( cd) over the algorithm s randomness (or deterministi-

cally if the algorithm is deterministic), the iterates x1, . . . , x T produced by the algorithm satisfy infx2S mint2{1,...,T } kxt xk c, where S is the set of c-stationary points of f( ).

The theorem implies that for a very large family of algorithms, it is impossible to obtain worstcase guarantees for ﬁnding near-approximately-stationary points of Lipschitz, bounded-from-below functions unless T has exponential dependence on d.

Before continuing, we note that the result can be extended to any oracle-based algorithm (randomized or deterministic) under a widely-believed assumption see remark 5 in the proof for details. Remark 1 (More assumptions on f( )). The Lipschitz functions f( ) used to prove the theorem are based on a simple composition of afﬁne functions, the Euclidean norm function x 7! kxk, and the max function. Thus, the result also holds for more speciﬁc families of functions considered in the literature, which satisfy additional regularity properties, as long as they contain any Lipschitz function composed as above (for example, Hadamard semi-differentiable functions [37], Whitney-stratiﬁable functions [9, 16], semi-algebraic functions etc.).

The formal proof of the theorem appears in Appendix A in the supplementary material, but can be informally described as follows: First, we construct a Lipschitz function on Rd, speciﬁed by a small vector w, which resembles the norm function x 7! kxk in most of Rd, but with a channel leading away from a neighborhood of the origin in the direction of w, and reaching a completely ﬂat region (see Fig. 1). We emphasize that the graphical illustration is a bit misleading due to the low dimension: In high dimensions, the channel and ﬂat region contain a vanishingly small portion of Rd. This function has the property of having -stationary points only in the ﬂat region in the direction of w, even though the function appears in most places like the norm function x 7! kxk . As a result, any oracle-based algorithm that doesn t happen to hit the vanishingly small region where the function differs from x 7! kxk, receives no information about w, thus cannot determine where the -stationary points lie. As a result, such an algorithm cannot return near-approximately-stationary points.

Unfortunately, the construction described so far does not work as-is, since the algorithm can always query sufﬁciently close to the origin (closer than roughly kwk), where the gradients do provide information about w. To prevent this, we compose the function with an algorithm-dependent afﬁne mapping, which doesn t signiﬁcantly change the function s properties, but ensures that the algorithm can never get too close to the (mapped) origin. We show that such an afﬁne mapping must exist,

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2

Figure 1: Mesh and Contour plot of the function x 7! max{ 1, gw(x)} on R2, where w = (0.3, 0) and gw( ) is deﬁned in Lemma 2 (as part of the proof of Thm. 1). The origin is marked with a red dot. Best viewed in color.

based on standard oracle complexity results, which imply that oracle-based algorithms as above cannot get too close to the minimum of a generic convex quadratic function with a bounded number of queries. Using this mapping, and the useful property that w can be chosen arbitrarily small, leads to our theorem. The full proof details appear in Appendix A in the supplementary material. Remark 2 (Extension to higher-order algorithms). Our proof approach is quite ﬂexible, in the sense that for any algorithm, we really only need some function which cannot be optimized to arbitrarily high accuracy, composed with a channel construction as above. Since functions of this type also exist for algorithms employing higher-order derivatives beyond gradients (e.g., [2]), it is unlikely that such higher-order algorithms will circumvent our impossibility result.

4 Smoothing Nonsmooth Nonconvex Functions

In this section, we turn to our second main contribution, examining the possibility of reducing nonsmooth nonconvex optimization to smooth nonconvex optimization, by running a smooth optimization method on a smooth approximation of the objective function. In what follows, we focus our discussion on 1-Lipschitz functions: This is without loss of generality, since if our objective is L-Lipschitz, we can simply rescale it by L (and a Lipschitz assumption is always necessary if we wish to obtain a Lipschitz-gradient smooth approximation). Also, we focus on smoothing functions over all of Rd for simplicity, but our results and proofs easily extend to the case where we are only interested in smoothing over some bounded domain on which the function is Lipschitz.

For a nonsmooth convex function f( ), a well-known smoothing approach is proximal smoothing (also known as the Moreau envelope or Moreau-Yosida regularization [5]) deﬁned as Pδ(f) where

Pδ(f)(x) := min

By picking δ appropriately, Pδ(f) is an arbitrarily good smooth approximation of f: More formally, if f is 1-Lipschitz, then for any > 0, there exists a choice of δ = ( ) such that k Pδ(f) fk1 , with the gradients of Pδ(f) being 1

-Lipschitz. This is essentially optimal, as no -approximation can attain a gradient Lipschitz parameter better than (1/ ) (see Lemma 11 in Appendix E for a formal proof). Finally, computing gradients of Pδ(f) (which can then be fed into a gradient-based smooth optimization method) is feasible, given a solution to Eq. (1), which is a convex optimization problem and hence efﬁciently solvable in general.

Unfortunately, for nonconvex functions, proximal smoothing (or other smoothing methods from convex optimization) generally fails in producing smooth approximations. However, it turns out that similar guarantees can be obtained with a slightly more complicated procedure, known as Lasry-Lions regularization in the functional analysis literature [25, 3], which is essentially a double application of proximal smoothing combined with function ﬂipping. One way to deﬁne it is as follows:

Pδ, (f)(x) := Pδ( P (f))(x) = max

Once more, if f is 1-Lipschitz, then choosing δ, = ( ) appropriately, we get an -accurate approximation of f, with gradients which are c

-Lipschitz for some absolute constant c. However, unlike the convex case, implementing this smoothing involves solving a non-convex optimization problem, which may not be computationally tractable.

Alternatively, a very simple smoothing approach, which works equally well on convex and nonconvex problems, is randomized smoothing, or equivalently, convolving the objective function with a smoothness-inducing density function. Formally, given the objective function f and a distribution P, we deﬁne f(x) := Ey P [f(x + y)]. In particular, letting P be a uniform distribution on an origin-centered ball of radius , the resulting function is an -approximation of f( ), and its gradient Lipschitz parameter is c

d , where c is an absolute constant and d is the input dimension [19]. Moreover, given access to values and gradients of f( ), computing unbiased stochastic estimates of the values or gradients of f( ) is computationally very easy: We just sample a single y P, and return3 f(x + y) or rf(x + y). These stochastic estimates can then be plugged into stochastic methods for smooth optimization (see [19, 20]).

Comparing these two approaches, we see an interesting potential trade-off between the smoothness obtained and computational efﬁciency, summarized in the following table:

r f Lipschitz param. Computationally Efﬁcient?

Randomized Smoothing c

d/ X Lasry-Lions Regularization c/

In words, randomized smoothing is computationally efﬁcient (unlike Lasry-Lyons regularization), but at the cost of a much larger gradient Lipschitz parameter. Since the iteration complexity of smooth optimization methods strongly depend on this Lipschitz parameter, it follows that in highdimensional problems, we pay a high price for computational tractability in reducing nonsmooth to smooth problems. We emphasize that this is a real phenomenon, and not just an artifact of iteration complexity analysis, as we demonstrate in Appendix C.

This discussion leads to a natural question: Is this trade-off necessary, or perhaps there exist computationally efﬁcient methods which can improve on randomized smoothing, in terms of the gradient Lipschitz parameter? Using an oracle complexity framework, we prove that this trade-off is indeed necessary (under mild assumptions), and that randomized smoothing is essentially an optimal method under the constraint of black-box access to the objective f( ), and a reasonable oracle complexity. We note that Duchi et al. [19] proved that the Lipschitz constant cannot be improved by simple randomized smoothing schemes, but here we consider a much larger class of possible methods.

4.1 Smoothing Algorithms

Before presenting our main result for this section, we need to carefully formalize what we mean by an efﬁcient smoothing method, since returning a smooth approximating function over Rd is not algorithmically well-deﬁned. Recalling the motivation to our problem, we want a method that given a nonsmooth objective function f( ), allows us to estimate values and gradients of a smooth approximation f( ) at arbitrary points, which can then be fed into standard black-box methods for smooth optimization (hence, we need a uniform approximation property). Moreover, for black-box optimization, it is desirable that this smoothing method operates in an oracle complexity framework, where it only requires local information about f( ) at various points. Finally, we are interested in controlling various parameters of the smoothing procedure, such as the degree of approximation, the smoothness of the resulting function, and the complexity of the procedure. In light of these considerations, a natural way to formalize smoothing methods is the following: Deﬁnition 1. An algorithm A is an (L, , T, M, r)-smoother if for any 1-Lipschitz function f on Rd, there exists a differentiable function f on Rd with the following properties:

1. kf fk1 , and r f is L-Lipschitz.

2. Given any x 2 Rd, the algorithm produces a (possibly randomized) query sequence x1, . . . , x T 2

{y : ky xk r}, of the form xi+1 = A(i)

f (x1) , . . . , O1

, where A(i) maps

3By Rademacher s theorem, f is differentiable almost everywhere hence rf(x + y) exists almost surely.

all the previous information gathered by the queries to a new query, possibly based on a draw of some random variable4 . Finally, the algorithm produces a vector

A (f, x) := A(out) )

f (x1) , . . . , O1

where A(out) is some mapping to Rd, such that

+++E [A(f, x)] r f(x)

[k A(f, x)k M] = 1 . (2)

Some comments about this deﬁnition are in order. First, the deﬁnition is only with respect to the ability of the algorithm to approximate gradients of f( ): It is quite possible that the algorithm also has additional output (such as an approximation of the value of f( )), but this is not required for our results. Second, we do not require the algorithm to return r f(x): It is enough that the expectation of the vector output is close to it (up to ). This formulation captures both deterministic optimization-type methods (such as Lasry-Lyons regularization, where in general we can only hope to solve the auxiliary optimization problem up to some ﬁnite precision) as well as stochastic methods (such as randomized smoothing, which returns r f(x) in expectation). Third, we assume that the queries returned by the algorithm lie at a bounded distance r from the input point x. In the context of randomized smoothing, this corresponds (for example) to using a uniform distribution over a ball of radius r centered on x. As we discuss later on, some assumption on the magnitude of the queries is necessary for our proof technique. However, requiring almost-sure boundedness is merely for simplicity, and it can be replaced by a high-probability bound with appropriate tail assumptions (e.g., if we are performing randomized smoothing with a Gaussian distribution), at the cost of making the analysis a bit more complicated. Remark 3. We are mostly interested in the following parameter regimes:

, essentially meaning that a single call to A is computable in a reasonable amount of time.

M = poly (L). As we formally prove in Lemma 12 in Appendix E, if we require f to approximate f

and also have L-Lipschitz gradients, we must have kr f (x) k = O(L). In particular, whenever M is sufﬁciently larger than L, Eq. (2) is interchangeable with the seemingly more natural condition Pr

k A(f, x) r f (x) k M

r = O ( ). If we are interested in smoothing a 1-Lipschitz, nonconvex function up to accuracy

around a given point x, we generally expect that only its O( )-neighborhood will convey useful information for the smoothing process. We note that this is indeed satisﬁed by randomized smoothing (with a uniform distribution around a radiusball, or with high probability if we use a Gaussian

distribution), as well as Lasry-Lyons regularization (in the sense that the smooth approximation at x does not change if we alter the function arbitrarily outside an O( )-neighborhood of x).

We consider all three of the above to be quite permissive. In particular, notice that randomized

smoothing over a ball satisﬁes the much stronger T = 1, M = 1, r = .

Our result will require the assumption that the smoothing algorithm A is translation invariant with respect to constant functions, in the sense that it treats all constant functions and regions of the input space in the same manner. In order to formalize this property, ﬁrst notice that if f is a constant function then the all-derivatives oracle O1

f (x) does not depend on x. Hence, for a constant function f we can use the abbreviation O1

f without specifying the speciﬁc input point. We use this to formalize our desired translation-invariance property as follows: Deﬁnition 2. A smoothing algorithm A satisﬁes TICF (translation invariance w.r.t. constant functions) if for any two constant functions f, g, any x 2 Rd and i 2 [T], and any realization of ,

f , . . . , O1

g , . . . , O1

+ x , (3) and

f , . . . , O1

g , . . . , O1

4We assume nothing about , allowing the algorithm to utilize an arbitrary amount of randomness.

In other words, if instead of a constant function f and an input point x, we pick some other constant function g and the origin, the distribution of the algorithm s sequence of queries remain the same (up to a shift by x), and the gradient estimate returned by the algorithm remains the same. We consider this to be a mild and natural assumption, which is clearly satisﬁed by standard smoothing techniques.

4.2 Main result

With these deﬁnitions in hand, we are ﬁnally ready to present our main result for this section:

Theorem 2. Let A be an (L, , T, M, r)-smoother which satisﬁes TICF. Then there exist absolute constants c1, c2 > 0 such that

log ((M + 1) (T + 1)) c1

d r (c2 ) . (4)

This theorem holds for general values of the parameters L, , T, M, r. Concretely, for parameter regimes of interest (see Remark 3) we have the following corollary:

Corollary 1. Suppose that the accuracy parameter satisﬁes c2/2. Then any smoothing algorithm which makes at most T = poly(d, L, 1) queries at a distance at most r = O( ) from the input point, and returns vectors of norm at most M = poly(d, L, 1), must correspond to a smooth approximation f( ) with Lipschitz gradient parameter at least L = (

We note that the lower bound on L in this corollary matches (up to logarithmic factors) the upper bound attained by randomized smoothing. This implies that at least under our framework and assumptions, randomized smoothing is an essentially optimal efﬁcient smoothing method.

Another implication of Thm. 2 is that even if we relax our assumption that r = O( ), then as long as r does not scale with the dimension d, the gradient Lipschitz parameter of any efﬁcient smoothing algorithm must scale with the dimension (even though there exist dimension-free smooth approximations, as evidenced by Lasry-Lyons regularization):

Corollary 2. Fix any accuracy parameter c2/2, and any r > 0. Then as long as the number of queries is T = poly(d) and the output is of size M = poly(d), we must have L (

A third corollary of our theorem is that (perhaps unsurprisingly), there is no way to implement Lasry-Lions regularization efﬁciently in an oracle complexity framework:

Corollary 3. If c2/2 and M = poly(d), then any smoothing algorithm for which f( ) corresponds to the Lasry-Lions regularization (which satisﬁes L = O(1)) must use a number of queries T = exp( (d)).

Remark 4 (Dependence on M). Our deﬁnition of a smoothing algorithm focuses on the expectation of the algorithm s output. This leads to a logarithmic dependence on M (an upper bound on the algorithm s output) in Thm. 2, since in the proof we need to bound the inﬂuence of exponentiallysmall-probability events on the expectation. It is plausible that the dependence on M can be eliminated altogether, by changing the deﬁnition of a smoothing algorithm to focus on the expectation of its output, conditioned on all but highly-improbably events. However, that would complicate the deﬁnition and our proof.

We will now outline the proof idea, which is formally presented in Appendix A. Consider a one dimensional monotonically increasing function g, which is locally constant at a (1/

d) neighborhood of a grid = {0, δ1, . . . , δK} [0, 1], with K roughly of order

d (see Fig. 2 in Appendix A). We deﬁne f (x) = g(w>x), where w 2 Sd 1 is a uniformly random unit vector. We note that f is a simple function, easily implemented by (say) a one-hidden layer Re LU neural network. We proceed to analyze what happens when a smoothing algorithm is given points of the form δiw, δi 2 . Since w is random, and the algorithm is assumed to be translation invariant, it can be shown (via a concentration of measure argument) that the algorithm is overwhelmingly likely to produce queries in directions which all have O(1/

d) correlation with w, as long as the number of queries is polynomial. Consequently, with high probability, the queries all lie in a region where the function f( ) is ﬂat, and the algorithm cannot distinguish between it and a constant function. By the translation-invariance property, this implies that the gradient estimates r f(δiw) must be of small norm, uniformly for all

δiw. Combining the observation that r f( ) is small along order-of-

d-many points between 0 and the unit vector w, together with the fact that r f( ) is L-Lipschitz, we can derive an upper bound on how much f( ) can increase along the line segment between 0 and w, roughly on the order of L/

d. On the other hand, f( ) is an approximation of f, which has a constant increase between 0 and w. Overall this allows us to deduce a lower bound on L scaling as

d, which results in the theorem.

5 Discussion

In this paper, we studied the problem of nonconvex, nonsmooth optimization from an oracle complexity perspective, and provided two main results: One (in Sec. 3) is an impossibility result for efﬁciently getting near approximately-stationary points, and the second (in Sec. 4) proving an inherent trade-off between oracle complexity and the smoothness parameter when smoothing nonsmooth functions. The second result also establishes the optimality of randomized smoothing as an efﬁcient smoothing method, under mild assumptions.

Our work leaves open several questions. First, at a more technical level, there is the question of whether some or all of our assumptions can be relaxed. In particular, the result in Sec. 3 currently holds for deterministic and zero-respecting algorithms. Although this is already a very broad class, one may ask whether it can be made to hold for all algorithms. We believe this is true, and (as we remark in Remark 5) will directly follow from a more general form of ﬁrst-order oracle complexity bounds for quadratics. As for the smoothness result in Sec. 4, it currently requires the algorithm to be translation invariant w.r.t. constant functions, as well as querying at some bounded distance from the input point x. We conjecture that the translation invariance assumption can be relaxed, possibly by a suitable reduction that shows that any smoothing algorithm can be converted to a translation invariant one. However, how to formally perform this remains unclear at the moment. As to the bounded distance of the queries, it is currently an essential assumption for our proof technique, which relies on a function which looks locally constant at many different points, but is globally non-constant, and this can generally be determined by querying far enough away from the input point (even along some random direction). Thus, relaxing this assumption may necessitate a different proof technique.

Another open question is whether randomized smoothing can be derandomized : Our results indicate that the gradient Lipschitz parameter of the smooth approximation cannot be improved, but leave open the possibility of an efﬁcient method returning the actual gradients of some smooth approximation (up to machine precision), in contrast to randomized smoothing which only provides noisy stochastic estimates of the gradient. These can then be plugged into smooth optimization methods which assume access to the exact gradients (rather than noisy stochastic estimates), generally improving the resulting iteration complexity. We note that naively computing the exact gradient of f( ) arising from randomized smoothing is infeasible in general, as it involves a high-dimensional integral.

At a more general level, our work leaves open the question of what is the right metric to study for nonsmooth-nonconvex optimization, where neither minimizing optimization error nor ﬁnding approximately-stationary points is feasible. In this paper, we show that the goal of getting near approximately stationary points is not feasible, at least in the worst case, whereas smoothing can be done efﬁciently, but not in a dimension-free manner. Can we ﬁnd other compelling goals to consider? One very appealing notion is the (δ, )-stationarity of [37] that we mentioned in the introduction, which comes with clean, ﬁnite-time and dimension-free guarantees. Our negative result in Thm. 1 provides further motivation to consider it, by showing that a natural variation of this notion will not work. However, as we discuss in Appendix B in the supplementary material, we need to accept that this stationarity notion can have unexpected behavior, and there exist cases where it will not resemble a stationary point in any intuitive sense. In any case, using an oracle complexity framework to study this and other potential metrics for nonsmooth nonconvex optimization, which combine computational efﬁciency and ﬁnite-time guarantees, remains an interesting direction for future research.

Broader impact. In this work we discuss several approaches for providing formal guarantees when optimizing nonsmooth nonconvex functions, and shed light on their inherent limitations and tradeoffs in an oracle complexity framework. Such optimization objectives arise when training modern deep neural networks, a practice which is currently subject to many heuristics. We believe a better theoretical understanding of the ﬁeld will eventually affect practice and help develop sound deep learning based technology. We ﬁnd it difﬁcult to provide more speciﬁc potential negative societal impact, other than the well known potential negative implications of such technology as a whole.

Funding disclosure. This research is supported by the European Research Council (ERC) grant 754705.

[1] Zeyuan Allen-Zhu and Elad Hazan. Optimal black-box reductions between optimization

objectives. ar Xiv preprint ar Xiv:1603.05642, 2016.

[2] Yossi Arjevani, Ohad Shamir, and Ron Shiff. Oracle complexity of second-order methods for

smooth convex optimization. Mathematical Programming, 178(1-2):327 360, 2019.

[3] Hédy Attouch and Dominique Aze. Approximation and regularization of arbitrary functions in

hilbert spaces by the lasry-lions method. Annales de l Institut Henri Poincare (C) Non Linear Analysis, 10(3):289 312, 1993.

[4] Keith Ball et al. An elementary introduction to modern convex geometry. Flavors of geometry,

31:1 58, 1997.

[5] Heinz H Bauschke, Patrick L Combettes, et al. Convex analysis and monotone operator theory

in Hilbert spaces, volume 408. Springer, 2011.

[6] Amir Beck and Nadav Hallak. On the convergence to stationary points of deterministic and

randomized feasible descent directions methods. SIAM Journal on Optimization, 30(1):56 79, 2020.

[7] Amir Beck and Marc Teboulle. Smoothing and ﬁrst order methods: A uniﬁed framework. SIAM

Journal on Optimization, 22(2):557 580, 2012.

[8] Michel Benaïm, Josef Hofbauer, and Sylvain Sorin. Stochastic approximations and differential

inclusions. SIAM Journal on Control and Optimization, 44(1):328 348, 2005.

[9] Jérôme Bolte, Aris Daniilidis, Adrian Lewis, and Masahiro Shiota. Clarke subgradients of

stratiﬁable functions. SIAM Journal on Optimization, 18(2):556 572, 2007.

[10] Jérôme Bolte, Shoham Sabach, Marc Teboulle, and Yakov Vaisbourd. First order methods be-

yond convexity and lipschitz gradient continuity with applications to quadratic inverse problems. SIAM Journal on Optimization, 28(3):2131 2151, 2018.

[11] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for ﬁnding

stationary points i. Mathematical Programming, pages 1 50, 2019.

[12] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. On the evaluation complexity of

composite function minimization with applications to nonconvex nonlinear programming. SIAM Journal on Optimization, 21(4):1721 1739, 2011.

[13] Xiaojun Chen. Smoothing methods for nonsmooth, nonconvex minimization. Mathematical

programming, 134(1):71 99, 2012.

[14] Frank H Clarke. Optimization and nonsmooth analysis, volume 5. Siam, 1990.

[15] Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly

convex functions. SIAM Journal on Optimization, 29(1):207 239, 2019.

[16] Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, and Jason D Lee. Stochastic subgradient

method converges on tame functions. Foundations of computational mathematics, pages 1 36, 2018.

[17] Dmitriy Drusvyatskiy and Courtney Paquette. Efﬁciency of minimizing compositions of convex

functions and smooth maps. Mathematical Programming, 178(1-2):503 558, 2019.

[18] John C Duchi and Feng Ruan. Stochastic methods for composite and weakly convex optimization

problems. SIAM Journal on Optimization, 28(4):3229 3259, 2018.

[19] John C Duchi, Peter L Bartlett, and Martin J Wainwright. Randomized smoothing for stochastic

optimization. SIAM Journal on Optimization, 22(2):674 701, 2012.

[20] Saeed Ghadimi and Guanghui Lan. Stochastic ﬁrst-and zeroth-order methods for nonconvex

stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

[21] AA Goldstein. Optimization of lipschitz continuous functions. Mathematical Programming, 13

(1):14 22, 1977.

[22] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape

saddle points efﬁciently. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1724 1732. JMLR. org, 2017.

[23] Krzysztof C Kiwiel. Convergence of the gradient sampling algorithm for nonsmooth nonconvex

optimization. SIAM Journal on Optimization, 18(2):379 388, 2007.

[24] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical

programming, 171(1-2):167 215, 2018.

[25] Jean-Michel Lasry and Pierre-Louis Lions. A remark on regularization in hilbert spaces. Israel

Journal of Mathematics, 55(3):257 266, 1986.

[26] Szymon Majewski, Bła zej Miasojedow, and Eric Moulines. Analysis of nonsmooth stochastic

approximation: the differential inclusion approach. ar Xiv preprint ar Xiv:1805.01916, 2018.

[27] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method

efﬁciency in optimization. Wiley, 1983.

[28] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical programming,

103(1):127 152, 2005.

[29] Yurii Nesterov. Efﬁciency of coordinate descent methods on huge-scale optimization problems.

SIAM Journal on Optimization, 22(2):341 362, 2012.

[30] Yurii Nesterov. How to make the gradients small. Optima. Mathematical Optimization Society

Newsletter, (88):10 11, 2012.

[31] Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.

[32] R. Tyrrell Rockafellar and Roger J.B. Wets. Variational analysis, volume 317. Springer Science

& Business Media, 2009.

[33] Max Simchowitz. On the randomized complexity of minimizing a convex quadratic function.

ar Xiv preprint ar Xiv:1807.09386, 2018.

[34] Tomasz Tkocz. An upper bound for spherical caps. The American Mathematical Monthly, 119

(7):606 607, 2012.

[35] Blake Woodworth and Nathan Srebro. Lower bound for randomized ﬁrst order convex opti-

mization. ar Xiv preprint ar Xiv:1709.03594, 2017.

[36] Chao Zhang and Xiaojun Chen. Smoothing projected gradient method and its application to

stochastic linear complementarity problems. SIAM Journal on Optimization, 20(2):627 649, 2009.

[37] Jingzhao Zhang, Hongzhou Lin, Suvrit Sra, and Ali Jadbabaie. On complexity of ﬁnding

stationary points of nonsmooth nonconvex functions. ar Xiv preprint ar Xiv:2002.04130, 2020.