# oracle_complexity_in_nonsmooth_nonconvex_optimization__4973adcb.pdf

Journal of Machine Learning Research 23 (2022) 1-43 Submitted 12/21; Revised 10/22; Published 10/22

Oracle Complexity in Nonsmooth Nonconvex Optimization

Guy Kornowski guy.kornowski@weizmann.ac.il Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot, Israel

Ohad Shamir ohad.shamir@weizmann.ac.il Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot, Israel

Editor: Tong Zhang

It is well-known that given a smooth, bounded-from-below, and possibly nonconvex function, standard gradient-based methods can ﬁnd ϵ-stationary points (with gradient norm less than ϵ) in O(1/ϵ2) iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently not smooth, making these results inapplicable. In this paper, we study nonsmooth nonconvex optimization from an oracle complexity viewpoint, where the algorithm is assumed to be given access only to local information about the function at various points. We provide two main results: First, we consider the problem of getting near ϵ-stationary points. This is perhaps the most natural relaxation of ﬁnding ϵ-stationary points, which is impossible in the nonsmooth nonconvex case. We prove that this relaxed goal cannot be achieved eﬃciently, for any distance and ϵ smaller than some constants. Our second result deals with the possibility of tackling nonsmooth nonconvex optimization by reduction to smooth optimization: Namely, applying smooth optimization methods on a smooth approximation of the objective function. For this approach, we prove under a mild assumption an inherent trade-oﬀbetween oracle complexity and smoothness: On the one hand, smoothing a nonsmooth nonconvex function can be done very eﬃciently (e.g., by randomized smoothing), but with dimension-dependent factors in the smoothness parameter, which can strongly aﬀect iteration complexity when plugging into standard smooth optimization methods. On the other hand, these dimension factors can be eliminated with suitable smoothing methods, but only by making the oracle complexity of the smoothing process exponentially large. Keywords: nonconvex nonsmooth optimization, optimization theory, oracle complexity, smoothing

1. Introduction

We consider optimization problems associated with functions f : Rd R, where f( ) is Lipschitz continuous and bounded from below, but otherwise satisﬁes no special structure, such as convexity. Clearly, in high dimensions, it is generally impossible to eﬃciently ﬁnd a global minimum of a nonconvex function. However, if we relax our goal to ﬁnding (approximate) stationary points, then the nonconvexity is no longer an issue. In particular,

c 2022 Guy Kornowski and Ohad Shamir.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-1507.html.

Kornowski and Shamir

it is known that if f( ) is smooth namely, diﬀerentiable and with a Lipschitz continuous gradient then for any ϵ > 0, simple gradient-based algorithms can ﬁnd x such that f(x) ϵ, using only O(1/ϵ2) gradient computations, independent of the dimension (see for example Nesterov 2012; Jin et al. 2017; Carmon et al. 2019). Unfortunately, many optimization problems of interest are inherently not smooth. For example, when training modern neural networks, involving max operations and rectiﬁed linear units, the associated optimization problem is virtually always nonconvex as well as nonsmooth. Thus, the positive results above, which crucially rely on smoothness, are inapplicable. Although there are positive results even for nonconvex nonsmooth functions, they tend to be either purely asymptotic in nature (e.g., Bena ım et al. 2005; Kiwiel 2007; Zhang and Chen 2009; Davis et al. 2018; Majewski et al. 2018), depend on the internal structure and representation of the objective function, or require additional assumptions which many problems of interest do not satisfy, such as weak convexity or some separation between nonconvex and nonsmooth components1 (e.g., Cartis et al. 2011; Chen 2012; Duchi and Ruan 2018; Bolte et al. 2018; Davis and Drusvyatskiy 2019; Drusvyatskiy and Paquette 2019; Beck and Hallak 2020). This leads to the interesting question of developing black-box algorithms with non-asymptotic guarantees for nonsmooth nonconvex functions, without assuming any special structure. In this paper, we study this question via the well-known framework of oracle complexity (Nemirovski and Yudin, 1983): Given a class of functions F, we associate with each f F an oracle, which for any x in the domain of f( ), returns local information about f( ) at x (such as its value and gradient).2 We consider iterative algorithms which can be described via an interaction with such an oracle: At every iteration t, the algorithm chooses an iterate xt, and receives from the oracle local information about f( ) at xt, which is then used to choose the next iterate xt+1. This framework captures essentially all iterative algorithms for black-box optimization. In this framework, we ﬁx some iteration budget T, and ask what properties can be guaranteed for the iterates x1, . . . , x T , as a function of T and uniformly over all functions in F (for example, how close to optimal they are, whether they contain an approximately-stationary point, etc.). Unfortunately, as recently pointed out in Zhang et al. (2020), neither small optimization error (in terms of the function value) nor small gradients can be obtained for nonsmooth nonconvex functions with such local-information algorithms: Indeed, approximately-stationary points can be easily hidden inside some arbitrarily small neighborhood, which cannot be found in a bounded number of iterations. Instead, we consider here two alternative approaches to tackle nonsmooth nonconvex optimization, and provide new oracle complexity results for each. We note that Zhang et al. (2020) recently proposed another promising approach, by deﬁning a certain relaxation of approximate stationarity (so-called (δ, ϵ)-stationarity), and remarkably, prove that points satisfying this relaxed goal can be found via simple iterative algorithms with provable guarantees. However, there exist cases where their deﬁnition does not resemble a stationary point in any intuitive case, and thus it remains to be seen whether it is the most appropriate one. We further discuss the pros and cons of their approach in Appendix A.

1. A trivial example arising in deep learning, which does not satisfy most such structural assumptions, is the negative of the Re LU function, x 7 max{0, x}. 2. For non-diﬀerentiable functions, we use a standard generalization of gradients following Clarke (1990), see Sec. 2 for details.

Oracle Complexity in Nonsmooth Nonconvex Optimization

In our ﬁrst contribution, we consider relaxing the goal of ﬁnding approximately-stationary points, to that of ﬁnding near-approximately-stationary points: Namely, getting δ-close to a point x with a (generalized) gradient of norm at most ϵ. This is arguably the most natural way to relax the goal of ﬁnding ϵ-stationary points, while hopefully still getting meaningful algorithmic guarantees. Moreover, approaching stationary points is feasible in an asymptotic sense (see for instance Drusvyatskiy and Paquette 2019). Unfortunately, we formally prove that this relaxation already sets the bar too high: For any possibly randomized algorithm interacting with a local oracle, it is impossible to ﬁnd near-approximately-stationary point with eﬃcient worst-case guarantees, for small enough constant δ, ϵ. In our second contribution, we consider tackling nonsmooth nonconvex optimization by reduction to smooth optimization: Given the target function f( ), we ﬁrst ﬁnd a smooth function f( ) (with Lipschitz gradients) which uniformly approximates it up to some arbitrarily small parameter ϵ, and then apply a smooth optimization method on f( ). Such reductions are common in convex optimization (e.g., Nesterov 2005; Beck and Teboulle 2012; Allen-Zhu and Hazan 2016), and intuitively, should usually lead to points with meaningful properties with respect to the original function f( ), at least when ϵ is small enough. For example, it is known that stationary points of f( ) are the limit of approximately-stationary points of appropriate smoothings of f( ), as ϵ 0 (Rockafellar and Wets, 2009, Theorem 9.67). This naturally leads to the question of how we can ﬁnd a smooth approximation of a Lipschitz function f( ). Inspecting existing approaches for smoothing nonconvex functions, we notice an interesting trade-oﬀbetween computational eﬃciency and the smoothness of the approximating function: On the one hand, there exist optimization-based methods from the functional analysis literature (in particular, Lasry-Lions regularization (Lasry and Lions, 1986)) which yield essentially optimal gradient Lipschitz parameters, but are not computationally eﬃcient. On the other hand, there exist simple, computationally tractable methods (such as randomized smoothing Duchi et al., 2012), which unfortunately lead to much worse gradient Lipschitz parameters, with strong scaling in the input dimension. This in turn leads to larger iteration complexity, when plugging into standard smooth optimization methods. Is this kind of trade-oﬀbetween computational eﬃciency and smoothness necessary? Considering this question from an oracle complexity viewpoint, we prove that this tradeoﬀis indeed necessary under mild assumptions: If we want a smoothing method whose oracle complexity is polynomial in the problem parameters, we must accept that the gradient Lipschitz parameter may be no better than that obtained with randomized smoothing (up to logarithmic factors). Thus, in a sense, randomized smoothing is an optimal nonconvex smoothing method among computationally eﬃcient ones. It is important to stress that although we formalize our results for general Lipschitz functions, all of our results readily apply to more restricted classes of Lipschitz functions which are often studied in the optimization literature such as Hadamard semi-diﬀerentiable, Whitney-stratiﬁable and semi-algebraic functions. We further discuss this in Remark 2. Overall, we hope that our work motivates additional research into black-box algorithms for nonconvex, nonsmooth optimization problems, with rigorous ﬁnite-time guarantees. Our paper is structured as follows. In Sec. 2, we formally introduce the notations and terminology that we use. In Sec. 3, we present our results for getting near approximately

Kornowski and Shamir

stationary points. In Sec. 4, we present our results for smoothing nonsmooth nonconvex functions. In Sec. 5, we provide full proofs. We conclude in Sec. 6 with a discussion of open questions. Our paper also contains a few appendices, which beside technical proofs and lemmas, include further discussion of the notion of (δ, ϵ)-stationarity from Zhang et al. (2020) (Appendix A), and a proof that the dimension-dependency arising from randomized smoothing provably aﬀects the iteration complexity of vanilla gradient descent (Appendix B).

2. Preliminaries

Notation. We let bold-face letters (e.g., x) denote vectors. 0 is the zero vector in Rd (where d is clear from context), and e1, e2, . . . are the standard basis vectors. Given a vector x, xi denotes its i-th coordinate, and x denotes the normalized vector x/ x (assuming x = 0). , , denote the standard Euclidean dot product and its induced norm over Rd, respectively. For any real number x, we denote [x]+ := max{x, 0}. Given two functions f( ), g( ) on the same domain X, we deﬁne f g = supx X |f(x) g(x)|. We denote by Sd 1 := x Rd x = 1 the unit sphere. For any natural N, we abbreviate [N] := {1, . . . , N}. We occasionally use standard big-O asymptotic notation: For functions f, g : X [0, ) we write f = O(g) if there exists c > 0 such that f(x) c g(x); f = Ω(g) if g = O(f); f = Θ(g) if f = O(g) and g = O(f). We occasionally hide logarithmic factors by writing f = O(g) if f = O(g log(g + 1)), or f = Ω(g) if g = O(f), and also denote by poly( ) polynomial factors. Gradients and generalized gradients. If a function f : Rd R is diﬀerentiable at x, we denote its gradient by f(x). For possibly non-diﬀerentiable functions, we let f(x) denote the set of generalized gradients (following Clarke 1990), arguably the most standard extension of gradients to certain classes of nonsmooth nonconvex functions. For Lipschitz functions (which are almost everywhere diﬀerentiable by Rademacher s theorem), one simple way to deﬁne it is

f(x) := conv{u : u = lim k f(xk), xk x}

(namely, the convex hull of all limit points of f(xk), over all sequences x1, x2, . . . of diﬀerentiable points of f( ) which converge to x). With this deﬁnition, a stationary point with respect to f( ) is a point x satisfying 0 f(x). Also, given some ϵ 0, we say that x is an ϵ-stationary point with respect to f( ), if there is some u f(x) such that u ϵ. Furthermore, when some small δ, ϵ will be clear from context, we will say that x is a near-approximately-stationary point if x x < δ for some ϵ-stationary point x . Local oracles. We consider oracles that given a function f( ) and a point x, return some quantity Of(x) which conveys local information about the function near that point. Formally (following Braun et al. 2017), we call an oracle local if for any x and any two functions f, g that are equal over some neighborhood of x, it holds that Of(x) = Og(x). An important example is the ﬁrst order oracle Of(x) = (f(x), f(x)), but one can consider more sophisticated oracles such as those which return all high order derivative information, wherever they exist. We choose to work in this generality since our results hold for any local oracle whatsoever, although we note that in the nonconvex-nonsmooth setting, it remains

Oracle Complexity in Nonsmooth Nonconvex Optimization

to be seen whether in the general Lipschitz setting discussed in this paper there is any use for any local information which is not ﬁrst order.

3. Hardness of Getting Near Approximately-Stationary Points

In this section, we present our ﬁrst main result, which establishes the hardness of getting near approximately-stationary points. To avoid trivialities, and following Zhang et al. (2020), we will focus on functions f( ) which are Lipschitz and bounded from below. In particular, we will assume that f(0) infx f(x) is upper bounded by a constant. We note that this is without loss of generality, as 0 can be replaced by any other ﬁxed reference point. We consider possibly randomized algorithms which interact with some local oracle. Such an algorithm ﬁrst produces x1 possibly at random, receives Of(x1) for some local oracle Of, and for every t > 1 produces xt possibly at random based on previously observed responses. Our main result in this section is the following:

Theorem 1 There exist absolute constants c, C > 0 such that for any algorithm A interacting with a local oracle, and any T N, d 2, there is a function f( ) on Rd such that

f( ) is C-Lipschitz, f(0) infx f(x) C, inf { x | f(x) = {0}} C .

With probability at least 1 CT exp( cd) over the algorithm s randomness, the iterates x1, . . . , x T produced by the algorithm satisfy

inf x Σ min t [T] xt x c ,

where Σ is the set of c-stationary points of f( ).

The theorem implies that it is impossible to obtain worst-case guarantees for ﬁnding near-approximately-stationary points of Lipschitz, bounded-from-below functions unless T has exponential dependence on d. Note that if an exponential dimension dependence is allowed, then under the theorem s assumptions, one can trivially produce points close to a stationary point (or to any point, for that matter) using an exhaustive grid search. The theorem implies that no oracle-based algorithm will be signiﬁcantly more eﬃcient than this trivial strategy. Further notice that since every deterministic algorithm can be viewed as a randomized algorithm with a trivial distribution over its decisions, the theorem above readily holds for deterministic algorithms as well. That being the case, the lower bound described in the second bullet holds deterministically.

Remark 2 (More assumptions on f( )) The Lipschitz functions f( ) used to prove the theorem are based on a composition of aﬃne functions, orthogonal projections, the Euclidean norm function x 7 x , and the max function. Thus, the result also holds for more speciﬁc families of functions considered in the literature, which satisfy additional regularity properties, as long as they contain any Lipschitz functions composed as above. These include, for example, Hadamard semi-diﬀerentiable functions (Zhang et al., 2020), Whitney-stratiﬁable functions (Bolte et al., 2007; Davis et al., 2018) and semi-algebraic functions.

Kornowski and Shamir

span 𝐞', , 𝐞"*'

𝑥 𝐰 Flat region

Function does not

depend on 𝐰 in green region

Figure 1: Illustration of the function used in the proof of Thm. 1.

The formal proof of the theorem appears in Sec. 5, but can be informally described as follows: First, we construct an algorithm-dependent one dimensional hard Lipschitz function h : R R that has large derivatives everywhere except for a single point x . By hard , we mean that after any ﬁnite number of steps of the algorithm, we can provide some small neighborhood of x which the algorithm is likely not to enter. Based on h, we construct a Lipschitz function on Rd, speciﬁed by a small vector w span{e1, . . . , ed 1}, which resembles the function x 7 (x1, . . . , xd 1) 2 + h(xd) in most of Rd, but with a channel leading away from a small neighborhood of x ed in the direction of w, and reaching a completely ﬂat region (see Fig. 1). In high dimensions, the channel and the ﬂat region contain a vanishingly small portion of Rd. This function has the property of having ϵ-stationary points only in the ﬂat region which is in the direction of x ed + w, even though the function appears in most places like a function which does not depend on w. As a result, any oracle-based algorithm that doesn t get to close to x in the d th coordinate and doesn t know w, is unlikely to hit the vanishingly small region where the function diﬀers from x 7 (x1, . . . , xd 1) 2 + h(xd), receiving no information about w, and thus cannot determine where the ϵ-stationary points lie. As a result, such an algorithm cannot return near-approximately-stationary points.

4. Smoothing Nonsmooth Nonconvex Functions

In this section, we turn to our second main contribution, examining the possibility of reducing nonsmooth nonconvex optimization to smooth nonconvex optimization, by running a smooth optimization method on a smooth approximation of the objective function. This longstanding approach for nonsmooth nonconvex optimization has been examined by many works, initially driven by practical applications (Blake and Zisserman, 1987; Wu, 1996) complemented by further theoretical analyses (Mobahi and Fisher III, 2015; Hazan et al., 2016). In what follows, we focus our discussion on 1-Lipschitz functions: This is without loss of generality, since if our objective is L-Lipschitz, we can simply rescale it by L (and a

Oracle Complexity in Nonsmooth Nonconvex Optimization

Lipschitz assumption is always necessary if we wish to obtain a Lipschitz-gradient smooth approximation). Also, we focus on smoothing functions over all of Rd for simplicity, but our results and proofs easily extend to the case where we are only interested in smoothing over some bounded domain on which the function is Lipschitz. For a nonsmooth convex function f( ), a well-known smoothing approach is proximal smoothing (also known as the Moreau envelope or Moreau-Yosida regularization, see Moreau, 1965; Bauschke et al., 2011) deﬁned as Pδ(f) where

Pδ(f)(x) := min y

2δ y x 2 . (1)

By picking δ appropriately, Pδ(f) is an arbitrarily good smooth approximation of f; more formally, if f is 1-Lipschitz, then for any ϵ > 0, there exists a choice of δ = Θ(ϵ) such that Pδ(f) f ϵ, with the gradients of Pδ(f) being 1

ϵ-Lipschitz (Beck and Teboulle, 2012, Section 4.2). This is essentially optimal, as no ϵ-approximation can attain a gradient Lipschitz parameter better than Ω(1/ϵ) (see Lemma 30 in Appendix D for a formal proof). Finally, computing gradients of Pδ(f) (which can then be fed into a gradient-based smooth optimization method) is feasible, given a solution to Eq. (1), which is a convex optimization problem and hence eﬃciently solvable in general. Unfortunately, for nonconvex functions, proximal smoothing (or other smoothing methods from convex optimization) generally fails in producing smooth approximations. However, it turns out that similar guarantees can be obtained with a slightly more complicated procedure, known as Lasry-Lions regularization in the functional analysis literature (Lasry and Lions, 1986; Attouch and Aze, 1993), which is essentially a double application of proximal smoothing combined with function ﬂipping. One way to deﬁne it is as follows:

Pδ,ν(f)(x) := Pδ( Pν(f))(x) = max y min z

Once more, if f is 1-Lipschitz, then choosing δ, ν = Θ(ϵ) appropriately, we get an ϵ-accurate approximation of f, with gradients which are c

ϵ-Lipschitz for some absolute constant c. However, unlike the convex case, implementing this smoothing involves solving a non-convex optimization problem, which may not be computationally tractable. Alternatively, a very simple smoothing approach, which works equally well on convex and non-convex problems, is randomized smoothing, or equivalently, convolving the objective function with a smoothness-inducing density function. Formally, given the objective function f and a distribution P, we deﬁne f(x) := Ey P [f(x + y)]. In particular, letting P be a uniform distribution on an origin-centered ball of radius ϵ, the resulting function is an ϵ-approximation of f( ), and its gradient Lipschitz parameter is c

d ϵ , where c is an absolute constant and d is the input dimension (Duchi et al., 2012, Lemma 8). Moreover, given access to values and gradients of f( ), computing unbiased stochastic estimates of the values or gradients of f( ) is computationally very easy: We just sample a single y P, and return f(x + y) or f(x + y).3 These stochastic estimates can then be plugged into stochastic methods for smooth optimization (see Duchi et al. 2012; Ghadimi and Lan 2013).

3. Recall that by Rademacher s theorem, f is diﬀerentiable almost everywhere hence f(x + y) exists almost surely.

Kornowski and Shamir

Comparing these two approaches, we see an interesting potential trade-oﬀbetween the smoothness obtained and computational eﬃciency, summarized in the following table:

f Lipschitz param. Computationally Eﬃcient?

Randomized Smoothing c

d/ϵ Lasry-Lions Regularization c/ϵ

In words, randomized smoothing is computationally eﬃcient (unlike Lasry-Lions regularization), but at the cost of a much larger gradient Lipschitz parameter. Since the iteration complexity of smooth optimization methods strongly depend on this Lipschitz parameter, it follows that in high-dimensional problems, we pay a high price for computational tractability in reducing nonsmooth to smooth problems. As we demonstrate in Appendix B, this is a real phenomenon and not just an artifact of iteration complexity analysis, at least for gradient descent. Roughly speaking, we prove in Proposition 24 that when gradient descent is combined with randomized smoothing, it is impossible to guarantee getting to ϵ-stationary points of the smoothed function within Ω(

d/ϵ) iterations. This discussion leads to a natural question: Is this trade-oﬀnecessary, or perhaps there exist computationally eﬃcient methods which can improve on randomized smoothing, in terms of the gradient Lipschitz parameter? Using an oracle complexity framework, we prove that this trade-oﬀis indeed necessary (under mild assumptions), and that randomized smoothing is essentially an optimal method under the constraint of black-box access to the objective f( ), and a reasonable oracle complexity. We note that Duchi et al. (2012) proved that the Lipschitz constant cannot be improved by simple randomized smoothing schemes, but here we consider a much larger class of possible methods.

4.1 Smoothing Algorithms

Before presenting our main result for this section, we need to carefully formalize what we mean by an eﬃcient smoothing method, since returning a smooth approximating function over Rd is not algorithmically well-deﬁned. Recalling the motivation to our problem, we want a method that given a nonsmooth objective function f( ), allows us to estimate values and gradients of a smooth approximation f( ) at arbitrary points, which can then be fed into standard black-box methods for smooth optimization (hence, we need a uniform approximation property). Moreover, for black-box optimization, it is desirable that this smoothing method operates in an oracle complexity framework, where it only requires local information about f( ) at various points. Finally, we are interested in controlling various parameters of the smoothing procedure, such as the degree of approximation, the smoothness of the resulting function, and the complexity of the procedure. In light of these considerations, a natural way to formalize smoothing methods is the following:

Deﬁnition 3 An algorithm S is an (L, ϵ, T, M, r)-smoother if for any 1-Lipschitz function f on Rd, there exists a diﬀerentiable function f on Rd with the following properties:

1. f f ϵ, and f is L-Lipschitz.

2. Given any x Rd, the algorithm produces a (possibly randomized) query sequence {x1, . . . , x T } {y : y x r}, of the form xi+1 = S(i) (ξ, x, Of (x1) , . . . , Of (xi)),

Oracle Complexity in Nonsmooth Nonconvex Optimization

where S(i) maps all the previous information gathered by the queries of some local oracle Of to a new query, possibly based on a draw of some random variable ξ.4 Finally, the algorithm produces a vector

S (f, x) := S(out) (ξ, x, Of (x1) , . . . , Of (x T )) ,

where S(out) is some mapping to Rd, such that Eξ [S(f, x)] f(x) ϵ and Pr ξ [ S(f, x) M] = 1 . (2)

Some comments about this deﬁnition are in order. First, the deﬁnition is only with respect to the ability of the algorithm to approximate gradients of f( ): It is quite possible that the algorithm also has additional output (such as an approximation of the value of f( )), but this is not required for our results. Second, we do not require the algorithm to return f(x): It is enough that the expectation of the vector output is close to it (up to ϵ). This formulation captures both deterministic optimization-type methods (such as Lasry-Lions regularization, where in general we can only hope to solve the auxiliary optimization problem up to some ﬁnite precision) as well as stochastic methods (such as randomized smoothing, which returns f(x) in expectation). Third, we assume that the queries returned by the algorithm lie at a bounded distance r from the input point x. In the context of randomized smoothing, this corresponds (for example) to using a uniform distribution over a ball of radius r centered on x. As we discuss later on, some assumption on the magnitude of the queries is necessary for our proof technique. However, requiring almostsure boundedness is merely for simplicity, and it can be replaced by a high-probability bound with appropriate tail assumptions (e.g., if we are performing randomized smoothing with a Gaussian distribution), at the cost of making the analysis a bit more complicated.

Remark 4 We are mostly interested (though do not limit our results) to the following parameter regimes:

T = poly d, L, ϵ 1 , essentially meaning that a single call to S is computable in a reasonable amount of time.

M = poly (L). As we formally prove in Lemma 31 in Appendix D, if we require f to approximate f and also have L-Lipschitz gradients, we must have f (x) = O(L). In particular, whenever M is suﬃciently larger than L, Eq. (2) is interchangeable with the seemingly more natural condition Prξ h S(f, x) f (x) M i = 1.

r = O (ϵ). If we are interested in smoothing a 1-Lipschitz, nonconvex function up to an accuracy ϵ around a given point x, we generally expect that only its O(ϵ)-neighborhood will convey useful information for the smoothing process. We note that this regime is indeed satisﬁed by randomized smoothing (with a uniform distribution around a radius-ϵ ball, or with high probability if we use a Gaussian distribution), as well as Lasry-Lions regularization (in the sense that the smooth approximation at x does not change if we alter the function arbitrarily outside an O(ϵ)-neighborhood of x).

4. We assume nothing about ξ, allowing the algorithm to utilize an arbitrary amount of randomness.

Kornowski and Shamir

We consider all three of the above to be quite permissive. In particular, notice that randomized smoothing over a ball satisﬁes the much stronger T = 1, M = 1, r = ϵ.5

Our result will require the assumption that the smoothing algorithm S is translation invariant with respect to constant functions, in the sense that it treats all constant functions and regions of the input space in the same manner. We formalize our desired translationinvariance property as follows:

Deﬁnition 5 A smoothing algorithm S satisﬁes TICF (translation invariance w.r.t. constant functions) if for any two constant functions f, g, any x Rd and i [T], and any realization of ξ,

S(i) (ξ, x, Of (x1) , . . . , Of (xi)) = S(i) (ξ, 0, Og (x1 x) , . . . , Og (xi x)) + x , (3)

S(out) (ξ, x, Of (x1) , . . . , Of (x T )) = S(out) (ξ, 0, Og (x1 x) , . . . , Og (x1 x)) .

In other words, if instead of a constant function f and an input point x, we pick some other constant function g and the origin, the distribution of the algorithm s sequence of queries remain the same (up to a shift by x), and the gradient estimate returned by the algorithm remains the same. We consider this to be a mild and natural assumption, which is clearly satisﬁed by standard smoothing techniques.

4.2 Main result

With these deﬁnitions in hand, we are ﬁnally ready to present our main result for this section, which is the following:

Theorem 6 Let S be an (L, ϵ, T, M, r)-smoother which satisﬁes TICF. Then

log ((M + 1) (T + 1)) c1

d r (c2 ϵ) (4)

for some absolute constants c1, c2 > 0.

This theorem holds for general values of the parameters L, ϵ, T, M, r. Concretely, for parameter regimes of interest (see Remark 4) we have the following corollary:

Corollary 7 Suppose that the accuracy parameter satisﬁes ϵ c2/2. Then any smoothing algorithm which makes at most T = poly(d, L, ϵ 1) queries at a distance at most r = O(ϵ) from the input point, and returns vectors of norm at most M = poly(d, L, ϵ 1), must correspond to a smooth approximation f( ) with Lipschitz gradient parameter at least L = Ω(

5. Indeed, S(f, x) := f(x+z) where z Unif(ϵSd 1) requires only a single query (namely T = 1) at x+z which is of distance at most ϵ away from x (hence r = ϵ) and satisﬁes S(f, x) 1 due to Lipschitzness (hence M = 1).

Oracle Complexity in Nonsmooth Nonconvex Optimization

Figure 2: Illustration of g(x), where = {0, δ1, . . . , δK} [0, 1].

We note that the lower bound on L in this corollary matches (up to logarithmic factors) the upper bound attained by randomized smoothing. This implies that at least under our framework and assumptions, randomized smoothing is an essentially optimal eﬃcient smoothing method. Another implication of Thm. 6 is that even if we relax our assumption that r = O(ϵ), then as long as r does not scale with the dimension d, the gradient Lipschitz parameter of any eﬃcient smoothing algorithm must scale with the dimension (even though there exist dimension-free smooth approximations, as evidenced by Lasry-Lions regularization):

Corollary 8 Fix any accuracy parameter ϵ c2/2, and any r > 0. Then as long as the number of queries is T = poly(d) and the output is of size M = poly(d), we must have L Ω(

A third corollary of our theorem is that (perhaps unsurprisingly), there is no way to implement Lasry-Lions regularization eﬃciently in an oracle complexity framework:

Corollary 9 If ϵ c2/2 and M = poly(d), then any smoothing algorithm for which f( ) corresponds to the Lasry-Lions regularization (which satisﬁes L = O(1)) must use a number of queries T = exp( Ω(d)).

Remark 10 (Dependence on M) Our deﬁnition of a smoothing algorithm focuses on the expectation of the algorithm s output. This leads to a logarithmic dependence on M (an upper bound on the algorithm s output) in Thm. 6, since in the proof we need to bound the inﬂuence of exponentially-small-probability events on the expectation. It is plausible that the dependence on M can be eliminated altogether, by changing the deﬁnition of a smoothing algorithm to focus on the expectation of its output, conditioned on all but highly-improbably events. However, that would complicate the deﬁnition and our proof.

Before presenting the formal proof of Thm. 6 in the next section, we outline the main proof idea. Consider a one dimensional monotonically increasing function g, which is locally constant at a Ω(1/

d) neighborhood of a grid = {0, δ1, . . . , δK} of points in [0, 1], with K

Kornowski and Shamir

roughly of order

d (see Fig. 2). We deﬁne f (x) = g(w x), where w Sd 1 is a uniformly random unit vector. We note that f is a simple function, easily implemented by (say) a one-hidden layer Re LU neural network, and that it belongs to all of the restricted classes discussed in Remark 2. We now proceed to analyze what happens when a smoothing algorithm is given points of the form δiw, for δi . Since w is random, and the algorithm is assumed to be translation invariant, it can be shown (via a concentration of measure argument) that the algorithm is overwhelmingly likely to produce queries in directions which all have O(1/

d) correlation with w, as long as the number of queries is polynomial. Consequently, with high probability, the queries all lie in a region where the function f( ) is ﬂat, and the algorithm cannot distinguish between it and a constant function. By the translation-invariance property, this implies that the gradient estimates f(δiw) must be of small norm, uniformly for all δiw. Combining the observation that f( ) is small along order-of-

d-many points between 0 and the unit vector w, together with the fact that f( ) is L-Lipschitz, we can derive an upper bound on how much f( ) can increase along the line segment between 0 and w, roughly on the order of L/

d. On the other hand, f( ) is an approximation of f, which has a constant increase between 0 and w. Overall this allows us to deduce a lower bound on L scaling as

d, which results in the theorem.

5.1 Proof of Thm. 1

We start by claiming that when optimizing a one dimensional Lipschitz function that has large derivatives everywhere apart from it s global minimum, no algorithm can guarantee this minimum can be exactly found with some positive probability within any ﬁnite time. The following proposition formalizes this claim.

Proposition 11 For any algorithm A interacting with a local oracle, any T N and any δ > 0, there exists a function h : R [0, ) and ρ(T, δ) > 0 such that

h is 2-Lipschitz, with a single global minimum x (0, 1), and h(0) = 1.

x = x , g h(x) : |g| 1.

The ﬁrst T iterates produced by A(h): x A(h) 1 , . . . , x A(h) T , satisfy Pr A[mint [T] |xt x | < ρ] < δ.

Remark 12 The quantity ρ(T, δ) given by Proposition 11 depends only on T, δ. In particular, it is worth noticing that it is uniform over any algorithm A.

The proof of Proposition 11 is inspired by a classic lower bound construction for convex optimization due to Nemirovski (Nemirovski, 1995, Lemma 1.1.1). The basic idea is to construct two functions that are identical outside some segment, in which their distinct minima lie at distance 2ρ one from another. Any algorithm that queries only outside that segment throughout it s ﬁrst T 1 iterates cannot distinguish between the two functions, thus has probability at least 1

2 to produce it s next iterate at least ρ away from the minimum of the

Oracle Complexity in Nonsmooth Nonconvex Optimization

function it is actually optimizing (see Fig. 5 in Appendix C for an illustration). By looking at smaller and smaller segments, this idea can be generalized to any number of queries. Unfortunately, the gradient size of Nemirovski s original construction shrinks exponentially with T, therefore cannot be applied to our setting as it does not satisfy the second bullet of Proposition 11, which is crucial to the rest of our proof. Nonetheless, we were able to provide a nonconvex construction similar in spirit which satisﬁes our desired qualities. Due to the substantial length and technicality of the proof, we defer it to Appendix C. We continue by showing that given Proposition 11, this claim generalizes to any dimension.

Lemma 13 For any algorithm A interacting with a local oracle, any T N, d 2 and any δ > 0, there exists a function h : R [0, ) and ρ(T, δ) > 0 such that

h is 2-Lipschitz, with a single global minimum x (0, 1), and h(0) = 1.

x = x , g h(x) : |g| 1.

If we deﬁne fh(x1, . . . , xd) := h(xd)+ 1

q Pd 1 i=1 x2 i , then the ﬁrst T iterates produced by

A( fh): x A( fh) 1 , . . . , x A( fh) T , satisfy Pr A h mint [T] x A( fh) t x < ρ i < δ, where x := (0, . . . , 0, x ).

Proof Since our goal is to provide a lower bound ρ for optimizing fh with the algorithm A, we can assume without loss of generality that A interacts with an even stronger oracle which provides more than just local information about fh. Speciﬁcally, suppose A has access to an oracle of the form

O fh(x) = n ((z1, . . . , zd 1, xd), fh(z1, . . . , zd 1, xd)) (z1, . . . , zd 1) Rd 1o , Oh(xd)

for some local oracle O. Namely, a full global description of the function fh over the aﬃne subspace {z| zd = xd}, coupled with local information with respect to the last coordinate. Note that by deﬁnition of fh(x), changing xd only aﬀects h(xd), thus all the information about xd is indeed conveyed through h(xd). Assuming A interacts with the described O, we turn to describe another algorithm A , which given local oracle access to a one dimensional function h works as follows:

A simulates A and receives it s ﬁrst iterate x1. It then produces the ﬁrst iterate (x1)d.

Given A s current iterate xt, A queries Oh((xt)d). Then, A feeds into A n ((z1, . . . , zd 1, (xt)d), fh(z1, . . . , zd 1, (xt)d) (z1, . . . , zd 1) Rd 1o , Oh((xt)d) ,

and receives from A it s next iterate xt+1. A then produces it s next iterate (xt+1)d.

It is clear that A is indeed a well deﬁned algorithm which interacts with a local oracle. Hence, let h( ), ρ(T, δ) be the function and the positive parameter given by Proposition 11 for A , T, δ. We will show these h, ρ satisfy all three bullets in the lemma. The ﬁrst two

Kornowski and Shamir

bullets are immediate, thus it only remains to prove the third. To that end, note that by construction of A , A s iterates when applied to h are exactly the d th coordinates of A when applied to f. That is, (x A( f) t )d = x A (h) t . Consequently,

min t [T] x A( f) t x < ρ Pr A

min t [T] |(x A( f) t )d (x )d| < ρ

min t [T] |x A (h) t x | < ρ < δ ,

where the last inequality follows by deﬁnition of h, ρ.

From now on, we ﬁx some algorithm A interacting with a local oracle, T N, d 2 and set δ = T exp( d/36). We denote by h( ), ρ > 0, x Rd their associated function, positive parameter and point given by Lemma 13. Given any nonzero vector w Rd such that wd = 0, we deﬁne

fw(x1, . . . , xd) := h(xd + x ) + 1

i=1 x2 i w, x + w 1

where w := w/ w . This function looks like the hard function given by Lemma 13 (up to a shift along the d th axis) as long as x is not highly correlated with w, which makes the Re LU term vanish. However, unlike the hard function which has a stationary point, the Re LU term adds at that point a large gradient component in the w direction, preventing it from being even ϵ-stationary. Moreover, the following lemma shows that this function has no ϵ-stationary points for small ϵ.

Lemma 14 For any nonzero w Rd such that wd = 0, fw is 15

4 -Lipschitz and has no ϵ-stationary points for any ϵ < 1 4

Proof Throughout the proof we omit the w subscript and refer to fw( ) as f( ).

The functions x 7 h(xd + x ), x 7 1

q Pd 1 i=1 x2 i , x 7 w, w + x , x 7 1

2 x + w and x 7 [x]+ are 2-Lipschitz, 1

4-Lipschitz, 1-Lipschitz, 1

2-Lipschitz and 1-Lipschitz respectively, from which it follows that f is 2 + 1

4 -Lipschitz. In order to prove that no point x is ϵ-stationary for any ϵ < 1 4

2, we will use the facts that (g1 + g2) g1 + g2, and that if g1 is univariate, (g1 g2)(x) conv{r1r2 : r1 g1(g2(x)), r2 g2(x)} (see Clarke, 1990, Proposition 2.3.3 and Theorem 2.3.9). We examine six exhaustive cases:

xd = 0. In this case we have

f(x) t ed + 1

2v |t| 1, ud = 0, s [0, 1], v 1 .

For any g f(x) corresponding to some t, u, s, v, using the fact that ud = wd = 0 we get

g g, sign(t) ed |t| 1

Oracle Complexity in Nonsmooth Nonconvex Optimization

x = 0. In this case we have

i=1 u2 i 1, ud = 0

For any vector in f(x) corresponding to some t, u, we use the fact that projecting any vector onto span{e1, . . . , ed 1} cannot increase it s norm, in order to get t ed + 1

2w = 1 2w 1

x = w. In this case we have

f(x) t ed 1

2u t [ 2, 2], s [0, 1], u 1 .

For any g f(x) corresponding to some t, s, u, using the fact that wd = 0 we get

4w, w + sw, w + Ds

xd = 0, x / {0, w}, w, x + w < 1

2. Note that

w, x + w < 1

2 = w, x + w 1

2 x + w < 0 ,

and that the set x w, x + w < 1

2 \ {0, w} is an open set in Rd. Thus for every

such point, the function x 7 f(x) is locally identical to x 7 h(xd+x )+ 1

q Pd 1 i=1 x2 i , which in particular implies that their gradient sets are identical. Furthermore, combining the assumptions xd = 0, x = 0 reveals that x1, . . . , xd 1 are not all zeroes. Consequently, we get

f(x) t ed + 1

4x t [ 2, 2] .

For any g f(x) corresponding to some t, we get

xd = 0, x / {0, w}, w, x + w > 1

2. Note that

w, x + w > 1

2 = w, x + w 1

2 x + w > 0 ,

Kornowski and Shamir

and that the set x w, x + w > 1

2 \ {0, w} is an open set in Rd. Thus for every such point, the function x 7 f(x) is locally identical to

x 7 h(xd + x ) + 1

i=1 x2 i w, x + w + 1

which in particular implies that their gradient sets are identical. Furthermore, combining the assumptions xd = 0, x = 0 reveals that x1, . . . , xd 1 are not all zeroes. Consequently, we get

f(x) t ed + 1

2 (x + w) t [ 2, 2] .

For any g f(x) corresponding to some t, using the fact that wd = 0 we get

4 x, w + w, w + 1

xd = 0, x / {0, w}, w, x + w = 1

2. In this case we have

f(x) t ed + 1

2(x + w) t [ 2, 2], s [0, 1] (5)

= 1 4 x + s 2 x + w

x + s 2 x + w s w

t [ 2, 2], s [0, 1] .

Denote x = x| + x where x = (I w w T )x is the orthogonal projection of x onto span(w) , and x| span(w). For any g f(x) corresponding to some t, s, using the fact that xd = wd = 0 we get for some scalar α:

1 4 x + s 2 x + w

x + s 2 x + w s w

1 4 x + s 2 x + w

1 4 x + s 2 x + w

Since (I w w T ) is an orthogonal projection, in particular symmetric, we also have

x 2 = x, (I w w T )2x = x, (I w w T )x = x 2 w, x 2 = x 2(1 w, x 2) .

Plugging into the above, it follows that g is at least 1

1 w, x 2. Assuming that there exists such g f(x) with norm at most ϵ, it follows that

1 w, x 2 ϵ . (6)

Oracle Complexity in Nonsmooth Nonconvex Optimization

However, we will show that for any ϵ < 1 4

2, we must arrive at a contradiction. To that end, let us consider two cases:

If w, x > 0, then by rearranging Eq. (6), we have w, x

1 16ϵ2. Hence,

1 16ϵ2+ w ( x + w ) p

1 16ϵ2 x+w p

However, dividing both sides by x + w , we get that w, x + w

1 16ϵ2. If ϵ < 1 4

2, we get that w, x + w > 1

2, contradicting our assumption on x.

If w, x 0, then by Eq. (6), we must have w, x

1 16ϵ2. But then, by recalling that wd = 0, we use Eq. (5) and our assumption that w, x + w = 1

2 in order to obtain

4 w, x s 1 1

This implies that 1

1 16ϵ2 g ϵ, which does not hold for any ϵ < 1 4

Finally, given some nonzero w Rd such that wd = 0, we are ready to consider the function

Fw(x1, . . . , xd) := max { 1, fw(x x )}

1, h(xd) + 1

i=1 x2 i w, x x + w 1

Lemma 15 The following hold:

Fw( ) is 15

4 -Lipschitz, Fw(0) infx Fw(x) 2 and inf { x | Fw(x) = {0}} 13.

Any ϵ-stationary point x for ϵ < 1 4

2 satisﬁes Fw(x) = 1.

There exists a choice of w, such that if we run A on Fw( ), then with probability at least 1 2T exp( d/36) the algorithm s iterates x Fw 1 , . . . , x Fw T satisfy mint [T] Fw(x Fw t ) > 0.

Proof First, recall that fw( ) is 15

4 -Lipschitz by Lemma 14. Combining this with the fact that x 7 x x , z 7 max{ 1, z} are both 1-Lipschitz yields the desired Lipschitz bound. Moreover, we see that infx Fw(x) 1, and by deﬁnition of Fw( ) and Lemma 13: Fw(0) h(0) = 1. Combining the two observations gives

Fw(0) inf x Fw(x) 1 + 1 = 2 .

For the remaining claim in the ﬁrst bullet, consider v = 12w + x . By Lemma 13 we have v 12+1 = 13, thus it is enough to show that v is a stationary point of Fw. In particular,

Kornowski and Shamir

it is enough to show that fw(v x ) < 1, since by the continuity of fw this will imply that Fw 1 in a neighborhood of v. Indeed, using the facts that v x = 12w, wd = 0 we get

fw(v x ) = h(x ) + 1

4 12 w w, 12w + w 1

< h(0) + 3 1

2 12w + w < 1 + 3 1

As to the second bullet, suppose x is an ϵ-stationary point for some ϵ < 1 4

2. Namely, there exists g Fw(x) such that g ϵ. Assume by contradiction that Fw(x) > 1. Since the set {y| Fw(y) > 1} is an open set, it follows that for all y in some neighborhood of x: Fw(y) > 1. Hence, for all y in some neighborhood of x: Fw(y) = fw(y x ), which in particular implies that Fw(x) = fw(x x ). We conclude that g fw(x x ) and satisﬁes g < 1 4

2. Thus (x x ) is an ϵ-stationary point of fw( ) for some ϵ < 1 4

2, which is a contradiction to Lemma 14. In order to prove the third bullet, we start by noticing that

x x : w, x x + w 1

{x w} = w, x x + w 1

2 x x + w 0 ,

from which it follows that

x x : w, x x + w 1

{x w} : Fw(x) = f(x) := h(xd) + 1

i=1 x2 i . (7)

Indeed, for any such x the Re LU term in the deﬁnition of Fw( ) vanishes, and the remaining function (which is non-negative) is greater than 1. We continue by showing that Eq. (7) holds over a set of more convenient form. In order to do that, ﬁx some x such that w, x x + w > 1

2 (i.e. the opposite condition). Multiplying by x x + w gives

w, x x + w = w, x x + w > 1

2 x x + w 1

2 ( x x w ) .

For x = x the inequality above is trivially satisﬁed. For x = x , dividing by x x and rearranging yields

2 3 w 2 x x .

The contrapositive allows us to deduce that any x which does not satisfy the condition above belongs to {x : w, x x + w 1

2} {x w}, so we get by Eq. (7):

x = x : w, x x 1

2 3 w 2 x x = Fw(x) = f(x) := h(xd) + 1

i=1 x2 i . (8)

With this equation in hand, we turn to describe how w should be set in order to establish the third bullet. Consider a random vector w Rd which is distributed as follows:

(w1, . . . , wd 1) Unif ρ

99 Sd 2 , Pr[wd = 0] = 1 , (9)

Oracle Complexity in Nonsmooth Nonconvex Optimization

where ρ 99 Sd 2 := {(y1, . . . , yd 1)| Pd 1 i=1 y2 i = ρ

99} is the (d 2)-dimensional sphere of radius ρ 99. Note that w = ρ

99, which by plugging into Eq. (8) gives

x = x : w, x x 1

2 ρ 66 x x = Fw(x) = f(x) := h(xd) + 1

i=1 x2 i . (10)

Let x f 1, . . . , x f T be the (possibly random) iterates produced by A when ran on f( ). Note that if min t [T] x f t x ρ > 0 max t [T] w, (x f t x ) < 1

then for all t [T] :

w, (x f t x ) < 1

66 x f t x ,

as well as x f t = x . Thus, by Eq. (10), this means that Eq. (11) implies that Fw(x f t ) = f(x f t ) for all t [T]. Moreover, using the fact that x f t is bounded away from x , it is easily veriﬁed that the condition in Eq. (10) also holds for all x in some neighborhood of x f t , so actually Fw( ) is identical to f( ) on these neighborhoods, implying the same local oracle response. Hence, assuming the event in Eq. (11) occurs, if we run the algorithm on Fw( ) rather than f( ), then the produced iterates x Fw 1 , . . . , x Fw T are identical to x f 1, . . . , x f T . That being the case, we would get

min t [T] Fw(x Fw t ) = min t [T] Fw(x f t ) = min t [T] f(x f t ) > 0 ,

where the last inequality utilizes the fact that x f t x > 0. Overall we see that min t [T] x f t x ρ max t [T] w, (x f t x ) < 1

= min t [T] Fw(x Fw t ) > 0 .

Thus, in order to ﬁnish the proof, it is enough to show that there exists w, such that

min t [T] x f t x ρ max t [T] w, (x f t x ) < 1

1 2T exp( d/36) . (12)

In order to prove this claim, we observe that:

1. By Lemma 13 we know that Pr A[mint [T] x f t x ρ] 1 δ = 1 T exp( d/36).

2. If we ﬁx some vectors u1, . . . , u T in Rd 1 such that t : ut 1, and pick a unit vector u Rd 1 uniformly at random, then by a union bound and a standard concentration of measure on the sphere argument (e.g., Ball et al., 1997, Lemma 2.2), Pr(maxt u, ut α) T Pr( u, u1 α) T exp( (d 1)α2/2). For any realization of A s randomness such that such that for all t [T] : x f t x ρ, by setting

α = 1/3, u = (w1, . . . , wd 1), ut = 1

x f t x ((x f t x )1, . . . , (x f t x )d 1) ,

Kornowski and Shamir

while noticing that u, ut = w, (x f t x ) since wd = 0, we get

max t [T] w, (x f t x ) 1/3 T exp( d/36) .

Combining the two observations in a formal manner results in

h Pr w [EA,w|A] 1 T exp( d/36) i 1 T exp( d/36) , (13)

EA,w := min t [T] x f t x ρ max t [T] w, (x f t x ) < 1

Finally, using the law of total expectation and Eq. (13) we get

Pr A,w[EA,w] = EA[Pr w [EA,w|A]]

h EA,w|A : Pr w [EA,w|A] 1 T exp( d/36) i Pr A

h Pr w [EA,w|A] 1 T exp( d/36) i

EA [(1 T exp( d/36) (1 T exp( d/36)]

(1 T exp( d/36))2

1 2T exp( d/36) .

Consequently, by the probabilistic method, there exists some ﬁxed choice of w such that

Pr A [EA,w] 1 2T exp( d/36) ,

which is exactly Eq. (12), ﬁnishing the proof.

The theorem is an immediate corollary of the previous lemma: With the speciﬁed high probability, mint Fw(xt) > 0, even though all ϵ-stationary points (for any ϵ < 1 4

value of 1. Since Fw is also 15

4 -Lipschitz, we get that the distance of any xt from an ϵstationary point must be at least 0 ( 1)

15. Simplifying the numerical terms by choosing

a large enough constant C and a small enough constant c, and relabeling Fw as f, the theorem follows.

5.2 Proof of Thm. 6

We start the proof by showing that without loss of generality we can impose certain assumptions on the parameters of interest. First, if ϵ 1 then the right hand side of Eq. (4) is negative for any c2 < 1, which makes the theorem trivial. Consequently, we can assume ϵ < 1. Using Lemma 30 in Appendix D, this also implies that L 1

8 since otherwise an L-smooth ϵ-approximation does not exist in the ﬁrst place in case of 1-Lipschitz function x 7 |x1| (in particular, no such smoother exists). Therefore, if p

log ((M + 1) (T + 1))

log ((M + 1) (T + 1)) 1

d 32r > 1 256

d r (1 ϵ) ,

Oracle Complexity in Nonsmooth Nonconvex Optimization

which proves the theorem. Thus we can assume throughout the proof that

log ((M + 1) (T + 1)) <

d 32r = 1 16r

d log ((M + 1) (T + 1)) > 2 . (14)

Our strategy is to deﬁne a distribution over a family of hard 1-Lipschitz functions over Rd, for which we will show that Eq. (4) must hold for some function supported by this distribution. Before we turn to do so, we will show that when a smoother which satisﬁes TICF acts on a constant function, it returns a gradient estimate of small norm. This will be crucial later on, since we will construct a function which looks locally constant at many points of interest, thus deceiving the smoother.

Lemma 16 If S is an (L, ϵ, T, M, r)-smoother satisfying TICF, then for any constant function f and any x Rd : E [S (f, x)] ϵ.

Proof Denote v := E [S (f, x)], and note that by the TICF property v does not depend on x. Let f be the ϵ-approximation of f implicitly computed by S, then by the deﬁnition of a smoothing algorithm, we have for all x Rd:

= v 2 D f (x) , v E = D v f (x) , v E v f (x) v ϵ v

= D f (x) , v E v 2 ϵ v .

Deﬁne the one dimensional projected function fv(t) := f(t v). Then for all t 0,

fv (t) fv (0) = Z t

0 f v(z)dz = Z t

D f (z v) , v E dz

v 2 ϵ v dz = t v 2 ϵ v = t v ( v ϵ) . (15)

On the other hand, fv(t), fv(0) are both ϵ-approximations of the same constant, since f is a constant function. Thus, | fv(t) fv(0)| 2ϵ. Combining this with Eq. (15) yields for all t 0 : 2ϵ t v ( v ϵ). This can hold for all t 0 only if ( v ϵ) 0, implying the lemma.

We are now ready to deﬁne a family of functions, with the rest of the proof devoted to analyze how a smoother acts on them. Relying on Eq. (14) we can deﬁne the set

log ((M + 1) (T + 1))

k = 0, 1, . . . ,

d log ((M + 1) (T + 1))

Kornowski and Shamir

That is, a grid on [0, 1] which consists of points of distance 16r q

log((M+1)(T+1))

another. We further deﬁne the inﬂation of by 4r q

log((M+1)(T+1))

d around every point:6

p : |p x| 4r

log ((M + 1) (T + 1))

Now we deﬁne the function g : R R as the unique continuous function which satisﬁes (see Fig. 2 for an illustration)

( 1 , x / 0 , x .

Finally, we are ready to consider

fw (x) = g ( x, w ) ,

where w Sd 1 is drawn uniformly from the unit sphere. The distribution over w speciﬁes a distribution over the functions fw. We start by claiming that these functions are indeed in our function class of interest:

Lemma 17 For all w Sd 1, fw( ) is 1-Lipschitz.

Proof It is clear by construction that g is 1-Lipschitz. Thus

|f (x) f (y)| = |g ( x, w ) g ( y, w )| | x, w y, w | = | x y, w | x y .

The following lemma is the key lemma of the proof. It will show that there exists a function supported by the distribution we deﬁned, such that many points mislead the smoother by appearing as if the function is constant - hence, the the smoother returns gradient estimates of small norm. Formally:

Lemma 18 There exists w Sd 1 such that for all δ : Eξ [ S (fw, δw) ] ϵ + 1

Proof Let x(w) 1 , . . . , x(w) T be the (possibly randomized) queries produced by S (fw, 0).

Consider the event Ew, in which for all i [T] : x(w) i , w < 4r q

log((M+1)(T+1))

that if Ew occurs then for all δ , i [T], v Rd:

fw x(w) i + δw + v = g D x(w) i + δw + v, w E = g δ + D x(w) i , w E + v, w . (16)

6. Note we use the quantities T + 1, M + 1 instead of the seemingly more natural T, M, since otherwise the logarithmic term in Eq. (4) can vanish, resulting in an invalid theorem. This would have occurred for randomized smoothing, where T = M = 1.

Oracle Complexity in Nonsmooth Nonconvex Optimization

In particular, as long as v < 4r q

log((M+1)(T+1))

d D x(w) i , w E , which by Cauchy-Schwarz implies D x(w) i , w E + v, w < 4r

log ((M + 1) (T + 1))

we get by construction of g and Eq. (16) that

fw x(w) i + δw + v = g (δ) .

In other words, if Ew occurs then inside some neighborhood of x(w) i +δw, the function fw is identical to the constant function g (δ). Therefore, if Ew occurs the local oracle O satisﬁes for any δ , i [T]:

Ofw x(w) i + δw = Ox7 g(δ) x(w) i + δw . (17)

Fix some δ0 , and let x(w) 1 , . . . , x(w) T be the (possibly randomized) queries produced by S (fw, δ0w). We will now show that conditioned on Ew, for all i [T]:

x(w) i = x(w) i + δ0w , (18)

in the sense that for every realization of S s randomness ξ they are equal. We show this by induction on i. For i = 1, using TICF:

x(w) 1 = S(1) (ξ, δ0w) = S(1) (ξ, 0) + δ0w = x(w) 1 + δ0w .

Assuming this is true up until i, then by the induction hypothesis, Eq. (17) and TICF:

x(w) i+1 = S(i) ξ, δ0w, Ofw x(w) 1 , . . . , Ofw x(w) i

= S(i) ξ, δ0w, Ofw x(w) 1 + δ0w , . . . , Ofw x(w) i + δ0w

= S(i) ξ, δ0w, Ox7 g(δ0) x(w) 1 + δ0w , . . . , Ox7 g(δ0) x(w) i + δ0w

= S(i) ξ, 0, Ox7 g(0) x(w) 1 , . . . , Ox7 g(0) x(w) 1 + δ0w

= S(i) ξ, 0, Ofw x(w) 1 , . . . , Ofw x(w) i + δ0w

= x(w) i+1 + δ0w .

Having established Eq. (18) for any δ , we turn to show that for all δ :

Eξ [S (fw, δw)|Ew] = Eξ [S (x 7 0, 0)|Ew] . (19)

Indeed, by Eq. (18), Eq. (17) and TICF:

Eξ [S (fw, δw)|Ew] = Eξ h S(out) ξ, δw, Ofw x(w) 1 , . . . , Ofw x(w) T Ew i

= Eξ h S(out) ξ, δw, Ofw x(w) 1 + δw , . . . , Ofw x(w) T + δw Ew i

= Eξ h S(out) ξ, δw, Ox7 g(δ) x(w) 1 + δw , . . . , Ox7 g(δ) x(w) T + δw Ew i

= Eξ h S(out) ξ, 0, Ox7 0 x(w) 1 , . . . , Ox7 0 x(w) T Ew i

= Eξ [S (x 7 0, 0)|Ew] .

Kornowski and Shamir

We now turn to show that Ew is likely to occur. Fix some realization of S s randomness ξ, and let qξ 1, . . . , qξ T be the (deterministic) queries produced by S (y 7 0, 0). We claim that

if for all i [T] : qξ i , w < 4r q

log((M+1)(T+1))

d then qξ 1, . . . , qξ T = x(w) 1 , . . . , x(w) T

independently of w. We show this by induction on i. For i = 1:

qξ 1 = S(1) (ξ, 0) = x(w) 1 .

Assuming true up until i, then

qξ i+1 = S(i) ξ, 0, Ox7 0 qξ 1 , . . . , Ox7 0 qξ i

= S(i) ξ, 0, Ofw qξ 1 , . . . , Ofw qξ i

= S(i) ξ, 0, Ofw x(w) 1 , . . . , Ofw x(w) i

= x(w) i+1 ,

where we used the assumption on qξ i and the induction hypothesis. Recall that by assump-

tion on the algorithm qξ i r for all i [T]. Using the union bound and concentration of measure on the sphere (e.g., Ball et al., 1997, Lemma 2.2) we can bound the probability of the complementary event

Pr w [Ec w | ξ] = Pr w

i [T] : qξ i , w 4r

log ((M + 1) (T + 1))

rqξ i , w 4

log ((M + 1) (T + 1))

log((M+1)(T+1))

= 2T (M + 1)8 (T + 1)8 2 (M + 1)8 (T + 1)7 .

This inequality holds for any realization of S s randomness ξ, hence by the law of total probability

Pr ξ,w [Ec w] 2 (M + 1)8 (T + 1)7 .

In particular, since Prξ,w [Ec w] = Ew [Prξ [Ec w|w]], there exists w Sd 1 such that

Pr ξ [Ec w] 2 (M + 1)8 (T + 1)7 . (20)

Oracle Complexity in Nonsmooth Nonconvex Optimization

For this ﬁxed w, we have for all δ by the law of total expectation and the triangle inequality:

Eξ [S (fw, δw)]

Eξ [S (fw, δw)|Ew] Pr ξ [Ew] | {z } ( )

Eξ [S (fw, δw)|Ec w] Pr ξ [Ec w] | {z } ( )

On one hand, by Eq. (19):

( ) = Eξ [S (x 7 0, 0)|Ew] Pr ξ [Ew] = Eξ [S (x 7 0, 0)] Eξ [S (x 7 0, 0)|Ec w] Pr ξ [Ec w] .

Using Lemma 16, and by incorporating the deﬁnition of M in Eq. (2) and Eq. (20) we get

( ) ϵ + M 2 (M + 1)8 (T + 1)7 ϵ + 2 (M + 1)7 (T + 1)7 . (22)

On the other hand, by Eq. (2) and Eq. (20) again we have

( ) Eξ [S (fw, δw)|Ec w] Pr ξ [Ec w] M 2 (M + 1)8 (T + 1)7 2 (M + 1)7 (T + 1)7 .

(23) Overall, plugging Eq. (22) and Eq. (23) into Eq. (21), gives

Eξ [S (fw, δw)] ϵ + 4 (M + 1)7 (T + 1)7 ϵ + 1

where the last inequality simply follows from the fact that M > 0, T 1.

From now on, we ﬁx w Sd 1 which is given by the previous lemma and denote f = fw. Denote by f the ϵ-approximation of f with L-Lipschitz gradients implicitly computed by S. We turn our focus to the directional projection:

ϕ : [0, 1] R

ϕ(t) = f (t w) .

Note that by assumption on f, ϕ is diﬀerentiable, and ϕ is L-Lipschitz. Lemma 18 ensures us that ϕ is relatively close to zero on the grid , as showed in the following lemma.

Lemma 19 δ : |ϕ (δ)| 2ϵ + 1

32 Proof By Cauchy-Schwarz, Lemma 18 and the deﬁnition of a smoother, we get that for all δ : ϕ (δ) = D f (δw) , w E f (δw) w = f (δw)

E [S (f, δw)] f (δw) + E [S (f, δw)] ϵ + ϵ + 1

By combining the fact that ϕ has small values along the grid , with the fact that ϕ is L-Lipschitz, we can bound the oscillation of ϕ along the unit interval.

Kornowski and Shamir

Figure 3: Illustration of l(t)

Lemma 20 |ϕ (1) ϕ (0)| 2ϵ + 1

log((M+1)(T+1))

Proof Denote δi = 16r q

log((M+1)(T+1))

d i, and note that for all i hj 1 16r q

d log((M+1)(T+1)) ki : δi . Then

|ϕ (1) ϕ (0)| =

0 ϕ (t) dt Z 1

d log((M+1)(T +1)) k 1 X

d log ((M + 1) (T + 1))

ϕ (t) dt . (24)

By Lemma 19 we have |ϕ (δi)| , |ϕ (δi+1)| 2ϵ+ 1

32. Recall that ϕ is L-Lipschitz, so |ϕ (t)| is majorized on the interval [δi, δi+1] by the piecewise linear function (see Fig. 3)

32 + L (t δi) δi t δi+δi+1

32 + L (δi+1 t) δi+δi+1

2 < t δi+1 .

Consequently, Z δi+1

ϕ (t) dt Z δi+1

δi l (t) dt

log ((M + 1) (T + 1))

log ((M + 1) (T + 1))

where the last equality is a direct calculation. Plugging Eq. (25) into Eq. (24), we get that

|ϕ (1) ϕ (0)| 2ϵ + 1

log ((M + 1) (T + 1))

We are now ready to ﬁnish the proof. Notice that ϕ (0) = f (0) , ϕ (1) = f (w). Additionally,

Oracle Complexity in Nonsmooth Nonconvex Optimization

a direct calculation shows that f(0) = 0, f(w) 1

2. Using the fact that f f ϵ, Lemma 20 reveals 1 2 |f(w) f(0)| f(w) f(0) + 2ϵ = |ϕ (1) ϕ (0)| + 2ϵ

log ((M + 1) (T + 1))

log ((M + 1) (T + 1))

6. Discussion

In this paper, we studied the problem of nonconvex, nonsmooth optimization from an oracle complexity perspective, and provided two main results: One (in Sec. 3) is an impossibility result for eﬃciently getting near approximately-stationary points, and the second (in Sec. 4) proving an inherent trade-oﬀbetween oracle complexity and the smoothness parameter when smoothing nonsmooth functions. The second result also establishes the optimality of randomized smoothing as an eﬃcient smoothing method, under mild assumptions. Our work leaves open several questions. First, at a more technical level, there is the question of whether some or all of our assumptions in Sec. 4 can be relaxed. The result currently requires the algorithm to be translation invariant w.r.t. constant functions, as well as querying at some bounded distance from the input point x. We conjecture that the translation invariance assumption can be relaxed, possibly by a suitable reduction that shows that any smoothing algorithm can be converted to a translation invariant one. However, how to formally perform this remains unclear at the moment. As to the bounded distance of the queries, it is currently an essential assumption for our proof technique, which relies on a function which looks locally constant at many diﬀerent points, but is globally nonconstant, and this can generally be determined by querying far enough away from the input point (even along some random direction). Thus, relaxing this assumption may necessitate a diﬀerent proof technique. Another open question is whether randomized smoothing can be derandomized : Our results indicate that the gradient Lipschitz parameter of the smooth approximation cannot be improved, but leave open the possibility of an eﬃcient method returning the actual gradients of some smooth approximation (up to machine precision), in contrast to randomized smoothing which only provides noisy stochastic estimates of the gradient. These can then be plugged into smooth optimization methods which assume access to the exact gradients (rather than noisy stochastic estimates), generally improving the resulting iteration complexity. We note that naively computing the exact gradient of f( ) arising from randomized smoothing is infeasible in general, as it involves a high-dimensional integral. At a more general level, our work leaves open the question of what is the right metric to study for nonsmooth-nonconvex optimization, where neither minimizing optimization error nor ﬁnding approximately-stationary points is feasible. In this paper, we show that the goal of getting near approximately stationary points is not feasible, at least in the worst case, whereas smoothing can be done eﬃciently, but not in a dimension-free manner. Can we ﬁnd other compelling goals to consider? One very appealing notion is the (δ, ϵ)-stationarity of Zhang et al. (2020) that we mentioned in the introduction, which comes with clean,

Kornowski and Shamir

ﬁnite-time and dimension-free guarantees. Our negative result in Thm. 1 provides further motivation to consider it, by showing that a natural variation of this notion will not work. However, as we discuss in Appendix A, we need to accept that this stationarity notion can have unexpected behavior, and there exist cases where it will not resemble a stationary point in any intuitive sense. In any case, using an oracle complexity framework to study this and other potential metrics for nonsmooth nonconvex optimization, which combine computational eﬃciency and ﬁnite-time guarantees, remains an interesting direction for future research.

Acknowledgements

This research is supported by the European Research Council (ERC) grant 754705.

Appendix A. (δ, ϵ)-Stationarity (Zhang et al., 2020)

In the recent work by Zhang, Lin, Jegelka, Sra and Jadbabaie (Zhang et al., 2020), the authors prove that for nonconvex nonsmooth functions, ﬁnding ϵ-approximately stationary points is infeasible in general. Instead, they study the following relaxation (based on the notion of δ-diﬀerential introduced by Goldstein 1977): Letting f(x) denote the generalized gradient set (as deﬁned in Sec. 2) of f( ) at x, we say that a point x is a (δ, ϵ)-stationary point, if min{ u : u conv{ y: y x δ f(y)}} ϵ , (26)

where conv{ } is the convex hull. In words, there exists a convex combination of gradients at a δ-neighborhood of x, whose norm is at most ϵ. Remarkably, the authors then proceed to provide a dimension-free, gradient-based algorithm for ﬁnding (δ, ϵ)-stationary points, using O(1/δϵ3) gradient and value evaluations, as well as study related settings. Subsequently, several works have continued the study of optimization in terms of (δ, ϵ)-stationary points (Davis et al., 2022; Tian et al., 2022). Although this constitutes a very useful algorithmic contribution to nonsmooth optimization, it is important to note that a (δ, ϵ)-stationary point x (as deﬁned above) does not imply that x is δ-close to an ϵ-stationary point of f( ), nor that x necessarily resembles a stationary point. Intuitively, this is because the convex hull of the gradients might contain a small vector, without any of the gradients being particular small. This is formally demonstrated in the following proposition:

Proposition 21 For any δ > 0, there exists a diﬀerentiable function f( ) on R2 which is 2π-Lipschitz on a ball of radius 2δ around the origin, and the origin is a (δ, 0)-stationary point, yet minx: x δ f(x) 1.

Proof Fixing some δ > 0, consider the function

f(u, v) := (2δ + u) sin π

(see Fig. A for an illustration). This function is diﬀerentiable, and its gradient satisﬁes

f(u, v) = sin π

2δ(2δ + u) cos π

Oracle Complexity in Nonsmooth Nonconvex Optimization

Figure 4: The function used in the proof of Proposition 21, for δ = 1. The origin (which fulﬁlls the deﬁnition of a (1, 0)-stationary point) is marked with a red dot. Best viewed in color.

First, we note that

f(0, δ) + 1

2 f(0, δ) = 1

2 ((1, 0) + ( 1, 0)) = (0, 0),

which implies that (0, 0) is in the convex hull of the gradients at a distance at most δ from the origin, hence the origin is a (δ, 0)-stationary point. Second, we have that

f(u, v) 2 = sin2 π

2 (2δ + u)2 cos2 π

For any (u, v) of norm at most 2δ, we must have |u| 2δ, and therefore the above is at most

2 (2δ + 2δ)2 cos2 π

2δv 4π2 sin2 π

2δv + cos2 π

2δv = 4π2 ,

which implies that the function is 2π-Lipschitz on a ball of radius 2δ around the origin. Finally, for any (u, v) of norm at most δ, we have |u| δ, so Eq. (27) is at least

2 (2δ δ)2 cos2 π

2δv + cos2 π

Remark 22 (Extension to globally Lipschitz functions) Although the function f( ) in the proof has a constant Lipschitz parameter only close to the origin, it can be easily

Kornowski and Shamir

modiﬁed to be globally Lipschitz and bounded, for example by considering the function

( f(x) x 2δ

max n 0, 2 x

2δ o f 2δ x x x > 2δ ,

which is identical to f( ) in a ball of radius 2δ around the origin, but decays to 0 for larger x, and can be veriﬁed to be globally bounded and Lipschitz independent of δ.

Remark 23 (Extension to constant distances) The proof of Thm. 1 uses a (more complicated) construction that actually strengthens Proposition 21: It implies that for any δ, ϵ smaller than some constants, there is a Lipschitz, bounded-from-below function on Rd, such that the origin is (δ, 0)-stationary, yet there are no ϵ-stationary points even at a constant distance from the origin. In more details, consider the function

ˆgw(x) := max{gw(0) 1 , gw(x)} ,

g(x)w = |xd| + 1

i=1 x2 i w, x + w 1

Using exactly the same proof as for Lemma 14, one can show that gw( ) is 15

4 -Lipschitz and has no ϵ-stationary points for ϵ < 1/4

2. Therefore, it is easily veriﬁed that for any w, ˆgw( ) is 15

4 -Lipschitz, bounded from below, and any ϵ-stationary point is at a distance of at least 4/15 from the origin.7 However, we also claim that the origin is a (δ, 0)-stationary point for any δ (0, 4/15). To see this, note ﬁrst that for such δ, by the Lipschitz property of gw( ), we have ˆgw(x) = gw(x) in a δ-neighborhood of the origin. Fix any w such that w = δ 2, wd = 0, and let v { δ ed, δ ed}. It is easily veriﬁed that w, v + w 1

2 v + w < 0, in which case sign(vd) ed gw(v) = ˆgw(v) ,

and therefore 1

2 ( ˆgw(v) + ˆgw( v)) = 0, where ˆgw(v) denotes the subgradient deﬁned above.

We end by noting that if we drop the the conv{ } operator from the deﬁnition of (δ, ϵ)- stationarity in Eq. (26), the goal becomes equivalent to ﬁnding points which are δ-close to ϵ-approximately stationary points which is exactly the goal we study in Sec. 3, and for which we show a strong impossibility result. This impossibility result implies that a natural strengthening of the notion of (δ, ϵ)-stationarity is already too strong to be feasible in general.

Appendix B. Runtime of smoothed GD suﬀers from a dimension dependency

In this appendix, we formally prove that randomized smoothing can indeed lead to strong dimension dependencies in the iteration complexity of simple gradient methods in particular, vanilla gradient descent with constant step size even for simple convex functions.

7. The last point follows from the fact that if y is an ϵ-stationary point of ˆgw( ), then we can ﬁnd a point x arbitrarily close to y such that ˆgw(x) = gw(x), hence gw(x) < gw(0) 1, and as a result gw(0) fw(x) > 1. But gw( ) is 15

4 -Lipschitz, hence x > 4/15, and therefore y 4/15.

Oracle Complexity in Nonsmooth Nonconvex Optimization

Thus, the dimension dependency arising from applying gradient descent on a randomlysmoothed function is real and not merely an artifact of the analysis (where the standard upper bound on the number of iterations scales with the gradient Lipschitz parameter). We note that we focus on constant step-size gradient descent for simplicity, and a similar analysis can be performed for other gradient-based methods, such as variable step-size gradient descent or stochastic gradient descent. Given a 1-Lipschitz function f : Rd R, denote the smooth approximation f(x) = E v 1[f(x + ϵv)] where v is distributed uniformly over the unit ball. Let x0 be a point which is of distance at most 1 to an ϵ-stationary point of f, and consider vanilla gradient descent with a constant step size η > 0:

xt+1 = xt η f (xt) .

The following proposition shows that for any step size, applying gradient descent to ﬁnd an approximately-stationary point of f will necessitate a number of iterations scaling strongly with the dimension:

Proposition 24 There exists a 1-Lipschitz function f : Rd R such that the following holds: For any ϵ < 1

2, η > 0, there exists x0 as above such that min{t : f(xt) ϵ} =

Proof We will show the claim holds for f (x) := |x1|. In a nutshell, the proof is based on the observation that f(x) is close to zero only when |x1| = O(1/

d). Thus, gradient descent must hit an interval of size O(1/

d). But in order to guarantee this, and with an arbitrary bounded starting point, the step size must be small, and hence the number of iterations required will be large. Proceeding with the formal proof, note that f (x) = E v 1 [|x1 + ϵv1|], hence

f (x) = E v 1 [sign (x1 + ϵv1)] e1

= Pr v 1 [x1 + ϵv1 > 0] Pr v 1 [x1 + ϵv1 < 0] e1

= 1 2 Pr v 1 [x1 + ϵv1 < 0] e1

= 1 2 Pr v 1

i e1 . (28)

We draw several consequences from Eq. (28). First, if x1 = 0 then Pr v 1 v1 < x1

2 due to symmetry around the origin, so in particular

f (0) = 0 . (29)

Second, if x1 ϵ then Pr v 1 v1 < x1

ϵ = 0, and if x1 ϵ then Pr v 1 v1 < x1

ϵ = 1. Overall |x1| ϵ = f (x) = sign(x1) e1 . (30)

Third, since probabilities are bounded between zero and one, we obtain the global upper estimate f (x) 1 . (31)

Kornowski and Shamir

Lastly, Pr v 1 v1 < x1

ϵ equals to the volume of the intersection of the halfspace {v Rd | v1 < x1

ϵ } with the unit ball, normalized by the unit ball volume. In particular, since this intersection is a subset of the spherical sector associated with the spherical cap v Sd 1 v1 < x1

ϵ , its normalized volume is less then the surface area of the cap. By well known estimates of spherical cap (for example Ball et al., 1997, Lemma 2.2):

i exp dx2 1 2ϵ2

By combining Eq. (28) and Eq. (32) we get

f (x) 1 2 exp dx2 1 2ϵ2

In particular,

d = f (x) 4

We are now ready to describe the choice of x0 which will prove the claim, depending on the value of η.

Case I: η 5

We set x0 = e1. First, x0 is indeed at distance 1 from 0, which by Eq. (29) is a stationary point. Furthermore, by the deﬁnition of gradient descent, Eq. (28) and Eq. (31), for all t 2

2 log(10)ϵ 2

So by Eq. (33), for every t 2

2 log(10)ϵ 2

5 : f (xt) 4

5. Consequently, the minimal

t for which the gradient norm is less than ϵ satisﬁes t > 2

2 log(10)ϵ 2

In this case, we deﬁne the real function

φ (s) := 2s η f (s e1)

Oracle Complexity in Nonsmooth Nonconvex Optimization

On on hand, by assumption on η and Eq. (33):

d η f (s e1)

On the other hand, η

d and by Eq. (31):

Notice that φ is continuous since f is smooth, so by the intermediate value theorem there

such that φ (s ) = 0. Equivalently,

s η f (s e1)

1 = s . (34)

We set x0 = s e1. First, x0 is of distance at most η

2 1 from 0, which by Eq. (29) is a stationary point. Furthermore, by the deﬁnition of gradient descent and Eq. (34) we get

x1 = s e1 η f (s e1) = s e1 = x0 .

Inductively, due to the symmetry of f with respect to the origin, we obtain xt = ( 1)tx0.

In particular, since s

d Eq. (33) ensures that for all t N : f (xt) 4

Case III: η > 2

Set x0 = e1, which satisﬁes the distance assumption as explained in case I. By the deﬁnition of gradient descent and Eq. (30):

x1 = e1 η f (e1) = (1 η) e1 .

Notice that (1 η) < 1, so by invoking Eq. (30) we get

x2 = x1 η f (x1) = (1 η) e1 + ηe1 = x0 .

We deduce that for all t N : xt+2 = xt, and in particular by Eq. (30): f (xt) = 1 > ϵ.

Kornowski and Shamir

Appendix C. Proof of Proposition 11

Denote by H the set of non-negative 2-Lipschitz functions h such that h(0) = 1, x := arg minx R h(x) (0, 1) is unique, and x = x g h(x) : |g| 1. We start by showing that if the proposition does not hold, then there exists an algorithm that ﬁnds the minimum of any function in H within some ﬁnite time, with high probability. We will use this implication in order to obtain a contradiction. Assume by contradiction that Proposition 11 does not hold. That is, that there exist A, T, δ such that for any h H, ρ > 0,

min t [T] |xh t x | < ρ δ . (35)

Let (ρn) n=1 > 0 be a decreasing sequence such that limn ρn = 0. By continuity of probability measures with respect to decreasing events, assuming Eq. (35) for any ρ > 0 implies

h t [T] : xh t = x i = Pr A

min t [T] |xh t x | = 0 = Pr A

n=1 min t [T] |xh t x | < ρn

= lim n Pr A

min t [T] |xh t x | < ρn

Namely, there exists some algorithm A that gets to the exact minimum of any function in H within some ﬁnite time T, with some positive probability δ. But note that a classic conﬁdence boosting argument shows that if Eq. (36) holds for some 0 < δ < 1, then it actually holds for any 0 < δ < 1 (with an appropriate blow up in running time). Indeed, assuming it is true for some δ0, A0, T0, we deﬁne an algorithm A which simulates N independent copies of A0, and returns the point with the smallest function value over all seen iterates along all the independent copies. By standard Chernoﬀ-Hoeﬀding bounds one easily gets that for N being some function of δ0 and δ, A satisﬁes Eq. (36) for T = N T0. Hence,

δ < 1 Aδ, Tδ h H : Pr Aδ

h t [T] : xh t = x i δ . (37)

We ﬁx δ = 1

2, and let A, T0 be it s associated algorithm and iteration number. By Yao s lemma (Yao, 1977), we can assume A is deterministic and provide a distribution over hard functions. Namely, in order to arrive at a contradiction it is enough to show that

h t [T0] : xhσ t = x i < 1

for some distribution over σ, such that σ : hσ H. Before delving into the technical details, we turn to explain the intuition behind the construction. We consider functions h N σ indexed by σ {0, 1}N, N N. The function h1 σ1 tilts to the left if σ1 = 0, and tilts to the right if σ1 = 1 (see Fig. 5). Given these two functions, we deﬁne the functions for N = 2, such that σ1 determines an outer tilt, while σ2 determines an inner tilt which behaves like h1 σ2 (once again, see Fig. 5). These functions are such that for any point outside the outer tilted segment, it s value does not

Oracle Complexity in Nonsmooth Nonconvex Optimization

depend on the inner tilt - that is, on σ2. We continue this process recursively for any N N, such that the larger i is, σi determines ﬁner tilts in smaller segments. The construction has the property that for all points outside the tilt determined by σi, the function s values do not depend on σi+1, . . . , σN. Thus, by setting N large enough relatively to T, we will be able to ensure that x1, . . . , x T are not likely to depend on σN, therefore missing the minimum which does depend on it. To that end, we deﬁne the following functions:

1 x x ( , 0) 1 2x x 0, 3

, h1 1 (x) =

1 x x ( , 0) 6

5x + 1 x 0, 5

Next, in a recursive manner, for any ˆσ := (σ2, . . . , σN) {0, 1}N 1 we deﬁne (see Fig. 5):

h N 0,ˆσ (x) =

1 x x ( , 0) 1 2x x 0, 1

1 4h(N 1) ˆσ (4x 1) + 1

h N 1,ˆσ (x) =

1 x x ( , 0) 1 x x 0, 1

1 4h(N 1) ˆσ (4x 2) + 1

Lemma 25 For any N N, it holds that for all σ {0, 1}N : h N σ H.

Proof The proof is by induction on N. For N = 1 it is easy to verify that h1 0, h1 1 H, and h1 0(1) = h1 1(1) = 1, as well as the fact that h1 0, h1 1 are both piecewise linear. Assume the claim is true for some N 1, and that for all ˆσ {0, 1}N 1 : h N 1 ˆσ (1) = 1 as well as that they are all piecewise linear. Consider h N 0,ˆσ for some ˆσ {0, 1}N 1. First, h N 0,ˆσ(0) = 1 2 0 = 1 as required. Moreover, it is clear by deﬁnition that h N 0,ˆσ(x) 0 for all x / [1

2]. For x [1

2], we have by the induction hypothesis

h N 0,ˆσ(x) = 1

4 h N 1 ˆσ (4x 1) | {z } 0

Kornowski and Shamir

Figure 5: Plots of the functions h N σ , σ {0, 1}N, N = 1 (on the left) and N = 2 (on the right).

which establishes the required non-negativity property. Note that h N 0,ˆσ is continuous. Indeed, it is easy to verify continuity for any x / {1

2}, and for those two points we have

4h N 1 ˆσ (0) + 1

2 = lim x 1

4h N 1 ˆσ (1) + 1

2 = lim x 1

Using our induction hypothesis we see that h N 0,ˆσ is a piecewise linear continuous function. Thus, in order to prove the remaining properties it is enough to inspect h N 0,ˆσ. It is easy to verify that for all x / [1

2] : h N 0,ˆσ(x) { 1, 2, 1}. For x [1

2], we have

h N 0,ˆσ(x) = z 7 1

y 7 h N 1 ˆσ (y) (x 7 4x 1) (x)

4 g 4 g h N 1 ˆσ (4x 1)

= n g g h N 1 ˆσ (4x 1) o . (39)

By the induction hypothesis, all elements of the set above are of magnitude at most 2, which establishes the desired Lipschitz property. Furthermore, note that if x [1

2] then 4x 1 [0, 1]. Using the induction hypothesis, let x [1

2] be the unique number such that 4x 1 = arg min h N 1 ˆσ . Then by the induction hypothesis and Eq. (39), for all x [1

2] \ {x } g h N 0,ˆσ(x) : |g| 1 which is exactly the desired property, assuming we show that x = arg min h N 0,ˆσ. Finally, noticing that h N 0,ˆσ is decreasing for all x < x

and increasing for all x > x (which inside the interval [1

2] follows once again from the

Oracle Complexity in Nonsmooth Nonconvex Optimization

induction hypothesis and our choice of x ) ﬁnishes the proof for h N 0,ˆσ. The proof for h N 1,ˆσ is essentially the same.

Iσ1,...,σk :=

Lemma 26 For any l < k and any σ1, . . . , σl, . . . , σk {0, 1} : Iσ1,...,σl Iσ1,...,σl,...,σk.

Proof On one hand,

On the other hand,

Lemma 27 If (σ1, . . . , σk 1) = (σ 1, . . . , σ k 1) then for all σk, σ k {0, 1} : Iσ1,...,σk Iσ 1,...,σ k = .

Proof Let i0 k 1 the the minimal index i for which σi = σ i. Assume without loss of generality that σi0 = 0, σ i0 = 1. If x Iσ1,...,σk then by deﬁnition

4i + σi0 + 1

Kornowski and Shamir

Using the fact that i0 < k we also get

Plugging Eq. (41) into Eq. (40) reveals than for any x Iσ1,...,σk:

Hence x / Iσ 1,...,σ k, which ﬁnishes the proof.

Lemma 28 For any N 2, any 1 k < N and any local oracle O, it holds that Oh N σ1,...,σN (x) does not depend on σk+1, . . . , σN for x / Iσ1,...,σk.

Proof First, notice that since R \ Iσ1,...,σk is an open set it is enough to prove that the function value h N σ1,...,σN (x) does not depend on σk+1, . . . , σN for x / Iσ1,...,σk, which implies the desired claim about Oh N σ1,...,σN (x) by deﬁnition of a local oracle. We will prove the claim for all natural pairs (N, k) such that k < N, using the following inductive argument:

For any N 2, the claim holds for (N, 1).

If the claim holds for (N 1, k 1) then it holds for (N, k).

Combining both bullets proves the claim for any pair (N, k), through the chain of implications (N k + 1, 1) = (N k + 2, 2) = = (N, k) .

For the ﬁrst bullet, ﬁx any N 2. We need to prove that h N σ1,...,σN (x) does not depend on σ2, . . . , σN for x / σ1+1

4 . Indeed, if σ1 = 0 then by construction h N 0,...,σN (x) does not depend on σ2, . . . , σN for x / 1

2 . Similarly, if σ1 = 1 then by construction h N 1,...,σN (x) does not depend on σ2, . . . , σN for x / 1

4 . For the second bullet, assume the claim is true for some pair (N 1, k 1). By renaming the variables σi σi+1, the induction hypothesis states that h N 1 σ2,...,σN (x) does not depend on σk+1, . . . , σN for x / [Pk 1 i=1 σi+1+1

4i , Pk 1 i=1 σi+1+1

4i + 1 4k 1 ]. Thus, h N 1 σ2,...,σN (4x 1 σ1) does not depend on σk+1, . . . , σN for

(4x 1 σ1) /

4i + 1 4k 1

Noticing that by deﬁnition, h N σ1,σ2,...,σN (x) depends on σ2, . . . , σN only when (4x 1 σi) is fed to h N 1 σ2,...,σN ﬁnishes the proof.

Oracle Complexity in Nonsmooth Nonconvex Optimization

From now on we ﬁx some N we will specify later, and abbreviate hσ = h N σ . Let σ Unif({0, 1}N), and consider the random sequence x1, x2, . . . produced by A when applied to hσ (where the randomness comes from the random choice of σ).

Lemma 29 For any t N, 1 l < k N : Prσ [xt+1 Iσ1,...,σl,...,σk|x1, . . . , xt / Iσ1,...,σl] 1 2k l 1 .

Proof If x1, . . . , xt / Iσ1,...,σl, then Lemma 28 tells us that Ohσ(x1), . . . , Ohσ(xt) do not depend on σl+1, . . . , σN. Since xt+1 is some deterministic function of Ohσ(x1), . . . , Ohσ(xt), which is the information that A obtains along it s ﬁrst t iterations, we can deduce that in this case

1. Conditioning on past information, σl+1, . . . , σN Unif({0, 1}). In particular we get (σl+1, . . . , σk 1) Unif({0, 1}k l 1).

2. xt+1 is chosen independently of σl+1, . . . , σk 1.

Note that by Lemma 27, Iσ1,...,σl,σl+1,...,σk 1,σk, Iσ1,...,σl,σ l+1,...,σ k 1,σ k are disjoint for any (σl+1, . . . , σk 1) = (σ l+1, . . . , σ k 1), thus the events x Iσ1,...,σl,σl+1,...,σk 1,σk and x Iσ1,...,σl,σ l+1,...,σ k 1,σ k are disjoint. Since there are 2(k 1) (l+1)+1 = 2k l 1 such binary vec-

tors, we obtain 2k l 1 disjoint events with equal probabilities. Thus, their probability is at most 1 2k l 1 , which proves the claim.

Recall that Iσ1 Iσ1,σ2 Iσ1,...,σN by Lemma 26. Combined with the previous lemma, we get

Pr σ [ t [T0] : xt = x ] Pr σ [ t [T0] : xt Iσ1,...,σN ]

t=1 Pr σ [xt Iσ1,...,σN ]

h s [t], l [N] : x1, . . . , xs / Iσ1,...,σl xs+1 Iσ1,...,σl+ N

h x1, . . . , xs / Iσ1,...,σl xs+1 Iσ1,...,σl+ N

h xs+1 Iσ1,...,σl+ N

x1, . . . , xs / Iσ1,...,σl i Pr σ [x1, . . . , xs / Iσ1,...,σl]

Since N(T0)2

2 N T0 1 N 0, there exists some ﬁnite N (which depends on T0) such that NT 2 0

With this N, we get

Pr σ [ t [T0] : xt = x ] 1

proving Eq. (38) and ﬁnishing the proof.

Kornowski and Shamir

Appendix D. Technical lemmas

Lemma 30 Denote by f( ) the L0-Lipschitz function x 7 L0|x1|. Assume f( ) has LLipschitz gradients, and satisﬁes f f ϵ. Then L L0

Proof Due to rescaling we can assume without loss of generality that L0 = 1. Denoting by e1 the ﬁrst standard basis vector, we have

f ( 4ϵ e1) f ( 4ϵ e1) ϵ = 4ϵ ϵ = 3ϵ , f (4ϵ e1) f (4ϵ e1) ϵ = 4ϵ ϵ = 3ϵ , f (0) f (0) + ϵ = ϵ .

By the mean value theorem, there exist 4ϵ < t0 < 0, 0 < t1 < 4ϵ such that

x1 f (t0) = f (0) f ( 4ϵ e1)

x1 f (t1) = f (4ϵ e1) f (0)

So by Cauchy-Schwarz and L-smoothness of f:

1 = x1 f (t1) x1 f (t0) = D f (t1 e1) f (t0 e1) , e1 E

f (t1 e1) f (t0 e1) L |t1 t0| L 8ϵ .

Lemma 31 If f( ) has L-Lipschitz gradients and satisﬁes f f ϵ for some 1-Lipschitz function f( ), then for all x Rd : f(x) 1 + 2ϵ + L

Proof Let x, y Rd. Denote γ(t) := (1 t) x + t y, and notice that

f (y) f (x) = f (γ (1)) f (γ (0)) = Z 1

f γ (t) dt = Z 1

D f (γ (t)) , γ (t) E dt

D f (γ (t)) , y x E dt . (42)

Combining Cauchy-Schwarz with the fact that f is L-Lipschitz, we get

D f (x) f (γ (t)) , y x E f (x) f (γ (t)) y x L x γ (t) y x

= D f (γ (t)) , y x E D f (x) , y x E L γ (t) x y x .

Oracle Complexity in Nonsmooth Nonconvex Optimization

Plugging this into Eq. (42) gives

f (y) f (x) Z 1

D f (x) , y x E L γ (t) x y x dt

= D f (x) , y x E L y x Z 1

0 γ (t) x dt

= D f (x) , y x E L y x 1

2 γ (1) x 2 1

2 γ (0) x 2

= D f (x) , y x E L

= D f (x) , y x E f (y) f (x) + L

We assume f(x) = 0 since otherwise the desired claim is trivial. In particular, if

y = x + f(x) f(x) then y x = 1 and inequality above reveals

f (x) f (y) f (x) + L

2 f (y) f (x) + 2ϵ + L

2 y x + 2ϵ + L

2 = 1 + 2ϵ + L

where we used the fact that f f ϵ, and that f is 1-Lipschitz.

Zeyuan Allen-Zhu and Elad Hazan. Optimal black-box reductions between optimization objectives. Advances in Neural Information Processing Systems, 29, 2016.

H edy Attouch and Dominique Aze. Approximation and regularization of arbitrary functions in hilbert spaces by the lasry-lions method. Annales de l Institut Henri Poincare (C) Non Linear Analysis, 10(3):289 312, 1993.

Keith Ball et al. An elementary introduction to modern convex geometry. Flavors of geometry, 31:1 58, 1997.

Heinz H Bauschke, Patrick L Combettes, et al. Convex analysis and monotone operator theory in Hilbert spaces, volume 408. Springer, 2011.

Amir Beck and Nadav Hallak. On the convergence to stationary points of deterministic and randomized feasible descent directions methods. SIAM Journal on Optimization, 30(1): 56 79, 2020.

Amir Beck and Marc Teboulle. Smoothing and ﬁrst order methods: A uniﬁed framework. SIAM Journal on Optimization, 22(2):557 580, 2012.

Michel Bena ım, Josef Hofbauer, and Sylvain Sorin. Stochastic approximations and diﬀerential inclusions. SIAM Journal on Control and Optimization, 44(1):328 348, 2005.

Andrew Blake and Andrew Zisserman. Visual reconstruction. MIT press, 1987.

Kornowski and Shamir

J erˆome Bolte, Aris Daniilidis, Adrian Lewis, and Masahiro Shiota. Clarke subgradients of stratiﬁable functions. SIAM Journal on Optimization, 18(2):556 572, 2007.

J erˆome Bolte, Shoham Sabach, Marc Teboulle, and Yakov Vaisbourd. First order methods beyond convexity and lipschitz gradient continuity with applications to quadratic inverse problems. SIAM Journal on Optimization, 28(3):2131 2151, 2018.

G abor Braun, Crist obal Guzm an, and Sebastian Pokutta. Lower bounds on the oracle complexity of nonsmooth convex optimization via information theory. IEEE Transactions on Information Theory, 63(7):4709 4724, 2017.

Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for ﬁnding stationary points i. Mathematical Programming, pages 1 50, 2019.

Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. On the evaluation complexity of composite function minimization with applications to nonconvex nonlinear programming. SIAM Journal on Optimization, 21(4):1721 1739, 2011.

Xiaojun Chen. Smoothing methods for nonsmooth, nonconvex minimization. Mathematical programming, 134(1):71 99, 2012.

Frank H Clarke. Optimization and nonsmooth analysis, volume 5. Siam, 1990.

Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207 239, 2019.

Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, and Jason D Lee. Stochastic subgradient method converges on tame functions. Foundations of computational mathematics, pages 1 36, 2018.

Damek Davis, Dmitriy Drusvyatskiy, Yin Tat Lee, Swati Padmanabhan, and Guanghao Ye. A gradient sampling method with complexity guarantees for general lipschitz functions in high and low dimensions. ar Xiv preprint ar Xiv:2112.06969, 2022.

Dmitriy Drusvyatskiy and Courtney Paquette. Eﬃciency of minimizing compositions of convex functions and smooth maps. Mathematical Programming, 178(1-2):503 558, 2019.

John C Duchi and Feng Ruan. Stochastic methods for composite and weakly convex optimization problems. SIAM Journal on Optimization, 28(4):3229 3259, 2018.

John C Duchi, Peter L Bartlett, and Martin J Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674 701, 2012.

Saeed Ghadimi and Guanghui Lan. Stochastic ﬁrst-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

AA Goldstein. Optimization of lipschitz continuous functions. Mathematical Programming, 13(1):14 22, 1977.

Oracle Complexity in Nonsmooth Nonconvex Optimization

Elad Hazan, Kﬁr Yehuda Levy, and Shai Shalev-Shwartz. On graduated optimization for stochastic non-convex problems. In International conference on machine learning, pages 1833 1841. PMLR, 2016.

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points eﬃciently. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1724 1732. JMLR. org, 2017.

Krzysztof C Kiwiel. Convergence of the gradient sampling algorithm for nonsmooth nonconvex optimization. SIAM Journal on Optimization, 18(2):379 388, 2007.

Jean-Michel Lasry and Pierre-Louis Lions. A remark on regularization in hilbert spaces. Israel Journal of Mathematics, 55(3):257 266, 1986.

Szymon Majewski, B la zej Miasojedow, and Eric Moulines. Analysis of nonsmooth stochastic approximation: the diﬀerential inclusion approach. ar Xiv preprint ar Xiv:1805.01916, 2018.

Hossein Mobahi and John W Fisher III. A theoretical analysis of optimization by gaussian continuation. In Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015.

Jean-Jacques Moreau. Proximit e et dualit e dans un espace hilbertien. Bulletin de la Soci et e math ematique de France, 93:273 299, 1965.

Arkadi Nemirovski. Information-based complexity of convex programming. Lecture Notes, 1995.

Arkadi Semenovich Nemirovski and David Borisovich Yudin. Problem complexity and method eﬃciency in optimization. Wiley, 1983.

Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127 152, 2005.

Yurii Nesterov. How to make the gradients small. Optima. Mathematical Optimization Society Newsletter, (88):10 11, 2012.

R. Tyrrell Rockafellar and Roger J.B. Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.

Lai Tian, Kaiwen Zhou, and Anthony Man-Cho So. On the ﬁnite-time complexity and practical computation of approximate stationarity concepts of Lipschitz functions. In Proceedings of the 39th International Conference on Machine Learning, pages 21360 21379. PMLR, 2022.

Zhijun Wu. The eﬀective energy transformation scheme as a special continuation approach to global optimization with application to molecular conformation. SIAM Journal on Optimization, 6(3):748 768, 1996.

Andrew Chi-Chin Yao. Probabilistic computations: Toward a uniﬁed measure of complexity. In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 222 227. IEEE Computer Society, 1977.

Kornowski and Shamir

Chao Zhang and Xiaojun Chen. Smoothing projected gradient method and its application to stochastic linear complementarity problems. SIAM Journal on Optimization, 20(2): 627 649, 2009.

Jingzhao Zhang, Hongzhou Lin, Stefanie Jegelka, Suvrit Sra, and Ali Jadbabaie. Complexity of ﬁnding stationary points of nonsmooth nonconvex functions. In Proceedings of the 37th International Conference on Machine Learning, pages 11173 11182, 2020.