# gradientless_descent_highdimensional_zerothorder_optimization__57b5d52d.pdf

Published as a conference paper at ICLR 2020

GRADIENTLESS DESCENT: HIGH-DIMENSIONAL ZEROTH-ORDER OPTIMIZATION

Daniel Golovin, John Karro, Greg Kochanski, Chansoo Lee Xingyou Song, Qiuyi (Richard) Zhang Google Brain {dgg,karro,gpk,chansoo,xingyousong,qiuyiz}@google.com

Zeroth-order optimization is the process of minimizing an objective f(x), given oracle access to evaluations at adaptively chosen inputs x. In this paper, we present two simple yet powerful Gradient Less Descent (GLD) algorithms that do not rely on an underlying gradient estimate and are numerically stable. We analyze our algorithm from a novel geometric perspective and present a novel analysis that shows convergence within an ϵ-ball of the optimum in O(k Q log(n) log(R/ϵ)) evaluations, for any monotone transform of a smooth and strongly convex objective with latent dimension k < n, where the input dimension is n, R is the diameter of the input space and Q is the condition number. Our rates are the ﬁrst of its kind to be both 1) poly-logarithmically dependent on dimensionality and 2) invariant under monotone transformations. We further leverage our geometric perspective to show that our analysis is optimal. Both monotone invariance and its ability to utilize a low latent dimensionality are key to the empirical success of our algorithms, as demonstrated on BBOB and Mu Jo Co benchmarks.

1 INTRODUCTION

We consider the problem of zeroth-order optimization (also known as gradient-free optimization, or bandit optimization), where our goal is to minimize an objective function f : Rn R with as few evaluations of f(x) as possible. For many practical and interesting objective functions, gradients are difﬁcult to compute and there is still a need for zeroth-order optimization in applications such as reinforcement learning (Mania et al., 2018; Salimans et al., 2017; Choromanski et al., 2018), attacking neural networks (Chen et al., 2017; Papernot et al., 2017), hyperparameter tuning of deep networks (Snoek et al., 2012), and network control (Liu et al., 2017).

The standard approach to zeroth-order optimization is, ironically, to estimate the gradients from function values and apply a ﬁrst-order optimization algorithm (Flaxman et al., 2005). Nesterov & Spokoiny (2011) analyze this class of algorithms as gradient descent on a Gaussian smoothing of the objective and gives an accelerated O(n Q log((LR2 + F)/ϵ)) iteration complexity for an LLipschitz convex function with condition number Q and R = x0 x and F = f(x0) f(x ). They propose a two-point evaluation scheme that constructs gradient estimates from the difference between function values at two points that are close to each other. This scheme was extended by (Duchi et al., 2015) for stochastic settings, by (Ghadimi & Lan, 2013) for nonconvex settings, and by (Shamir, 2017) for non-smooth and non-Euclidean norm settings. Since then, ﬁrst-order techniques such as variance reduction (Liu et al., 2018), conditional gradients (Balasubramanian & Ghadimi, 2018), and diagonal preconditioning (Mania et al., 2018) have been successfully adopted in this setting. This class of algorithms are also known as stochastic search, random search, or (natural) evolutionary strategies and have been augmented with a variety of heuristics, such as the popular CMA-ES (Auger & Hansen, 2005).

These algorithms, however, suffer from high variance due to non-robust local minima or highly non-smooth objectives, which are common in the ﬁelds of deep learning and reinforcement learn-

Author list in alphabetical order.

Published as a conference paper at ICLR 2020

ing. Mania et al. (2018) notes that gradient variance increases as training progresses due to higher variance in the objective functions, since often parameters must be tuned precisely to achieve reasonable models. Therefore, some attention has shifted into direct search algorithms that usually ﬁnds a descent direction u and moves to x + δu, where the step size is not scaled by the function difference.

The ﬁrst approaches for direct search were based on deterministic approaches with a positive spanning set and date back to the 1950s (Brooks, 1958). Only recently have theoretical bounds surfaced, with Gratton et al. (2015) giving an iteration complexity that is a large polynomial of n and Dodangeh & Vicente (2016) giving an improved O(n2L2/ϵ). Stochastic approaches tend to have better complexities: Stich et al. (2013) uses line search to give a O(n Q log(F/ϵ)) iteration complexity for convex functions with condition number Q and most recently, Gorbunov et al. (2019) uses importance sampling to give a O(n Q log(F/ϵ)) complexity for convex functions with average condition number Q, assuming access to sampling probabilities. Stich et al. (2013) notes that direct search algorithms are invariant under monotone transforms of the objective, a property that might explain their robustness in high-variance settings.

In general, zeroth order optimization suffers an at least linear dependence on input dimension n and recent works have tried to address this limitation when n is large but f(x) admits a low-dimensional structure. Some papers assume that f(x) depends only on k coordinates and Wang et al. (2017) applies Lasso to ﬁnd the important set of coordinates, whereas Balasubramanian & Ghadimi (2018) simply change the step size to achieve an O(k(log(n)/ϵ)2) iteration complexity. Other papers assume more generally that f(x) = g(PAx) only depends on a k-dimensional subspace given by the range of PA and Djolonga et al. (2013) apply low-rank approximation to ﬁnd the low-dimensional subspace while Wang et al. (2013) use random embeddings. Hazan et al. (2017) assume that f(x) is a sparse collection of k-degree monomials on the Boolean hypercube and apply sparse recovery to achieve a O(nk) runtime bound. We will show that under the case that f(x) = g(PAx), our algorithm will inherently pick up any low-dimensional structure in f(x) and achieve a convergence rate that depends on k log(n). This initial convergence rate survives, even if we perturb f(x) = g(PAx) + h(x), so long as h(x) is sufﬁciently small.

We will not cover the whole variety of black-box optimization methods, such as Bayesian optimization or genetic algorithms. In general, these methods attempt to solve a broader problem (e.g. multiple optima), have weaker theoretical guarantees and may require substantial computation at each step: e.g. Bayesian optimization generally has theoretical iteration complexities that grow exponentially in dimension, and CMA-ES lacks provable complexity bounds beyond convex quadratic functions. In addition to the slow runtime and weaker guarantees, Bayesian optimization assumes the success of an inner optimization loop of the acquisition function. This inner optimization is often implemented with many iterations of a simpler zeroth-order methods, justifying the need to understand gradient-less descent algorithms within its own context.

1.1 OUR CONTRIBUTIONS

In this paper, we present Gradient Less Descent (GLD), a class of truly gradient-free algorithms (also known as direct search algorithms) that are parameter free and provably fast. Our algorithms are based on a simple intuition: for well-conditioned functions, if we start from a point and take a small step in a randomly chosen direction, there is a signiﬁcant probability that we will reduce the objective function value. We present a novel analysis that relies on facts in high dimensional geometry and can thus be viewed as a geometric analysis of gradient-free algorithms, recovering the standard convergence rates and step sizes. Speciﬁcally, we show that if the step size is on the order of O( 1 n), we can guarantee an expected decrease of 1 Ω( 1

n) in the optimality gap, based on geometric properties of the sublevel sets of a smooth and strongly convex function.

Our results are invariant under monotone transformations of the objective function, thus our convergence results also hold for a large class of non-convex functions that are a subclass of quasi-convex functions. Speciﬁcally, note that monotone transformations of convex functions are not necessarily convex. However, a monotone transformation of a convex function is always quasi-convex. The maximization of quasi-concave utility functions, which is equivalent to the minimization of quasiconvex functions, is an important topic of study in economics (e.g. Arrow & Enthoven (1961)).

Published as a conference paper at ICLR 2020

Table 1: Comparison of zeroth order optimization for well-conditioned convex functions where R = x0 x and F = f(x0) f(x ). Monotone column indicates the invariance under monotone transformations (Deﬁnition 4). k-Sparse and k-Afﬁne columns indicate that iteration complexity is poly(k, log(n)) when f(x) depends only on a k-sparse subset of coordinates or on a rank-k afﬁne subspace.

Algorithm Iteration Complexity Monotone k-Sparse k-Afﬁne Nesterov & Spokoiny (2011) n log((R2 + F)/ϵ) No No No Balasubramanian & Ghadimi (2018) n log(F/ϵ) No Yes No Stich et al. (2013) n log(F/ϵ) Yes No No Gorbunov et al. (2019) n log(F/ϵ) Yes No No This paper (GLD) n log(R/ϵ) Yes Yes Yes

Intuition suggests that the step-size dependence on dimensionality can be improved when f(x) admits a low-dimensional structure. With a careful choice of sampling distribution we can show that if f(x) = g(PAx), where PA is a rank k matrix, then our step size can be on the order of O( 1

k) as our optimization behavior is preserved under projections. We call this property afﬁne-invariance and show that the number of function evaluations needed for convergence depends logarithmically on n. Unlike most previous algorithms in the high-dimensional setting, no expensive sparse recovery or subspace ﬁnding methods are needed. Furthermore, by novel perturbation arguments, we show that our fast convergence rates are robust and holds even under the more realistic assumption when f(x) = g(PAx) + h(x) with h(x) being sufﬁciently small.

Theorem 1 (Convergence of GLD: Informal Restatement of Theorem 7 and Theorem 14). Let f(x) be any monotone transform of a convex function with condition number Q and R = x0 x . Let y be a sample from an appropriate distribution centered at x. Then, with constant probability,

f(y) f(x ) (f(x) f(x )) 1 1 5n Q

Therefore, we can ﬁnd x T such that x T x ϵ after T = e O(n Q log(R/ϵ)) function evaluations. Furthermore, for functions f(x) = g(PAx) + h(x) with rank k matrix PA and sufﬁciently small h(x), we only require e O(k Q log(n) log(R/ϵ)) evaluations.

Another advantage of our non-standard geometric analysis is that it allows us to deduce that our rates are optimal with a matching lower bound (up to logarithmic factors), presenting theoretical evidence that gradient-free inherently requires Ω(n Q) function evaluations to converge. While gradient-estimation algorithms can achieve a better theoretical iteration complexity of O(n Q), they lack the monotone and afﬁne invariance properties. Empirically, we see that invariance properties are important to successful optimization, as validated by experiments on synthetic BBOB and Mu Jo Co benchmarks that show the competitiveness of GLD against standard optimization procedures.

2 PRELIMINARIES

We ﬁrst deﬁne a few notations for the rest of the paper. Let X be a compact subset of Rn and let denote the Euclidean norm. The diameter of X, denoted X = maxx,x X x x , is the maximum distance between elements in X. Let f : X R be a real-valued function which attains its minimum at x . We use f(X) = {f(x) : x X} to denote the image of f on a subset X of Rn, and B(c, r) = {x Rn : c x r} to denote the ball of radius r centered at c.

Deﬁnition 2. The level set of f at point x X is Lc(f) = {y X : f(y) = f(x)}. The sub-level set of f at point x X is L c(f) = {y X : f(y) f(x)}. When the function f is clear from the context, we omit it.

Deﬁnition 3. We say that f is α-strongly convex for α > 0 if f(y) f(x) + f(x), y x + α

2 y x 2 for all x, y X and β-smooth for β > 0 if f(y) f(x) + f(x), y x + β

2 y x 2 for all x, y X.

Published as a conference paper at ICLR 2020

Deﬁnition 4. We say that g f is a monotone transformation of f if g : f(X) R is a monotonically (and strictly) increasing function.

Monotone transformations preserve the level sets of a function in the sense that Lx(f) = Lx(g f). Because our algorithms depend only on the level set properties, our results generalize to any monotone transformation of a strongly convex and strongly smooth function. This leads to our extended notion of condition number. Deﬁnition 5. A function f has condition number Q 1 if it is the minimum ratio β/α over all functions g such that f is a monotone transformation of g and g is α-strongly convex and β smooth.

When we work with low rank extensions of f, we only care about the condition number of f within a rank k subspace. Indeed, if f only varies along a rank k subspace, then it has a strong convexity value of 0, making its condition number undeﬁned. If f is α-strongly convex and β-smooth, then its Hessian matrix always has eigenvalues bounded between α and β. Therefore, we need a notion of a projected condition number. Let A Rd k be some orthonormal matrix and let PA = AA be the projection matrix onto the column space of A. Deﬁnition 6. For some orthonormal A Rd k with d > k, a function f has condition number restricted to A, Q(A) 1, if it is the minimum ratio β/α over all functions g such that f is a monotone transformation of g and h(y) = g(Ay) is α-strongly convex and β smooth.

3 ANALYSIS OF DESCENT STEPS

The GLD template can be summarized as follows: given a sampling distribution D, we start at x0 and in iteration t, we choose a scalar radii rt and we sample yt from a distribution rt D centered around xt, where rt provides the scaling of D. Then, if f(yt) < f(xt), we update xt+1 = yt; otherwise, we set xt+1 = xt. The analysis of GLD follows from the main observation that the sublevel set of a monotone transformation of a strongly convex and strongly smooth function contains a ball of sufﬁciently large radius tangent to the level set (Lemma 15). In this section, we show that this property, combined with facts of high-dimensional geometry, implies that moving in a random direction from any point has a good chance of signiﬁcantly improving the objective.

As we mentioned before, the key to fast convergence is the careful choice of step sizes, which we describe in Theorem 7. The intuition here is that we would like to take as large steps as possible while keeping the probability of improving the objective function reasonably high, so by insights in high-dimensional geometry, we choose a step size of Θ(1/ n). Also, we show that if f(x) admits a latent rank-k structure, then this step size can be increased to Θ(1/

k) and is therefore only dependent on the latent dimensionality of f(x), allowing for fast high-dimensional optimization. Lastly, our geometric understanding allows us to show that our convergence rates are optimal with a matching lower bound. Without loss of generality, this section assumes that f(x) is strongly convex and smooth with condition number Q.

3.1 STEP SIZE

Theorem 7. For any x such that 3 5Q x x [C1, C2], we can ﬁnd integers 0 k1, k2 < log C2

C1 such that if r = 2k1C1 or r = 2 k2C2, then a random sample y from uniform distribution over Bx = B(x, r n) satisﬁes

f(y) f(x ) (f(x) f(x )) 1 1 5n Q

with probability at least 1

Proving the above theorem requires the following lemma about the intersection of balls in high dimensions and it is proved in the appendix. Lemma 8. Let B1 and B2 be two balls in Rn of radii r1 and r2 respectively. Let ℓbe the distance between the centers. If r1 [ ℓ 2 n, ℓ n] and r2 ℓ ℓ 4n, then

vol (B1 B2) cn vol (B1) ,

where cn is a dimension-dependent constant that is lower bounded by 1

4 at n = 1.

Published as a conference paper at ICLR 2020

3.2 GAUSSIAN SAMPLING AND LOW RANK STRUCTURE

A direct application of Lemma 8 seems to imply that uniform sampling of a high-dimensional ball is necessary. Upon further inspection, this can be easily replaced with a much simpler Gaussian sampling procedure that concentrates the mass close to the surface to the ball. This procedure lends itself to better analysis when f(x) admits a latent low-dimensional structure since any afﬁne projection of a Gaussian is still Gaussian.

Lemma 9. Let B1 and B2 be two balls in Rn of radii r1 and r2 respectively. Let ℓbe the distance between the centers. If r1 [ ℓ 2 n, ℓ n] and r2 ℓ ℓ

n and X = (X1, ..., Xn) are independent

Gaussians with mean centered at the center of B1 and variance r2 1 n , then

Pr[X B2] > c,

where c is a dimension-independent constant.

Assume that there exists some rank k projection matrix PA such that f(x) = g(PAx), where k is much smaller than n. Because Gaussians projected on a k-dimensional subspace are still Gaussians, we show that our algorithm has a dimension dependence on k. We let Qg(A) be the condition number of g restricted to the subspace A that drives the dominant changes in f(x).

Theorem 10. Let f(x) = g(PAx) for some unknown rank k matrix PA with k < n and suppose 3 5Q PA(x x ) [C1, C2] for some numbers C1, C2 R+. Then, there exist integers 0 k1, k2 < log C2

C1 such that if r = 2k1C1 or r = 2 k2C2, then a random sample y from a Gaussian

distribution N(x, r2

k I) satisﬁes

f(y) f(x ) (f(x) f(x )) 1 1 5k Qg(A)

with constant probability.

Note that the speed-up in progress is due to the fact that we can now tolerate the larger sampling radius of Ω(1/

k), while maintaining a high probability of making progress. If k is unknown, we can simply use binary search to ﬁnd the correct radius with an extra factor of log(n) in our runtime.

The low-rank assumption is too restrictive to be realistic; however, our fast rates still hold, at least for the early stages of the optimization, even if we assume that f(x) = g(PAx) + h(x) and |h(x)| δ is a full-rank function that is bounded by δ. In this setting, we can show that convergence remains fast, at least until the optimality gap approaches δ.

Theorem 11. Let f(x) = g(PAx) + h(x) for some unknown rank k matrix PA with k < n where g, h are convex and |h| δ. Suppose 3 5Q PAx z [C1, C2] for some numbers C1, C2 R+

where z minimizes g(z). Then, there exist integers 0 k1, k2 < log C2

C1 such that if r = 2k1C1 or

r = 2 k2C2, then a random sample y from a Gaussian distribution N(x, r2

k I) satisﬁes

f(y) f(x ) (f(x) f(x )) 1 1 10k Qg(A)

with constant probability whenever f(x) f(x ) 60δk Qg(A).

3.3 LOWER BOUNDS

We show that our upper bounds given in the previous section are tight up to logarithmic factors for any symmetric sampling distribution D. These lower bounds are easily derived from our geometric perspective as we show that a sampling distribution with a large radius gives an extremely low probability of intersection with the desired sub-level set. Therefore, while gradient-approximation algorithms can be accelerated to achieve a runtime that depends on the square-root of the condition number Q, gradient-less methods that rely on random sampling are likely unable to be accelerated according to our lower bound. However, we emphasize that monotone invariance allows these results to apply to a broader class of objective functions, beyond smooth and convex, so the results can be useful in practice despite the seemingly worse theoretical bounds.

Published as a conference paper at ICLR 2020

Algorithm 1: Gradientless Descent with Binary Search (GLD-Search) Input: function: f : Rn R, T Z+: number of iterations, x0: starting point, D: sampling distribution, R: maximum search radius, r: minimum search radius

1 Set K = log(R/r)

2 for t = 0, . . . , T do

3 Ball Sampling Trial i:

4 for k = 0, ..., K do

5 Set ri,k = 2 k R.

6 Sample vi,k ri,k D.

8 Update: xt+1 = arg mink n f(y) y = xt, y = xt + vi,k o

10 return xt

Theorem 12. Let y = x + v, where v is a random sample from r D for some radius r > 0 and D is standard Gaussian or any rotationally symmetric distribution. Then, there exist a region X with positive measure such that for any x X,

f(y) f(x ) (f(x) f(x )) 1

with probability at least 1 1 poly(n Q).

4 GRADIENTLESS ALGORITHMS

In this section, we present two algorithms that follow the same Gradientless Descent (GLD) template: GLD-Search and GLD-Fast, with the latter being an optimized version of the former when an upper bound on the condition number of a function is known. For both algorithms, since they are monotone-invariant, we appeal to the previous section to derive fast convergence rates for any monotone transform of convex f(x) with good condition number. We show the efﬁcacy of both algorithms experimentally in the Experiments section.

4.1 GRADIENTLESS DESCENT WITH BINARY SEARCH

Although the sampling distribution D is ﬁxed, we have a choice of radii for each iteration of the algorithm. We can apply a binary search procedure to ensure progress. The most straightforward version of our algorithm is thus with a naive binary sweep across an interval in [r, R] that is unchanged throughout the algorithm. This allows us to give convergence guarantees without previous knowledge of the condition number at a cost of an extra factor of log(n/ϵ). Theorem 13. Let x0 be any starting point and f a blackbox function with condition number Q. Running Algorithm 1 with r = ϵ n, R = X and D = N(0, I) as a standard Gaussian returns a

point x T such that x T x 2Q3/2ϵ after O(n Q log(n X /ϵ)2) function evaluations with high probability.

Furthermore, if f(x) = g(PAx) admits a low-rank structure with PA a rank k matrix, then we only require O(k Qg(A) log(n X /ϵ)2) function evaluations to guarantee PA(x T x ) ϵ. This holds analogously even if f(x) = g(PAx) + h(x) is almost low-rank where |h| δ and ϵ > 60δk Qg(A).

4.2 GRADIENTLESS DESCENT WITH FAST BINARY SEARCH

GLD-Search (Algorithm 1) uses a naive lower and upper bound for the search radius xt x , which incurs an extra factor of log(1/ϵ) in the runtime bound. In GLD-Fast, we remove this extra factor dependence on log(1/ϵ) by drastically reducing the range of the binary search. This is done by exploiting the assumption that f has a good condition number upper bound ˆQ and by slowly halﬁng the diameter of the search space every few iterations since we expect xt x as t .

Published as a conference paper at ICLR 2020

Algorithm 2: Gradientless Descent with Fast Binary Search (GLD-Fast) Input: function f : Rn R, T Z+: number of iterations, x0: starting point, D: sampling distribution, R: diameter of search space, Q: condition number bound

1 Set K = log(4 Q), H = n Q log(Q)

2 for t = 1, . . . , T do

3 Set R = R/2 when T 0 mod H (every H iterations).

4 Ball Sampling Trial i:

5 for k = -K, ..., 0, ..., K do

6 Set ri,k = 2 k R.

7 Sample vi,k ri,k D.

9 Update: xt+1 = arg mini n f(y) y = xt, y = xt + vi o

11 return xt

Theorem 14. Let x0 be any starting point and f a blackbox function with condition number upper bounded by Q. Running Algorithm 2 with suitable parameters returns a point x T such that f(x T ) f(x ) ϵ after O(n Q log2(Q) log( X /ϵ)) function evaluations with high probability.

Furthermore, if f(x) = g(PAx) admits a low-rank structure with PA a rank k matrix, then we only require O(k Qg(A) log(n) log2(Qg(A)) log( X /ϵ)) function evaluations to guarantee PA(x T x ) ϵ. This holds analogously even if f(x) = g(PAx) + h(x) is almost low-rank where |h| δ and ϵ > 60δk Qg(A).

5 EXPERIMENTS

We tested GLD algorithms on a simple class of objective functions and compare it to Accelerated Random Search (ARS) by Nesterov & Spokoiny (2011), which has linear convergence guarantees on strongly convex and strongly smooth functions. To our knowledge, ARS makes the weakest assumption among the zeroth-order algorithms that have linear convergence guarantees and perform only a constant order of operations per iteration. Our main conclusion is that GLD-Fast is comparable to ARS and tends to achieve a reasonably low error much faster than ARS in high dimensions ( 50). In low dimensions, GLD-Search is competitive with GLD-Fast and ARS though it requires no information about the function.

We let Hα,β,n Rn n be a diagonal matrix with its i-th diagonal equal to α + (β α) i 1

n 1. In simple words, its diagonal elements form an evenly space sequence of numbers from α to β. Our objective function is then fα,β,n : Rn R as fα,β,n(x) = 1 2x Hα,β,nx, which is α-strongly convex and β-strongly smooth. We always use the same starting point x = 1 n(1, . . . , 1), which requires X = Q for our algorithms. We plot the optimality gap f(bt) f(x ) against the number of function evaluations, where bt is the best point observed so far after t evaluations. Although all tested algorithms are stochastic, they have a low variance on the objective functions that we use; hence we average the results over 10 runs and omit the error bars in the plots.

We ran experiments on f1,8,n with imperfect curvature information ˆα and ˆβ (see Figure 3 in appendix). GLD-Search is independent of the condition number. GLD-Fast takes only one parameter, which is the upper bound on the condition number; if approximation factor is z, then we pass 8z as the upper bound. ARS requires both strong convexity and smoothness parameters. We test three different distributions of the approximation error; when the approximation factor is z, then ARS-alpha gets (α/z, β), ARS-beta gets (α, zβ), and ARS-even gets (α/ z, zβ) as input. GLD-Fast is more robust and faster than ARS when the condition number is over-approximated. When the condition number is underestimated, GLD-Fast still steadily converges.

Published as a conference paper at ICLR 2020

0 1 2 3 4 5 6 7 8 9 10 Evaluations (Thousands)

Log Optimality Gap

Dimension = 10

0 1 2 3 4 5 6 7 8 9 10 Evaluations (Thousands)

0 Dimension = 20

0 1 2 3 4 5 6 7 8 9 10 Evaluations (Thousands)

0 Dimension = 50

0 1 2 3 4 5 6 7 8 9 10 Evaluations (Thousands)

0 Dimension = 100

GLD-Search GLD-Fast ARS

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Log Optimality Gap

Dimension = 10

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 20

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 50

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 100

ARS GLD-Fast GLD-Search

Figure 1: The average optimality gap on a quadratic objective function that is strongly convex and smooth objective (top); and its monotone transformation (bottom). Further experiments on nonconvex BBOB functions show similar behavior and are in the appendix.

5.1 MONOTONE TRANSFORMATIONS

In Figure 1, we ran experiments on f1,8,n for different settings of dimensionality n, and its monotone transformation with g(y) = exp( y). For this experiment, we assume a perfect oracle for the strong convexity and smoothness parameters of f. The convergence of GLD is totally unaffected by the monotone transformation. For the low-dimension cases of a transformed function (bottom half of the ﬁgure), we note that there are inﬂection points in the convergence curve of ARS. This means that ARS initially struggles to gain momentum and then struggles to stop the momentum when it gets close to the optimum. Another observation is that unlike ARS that needs to build up momentum, GLD-Fast starts from a large radius and therefore achieves a reasonably low error much faster than ARS, especially in higher dimensions.

5.2 BBOB BENCHMARKS

To show that practicality of GLD on practical and non-convex settings, we also test GLD algorithms on a variety of Black Box Optimization Benchmarking (BBOB) functions (Hansen et al., 2009). For each function, the optima is known and we use the log optimality gap as a measure of competance. Because each function can exhibit varying forms of non-smoothness and convexity, all algorithms are ran with a smoothness constant of 10 and a strong convexity constant of 0.1. All other setup details are same as before, such as using a ﬁxed starting point.

The plots, given in Appendix C, underscore the superior performance of GLD algorithms on various BBOB functions, demonstrating that GLD can successfully optimize a diverse set of functions even without explicit knowledge of condition number. We note that BBOB functions are far from convex and smooth, many exhibiting high conditioning, multi-modal valleys, and weak global structure. Due to our radius search produce, our algorithm appears more robust to non-ideal settings with non-convexity and ill conditioning. As expected, we note that GLD-Fast tend to outperform GLDSearch, especially as the dimension increases, matching our theoretical understanding of GLD.

5.3 MUJOCO CONTROL BENCHMARKS AND AFFINE TRANSFORMATIONS

We also ran experiments on the Mujoco benchmarks with varying architectures, both linear and nonlinear. This demonstrates the viability of our approach even in the non-convex, high dimensional setting. We note that however, unlike e.g. ES which uses all queries to form a gradient direction, our algorithm removes queries which produce less reward than using the current arg-max, which can be an information handicap. Nevertheless, we see that our algorithm still achieves competitive performance on the maximum reward. We used a horizon of 1000 for all experiments.

Published as a conference paper at ICLR 2020

We further tested the afﬁne invariance of GLD on the policy parameters from using Gaussian ball sampling, under the Half Cheetah benchmark by projecting the state s of the MDP with linear policy to a higher dimensional state Ws, using a matrix multiplication with an orthonormal W. Specifically, in this setting, for a linear policy parametrized by matrix K, the objective function is thus J(KW) where πK(Ws) = KWs. Note that when projecting into a high dimension, there is a slowdown factor of log dnew

dold where dnew, dold are the new high dimension and previous base dimension, respectively, due to the binary search in our algorithm on a higher dimensional space. For our Half Cheetah case, we projected the 17 base dimension to a 200-length dimension, which suggests that the slowdown factor is a factor log 200

17 3.5. This can be shown in our plots in the appendix (Figure 15).

Table 2: Final rewards by GLD with linear (L) and deep (H41) policies on Mujoco Benchmarks show that GLD is competitive. We apply an afﬁne projection on Half Cheetah to test afﬁne invariance. We use the reward threshold found from (Mania et al., 2018) with Reacher s threshold (Schulman et al., 2017) for a reasonable baseline.

Env. Arch Rew. at (104, 105, Max) Queries Rew. Thresh. Half Cheetah-v1 L 3799, 3903, 4064 3430 Half Cheetah-v1, Proj-200 L 1594 , 3509 , 4342 3430 Half Cheetah-v1 H41 2741, 3074, 3392 3430 Hopper-v1 L 1017, 3359, 3375 3120 Hopper-v1 H41 2708, 3370, 3566 3120 Reacher-v1 L -70, -5, -4 -10 Reacher-v1 H41 -231, -17, -15 -10 Swimmer-v1 L 365, 369 , 369 325 Swimmer-v1 H41 353, 369, 369 325 Walker2d-v1 L 1027 , 2201, 2201 4390 Walker2d-v1 H41 1630, 1963, 2146 4390

6 CONCLUSION

We introduced GLD, a robust zeroth-order optimization algorithm that is simple, efﬁcient, and we show strong theoretical convergence bounds via our novel geometric analysis. As demonstrated by our experiments on BBOB and Mu Jo Co benchmarks, GLD performs very robustly even in the non-convex setting and its monotone and afﬁne invariance properties give theoretical insight on its practical efﬁciency.

GLD is very ﬂexible and allows easy modiﬁcations. For example, it could use momentum terms to keep moving in the same direction that improved the objective, or sample from adaptively chosen ellipsoids similarly to adaptive gradient methods. (Duchi et al., 2011; Mc Mahan & Streeter, 2010). Just as one may decay or adaptively vary learning rates for gradient descent, one might use a similar change the distribution from which the ball-sampling radii are chosen, perhaps shrinking the minimum radius as the algorithm progresses, or concentrating more probability mass on smaller radii.

Likewise, GLD could be combined with random restarts or other restart policies developed for gradient descent. Analogously to adaptive per coordinate learning rates Duchi et al. (2011); Mc Mahan & Streeter (2010), one could adaptively change the shape of the balls being sampled into ellipsoids with various length-scale factors. Arbitrary combinations of the above variants are also possible.

Published as a conference paper at ICLR 2020

Kenneth J Arrow and Alain C Enthoven. Quasi-concave programming. Econometrica: Journal of the Econometric Society, pp. 779 800, 1961.

Anne Auger and Nikolaus Hansen. A restart cma evolution strategy with increasing population size. In Evolutionary Computation, 2005. The 2005 IEEE Congress on, volume 2, pp. 1769 1776. IEEE, 2005.

Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates. In Advances in Neural Information Processing Systems, pp. 3455 3464, 2018.

Samuel H Brooks. A discussion of random methods for seeking maxima. Operations research, 6 (2):244 251, 1958.

Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artiﬁcial Intelligence and Security, pp. 15 26. ACM, 2017.

Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard E Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization. ar Xiv preprint ar Xiv:1804.02395, 2018.

Josip Djolonga, Andreas Krause, and Volkan Cevher. High-dimensional gaussian process bandits. In Advances in Neural Information Processing Systems, pp. 1025 1033, 2013.

Mahdi Dodangeh and Luís N Vicente. Worst case complexity of direct search under convexity. Mathematical Programming, 155(1-2):307 332, 2016.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011.

John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788 2806, 2015.

Abraham D Flaxman, Adam Tauman Kalai, and H Brendan Mc Mahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 385 394. Society for Industrial and Applied Mathematics, 2005.

Saeed Ghadimi and Guanghui Lan. Stochastic ﬁrst-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

Eduard Gorbunov, Adel Bibi, Ozan Sener, El Houcine Bergou, and Peter Richtárik. A stochastic derivative free optimization method with momentum. ar Xiv preprint ar Xiv:1905.13278, 2019.

Serge Gratton, Clément W Royer, Luís Nunes Vicente, and Zaikun Zhang. Direct search based on probabilistic descent. SIAM Journal on Optimization, 25(3):1515 1541, 2015.

Nikolaus Hansen, Steffen Finck, Raymond Ros, and Anne Auger. Real-Parameter Black-Box Optimization Benchmarking 2009: Noiseless Functions Deﬁnitions. Research Report RR-6829, INRIA, 2009.

Elad Hazan, Adam Klivans, and Yang Yuan. Hyperparameter optimization: A spectral approach. ar Xiv preprint ar Xiv:1706.00764, 2017.

Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics and Statistics, 4(1):66 70, 2011.

Sijia Liu, Jie Chen, Pin-Yu Chen, and Alfred O Hero. Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications. ar Xiv preprint ar Xiv:1710.07804, 2017.

Published as a conference paper at ICLR 2020

Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zerothorder stochastic variance reduction for nonconvex optimization. In Advances in Neural Information Processing Systems, pp. 3727 3737, 2018.

Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. ar Xiv preprint ar Xiv:1803.07055, 2018.

H. Brendan Mc Mahan and Matthew J. Streeter. Adaptive bound optimization for online convex optimization. In COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pp. 244 256, 2010.

Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Technical report, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2011.

Nicolas Papernot, Patrick Mc Daniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506 519. ACM, 2017.

Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. ar Xiv preprint ar Xiv:1703.03864, 2017.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. Co RR, abs/1707.06347, 2017. URL http://arxiv.org/abs/ 1707.06347.

Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18(52):1 11, 2017.

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951 2959, 2012.

Sebastian U Stich, Christian L Muller, and Bernd Gartner. Optimization of convex functions with random pursuit. SIAM Journal on Optimization, 23(2):1284 1309, 2013.

Yining Wang, Simon Du, Sivaraman Balakrishnan, and Aarti Singh. Stochastic zeroth-order optimization in high dimensions. ar Xiv preprint ar Xiv:1710.10551, 2017.

Ziyu Wang, Masrour Zoghi, Frank Hutter, David Matheson, and Nando De Freitas. Bayesian optimization in high dimensions via random embeddings. In Twenty-Third International Joint Conference on Artiﬁcial Intelligence, 2013.

Published as a conference paper at ICLR 2020

A PROOFS OF SECTION 3

Lemma 15. If h has condition number Q, then for all x X, there is a ball of radius Q 1 x x that is tangent at x and inside the sublevel set L x(h).

Proof. Write h = g f such that f is α-strongly convex and β-smooth for some β = Qα and g is monotonically increasing. From the smoothness assumption, we have for any s,

f(x) + D f(x), s 1

β f(x) E + β

2 s 2 1 β2 f(x) 2 .

Consider the ball B = B(x 1

β f(x), 1 β f(x) ). For any y B, the above inequality implies f(y) f(x). Hence, when we apply g on both sides, we still have h(y) h(x) for all y B. Therefore, B L h(y).

By strong convexity, f(x) α x x . It follows that the radius of B is at least α

Proof of Lemma 8. Without loss of generality, consider the unit distance case where ℓ= 1. Furthermore, it sufﬁces to prove for the smallest possible radius r2 = 1 1 4n.

Since |r1 r2| ℓ r1+r2, the intersection B1 B2 is composed of two hyperspherical caps glued end to end. We lower bound vol (B1 B2) by the volume of the cap C1 of B1 that is contained in the intersection. Consider the triangle with sides r1, r2 and ℓ. From classic geometry, the height of C1 is c1 = 1

2 1 + r2 1 r2 2 > 0. (1)

The volume of a spherical cap is Li (2011),

vol (C1) = 1

2 vol (B1) I 1 c2 1 r2 1 (n + 1

where I is the regularized incomplete beta function deﬁned as

R x 0 ta 1(1 t)b 1 dt R 1 0 ta 1(1 t)b 1 dt

where x [0, 1] and a, b (0, ). Note that for any ﬁxed a and b, Ix(a, b) is increasing in x. Hence, in order to obtain a lower bound on vol (C1), we want to lower bound 1 c2 1 r2 1 or equivalently,

upper bound c2 1 r2 1 .

Write r1 = α 2 n for some α [1, 2]. From Eq. (1),

8n 1 32n2 .

Hence, c1 r1 = 1 16 n

Since g(α) = 8 α + 4α is convex in [1, 2], g(α) max(g(1), g(2)) = 12. It follows that

c1 r1 1 16 n 12 1

n 3 4 n. So, 1 c2 1 r2 1 1 9 16n. To complete the proof, note that

Vn := I1 9 16n ( n+1

2) is increasing in n, and V1 = 1

4. As n goes to inﬁnity, this value converges to 1 as B1 B2.

Published as a conference paper at ICLR 2020

Proof of Lemma 7. Let ν = 1 5n Q. Let q = (1 ν)x + νx . Let Bq = B(cq, rq) be a ball that has q on its surface, lies inside L q, and has radius rq = Q 1 x x . Lemma 15 guarantees its existence.

Suppose that vol (Bx Bq) 1

4 vol (Bx) (2)

and that a random sample y from Bx belongs to Bq, which happens with probability at least 1

4. Then, our guarantee follows by

f(y) f(x ) f(q) f(x ) (1 ν)f(x) + νf(x ) f(x ) (1 ν) (f(x) f(x ))

where the ﬁrst line follows from Lemma 15 and second line from convexity of f.

Therefore, it now sufﬁces to prove Eq. 2. To do so, we will apply Lemma 8 after showing that the radius of Bx and Bq are in the proper ranges. Let ℓ= x cq and note that

ℓ x q + rq (3)

ν x x + rq = ν x x + Q 1 q x

(ν + Q 1(1 ν)) x x (4)

6 5Q bx x .

Since x is outside of Bq, we also have

ℓ rq = Q 1 q x = Q 1(1 ν) x x

4 5Q bx x . (5)

It follows that ℓ 2 3

In the log2 space, our choice of k1 is equivalent to starting from log2 C1 and sweeping through the range [log2 C1, log2 C2] at the interval of size 1. This is guaranteed to ﬁnd a point between ℓ

2 and ℓ, which is also an interval of size 1. Therefore, there exists a k1 satisfying the theorem statement, and similarly, we can prove the existence of k2.

Finally, it remains to show that rq (1 1/(4n))ℓ. From Eq. (3), it sufﬁces to show that x q ℓ 4n or equivalently ν x x ℓ 4n. From Eq. (4),

x q = ν x x νQ(1 ν) 1ℓ.

For any Q, n 1, 1 ν 4

νQ(1 ν) 1 = 1 5n(1 ν) 1 1 4n (6)

and the proof is complete.

Proof of Lemma 9. Without loss of generality, let ℓ= 1 and B2 is centered at the origin with radius r2 and B1 is centered at e1 = (1, 0, ..., 0). Then, we simply want to show that

(1 + X1)2 +

i=2 X2 i r2 2

By Markov s inequality, we see that Pn i=2 X2 i 2r2 1 = 2/n with probability at most 1/2. And since X1 is independent and r2 1 1/n, it sufﬁces to show that

Pr (1 + X1)2 1 4/n > Ω(1)

Since X1 has standard deviation at least r1/ n 1/(2n), we see that the probability of deviating at least a few standard deviation below is at least a constant.

Published as a conference paper at ICLR 2020

Proof of Theorem 10. We can consider the projection of all points onto the column space of A and since the Gaussian sampling process is preserved, our proof follows from applying Theorem 7 restricted onto the k-dimensional subspace and using Lemma 9 in place of Lemma 8.

Proof of Theorem 11. By the boundedness of h, since f(x) f(x ) 60δk Qg(A), we see that g(PAx) g(x ) 60δk Qg(A) 2δ > 0. By Lemma 9, we see that if we sample from a Gaussian distribution y N(x, r2

k I), then if z is the minimum of g(x) restricted to the column space of A, then

g(PAy) g(z ) (g(PAx) g(z )) 1 1 5k Qg(A)

with constant probability. By boundedness on h, we know that h(y) h(x)+2δ. Furthermore, this also implies that g(PAx ) g(z ) + 2δ. Therefore, we know that the decrease is at least

f(y) f(x ) = g(PAy) g(PAx ) + h(y) h(x ) g(PAy) g(z ) + 2δ

(g(PAx) g(z )) 1 1 5k Qg(A)

(g(PAx) g(PAx ) + 2δ) 1 1 5k Qg(A)

(f(x) f(x ) + 4δ) 1 1 5k Qg(A)

(f(x) f(x )) 1 1 5k Qg(A)

Since f(x) f(x ) 10δk Qg(A), we conclude that (f(x) f(x )) 1 1 5k Qg(A) + 6δ

(f(x) f(x )) 1 1 10k Qg(A) and our proof is complete.

Proof of Theorem 12. Our main proof strategy is to show that progress can only be made with a radius size of O( p

log(n Q)/(n Q)); larger radii cannot ﬁnd descent directions with high probability. Consider a simple ellipsoid function f(x) = x Dx, where D is a diagonal matrix and D11 D22 ... Dnn, where WLOG we let D11 = 1 and Dii = Q for i > 1. The optima is x = 0 with f(x ) = 0.

Consider the region X = {x = (x1, x2, ..., xn)|1 x1 0.9, |xi| 0.1/(Q n)}. Then, if we let v N(0, I) be a standard Gaussian vector, then for some radius r, we see that the probability of ﬁnding a descent direction is:

Pr[f(x + rv) f(x)] = Pr

(x1 + rv1)2 + X

i>1 Dii(xi + rvi)2 x2 1 + X

i>1 Diix2 i

2rx1v1 + r2v2 1 + Q X

i>1 (2rxivi + r2v2 i ) 0

i>1 2rxivi Qr2 X

Published as a conference paper at ICLR 2020

By standard concentration bounds for sub-exponential variables, we have

i>1 v2 i 1| t

2e (n 1)t2/8

Therefore, with exponentially high probability, P

i>1 X2 i n/2. Also, since |xi| 0.1/(Q n), Chernoff bounds give:

Therefore, with probability at least 1 1/(n Q)3, | P

i>1 vi Xi| p

log(n Q)/Q.

If Qrn Ω( p

log(n Q)), then we have

i>1 vi Xi 1

i>1 X2 i Ω( p

We conclude that the probability of descent is upper bounded by Pr h v1X1 Ω( p

log(n Q)) i .

This probability is exactly Φ( l), where Φ is the cumulative density of a standard normal and l = Ω( p

log(n Q)). By a naive upper bound, we see that

l e x2/2 dx

l xe x2/2 dx

Since l = Ω( p

log(n Q)), we conclude that with probability at least 1 1/poly(n Q), we have f(y) f(x ) f(x) f(x ).

Otherwise, we are in the case that Qrn O( p

log(n Q)). Arguing simiarly as before, with high probability, our objective function and each coordinate can change by at most O( p

log(n Q)/(Qn)).

Next, we extend our proof to any symmetric distribution D. Since D is rotationally symmetric, if we parametrize v = (r, θ) is polar-coordinates, then the p.d.f. of any scaling of D must take the form p(v) = pr(r)u(θ), where u(θ) induces the uniform distribution over the unit sphere. Therefore, if Y is a random variable that follows D, then we may write Y = Rv/ v , where R is a random scalar with p.d.f pr(r) and v is a standard Gaussian vector and R, X are independent.

As previously argued, v [0.5n, 1.5n] with exponentially high probability. Therefore, if R Ω( p

log(n Q)/Q), the same arguments will imply that Y is a descent direction with polynomially small probability. Thus, when Y is a descent direction, it must be that R Ω( p

log(n Q)/Q) and as argued previously, our lower bound follows similarly.

B PROOFS OF SECTION 4

Proof of Theorem 13. By the Gaussian version of Theorem 7 (full rank version of Theorem 10), as long as our binary search sweeps between minimum search radius r 3 5Q n x x and

Published as a conference paper at ICLR 2020

maximum search radius of the diameter of the whole space R = X , the objective value will decrease multiplicatively by 1 1 5n Q in each iteration with constant probability. Therefore, if xt x 2Qϵ and we set r = ϵ n and R = X , then with high probability, we expect f(x T ) f(x ) βQ2ϵ2 after T = O(n Q log( X /(Qϵ))) iterations, where we note that F = f(x0) f(x ) β X 2 by smoothness.

Otherwise, if there exists some xt such that xt x 2Qϵ, then f(x T ) f(x ) f(xt) f(x ) 4βQ2ϵ2. Therefore, by strong convexity, we conclude that in either case, x T x 2Q3/2ϵ. Finally note that each iteration uses a binary search that requires O(log(R/r)) = O(log(n X /ϵ)) function evaluations.

Therefore, by combining these bounds, we derive our result. The low-rank result follows from applying Theorem 10 and Theorem 11 instead.

Proof of Theorem 14. Let H = O(n Q log(Q)) be the number of iterations between successive radius halving and we initialize R = X and half R every H iterations. We call the iterations between two halving steps an epoch. We claim that xi x0 R for all iterations and proceed with induction on the epoch number. The base case is trivial.

Assume that xi x0 R for all iterations in the previous epoch and let iteration is be the start of the epoch and iteration is + H be the end of the epoch. Then, since xis x R, we see that f(xis) f(x ) βR2 by smoothness. If R 4 Q xi x 4 QR for all i in the previous epoch, then by the Gaussian version of Theorem 7 (Theorem 10), since we do a binary sweep from R 4Q to 4 QR, we can choose D accordingly so that we are guaranteed that our objective value will decrease multiplicatively by 1 1 5n Q with constant probability at a cost of O(log(Q)) function evaluations per iteration. This implies that with high probability, after O(n Q log(Q)) iterations, we conclude

f(xis+H) f(x ) 1

4Q(f(xis) f(x )) α

4 xis x 2 α

Otherwise, there exists some 1 j H such that xis+j x 4 QR or xis+j x R 4 Q. If it is the former, then by strong convexity, f(xis+j) f(x ) α xis+j x 2 2βR2, which contradicts the fact that f(xis) f(x ) βR2 by smoothness. If it is the latter, then by smoothness, we reach the same conclusion:

f(xis+H) f(x ) f(xis+j) f(x ) β xis+j x 2 α

Therefore, by strong convexity, we have

f(xis+H) f(x )

And our induction is complete. Therefore, we conclude that after log( X /ϵ) epochs, we have x T x ϵ. Each epoch has H iterations, each with O(log(Q)) function evaluations and so our result follows.

The low-rank result follows from applying Theorem 10 and Theorem 11 instead. However, note that since we do not know the latent dimension k, we must extend the binary search to incur an extra log(n) factor in the binary search cost.

Published as a conference paper at ICLR 2020

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Log Optimality Gap

Condition Number = 2

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

5 Condition Number = 4

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Condition Number = 8

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Condition Number = 16

GLD-Search GLD-Fast ARS

Figure 2: The average optimality gap by the condition number of the objective function.

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Log Optimality Gap

Approximation factor = 0.125

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Approximation factor = 0.25

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Approximation factor = 4

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Approximation factor = 8

GLD-Search GLD-Fast ARS-alpha ARS-beta ARS-even

Figure 3: The average optimality gap by the accuracy of the condition number estimate, where approximation factor is the ratio of estimated to true condition number. The dimension n = 20.

C.1 BBOB FUNCTION PLOTS

Figure 4: Convergence plot for the BBOB Rastrigin Function.

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Log Optimality Gap

Dimension = 5

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 10

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 20

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 40

ARS GLD-Fast GLD-Search

Convergence Plots for Rastrigin Function

Figure 5: Convergence plot for the BBOB Bent Cigar Function.

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Log Optimality Gap

Dimension = 5

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 10

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 20

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 40

ARS GLD-Fast GLD-Search

Convergence Plots for Bent Cigar Function

Published as a conference paper at ICLR 2020

Figure 6: Convergence plot for the BBOB Bueche Rastrigin Function.

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Log Optimality Gap

Dimension = 5

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 10

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 20

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 40

ARS GLD-Fast GLD-Search

Convergence Plots for Bueche Rastrigin Function

Figure 7: Convergence plot for the BBOB Different Powers Function.

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Log Optimality Gap

Dimension = 5

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 10

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 20

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 40

ARS GLD-Fast GLD-Search

Convergence Plots for Different Powers Function

Figure 8: Convergence plot for the BBOB Discus Function.

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Log Optimality Gap

Dimension = 5

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 10

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 20

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 40

ARS GLD-Fast GLD-Search

Convergence Plots for Discus Function

Figure 9: Convergence plot for the BBOB Ellipsoidal Function.

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Log Optimality Gap

Dimension = 5

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 10

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 20

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 40

ARS GLD-Fast GLD-Search

Convergence Plots for Ellipsoidal Function

Published as a conference paper at ICLR 2020

Figure 10: Convergence plot for the BBOB Katsuura Function.

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Log Optimality Gap

Dimension = 5

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 10

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 20

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 40

ARS GLD-Fast GLD-Search

Convergence Plots for Katsuura Function

Figure 11: Convergence plot for the BBOB Schaffers F7 Function.

0 1 2 3 4 5 6 7 8 9 10 Evaluations (Thousands)

Log Optimality Gap

Dimension = 5

0 1 2 3 4 5 6 7 8 9 10 Evaluations (Thousands)

Dimension = 10

0 1 2 3 4 5 6 7 8 9 10 Evaluations (Thousands)

Dimension = 20

0 1 2 3 4 5 6 7 8 9 10 Evaluations (Thousands)

Dimension = 40

ARS GLD-Fast GLD-Search

Convergence Plots for Schaffers F7 Function

Figure 12: Convergence plot for the BBOB Ill-Conditioned Schaffers F7 Function.

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Log Optimality Gap

Dimension = 5

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 10

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 20

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 40

ARS GLD-Fast GLD-Search

Convergence Plots for Schaffers F7Ill Conditioned Function

Figure 13: Convergence plot for the BBOB Sharp Ridge Function.

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Log Optimality Gap

Dimension = 5

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 10

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 20

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 40

ARS GLD-Fast GLD-Search

Convergence Plots for Sharp Ridge Function

Published as a conference paper at ICLR 2020

Figure 14: Convergence plot for the BBOB Weierstass Function.

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Log Optimality Gap

Dimension = 5

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 10

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 20

0 1 2 3 4 5 6 7 8 9 10

Evaluations (Thousands)

Dimension = 40

ARS GLD-Fast GLD-Search

Convergence Plots for Weierstass Function

Published as a conference paper at ICLR 2020

C.2 MUJOCO CONTROL PLOTS

Figure 15: Plot of maximum reward so far found by algorithm. Main line is the median trajectory across 3 runs.

Figure 16: Example of rewards found by all samples by algorithm.