# explicit_gradient_learning_for_blackbox_optimization__5a468def.pdf

Explicit Gradient Learning for Black-Box Optimization

Elad Saraﬁan* 1 Mor Sinay* 1 Yoram Louzoun 1 Noa Agmon 1 Sarit Kraus 1

Black-Box Optimization (BBO) methods can ﬁnd optimal policies for systems that interact with complex environments with no analytical representation. As such, they are of interest in many Artiﬁcial Intelligence (AI) domains. Yet classical BBO methods fall short in high-dimensional non-convex problems. They are thus often overlooked in real-world AI tasks. Here we present a BBO method, termed Explicit Gradient Learning (EGL), that is designed to optimize highdimensional ill-behaved functions. We derive EGL by ﬁnding weak spots in methods that ﬁt the objective function with a parametric Neural Network (NN) model and obtain the gradient signal by calculating the parametric gradient. Instead of ﬁtting the function, EGL trains a NN to estimate the objective gradient directly. We prove the convergence of EGL to a stationary point and its robustness in the optimization of integrable functions. We evaluate EGL and achieve state-ofthe-art results in two challenging problems: (1) the COCO test suite against an assortment of standard BBO methods; and (2) in a high-dimensional non-convex image generation task.

1. Introduction

Optimization problems are prevalent in many artiﬁcial intelligence applications, from search-and-rescue optimal deployment (Zhen et al., 2014) to triage policy in emergency rooms (Rosemarin et al., 2019) to hyperparameter tuning in machine learning (Bardenet et al., 2013). In these tasks, the objective is to ﬁnd a policy that minimizes a cost or maximizes a reward. Evaluating the cost of a single policy is a complicated and often costly process that usually has no analytical representation, e.g., due to interaction with

*Equal contribution. 1Department of Computer Science, Bar Ilan University, Israel. Correspondence to: Elad Saraﬁan, Mor Sinay <elad.saraﬁan@gmail.com, mor.sinay@gmail.com>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

real-world physics or numerical simulation. Black-Box Optimization (BBO) algorithms (Audet & Hare, 2017; Golovin et al., 2017) are designed to solve such problems, when the analytical formulation is missing, by repeatedly querying the Black-Box function and searching for an optimal solution while minimizing the number of queries (budget).

Related Work BBO problems have been studied in multiple ﬁelds with diverse approaches. Many works investigated derivative free methods (Rios & Sahinidis, 2013), from the classic Nelder Mead algorithm (Nelder & Mead, 1965) and Powell s method (Powell, 1964) to more recent evolutionary algorithms such as CMA-ES (Hansen, 2006). Another line of research is derivative-based algorithms, which ﬁrst approximate the gradient and then apply line-search methods such as the Conjugate Gradient (CG) Method (Shewchuk et al., 1994) and Quasi-Newton Methods, e.g. BFGS (Nocedal & Wright, 2006). Other model-based methods such as SLSQP (Bonnans et al., 2006) and COBYLA (Powell, 2007) iteratively solve quadratic or linear approximations of the objective function. Some variants apply trust-region methods and iteratively ﬁnd an optimum within a trusted subset of the domain (Conn et al., 2009; Chen et al., 2018). Another line of research is more focused on stochastic discrete problems, e.g. Bayesian methods (Snoek et al., 2015), and multi-armed bandit problems (Flaxman et al., 2004).

More recent works have studied the applications of NNs in BBO algorithms. (Mania et al., 2018; Vemula et al., 2019) showed that Markov Decision Processes could be solved by applying random search optimization over the weights of a parameterized policy. (Sener & Koltun, 2020) suggested learning an intermediate low dimensional manifold with a NN to reduce the complexity of the random search. (Maheswaranathan et al., 2018) suggested learning the objective and using the parametric gradient with respect to the inputs (Lillicrap et al., 2015) to guide a random search. (Saremi, 2019) studied NN architectures for learning gradients.

Our contribution In this paper, we suggest a new derivative-based algorithm, Explicit Gradient Learning (EGL), that learns a surrogate function for the gradient by averaging the numerical directional derivatives over small volumes bounded by ε radius. We control the accuracy of our model by controlling the ε radius parameter. We then use trust-regions and dynamic scaling of the objective function

Explicit Gradient Learning

to ﬁne-tune the convergence process to the optimal solution. This results in a theoretically guaranteed convergence of EGL to a stationary point. We compared the performance of EGL to eight different BBO algorithms in rigorous simulations in the COCO test suite (Hansen et al., 2019) and show that EGL outperforms all others in terms of ﬁnal accuracy. EGL was further evaluated in a high-dimensional non-convex domain, involving searching in the latent space of a Generative Adversarial Network (GAN) (Pan et al., 2019) to generate an image with speciﬁc characteristics. EGL again outperformed existing algorithms in terms of accuracy and appearance, where other BBO methods failed to converge to a solution in a reasonable amount of time.

The paper is organized as follows: BBO background and motivation for the EGL algorithm are presented in Sections 2 and 3. The detailed description of the EGL algorithm and its theoretical analysis are shown in Section 4. The empirical evaluation of EGL s performance and comparison to state-of-the-art BBO algorithms are described in Section 5, and we conclude in Section 6. The proofs for all of the theoretical statements are found in the Appendix.

2. Background

A function f : Ω R, Ω Rn is considered to be a Black-Box if one can evaluate y = f(x) at x Ω, but has no prior knowledge of its analytical form. A BBO algorithm seeks to ﬁnd x = arg minx Ωf(x), typically with as few evaluations as possible (Audet & Hare, 2017). Since, in general, it is not possible to converge to the optimal value with a ﬁnite number of evaluations, we deﬁne a budget C and seek to ﬁnd as good a solution as possible x with less than C evaluations (Hansen et al., 2010).

Many BBO methods operate with a two-phase iterative algorithm: (1) search or collect data with some heuristic; and (2) update a model to obtain a new candidate solution and improve the heuristic. Traditionally, BBO methods are divided into derivative-free and derivative-based methods. The former group relies on statistical models (Balandat et al., 2019), physical models (Van Laarhoven & Aarts, 1987), or Evolutionary Strategies (Back, 1996) to deﬁne the search pattern and are not restricted to continuous domains, so they can also be applied to discrete variables. Our algorithm, EGL, falls under the latter category, which relies on a gradient estimation to determine the search direction (Bertsekas & Scientiﬁc, 2015). Formally, derivative-based methods are restricted to differentiable functions, but here we show that EGL can be applied successfully whenever the objective function is merely locally integrable.

Recently, with the Reinforcement Learning renaissance (Silver et al., 2017; Schulman et al., 2015), new algorithms have been suggested for the problem of continuous action control

(e.g., robotics control). Many of these can be viewed in the context of BBO as derivative-based methods applied with NN parametric models. One of the most prominent algorithms is DDPG (Lillicrap et al., 2015). DDPG iteratively collects data with some (often naive) exploration strategy and then ﬁts a local NN parametric model fθ around a candidate solution xk. To update the candidate, it approximates the gradient f at xk with the parametric gradient fθ and then applies a gradient descent step (Ruder, 2016).1 Algorithm 1 outlines the DDPG steps in the BBO formulation. We denote it as Indirect Gradient Learning (IGL) since it does not directly learn the gradient f. In the next section, we will develop arguments as to why one should learn the gradient explicitly instead of using the parametric gradient.

Algorithm 1 Indirect Gradient Learning Input: x0, α, C k = 0 while budget C > 0 do

Build Local Model:

Collect data Dk = {(xk + εni, yi)}m i=1, ni N(0, I) Fit a model fθk with θk = arg minθ Pm i=1 |fθ(xk + εni) yi|2

Gradient Descent:

xk+1 xk α fθk(xk) k k + 1

3. Motivation

In Algorithm 1 only the gradient information f(xk) is required to update the next xk candidate. However, the gradient function is never learned directly, and it is only inferred from the parametric model without any clear guarantee of its veracity. Hence we seek a method that learns the gradient function f explicitly. Clearly, directly learning the gradient is infeasible since the Black-Box only outputs the f(x) values. Instead, our approach would be to learn a surrogate function for f, termed the mean-gradient, by sampling pairs of observations {(xi, yi), (xj, yj)}i,j and averaging the numerical directional derivatives over small volumes. This section formally deﬁnes the mean-gradient and then formulates two arguments that motivate its use, instead of fθ. We then illustrate these arguments using 1D and 2D examples taken from the COCO test suite.

3.1. The Mean-Gradient

For any differentiable function f with a continuous gradient, the ﬁrst order Taylor expression is

f(x + τ) = f(x) + f(x) τ + O( τ 2). (1)

1fθ is differentiable. Thus, it can be differentiated with respect to its input x, as done in adversarial training (Yuan et al., 2019).

Explicit Gradient Learning

Figure 1. Comparing indirect gradient learning and explicit gradient learning for 4 typical functions: (a) parabolic; (b) piece-wise linear; (c) multiple local minima; (d) step function.

Thus, locally around x, the directional derivative satisﬁes f(x) τ f(x + τ) f(x). We deﬁne the meangradient as the function that minimizes the Mean-Square Error (MSE) of such approximations in a vicinity of x.

Deﬁnition 1. The mean-gradient at x with ε > 0 averaging radius is

gε(x) = arg min g Rn

|g τ f(x + τ) + f(x)|2dτ (2)

where Vε(x) Rn is a convex subset s.t. x x ε for all x Vε(x) and the integral domain is over τ s.t. x + τ Vε(x).

Proposition 1 (controllable accuracy). For any differentiable function f with a continuous gradient, there is κg > 0, so that for any ε > 0 the mean-gradient satisﬁes gε(x) f(x) κgε for all x Ω.

In other words, the mean-gradient has a controllable accuracy parameter ε s.t. reducing ε improves the gradient approximation. As explained in Sec. 4.2, this property is crucial in obtaining the convergence of EGL to stationary points. Unlike the mean-gradient, the parametric gradient has no such parameter and the gradient accuracy is not directly controlled. Even for zero MSE error s.t. fθ(xi) yi for all of the samples in the replay buffer (Mnih et al., 2015), there is no guarantee of the parametric gradient accuracy. On the contrary, overﬁtting the objective may severely hurt the gradients approximation.

Moreover, the parametric gradient can be discontinuous even when the parametric model has a Lipschitz continuous (Hansen & Jaumard, 1995) gradient. For example, a commonly used NN with Re LU activation is only piecewise differentiable. This leads to very erratic gradients even for smooth objective functions. If the objective function is not smooth (e.g., for noise-like functions or in singular points), the gradient noise is exacerbated. On the other hand, the next proposition suggests that, due to the integral over Vε(x) that smooths the gradient, the mean gradient is smooth whenever the objective function is continuous.

Proposition 2 (continuity). If f is continuous in V and Vε(x) V then gε is a continuous function at x.

The mean-gradient is not necessarily continuous in discontinuity points of f. One can obtain an even smoother surrogate for f by slightly modifying the mean-gradient deﬁnition

gp ε(x) = arg min g Rn

|g (τ s) f(τ)+f(s)|2dsdτ,

where the integral domains are s Bp(x) and τ Vε(x). Here, Bp(x) Vε(x) is an n-ball perturbation set with p < ε radius that dithers the reference point (which is ﬁxed at x in Deﬁnition 1). We term this modiﬁed version the perturbed mean-gradient. Remarkably, while gp ε is still a controllably accurate model for Lipschitz continuous gradients, as the integrand is x independent, gp ε is continuous

Explicit Gradient Learning

Figure 2. Visualizing explicit gradient learning with different ε for various 2D problems from COCO test suite: (a) sharp-ridge problem 194; (b) step-ellipsoid problem 97. Comparing EGL and IGL: (c) Schaffer F7 C1000 problem 255; (d) Schaffer F7 C10 problem 240.

whenever f is merely integrable. In practice, as gε is learned with a Lipschitz continuous model, we ﬁnd that both forms are continuous for integrable functions. Nevertheless, Sec. 5 shows that gp ε adds a small gain to the EGL performance. Next, we demonstrate how the EGL smoothness (as opposed to fθ) leads to more stable and efﬁcient trajectories in both continuous and discontinuous objective functions.

3.2. Illustrative Examples

To demonstrate these properties of the mean-gradient, we consider 1D 2 and 2D problems from the COCO test suite. We start by examining 4 typical 1D functions: (a) parabolic; (b) piece-wise linear; (c) multiple local minima; and (d) step function. We ﬁt f with a NN and compare its parametric gradient to the mean-gradient, learned with another NN. The NN model is identical for both functions and is based on our spline embedding architecture (see Sec. 4.4). The results are presented in Fig. 1. The 1st row shows that the ﬁt ˆf is very strong s.t. the error is almost indistinguishable to the naked eye. Nevertheless, the calculated parametric gradient (2nd row) is noisy, even for smooth functions, as the NN architecture is piece-wise linear. Occasionally, there are even spikes that change the gradient sign. Traversing the surface curves with such a function is very unstable and inefﬁcient. On the other hand, due to the smoothing parameter ε = 0.1 and since the NN is Lipschitz continuous, the mean-gradient is always smooth, even for singularity points and discontinuous gradients.

In the 3rd row we evaluate gε for different size ε parameters. We see that by setting ε sufﬁciently high, the gradient becomes smooth enough, so there is no problem descending over steps and multiple local minima functions. In practice, we may use this property and start the descent trajectory with a high ε. This way, the optimization process does not commit too early to a local minimum and searches for regions with lower valleys. After reﬁning ε the process will

2Since COCO does not have built-in 1D problems; we generated 1D problems based on 2D problems with f1D(x) : f2D(x, x)

settle into a local minimum, which in practice would be much lower than minima found around the initial point.

The next experiment (Fig. 2) is executed on 2D problems from the COCO test suite. In Fig. 2(a-b) we present the gradient-descent steps with the mean-gradient and a constant ε. In 2(a) the objective has a singular minimum, similar to the |x| function s minimum (Fig.1(a)). As ε gets smaller, the gradients near the minimum get larger and there is a need to reduce the learning-rate (α) in order to converge. 2(b) presents a step function. Again, we observe that for a smaller ε the gradient has high spikes in the discontinuity points, leading to a noisier trajectory. For that purpose, the EGL algorithm decays both α and ε during the learning process (see Sec. 4 for details). This lends much smoother trajectories, as can be seen in Fig. 2(c-d). Here we executed both EGL and IGL with the same α, ε decay pattern. We observe that the IGL trajectories are noisy and inefﬁcient since the parametric gradient error is not bounded. On the other hand, EGL smoothly travels through a ravine (Fig. 2(c)) and converges to a global minimum (Fig. 2(d)).

4. Design & Analysis

In this section, we lay out the practical EGL algorithm and analyze its asymptotic properties.

4.1. Monte-Carlo Approximation

To learn the mean-gradient, one may evaluate the integral in Eq. (2) with Monte-Carlo samples and learn a model that minimizes this term. Formally, for a model gθ : Ω Rn

and a dataset Dk = {(xi, yi)}m i=1, deﬁne the loss function

xj Vε(xi) |(xj xi) gθ(xi) yj +yi|2 (3)

and learn θ k = arg minθ Lk,ε(θ), e.g. with gradient descent. This formulation can be used to estimate the mean-gradient for any x. Yet, practically, in each optimization iteration, we only care about estimating it in close proximity to the

Explicit Gradient Learning

current candidate solution xk. Therefore, we assume that the dataset Dk holds samples only from Vε(xk).

The accuracy of the learned model gθ k heavily depends on the number and locations of the evaluation points and the speciﬁc parameterization for gθ. Still, for f with a Lipschitz continuous gradient, i.e. f C+1, we can set bounds for the model accuracy in Vε(xk) with respect to ε. For that purpose, we require a set of at least m n + 1 evaluation points in Dk which satisfy the following poised set deﬁnition.

Deﬁnition 2 (poised set for regression). Let Dk = {(xi, yi)}m 1 , m n + 1 s.t. xi Vε(xk) for all i. Deﬁne the matrix Xi Mm n s.t. the j-th row is xi xj. Now deﬁne X = ( XT 1 XT m )T . The set Dk is a poised set for regression in xk if the matrix X has rank n.

Intuitively, a set is poised if its difference vectors xi xj span Rn. For the poised set, and a constant parameterization, the solution of Eq. (3) is unique and it is equal to the Least Squares (LS) minimizer. If f has a Lipschitz continuous gradient, then, with an admissible parametric model, the error between gθ(x) and f(x) can be proportional to ε. We formalize this in the following theorem and corollary.

Theorem 1. Let Dk be a poised set in Vε(xk). The regression problem

g MSE = arg min g

i,j Dk |(xj xi) g yj + yi|2 (4)

has the unique solution g MSE = ( XT X) 1 XT δ, where δ Rm2 s.t. δi (m 1)+j = yj yi. Further, if f C1+

and gθ C0 is a parameterization with (equal or) lower regression loss than g MSE, the following holds for all x Vε(xk): f(x) gθ(x) κgε (5)

Corollary 1. For the Dk poised set, any Lipschitz continuous parameterization of the form gθ(x) = F(Wx) + b is a controllably accurate model (Audet & Hare, 2017) in Vε(xk) for the optimal set of parameters θ k.

This is obvious as we can simply set W = 0 and b = g MSE. In this work, we are interested in using NNs which are much stronger parameterizations than constant models. If the NN satisﬁes the Lipschitz continuity property (e.g., with spectral normalization) and has at least a biased output layer, then its optimal set of parameters θ k has lower regression loss than g MSE and it is therefore a controllably accurate model.3

Finally, to incorporate learning of the perturbed mean gradient, we slightly modify Eq. (3). Note that we cannot directly dither the reference point xi as we would need to collect

3Provided that the optimization process recovers θ k, which is not necessarily true in practice.

more samples. Instead, we may dither the gθ argument by evaluating it in xir = xi + nr where nr is uniformly sampled in an n-ball with radius p. Thus, the perturbed loss function for small p s.t. p ε is

Lk,ε,p(θ) = X

i,j Dk,nr |(xj xi) gθ( xir) yj + yi|2 (6)

4.2. Asymptotic Analysis

In this part, we assume that the function f has a Lipschitz continuous gradient s.t. f(x) f(x ) κf x x . In this case, the classical gradient descent theorem states that the update xk+1 = xk α f(xk) with learning parameter α 1 κf converges to a stationary point x s.t. f(xk) 0 for k (Nesterov, 2013; Lee et al., 2016).

EGL descends over a surrogate function of the gradient that contains some amount of error. Far from x , where f is large, the error in gε is small enough s.t. every new candidate improves the solution, i.e. f(xk+1) f(xk). If the stationary point x is locally convex, then, as xk gets closer to x , the gradient f decreases, and eventually the error, i.e. gε(xk) f(xk), becomes so signiﬁcant that improvement is no longer guaranteed. Nevertheless, our analysis shows that a proper choice of εk yields monotonic decreasing steps that converge to a stationary point.

Theorem 2. Let f : Ω R have a Lipschitz continuous gradient and a Lipschitz constant κf. Suppose a controllable mean-gradient model gε with error constant κg, the gradient descent iteration xk+1 = xk αgε(xk) with α s.t. 5ε f(xk) α min( 1

κf ) guarantees a monotonically

decreasing step s.t. f(xk+1) f(xk) 2.25 ε2

Corollary 2. If Theorem 2 is satisﬁed for all k, the gradient descent iteration converges to a stationary point x s.t. 1 K PK k=1 f(xk) 2 12|f(x0) f(x )|

Utilizing Theorem 2 we can design an algorithm that converges to a stationary point. For that purpose we must make sure that the learning rate abides the requirement α min( 1

κf ). Since this factor cannot be easily estimated, a practical solution is to decay α during the optimization process. However, to converge, the ratio ε

α must also decay to zero. This means that ε must decay to zero faster than α. Algorithm 2 provides both of these conditions, so it is guaranteed to converge to a local minimum for ε 0.

This analysis assumes a Lipschitz continuous gradient, but since EGL is also designed for non-smooth functions, our practical algorithm alleviates the requirement for strictly monotonically decreasing steps. Instead, it calculates a running mean over the last candidates to determine when to decrease α and ε. The complete practical EGL algorithm is found in the Appendix Sec. E. It also includes input and output mapping and scaling, as described in the next section.

Explicit Gradient Learning

Algorithm 2 Convergent EGL Input: x0, α, ε, γα < 1, γε < 1, ε k = 0 while ε ε do

Build Model:

Collect data {(xi, yi)}m 1 , xi Vε(xk) Learn a local model gε(xk)

Gradient Descent:

xk+1 xk αgε(xk) if f(xk+1) > f(xk) 2.25 ε2

α then α γαα ε γαγεε

4.3. Dynamic Mappings

Unlike supervised learning with a constant dataset, BBO is a dynamic problem. The input and output statistics change over time as the optimization progresses. There are several sources for this drift. First, traversing via gε changes the input s ﬁrst moment by updating the center of the samples xi and the output s ﬁrst moment by collecting smaller costs yi. At the same time, squeezing ε and α over time decreases the second moment statistics by reducing the variety in {(xi, yi)}. NNs are sensitive to such distribution changes (Brownlee, 2018) and require tweaking hyperparameters such as learning rate and initial weight distribution to maintain high performance. Default numbers (e.g. learning rate of 10 3) usually work best when the input data is normalized. To regulate the statistics over the entire optimization process, we apply a method of double dynamic mappings for both input and output values.

From the input perspective, our dynamic mapping resembles Trust-Region methods (Conn et al., 2000; Nocedal & Wright,

2006). Instead of searching for x in the entire Ωdomain by reducing ε and α over time, we ﬁx ε and α and search for x j in a sub-region Ωj. After ﬁnding the best candidate solution in this sub-region we shrink Ωj by a factor of γα > 0 and ε by a factor of γε > 0. To keep the input statistics regulated, we maintain a bijective mapping hj : Ωj Rn : that scales the x values to an unconstrained domain x with approximately constant ﬁrst and second moments.

In this work, Ωis assumed to be a rectangular box that denotes the upper and lower bounds for each entry in x. Thus we consider element-wise bijective mappings of the form hj(x) = (h1 j(x1), ..., hn j (xn)) and each new sub-region, Ωj+1 Ωj, is a smaller rectangular box centered at x j. In the unconstrained domain x, we can use the initial learning rate, yet, due to the squeezing factor, the effective learning rate is decayed by the γα factor. Equivalently, the effective accuracy parameter ε is reduced by a factor of γα γε. In

other words, we use the same NN parameters to learn a zoomed-in problem of the original objective function.

In order to regulate the output statistics, we deﬁne a scalar, monotonically increasing, invertible mapping rk(y) which maps the yi samples in the dataset Dk to approximately constant statistics. Since traversing with gε can signiﬁcantly change the statistics even in the same sub-region, rk must be dynamic and cannot be held ﬁxed for the entire j sub-problem. Combining both input and output mappings, we obtain a modiﬁed set of samples {( xi, yi)}i = {(hj(xi), rk(yi))}i, so effectively we learn a modiﬁed mean-gradient, denoted as gεjk, of a compressed and scaled function y = fjk( x) = rk f h 1 j ( x).

If both input and output mappings are linear, then the true mean-gradient is proportional to the modiﬁed mean-gradient

1 hj(x) gεjk( x) (7)

Even when the mappings are approximately linear inside Vεjk( xk), gε(xk) can be estimated according to Eq. (7). Hence, to maintain a controllably accurate model, we seek mappings that preserve the linearity as much as possible. On the other hand, strictly linear mappings may be insufﬁcient since they are susceptible to outliers. After experimenting with several functions, we found a sweet spot: a composition of a linear mapping followed by a squash function that compresses only the outliers (see details in Appendix Sec. D). With such mappings, we make sure both that the model is controllably accurate and that the learning hyperparameters are adequate for the entire optimization process.

4.4. Spline Embedding

Many black-box applications, including the experiments in Sec. 5, have no clear inductive bias which can be exploited, as e.g., CNNs are suitable for natural images (Cohen & Shashua, 2016). In such cases, it is common to use feedforward NNs. However, we found out that the quality of ﬁtting objectives and gradients with such NN is unsatisfactory, especially for low dimensional problems. One method to augment the input layer is through learnable embeddings (Zhang et al., 2016). However, usually, embeddings are applied for categorical input and they do not preserve order. Since our input domain is continuous, we wish to design Lipschitz continuous learnable embeddings sθ(x) s.t. for two inputs x1 and x2 the calculated embedding satisﬁes s(x1) s(x2) κs x1 x2 for some κs > 0.

We chose to do so by learning a set of one-dimensional splines (Reinsch, 1967). A spline is a piece-wise polynomial deﬁned on a set of disjoint intervals with smoothness conditions in the intersection points, denoted as knots. Learnable splines were also suggested in (Fey et al., 2018) for the problem of learning over irregular grids. We used piece-wise

Explicit Gradient Learning

Figure 3. Comparing the success rate of EGL and other BBO algorithms for a budget C = 150 103.

linear splines of the form

hi (ti+1 x) + θi+1

hi (x ti) (8)

where θ has k + 1 elements (k is the number of knots), ti is the position of the i-th knot and hi = ti+1 ti. Instead of ﬁtting θ to the data as usually done in classical spline applications, we learned these parameters as part of the optimization process, similar to categorical embeddings. Each entry in the input vector was expanded by a paramterization of several one-dimensional splines. We found this architecture particularly suitable for modeling complex functions deﬁned over unstructured input domain. Please refer to the Appendix Sec. C for a full description and an empirical evaluation of ﬁtting objective functions with spline-net.

5. Empirical Evaluation

In this section we describe the rigorous empirical analysis of EGL against various BBO methods in the COCO test suite and in a search task over the latent space of a GAN model. The code is available at http://github.com/ Mor Sinay/BBO.

5.1. The COCO test suite

We tested EGL on the COCO test suite, a platform for systematic comparison of real-parameter global optimizers. COCO provides Black-Box functions in several dimensions (2,3,5,10,20,40), where each dimension comprises 360 distinct problems. To test higher dimensions, we created an extra set of 784D problems by composing a pre-trained encoder (Kingma & Welling, 2013) of Fashion Mnist images, each with 784 pixels (Xiao et al., 2017), with a 10D COCO problem as follows: f784D(x) : f10D(Encoder(x)).

We compared EGL with seven baselines, implemented on Scipy and Cma Python packages: Nedler Mead, SLSQP, POWELL, CG, COBYLA, BFGS, CMA-ES and with our IGL implementation based on the DDPG approach. For the 784D problems we evaluated only CG, CMA-ES and IGL which yielded results in a reasonable amount of time. We examined two network models. To compare against other

baselines (Figues 3, 4 and 5(a)), we used our spline embedding model (details in Appendix Sec. C). Since spline-net is more time consuming, for the ablation tests in Fig. 5(b-d), we used a lighter Fully Connected (FC) net. For additional information, including a hyperparameters list, refer to Appendix Sec. F.

Fig. 3 presents the success rate of each algorithm with respect to the dimension number and for a budget of C = 150 103. A single test-run is considered successful if: (1) ybest y 1; and (2) ybest y

y0 y 10 2. Here, y0 is the initial score f(x0), ybest = min yk is the best-observed value for that run, and y is the minimal value obtained from all the baselines test-runs. This deﬁnition guarantees that a run is successful only if its best-observed value is near y , both in terms of absolute distance and in terms of relative improvement with respect to the initial score. The results show that EGL outperforms all other baselines for any dimension. Most importantly, as the dimension number increases, the performance gap between EGL and all other baselines grows larger.

In Fig. 4 we present two performance proﬁles (Dolan & Mor e, 2002) for the 40D and 10D problem sets: (1) Time-To Solution (TTS); and (2) Best Observed (BO). TTS measures the percentage of solved problems with respect to the time step. We ﬁnd that both EGL and IGL have a cold start behavior. This is a result of our design choice to start the training with a warm-up phase where we evaluate 384 points around x0 and train the networks before executing the ﬁrst gradient descent step. However, as the training process progresses, EGL outperforms all other methods and peaks at C = 150 103 with the success rate of Fig. 3. The BO proﬁle measures the percentage of problems whose best-observed value after 150 103 time steps is better than y [y0, y ]. We ﬁnd that for all of the possible thresholds y, EGL solves more problems than any other method.

Fig. 5 visualizes the learning process of each method. To do that, we ﬁrst calculate a scaled distance between the best value at time t and the optimal value yt best = mink t yk y

y0 y . We then average this number for each t, over all runs in the same dimension problem set. This distance

Explicit Gradient Learning

Figure 4. Performance Proﬁle for 10D and 40D: Time-To-Solution (TTS) and Best-Observed (BO) after 150 103 steps. In BO the x-axis linearly maps [y0, y ] to [0, 1].

Figure 5. The scaled distance yt best as a function of t [1, .., C] for: (a) EGL and baselines on 40D, (b) trust-region and output mapping ablation test, (c) EGL with different m samples and baselines on 784D, (d) the perturbed mean-gradient and long replay buffer on 40D.

is scaled to [0, 1] and the results are presented on a log-log scale. Fig. 5(a) presents y t best on the 40D set. The elbow pattern in step 384 marks the switch from x0 to x1. Unlike the baselines that settle on a local minimum, EGL monotonically decreases during the entire optimization process. It overtakes the best baseline (CMA-ES) after 104 steps.

Fig. 5(b) demonstrates the advantage of output-mapping (OM) and trust-region (TR). Here, we compared an FC net, trained with OM and TR (FC TR OM) against the variations FC, FC TR and FC OM. The results show a clear advantage to using both OM and TR, yet, while OM is crucial for the entire optimization process, the TR advantage materializes only near the minimal value. In addition, Fig. 5(b) manifests the gain of the spline-net (SPLINE) on top of FC TR OM.

The next experiment is executed on the high dimensional 784D problem set. In Fig. 5(c) we compare the performance of different numbers of exploration points m = {64, 800} against CMA-ES, CG and the IGL baselines. While Theorem 1 guarantees a controllably accurate model when sampling m n + 1 points, remarkably, EGL generalized to an outstanding performance and outperformed all other baselines even for m = 64 n. Nevertheless, as expected, more exploration points converged to a better ﬁnal value. In practice, the choice of m should correspond to the allocated

budget size. More exploration around each candidate comes with the cost of fewer gradient descent steps; thus, for low budgets, one should typically choose a small m.

Next, we tested the gain of two possible modiﬁcations to EGL: (1) perturbations with gp ε; and (2) training gθ with a larger replay buffer (RB) with exploration points from the last L candidates. These tests were all executed with the same random seed. Fig. 5(d) presents the performance gain (1 y t best/ y t RB1 P 0) as a function of t for several different runs. Small perturbations p = 0.01ε improved the performance (14%), possibly since they also regulate the NN training, yet, too large perturbations of p = 0.1ε yielded inconsistent results. We also observed that a too long RB of L = 16 largely hurt the performance, probably since it adds high values to the RB which leads to a more compressed output mapping. However, moderate RB of L = 4 had a positive impact (10%), probably since more exploration points near the current candidate reduce the controllable accuracy factor κg and thus the accuracy of gε improves.

5.2. Searching the latent space of generative models

To examine EGL in a high-dimensional, complex, nonconvex and noisy domain, we experimented with the task of searching the latent space of an image generative model

Explicit Gradient Learning

Figure 6. Searching latent space of generative models with EGL and IGL. Note that the target image is not revealed to the optimizer, only the face attributes and landmark points. The left-hand number is the average minimal value for each algorithm over 64 different problems.

(Volz et al., 2018). Generative models learn to map between a latent predeﬁned distribution z to a complex real-world distribution x (e.g. images or audio). Given a trained Black Box generator, while it is easy to sample from the distribution of x by sampling from z, it is not straightforward to generate an image with some desired characteristics. For that purpose, one may apply a BBO to search the latent space for a hidden representation z that generates an image with the desired traits x . Here, we used a face generative model and optimized z to generate an image with a required set of face attributes, landmark points and quality.

We trained a generator & discriminator for the Celeb A dataset (Liu et al., 2015) based on the Big GAN (Brock et al., 2018) architecture and a classiﬁer for the Celeb A attributes. For the face landmark points, we used a pre-trained model (Kazemi & Sullivan, 2014). The BBO was trained to minimize the following objective:

fal(z) = λa La(G(z))+λl Ll(G(z))+λg tanh(D(G(z)))

Where: (1) La is the Cross-Entropy loss between the generated face attributes as measured by the classiﬁer and the desired set of attributes a; (2) Ll is the MSE between the generated landmark points and the desired set of landmarks l ; and (3) D(G(z)) is the discriminator output, positive for low-quality images and negative for high-equality images.

Since each evaluation of z is costly, we limited the budget C to only 104 evaluations and the number of exploration points in each step was only m = 32 512. Such m violates the requirement in Theorem 1. However, we found that in practice, for ﬁnite budget and high dimensions, it is better to take more gradient steps with a less accurate gradient. To increase the sample efﬁciency we found that instead of sampling points in an n-ball around the candidate xk, it is better to sample points in a cone with an apex at

xk and a vector equal to the mean-gradient estimation at xk, i.e. g(xk). We term this exploration strategy as gradientguided-exploration. For further details, please refer to the Appendix, Sec. H.

Due to the high-dimensional problem, classic methods such as CG and even CMA-ES fail to generate satisfying faces (see Appendix Sec. G for sample images). In Fig. 6 we compare the images generated by EGL and IGL. Generally, the quality of the results depends on the image target style. Some face traits which are more frequent in the Celeb A dataset lead to a better face quality. Some faces, speciﬁcally with attributes such as a beard or a hat, are much harder to ﬁnd, probably due to the suboptimality of the generator, which usually reduces the variety of images found in the dataset (Bau et al., 2019). Nevertheless, the results show that EGL produces better images, both visually and according to the ﬁnal cost value. We observed that for hard targets, IGL frequently fails to ﬁnd any plausible solutions while EGL yields non-perfect yet much more satisfactory candidates.

6. Conclusions

We presented EGL, a derivative-based BBO algorithm that achieves state-of-the-art results on a wide range of optimization problems. The essence of its success is a learnable function that estimates the mean gradient with a controllable smoothness factor. Starting with a high smoothness factor, let EGL ﬁnd global areas in the function with low valleys. Gradually decreasing it lets EGL converge to a local minimum. The concept of EGL can be generalized to other related ﬁelds, such as sequential decision-making problems (i.e. Reinforcement Learning), by directly learning the gradient of the Q-function. We also demonstrated the use of EGL in an applicative high-dimensional Black-Box problem, searching the latent space of generative models.

Explicit Gradient Learning

Acknowledgements

This work was supported in part by the Ministry of Science & Technology, Israel.

Audet, C. and Hare, W. Derivative-free and blackbox optimization. Springer, 2017.

Back, T. Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford university press, 1996.

Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. Bo Torch: Programmable Bayesian Optimization in Py Torch. arxiv e-prints, 2019. URL http://arxiv.org/abs/ 1910.06403.

Bardenet, R., Brendel, M., K egl, B., and Sebag, M. Collaborative hyperparameter tuning. In International conference on machine learning, pp. 199 207, 2013.

Bau, D., Zhu, J.-Y., Wulff, J., Peebles, W., Strobelt, H., Zhou, B., and Torralba, A. Seeing what a gan cannot generate. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4502 4511, 2019.

Bertsekas, D. P. and Scientiﬁc, A. Convex optimization algorithms. Athena Scientiﬁc Belmont, 2015.

Bonnans, J.-F., Gilbert, J. C., Lemar echal, C., and Sagastiz abal, C. A. Numerical optimization: theoretical and practical aspects. Springer Science & Business Media, 2006.

Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high ﬁdelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096, 2018.

Brownlee, J. Better Deep Learning: Train Faster, Reduce Overﬁtting, and Make Better Predictions. Machine Learning Mastery, 2018.

Chen, R., Menickelly, M., and Scheinberg, K. Stochastic optimization using a trust-region method and random models. Mathematical Programming, 169(2):447 487, 2018.

Cohen, N. and Shashua, A. Inductive bias of deep convolutional networks through pooling geometry. ar Xiv preprint ar Xiv:1605.06743, 2016.

Conn, A. R., Gould, N. I., and Toint, P. L. Trust region methods. SIAM, 2000.

Conn, A. R., Scheinberg, K., and Vicente, L. N. Introduction to derivative-free optimization, volume 8. Siam, 2009.

Dolan, E. D. and Mor e, J. J. Benchmarking optimization software with performance proﬁles. Mathematical programming, 91(2):201 213, 2002.

Fey, M., Eric Lenssen, J., Weichert, F., and M uller, H. Splinecnn: Fast geometric deep learning with continuous b-spline kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 869 877, 2018.

Flaxman, A. D., Kalai, A. T., and Mc Mahan, H. B. Online convex optimization in the bandit setting: gradient descent without a gradient. ar Xiv preprint cs/0408007, 2004.

Golovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro, J., and Sculley, D. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1487 1495, 2017.

Hansen, N. The cma evolution strategy: a comparing review. In Towards a new evolutionary computation, pp. 75 102. Springer, 2006.

Hansen, N., Auger, A., Ros, R., Finck, S., and Poˇs ık, P. Comparing results of 31 algorithms from the black-box optimization benchmarking bbob-2009. In Proceedings of the 12th annual conference companion on Genetic and evolutionary computation, pp. 1689 1696, 2010.

Hansen, N., Brockhoff, D., Mersmann, O., Tusar, T., Tusar, D., El Hara, O. A., Sampaio, P. R., Atamna, A., Varelas, K., Batu, U., Nguyen, D. M., Matzner, F., and Auger, A. COmparing Continuous Optimizers: numbbo/COCO on Github, March 2019. URL https://doi.org/10. 5281/zenodo.2594848.

Hansen, P. and Jaumard, B. Lipschitz optimization. In Handbook of global optimization, pp. 407 493. Springer, 1995.

Kazemi, V. and Sullivan, J. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1867 1874, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent converges to minimizers. ar Xiv preprint ar Xiv:1602.04915, 2016.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015.

Explicit Gradient Learning

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

Maheswaranathan, N., Metz, L., Tucker, G., Choi, D., and Sohl-Dickstein, J. Guided evolutionary strategies: Augmenting random search with surrogate gradients. ar Xiv preprint ar Xiv:1806.10230, 2018.

Mania, H., Guy, A., and Recht, B. Simple random search provides a competitive approach to reinforcement learning. ar Xiv preprint ar Xiv:1803.07055, 2018.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529 533, 2015.

Nelder, J. A. and Mead, R. A simplex method for function minimization. The computer journal, 7(4):308 313, 1965.

Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.

Nocedal, J. and Wright, S. Numerical optimization. Springer Science & Business Media, 2006.

Pan, Z., Yu, W., Yi, X., Khan, A., Yuan, F., and Zheng, Y. Recent progress on generative adversarial networks (gans): A survey. IEEE Access, 7:36322 36333, 2019.

Powell, M. J. An efﬁcient method for ﬁnding the minimum of a function of several variables without calculating derivatives. The computer journal, 7(2):155 162, 1964.

Powell, M. J. A view of algorithms for optimization without derivatives. Mathematics Today-Bulletin of the Institute of Mathematics and its Applications, 43(5):170 174, 2007.

Reinsch, C. H. Smoothing by spline functions. Numerische mathematik, 10(3):177 183, 1967.

Rios, L. M. and Sahinidis, N. V. Derivative-free optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization, 56(3): 1247 1293, 2013.

Rosemarin, H., Rosenfeld, A., and Kraus, S. Emergency department online patient-caregiver scheduling. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pp. 695 701, 2019.

Ruder, S. An overview of gradient descent optimization algorithms. ar Xiv preprint ar Xiv:1609.04747, 2016.

Saremi, S. On approximating f with neural networks. ar Xiv preprint ar Xiv:1910.12744, 2019.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889 1897, 2015.

Sener, O. and Koltun, V. Learning to guide random search. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=B1g Hok BKw S.

Shewchuk, J. R. et al. An introduction to the conjugate gradient method without the agonizing pain, 1994.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.

Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., and Adams, R. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171 2180, 2015.

Van Laarhoven, P. J. and Aarts, E. H. Simulated annealing. In Simulated annealing: Theory and applications, pp. 7 15. Springer, 1987.

Vemula, A., Sun, W., and Bagnell, J. A. Contrasting exploration in parameter and action space: A zeroth-order optimization perspective. ar Xiv preprint ar Xiv:1901.11503, 2019.

Volz, V., Schrum, J., Liu, J., Lucas, S. M., Smith, A., and Risi, S. Evolving mario levels in the latent space of a deep convolutional generative adversarial network. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 221 228, 2018.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.

Yuan, X., He, P., Zhu, Q., and Li, X. Adversarial examples: Attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems, 30(9): 2805 2824, 2019.

Zhang, W., Du, T., and Wang, J. Deep learning over multiﬁeld categorical data. In European conference on information retrieval, pp. 45 57. Springer, 2016.

Zhen, L., Wang, K., Hu, H., and Chang, D. A simulation optimization framework for ambulance deployment and relocation problems. Computers & Industrial Engineering, 72:12 23, 2014.