# learning_to_guide_random_search__cfe44628.pdf Published as a conference paper at ICLR 2020 LEARNING TO GUIDE RANDOM SEARCH Ozan Sener Intel Labs Vladlen Koltun Intel Labs We are interested in derivative-free optimization of high-dimensional functions. The sample complexity of existing methods is high and depends on problem dimensionality, unlike the dimensionality-independent rates of first-order methods. The recent success of deep learning suggests that many datasets lie on low-dimensional manifolds that can be represented by deep nonlinear models. We therefore consider derivative-free optimization of a high-dimensional function that lies on a latent low-dimensional manifold. We develop an online learning approach that learns this manifold while performing the optimization. In other words, we jointly learn the manifold and optimize the function. Our analysis suggests that the presented method significantly reduces sample complexity. We empirically evaluate the method on continuous optimization benchmarks and high-dimensional continuous control problems. Our method achieves significantly lower sample complexity than Augmented Random Search, Bayesian optimization, covariance matrix adaptation (CMA-ES), and other derivative-free optimization algorithms. 1 INTRODUCTION A typical approach to machine learning problems is to define an objective function and optimize it over a dataset. First-order optimization methods are widely used for this purpose since they can scale to high-dimensional problems and their convergence rates are independent of problem dimensionality in most cases. However, gradients are not available in many important settings such as control, black-box optimization, and interactive learning with humans in the loop. Derivativefree optimization (DFO) can be used to tackle such problems. The challenge is that the sample complexity of DFO scales poorly with the problem dimensionality. The design of DFO methods that solve high-dimensional problems with low sample complexity is a major open problem. The success of deep learning methods suggests that high-dimensional data that arises in real-world settings can commonly be represented in low-dimensional spaces via learned nonlinear features. In other words, while the problems of interest are high-dimensional, the data typically lies on lowdimensional manifolds. If we could perform the optimization directly in the manifold instead of the full space, intuition suggests that we could reduce the sample complexity of DFO methods since their convergence rates are generally a function of the problem dimensionality (Nesterov & Spokoiny, 2017; Dvurechensky et al., 2018). In this paper, we focus on high-dimensional data distributions that are drawn from low-dimensional manifolds. Since the manifold is typically not known prior to the optimization, we pose the following question. Can we develop an adaptive derivative-free optimization algorithm that learns the manifold in an online fashion while performing the optimization? There exist DFO methods that aim to identify a low-dimensional search space (Maheswaranathan et al., 2019; Choromanski et al., 2019). However, they are limited to linear subspaces. In contrast, we propose to use expressive nonlinear models (specifically, neural networks) to represent the manifold. Our approach not only increases expressivity but also enables utilization of domain knowledge concerning the geometry of the problem. For example, if the function of interest is known to be translation invariant, convolutional networks can be used to represent the underlying manifold structure. On the other hand, the high expressive power and flexibility brings challenges. Our approach requires solving for the parameters of the nonlinear manifold at each iteration of the optimization. To address this, we develop an efficient online method that learns the underlying manifold while the function is being optimized. Published as a conference paper at ICLR 2020 We specifically consider random search methods and extend them to the nonlinear manifold learning setting. Random search methods choose a set of random directions and perform perturbations to the current iterate in these directions. Differences of the function values computed at perturbed points are used to compute an estimator for the gradient of the function. We first extend this to random search over a known manifold and show that sampling directions in the tangent space of the manifold provides a similar estimate. We then propose an online learning method that estimates this manifold while jointly performing the optimization. We theoretically analyze sample complexity and show that our method reduces it. We conduct extensive experiments on continuous control problems, continuous optimization benchmarks, and gradient-free optimization of an airfoil. The results indicate that our method significantly outperforms state-of-the-art derivative-free optimization algorithms from multiple research communities. 2 PRELIMINARIES We are interested in high-dimensional stochastic optimization problems of the form min x Rd f(x) = Eξ[F(x, ξ)], (1) where x is the optimization variable and f : Rd R is the function of interest, which is defined as expectation over a noise variable ξ. We assume that the stochastic function is bounded (|F(x, ξ)| Ω), L-Lipschitz, and µ-smooth1 with respect to x for all ξ, and has uniformly bounded variance (Eξ[(F(x, ξ) f(x))2] VF ). In DFO, we have no access to the gradients. Instead, we only have zeroth-order access by evaluating the function F (i.e. sampling F(x, ξ) for the input x). We are specifically interested in random search methods in which an estimate of the gradient is computed using function evaluations at points randomly sampled around the current iterate. Before we formalize this, we introduce some definitions. Denote the d-dimensional unit sphere and unit ball by Sd 1 and Bd, respectively. We define a smoothed function following Flaxman et al. (2005). For a function f : Rd R, its δ-smoothed version is ˆf(x) = Ev Bd[f(x + δv)]. The main workhorse of random search is the following result by Flaxman et al. (2005). Let s be a random vector sampled from the uniform distribution over Sd 1. Then f(x + δs)s is an unbiased estimate of the gradient of the smoothed function: Eξ,s Sd 1[F(x + δs, ξ)s] = δ d x ˆf(x). (2) We use antithetic samples since this is known to decrease variance (Kroese et al., 2013) and define the final gradient estimator as y(x, s) = (F(x + δs, ξ) F(x δs, ξ)) s. Extending (2) to the antithetic case, Eξ,s Sd 1[y(x, s)] = 2δ A simple way to optimize the function of interest is to use the gradient estimate in stochastic gradient descent (SGD), as summarized in Algorithm 1. This method has been analyzed in various forms and its convergence is characterized well for nonconvex smooth functions. We restate the convergence rate and defer the constants and proof to Appendix A.2. Proposition 1 (Flaxman et al., 2005; Vemula et al., 2019). Let f(x) be differentiable, L-Lipschitz, and µ-smooth. Consider running random search (Algorithm 1) for T steps. Let k = 1 for simplicity. Then 1 T t=1 E xf(xt) 2 2 O T 1 3 ONLINE LEARNING TO GUIDE RANDOM SEARCH Proposition 1 implies that the sample complexity of random search scales linearly with the dimensionality. This dependency is problematic when the function of interest is high-dimensional. We argue that in many practical problems, the function of interest lies on a low-dimensional nonlinear 1L-Lipschitz and µ-smooth: |f(x) f(y)| L x y 2 and xf(x) xf(y) 2 µ x y 2 x,y Published as a conference paper at ICLR 2020 Algorithm 1 Random Search 1: for t = 1 to T do 2: gt, = GRADEST(xt, δ) 3: xt+1 = xt αgt Algorithm 2 Manifold Random Search 1: for t = 1 to T do 2: gt, = MANIFOLDGRADEST(xt, J(xt; θ )) 3: xt+1 = xt αgt 1: procedure GRADEST(x, δ) 2: Sample: s1, . . . , sk Sd 1 3: Query: yi = f(x + δsi) f(x δsi) 4: Estimator: g = d 2δ Pk i=1 yisi 5: return g, {si}i [k] 6: end procedure 1: procedure MANIFOLDGRADEST(x, θ,δ) 2: Normalize: Jq = Gram Schmidt(J) 3: Sample: s1, . . . , sk Sn 1 4: Query: yi = f(x + δJqsi) f(x δJqsi) 5: Estimator: g = n 2δ Pk i=1 yi Jqsi 6: return g, {Jqsi}i [k] 7: end procedure manifold. This structural assumption will allow us to significantly reduce the sample complexity of random search, without knowing the manifold a priori. Assume that the function of interest is defined on an n-dimensional manifold (n d) and this manifold can be defined via a nonlinear parametric family (e.g. a neural network). Formally, we are interested in derivative-free optimization of functions with the following properties: Smoothness: F( , ξ) : Rd R is µ-smooth and L-Lipschitz for all ξ. Manifold: F( , ξ) is defined on an n-dimensional manifold M for all ξ. Representability: The manifold M and the function of interest can be represented using parametrized function classes r( ; θ) and g( ; ψ). Formally, given ξ, there exist θ and ψ such that F(x, ξ) = g(r(x; θ ); ψ ) x Rd. We will first consider an idealized setting where the manifold is already known (i.e. we know θ ). Then we will extend the developed method to the practical setting where the manifold is not known in advance and must be estimated with no prior knowledge as the optimization progresses. 3.1 WARM-UP: RANDOM SEARCH OVER A KNOWN MANIFOLD If the manifold is known a priori, we can perform random search directly over the manifold instead of the full space. Consider the chain rule applied to g(r(x; θ); ψ) as xf(x) = J(x; θ ) rg(r; ψ), where J(x; θ ) = r(x;θ )/ x and r = r(x, θ ). The gradient of the function of interest lies in the column space of the Jacobian of the parametric family. In light of this result, we can perform random search in the column space of the Jacobian, which has lower dimensionality than the full space. For numerical stability, we will first orthonormalize the Jacobian using the Gram-Schmidt procedure, and perform the search in the column space of this orthonormal matrix since it spans the same space. We denote the orthonormalized version of J(x; θ ) by Jq(x; θ ). In order to perform random search, we sample an n-dimensional vector uniformly ( s Sn 1) and lift it to the input space via Jq(x; θ ) s. As a consequence of the manifold Stokes theorem, using the lifted vector as a random direction gives an unbiased estimate of the gradient of the smoothed function as Eξ,s Sn 1[y(x, Jq(x; θ ) s)] = 2δ n x fθ (x), (3) where the smoothed function is defined as fθ (x) = E v Bn[f(x + δJq(x; θ ) v)]. We show this result as Lemma 1 in Appendix A.1. We use the resulting gradient estimate in SGD. The following proposition summarizes the sample complexity of this method. The constants and the proof are given in Appendix A.2. Proposition 2. Let f(x) be differentiable, L-Lipschitz, and µ-smooth. Consider running manifold random search (Algorithm 2) for T steps. Let k = 1 for simplicity. Then t=1 E xf(xt) 2 2 O T 1 Published as a conference paper at ICLR 2020 3.2 JOINT OPTIMIZATION AND MANIFOLD LEARNING When n d, the reduction in the sample complexity of random search (summarized in Proposition 2) is significant. However, the setting of Algorithm 2 and Proposition 2 is impractical since the manifold is generally not known a priori. We thus propose to minimize the function and learn the manifold jointly. In other words, we start with an initial guess of the parameters and solve for them at each iteration using all function evaluations that have been performed so far. Our major objective is to improve the sample efficiency of random search. Hence, minimizing the sample complexity with respect to manifold parameters is an intuitive way to approach the problem. We analyze the sample complexity of SGD using biased gradients in Appendix A.3.1 and show the following informal result. Consider running manifold random search with a sequence of manifold parameters θ1, ψ1, . . . , θT , ψT for T steps. Then the additional suboptimality caused by biased gradients, defined as SUBOPTIMALITY = 1 T PT t=1 E[ xf(x)], is bounded as follows: SUBOPTIMALITY(θt, ψt) SUBOPTIMALITY(θ , ψ )+ Ω t=1 x fθ (xt) xg(r(xt; θt); ψt) 2, (4) where SUBOPTIMALITY(θ , ψ ) is the suboptimality of the oracle case (Algorithm 2). Our aim is to minimize the additional suboptimality with respect to θt and ψt. However, we do not have access to xf(x) since we are in a derivative-free setting. Hence we cannot directly minimize (4). At each iteration, we observe y(xt, st). Moreover, y(xt, st) = 2δst x F(xt, ξ) + O(δ2), due to the smoothness. Since we observe the projection of the gradient onto the chosen directions, we minimize the projection of (4) onto these directions. Formally, we define our one-step loss as L(xt, st, θt, ψt) = y(xt, st) 2δ st xg(r(xt; θt); ψt) 2 . (5) We use the Follow the Regularized Leader (FTRL) algorithm (Hazan, 2016; Shalev-Shwartz, 2012) to minimize the aforementioned loss function and learn the manifold parameters: θt+1, ψt+1 = arg min θ,ψ i=1 L(xi, si, θ, ψ) + λR(θ, ψ), (6) where the regularizer R(θ, ψ) = xg(r(xt; θt); ψt) xg(r(xt; θ); ψ) 2 is a temporal smoothness term that penalizes sudden changes in the gradient estimates. Algorithm 3 summarizes our algorithm. We add exploration by sampling a mix of directions from the manifold and the full space. In each iteration, we sample directions and produce two gradient estimates gm, ge using the samples from the tangent space and the full space, respectively. We mix them to obtain the final estimate g = (1 β)gm + βge. We discuss the implementation details of the FTRL step in Section 4. In our theoretical analysis, we assume that (6) can be solved optimally. Although this is a strong assumption, experimental results suggest that neural networks can easily fit any training data (Zhang et al., 2017). Our experiments also support this observation. Theorem 1 states our main result concerning the sample complexity of our method. As expected, the sample complexity includes both the input dimensionality d and the manifold dimensionality n. On the other hand, the sample complexity only depends on n d rather than d. Thus our method significantly decreases sample complexity when n d. Theorem 1. Let f(x) be bounded, L-Lipschitz, and µ-smooth. Consider running learned manifold random search (Algorithm 3) for T steps. Let ke = 1 and km = 1 for simplicity. Then i=1 E xf(xt) 2 2 O d 1 2 T 1 + (d 1 2 + n + nd 1 2 )T 1 2 + (n 2 3 + d 1 2 n 2 3 )T 1 Proof sketch. We provide a short proof sketch here and defer the detailed proof and constants to Appendix A.3. We start by analyzing SGD with bias. The additional suboptimality of using θt, ψt instead of θ , ψ can be bounded by (4). Published as a conference paper at ICLR 2020 The empirical loss we minimize is the projection of (4) onto randomly chosen directions. Next, we show that the expectation of the empirical loss is (4) when the directions are chosen uniformly at random from the unit sphere: Est Sd 1 L(xt, st, θt, ψt) = 1 t=1 x fθ (xt) xg(r(xt; θt); ψt) 2. (7) A crucial argument in our analysis is the concentration of the empirical loss around its expectation. In order to study this concentration, we use Freedman s inequality (Freedman, 1975), inspired by the analysis of generalization in online learning by Kakade & Tewari (2009). Our analysis bounds the difference Est Sd 1 h PT t=1 Lti PT t=1 Lt , where Lt = L(xt, st, θt, ψt). Next, we use the FTL-BTL Lemma (Kalai & Vempala, 2005) to analyze the empirical loss PT t=1 Lt. We bound the empirical loss in terms of the distances between the iterates PT t=1 xt+1 xt 2. Such a bound would not be useful in an adversarial setting since the adversary chooses xt, but we set appropriate step sizes, which yield sufficiently small steps and facilitate convergence. Our analysis of learning requires the directions in (7) to be sampled from a unit sphere. On the other hand, our optimization method requires directions to be chosen from the tangent space of the manifold. We mix exploration (directions sampled from Sd 1) and exploitation (directions sampled from the tangent space of the manifold) to address this mismatch. We show that mixing weight β = 1/d yields both fast optimization and no-regret learning. Finally, we combine the analyses of empirical loss, concentration, and SGD to obtain the statement of the theorem. 4 IMPLEMENTATION DETAILS AND LIMITATIONS We summarize important details here and elaborate further in Appendix B. A full implementation is available at https://github.com/intel-isl/LMRS. Parametric family. We use multilayer perceptrons with Re LU nonlinearities to define g and r. We initialize our models with standard normal distributions. Our method thus starts with random search at initialization and transitions to manifold random search as the learning progresses. Solving FTRL. Results on training deep networks suggest that local SGD-based methods perform well. We thus use SGD with momentum as a solver for FTRL in (6). We do not solve each learning problem from scratch but initialize with the previous solution. Since this process may be vulnerable to local optima, we fully solve (6) from scratch for every 100th iteration of the method. Computational complexity. Our method increases the amount of computation since we need to learn a model while performing the optimization. However, in DFO, the major computational bottleneck is typically the function evaluation. When efficiently implemented on a GPU, the time spent on learning the manifold is negligible in comparison to function evaluations. Parallelization. Random search is highly parallelizable since directions can be processed independently. Communication costs include i) sending the current iterate to workers, ii) sending directions to each corresponding worker, and iii) workers sending the function values back. When the directions are chosen independently, they can be indicated to each worker via a single integer by first creating a shared noise table in preprocessing. For a d-dimensional problem with k random directions, these costs are d, k, and k, respectively. The total communication cost is therefore d + 2k. In our method, each worker also needs a copy of the Jacobian, resulting in a communication cost of d + 2k + kd. Hence our method increases communication cost from d + 2k to d + 2k + kd. Algorithm 3 Learned Manifold Random Search (LMRS) 1: for t = 1 to T do 2: gt e, St e = GRADEST(xt, δ) 3: gt m, St m = MANIFOLDGRADEST(xt, J(xt; θt)) 4: gt = βke ke+km gt e + (1 β)km ke+km gt m 5: xt+1 = xt αgt 6: θt+1, ψt+1 = arg minθ,ψ Pt i=1 L(xi, Si e,m, θ, ψ) + λR(θ, ψ) 7: end for See Algorithms 1 & 2 for definitions of GRADEST and MANIFOLDGRADEST. Published as a conference paper at ICLR 2020 Guided ES ARS LMRS CMA-ES Humanoid-v1 Number of Episodes Average Reward 0 100000 200000 300000 400000 500000 600000 0 Number of Episodes Average Reward 0 20000 60000 100000 140000 1000 Walker2d-v1 Number of Episodes Average Reward 0 20000 40000 60000 80000 0 Half Cheetah-v1 Number of Episodes Average Reward 0 2000 4000 6000 8000 0 Number of Episodes Average Reward 0 1000 2000 3000 4000 5000 6000 7000 0 Number of Episodes Average Reward 0 100 200 300 400 500 600 700 0 Figure 1: Average reward vs. number of episodes for Mu Jo Co locomotion tasks. In each condition, we perform 5 runs with different random seeds. Shaded areas represent 1 standard deviation. The grey horizontal line indicates the prescribed threshold at which the task is considered solved . 5 EXPERIMENTS We empirically evaluate the presented method (referred to as Learned Manifold Random Search (LMRS)) on the following sets of problems. i) We use the Mu Jo Co simulator (Todorov et al., 2012) to evaluate our method on high-dimensional control problems. ii) We use 46 single-objective unconstrained functions from the Pagmo suite of continuous optimization benchmarks (Biscani et al., 2019). iii) We use the XFoil simulator (Drela, 1989) to benchmark gradient-free optimization of an airfoil. We consider the following baselines. i) Augmented Random Search (ARS): Random search with all the augmentations from Mania et al. (2018). ii) Guided ES (Maheswaranathan et al., 2019): A method to guide random search by adapting the covariance matrix. iii) CMA-ES (Hansen, 2016): Adaptive derivative-free optimization based on evolutionary search. iv) REMBO (Wang et al., 2016): A Bayesian optimization method which uses random embeddings in order to scale to high-dimensional problems. Although CMA-ES and REMBO are not based on random search, we include them for the sake of completeness. Additional implementation details are provided in Appendix B. 5.1 LEARNING CONTINUOUS CONTROL Following the setup of Mania et al. (2018), we use random search to learn control of highly articulated systems. The Mu Jo Co locomotion suite (Todorov et al., 2012) includes six problems of varying difficulty. We evaluate our method and the baselines on all of them. We use linear policies and include all the tricks (whitening the observation space and scaling the step size using the variance of the rewards) from Mania et al. (2018). We report average reward over five random experiments versus the number of episodes (i.e. number of function evaluations) in Figure 1. We also report the average number of episodes required to reach the prescribed reward threshold at which the task is considered solved in Table 1. We include proximal policy optimization (PPO) (Schulman et al., 2017; Hill et al., 2018) for reference. Note that our results are slightly different from the numbers reported by Mania et al. (2018) as we use 5 random seeds instead of 3. The results suggest that our method improves upon ARS in all environments. Our method also outperforms all other baselines. The improvement is particularly significant for high-dimensional problems such as Humanoid. Our method is at least twice as efficient as ARS in all environments except Swimmer, which is the only low-dimensional problem in the suite. Interestingly, Guided-ES fails to solve the Humanoid task, which we think is due to biased gradient estimation. Furthermore, CMA-ES performs similarly to ARS. These results suggest that a challenging task like Humanoid is out of reach for heuristics like local adaptation of the covariance matrix due to high stochasticity and nonconvexity. Published as a conference paper at ICLR 2020 Guided ES ARS LMRS CMA-ES Half Cheetah-v1 (16 Cores) Hopper-v1 (8 Cores) Swimmer-v1 (2 Cores) Ant-v1 (60 Cores) Humanoid-v1 (120 Cores) Walker2d-v1 (40 Cores) Figure 2: Average reward vs. wall-clock time for Mu Jo Co locomotion tasks. In each condition, we perform 5 runs with different random seeds. Shaded areas represent 1 standard deviation. The grey horizontal line indicates the prescribed threshold at which the task is considered solved . Measurements are performed on Intel Xeon E7-8890 v3 processors and Nvidia Ge Force RTX 2080 Ti GPUs. We list the number of cores used for each experiment. REMBO only solves the Swimmer task and fails to solve others. We believe this is due to the fact that these continuous control problems have no global structure and are highly nonsmooth. The number of possible sets of contacts with the environment is combinatorial in the number of joints, and each contact configuration yields a distinct reward surface. This contradicts the global structure assumption in Bayesian optimization. Wall-clock time analysis. Our method performs additional computation as we learn the underlying manifold. In order to quantify the effect of the additional computation, we perform wall-clock time analysis and plot average reward vs wall-clock time in Figure 2. Our method outperforms all baselines with similar margins to Figure 1. The trends and shapes of the curves in Figures 1 and 2 are similar. This is not surprising since computation requirements of all the optimizers are rather negligible when compared with the simulation time of Mu Jo Co. The only major differences we notice are on the Hopper task. Here the margin between our method and the baselines narrows and the relative ordering of Guided-ES and ARS changes. This is due to the fact that simulation stops when the agent falls. In other words, the simulation time depends on the current solution. Methods that query the simulator for these unstable solutions lose less wall-clock time. Quantifying manifold learning performance. In order to evaluate the learning performance, we project the gradient of the function to the tangent space of the learned manifold and plot the norm of the residual. Since we do not have access to the gradients, we estimate them at 30 time instants, evenly distributed through the learning process. We perform accurate gradient estimation using a very large number of directions (2000). We compute the norm of the residual of the projection as 1 T PT t=1 xf(xt) PJq(xt,θt)( xf(xt)) 2, where PA( ) is projection onto the column space of A. The results are visualized in Figure 3. Our method successfully and quickly learns the manifold in all cases. Self Baselines Task Threshold LMRS ARS CMA-ES Guided ES PPO REMBO No learning Offline l. Swimmer 325 222 381 460 640 790 520 320 325 Hopper 3120 2408 6108 14640 5288 10397 (0/5) 5561 (4/5) 11616 Half Cheetah 3430 1128 4284 2888 8004 6820 (0/5) 6528 3416 Walker 4390 15525 31240 96525 62544 102440 (0/5) 295493 (3/5) 118640 Ant 3580 9240 20440 154440 152400 52553 (0/5) 24520(4/5) 63720(2/5) Humanoid 6000 108133 220733 1923030 (0/5) 200430 (0/5) 267260(2/5) 451260(2/5) Table 1: Number of episodes required to reach the prescribed threshold on each Mu Jo Co locomotion task for our method, baselines, and ablations. Lower is better. We average over five random seeds. We denote the number of successful trials as (success/trial) and average over successful trials only. If the number of successful trials is not noted, the method solved the task for all random seeds. Published as a conference paper at ICLR 2020 0 200K 400K 600K Number of Episodes Humanoid-v1 Number of Episodes 0 20K 40K 60K 80K Walker2d-v1 Number of Episodes Half Cheetah-v1 2000 4000 6000 8000 10K Number of Episodes 0 2000 4000 6000 Figure 3: Manifold learning accuracy. We plot 1 T PT t=1 xf(xt) PJq(xt,θt)( xf(xt)) 2. In order to estimate the gradient, we use GRADEST with a high number of directions. Ablation studies. Our method uses three major ideas. i) We learn a manifold that the function lies on. ii) We learn this manifold in an online fashion. iii) We perform random search on the learned manifold. To study the impact of each of these ideas, we perform the following experiments. i) No learning. We randomly initialize the manifold r( ; θ) by sampling the entries of θ from the standard normal distribution. Then we perform random search on this random manifold. ii) No online learning. We collect an offline training dataset by sampling xi values uniformly at random from a range that includes the optimal solutions. We evaluate function values at sampled points and learn the manifold. We perform random search on this manifold without updating the manifold model. iii) No search. We use the gradients of the estimated function ( xg(r(x; θt)ψt)) as surrogate gradients and minimize the function of interest using first-order methods. We list the results in Table 1. We do not include the no-search baseline since it fails to solve any of the tasks. Failure of the no-search baseline suggests that the estimated functions are powerful enough to guide the search, but not accurate enough for optimization. The no-learning baseline outperforms ARS on the simplest problem (Swimmer), but either fails completely or increases sample complexity on other problems, suggesting that random features are not effective, especially on high-dimensional problems. Although the offline learning baseline solves more tasks than the no-learning one, it has worse sample complexity since initial offline sampling is expensive. This study indicates that all three of the ideas that underpin our method are important. 5.2 CONTINUOUS OPTIMIZATION BENCHMARKS We use continuous optimization problems from the Pagmo problem suite (Biscani et al., 2019). This benchmark includes minimization of 46 functions such as Rastrigin, Rosenbrock, Schwefel, etc. (See Appendix B for the complete list.) We use ten random starting points and report the average number of function evaluations required to reach a stationary point. Figure 4 reports the results as performance profiles (Dolan & Mor e, 2002). Performance profiles represent how frequently a method is within distance τ of optimality. Specifically, if we denote the number of function evaluations that method m requires to solve problem p by Tm(p) and the number of function evaluations used by the best method by T (p) = minm Tm(p), the performance profile is the fraction of problems for which method m is within distance τ of the best: 1/Np P p 1[Tm(p) T (p) τ], where 1[ ] is the indicator function and Np Fraction of problems (%) LMRS ARS Guided ES CMA-ES REMBO Figure 4: Performance profiles of our method and baselines on an optimization benchmark. is the number of problems. As can be seen in Figure 4, our method outperforms all baselines. The success of our method is not surprising since the functions are typically defined as nonconvex functions of some statistics, inducing manifold structure by construction. REMBO (Bayesian optimization) is close to our method and outperforms the other baselines. We believe this is due to the global geometric structure of the considered functions. Both CMA-ES and Guided-ES outperform ARS. 5.3 OPTIMIZATION OF AN AIRFOIL We apply our method to gradient-free optimization of a 2D airfoil. We use a computational fluid dynamics (CFD) simulator, XFoil Published as a conference paper at ICLR 2020 Airfoil after 1500 Simulations Init LMRS ARS Guided ES CMA-ES REMBO LIFT 0.229 2.0126 0.567 1.231 1.418 1.8645 DRAG 0.040 0.0389 0.007 0.0598 0.0249 0.02123 LIFT DRAG 0.269 1.9737 0.560 1.1712 1.3931 1.8432 Table 2: Generated airfoils with their lift and drag values after 1500 calls to XFoil (Drela, 1989). (Drela, 1989), which can simulate an airfoil using its contour plot. We parametrize the airfoils using smooth polynomials of up to 36 degrees. We model the upper and lower parts of the airfoil with different polynomials. The dimensionality of the problem is thus 72. XFoil can simulate various viscosity properties, speeds, and angles of attack. The details are discussed in Appendix B. We plot the resulting airfoil after 1500 simulator calls in Table 2. We also report the lift and drag of the resulting shape. The objective we optimize is LIFT DRAG. Table 2 suggests that all methods find airfoils that can fly (LIFT > DRAG). Our method yields the highest LIFT DRAG. Bayesian optimization outperforms the other baselines. 5.4 EFFECT OF MANIFOLD AND PROBLEM DIMENSIONALITY In this section, we perform a controlled experiment to understand the effect of problem dimensionality (d) and manifold dimensionality (n). We generate a collection of synthetic optimization problems. All synthesized functions follow the manifold hypothesis: f(x) = g(r(x, θ )), where r(x, θ ) is a multilayer perceptron with the architecture Linear(d, 2n) Re LU Linear(2n, n) and g( ) is a randomly sampled convex quadratic function. In order to sample a convex quadratic function, we sample the parameters of the quadratic function from a Gaussian distribution and project the result to the space of convex quadratic functions. We choose d {100, 1000} and plot the objective value with respect to the number of function evaluations for various manifold dimensionalities n in Figure 5. The results suggest that for a given ambient dimensionality d, the lower the dimensionality of the data manifold n, the more sampleefficient our method. In concordance with our theoretical analysis, the improvement is very significant when n d, as can be seen in the cases n = 5, d = 1000 and n = 2, d = 100. Interestingly, our method is effective even when the manifold assumption is violated (n = d). We hypothesize that this is due to anisotropy in the geometry of the problem. Although all directions are important when n = d, some will result in faster search since the function changes more along them. It appears that our method can identify these direction and thus accelerate the search. Dimension = 100 / Manifold Dim = 2 Number of Function Evaluations 0 50 100 150 200 250 300 350 400 Dimension = 100 / Manifold Dim = 100 Number of Function Evaluations 0 200 400 600 800 Dimension = 100 / Manifold Dim = 25 Number of Function Evaluations 0 25 50 75 100 125 150 175 Dimension = 1000 / Manifold Dim = 5 Number of Function Evaluations 0 200 400 600 800 Dimension = 1000 / Manifold Dim = 1000 Number of Function Evaluations 0 100 200 300 400 500 600 700 Dimension = 1000 / Manifold Dim = 250 Number of Function Evaluations 0 50 100 150 200 250 300 0.00 Figure 5: Effect of problem dimensionality and manifold dimensionality. Average objective value over 10 random seeds vs. number of function evaluations. Shaded areas represent 1 standard deviation. Published as a conference paper at ICLR 2020 6 RELATED WORK Derivative-free optimization. We summarize the work on DFO that is relevant to our paper. For a complete review, readers are referred to Cust odio et al. (2017) and Conn et al. (2009). We are specifically interested in random search methods, which have been developed as early as Matyas (1965) and Rechenberg (1973). Convergence properties of these methods have recently been analyzed by Agarwal et al. (2010), Bach & Perchet (2016), Nesterov & Spokoiny (2017), and Dvurechensky et al. (2018). A lower bound on the sample complexity for the convex case has been given by Duchi et al. (2015) and Jamieson et al. (2012). Bandit convex optimization is also highly relevant and we utilize the work of Flaxman et al. (2005) and Shamir (2013). Random search for learning continuous control. Learning continuous control is an active research topic that has received significant interest in the reinforcement learning community. Recently, Salimans et al. (2017) and Mania et al. (2018) have shown that random search methods are competitive with state-of-the-art policy gradient algorithms in this setting. Vemula et al. (2019) analyzed this phenomenon theoretically and characterized the sample complexity of random search and policy gradient methods for continuous control. Adaptive random search. There are various methods in the literature that adapt the search space by using anisotropic covariance as in the case of CMA-ES (Hansen et al., 2003; Hansen, 2016), guided evolutionary search (Maheswaranathan et al., 2019), and active subspace methods (Choromanski et al., 2019). There are also methods that enforce structure such as orthogonality in the search directions (Choromanski et al., 2018). Other methods use information geometry tools as in Wierstra et al. (2014) and Glasmachers et al. (2010). Lehman et al. (2018) use gradient magnitudes to guide neuro-evolutionary search. Staines & Barber (2012) use a variational lower bound to guide the search. In contrast to these methods, we explicitly posit nonlinear manifold structure and directly learn this latent manifold via online learning. Our method is the only one that can learn an arbitrary nonlinear search space given a parametric class that characterizes its geometry. Adaptive Bayesian optimization. Bayesian optimization (BO) is another approach to zeroth-order optimization with desirable theoretical properties (Srinivas et al., 2010). Although we are only interested in methods based on random search, some of the ideas we use have been utilized in BO. Calandra et al. (2016) used the manifold assumption for Gaussian processes. In contrast to our method, they use autoencoders for learning the manifold and assume initial availability of offline data. Similarly, Djolonga et al. (2013) consider the case where the function of interest lies on some linear manifold and collect offline data to identify this manifold. In contrast, we only use online information and our models are nonlinear. Wang et al. (2016) and Kirschner et al. (2019) propose using random low-dimensional features instead of adaptation. Rolland et al. (2018) design adaptive BO methods for additive models. Major distinctions between our work and the adaptive BO literature include our use of nonlinear manifolds, no reliance on offline data collection, and formulation of the problem as online learning. 7 CONCLUSION We presented Learned Manifold Random Search (LMRS): a derivative-free optimization algorithm. Our algorithm learns the underlying geometric structure of the problem online while performing the optimization. Our experiments suggest that LMRS is effective on a wide range of problems and significantly outperforms prior derivative-free optimization algorithms from multiple research communities. Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Conference on Learning Theory (COLT), 2010. Francis R. Bach and Vianney Perchet. Highly-smooth zero-th order online optimization. In Conference on Learning Theory (COLT), 2016. Published as a conference paper at ICLR 2020 Francesco Biscani, Dario Izzo, Wenzel Jakob, Marcus Martens, Alessio Mereta, Cord Kaldemeyer, Sergey Lyskov, Sylvain Corlay, Benjamin Pritchard, Kishan Manani, et al. Esa/Pagmo2: Pagmo 2.10, 2019. Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth. Manifold Gaussian processes for regression. In International Joint Conference on Neural Networks (IJCNN), 2016. Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard E. Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization. In International Conference on Machine Learning (ICML), 2018. Krzysztof Choromanski, Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, and Vikas Sindhwani. From complexity to simplicity: Adaptive ES-active subspaces for blackbox optimization. In Neural Information Processing Systems, 2019. Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to Derivative-Free Optimization. SIAM, 2009. Ana Lu ısa Cust odio, Katya Scheinberg, and Lu ıs Nunes Vicente. Methodologies and software for derivative-free optimization. Advances and Trends in Optimization with Engineering Applications, 2017. Josip Djolonga, Andreas Krause, and Volkan Cevher. High-dimensional Gaussian process bandits. In Neural Information Processing Systems, 2013. Elizabeth D. Dolan and Jorge J. Mor e. Benchmarking optimization software with performance profiles. Mathematical Programming, 91, 2002. Mark Drela. XFOIL: An analysis and design system for low Reynolds number airfoils. In Low Reynolds Number Aerodynamics, 1989. John C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 2015. Pavel Dvurechensky, Alexander Gasnikov, and Eduard Gorbunov. An accelerated method for derivative-free smooth stochastic convex optimization. ar Xiv:1802.09022, 2018. Abraham Flaxman, Adam Tauman Kalai, and H. Brendan Mc Mahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. In Symposium on Discrete Algorithms (SODA), 2005. David A Freedman. On tail probabilities for martingales. Annals of Probability, 3(1), 1975. Jacob R Gardner, Geoff Pleiss, David Bindel, Kilian Q Weinberger, and Andrew Gordon Wilson. GPy Torch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. In Neural Information Processing Systems, 2018. Saeed Ghadimi and Guanghui Lan. Stochastic firstand zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 2013. David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. Kriging is well-suited to parallelize optimization. In Computational Intelligence in Expensive Optimization Problems. Springer, 2010. Tobias Glasmachers, Tom Schaul, Yi Sun, Daan Wierstra, and J urgen Schmidhuber. Exponential natural evolution strategies. In Genetic and Evolutionary Computation Conference (GECCO), 2010. Nikolaus Hansen. The CMA evolution strategy: A tutorial. ar Xiv:1604.00772, 2016. Nikolaus Hansen, Sibylle D. M uller, and Petros Koumoutsakos. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computation, 11(1), 2003. Published as a conference paper at ICLR 2020 Nikolaus Hansen, Youhei Akimoto, and Petr Baudis. CMA-ES/pycma on Github. Zenodo, DOI:10.5281/zenodo.2559634, February 2019. Elad Hazan. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4), 2016. Ashley Hill, Antonin Raffin, Maximilian Ernestus, Adam Gleave, Anssi Kanervisto, Rene Traore, Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, et al. Stable baselines. https: //github.com/hill-a/stable-baselines, 2018. Kevin G. Jamieson, Robert D. Nowak, and Benjamin Recht. Query complexity of derivative-free optimization. In Neural Information Processing Systems, 2012. Sham M Kakade and Ambuj Tewari. On the generalization ability of online strongly convex programming algorithms. In Neural Information Processing Systems, 2009. Adam Tauman Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3), 2005. Johannes Kirschner, Mojm ır Mutn y, Nicole Hiller, Rasmus Ischebeck, and Andreas Krause. Adaptive and safe Bayesian optimization in high dimensions via one-dimensional subspaces. In International Conference on Machine Learning (ICML), 2019. Dirk P Kroese, Thomas Taimre, and Zdravko I Botev. Handbook of Monte Carlo Methods. John Wiley & Sons, 2013. Joel Lehman, Jay Chen, Jeff Clune, and Kenneth O. Stanley. Safe mutations for deep and recurrent neural networks through output gradients. In Genetic and Evolutionary Computation Conference (GECCO), 2018. Niru Maheswaranathan, Luke Metz, George Tucker, and Jascha Sohl-Dickstein. Guided evolutionary strategies: Escaping the curse of dimensionality in random search. In International Conference on Machine Learning (ICML), 2019. Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search of static linear policies is competitive for reinforcement learning. In Neural Information Processing Systems, 2018. S ebastien Marmin, Cl ement Chevalier, and David Ginsbourger. Differentiating the multipoint expected improvement for optimal batch design. In Workshop on Machine Learning, Optimization, and Big Data, 2015. J Matyas. Random optimization. Automation and Remote Control, 26(2), 1965. Yurii Nesterov and Vladimir G. Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2), 2017. Ingo Rechenberg. Evolutionsstrategie - optimierung technischer systeme nach prinzipien der biologischen information. Stuttgart-Bad Cannstatt: Friedrich Frommann Verlag, 1973. Paul Rolland, Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. High-dimensional Bayesian optimization via additive models with overlapping groups. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2018. Nikitas Rontsis, Michael A. Osborne, and Paul J. Goulart. Distributionally ambiguous optimization techniques for batch Bayesian optimization. ar Xiv:1707.04191, 2017. Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. ar Xiv:1703.03864, 2017. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv:1707.06347, 2017. Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends R in Machine Learning, 4(2), 2012. Published as a conference paper at ICLR 2020 Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory (COLT), 2013. Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML), 2010. Joe Staines and David Barber. Variational optimization. ar Xiv:1212.4507, 2012. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS), 2012. Anirudh Vemula, Wen Sun, and J. Andrew Bagnell. Contrasting exploration in parameter and action space: A zeroth-order optimization perspective. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019. Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson, and Nando de Freitas. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research, 55, 2016. Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and J urgen Schmidhuber. Natural evolution strategies. Journal of Machine Learning Research (JMLR), 15, 2014. Andrew Gordon Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISS-GP). In International Conference on Machine Learning (ICML), 2015. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017. Published as a conference paper at ICLR 2020 A.1 GRADIENT ESTIMATOR In this section, we show that when random directions are sampled in the column space of the orthonormal matrix U, perturbations give biased gradient estimates of the manifold smoothed function. Moreover, when U = U , resulting gradients are unbiased. We formalize this with the following lemma. Lemma 1. Let U be an orthonormal basis for the tangent space of the n-dimensional manifold M at point x M, U be another orthonormal matrix, and f be a function defined on this manifold. Fix δ > 0. Then Es Sn 1[f(x + δUs)Us] = δ n x f U(x) + BIAS(U) where BIAS(U) = Es Sn 1 [f(x + δUs)[U U ]s]. Moreover, bias is 0 and the resulting estimator is unbiased when U = U . Proof. Without loss of generality, we can assume det U = 1 and det U = 1. Using this remark, we can state the proof of the lemma as the straightforward application of the manifold Stoke s theorem Es Sn 1 [f(x + δUs)Us] = Es Sn 1 [f(x + δUs)U s] + Es Sn 1 [f(x + δUs)[U U ]s] (a) = 1 vol(δSn 1) δSn 1 f(x + δUs)U sds + Es Sn 1 [f(x + δUs)[U U ]s] | {z } = 1 vol(δSn 1) δSn 1 f(x + δUs)U det U s s ds + BIAS(U) (b) = 1 vol(δSn 1) δSn 1 f(x + δUU U s)U det U s s ds + BIAS(U) (c) = 1 vol(δSn 1) x δBn f(x + δUU U v)dv + BIAS(U) (d) = vol(δBn) vol(δSn 1) x δBn f(x + δUv)dv vol(δBn) + BIAS(U) n x f U(x) + BIAS(U). where vol denotes volume, and we use the definition of the expectation in (a, e), orthonormality of U in (b,d), manifold Stoke s theorem in (c) and the fact that the ratio of volume to the surface area of a n dimensional ball of radius δ is δ n in (e). Moreover, bias vanishes when U = U . A.2 SAMPLE COMPLEXITY FOR RANDOM SEARCH AND MANIFOLD RANDOM SEARCH In this section, we bound the sample complexity of the random search (Algorithm 1) and the manifold random search (Algorithm 2). Our analysis starts with studying the relationship between the function (f) and its smoothed ( ˆf) as well as manifold smoothed ( f U) versions in section A.2.2. We show that L Lipschitzness and µ smoothness of the function extend to the smoothed functions. Moreover, we also bound the difference between the gradients of the original function and the gradients of the smoothed versions. Next, we study the second moment of the gradient estimator in section A.2.3. Finally, we state the sample complexity of SGD on non-convex functions in section A.2.1. Combining these results, we state the final sample complexity of random search and manifold random search in section A.2.4&A.2.5. A.2.1 CONVERGENCE OF SGD FOR NON-CONVEX FUCTIONS The convergence of the SGD has been widely studied and here we state its convergence result for non-convex functions from Ghadimi & Lan (2013) as a Lemma and give its proof for the sake of completeness. Published as a conference paper at ICLR 2020 Lemma 2 (Convergence of SGD (Ghadimi & Lan, 2013; Vemula et al., 2019)). Consider running SGD on f(x) that is µ-smooth and L-Lipschitz for T steps starting with initial solution x0. Denote Ω0 = f(x0) f(x ) where x is the globally optimal point and assume that the unbiased gradient estimate has second moment bounded with V . Then, t=1 E xf(xt) 2 2 Proof. We denote the step size as α and the unbiased gradient estimate as gt. We analyze the step at t as; f(xt+1) = f(xt αgt) f(xt) α xf(xt) gt + µα2 2 gt 2 2 (10) where we used the µ smoothness of the function. Taking expectation of the inequality, Egt[f(xt+1)] f(xt) α xf(xt) 2 2 + µα2 2 E[ gt 2 2] (11) Using the bounded second moment of the gradient, and summing from step 1 to T, t=1 Egt[f(xt+1)] t=1 f(xt) α t=1 xf(xt) 2 2 + µα2TV Re-arranging the terms, we obtain, t=1 xf(xt) 2 2 1 αEg0,...,gt[f(x0) f(xt+1)] + µαTV 2 0 µT V , and divide the inequality to T in order to obtain the required inequality as t=1 xf(xt) 2 2 A.2.2 PRELIMINARY RESULTS ON SMOOTHED FUNCTIONS First, we will show that the µ smoothness and L Lipschitness properties of f applies to ˆf and f. | f U(x1) f U(x2)| = |Ev Bn[f(x1 + δUv)] Ev Bn[f(x2 + δUv)]| = |Ev Bn[f(x1 + δUv) f(x2 + δUv)]| |Ev Bn[L x1 x2 2]| (a) = L x1 x2 2 where we use L Lipschitz continuity of f in (a), and, f U(x1) f U(x2) 2 = Ev Bn[f(x1 + δUv)] Ev Bn[f(x2 + δUv)] 2 = Ev Bn[ f(x1 + δUv)] Ev Bn[ f(x2 + δUv)] 2 = Ev Bn[ f(x1 + δUv) f(x2 + δUv)] 2 (b) Ev Bn[ f(x1 + δUv) f(x2 + δUv) 2] (c) µ x1 x2 2 Published as a conference paper at ICLR 2020 where we use Jensen s inequality and convexity of the norm in (b) and µ smoothness of f in (c). Hence, µ smoothness and L Lipschitness applies to f U for any U. Take n = d and U = I, then the µ smoothness and L Lipschitness applies to ˆf. Next, we will study the impact of using the gradients of the smoothed function instead of the original function. t=1 xf(x) 2 2 = 1 t=1 xf(x) x f U(x) + x f U(x) 2 2 t=1 xf(x) x f U(x) 2 2 + 2 t=1 x f U(x) 2 2. where we use a + b 2 2 2 a 2 2 + 2 b 2 2. We further bound the left term as xf(x) x f U(x) 2 2 = xf(x) x Ev Bn[f(x + δUv)] 2 2 (a) = Ev Bn[ xf(x) xf(x + δUv)] 2 2 (b) Ev Bn[δµ Uv 2] 2 2 using dominated convergence theorem in (a), the µ smoothness of f in (b) and orthonormality of U and the fact that norm of any point in a unit ball is bounded by 1. By taking n = d and U = I, this result also implies the same for ˆf. Hence, t=1 xf(x) 2 2 2 t=1 x ˆf(x) 2 2 + 2δ2µ2 t=1 xf(x) 2 2 2 t=1 x f Ut(x) 2 2 + 2δ2µ2 U1,...,UT A.2.3 SECOND MOMENT OF THE GRADIENT ESTIMATOR We will start with studying the second moment of our gradient estimate for the manifold case. We bound the expected square norm of the gradient estimate as Es Sn,ξ h gt m 2 2 i = Es Sn,ξ 2δ F(xt + δUs, ξ1) F(xt δUs, ξ2) Us 2 4δ2 Es Sn,ξ h F(xt + δUs, ξ1) F(x δUs, ξ2) 2i 2δ2 Es Sn h (f(x + δUs) f(x δUs))2i δ2 Es Sn,ξ h (F(x + δUs, ξ) f(x + δUs))2i δ2 Es Sn,ξ h (F(x δUs, ξ) f(x δUs))2i (c) 2n2L2 + 2n2VF where we use orthonormality of U and unit norm property of s in (a), add and substract f(x + δUs) f(x δUs) and use (a + b)2 2a2 + 2b2 in (b), use the bounded variance of F and the Lipschitz smoothness of f in (c). Published as a conference paper at ICLR 2020 Second moment of the random search estimator can also computed similarly. And, the resulting bound would be Es Sn,ξ h gt e 2 2 i 2d2L2 + 2d2VF A.2.4 PROOF OF PROPOSITION 1 Proof. Analysis of SGD from Lemma 2 shows that t=1 xf(xt) 2 2 Using the bound on the second moment of the estimator we derive in (21), t=1 xf(xt) 2 2 2d2L2 + 2d2VF Using the relationship between f and ˆf we derive in (19), t=1 x ˆf(x) 2 2 2δ2µ2 + 2 2d2L2 + 2d2VF We use the property b, and solve for α. After we substituted the resulting δ, t xf(xt) 2 2 c0 + c1d T 1 2 + c2d 2 3 where c0 = 4L p 2c0, and c2 = (4VF Ω0) 1 3 (2µ + 4µ 5 6 ). A.2.5 PROOF OF PROPOSITION 2 Proof. Sample complexity of the manifold random search follows closely the proof of Proposition 1. We summarize here for the sake of completeness. Using the analysis of SGD from Lemma 2, t=1 xf(xt) 2 2 Using the bound on the second moment of the estimator we derive in (21), and the relationship between f and f U we derive in (19), t=1 x ˆf(x) 2 2 2δ2µ2 + 2 2n2L2 + 2n2VF We first use the property b, then solve for α and δ. Finally, we substitute resulting δ to get the final result as; t xf(xt) 2 2 c0 + c1n T 1 2 + c2n 2 3 where c0 = 4L p 2c0, and c2 = (4VF Ω0) 1 3 (2µ + 4µ 5 6 ). A.3 PROOF OF THE THEOREM 1 Proof. We will prove our main theorem using a three major arguments. First, we analyze the SGD of a non-convex function with biased gradients in section A.3.1. Second, we show that the expected value of our loss function is equal to the bias term in section A.3.2. In order to bound the difference between the empirical loss function we minimize and its expectation, we use the Freedman s inequality Freedman (1975). Third, we bound the empirical loss in section A.3.3 in terms of the distance travelled by the iterates of the optimization xt+1 xt 2. Finally, we optimize the resulting bound in terms of the finite difference step (δt), SGD step size (αt), and mixing coefficients (β) to obtain the final statement of the theorem in section A.3.4. Published as a conference paper at ICLR 2020 A.3.1 ANALYSIS OF SGD WITH BIAS In order to analyze the SGD with bias, we will denote the gradient at the iteration t, as gt. Moreover, we will assume that its bias is bt as E[gt] = bt + x F(x, ξ). Using the µ smoothness of the function F, we can state that F(xt+1, ξ) = F(ξt αgt, ξ) F(xt, ξ) α x F(xt, ξ)gt + µα2 2 gt 2 2. (29) Let s assume the effect of the bias is bounded as | x F(xt, ξ) bt| Bt, then F(xt+1, ξ) F(xt, ξ) α x F(xt, ξ)[gt bt] + µα2 2 gt 2 2 + αBt. (30) Taking the expectation with respect to s and ξ, we get α x f(xt) 2 2 Es,ξ[ f(xt+1)] f(xt) + µα2Vg 2 + αBt (31) where Vg = E[ gt 2 2]. Summing up from t = 1 to T and dividing by α, we obtain t=1 x f(xt) 2 2 Ω t=1 Bt. (32) We now compute the bound on the bias term (Bt) using Lemma 1 and gt = βgt e + (1 β)gt m, x F(xt, ξ) bt = (1 β)Es Sn 1 h F(x + δUs, ξ) x F(xt, ξ) [U U ]s i (1 β)ΩEs Sn 1 h | x F(xt, ξ) [U U ]s| i (33) where we used the fact that function is bounded as F(x, ξ) Ω. Since x F(x, ξ) lies in the column space of U , for some p, p U [U U ] = p [U U I] = p [U U U U] = [ x F(x, ξ) p U ]U (34) Choice of bases (U ) is not unique in Lemma 1. It can be any orthonormal basis spanning the tangent space of the manifold. In order to minimize Es Sn 1 h | x F(xt, ξ) [U U ]s| i , choose the basis which will set Up to the projection of x F(xt, ξ) into the column space of U. Then, x F(xt, ξ) [U U ] 2 min q Rn x F(xt, ξ) Uq 2 x F(xt, ξ) xg(r(xt; θt); ψt) 2 (35) where last inequality is due to xg(r(xt; θt); ψt) being in the column space of U. Combining with (32), (19), and 0 β 1; t=1 xf(xt) 2 2 Ω t=1 x F(xt, ξ) xg(r(xt; θt); ψt) 2 + δ2µ2. (36) We proceed to bound PT t=1 x F(xt, ξ) xg(r(xt; θt); ψt) 2 in the next section. A.3.2 ROLE OF THE BANDIT FEEDBACK The true loss we are interested in is the effect of bias on the SGD. The bias is the sum of the differences between the gradients of the true function ( F(x, ξ)) and the estimated one (g(r(x; θ); ψ)) as derived in (36). On the other hand, the empirical information we have is the projection of this loss to a random direction (s) with an additional noise term. In this section, we will analyze the difference between the bias and the empirical loss without the noise. We will include the discussion on the noise in section A.3.3. Published as a conference paper at ICLR 2020 First, we will show that expectation of the empirical loss over a direction uniformly chosen from a unit sphere is the bias term. In order to show this, we need an elementary result which is Es Sd 1[(stv)2] = v 2 d . We show this result in Section A.4.1. Using this result, Es Sd 1 h s x F(x, ξ) xg((r(x; θ); ψ) 2i = 1 x F(x, ξ) xg((r(x; θ); ψ) 2 2 (37) We introduce the notation (s, x, ξ, θ, ψ) = s x F(x, ξ) xg((r(x; θ); ψ) 2 for clarity and, proceed to bound the difference (s, x, ξ, θ, ψ) Es[ (s, x, ξ, θ, ψ)] . Consider the sequence of differences as, Zt = (st, xt, ξt, θt, ψt) 1 x F(xt, ξt) xg((r(xt; θt); ψt) 2 2, (38) it is clear that E[Zt] = 0 for all t. Moreover, the differences are bounded due to the Lipschitz continuity. Hence, Zt is a martingale difference sequence. We use the Freedman s inequality (Freedman, 1975) in order to bound P Zt, similar to the seminal work studying the generalization of online learning by Kakade & Tewari (2009). Freedman s inequality (Freedman, 1975) states that if Zt is a martingale difference sequence, i=1 V ar(Zt), 3b p 4γ ln(T) (39) where b is the bound on Zt as |Zt| b for all t. Before we substitute the Zt in the Freedman s inequality, we need to compute the variance of the Zt. We bound the variance using the definition of the variance as Es Sd 1 st x F(xt, ξt) xg((r(xt; θt); ψt) 2 1 x F(xt, ξt) xg((r(xt; θt); ψt) 2 2 d x F(xt, ξt) xg((r(xt; θt); ψt) dst x F(xt, ξt) + xg((r(xt; θt); ψt) 2 (a) x F(xt, ξt) xg((r(xt; θt); ψt) 2 2 d2 Es Sd 1 ν dst x F(xt, ξt) + xg((r(xt; θt); ψt) 2 d2 x F(xt, ξt) xg((r(xt; θt); ψt) 2 2 d2Es Sd 1[(ν st)2] + 2L2) x F(xt, ξt) xg((r(xt; θt); ψt) 2 2 (40) where we denote the unit vector in the direction of x F(xt, ξt) xg((r(xt; θt); ψt) as ν in (a) and use the Lipschitz property as well as (63) in (b). Finally we substitute this result in Freedman s inequality and use the expectation (st, xt, ξt, θt, ψt) from (37). We also introduce the shorthand notation t = (st, xt, ξt, θt, ψ). With probability at least 1 4γ ln(T), t=1 t + max t=q E[ t], 6L2 1 + d We can further bound PT t=1 E[ t] by using the fact that (41) is in the form of s2 r + max{2as, 6bc}c which can be solved for s using quadratic formula. We solve this quadratic in Section A.4.2, and show that it implies s r + 2ac r + max{4a2, 6b}c2. Using the solution of the quadratic formula, we get the following final result describing the effect of bandit feedback. With probability 1 4γ ln(T), ln(1/γ)+max 8L2 + 4d d , 6L2 1 + d Published as a conference paper at ICLR 2020 In summary, we bound the difference between the effect of the bias PT t=1 E[ t] and the empirical loss we minimize PT t=1 t without the noise term. In the next section, we procede to bound the empirical loss PT t=1 Lt and the effect of the noise term PT t=1 | t Lt|. A.3.3 ANALYSIS OF THE EMPIRICAL LOSS In this section, we will analyze the empirical loss. Our analysis is similar to the regret analysis of Follow the Regularized Leader (FTRL). However, we do not get an adverserial bound. Our resulting bound is the function of distances of iterates denoted as xt xt 1 2. Such a bound would not be useful in adverserial setting since adversary chooses the iterates. However, we also design the optimization method. Hence, we bound xt xt 1 2 by setting step sizes accordingly. We start our analysis with bounding the total empirical loss in terms of the length of the trajectory the learner takes. As a consequence of the FTL-BTL Lemma (Kalai & Vempala, 2005), t=1 L(st, xt, ξt, θt, ψt) L(st, xt, ξt, θt, ψt) L(st, xt, ξt, θt+1, ψt+1) + R(θT , ψT ). (43) We use the Lipschitz smoothness property to convert this into distance travelled by the learner as L(st, xt, ξt, θt, ψt) L(st, xt, ξt, θt+1, ψt+1) y(xt, st, ξt) 2δ st xg(r(xt; θt); ψt) y(xt, st, ξt) 2δ st xg(r(xt; θt+1); ψt+1) t=1 st ( xg(r(xt; θt); ψt) xg(r(xt; θt+1); ψt+1)) t=1 xg(r(xt; θt); ψt) xg(r(xt; θt+1); ψt+1) 2 | {z } LEARNERPATHLENGTH With properly chosen λ, our regularizer enforces the smallest possible update, in terms of learner path length, which is consistent with the current sampled directions. This is due to the representability assumption which guarantees that manifold can be fit perfectly using the parametric family. Hence, there is a solution with L = 0. Considering the regularizer is the learner path length, with proper choice of λ, the FTRL will choose the shortest learner path length. Among all choices of st, st = x F (xt,ξt)/ x F (xt,ξt) would result in the longest distance. Hence, we can bound the learner path distance of our empirical problem with the distances of this oracle problem. We denote the θ and ψ found by this oracle problem as ˆθ, ˆψ. Formally, this upper bound leads to LEARNERPATHLENGTH t=1 xg(r(xt; ˆθt); ˆψt) xg(r(xt; ˆθt+1); ˆψt+1) 2 t=1 xg(r(xt; ˆθt); ˆψt) xg(r(xt 1; ˆθt+1); ˆψt+1) 2 t=1 xg(r(xt 1; ˆθt+1); ˆψt+1) xg(r(xt; ˆθt+1); ˆψt+1) 2 t=1 xg(r(xt; ˆθt); ˆψt) xg(r(xt 1; ˆθt); ˆψt) 2 t=1 xg(r(xt 1; ˆθt+1); ˆψt+1) xg(r(xt; ˆθt+1); ˆψt+1) 2 t=1 xt xt 1 2, Published as a conference paper at ICLR 2020 as the consequence of the fact that oracle problem solves all gradients perfectly (i.e. x F(xt, ξt) = xg(r(x; θi); ψi) for all i > t) in (a), and the functions are µ smooth in (b). Using the fact that gradient norms are bounded, t=1 L(st, xt, ξt, θt, ψt) 8µL t=1 xt xt 1 2 + 2L. (46) In order to extend this result to the t, we use the smoothness of the function as t = st [ x ˆF(xt, ξ) xg(r(xt; θt); ψt)] 2 st x ˆF(xt, ξ) y(xt, st, ξt) y(xt, st, ξt) 2δ st xg(r(xt; θt); ψt) st x F(xt + δv, ξ) y(xt, st, ξt) + L(st, xt, ξt, θt, ψt) µ2δ2 + L(st, xt, ξt, θt, ψt) By combining (47) and (46), we state t=1 xt xt 1 2 + µ2δ2T + 2L. (48) A.3.4 PROOF OF THE THEOREM We combine the aforementioned three arguments to state the final sample complexity of our method. Our analysis of SGD with bias from (36) combined with the definition of the t gives the following bound on the sample complexity. t=1 xf(xt) 2 2 Ω d E[ t] + δ2µ2. (49) Using the concavity of the square root function with Jensen s inequality, we can convert this bound to t=1 xf(xt) 2 2 Ω t=1 E[ t] + δ2µ2. (50) Next, we will bound q 1 T PT t=1 E[ t] using (42). For simplicity, we will analyze two cases (PT t=1 t 1) and (PT t=1 t > 1) seperately. Case 1, PT t=1 t 1: We substitute this bound directly in (42). With probability 1 4γ ln(T), t=1 E[ t] 1 ln(1/γ) + max 8L2 + 4d d T , 6L2 1 + d ln(1/γ) (51) Relaxing the upper bound with the fact that dimension is greater than 1, v u u t 1 t=1 E[ t] rc1 where c1 = 1 + p (8L2 + 4) ln(1/γ) + max{8L2 + 4, 12L2} ln(1/γ). T PT t=1 t > 1: Using the fact that x < x for x > 1 and x + y x + y as well as d 1, we can state that with probability 1 4γ ln(T), v u u t 1 t=1 E[ t] rc2 Published as a conference paper at ICLR 2020 where c2 = max{8L2 + 4, 12L2} ln(1/γ) and c3 = 1 + q (2L2 + 1) ln(1/γ). Combining this with the bound (48) and x x for x > 1, v u u t 1 t=1 E[ t] rc2 T + c3(µ2δ2 + 8µαVg) (54) We combine two cases and substitute it in (50). The final sample complexity is, t=1 xf(xt) 2 Ω d +µ2δ2 1+Ωc3 (55) where c1,2 = Ωmax{c1, c2}. We minimize with respect to αt and substitute it. t=1 xf(xt) 2 4c3Ωd + µ2δ2 1 + Ωc3 Before we solve for δ, we bound Vg by choosing β = 1/d as, 4L2n2 + 4n2VF Next, we solve δ to obtain the statement of the theorem. With probability 1 4γ ln(T), t=1 xf(xt) 2 k1d 1 2 T + k2d 1 2 + k3n + k4nd 1 2 T 1 2 + k5n 2 3 + k6d 1 2 n 2 3 where k1 = 2c3ΩL, k2 = c1,2, k3 = 2L 2µΩ, k4 = 4LΩ µc3, k5 = 3(2ΩVF ) 1/3, and k6 = k5(3Ωc3). A.4 USEFUL ELEMENTARY RESULTS A.4.1 EXPECTATION OF (s v)2 WHEN s IS CHOSEN UNIFORMLY FROM Sd 1 Consider R F(Oe)µ(O) where R [ ]µ(O) is an integral over orthogonal matrices with Haar measure. If e is a unit vector, we can show that Es Sd 1[F(s)] = Z F(Oe)µ(O). (59) Before we use this result, we define v 2 as an integral over orthogonal matrices O. Using orthogonality and cyclic property of the trace, v v = Tr(vv ) = Z Tr(vv )µ(O) = Z Tr (OO vv ) µ(O) = Z Tr (O vv O) µ(O). (60) Since the indentity matrix is sum of outer products of one hot vectors ei as I = P i eie i , Z Tr (O vv O) µ(O) = Z Tr i eie i O vv O Z Tr (e i O vv Oei) µ(O). (61) Using (59), we can further show X Z Tr (e i O vv Oei) µ(O) = X i Es Sd 1[Tr(s vv s)] = d Es Sd 1[(s v)] (62) Hence, combining all, Es Sd 1[(s v)] = v 2 Published as a conference paper at ICLR 2020 A.4.2 BOUNDING THE QUADRATIC FORM We need to bound s when the following quadratic inequality is correct; s r + max{2a s, 6bc}c. (64) We first consider the max as two separate options. Given the original inequality, one of the following is correct; s r + 2ac s, or s r + 6bc2. (65) For the first case, ( s)2 2ac s r 0. Using the quadratic formula, s is smaller than the largest root as s ac + a2c2 + r. Hence, s = ( s)2 (ac+ p a2c2 + r)2 = a2c2 +a2c2 +r +2ac p a2c2 + r 4a2c2 +2ac r +r (66) Combining the resulting bound with the other option gives the final bound as, s r + 2ac r + max{4a2, 6b}c2 (67) B ADDITIONAL IMPLEMENTATION DETAILS In this section, we provide additional details for the implementation. Our experimental setup is also available open-source along with a full implementation of our method at https:// github.com/intel-isl/LMRS. One key element of our method is the parametric family we use to learn the manifold. We consider a multi-layered perceptron with one to three hidden layers as Linear(d, 2n) Re LU Linear(2n, n) Re LU Linear(n, n) for d < 1000 and Linear(d, 1/2d) Re LU Linear(1/2d, 2n) Re LU Linear(2n, n) Re LU Linear(n, n) for d > 1000. We set the dimensionality of the manifold as the number of directions k which is a hyper parameter. We use grid search over δ and n = k values and choose the best performing one in all experiments. Moreover, we also reinitialize the manifold parameters whenever the estimated gradient s magnitude is less than 1e 6. We perform online gradient descent to learn the model parameters using SGD with momentum as 0.9. We also perform grid search for learning rate over {1e 4, 1e 3, 1e 2}. We set λ = 103 for all experiments. We further discuss experiment-specific details below. Mu Jo Co experiments. We use linear policies and initialize them as zeros, which corresponds to no action. We use v2 t algorithm from Mania et al. (2018), which includes whitening of the observation space and using top-k directions instead of all. We use grid search over the parameter space described by Mania et al. (2018) for n = k, α and δ. Low-dimensional unconstrained optimization suite. We use the following functions: sphere, noisysphere, cigar, tablet, cigtab, cigtab2, elli, rosen, rosen chained, diffpow, rosenelli, ridge, ridgecircle, happycat, branin, goldsteinprice, rastrigin, schaffer, schwefel2 22, lincon, rosen nesterov, styblinski tang, bukin with dimensions d = 10 and d = 100 resulting in total 46 problems. We initialize all solutions with zero mean unit variance normal variables and use grid search over δ {1e 4, 1e 3, 1e 2, 1e 1}, k {2, 5, 10, 50}, and α {1e 4, 1e 3, 1e 2, 1e 1}. Airfoil optimization. We initialize the parameters of the manifold with zero-mean unit-variance normal variables. We use grid search over δ {1e 4, 1e 3, 1e 2, 1e 1}, k {2, 5, 10, 50}, and α {1e 4, 1e 3, 1e 2, 1e 1}. For choosing hyper-parameters, we simulate all models with Reynolds number 12e6, speed 0.4 mach, and angle of attack 5 degrees. After the hyper-parameters are set, we used Reynolds number 14e6, speed 0.6 mach and angle of attack 2 degrees for evaluation. Implementation details for the baselines. For ARS, we use grid search over the parameters recommended by Mania et al. (2018). For CMA-ES, we use pycma (Hansen et al., 2019) with recommended hyperparameters. For Bayesian optimization, we use the REMBO optimizer (Wang et al., 2016) with the following details. For continuous control problems, we use Structured Kernel Interpolation (SKI) (Wilson & Nickisch, 2015) and perform grid search over hyperparameters using the ranges recommended by GPy Torch (Gardner et al., 2018). For other problems, we use full GP inference since the number of samples is rather low and perform grid search over kernel parameters following GPy Torch (Gardner et al., 2018). As an acquisition function, we perform grid search over Optimistic Expected Improvement (Rontsis et al., 2017), Multi-point Expected Improvement (Marmin et al., 2015), and the approach of Ginsbourger et al. (2010).