# bayesian_functional_optimization__4144c18f.pdf

Bayesian Functional Optimization

Ngo Anh Vien EEECS/ECIT, Queen s University Belfast

Heiko Zimmermann MLR, University of Stuttgart Marc Toussaint MLR, University of Stuttgart

Bayesian optimization (Bayes Opt) is a derivative-free approach for sequentially optimizing stochastic black-box functions. Standard Bayes Opt, which has shown many successes in machine learning applications, assumes a ﬁnite dimensional domain which often is a parametric space. The parameter space is deﬁned by the features used in the function approximations which are often selected manually. Therefore, the performance of Bayes Opt inevitably depends on the quality of chosen features. This paper proposes a new Bayesian optimization framework that is able to optimize directly on the domain of function spaces. The resulting framework, Bayesian Functional Optimization (BFO), not only extends the application domains of Bayes Opt to functional optimization problems but also relaxes the performance dependency on the chosen parameter space. We model the domain of functions as a reproducing kernel Hilbert space (RKHS), and use the notion of Gaussian processes on a real separable Hilbert space. As a result, we are able to deﬁne traditional improvement-based (PI and EI) and optimistic acquisition functions (UCB) as functionals. We propose to optimize the acquisition functionals using analytic functional gradients that are also proved to be functions in a RKHS. We evaluate BFO in three typical functional optimization tasks: i) a synthetic functional optimization problem, ii) optimizing activation functions for a multi-layer perceptron neural network, and iii) a reinforcement learning task whose policies are modeled in RKHS.

Introduction

Bayesian optimization is a derivative-free optimization scheme and is approached from the viewpoint of Bayesian theory (Jones, Schonlau, and Welch 1998; Brochu, Cora, and De Freitas 2010). It frames the optimization problem of unknown functions as a sequential decision task. These unknown functions are often costly to evaluate, especially in stochastic tasks, hence query points have to be selected such that the total cost of evaluations is optimized, for example, cost minimization in robotic control tasks as thoroughly discussed by (Deisenroth, Neumann, and Peters 2013), proﬁt maximization in advertisement placement problems (Pandey, Chakrabarti, and Agarwal 2007),

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

etc.. Speciﬁcally, the goal is to optimize an unknown realvalued objective function f(x) on some bounded subset X, mostly a subset of ℜd. At each round t, the optimizer can only access a noisy evaluation yt by querying f(xt) at a sample point xt. The target is to ﬁnd the minimizer x of the objective function while also minimizing the overall evaluation cost. To this end making decisions on an optimal next query point needs to integrate the information of all previous queries. A common approach of Bayes Opt is to incorporate a Gaussian process prior as a probabilistic surrogate model of the unknown function. New candidate points are then sampled based on the posterior distribution of the learned model. Hence, the decision making process becomes a belief search problem. Many criteria can be used to exploit the distribution of this learned model, for example the probability of improvement (Kushner 1964), the expected improvement (Moˇckus 1975), the conﬁdence bound criteria (Cox and John 1992; Srinivas et al. 2012), and information-based approaches (Hennig and Schuler 2012; Shahriari et al. 2014). In this work, we are concerned with black-box optimization of functional objectives. By modeling candidate functions in reproducing kernel Hilbert space (RKHS) we enable Bayes Opt to optimize in non-parametric, rich solution spaces while inheriting the useful structures and properties from a RKHS. As a summary, our major contributions are three-fold: First, we propose the novel Bayesian functional optimization framework (BFO), that enables optimization in function spaces that are potentially inﬁnite-dimensional. To this end BFO adopts a functional Gaussian process prior as a surrogate model for objective functionals, moreover we propose three acquisition functionals to select which function should be evaluated next: Probability of Improvement, Expected Improvement, and i GP-UCB functionals (inﬁnite GP-UCB). Second, by assuming the domain of BFO is a RKHS HK with a kernel K, we show that functional gradients of those acquisition functionals can be derived analytically. Moreover, those functional gradients are functions in HK which results in an efﬁcient functional gradient update. Third, we provide a cumulative regret bound for i GP-UCB. There have recently been similar effort in proposing new machine learning frameworks for learning on functional data, such as functional regression by (Kadri et al. 2015),

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

modeling policies for reinforcement learning in RKHS by (Bagnell and Schneider 2003; Lever and Stafford 2015; Vien, Englert, and Toussaint 2016), representing motion trajectories in RKHS by (Marinho et al. 2016; Dong et al. 2016), or ﬁnding geodesic shortest paths for physical systems by (Kasim and Norreys 2016). Particularly, the last work by (Kasim and Norreys 2016) is the closest to our work in which they also tackle global functional optimization problems. This method extends the Simultaneous Optimistic Optimisation (SOO) approach proposed by (Munos 2011) to optimize functionals, which however have to resort to discretization. Another similar work by (Vien, Dang, and Chung 2017) has also tried to extend CMA-ES to functional optimization

Background We ﬁrst give a brief problem statement, then review background about Gaussian processes (GP) and Bayes Opt with a GP prior. In Bayesian optimization (Moˇckus 1975) (Bayes Opt), we are interested in ﬁnding the maximum of a black-box function f : D ℜon some bounded domain D, where D ℜn. Bayes Opt takes a probabilistic approach to model the objective function f whose uncertainty quantiﬁcation can be exploited in making decisions on which query point x the objective f(x) is evaluated next. Therefore, any Bayes Opt method essentially consist of two major components. Firstly, a probabilistic surrogate model, often chosen to be a Gaussian Process, to represent the belief over the unknown objective f(x), Secondly, an acquisition function which computes a utility value for candidate evaluation points from the posterior distribution in order to select the next optimal evaluation point.

Gaussian Process Using Bayesian methods, one can infer a probabilistic model of f that can be queried for estimates of f(x), it s conﬁdence, and correlation with nearby regions. A Gaussian process (GP) prior is a generalized distribution over inﬁnite many Gaussian random variables, of which each ﬁnite subset is distributed jointly normal (Rasmussen 2006). A GP is deﬁned as GP(μ( ), k( , )) with mean function μ and covariance kernel k that is presumably bounded: k(x, x ) 1, x D. The kernel, i.e. covariance function, k encodes differentiability and smoothness properties of samples f(x) GP(μ(x), k(x, )). Assuming that the noise ϵt N(0, σ2) is i.i.d. Gaussian, the posterior over f after t observations {xi, yi}t i=1 is Gaussian with mean, covariance, and variance as

μt(x) = kt(x) (Kt + σ2I) 1yt,

kt(x, x ) = k(x, x ) k t (x)(Kt + σ2I) 1kt(x ),

σ2 t (x) = kt(x, x),

where yt = [y1, y2, , yt] is the vector of previous observations, kt(x) = [k(x, x1), k(x, x2), , k(x, xt))] the vector of pairwise kernels between current and previous query points, and Kt is the t t Gram matrix of kernels

k(xi, xj), i, j {1, 2, , t}. For later analysis, a stationary kernel k(x, x ) = ˆk(|x x |) is used. Common examples of a stationary kernel are the squared exponential and Mat ern kernels.

Bayesian Optimization: Acquisition Functions We have discussed a probabilistic surrogate model used to represent the belief over the unknown function f. We now discuss the strategies to select a sequence of query points x1:t by deﬁning an acquisition function u : D ℜ. Here we present and use three traditional acquisition functions (Brochu, Cora, and De Freitas 2010): probability of improvement (PI) (Kushner 1964), expected improvement (EI) (Lizotte 2008), and an upper conﬁdence bound criteria (UCB) (Cox and John 1992; Auer, Cesa-Bianchi, and Fischer 2002). We denote xbest, ybest = f(xbest) the best evaluation until time t, and Φ( ) the cumulative distribution function of the standard normal distribution denoted by φ( ). Probability of Improvement: This strategy selects the query point that maximizes the probability of improvement over the current best value, which is analytically computed as

u PI(x) = Pr(f(x) > ybest) = Φ (γ(x)) (1)

where γ(x) = μt(x) ybest

σt(x) . Expected Improvement: This strategy selects the query point that maximizes the expected improvement over the current best. Similarly, this criteria has the analytic form

u EI(x) = σt(x) (γ(x)Φ(γ(x)) + φ(γ(x))) . (2)

GP-UCB: this strategy selects the query point that maximizes the upper conﬁdence bound criteria such that, when meeting some rather mild conditions, the cumulative regret is bounded, as introduced by (Srinivas et al. 2012).

u UCB(x) = μt(x) + β1/2 t σt(x) (3)

where βt = ντt and ν is a hyperparameter. The selection of ν = 1 and τt = 2 log(tn/2+2π2/3δ) (n being the dimensionality of the domain D) is shown to make GP-UCB achieve no-regret with probability 1 δ. There are a number of Bayes Opt methods dealing with very high-dimensional problems, for example learning sparse additive model in high dimensions by (Tyagi, G artner, and Krause 2014; Tyagi et al. 2016), using random embeddings on high-dimensions by (Wang et al. 2016), learning a lower dimensional subspace by (Djolonga, Krause, and Cevher 2013). However, those methods work by assuming that the underlying problem is an inherently simple task (low-dimensional) but hidden in a very high dimensional space. On the other hand, none of those methods directly work with problems on function domains (potentially inﬁnite-dimensional).

Functional Optimization: A Bayesian Approach We now present BFO, a principled Bayesian optimization framework for both functional optimization tasks and optimization problems on non-parametric domains.

Algorithm 1 RKHS-REMBO

1: Generate a random bounded linear operator on RKHS, T : ℜd H 2: Deﬁne a bounded region set on Z ℜd

3: Set D0 = 4: while (not terminate) do 5: Select zt+1 = arg maxz Z u(z) (maximizing an acquisition function on Z) 6: Sample yt+1 = f(Tzt+1) + ϵt+1 7: Update the data Dt+1 = Dt {zt+1, yt+1} 8: Tuning the kernel hyper-parameters of the GP on the domain Z 9: end while

Problem Statement We consider the problem of globally maximizing an unknown objective functional f : Hk ℜ, where Hk is a reproducing kernel Hilbert space with real-valued reproducing kernel k : D D ℜ, D ℜ, consisting of span{k(x, ) | x D} and it s closure. At each round t, a function ht H is selected, and a noisy evaluation yt = f(ht)+ϵt is returned, where ϵt N(0, σ2).

Naive Approach This section suggests one simple Bayesian optimization approach for the above problem, using the random embedding idea from (Wang et al. 2016) as described in Algorithm 1, called RKHS-REMBO (Bayesian Optimization on RKHS with Random Embedding). Many previous parametric Bayesian optimization approaches might be extended to tackle the above problem. However, we believe that those extensions using more sophisticated methods like learning a lower dimensional subspace by (Djolonga, Krause, and Cevher 2013), using the Karhunen-Loeve theorem (Jorgensen and Song 2007), or kernel learning (Wilson 2014)) might be non-trivial. Given a kernel function k, a random bounded linear operator T is generated by ﬁrst sampling randomly d functions hi in RKHS: hi = N i=1 αik(xi, ), where αi ℜ, xi D are sampled randomly. The random bounded linear operator T is formed as T = [h1, h2, . . . , hd]. One can consider T as a |H| d matrix (|H| is the dimensionality of H which is potentially inﬁnite). As T is constructed from a set of RKHS functions, we can easily conclude that T is a bounded operator. Similar to the original REMBO algorithm, RKHSREMBO assumes that the unknown function f has an intrinsically d-dimensional structure (where d must be treated as a hyperparameter), instead of |H| which might be potentially inﬁnite. Therefore, RKHS-REMBO can only result in a sub-optimal solution function h that depends on a ﬁxed set of initially randomly sampled functions hi. However, though this projection may approximate the function domain crudely, it provides a simple and fast solution.

Bayesian Functional Optimization Our proposed BFO framework is depicted in Algorithm 2. BFO is constructed based on two choices: i) a GP prior to

track the belief over the unknown objective functional f(h) and ii) an acquisition functional h : Hk ℜ.

Gaussian Process for Functional Domains There was little effort in using GP for functional data as in work of (Shi and Choi 2011) in which they deﬁne a GP kernel function on the inputs of parametric form. However, we assume speciﬁcally that the input space is a RKHS Hk, which allows us to directly deﬁne a GP kernel over functions in Hk. We assume that a Gaussian process on a real separable Hilbert space Hk with a scalar reproducing kernel k = , H models a prior distribution on the unknown functional f(h), h Hk. A stochastic process f = {f(h), h Hk} deﬁned in a complete probability space (Ω, F, μ) is a Gaussian process if f is a Gaussian family of random variables such that cov(f(h), f(g)) = K(h, g), where K( , ) is required to be a positive semi-deﬁnite kernel K : Hk Hk ℜ. The kernel K(h, g), where h, g Hk, can be constructed based on RKHS kernels k evaluated at support points from the underlying RKHS function domain D. We provide two simple functional GP kernels: Polynomial kernel: K(h, g) = ( h, g Hk + c1)c2 (where c1, c2 0 are hyper-parameters). Hence if h and g the form: h = N i=1 αik(x(h) i , ), g = M i=1 βik(x(g) i , ), then the kernel K(h, g) can be computed as

K(h, g) = N,M

i=1,j=1 αiβjk(x(h) i , x(g) j ) + c1 c2 .

Stationary kernel: This kernel involves a computation of the distance g h 2 = g h, g h Hk, which again can be computed based on the evaluation of kernels k( , ) as

g h, g h Hk =

i=1,j=1 αiαjk(x(h) i , x(h) j )

i=1,j=1 βiβjk(x(g) i , x(g) j )

i=1,j=1 αiβjk(x(h) i , x(g) j )

RBF kernel: An RBF kernel can easily be constricted using the stationary kernel as

K(g, h) = exp( g h 2 Hk/2σ2).

Posterior update: With a positive deﬁnite kernel K, the posterior over f is updated similarly to the standard GP. For a data set of noisy evaluations yt = [y1, y2, , yt] and sampled functions {h1, h2, , ht}, the posterior is again a GP distribution with mean functional μt( ) and covariance kernel Kt( , ) as

μt(h) = kt(h) (Gt + σ2I) 1yt Kt(h, h ) = K(h, h ) k t (h)(Gt + σ2I) 1kt(h )

σ2 t (h) = Kt(h, h)

Algorithm 2 The BFO framework

1: Initialize D0 = 2: Prior mean functional μ0 Hk 3: while (not terminate) do 4: Select ht+1 = arg maxh Hk u(h) (maximizing the acquisition functional on Hk) 5: Sparsify ht+1 to get a compact function ht+1 6: Sample yt+1 = f( ht+1) + ϵt+1 7: Update the data Dt+1 = Dt { ht+1, yt+1} 8: Tuning the hyper-parameters of the kernel, K : Hk Hk ℜ 9: end while

where kt(h) = [K(h, h1), K(h, h2), h , K(h, ht))] , and Gt is a t t Gram matrix of functional kernels K(hi, hj), i, j {1, 2, , t}. Note that the above update is an extended Bayesian interpretation of the non-Bayesian kernel regression method on functional data as studied recently by (Kadri et al. 2015).

Acquisition Functionals We propose three acquisition functionals: probability of improvement, expected improvement, and i GP-UCB to select a function h to next evaluate f(h). We denote hbest, ybest = f(hbest) the best evaluation until time t. Probability of Improvement (PI):

u PI(h) = Φ(γ(h))

where the functional γ is deﬁned as

γ(h) = μt(h) f(hbest)

As Φ( ) is a monotonically increasing function, maximizing PI can be replaced by just maximizing γ(h). Expected Improvement (EI):

u EI(h) = σt(h) γ(h)Φ(γ(h)) + φ(γ(h))

u UCB(h) = μt(h) + β1/2 t σt(h)

Optimizing the above acquisition functionals might be hard. Fortunately, the functional gradient w.r.t functions on a RKHS Hk with a reproducing kernel k can be derived analytically We are now computing the functional gradients of those acquisition functionals. Speciﬁcally, here we are stating the functional gradients for the i GP-UCB acquisition functional and RBF kernel K(h, h ) = exp( h h 2 Hk/2σ2) 1 We use the notion of the Fr echet derivative which is a derivative on Banach spaces. Let V and W be Banach spaces, and U V be an open subset of V, then a function f : U W is called Fr echet differentiable at h U if there exists a bounded linear operator Df|h : V W such that

lim g 0 f(h + g) f(h) Df|h(g) W

1We would like to refer the reader to the supplemenary material for the detailed analytic computation of the functional gradients of the three acquisition functionals.

Assumption 1 Assume that each functional kernel K(ht, h) has a Fr echet derivative Dht : Hk ℜ

According to (Chae 1985), when W is a real (or complex) space, the Fr echet derivative becomes a function in Hk i.e. Df|h Hk and Df|h(g) = Df|h, g Hk.

Lemma 1 The Fr echet derivative at h Hk of the RBF kernel function K(ht, h) = exp( ht h 2 Hk/2σ2) is

Dht|h : g K(ht, h)

σ2 ht h , g

As we can see the Fr echet derivative Dht|h at h of the RBF kernel is a function in Hk which support points are the combined set of suport points from h and ht. Speciﬁcally, assuming that h and ht have representation

i=1 αik(xi, ), ht =

i=1 βi K(x i, )

then Dht|h is written as

Dht|h(x) = K(ht, h)

i=1 βi K(x i, x)

i=1 αik(xi, x)

where x, xi, x i ℜn.

Lemma 2 The derivative of the mean and variance functionals at h Hk are the linear operators Dμt|h : Hk ℜ, and Dσ2 t |h : Hk ℜsuch that

Dμt|h(g) = Dkt|h(g)(Gt + σ2I) 1yt Dσ2 t |h(g) = Dh|h(g) 2Dkt|h(g) (Gt + σ2I) 1kt(h)

where Dkt|h is the Fr echet derivative of kt(h), and Dh|h(g) is deﬁned in Assumption 1. In addition, Dμt|h and Dσ2 t |h are functions in Hk.

Proposition 1 The Fr echet derivative of the UCB acquisition functional is

Du UCB|h = Dμt|h + β1/2 1 σ2 t (h)Dσ2 t |h

which is in Hk.

Optimizing the acquisition functional: The recursive gradient update process starts with a randomly initialized function h{0} HK. The function h is computed iteratively as

h{l+1} = h{l} + αl(Du|h{l} + λh{l}) (4)

where αl is a step-size, we denote Du|h{l} the Fr echet derivative of the acquisition functionals u UCB, u PI, or u EI. There are three key insight about this process.

1. The Lagrange function for acquisition functional optimization (using box constraints): we assume that there is a boundary on Hk that helps refrain global optimization methods from relentlessly exploring. Speciﬁcally, Hk consists of functions bounded by a constant C: h Hk C. Therefore, we receive a functional optimization problem of a given acquisition function as

max h Hk u(h) s.t. h Hk C (5)

We form the Lagrange function λ: maxh Hk u(h) + λ 2 ( h 2 Hk C2). Thus, we propose to optimize this function to ﬁnd the next query function h, hence receive a functional gradient update as in Eq. 4. We treat λ as a hyperparameter.

2. Because all Fr echet derivatives of three acquisition functionals are in Hk, the recursive update in Eq. 4 also results in functions h{l} in Hk. Moreover, its representation depends on all support centres from h1:t and h{0}. Speciﬁcally, assuming that

i=1 α{j} i k(x{j} i , ) j (1, 2, , t),

i=1 αik(xi, ),

after l functional gradient updates in Eq. 4, h{l} might have representation as

i=1 w0,iαik(xi, ) +

i=1 wl 1,iα{j} i k(x{j} i , )

where wli are the weights of the function h{l}.

3. As realized by the above, the representation of the resulting optimal solution h depends on ﬁxed support centres x{t} i from previous sampled points ht and centres xi from the initial function h{0}. Therefore, we propose to initialize h{0} randomly by a randomly sampled number of samples N (large enough), and random samples x1:N, α1:N to result in h{0} = N i=1 αik(xi, ).

We use a multi-start strategy that reruns the above optimization process multiple times with randomly sampled functions h{0} to assure h is the global solution in optimizing the acquisition functional. As the ﬁnal solution ht+1 = h{ } might be a complex function (Nt+1 large) (Step 4 in Algorithm 2), it might slow down the update of our GP in RKHS. We suggest to sparsify ht+1 before evaluating f(ht+1). Our paper uses the kernel pursuit matching algorithm by (Vincent and Bengio 2002) to sparsify h to be represented by only d centres, instead of being N + t j=1 Nj. Sparsiﬁcation is seen at Step 5 in Algorithm 2. After each iteration, we tune the hyperparameter (Step 8) of the kernel K (currently by maximizing the marginal likelihood).

Figure 1: (left) MSE on 1D domain, (right) MSE on 2D domain

Theoretical Results The PI approach is known to be a heuristic rule as discussed by (Jones 2001), and the EI approach was recently proved to converge by (Vazquez and Bect 2010) and (Bull 2011) with limited assumptions about a ﬁxed Gaussian process prior of ﬁnite smoothness and known smoothness of f, respectively. Therefore, we decide to provide only a theoretical result for the BFO UCB (i GPUCB) method where the cumulative regret RT is a performance metric. (Srinivas et al. 2012) provide cumulative regret bounds that depend on the dimensionality of the input space ℜn. We follow their proof and provide a new regret bound suitable for our problem setting. The main difﬁculty is to deal with the (potentially) inﬁnite dimensional domain. Many results of (Srinivas et al. 2012) can only hold with assuming ﬁnite dimensionality. First we deﬁne the maximum information gain γT after T rounds as

γT = max H Hk,|H|=T I(y H; f H) = 1

2 log |I + σ 2GT | (6)

where GT is the covariance matrix K(h, h ) for h, h H. In the following theorem, we show the bound on the cumulative regret for i GP-UCB. Theorem 1 Deﬁne C1 = 8/ log(1 + σ 2), we have: RT C1TβT γT + π2

6 , T 1, with probability greater than 1 δ, where βT = 2d0 log(rd0b T 2πt

log(2d0a)/δ) 2d0 log(1 2ϵt2) in which d0, r, a, b are parameters depending on discretization of the search space and T t 1 1/πt = 1, πt > 0. For a sketch of the proof, we use the Stone-Weierstrass theorem twice in order to approximate any function f(h), h Hk by a parametric function, which has a ﬁnite-dimensional domain based on two stages. The ﬁrst stage approximates f by a ﬁnite set of basis {hi}d i=1. This step is equivalent to represent f in parametric form as f(h) d i=1 αi K(hi, ). The next stage is to approximate each function hi by a polynomial of degree N and smaller. Any function f(h) is parameterized by a cross parameter space of coefﬁcients of all d polynomials and αi. Therefore our proof can inherit many results of (Srinivas et al. 2012). A proof in detail is presented in the supplementary material.

Experiment We ﬁrst evaluate the advantages of BFO on a range of applications which are also typical application domains for

Bayes Opt: i) a synthetic functional optimization problem, ii) a hyperparameter optimization problem where we use BFO to choose optimal activation functions of a neural network for the MINIST dataset, and iii) policy search in reinforcement learning. We use RBF kernels k(x, x ) = exp( x x /2σ2 1), K(h, h ) = exp( h h /2σ2 2), where σ1, σ2 are two hyperparamters. The GP kernel hyperparamets σ2 is tuned by maximizing the marginal data likelihood.

Functional Optimization: Synthetic Problems

We design two different tasks, n = 1 and n = 2, of an unknown function h : ℜn ℜ. Each function is a mixture of two (multi-variate) Gaussians, respectively. All optimizers are tasked to minimize

h (x) h(x) 2dx + ϵ

h (xi) h(xi) 2 + ϵ

as a noisy square distance to the unknown function h , where x ℜn, ϵ N(0, σ2). We compare its behavior with other base-line methods: standard Bayes Opt, RKHS-REMBO (Algorithm 1), and functional gradient descent (assuming to know the true function h and ignoring noise). Functional gradient: Using functional gradient requires to have access to the non-noisy ground-truth function h from which J(h) can be approximately evaluated as stated above (without noise ϵ). The functional gradient at h can be computed as h J(h) = N i=1 2 h(xi) h (xi) K(xi, ) and thus the functional gradient update at iteration l is h{l+1} h{l} α h J(h{l}). A sparsiﬁcation technique (Vincent and Bengio 2002) can be used to achieve a compact representation of h which renders the functional gradient approach an adaptive method too. This means the representation of h will be adaptively adapted to best approximate h . Hence, discretization is required to be ﬁne enough to achieve good approximation. Standard Bayes Opt: We assume a parametric representation of h as a linear expansion of N features: h(x) = N i=1 θiφi(x). We use RBF features φi(x) = exp( x xi 2/σ2) centered around N center points xi. Standard Bayes Opt optimizes over a search space of {θi, {xi}N i=1}. Results: For all optimizers (except the functional gradient method), we use the same number N = 2 of features and evaluate them on the two corresponding tasks, n = 1 and n = 2. The bandwidth σ1 is set equal to the bandwidth of the Gaussians in the ground-truth function h . We report the mean squared error plot (MSE) J in Fig. 1, together with the ﬁnal best MSE in Table 1, both averaged over 10 runs. The results show that BFO outperforms all other methods (except the functional gradient which assumes to know the ground-truth). RKHS-REMBO only optimizes on a ﬁxed parameter space in which it can not ﬁnd a

Figure 2: MNIST Dataset: (left) cross-entropy on validation dataset, (right) standard deviations

good solution. Standard Bayes Opt does not have this problem but is not able to deal with the different scaling of center and weight spaces.

Table 1: Synthetic domain: MSE and standard deviations of the best evaluation over 10 runs. Methods 1D Domain 2D Domain Functional Gradient 1.31e-6 7.0e-05 2.9e-6 7.76e-7 BFO UCB 0.0086 0.0036 0.0085 0.002 BFO PI 0.0026 0.0025 0.0058 0.002 BFO EI 0.0027 0.0014 0.0056 0.003 Bayes Opt UCB 0.0195 0.0086 0.0143 0.006 Bayes Opt PI 0.0168 0.0088 0.0131 0.002 Bayes Opt EI 0.0197 0.0070 0.0121 0.002 RKHS-REMBO UCB 0.0780 4.58e-6 0.0246 0.0001 RKHS-REMBO PI 0.0780 1.38e-5 0.0246 7.63e-5 RKHS-REMBO EI 0.0781 0.0001 0.0247 0.0001

Hyperparameter Optimization for Neural Networks: Choosing Activation Functions

The MNIST database consists of labeled 28x28 pixel greyscale images of handwritten digits. It contains a test data set of 10.000 data tuples and a training data set of 60.000 data tuples. We train a multilayer perceptron with 2 hidden layers containing 500 and 300 neurons. The network is trained using the cross entropy loss and stochastic minibatch gradient descent with batches of size 100, using Tensor Flow with the ADAM optimizer by (Kingma and Ba 2015). We use three centers for the parametric methods, and also sparsify the functions in BFO to three basis functions (centers in parametric view) to assure a compact representation of the activation functions. We compare BFO to: i) the baseline ﬁxed sigmoid activation function, ii) tunable parametric activation functions (using RBF features), iii) standard Bayes Opt using UCB (GP-UCB). We selected the objective functional for Bayesian functional optimization as the cross entropy of the validation data set obtained by training the MLP model with the query activation function. We report the result in Fig. 2, and Table 2. The results clearly show the beneﬁt of our nonparametric BFO approach which outperforms the existing parametric approaches, joint training with tunable activation functions and standard Bayes Opt.

Table 2: MNIST Dataset: cross-entropy and standard deviations of the best evaluation over 10 runs.

Methods CE (val. data) Test CE (best) Test Acc. Activation (best)

Fixed Sigmoid 0.112 0.006 0.097 97.06 %

Joint Training 0.093 0.007 0.077 97.77%

BFO UCB 0.055 0.005 0.060 98.06 %

BFO PI 0.054 0.004 0.058 98.14 %

BFO EI 0.053 0.002 0.059 98.26%

GP-UCB 0.070 0.007 0.072 97.59%

Reinforcement Learning by Policy Search: Inverted Pendulum

For simplicity, we assume a policy π as a Gaussian controller with the mean function h H, a = h(s) = π(s) where s is a state in the state space D, and a variance σ2. For parametric policy approaches, h may be a linear function of predeﬁned features as h(s) = θ Φ(s), where θ ℜN. For each sample θ, we evaluate J(θ) = Eπ(θ) T i=0 γiri

which is computed using Monte-Carlo simulations. Specifically, Z trajectories are collected by executing π(θ), and J(θ) 1

Z Z i=1 R(τi), where R(τi) is the ith return. A simple application of Bayes Opt for policy search is to deﬁne a Euclidean kernel in parameter space (Brochu, Cora, and De Freitas 2010). We compare our direct policy search via BFO using i GP-UCB to standard Bayes Opt policy search (Lizotte et al. 2007) (optimizing over a search space of {θi}N i=1, while si are evenly placed), Bayes Opt A (optimizing over a search space of {θi, si}N i=1), CMAES (Heidrich-Meisner and Igel 2009), a parametric actorcritic, and the actor-critic in RKHS (RKHS-AC) (Lever and Stafford 2015) methods. In all experiments, we use the RBF kernel where the bandwidths are set using the median-trick. For the inverted pendulum domain we use the same settings as in (Lever and Stafford 2015). We use N = 16 centres, i.e. features, for all algorithms and set discout factor γ = 0.99 and a horizon H = 400. The results of mean performance and it s 95% conﬁdence are computed over 10 runs and reported in Fig. 3. We observe that all local methods such actor-critic and RKHS actor critic are not competitive as their performance improves too slowly. Also the CMA-ES method is not very data efﬁcient, with a limited number of episodes it s performance remains non-competitive. On contrary, all Bayesian optimization methods, i GP-UCB, GP-UCB, RKHS-REMBO and adaptive Bayes Opt, are very competitive. Their performances improve quickly and they exploit data very efﬁ-

0 20 40 60 80 100 120 140 160 180 200

Discounted cumulative reward

Actor-Critic

CMA-ES CMA-ES-A

Bayes Opt Bayes Opt-A RKHS-REMBO

Figure 3: The Inverted Pendulum domain

Conclusion This paper proposes BFO, a Bayesian functional optimization framework, for global functional optimization. We modeled the function space as a reproducing kernel Hilbert space which results in both, an efﬁcient update of the functional GP and simple optimization of the acquisition functional. Combined with an efﬁcient sparsiﬁcation method we attain compact and ﬂexible solutions without slowing down the functional GP update too much. Our experiments show that BFO is very promising and able to represent complex solution functions compactly. Compared to other methods BFO can not only theoretically handle functional optimization directly, but also practically does not need to rely on a predeﬁned set of features while bypassing the problem of handling different scales in cross parameter spaces that might occur with standard Bayes Opt. We believe that it might be more straightforward and convenient to separately argue about a suited RKHS to represent candidate functions and a functional GP kernel for measuring the similarity between those functions than directly designing a parametric kernel for standard Bayes Opt in a parameterized task setting.

References Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finitetime analysis of the multiarmed bandit problem. Machine Learning 47(2-3):235 256. Bagnell, J. A. D., and Schneider, J. 2003. Policy search in reproducing kernel hilbert space. Technical Report CMURI-TR-03-45, Robotics Institute, Pittsburgh, PA. Brochu, E.; Cora, V. M.; and De Freitas, N. 2010. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. ar Xiv preprint ar Xiv:1012.2599. Bull, A. D. 2011. Convergence rates of efﬁcient global optimization algorithms. JMLR 12(Oct):2879 2904. Chae, S. B. 1985. Holomorphy and Calculus in Normed Spaces (Monographs and Textbooks in Pure and Applied Mathematics). Marcel Dekker. Cox, D. D., and John, S. 1992. A statistical method for global optimization. In Systems, Man and Cybernetics, 1992., IEEE International Conference on, 1241 1246. IEEE. Deisenroth, M. P.; Neumann, G.; and Peters, J. 2013. A survey on policy search for robotics. Foundations and Trends in Robotics 2(1-2):1 142. Djolonga, J.; Krause, A.; and Cevher, V. 2013. Highdimensional gaussian process bandits. In NIPS, 1025 1033. Dong, J.; Mukadam, M.; Dellaert, F.; and Boots, B. 2016. Motion planning as probabilistic inference using gaussian processes and factor graphs. In RSS. Heidrich-Meisner, V., and Igel, C. 2009. Neuroevolution strategies for episodic reinforcement learning. J. Algorithms 64(4):152 168. Hennig, P., and Schuler, C. J. 2012. Entropy search for information-efﬁcient global optimization. Journal of Machine Learning Research 13(Jun):1809 1837. Jones, D. R.; Schonlau, M.; and Welch, W. J. 1998. Efﬁcient global optimization of expensive black-box functions. J. Global Optimization 13(4):455 492. Jones, D. R. 2001. A taxonomy of global optimization methods based on response surfaces. Journal of global optimization 21(4):345 383. Jorgensen, P. E. T., and Song, M.-S. 2007. Entropy encoding, hilbert space, and karhunen-love transforms. Journal of Mathematical Physics 48. Kadri, H.; Duﬂos, E.; Preux, P.; Canu, S.; Rakotomamonjy, A.; and Audiffren, J. 2015. Operator-valued kernels for learning from functional response data. Journal of Machine Learning Research 16:1 54. Kasim, M. F., and Norreys, P. A. 2016. Inﬁnite dimensional optimistic optimisation with applications on physical systems. ar Xiv preprint ar Xiv:1611.05845. Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR. Kushner, H. J. 1964. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering 86(1):97 106.

Lever, G., and Stafford, R. 2015. Modelling policies in mdps in reproducing kernel hilbert space. In AISTATS. Lizotte, D. J.; Wang, T.; Bowling, M. H.; and Schuurmans, D. 2007. Automatic gait optimization with gaussian process regression. In IJCAI, 944 949. Lizotte, D. J. 2008. Practical Bayesian Optimization. Ph.D. Dissertation, University of Alberta, Edmonton, Alberta, Canada. Marinho, Z.; Boots, B.; Dragan, A. D.; Byravan, A.; Gordon, G. J.; and Srinivasa, S. 2016. Functional gradient motion planning in reproducing kernel hilbert spaces. In RSS. Moˇckus, J. 1975. On bayesian methods for seeking the extremum. Springer. 400 404. Munos, R. 2011. Optimistic optimization of a deterministic function without the knowledge of its smoothness. In NIPS, 783 791. Pandey, S.; Chakrabarti, D.; and Agarwal, D. 2007. Multiarmed bandit problems with dependent arms. In ICML, 721 728. Rasmussen, C. E. 2006. Gaussian processes for machine learning. MIT Press. Shahriari, B.; Wang, Z.; Hoffman, M. W.; Bouchard-Cˆot e, A.; and de Freitas, N. 2014. An entropy search portfolio for bayesian optimization. ar Xiv preprint ar Xiv:1406.4625. Shi, J. Q., and Choi, T. 2011. Gaussian process regression analysis for functional data. CRC Press Boca Raton, FL:. Srinivas, N.; Krause, A.; Kakade, S. M.; and Seeger, M. W. 2012. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Trans. Information Theory 58(5):3250 3265. Tyagi, H.; Kyrillidis, A.; G artner, B.; and Krause, A. 2016. Learning sparse additive models with interactions in high dimensions. In AISTATS, 111 120. Tyagi, H.; G artner, B.; and Krause, A. 2014. Efﬁcient sampling for learning sparse additive models in high dimensions. In NIPS, 514 522. Vazquez, E., and Bect, J. 2010. Convergence properties of the expected improvement algorithm with ﬁxed mean and covariance functions. Journal of Statistical Planning and inference 140(11):3088 3095. Vien, N. A.; Dang, V.-H.; and Chung, T. 2017. A covariance matrix adaptation evolution strategy for direct policy search in reproducing kernel hilbert space. In ACML. Vien, N. A.; Englert, P.; and Toussaint, M. 2016. Policy search in reproducing kernel hilbert space. In IJCAI, 2089 2096. Vincent, P., and Bengio, Y. 2002. Kernel matching pursuit. Machine Learning 48(1-3):165 187. Wang, Z.; Hutter, F.; Zoghi, M.; Matheson, D.; and de Freitas, N. 2016. Bayesian optimization in a billion dimensions via random embeddings. JAIR 55:361 387. Wilson, A. G. 2014. Covariance kernels for fast automatic pattern discovery and extrapolation with Gaussian processes. Ph.D. Dissertation, Univ. of Cambridge, UK.