# multifidelity_bayesian_optimisation_with_continuous_approximations__9dd1adc8.pdf

Multi-ﬁdelity Bayesian Optimisation with Continuous Approximations

Kirthevasan Kandasamy 1 Gautam Dasarathy 2 Jeff Schneider 1 Barnab as P oczos 1

Bandit methods for black-box optimisation, such as Bayesian optimisation, are used in a variety of applications including hyper-parameter tuning and experiment design. Recently, multiﬁdelity methods have garnered considerable attention since function evaluations have become increasingly expensive in such applications. Multiﬁdelity methods use cheap approximations to the function of interest to speed up the overall optimisation process. However, most multi-ﬁdelity methods assume only a ﬁnite number of approximations. On the other hand, in many practical applications, a continuous spectrum of approximations might be available. For instance, when tuning an expensive neural network, one might choose to approximate the cross validation performance using less data N and/or few training iterations T. Here, the approximations are best viewed as arising out of a continuous two dimensional space (N, T). In this work, we develop a Bayesian optimisation method, BOCA, for this setting. We characterise its theoretical properties and show that it achieves better regret than than strategies which ignore the approximations. BOCA outperforms several other baselines in synthetic and real experiments.

1. Introduction

Many tasks in scientiﬁc and engineering applications can be framed as bandit optimisation problems, where we need to sequentially evaluate a noisy black-box function f : X R with the goal of ﬁnding its optimum. Some applications include hyper-parameter tuning in machine learning (Hutter et al., 2011; Snoek et al., 2012), optimal policy search (Lizotte et al., 2007; Martinez-Cantin et al., 2007) and scientiﬁc experiments (Gonzalez et al., 2014; Parkinson et al., 2006).

*Equal contribution 1Carnegie Mellon University, Pittsburgh PA, USA 2Rice University, Houston TX, USA. Correspondence to: Kirthevasan Kandasamy <kandasamy@cmu.edu>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

Typically, in such applications, each function evaluation is expensive, and conventionally, the bandit literature has focused on developing methods for ﬁnding the optimum while keeping the number of evaluations to f at a minimum.

However, with increasingly expensive function evaluations, conventional methods have become infeasible as a signiﬁcant cost needs to be expended before we can learn anything about f. As a result, multi-ﬁdelity optimisation methods have recently gained attention (Cutler et al., 2014; Kandasamy et al., 2016a; Li et al., 2016). As the name suggests, these methods assume that we have access to lower ﬁdelity approximations to f which can be evaluated instead of f. The lower the ﬁdelity, the cheaper the evaluation, but it provides less accurate information about f. For example, when optimising the conﬁguration of an expensive real world robot, its performance can be approximated using cheaper computer simulations. The goal is to use the cheap approximations to guide search for the optimum of f, and reduce the overall cost of optimisation. However, most multi-ﬁdelity work assume only a ﬁnite number of approximations. In this paper, we study multi-ﬁdelity optimisation when there is access to a continuous spectrum of approximations.

To motivate this set up, consider tuning a classiﬁcation algorithm over a space of hyper-parameters X by maximising a validation set accuracy. The algorithm is to be trained using N data points via an iterative algorithm for T iterations. However, we wish to use fewer training points N < N and/or fewer iterations T < T to approximate the validation accuracy. We can view validation accuracy as a function g : [1, N ] [1, T ] X R where evaluating g(N, T, x) requires training the algorithm with N points for T iterations with the hyper-parameters x. If the training complexity of the algorithm is quadratic in data size and linear in the number of iterations, then the cost of this evaluation is λ(N, T) = O(N 2T). Our goal is to ﬁnd the optimum when N = N , and T = T , i.e. we wish to maximise f(x) = g(N , T , x).

In this setting, while N, T are technically discrete choices, they are more naturally viewed as coming from a continuous 2 dimensional ﬁdelity space, [1, N ] [1, T ]. One might hope that cheaper queries to g(N, T, ) with N, T less than N , T can be used to learn about g(N , T , ) and consequently optimise it using less overall cost. Indeed, this

Bayesian Optimisation with Continuous Approximations

is the case with many machine learning algorithms where cross validation performance tends to vary smoothly with data set size and number of iterations. Therefore, one may use cheap low ﬁdelity experiments with small (N, T) to discard bad hyper-parameters and deploy expensive high ﬁdelity experiments with large (N, T) only in a small but promising region. The main theoretical result of this paper (Theorem 1) shows that our proposed algorithm, BOCA, exhibits precisely this behaviour.

Continuous approximations also arise in simulation studies: where simulations can be carried out at varying levels of granularity, on-line advertising: where an ad can be controlled by continuous parameters such as display time or target audience, and several other experiment design tasks. In fact, in many multi-ﬁdelity papers, the ﬁnite approximations were obtained by discretising a continuous space (Huang et al., 2006; Kandasamy et al., 2016a). Here, we study a Bayesian optimisation technique that is directly designed for continuous ﬁdelity spaces and is potentially applicable to more general spaces. Our main contributions are,

1. A novel setting and model for multi-ﬁdelity optimisation with continuous approximations using Gaussian process (GP) assumptions. We develop a novel algorithm, BOCA, for this setting. 2. A theoretical analysis characterising the behaviour and regret bound for BOCA. 3. An empirical study which demonstrates that BOCA outperforms alternatives, both multi-ﬁdelity and otherwise, on a series of synthetic problems and real examples in hyper-parameter tuning and astrophysics.

Related Work Bayesian optimisation (BO), refers to a suite of techniques for bandit optimisation which use a prior belief distribution for f. While there are several techniques for BO (de Freitas et al., 2012; Hern andez-Lobato et al., 2014; Jones et al., 1998; Mockus, 1994; Thompson, 1933), our work will build on the Gaussian process upper conﬁdence bound (GP-UCB) algorithm of Srinivas et al. (2010). GP-UCB models f as a GP and uses upper conﬁdence bound (UCB) (Auer, 2003) techniques to determine the next point for evaluation.

BO techniques have been used in developing multi-ﬁdelity optimisation methods in various applications such as hyperparameter tuning and industrial design (Forrester et al., 2007; Huang et al., 2006; Klein et al., 2015; Lam et al., 2015; Poloczek et al., 2016; Swersky et al., 2013). However, these methods are either problem speciﬁc and/or only use a ﬁnite number of ﬁdelities. Further, none of them come with theoretical underpinnings. Recent work has studied multi-ﬁdelity methods for speciﬁc problems such as hyperparameter tuning, active learning and reinforcement learning (Agarwal et al., 2011; Cutler et al., 2014; Li et al., 2016; Sabharwal et al., 2015; Zhang & Chaudhuri, 2015). While

some of the above tasks can be framed as optimisation problems, the methods themselves are speciﬁc to the problem considered. Our method is more general as it applies to any bandit optimisation task.

Perhaps the closest work to us is that of Kandasamy et al. (2016a;b;c) who developed MF-GP-UCB assuming a ﬁnite number of approximations to f. While this line of work was the ﬁrst to provide theoretical guarantees for multiﬁdelity optimisation, it has two important shortcomings. First, they make strong assumptions, particularly a uniform bound on the difference between the expensive function and an approximation. This does not allow for instances where an approximation might be good at certain regions but not at the other. In contrast, our probabilistic treatment between ﬁdelities is is robust to such cases. Second, their model does not allow sharing information between ﬁdelities; each approximation is treated independently. Not only is this wasteful as lower ﬁdelities can provide useful information about higher ﬁdelities, it also means that the algorithm might perform poorly if the ﬁdelities are not designed properly. We demonstrate this with an experiment in Section 4. On the other hand, our model allows sharing information across the ﬁdelity space in a natural way. In addition, we can also handle continuous approximations whereas their method is strictly for a ﬁnite number of approximations. That said, BOCA inherits a key intuition from MF-GP-UCB, which is to choose a ﬁdelity only if we have sufﬁciently reduced the uncertainty at all lower ﬁdelities. Besides this, there are considerable differences in the mechanics of the algorithm and proof techniques. As we proceed, we will draw further comparisons to Kandasamy et al. (2016a).

2. Preliminaries

2.1. Some Background Material

Gaussian processes: A GP over a space X is a random process from X to R. GPs are typically used as a prior for functions in Bayesian nonparametrics. It is characterised by a mean function µ : X R and a covariance function (or kernel) κ : X 2 R. If f GP(µ, κ), then f(x) is distributed normally N(µ(x), κ(x, x)) for all x X. Suppose that we are given n observations Dn = {(xi, yi)}n i=1 from this GP, where xi X, yi = f(xi) + ϵi R and ϵi N(0, η2). Then the posterior process f|Dn is also a GP with mean µn and covariance κn given by

µn(x) = k (K + η2I) 1Y, (1)

κn(x, x ) = κ(x, x ) k (K + η2I) 1k .

Here Y Rn is a vector with Yi = yi, and k, k Rn

are such that ki = κ(x, xi), k i = κ(x , xi). The matrix K Rn n is given by Ki,j = κ(xi, xj). We refer the reader to chapter 2 of Rasmussen & Williams (2006) for more on the basics of GPs and their use in regression.

Bayesian Optimisation with Continuous Approximations

h = 0.05 h = 0.15 h = 0.5

Figure 1. Samples drawn from a GP with 0 mean and SE kernel with bandwidths h = 0.01, h = 0.15, 0.5. Samples tend to be smoother across the domain for large bandwidths.

Radial kernels: The prior covariance functions of GPs are typically taken to be radial kernels; some examples are the squared exponential (SE) and Mat ern kernels. Using a radial kernel means that the prior covariance can be written as κ(x, x ) = κ0φ( x x ) and depends only on the distance between x and x . Here, the scale parameter κ0 captures the magnitude f could deviate from µ. The function φ : R+ R+ is a decreasing function with φ = φ(0) = 1. In this paper, we will use the SE kernel in a running example to convey the intuitions in our methods. For the SE kernel, φ(r) = φh(r) = exp( r2/(2h2)), where h R+, called the bandwidth of the kernel, controls the smoothness of the GP. When h is large, the samples drawn from the GP tend to be smoother as illustrated in Fig. 1. We will reference this observation frequently in the text.

GP-UCB: The Gaussian Process Upper Conﬁdence Bound (GP-UCB) algorithm of Srinivas et al. (2010) is a method for bandit optimisation, which, like many other BO methods, models f as a sample from a Gaussian process. At time t, the next point xt for evaluating f is chosen via the following procedure. First, we construct an upper conﬁdence bound ϕt(x) = µt 1(x) + β1/2 t σt 1(x) for the GP. µt 1 is the posterior mean of the GP conditioned on the previous t 1 evaluations and σt 1 is the posterior standard deviation. Following other UCB algorithms (Auer, 2003), the next point is chosen by maximising ϕt, i.e. xt = argmaxx X ϕt(x). The µt 1 term encourages an exploitative strategy in that we want to query regions where we already believe f is high and σt 1 encourages an exploratory strategy in that we want to query where we are uncertain about f so that we do not miss regions which have not been queried yet. βt, which is typically increasing with t, controls the trade-off between exploration and exploitation. We have provided a brief review of GP-UCB in Appendix A.1.

2.2. Problem Set Up

Our goal in bandit optimisation is to maximise a function f : X R, over a domain X. When we evaluate f at x X we observe y = f(x) + ϵ where E[ϵ] = 0. Let x argmaxx X f(x) be a maximiser of f and f = f(x ) be the maximum value. An algorithm for bandit optimisation is a sequence of points {xt}t 0, where, at time t, the algorithm chooses to evaluate f at xt based on previous queries and

Figure 2. g : Z X R is a function deﬁned on the product space of the ﬁdelity space Z and domain X. The purple line is f(x) = g(z , x). We wish to ﬁnd the maximiser x argmaxx X f(x). The multi-ﬁdelity framework is attractive when g is smooth across Z as illustrated in the ﬁgure.

observations {(xi, yi)}t 1 i=1. After n queries to f, its goal is to achieve small simple regret Sn, as deﬁned below.

Sn = min t=1,...,n f f(xt). (2)

Continuous Approximations: In this work, we will let f be a slice of a function g that lies in a larger space. Precisely, we will assume the existence of a ﬁdelity space Z and a function g : Z X R deﬁned on the product space of the ﬁdelity space and domain. The function f which we wish to maximise is related to g via f( ) = g(z , ), where z Z. For instance, in the hyper-parameter tuning example from Section 1, Z = [1, N ] [1, T ] and z = [N , T ]. Our goal is to ﬁnd a maximiser x argmaxx f(x) = argmaxx g(z , x). We have illustrated this setup in Fig. 2. In the rest of the manuscript, the term ﬁdelities will refer to points z in the ﬁdelity space Z.

The multi-ﬁdelity framework is attractive when the following two conditions are true about the problem.

1. There exist ﬁdelities z Z where evaluating g is

cheaper than evaluating at z . To this end, we will associate a known cost function λ : Z R+. In the hyper-parameter tuning example, λ(z) = λ(N, T) = O(N 2T). It is helpful to think of z as being the most expensive ﬁdelity, i.e. maximiser of λ, and that λ(z) decreases as we move away from z . However, this notion is strictly not necessary for our algorithm or results.

2. The cheap g(z, ) evaluation gives us information about

g(z , ). This is true if g is smooth across the ﬁdelity space as illustrated in Fig. 2. As we will describe shortly, this smoothness can be achieved by modelling g as a GP with an appropriate kernel for the ﬁdelity space Z.

In the above setup, a multi-ﬁdelity algorithm is a sequence of query-ﬁdelity pairs {(zt, xt)}t 0 where, at time t, the algorithm chooses zt Z and xt X, and observes yt =

Bayesian Optimisation with Continuous Approximations

g(zt, xt) + ϵ where E[ϵ] = 0. The choice of (zt, xt) can of course depend on the previous ﬁdelity-query-observation triples {(zi, xi, yi)}t 1 i=1.

Multi-ﬁdelity Simple Regret: We provide bounds on the simple regret S(Λ) of a multi-ﬁdelity optimisation method after it has spent capital Λ of a resource. Following Kandasamy et al. (2016a); Srinivas et al. (2010), we will aim to provide any capital bounds, meaning that an algorithm would be expected to do well for all values of (sufﬁciently large) Λ. Say we have made N queries to g within capital Λ, i.e. N is the random quantity such that N = max{n 1 : Pn t=1 λ(zt) Λ}. While the cheap evaluations at z = z are useful in guiding search for the optimum of g(z , ), there is no reward for optimising a cheaper g(z, ). Accordingly, we deﬁne the simple regret after capital Λ as,

min t {1,...,N} s.t zt=z f f(xt) if we have queried at z ,

+ otherwise.

This deﬁnition reduces to the single ﬁdelity deﬁnition (2) when we only query g at z . It is also similar to the deﬁnition in Kandasamy et al. (2016a), but unlike them, we do not impose additional boundedness constraints on f or g.

Before we proceed, we note that it is customary in the bandit literature to analyse cumulative regret. However, the deﬁnition of cumulative regret depends on the application at hand (Kandasamy et al., 2016c) and the results in this paper can be extended to to many sensible notions of cumulative regret. However, both to simplify exposition and since our focus in this paper is optimisation, we stick to simple regret.

Assumptions: As we will be primarily focusing on continuous and compact domains and ﬁdelity spaces, going forward we will assume, without any loss of generality, that X = [0, 1]d and Z = [0, 1]p. We discuss non-continuous settings brieﬂy at the end of Section 3. In keeping with similar work in the Bayesian optimisation literature, we will assume g GP(0, κ) and upon querying at (z, x) we observe y = g(z, x) + ϵ where ϵ N(0, η2). κ : (Z X)2 R is the prior covariance deﬁned on the product space. In this work, we will study exclusively κ of the following form,

κ([z, x], [z , x ]) = κ0 φZ( z z ) φX ( x x ). (3)

Here, κ0 R+ is the scale parameter and φZ, φX are radial kernels deﬁned on Z, X respectively. The ﬁdelity space kernel φZ is an important component in this work. It controls the smoothness of g across the ﬁdelity space and hence determines how much information the lower ﬁdelities provide about g(z , ). For example, suppose that φZ was a SE kernel. A favourable setting for a multi-ﬁdelity method would be for φZ to have a large bandwidth h Z as that would imply

that g is very smooth across Z. We will see that h Z determines the behaviour and theoretical guarantees of BOCA in a natural way when φZ is the SE kernel. To formalise this notion, we will deﬁne the following function ξ : Z [0, 1].

1 φZ( z z )2 (4)

One interpretation of ξ(z) is that it measures the gap in information about g(z , ) when we query at z = z . That is, it is the price we have to pay, in information, for querying at a cheap ﬁdelity. Observe that ξ increases when we move away from z in the ﬁdelity space. For the SE kernel, it can be shown1 ξ(z) z z

h Z . For large h Z, g is smoother across Z and we can expect the lower ﬁdelities to be more informative about f; as expected the information gap ξ is small for large h Z. If h Z is small and g is not smooth, the gap ξ is large and lower ﬁdelities are not as informative.

Before we present our algorithm for the above setup, we will introduce notation for the posterior GPs for g and f. Let Dn = {(zi, xi, yi)}n i=1 be n ﬁdelity, query, observation values from the GP g, where yi was observed when evaluating g(zi, xi). We will denote the posterior mean and standard deviation of g conditioned on Dn by νn and τn respectively (νn, τn can be computed from (1) by replacing x [z, x]). Therefore g(z, x)|Dn N(νn(z, x), τ 2 n(z, x)) for all (z, x) Z X. We will further denote

µn( ) = νn(z , ), σn( ) = τn(z , ), (5)

to be the posterior mean and standard deviation of g(z , ) = f( ). It follows that f|Dn is also a GP and satisﬁes f(x)|Dn N(µn(x), σ2 n(x)) for all x X.

3. BOCA: Bayesian Optimisation with Continuous Approximations

BOCA is a sequential strategy to select a domain point xt X and ﬁdelity zt Z at time t based on previous observations. At time t, we will ﬁrst construct an upper conﬁdence bound ϕt for the function f we wish to optimise. It takes the form,

ϕt(x) = µt 1(x) + β1/2 t σt 1(x). (6)

Recall from (5) that µt 1 and σt 1 are the posterior mean and standard deviation of f using the observations from the previous t 1 time steps at all ﬁdelities, i.e. the entire Z X space. We will specify βt in Theorems 1, 8. Following other UCB algorithms, our next point xt in the domain X for evaluating g is a maximiser of ϕt, i.e. xt argmaxx X ϕt(x).

Next, we need to determine the ﬁdelity zt Z to query g.

1Strictly, ξ(z) z z /h Z, but the inequality is tighter for larger h Z. In any case, ξ is strictly decreasing with h Z.

Bayesian Optimisation with Continuous Approximations

For this we will ﬁrst select a subset Zt(xt) of Z as follows,

Zt(xt) = n z Z : λ(z) < λ(z ), τt 1(z, xt) > γ(z),

ξ(z) > β 1/2 t ξ o , (7)

where γ(z) = κ0 ξ(z) λ(z)

Here, ξ is the information gap function in (4) and τt 1 is the posterior standard deviation of g, and p, d are the dimensionalities of Z, X. The exponent q depends on the kernel used for φZ. For e.g., for the SE kernel, q = 1/(p + d + 2). We ﬁlter out the ﬁdelities we consider at time t using three conditions as speciﬁed above. We elaborate on these conditions in more detail in Section 3.1. If Zt is not empty, we choose the cheapest ﬁdelity in this set, i.e. zt argminz Zt λ(z). If Zt is empty, we choose zt = z .

We have summarised the resulting procedure below in Algorithm 1. An important advantage of BOCA is that it only requires specifying the GP hyper-parameters for g such as the kernel κ. In practice, this can be achieved by various effective heuristics such as maximising the GP marginal likelihood or cross validation which are standard in most BO methods. In contrast, MF-GP-UCB of Kandasamy et al. (2016a) requires tuning several other hyper-parameters.

Algorithm 1 BOCA Input: kernel κ.

Set ν0( ) 0, τ0( ) κ( , )1/2, D0 .

for t = 1, 2, . . .

1. xt argmaxx X ϕt(x). See (6) 2. zt argminz Zt(xt) {z } λ(z). See (7) 3. yt Query g at (zt, xt). 4. Dt Dt 1 {(zt, xt, yt)}. Update posterior mean νt, and standard deviation τt for g conditioned on Dt.

3.1. Fidelity Selection Criterion

We will now provide an intuitive justiﬁcation for the three conditions in the selection criterion for zt, i.e., equation (7). The ﬁrst condition, λ(z) < λ(z ) is fairly obvious; since we wish to optimise g(z , ) and since we are not rewarded for queries at other ﬁdelities, there is no reason to consider ﬁdelities that are more expensive than z .

The second condition, τt 1(z, xt) > γ(z) says that we will only consider ﬁdelities where the posterior variance is larger than a threshold γ(z) = κ0ξ(z)(λ(z)/λ(z ))q, which depends critically on two quantities, the cost function λ and the information gap ξ. As a ﬁrst step towards parsing this condition, observe that a reasonable multi-ﬁdelity strategy should be inclined to query cheap ﬁdelities and learn about

g before querying expensive ﬁdelities. As γ(z) is monotonically increasing in λ(z), it becomes easier for a cheap z to satisfy τt 1(z, xt) > γ(z) and be included in Zt at time t. Moreover, since we choose zt to be the minimiser of λ in Zt, a cheaper ﬁdelity will always be chosen over expensive ones if included in Zt. Second, if a particular ﬁdelity z is far away from z , it probably contains less information about g(z , ). Again, a reasonable multi-ﬁdelity strategy should be discouraged from making such queries. This is precisely the role of the information gap ξ which is increasing with z z . As z moves away from z , γ(z) increases and it becomes harder to satisfy τt 1(z, xt) > γ(z). Therefore, such a z is less likely to be included in Zt(xt) and be considered for evaluation. Our analysis reveals that setting γ as in (7) is a reasonable trade off between cost and information in the approximations available to us; cheaper ﬁdelities cost less, but provide less accurate information about the function f we wish to optimise. It is worth noting that the second condition is similar in spirit to Kandasamy et al. (2016a) who proceed from a lower to higher ﬁdelity only when the lower ﬁdelity variance is smaller than a threshold. However, while they treat the threshold as a hyper-parameter, we are able to explicitly specify theoretically motivated values.

The third condition in (7) is ξ(z) > ξ /β1/2 t . Since ξ is increasing as we move away from z , it says we should exclude ﬁdelities inside a (small) neighbourhood of z . Recall that if Zt is empty, BOCA will choose z by default. But when it is not empty, we want to prevent situations where we get arbitrarily close to z but not actually query at z . Such pathologies can occur when we are dealing with a continuum of ﬁdelities and this condition forces BOCA to pick z instead of querying very close to it. Observe that since βt is increasing with t, this neighborhood is shrinking with time and therefore the algorithm will eventually have the opportunity to evaluate ﬁdelities close to z .

3.2. Theoretical Results

We now present our main theoretical contributions. In order to simplify the exposition and convey the gist of our results, we will only present a simpliﬁed version of our theorems. We will suppress constants, polylog terms, and other technical details that arise due to a covering argument in our proofs. A rigorous treatment is available in Appendix B.

Maximum Information Gain: Up until this point, we have not discussed much about the kernel φX of the domain X. Since we are optimising f over X, it is natural to expect that this will appear in the bounds. Srinivas et al. (2010) showed that the statistical difﬁculty of GP bandits is determined by the Maximum Information Gain (MIG) which measures the maximum information a subset of observations have about f. We denote it by Ψn(A) where A is a subset of X and n is the number of queries to f. We refer the reader

Bayesian Optimisation with Continuous Approximations

to Appendix B for a formal deﬁnition of MIG. For the current exposition however, it sufﬁces to know that for radial kernels, Ψn(A) increases with n and the volume vol(A) of A. For instance, when we use an SE kernel for φX , we have Ψn(A) vol(A) log(n)d+1and for a Mat ern kernel with smoothness parameter ν, Ψn(A) vol(A) n1 ν 2ν+d(d+1) . (Srinivas et al., 2010). Let nΛ = Λ/λ(z ) denote the number of queries by a single ﬁdelity algorithm within capital Λ. Srinivas et al. (2010) showed that the simple regret S(Λ) for GP-UCB after capital Λ can be bounded by,

Simple Regret for GP-UCB: S(Λ)

In our analysis of BOCA we show that most queries to g at ﬁdelity z will be conﬁned to a small subset of X which contains the optimum x . Precisely, after capital Λ, for any α (0, 1), we show there exists ρ > 0 such that the number of queries outside the following set Xρ is less than nα Λ.

Xρ = x X : f f(x) 2ρ κ0 ξ . (9)

Here, ξ is from (4). While it is true that any optimisation algorithm would eventually query extensively in a neighbourhood around the optimum, a strong result of the above form is not always possible. For instance, for GP-UCB, the best achievable bound on the number of queries in any set that does not contain x is n1/2 Λ . The fact that Xρ exists relies crucially on the multi-ﬁdelity assumptions and that our algorithm leverages information from lower ﬁdelities when querying at z . As ξ is small when g is smooth across Z, the set Xρ will be small when the approximations are highly informative about g(z , ). For e.g., when φZ is a SE kernel, we have Xρ {x X : f f(x) 2ρ κ0p/h Z}. When h Z is large and g is smooth across Z, Xρ is small as the right side of the inequality is smaller. As BOCA conﬁnes most of its evaluations to this small set containing x , we will be able to achieve much better regret than GP-UCB. When h Z is small and g is not smooth across Z, the set Xρ becomes large and the advantage of multi-ﬁdelity optimisation diminishes. One can similarly argue that for the Mat ern kernel, as the parameter ν increases, g will be smoother across Z, and Xρ becomes smaller yielding better bounds on the regret. Below, we provide an informal statement of our main theoretical result. , will denote inequality and equality ignoring constant and polylog terms.

Theorem 1 (Informal, Regret of BOCA). Let g GP(0, κ) where κ satisﬁes (3). Choose βt d log(t/δ). Then, for sufﬁciently large Λ and for all α (0, 1), there exists ρ depending on α such that the following bound holds with probability at least 1 δ.

In the above bound, the latter term vanishes fast due to the n (1 α/2) Λ dependence. When comparing this with (8), we see that we outperform GP-UCB by a factor of p

ΨnΛ(Xρ)/ΨnΛ(X) p

vol(Xρ)/vol(X) asymptotically. If g is smooth across the ﬁdelity space, Xρ is small and the gains over GP-UCB are signiﬁcant. If g becomes less smooth across Z, the bound decays gracefully, but we are never worse than GP-UCB up to constant factors.

Theorem 1 also has similarities to the bounds of Kandasamy et al. (2016a) who also demonstrate better regret than GPUCB by showing that it is dominated by queries inside a set X which contains the optimum. However, their bounds depend critically on certain threshold hyper-parameters which determine the volume of X among other terms in their regret. The authors of that paper note that their bounds will suffer if these hyper-parameters are not chosen appropriately, but do not provide theoretically justiﬁed methods to make this choice. In contrast, many of the design choices for BOCA fall out naturally of our modeling assumptions. Beyond this analogue, our results are not comparable to Kandasamy et al. (2016a) as the assumptions are different.

Extensions: While we have focused on continuous Z, many of the ideas here can be extended to other settings. If Z is a discrete subset of [0, 1]p our work extends straightforwardly. We reiterate that this will not be the same as the ﬁnite ﬁdelity MF-GP-UCB algorithm as the assumptions are different. In particular, Kandasamy et al. (2016a) are not able to effectively share information across ﬁdelities as we do. We also believe that Algorithm 1 can be extended to arbitrary ﬁdelity spaces Z provided that a kernel can be deﬁned on Z. Our results can also be extended to discrete domains X and various other kernels for φX by adopting techniques from Srinivas et al. (2010).

4. Experiments

We compare BOCA to the following four baselines: (i) GPUCB, (ii) the GP-EI criterion in BO (Jones et al., 1998), (iii) MF-GP-UCB (Kandasamy et al., 2016a) and (iv) MF-SKO, the multi-ﬁdelity sequential kriging optimisation method from Huang et al. (2006). All methods are based on GPs and we use the SE kernel for both the ﬁdelity space and domain. The ﬁrst two are not multi-ﬁdelity methods, while the last two are ﬁnite multi-ﬁdelity methods2. Kandasamy et al. (2016a) also study some naive multi-ﬁdelity algorithms and demonstrate that they do not perform well; as such we will not consider such alternatives here. In all our experiments, the ﬁdelity space was designed to be Z = [0, 1]p with z = 1p = [1, . . . , 1] Rp being the most expensive ﬁ-

2To our knowledge, the only other work that applies to continuous approximations is Klein et al. (2015) which was developed speciﬁcally for hyper-parameter tuning. Further, their implementation is not made available and is not straightforward to implement.

Bayesian Optimisation with Continuous Approximations

0 20 40 60 80 100 120

GP, p = 1, d = 1

GP-UCB GP-EI MF-GP-UCB MF-SKO BOCA

0 50 100 150 200 250 300

GP-Bad-Approx, p = 1, d = 1

0 10 20 30 40 50

Currin Exp, p = 1, d = 2

0 20 40 60 80 100

Hartmann3, p = 4, d = 3

0 50 100 150 200

Hartmann6, p = 2, d = 6

0 50 100 150 200

Borehole, p = 1, d = 8

Figure 3. Results on 6 synthetic problems where we plot the simple regret S(Λ) (lower is better) against the capital Λ. The title states the function used, and the ﬁdelity and domain dimesions. For the ﬁrst two ﬁgures we used capital 30λ(z ), therefore a method which only queries at g(z , ) can make at most 30 evaluations. For the third ﬁgure we used 50λ(z ), for the fourth 100λ(z ) and for the last 200λ(z ) to reﬂect the dimensionality d of X. The curves for the multi-ﬁdelity methods start mid-way since they have not queried at z up until that point. All curves were produced by averaging over 20 experiments and the error bars indicate one standard error.

delity. For MF-GP-UCB and MF-SKO, we used 3 ﬁdelities (2 approximations) where the approximations were obtained at z = 0.3331p and z = 0.6671p in Z. Empirically, we found that both algorithms did reasonably well with 1-3 approximations, but did not perform well with a large number of approximations (> 5); even the original papers restrict experiments to 1-3 approximations. Implementation details for all methods are given in Appendix C.1.

4.1. Synthetic Experiments The results for the ﬁrst set of synthetic experiments are given in Fig. 3. The title of each ﬁgure states the function used, and the dimensionalities p, d of the ﬁdelity space and domain. To reﬂect the setting in our theory, we add Gaussian noise to the function value when observing g at any (z, x). This makes the problem more challenging than standard global optimisation problems where function evaluations are not noisy. The functions g, the cost functions λ and the noise variances η2 are given in Appendix C.2.

The ﬁrst two panels in Fig. 3 are simple sanity checks. In both cases, Z = [0, 1], X = [0, 1] and the functions were sampled from GPs. The GP was made known to all methods, i.e. all methods used the true GP in picking the next point. In the ﬁrst panel, we used an SE kernel with bandwidth 0.1 for φX and 1.0 for φZ. g is smooth across Z in this setting, and BOCA outperforms other baselines. The curve starts mid-way as BOCA is yet to query at z up until that point. The second panel uses the same set up as the ﬁrst except

we used bandwidth 0.01 for φZ. Even though g is highly un-smooth across Z, BOCA does not perform poorly. This corroborates a claim that we made earlier that BOCA can naturally adapt to the smoothness of the approximations. The other multi-ﬁdelity methods suffer in this setting.

In the remaining experiments, we use some standard benchmarks for global optimisation. We modify them to obtain g and add noise to the observations. As the kernel and other GP hyper-parameters are unknown, we learn them by maximising the marginal likelihood every 25 iterations. We outperform all methods on all problems except in the case of the Borehole function where MF-GP-UCB does better. The last synthetic experiment is the Branin function given in Fig. 4(a). We used the same set up as above, but use 10 ﬁdelities for MF-GP-UCB and MF-SKO where the kth ﬁdelity is obtained at z = k 101p in the ﬁdelity space. Notice that the performance of ﬁnite ﬁdelity methods deteriorate. In particular, as MF-GP-UCB does not share information across ﬁdelities, the approximations need to be designed carefully for the algorithm to work well. Our more natural modelling assumptions prevent such pitfalls. We next present two real examples in astrophysics and hyper-parameter tuning. We do not add noise to the observations, but treat it as optimisation tasks, where the goal is to maximise the function.

4.2. Astrophysical Maximum Likelihood Inference We use data on Type Ia supernova for maximum likelihood inference on 3 cosmological parameters, the Hubble con-

Bayesian Optimisation with Continuous Approximations

0 20 40 60 80 100 120 140

Branin, p = 3, d = 2

GP-UCB GP-EI MF-GP-UCB MF-SKO BOCA

1000 1500 2000 2500 3000 3500

Avg. Log Likelihood

0.08 0.09 0.1

Supernova, p = 2, d = 3

500 1000 1500 200

Cross Validation Accuracy

0.915 News Group-SVM, p = 2, d = 2

Figure 4. (a): The synthetic benchmark with the Branin function where we used a capital of 50λ(z ). See caption under Fig. 3 for more details. (b), (c): Results on the supernova and news group experiments from sections 4.2 and 4.3 respectively. We have plotted the maximum value (higher is better) against wall clock time. (a) was averaged over 20 experiments while (b) and (c) were averaged over 10 xperiments each. The error bars indicate one standard error.

stant H0 (60, 80), the dark matter fraction ΩM (0, 1) and dark energy fraction ΩΛ (0, 1); hence d = 3. The likelihood is given by the Robertson-Walker metric, the computation of which requires a one dimensional numerical integration for each point in the dataset. Unlike typical maximum likelihood problems, here the likelihood is only accessible via point evaluations. We use the dataset from Davis et al (2007) which has data on 192 supernovae. We construct a p = 2 dimensional multi-ﬁdelity problem where we can choose between data set size N [50, 192] and perform the integration on grids of size G [102, 106] via the trapezoidal rule. As the cost function for ﬁdelity selection, we used λ(N, G) = NG as the computation time is linear in both parameters. Our goal is to maximise the average log likelihood at z = [192, 106]. For the ﬁnite ﬁdelity methods we use three ﬁdelities with the approximations available at z = [97, 2.15 103] and z = [145, 4.64 104] (which correspond to 0.3331p and 0.6671p after rescaling as in Section 4.1). The results are given in Fig. 4(b) where we plot the maximum average log likelihood against wall clock time as that is the cost in this experiment. The plot includes the time taken by each method to tune the GPs and determine the next points/ﬁdelities for evaluation.

4.3. Support Vector Classiﬁcation with 20 news groups We use the 20 news groups dataset (Joachims, 1996) in a text classiﬁcation task. We obtain the bag of words representation for each document, convert them to tf-idf features and feed them to a support vector classiﬁer. The goal is to tune the regularisation penalty and the temperature of the rbf kernel both in the range [10 2, 103]; hence d = 2. The support vector implementation was taken from scikit-learn. We set this up as a 2 dimensional multi-ﬁdelity problem where we can choose a dataset size N [5000, 15000] and the number of training iterations T [20, 100]. Each evaluation takes the given dataset of size N and splits it up into 5 to perform 5-fold cross validation. As the cost function for ﬁdelity selection, we used λ(N, T) = NT as

the training/validation complexity is linear in both parameters. Our goal is to maximise the cross validation accuracy at z = [15000, 100]. For the ﬁnite ﬁdelity methods we use three ﬁdelities with the approximations available at z = [8333, 47] and z = [11667, 73]. The results are given in Fig. 4(c) where we plot the average cross validation accuracy against wall clock time.

5. Conclusion

We studied Bayesian optimisation with continuous approximations, by treating the approximations as arising out of a continuous ﬁdelity space. While previous multi-ﬁdelity literature has predominantly focused on a ﬁnite number of approximations, BOCA applies to continuous ﬁdelity spaces and can potentially be extended to arbitrary spaces. We bound the simple regret for BOCA and demonstrate that it is better than methods such as GP-UCB which ignore the approximations and that the gains are determined by the smoothness of the ﬁdelity space. When compared to existing multi-ﬁdelity methods, BOCA is able to share information across ﬁdelities effectively, has more natural modelling assumptions and has fewer hyper-parameters to tune. Empirically, we demonstrate that BOCA is competitive with other baselines in synthetic and real problems. Another nice feature of using continuous approximations is that it relieves the practitioner from having to design the approximations; she/he can specify the available approximations and let the algorithm decide how to choose them.

Going forward, we wish to extend our theoretical results to more general settings. For instance, we believe a stronger bound on the regret might be possible if φZ is a ﬁnite dimensional kernel. Since ﬁnite dimensional kernels are typically not radial (Sriperumbudur et al., 2016), our analysis techniques will not carry over straightforwardly. Another line of work that we have alluded to is to study more general ﬁdelity spaces with an appropriately deﬁned kernel φZ.

Bayesian Optimisation with Continuous Approximations

Acknowledgements

We would like to thank Renato Negrinho for reviewing an initial draft of this paper. This research is supported in part by DOE grant DESC0011114 and NSF grant IIS1563887. KK is supported by a Facebook Ph.D. fellowship.

Agarwal, Alekh, Duchi, John C, Bartlett, Peter L, and Levrard, Clement. Oracle inequalities for computationally budgeted model selection. In COLT, 2011.

Auer, Peter. Using Conﬁdence Bounds for Exploitationexploration Trade-offs. J. Mach. Learn. Res., 2003.

Brochu, E., Cora, V. M., and de Freitas, N. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical RL. Co RR, 2010.

Cutler, Mark, Walsh, Thomas J., and How, Jonathan P. Reinforcement Learning with Multi-Fidelity Simulators. In ICRA, 2014.

Davis et al, T. M. Scrutinizing Exotic Cosmological Models Using ESSENCE Supernova Data Combined with Other Cosmological Probes. Astrophysical Journal, 2007.

de Freitas, Nando, Smola, Alex J., and Zoghi, Masrour. Exponential Regret Bounds for Gaussian Process Bandits with Deterministic Observations. In ICML, 2012.

Forrester, Alexander I. J., S obester, Andr as, and Keane, Andy J. Multi-ﬁdelity optimization via surrogate modelling. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Science, 2007.

Ghosal, Subhashis and Roy, Anindya. Posterior consistency of Gaussian process prior for nonparametric binary regression . Annals of Statistics, 2006.

Gonzalez, J., Longworth, J., James, D., and Lawrence, N. Bayesian Optimization for Synthetic Gene Design. In Bayes Opt, 2014.

Hern andez-Lobato, Jos e Miguel, Hoffman, Matthew W, and Ghahramani, Zoubin. Predictive Entropy Search for Efﬁcient Global Optimization of Black-box Functions. In NIPS, 2014.

Huang, D., Allen, T.T., Notz, W.I., and Miller, R.A. Sequential kriging optimization using multiple-ﬁdelity evaluations. Structural and Multidisciplinary Optimization, 2006.

Hutter, Frank, Hoos, Holger H., and Leyton-Brown, Kevin. Sequential Model-based Optimization for General Algorithm Conﬁguration. In LION, 2011.

Joachims, Thorsten. A probabilistic analysis of the rocchio algorithm with tﬁdf for text categorization. Technical report, DTIC Document, 1996.

Jones, D. R., Perttunen, C. D., and Stuckman, B. E. Lipschitzian Optimization Without the Lipschitz Constant. J. Optim. Theory Appl., 1993.

Jones, Donald R., Schonlau, Matthias, and Welch, William J. Efﬁcient global optimization of expensive black-box functions. J. of Global Optimization, 1998.

Kandasamy, Kirthevasan, Schenider, Jeff, and P oczos, Barnab as. High Dimensional Bayesian Optimisation and Bandits via Additive Models. In International Conference on Machine Learning, 2015.

Kandasamy, Kirthevasan, Dasarathy, Gautam, Oliva, Junier, Schenider, Jeff, and P oczos, Barnab as. Gaussian Process Bandit Optimisation with Multi-ﬁdelity Evaluations. In Advances in Neural Information Processing Systems, 2016a.

Kandasamy, Kirthevasan, Dasarathy, Gautam, Oliva, Junier B, Schneider, Jeff, and Poczos, Barnabas. Multiﬁdelity gaussian process bandit optimisation. ar Xiv preprint ar Xiv:1603.06288, 2016b.

Kandasamy, Kirthevasan, Dasarathy, Gautam, Schneider, Jeff, and Poczos, Barnabas. The Multi-ﬁdelity Multiarmed Bandit. In NIPS, 2016c.

Klein, A., Bartels, S., Falkner, S., Hennig, P., and Hutter, F. Towards efﬁcient Bayesian Optimization for Big Data. In Bayes Opt, 2015.

Lam, R emi, Allaire, Douglas L, and Willcox, Karen E. Multiﬁdelity optimization using statistical surrogate modeling for non-hierarchical information sources. In 56th AIAA/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, pp. 0143, 2015.

Li, Lisha, Jamieson, Kevin, De Salvo, Giulia, Rostamizadeh, Afshin, and Talwalkar, Ameet. Hyperband: A novel bandit-based approach to hyperparameter optimization. ar Xiv preprint ar Xiv:1603.06560, 2016.

Lizotte, Daniel, Wang, Tao, Bowling, Michael, and Schuurmans, Dale. Automatic gait optimization with gaussian process regression. In IJCAI, 2007.

Martinez-Cantin, R., de Freitas, N., Doucet, A., and Castellanos, J. Active Policy Learning for Robot Planning and Exploration under Uncertainty. In Proceedings of Robotics: Science and Systems, 2007.

Mockus, Jonas. Application of Bayesian approach to numerical methods of global and stochastic optimization. Journal of Global Optimization, 1994.

Bayesian Optimisation with Continuous Approximations

Parkinson, D., Mukherjee, P., and Liddle, A.. R. A Bayesian model selection analysis of WMAP3. Physical Review, 2006.

Poloczek, Matthias, Wang, Jialei, and Frazier, Peter I. Multi-information source optimization. ar Xiv preprint ar Xiv:1603.00389, 2016.

Rasmussen, C.E. and Williams, C.K.I. Gaussian Processes for Machine Learning. UPG Ltd, 2006.

Sabharwal, A, Samulowitz, H, and Tesauro, G. Selecting near-optimal learners via incremental data allocation. In AAAI, 2015.

Seeger, MW., Kakade, SM., and Foster, DP. Information Consistency of Nonparametric Gaussian Process Methods. IEEE Transactions on Information Theory, 2008.

Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian Optimization of Machine Learning Algorithms. In NIPS, 2012.

Srinivas, Niranjan, Krause, Andreas, Kakade, Sham, and Seeger, Matthias. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In ICML, 2010.

Sriperumbudur, Bharath et al. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3):1839 1893, 2016.

Swersky, Kevin, Snoek, Jasper, and Adams, Ryan P. Multitask bayesian optimization. In NIPS, 2013.

Thompson, W. R. On the Likelihood that one Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 1933.

Xiong, Shifeng, Qian, Peter Z. G., and Wu, C. F. Jeff. Sequential design and analysis of high-accuracy and lowaccuracy computer codes. Technometrics, 2013.

Zhang, C. and Chaudhuri, K. Active Learning from Weak and Strong Labelers. In NIPS, 2015.