# multifidelity_bayesian_optimisation_with_continuous_approximations__9dd1adc8.pdf Multi-fidelity Bayesian Optimisation with Continuous Approximations Kirthevasan Kandasamy 1 Gautam Dasarathy 2 Jeff Schneider 1 Barnab as P oczos 1 Bandit methods for black-box optimisation, such as Bayesian optimisation, are used in a variety of applications including hyper-parameter tuning and experiment design. Recently, multifidelity methods have garnered considerable attention since function evaluations have become increasingly expensive in such applications. Multifidelity methods use cheap approximations to the function of interest to speed up the overall optimisation process. However, most multi-fidelity methods assume only a finite number of approximations. On the other hand, in many practical applications, a continuous spectrum of approximations might be available. For instance, when tuning an expensive neural network, one might choose to approximate the cross validation performance using less data N and/or few training iterations T. Here, the approximations are best viewed as arising out of a continuous two dimensional space (N, T). In this work, we develop a Bayesian optimisation method, BOCA, for this setting. We characterise its theoretical properties and show that it achieves better regret than than strategies which ignore the approximations. BOCA outperforms several other baselines in synthetic and real experiments. 1. Introduction Many tasks in scientific and engineering applications can be framed as bandit optimisation problems, where we need to sequentially evaluate a noisy black-box function f : X R with the goal of finding its optimum. Some applications include hyper-parameter tuning in machine learning (Hutter et al., 2011; Snoek et al., 2012), optimal policy search (Lizotte et al., 2007; Martinez-Cantin et al., 2007) and scientific experiments (Gonzalez et al., 2014; Parkinson et al., 2006). *Equal contribution 1Carnegie Mellon University, Pittsburgh PA, USA 2Rice University, Houston TX, USA. Correspondence to: Kirthevasan Kandasamy . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). Typically, in such applications, each function evaluation is expensive, and conventionally, the bandit literature has focused on developing methods for finding the optimum while keeping the number of evaluations to f at a minimum. However, with increasingly expensive function evaluations, conventional methods have become infeasible as a significant cost needs to be expended before we can learn anything about f. As a result, multi-fidelity optimisation methods have recently gained attention (Cutler et al., 2014; Kandasamy et al., 2016a; Li et al., 2016). As the name suggests, these methods assume that we have access to lower fidelity approximations to f which can be evaluated instead of f. The lower the fidelity, the cheaper the evaluation, but it provides less accurate information about f. For example, when optimising the configuration of an expensive real world robot, its performance can be approximated using cheaper computer simulations. The goal is to use the cheap approximations to guide search for the optimum of f, and reduce the overall cost of optimisation. However, most multi-fidelity work assume only a finite number of approximations. In this paper, we study multi-fidelity optimisation when there is access to a continuous spectrum of approximations. To motivate this set up, consider tuning a classification algorithm over a space of hyper-parameters X by maximising a validation set accuracy. The algorithm is to be trained using N data points via an iterative algorithm for T iterations. However, we wish to use fewer training points N < N and/or fewer iterations T < T to approximate the validation accuracy. We can view validation accuracy as a function g : [1, N ] [1, T ] X R where evaluating g(N, T, x) requires training the algorithm with N points for T iterations with the hyper-parameters x. If the training complexity of the algorithm is quadratic in data size and linear in the number of iterations, then the cost of this evaluation is λ(N, T) = O(N 2T). Our goal is to find the optimum when N = N , and T = T , i.e. we wish to maximise f(x) = g(N , T , x). In this setting, while N, T are technically discrete choices, they are more naturally viewed as coming from a continuous 2 dimensional fidelity space, [1, N ] [1, T ]. One might hope that cheaper queries to g(N, T, ) with N, T less than N , T can be used to learn about g(N , T , ) and consequently optimise it using less overall cost. Indeed, this Bayesian Optimisation with Continuous Approximations is the case with many machine learning algorithms where cross validation performance tends to vary smoothly with data set size and number of iterations. Therefore, one may use cheap low fidelity experiments with small (N, T) to discard bad hyper-parameters and deploy expensive high fidelity experiments with large (N, T) only in a small but promising region. The main theoretical result of this paper (Theorem 1) shows that our proposed algorithm, BOCA, exhibits precisely this behaviour. Continuous approximations also arise in simulation studies: where simulations can be carried out at varying levels of granularity, on-line advertising: where an ad can be controlled by continuous parameters such as display time or target audience, and several other experiment design tasks. In fact, in many multi-fidelity papers, the finite approximations were obtained by discretising a continuous space (Huang et al., 2006; Kandasamy et al., 2016a). Here, we study a Bayesian optimisation technique that is directly designed for continuous fidelity spaces and is potentially applicable to more general spaces. Our main contributions are, 1. A novel setting and model for multi-fidelity optimisation with continuous approximations using Gaussian process (GP) assumptions. We develop a novel algorithm, BOCA, for this setting. 2. A theoretical analysis characterising the behaviour and regret bound for BOCA. 3. An empirical study which demonstrates that BOCA outperforms alternatives, both multi-fidelity and otherwise, on a series of synthetic problems and real examples in hyper-parameter tuning and astrophysics. Related Work Bayesian optimisation (BO), refers to a suite of techniques for bandit optimisation which use a prior belief distribution for f. While there are several techniques for BO (de Freitas et al., 2012; Hern andez-Lobato et al., 2014; Jones et al., 1998; Mockus, 1994; Thompson, 1933), our work will build on the Gaussian process upper confidence bound (GP-UCB) algorithm of Srinivas et al. (2010). GP-UCB models f as a GP and uses upper confidence bound (UCB) (Auer, 2003) techniques to determine the next point for evaluation. BO techniques have been used in developing multi-fidelity optimisation methods in various applications such as hyperparameter tuning and industrial design (Forrester et al., 2007; Huang et al., 2006; Klein et al., 2015; Lam et al., 2015; Poloczek et al., 2016; Swersky et al., 2013). However, these methods are either problem specific and/or only use a finite number of fidelities. Further, none of them come with theoretical underpinnings. Recent work has studied multi-fidelity methods for specific problems such as hyperparameter tuning, active learning and reinforcement learning (Agarwal et al., 2011; Cutler et al., 2014; Li et al., 2016; Sabharwal et al., 2015; Zhang & Chaudhuri, 2015). While some of the above tasks can be framed as optimisation problems, the methods themselves are specific to the problem considered. Our method is more general as it applies to any bandit optimisation task. Perhaps the closest work to us is that of Kandasamy et al. (2016a;b;c) who developed MF-GP-UCB assuming a finite number of approximations to f. While this line of work was the first to provide theoretical guarantees for multifidelity optimisation, it has two important shortcomings. First, they make strong assumptions, particularly a uniform bound on the difference between the expensive function and an approximation. This does not allow for instances where an approximation might be good at certain regions but not at the other. In contrast, our probabilistic treatment between fidelities is is robust to such cases. Second, their model does not allow sharing information between fidelities; each approximation is treated independently. Not only is this wasteful as lower fidelities can provide useful information about higher fidelities, it also means that the algorithm might perform poorly if the fidelities are not designed properly. We demonstrate this with an experiment in Section 4. On the other hand, our model allows sharing information across the fidelity space in a natural way. In addition, we can also handle continuous approximations whereas their method is strictly for a finite number of approximations. That said, BOCA inherits a key intuition from MF-GP-UCB, which is to choose a fidelity only if we have sufficiently reduced the uncertainty at all lower fidelities. Besides this, there are considerable differences in the mechanics of the algorithm and proof techniques. As we proceed, we will draw further comparisons to Kandasamy et al. (2016a). 2. Preliminaries 2.1. Some Background Material Gaussian processes: A GP over a space X is a random process from X to R. GPs are typically used as a prior for functions in Bayesian nonparametrics. It is characterised by a mean function µ : X R and a covariance function (or kernel) κ : X 2 R. If f GP(µ, κ), then f(x) is distributed normally N(µ(x), κ(x, x)) for all x X. Suppose that we are given n observations Dn = {(xi, yi)}n i=1 from this GP, where xi X, yi = f(xi) + ϵi R and ϵi N(0, η2). Then the posterior process f|Dn is also a GP with mean µn and covariance κn given by µn(x) = k (K + η2I) 1Y, (1) κn(x, x ) = κ(x, x ) k (K + η2I) 1k . Here Y Rn is a vector with Yi = yi, and k, k Rn are such that ki = κ(x, xi), k i = κ(x , xi). The matrix K Rn n is given by Ki,j = κ(xi, xj). We refer the reader to chapter 2 of Rasmussen & Williams (2006) for more on the basics of GPs and their use in regression. Bayesian Optimisation with Continuous Approximations h = 0.05 h = 0.15 h = 0.5 Figure 1. Samples drawn from a GP with 0 mean and SE kernel with bandwidths h = 0.01, h = 0.15, 0.5. Samples tend to be smoother across the domain for large bandwidths. Radial kernels: The prior covariance functions of GPs are typically taken to be radial kernels; some examples are the squared exponential (SE) and Mat ern kernels. Using a radial kernel means that the prior covariance can be written as κ(x, x ) = κ0φ( x x ) and depends only on the distance between x and x . Here, the scale parameter κ0 captures the magnitude f could deviate from µ. The function φ : R+ R+ is a decreasing function with φ = φ(0) = 1. In this paper, we will use the SE kernel in a running example to convey the intuitions in our methods. For the SE kernel, φ(r) = φh(r) = exp( r2/(2h2)), where h R+, called the bandwidth of the kernel, controls the smoothness of the GP. When h is large, the samples drawn from the GP tend to be smoother as illustrated in Fig. 1. We will reference this observation frequently in the text. GP-UCB: The Gaussian Process Upper Confidence Bound (GP-UCB) algorithm of Srinivas et al. (2010) is a method for bandit optimisation, which, like many other BO methods, models f as a sample from a Gaussian process. At time t, the next point xt for evaluating f is chosen via the following procedure. First, we construct an upper confidence bound ϕt(x) = µt 1(x) + β1/2 t σt 1(x) for the GP. µt 1 is the posterior mean of the GP conditioned on the previous t 1 evaluations and σt 1 is the posterior standard deviation. Following other UCB algorithms (Auer, 2003), the next point is chosen by maximising ϕt, i.e. xt = argmaxx X ϕt(x). The µt 1 term encourages an exploitative strategy in that we want to query regions where we already believe f is high and σt 1 encourages an exploratory strategy in that we want to query where we are uncertain about f so that we do not miss regions which have not been queried yet. βt, which is typically increasing with t, controls the trade-off between exploration and exploitation. We have provided a brief review of GP-UCB in Appendix A.1. 2.2. Problem Set Up Our goal in bandit optimisation is to maximise a function f : X R, over a domain X. When we evaluate f at x X we observe y = f(x) + ϵ where E[ϵ] = 0. Let x argmaxx X f(x) be a maximiser of f and f = f(x ) be the maximum value. An algorithm for bandit optimisation is a sequence of points {xt}t 0, where, at time t, the algorithm chooses to evaluate f at xt based on previous queries and Figure 2. g : Z X R is a function defined on the product space of the fidelity space Z and domain X. The purple line is f(x) = g(z , x). We wish to find the maximiser x argmaxx X f(x). The multi-fidelity framework is attractive when g is smooth across Z as illustrated in the figure. observations {(xi, yi)}t 1 i=1. After n queries to f, its goal is to achieve small simple regret Sn, as defined below. Sn = min t=1,...,n f f(xt). (2) Continuous Approximations: In this work, we will let f be a slice of a function g that lies in a larger space. Precisely, we will assume the existence of a fidelity space Z and a function g : Z X R defined on the product space of the fidelity space and domain. The function f which we wish to maximise is related to g via f( ) = g(z , ), where z Z. For instance, in the hyper-parameter tuning example from Section 1, Z = [1, N ] [1, T ] and z = [N , T ]. Our goal is to find a maximiser x argmaxx f(x) = argmaxx g(z , x). We have illustrated this setup in Fig. 2. In the rest of the manuscript, the term fidelities will refer to points z in the fidelity space Z. The multi-fidelity framework is attractive when the following two conditions are true about the problem. 1. There exist fidelities z Z where evaluating g is cheaper than evaluating at z . To this end, we will associate a known cost function λ : Z R+. In the hyper-parameter tuning example, λ(z) = λ(N, T) = O(N 2T). It is helpful to think of z as being the most expensive fidelity, i.e. maximiser of λ, and that λ(z) decreases as we move away from z . However, this notion is strictly not necessary for our algorithm or results. 2. The cheap g(z, ) evaluation gives us information about g(z , ). This is true if g is smooth across the fidelity space as illustrated in Fig. 2. As we will describe shortly, this smoothness can be achieved by modelling g as a GP with an appropriate kernel for the fidelity space Z. In the above setup, a multi-fidelity algorithm is a sequence of query-fidelity pairs {(zt, xt)}t 0 where, at time t, the algorithm chooses zt Z and xt X, and observes yt = Bayesian Optimisation with Continuous Approximations g(zt, xt) + ϵ where E[ϵ] = 0. The choice of (zt, xt) can of course depend on the previous fidelity-query-observation triples {(zi, xi, yi)}t 1 i=1. Multi-fidelity Simple Regret: We provide bounds on the simple regret S(Λ) of a multi-fidelity optimisation method after it has spent capital Λ of a resource. Following Kandasamy et al. (2016a); Srinivas et al. (2010), we will aim to provide any capital bounds, meaning that an algorithm would be expected to do well for all values of (sufficiently large) Λ. Say we have made N queries to g within capital Λ, i.e. N is the random quantity such that N = max{n 1 : Pn t=1 λ(zt) Λ}. While the cheap evaluations at z = z are useful in guiding search for the optimum of g(z , ), there is no reward for optimising a cheaper g(z, ). Accordingly, we define the simple regret after capital Λ as, min t {1,...,N} s.t zt=z f f(xt) if we have queried at z , + otherwise. This definition reduces to the single fidelity definition (2) when we only query g at z . It is also similar to the definition in Kandasamy et al. (2016a), but unlike them, we do not impose additional boundedness constraints on f or g. Before we proceed, we note that it is customary in the bandit literature to analyse cumulative regret. However, the definition of cumulative regret depends on the application at hand (Kandasamy et al., 2016c) and the results in this paper can be extended to to many sensible notions of cumulative regret. However, both to simplify exposition and since our focus in this paper is optimisation, we stick to simple regret. Assumptions: As we will be primarily focusing on continuous and compact domains and fidelity spaces, going forward we will assume, without any loss of generality, that X = [0, 1]d and Z = [0, 1]p. We discuss non-continuous settings briefly at the end of Section 3. In keeping with similar work in the Bayesian optimisation literature, we will assume g GP(0, κ) and upon querying at (z, x) we observe y = g(z, x) + ϵ where ϵ N(0, η2). κ : (Z X)2 R is the prior covariance defined on the product space. In this work, we will study exclusively κ of the following form, κ([z, x], [z , x ]) = κ0 φZ( z z ) φX ( x x ). (3) Here, κ0 R+ is the scale parameter and φZ, φX are radial kernels defined on Z, X respectively. The fidelity space kernel φZ is an important component in this work. It controls the smoothness of g across the fidelity space and hence determines how much information the lower fidelities provide about g(z , ). For example, suppose that φZ was a SE kernel. A favourable setting for a multi-fidelity method would be for φZ to have a large bandwidth h Z as that would imply that g is very smooth across Z. We will see that h Z determines the behaviour and theoretical guarantees of BOCA in a natural way when φZ is the SE kernel. To formalise this notion, we will define the following function ξ : Z [0, 1]. 1 φZ( z z )2 (4) One interpretation of ξ(z) is that it measures the gap in information about g(z , ) when we query at z = z . That is, it is the price we have to pay, in information, for querying at a cheap fidelity. Observe that ξ increases when we move away from z in the fidelity space. For the SE kernel, it can be shown1 ξ(z) z z h Z . For large h Z, g is smoother across Z and we can expect the lower fidelities to be more informative about f; as expected the information gap ξ is small for large h Z. If h Z is small and g is not smooth, the gap ξ is large and lower fidelities are not as informative. Before we present our algorithm for the above setup, we will introduce notation for the posterior GPs for g and f. Let Dn = {(zi, xi, yi)}n i=1 be n fidelity, query, observation values from the GP g, where yi was observed when evaluating g(zi, xi). We will denote the posterior mean and standard deviation of g conditioned on Dn by νn and τn respectively (νn, τn can be computed from (1) by replacing x [z, x]). Therefore g(z, x)|Dn N(νn(z, x), τ 2 n(z, x)) for all (z, x) Z X. We will further denote µn( ) = νn(z , ), σn( ) = τn(z , ), (5) to be the posterior mean and standard deviation of g(z , ) = f( ). It follows that f|Dn is also a GP and satisfies f(x)|Dn N(µn(x), σ2 n(x)) for all x X. 3. BOCA: Bayesian Optimisation with Continuous Approximations BOCA is a sequential strategy to select a domain point xt X and fidelity zt Z at time t based on previous observations. At time t, we will first construct an upper confidence bound ϕt for the function f we wish to optimise. It takes the form, ϕt(x) = µt 1(x) + β1/2 t σt 1(x). (6) Recall from (5) that µt 1 and σt 1 are the posterior mean and standard deviation of f using the observations from the previous t 1 time steps at all fidelities, i.e. the entire Z X space. We will specify βt in Theorems 1, 8. Following other UCB algorithms, our next point xt in the domain X for evaluating g is a maximiser of ϕt, i.e. xt argmaxx X ϕt(x). Next, we need to determine the fidelity zt Z to query g. 1Strictly, ξ(z) z z /h Z, but the inequality is tighter for larger h Z. In any case, ξ is strictly decreasing with h Z. Bayesian Optimisation with Continuous Approximations For this we will first select a subset Zt(xt) of Z as follows, Zt(xt) = n z Z : λ(z) < λ(z ), τt 1(z, xt) > γ(z), ξ(z) > β 1/2 t ξ o , (7) where γ(z) = κ0 ξ(z) λ(z) Here, ξ is the information gap function in (4) and τt 1 is the posterior standard deviation of g, and p, d are the dimensionalities of Z, X. The exponent q depends on the kernel used for φZ. For e.g., for the SE kernel, q = 1/(p + d + 2). We filter out the fidelities we consider at time t using three conditions as specified above. We elaborate on these conditions in more detail in Section 3.1. If Zt is not empty, we choose the cheapest fidelity in this set, i.e. zt argminz Zt λ(z). If Zt is empty, we choose zt = z . We have summarised the resulting procedure below in Algorithm 1. An important advantage of BOCA is that it only requires specifying the GP hyper-parameters for g such as the kernel κ. In practice, this can be achieved by various effective heuristics such as maximising the GP marginal likelihood or cross validation which are standard in most BO methods. In contrast, MF-GP-UCB of Kandasamy et al. (2016a) requires tuning several other hyper-parameters. Algorithm 1 BOCA Input: kernel κ. Set ν0( ) 0, τ0( ) κ( , )1/2, D0 . for t = 1, 2, . . . 1. xt argmaxx X ϕt(x). See (6) 2. zt argminz Zt(xt) {z } λ(z). See (7) 3. yt Query g at (zt, xt). 4. Dt Dt 1 {(zt, xt, yt)}. Update posterior mean νt, and standard deviation τt for g conditioned on Dt. 3.1. Fidelity Selection Criterion We will now provide an intuitive justification for the three conditions in the selection criterion for zt, i.e., equation (7). The first condition, λ(z) < λ(z ) is fairly obvious; since we wish to optimise g(z , ) and since we are not rewarded for queries at other fidelities, there is no reason to consider fidelities that are more expensive than z . The second condition, τt 1(z, xt) > γ(z) says that we will only consider fidelities where the posterior variance is larger than a threshold γ(z) = κ0ξ(z)(λ(z)/λ(z ))q, which depends critically on two quantities, the cost function λ and the information gap ξ. As a first step towards parsing this condition, observe that a reasonable multi-fidelity strategy should be inclined to query cheap fidelities and learn about g before querying expensive fidelities. As γ(z) is monotonically increasing in λ(z), it becomes easier for a cheap z to satisfy τt 1(z, xt) > γ(z) and be included in Zt at time t. Moreover, since we choose zt to be the minimiser of λ in Zt, a cheaper fidelity will always be chosen over expensive ones if included in Zt. Second, if a particular fidelity z is far away from z , it probably contains less information about g(z , ). Again, a reasonable multi-fidelity strategy should be discouraged from making such queries. This is precisely the role of the information gap ξ which is increasing with z z . As z moves away from z , γ(z) increases and it becomes harder to satisfy τt 1(z, xt) > γ(z). Therefore, such a z is less likely to be included in Zt(xt) and be considered for evaluation. Our analysis reveals that setting γ as in (7) is a reasonable trade off between cost and information in the approximations available to us; cheaper fidelities cost less, but provide less accurate information about the function f we wish to optimise. It is worth noting that the second condition is similar in spirit to Kandasamy et al. (2016a) who proceed from a lower to higher fidelity only when the lower fidelity variance is smaller than a threshold. However, while they treat the threshold as a hyper-parameter, we are able to explicitly specify theoretically motivated values. The third condition in (7) is ξ(z) > ξ /β1/2 t . Since ξ is increasing as we move away from z , it says we should exclude fidelities inside a (small) neighbourhood of z . Recall that if Zt is empty, BOCA will choose z by default. But when it is not empty, we want to prevent situations where we get arbitrarily close to z but not actually query at z . Such pathologies can occur when we are dealing with a continuum of fidelities and this condition forces BOCA to pick z instead of querying very close to it. Observe that since βt is increasing with t, this neighborhood is shrinking with time and therefore the algorithm will eventually have the opportunity to evaluate fidelities close to z . 3.2. Theoretical Results We now present our main theoretical contributions. In order to simplify the exposition and convey the gist of our results, we will only present a simplified version of our theorems. We will suppress constants, polylog terms, and other technical details that arise due to a covering argument in our proofs. A rigorous treatment is available in Appendix B. Maximum Information Gain: Up until this point, we have not discussed much about the kernel φX of the domain X. Since we are optimising f over X, it is natural to expect that this will appear in the bounds. Srinivas et al. (2010) showed that the statistical difficulty of GP bandits is determined by the Maximum Information Gain (MIG) which measures the maximum information a subset of observations have about f. We denote it by Ψn(A) where A is a subset of X and n is the number of queries to f. We refer the reader Bayesian Optimisation with Continuous Approximations to Appendix B for a formal definition of MIG. For the current exposition however, it suffices to know that for radial kernels, Ψn(A) increases with n and the volume vol(A) of A. For instance, when we use an SE kernel for φX , we have Ψn(A) vol(A) log(n)d+1and for a Mat ern kernel with smoothness parameter ν, Ψn(A) vol(A) n1 ν 2ν+d(d+1) . (Srinivas et al., 2010). Let nΛ = Λ/λ(z ) denote the number of queries by a single fidelity algorithm within capital Λ. Srinivas et al. (2010) showed that the simple regret S(Λ) for GP-UCB after capital Λ can be bounded by, Simple Regret for GP-UCB: S(Λ) In our analysis of BOCA we show that most queries to g at fidelity z will be confined to a small subset of X which contains the optimum x . Precisely, after capital Λ, for any α (0, 1), we show there exists ρ > 0 such that the number of queries outside the following set Xρ is less than nα Λ. Xρ = x X : f f(x) 2ρ κ0 ξ . (9) Here, ξ is from (4). While it is true that any optimisation algorithm would eventually query extensively in a neighbourhood around the optimum, a strong result of the above form is not always possible. For instance, for GP-UCB, the best achievable bound on the number of queries in any set that does not contain x is n1/2 Λ . The fact that Xρ exists relies crucially on the multi-fidelity assumptions and that our algorithm leverages information from lower fidelities when querying at z . As ξ is small when g is smooth across Z, the set Xρ will be small when the approximations are highly informative about g(z , ). For e.g., when φZ is a SE kernel, we have Xρ {x X : f f(x) 2ρ κ0p/h Z}. When h Z is large and g is smooth across Z, Xρ is small as the right side of the inequality is smaller. As BOCA confines most of its evaluations to this small set containing x , we will be able to achieve much better regret than GP-UCB. When h Z is small and g is not smooth across Z, the set Xρ becomes large and the advantage of multi-fidelity optimisation diminishes. One can similarly argue that for the Mat ern kernel, as the parameter ν increases, g will be smoother across Z, and Xρ becomes smaller yielding better bounds on the regret. Below, we provide an informal statement of our main theoretical result. , will denote inequality and equality ignoring constant and polylog terms. Theorem 1 (Informal, Regret of BOCA). Let g GP(0, κ) where κ satisfies (3). Choose βt d log(t/δ). Then, for sufficiently large Λ and for all α (0, 1), there exists ρ depending on α such that the following bound holds with probability at least 1 δ. In the above bound, the latter term vanishes fast due to the n (1 α/2) Λ dependence. When comparing this with (8), we see that we outperform GP-UCB by a factor of p ΨnΛ(Xρ)/ΨnΛ(X) p vol(Xρ)/vol(X) asymptotically. If g is smooth across the fidelity space, Xρ is small and the gains over GP-UCB are significant. If g becomes less smooth across Z, the bound decays gracefully, but we are never worse than GP-UCB up to constant factors. Theorem 1 also has similarities to the bounds of Kandasamy et al. (2016a) who also demonstrate better regret than GPUCB by showing that it is dominated by queries inside a set X which contains the optimum. However, their bounds depend critically on certain threshold hyper-parameters which determine the volume of X among other terms in their regret. The authors of that paper note that their bounds will suffer if these hyper-parameters are not chosen appropriately, but do not provide theoretically justified methods to make this choice. In contrast, many of the design choices for BOCA fall out naturally of our modeling assumptions. Beyond this analogue, our results are not comparable to Kandasamy et al. (2016a) as the assumptions are different. Extensions: While we have focused on continuous Z, many of the ideas here can be extended to other settings. If Z is a discrete subset of [0, 1]p our work extends straightforwardly. We reiterate that this will not be the same as the finite fidelity MF-GP-UCB algorithm as the assumptions are different. In particular, Kandasamy et al. (2016a) are not able to effectively share information across fidelities as we do. We also believe that Algorithm 1 can be extended to arbitrary fidelity spaces Z provided that a kernel can be defined on Z. Our results can also be extended to discrete domains X and various other kernels for φX by adopting techniques from Srinivas et al. (2010). 4. Experiments We compare BOCA to the following four baselines: (i) GPUCB, (ii) the GP-EI criterion in BO (Jones et al., 1998), (iii) MF-GP-UCB (Kandasamy et al., 2016a) and (iv) MF-SKO, the multi-fidelity sequential kriging optimisation method from Huang et al. (2006). All methods are based on GPs and we use the SE kernel for both the fidelity space and domain. The first two are not multi-fidelity methods, while the last two are finite multi-fidelity methods2. Kandasamy et al. (2016a) also study some naive multi-fidelity algorithms and demonstrate that they do not perform well; as such we will not consider such alternatives here. In all our experiments, the fidelity space was designed to be Z = [0, 1]p with z = 1p = [1, . . . , 1] Rp being the most expensive fi- 2To our knowledge, the only other work that applies to continuous approximations is Klein et al. (2015) which was developed specifically for hyper-parameter tuning. Further, their implementation is not made available and is not straightforward to implement. Bayesian Optimisation with Continuous Approximations 0 20 40 60 80 100 120 GP, p = 1, d = 1 GP-UCB GP-EI MF-GP-UCB MF-SKO BOCA 0 50 100 150 200 250 300 GP-Bad-Approx, p = 1, d = 1 0 10 20 30 40 50 Currin Exp, p = 1, d = 2 0 20 40 60 80 100 Hartmann3, p = 4, d = 3 0 50 100 150 200 Hartmann6, p = 2, d = 6 0 50 100 150 200 Borehole, p = 1, d = 8 Figure 3. Results on 6 synthetic problems where we plot the simple regret S(Λ) (lower is better) against the capital Λ. The title states the function used, and the fidelity and domain dimesions. For the first two figures we used capital 30λ(z ), therefore a method which only queries at g(z , ) can make at most 30 evaluations. For the third figure we used 50λ(z ), for the fourth 100λ(z ) and for the last 200λ(z ) to reflect the dimensionality d of X. The curves for the multi-fidelity methods start mid-way since they have not queried at z up until that point. All curves were produced by averaging over 20 experiments and the error bars indicate one standard error. delity. For MF-GP-UCB and MF-SKO, we used 3 fidelities (2 approximations) where the approximations were obtained at z = 0.3331p and z = 0.6671p in Z. Empirically, we found that both algorithms did reasonably well with 1-3 approximations, but did not perform well with a large number of approximations (> 5); even the original papers restrict experiments to 1-3 approximations. Implementation details for all methods are given in Appendix C.1. 4.1. Synthetic Experiments The results for the first set of synthetic experiments are given in Fig. 3. The title of each figure states the function used, and the dimensionalities p, d of the fidelity space and domain. To reflect the setting in our theory, we add Gaussian noise to the function value when observing g at any (z, x). This makes the problem more challenging than standard global optimisation problems where function evaluations are not noisy. The functions g, the cost functions λ and the noise variances η2 are given in Appendix C.2. The first two panels in Fig. 3 are simple sanity checks. In both cases, Z = [0, 1], X = [0, 1] and the functions were sampled from GPs. The GP was made known to all methods, i.e. all methods used the true GP in picking the next point. In the first panel, we used an SE kernel with bandwidth 0.1 for φX and 1.0 for φZ. g is smooth across Z in this setting, and BOCA outperforms other baselines. The curve starts mid-way as BOCA is yet to query at z up until that point. The second panel uses the same set up as the first except we used bandwidth 0.01 for φZ. Even though g is highly un-smooth across Z, BOCA does not perform poorly. This corroborates a claim that we made earlier that BOCA can naturally adapt to the smoothness of the approximations. The other multi-fidelity methods suffer in this setting. In the remaining experiments, we use some standard benchmarks for global optimisation. We modify them to obtain g and add noise to the observations. As the kernel and other GP hyper-parameters are unknown, we learn them by maximising the marginal likelihood every 25 iterations. We outperform all methods on all problems except in the case of the Borehole function where MF-GP-UCB does better. The last synthetic experiment is the Branin function given in Fig. 4(a). We used the same set up as above, but use 10 fidelities for MF-GP-UCB and MF-SKO where the kth fidelity is obtained at z = k 101p in the fidelity space. Notice that the performance of finite fidelity methods deteriorate. In particular, as MF-GP-UCB does not share information across fidelities, the approximations need to be designed carefully for the algorithm to work well. Our more natural modelling assumptions prevent such pitfalls. We next present two real examples in astrophysics and hyper-parameter tuning. We do not add noise to the observations, but treat it as optimisation tasks, where the goal is to maximise the function. 4.2. Astrophysical Maximum Likelihood Inference We use data on Type Ia supernova for maximum likelihood inference on 3 cosmological parameters, the Hubble con- Bayesian Optimisation with Continuous Approximations 0 20 40 60 80 100 120 140 Branin, p = 3, d = 2 GP-UCB GP-EI MF-GP-UCB MF-SKO BOCA 1000 1500 2000 2500 3000 3500 Avg. Log Likelihood 0.08 0.09 0.1 Supernova, p = 2, d = 3 500 1000 1500 200 Cross Validation Accuracy 0.915 News Group-SVM, p = 2, d = 2 Figure 4. (a): The synthetic benchmark with the Branin function where we used a capital of 50λ(z ). See caption under Fig. 3 for more details. (b), (c): Results on the supernova and news group experiments from sections 4.2 and 4.3 respectively. We have plotted the maximum value (higher is better) against wall clock time. (a) was averaged over 20 experiments while (b) and (c) were averaged over 10 xperiments each. The error bars indicate one standard error. stant H0 (60, 80), the dark matter fraction ΩM (0, 1) and dark energy fraction ΩΛ (0, 1); hence d = 3. The likelihood is given by the Robertson-Walker metric, the computation of which requires a one dimensional numerical integration for each point in the dataset. Unlike typical maximum likelihood problems, here the likelihood is only accessible via point evaluations. We use the dataset from Davis et al (2007) which has data on 192 supernovae. We construct a p = 2 dimensional multi-fidelity problem where we can choose between data set size N [50, 192] and perform the integration on grids of size G [102, 106] via the trapezoidal rule. As the cost function for fidelity selection, we used λ(N, G) = NG as the computation time is linear in both parameters. Our goal is to maximise the average log likelihood at z = [192, 106]. For the finite fidelity methods we use three fidelities with the approximations available at z = [97, 2.15 103] and z = [145, 4.64 104] (which correspond to 0.3331p and 0.6671p after rescaling as in Section 4.1). The results are given in Fig. 4(b) where we plot the maximum average log likelihood against wall clock time as that is the cost in this experiment. The plot includes the time taken by each method to tune the GPs and determine the next points/fidelities for evaluation. 4.3. Support Vector Classification with 20 news groups We use the 20 news groups dataset (Joachims, 1996) in a text classification task. We obtain the bag of words representation for each document, convert them to tf-idf features and feed them to a support vector classifier. The goal is to tune the regularisation penalty and the temperature of the rbf kernel both in the range [10 2, 103]; hence d = 2. The support vector implementation was taken from scikit-learn. We set this up as a 2 dimensional multi-fidelity problem where we can choose a dataset size N [5000, 15000] and the number of training iterations T [20, 100]. Each evaluation takes the given dataset of size N and splits it up into 5 to perform 5-fold cross validation. As the cost function for fidelity selection, we used λ(N, T) = NT as the training/validation complexity is linear in both parameters. Our goal is to maximise the cross validation accuracy at z = [15000, 100]. For the finite fidelity methods we use three fidelities with the approximations available at z = [8333, 47] and z = [11667, 73]. The results are given in Fig. 4(c) where we plot the average cross validation accuracy against wall clock time. 5. Conclusion We studied Bayesian optimisation with continuous approximations, by treating the approximations as arising out of a continuous fidelity space. While previous multi-fidelity literature has predominantly focused on a finite number of approximations, BOCA applies to continuous fidelity spaces and can potentially be extended to arbitrary spaces. We bound the simple regret for BOCA and demonstrate that it is better than methods such as GP-UCB which ignore the approximations and that the gains are determined by the smoothness of the fidelity space. When compared to existing multi-fidelity methods, BOCA is able to share information across fidelities effectively, has more natural modelling assumptions and has fewer hyper-parameters to tune. Empirically, we demonstrate that BOCA is competitive with other baselines in synthetic and real problems. Another nice feature of using continuous approximations is that it relieves the practitioner from having to design the approximations; she/he can specify the available approximations and let the algorithm decide how to choose them. Going forward, we wish to extend our theoretical results to more general settings. For instance, we believe a stronger bound on the regret might be possible if φZ is a finite dimensional kernel. Since finite dimensional kernels are typically not radial (Sriperumbudur et al., 2016), our analysis techniques will not carry over straightforwardly. Another line of work that we have alluded to is to study more general fidelity spaces with an appropriately defined kernel φZ. Bayesian Optimisation with Continuous Approximations Acknowledgements We would like to thank Renato Negrinho for reviewing an initial draft of this paper. This research is supported in part by DOE grant DESC0011114 and NSF grant IIS1563887. KK is supported by a Facebook Ph.D. fellowship. Agarwal, Alekh, Duchi, John C, Bartlett, Peter L, and Levrard, Clement. Oracle inequalities for computationally budgeted model selection. In COLT, 2011. Auer, Peter. Using Confidence Bounds for Exploitationexploration Trade-offs. J. Mach. Learn. Res., 2003. Brochu, E., Cora, V. M., and de Freitas, N. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical RL. Co RR, 2010. Cutler, Mark, Walsh, Thomas J., and How, Jonathan P. Reinforcement Learning with Multi-Fidelity Simulators. In ICRA, 2014. Davis et al, T. M. Scrutinizing Exotic Cosmological Models Using ESSENCE Supernova Data Combined with Other Cosmological Probes. Astrophysical Journal, 2007. de Freitas, Nando, Smola, Alex J., and Zoghi, Masrour. Exponential Regret Bounds for Gaussian Process Bandits with Deterministic Observations. In ICML, 2012. Forrester, Alexander I. J., S obester, Andr as, and Keane, Andy J. Multi-fidelity optimization via surrogate modelling. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Science, 2007. Ghosal, Subhashis and Roy, Anindya. Posterior consistency of Gaussian process prior for nonparametric binary regression . Annals of Statistics, 2006. Gonzalez, J., Longworth, J., James, D., and Lawrence, N. Bayesian Optimization for Synthetic Gene Design. In Bayes Opt, 2014. Hern andez-Lobato, Jos e Miguel, Hoffman, Matthew W, and Ghahramani, Zoubin. Predictive Entropy Search for Efficient Global Optimization of Black-box Functions. In NIPS, 2014. Huang, D., Allen, T.T., Notz, W.I., and Miller, R.A. Sequential kriging optimization using multiple-fidelity evaluations. Structural and Multidisciplinary Optimization, 2006. Hutter, Frank, Hoos, Holger H., and Leyton-Brown, Kevin. Sequential Model-based Optimization for General Algorithm Configuration. In LION, 2011. Joachims, Thorsten. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Technical report, DTIC Document, 1996. Jones, D. R., Perttunen, C. D., and Stuckman, B. E. Lipschitzian Optimization Without the Lipschitz Constant. J. Optim. Theory Appl., 1993. Jones, Donald R., Schonlau, Matthias, and Welch, William J. Efficient global optimization of expensive black-box functions. J. of Global Optimization, 1998. Kandasamy, Kirthevasan, Schenider, Jeff, and P oczos, Barnab as. High Dimensional Bayesian Optimisation and Bandits via Additive Models. In International Conference on Machine Learning, 2015. Kandasamy, Kirthevasan, Dasarathy, Gautam, Oliva, Junier, Schenider, Jeff, and P oczos, Barnab as. Gaussian Process Bandit Optimisation with Multi-fidelity Evaluations. In Advances in Neural Information Processing Systems, 2016a. Kandasamy, Kirthevasan, Dasarathy, Gautam, Oliva, Junier B, Schneider, Jeff, and Poczos, Barnabas. Multifidelity gaussian process bandit optimisation. ar Xiv preprint ar Xiv:1603.06288, 2016b. Kandasamy, Kirthevasan, Dasarathy, Gautam, Schneider, Jeff, and Poczos, Barnabas. The Multi-fidelity Multiarmed Bandit. In NIPS, 2016c. Klein, A., Bartels, S., Falkner, S., Hennig, P., and Hutter, F. Towards efficient Bayesian Optimization for Big Data. In Bayes Opt, 2015. Lam, R emi, Allaire, Douglas L, and Willcox, Karen E. Multifidelity optimization using statistical surrogate modeling for non-hierarchical information sources. In 56th AIAA/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, pp. 0143, 2015. Li, Lisha, Jamieson, Kevin, De Salvo, Giulia, Rostamizadeh, Afshin, and Talwalkar, Ameet. Hyperband: A novel bandit-based approach to hyperparameter optimization. ar Xiv preprint ar Xiv:1603.06560, 2016. Lizotte, Daniel, Wang, Tao, Bowling, Michael, and Schuurmans, Dale. Automatic gait optimization with gaussian process regression. In IJCAI, 2007. Martinez-Cantin, R., de Freitas, N., Doucet, A., and Castellanos, J. Active Policy Learning for Robot Planning and Exploration under Uncertainty. In Proceedings of Robotics: Science and Systems, 2007. Mockus, Jonas. Application of Bayesian approach to numerical methods of global and stochastic optimization. Journal of Global Optimization, 1994. Bayesian Optimisation with Continuous Approximations Parkinson, D., Mukherjee, P., and Liddle, A.. R. A Bayesian model selection analysis of WMAP3. Physical Review, 2006. Poloczek, Matthias, Wang, Jialei, and Frazier, Peter I. Multi-information source optimization. ar Xiv preprint ar Xiv:1603.00389, 2016. Rasmussen, C.E. and Williams, C.K.I. Gaussian Processes for Machine Learning. UPG Ltd, 2006. Sabharwal, A, Samulowitz, H, and Tesauro, G. Selecting near-optimal learners via incremental data allocation. In AAAI, 2015. Seeger, MW., Kakade, SM., and Foster, DP. Information Consistency of Nonparametric Gaussian Process Methods. IEEE Transactions on Information Theory, 2008. Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian Optimization of Machine Learning Algorithms. In NIPS, 2012. Srinivas, Niranjan, Krause, Andreas, Kakade, Sham, and Seeger, Matthias. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In ICML, 2010. Sriperumbudur, Bharath et al. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3):1839 1893, 2016. Swersky, Kevin, Snoek, Jasper, and Adams, Ryan P. Multitask bayesian optimization. In NIPS, 2013. Thompson, W. R. On the Likelihood that one Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 1933. Xiong, Shifeng, Qian, Peter Z. G., and Wu, C. F. Jeff. Sequential design and analysis of high-accuracy and lowaccuracy computer codes. Technometrics, 2013. Zhang, C. and Chaudhuri, K. Active Learning from Weak and Strong Labelers. In NIPS, 2015.