# gaussian_process_bandit_optimisation_with_multifidelity_evaluations__f056da98.pdf

Gaussian Process Bandit Optimisation with Multi-ﬁdelity Evaluations

Kirthevasan Kandasamy , Gautam Dasarathy , Junier Oliva , Jeff Schneider , Barnabás Póczos

Carnegie Mellon University, Rice University {kandasamy, joliva, schneide, bapoczos}@cs.cmu.edu, gautamd@rice.edu

In many scientiﬁc and engineering applications, we are tasked with the optimisation of an expensive to evaluate black box function f. Traditional methods for this problem assume just the availability of this single function. However, in many cases, cheap approximations to f may be obtainable. For example, the expensive real world behaviour of a robot can be approximated by a cheap computer simulation. We can use these approximations to eliminate low function value regions cheaply and use the expensive evaluations of f in a small but promising region and speedily identify the optimum. We formalise this task as a multi-ﬁdelity bandit problem where the target function and its approximations are sampled from a Gaussian process. We develop MF-GP-UCB, a novel method based on upper conﬁdence bound techniques. In our theoretical analysis we demonstrate that it exhibits precisely the above behaviour, and achieves better regret than strategies which ignore multi-ﬁdelity information. MF-GP-UCB outperforms such naive strategies and other multi-ﬁdelity methods on several synthetic and real experiments.

1 Introduction

In stochastic bandit optimisation, we wish to optimise a payoff function f : X R by sequentially querying it and obtaining bandit feedback, i.e. when we query at any x X, we observe a possibly noisy evaluation of f(x). f is typically expensive and the goal is to identify its maximum while keeping the number of queries as low as possible. Some applications are hyper-parameter tuning in expensive machine learning algorithms, optimal policy search in complex systems, and scientiﬁc experiments [20, 23, 27]. Historically, bandit problems were studied in settings where the goal is to maximise the cumulative reward of all queries to the payoff instead of just ﬁnding the maximum. Applications in this setting include clinical trials and online advertising.

Conventional methods in these settings assume access to only this single expensive function of interest f. We will collectively refer to them as single ﬁdelity methods. In many practical problems however, cheap approximations to f might be available. For instance, when tuning hyper-parameters of learning algorithms, the goal is to maximise a cross validation (CV) score on a training set, which can be expensive if the training set is large. However CV curves tend to vary smoothly with training set size; therefore, we can train and cross validate on small subsets to approximate the CV accuracies of the entire dataset. For a concrete example, consider kernel density estimation (KDE), where we need to tune the bandwidth h of a kernel. Figure 1 shows the CV likelihood against h for a dataset of size n = 3000 and a smaller subset of size n = 300. The two maximisers are different, which is to be expected since optimal hyper-parameters are functions of the training set size. That said, the curve for n = 300 approximates the n = 3000 curve quite well. Since training/CV on small n is cheap, we can use it to eliminate bad values of the hyper-parameters and reserve the expensive experiments with the entire dataset for the promising candidates (e.g. boxed region in Fig. 1).

In online advertising, the goal is to maximise the cumulative number of clicks over a given period. In the conventional bandit treatment, each query to f is the display of an ad for a speciﬁc time, say one

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

hour. However, we may display ads for shorter intervals, say a few minutes, to approximate its hourly performance. The estimate is biased, as displaying an ad for a longer interval changes user behaviour, but will nonetheless be useful in gauging its long run click through rate. In optimal policy search in robotics and automated driving vastly cheaper computer simulations are used to approximate the expensive real world performance of the system. Scientiﬁc experiments can be approximated to varying degrees using less expensive data collection, analysis, and computational techniques.

In this paper, we cast these tasks as multi-ﬁdelity bandit optimisation problems assuming the availability of cheap approximate functions (ﬁdelities) to the payoff f. Our contributions are:

1. We present a formalism for multi-ﬁdelity bandit optimisation using Gaussian Process (GP) assumptions on f and its approximations. We develop a novel algorithm, Multi-Fidelity Gaussian Process Upper Conﬁdence Bound (MF-GP-UCB) for this setting. 2. Our theoretical analysis proves that MF-GP-UCB explores the space at lower ﬁdelities and uses the high ﬁdelities in successively smaller regions to zero in on the optimum. As lower ﬁdelity queries are cheaper, MF-GP-UCB has better regret than single ﬁdelity strategies. 3. Empirically, we demonstrate that MF-GP-UCB outperforms single ﬁdelity methods on a series of synthetic examples, three hyper-parameter tuning tasks and one inference problem in Astrophysics. Our matlab implementation and experiments are available at github.com/kirthevasank/mf-gp-ucb.

Related Work: Since the seminal work by Robbins [25], the multi-armed bandit problem has been studied extensively in the K-armed setting. Recently, there has been a surge of interest in the optimism under uncertainty principle for K armed bandits, typiﬁed by upper conﬁdence bound (UCB) methods [2, 4]. UCB strategies have also been used in bandit tasks with linear [6] and GP [28] payoffs. There is a plethora of work on single ﬁdelity methods for global optimisation both with noisy and noiseless evaluations. Some examples are branch and bound techniques such as dividing rectangles (Di Rect) [12], simulated annealing, genetic algorithms and more [17, 18, 22]. A suite of single ﬁdelity methods in the GP framework closely related to our work is Bayesian Optimisation (BO). While there are several techniques for BO [13, 21, 30], of particular interest to us is the Gaussian process upper conﬁdence bound (GP-UCB) algorithm of Srinivas et al. [28].

Many applied domains of research such as aerodynamics, industrial design and hyper-parameter tuning have studied multi-ﬁdelity methods [9, 11, 19, 29]; a plurality of them use BO techniques. However none of these treatments neither formalise nor analyse any notion of regret in the multiﬁdelity setting. In contrast, MF-GP-UCB is an intuitive UCB idea with good theoretical properties. Some literature have analysed multi-ﬁdelity methods in speciﬁc contexts such as hyper-parameter tuning, active learning and reinforcement learning [1, 5, 26, 33]. Their settings and assumptions are substantially different from ours. Critically, none of them are in the more difﬁcult bandit setting where there is a price for exploration. Due to space constraints we discuss them in detail in Appendix A.3.

The multi-ﬁdelity poses substantially new theoretical and algorithmic challenges. We build on GPUCB and our recent work on multi-ﬁdelity bandits in the K-armed setting [16]. Section 2 presents our formalism including a notion of regret for multi-ﬁdelity GP bandits. Section 3 presents our algorithm. The theoretical analysis is in Appendix C with a synopsis for the 2-ﬁdelity case in Section 4. Section 6 presents our experiments. Appendix A.1 tabulates the notation used in the manuscript.

2 Preliminaries

We wish to maximise a payoff function f : X R where X [0, r]d. We can interact with f only by querying at some x X and obtaining a noisy observation y = f(x)+ϵ. Let x argmaxx X f(x) and f = f(x ). Let xt X be the point queried at time t. The goal of a bandit strategy is to maximise the sum of rewards Pn t=1 f(xt) or equivalently minimise the cumulative regret Pn t=1 f f(xt) after n queries; i.e. we compete against an oracle which queries at x at all t.

Our primary distinction from the classical setting is that we have access to M 1 successively accurate approximations f (1), f (2), . . . , f (M 1) to the payoff f = f (M). We refer to these approximations as ﬁdelities. We encode the fact that ﬁdelity m approximates ﬁdelity M via the assumption, f (M) f (m) ζ(m), where ζ(1) > ζ(2) > > ζ(M) = 0. Each query at ﬁdelity m expends a cost λ(m) of a resource, e.g. computational effort or advertising time, where λ(1) < λ(2) < < λ(M). A strategy for multi-ﬁdelity bandits is a sequence of query-ﬁdelity pairs {(xt, mt)}t 0, where

n=300 n=3000

Figure 1: Left: Average CV log likelihood on datasets of size 300, 3000 on a synthetic KDE task. The crosses are the maxima. Right: Illustration of GP-UCB at time t. The ﬁgure shows f(x) (solid black line), the UCB ϕt(x) (dashed blue line) and queries until t 1 (black crosses). We query at xt = argmaxx X ϕt(x) (red star).

(xn, mn) could depend on the previous query-observation-ﬁdelity tuples {(xt, yt, mt)}n 1 t=1 . Here yt = f (mt)(xt) + ϵ. After n steps we will have queried any of the M ﬁdelities multiple times.

Some smoothness assumptions on f (m) s are needed to make the problem tractable. A standard in the Bayesian nonparametric literature is to use a Gaussian process (GP) prior [24] with covariance kernel κ. In this work we focus on the squared exponential (SE) κσ,h and the Matérn κν,h kernels as they are popularly used in practice and their theoretical properties are well studied. Writing z = x x 2,

they are deﬁned as κσ,h(x, x ) = σ exp z2/(2h2) , κν,h(x, x ) = 21 ν

h , where Γ, Bν are the Gamma and modiﬁed Bessel functions. A convenience the GP framework offers is that posterior distributions are analytically tractable. If f GP(0, κ), and we have observations Dn = {(xi, yi)}n i=1, where yi = f(xi) + ϵ and ϵ N(0, η2) is Gaussian noise, the posterior distribution for f(x)|Dn is also Gaussian N(µn(x), σ2 n(x)) with

µn(x) = k 1Y, σ2 n(x) = κ(x, x) k 1k. (1)

Here, Y Rn with Yi = yi, k Rn with ki = κ(x, xi) and = K + η2I Rn n where Ki,j = κ(xi, xj). In keeping with the above, we make the following assumptions on our problem.

Assumption 1. A1: The functions at all ﬁdelities are sampled from GPs, f (m) GP(0, κ) for all m = 1, . . . , M. A2: f (M) f (m) ζ(m) for all m = 1, . . . , M. A3: f (M) B.

The purpose of A3 is primarily to deﬁne the regret. In Remark 7, Appendix A.4 we argue that these assumptions are probabilistically valid, i.e. the latter two events occur with nontrivial probability when we sample the f (m) s from a GP. So a generative mechanism would keep sampling the functions and deliver them when the conditions hold true. A point x X can be queried at any of the M ﬁdelities. When we query at ﬁdelity m, we observe y = f (m)(x) + ϵ where ϵ N(0, η2).

We now present our notion of cumulative regret R(Λ) after spending capital Λ of a resource in the multi-ﬁdelity setting. R(Λ) should reduce to the conventional deﬁnition of regret for any single ﬁdelity strategy that queries only at M th ﬁdelity. As only the optimum of f = f (M) is of interest to us, queries at ﬁdelities less than M should yield the lowest possible reward, ( B) according to A3. Accordingly, we set the instantaneous reward qt at time to be B if mt = M and f (M)(xt) if mt = M. If we let rt = f qt denote the instantaneous regret, we have rt = f + B if mt = M and f f(xt) if mt = M. R(Λ) should also factor in the costs of the ﬁdelity of each query. Finally, we should also receive ( B) reward for any unused capital. Accordingly, we deﬁne R(Λ) as,

t=1 λ(mt)qt + Λ

t=1 λ(mt) ( B)

t=1 λ(mt)rt, (2)

where Λres = Λ PN t=1 λ(mt). Here, N is the (random) number of queries at all ﬁdelities within capital Λ, i.e. the largest n such that Pn t=1 λ(mt) Λ. According to (2) above, we wish to compete against an oracle that uses all its capital Λ to query x at the M th ﬁdelity. R(Λ) is at best 0 when we follow the oracle and at most 2ΛB. Our goal is a strategy that has small regret for all values of (sufﬁciently large) Λ, i.e. the equivalent of an anytime strategy, as opposed to a ﬁxed time horizon strategy in the usual bandit setting. For the purpose of optimisation, we also deﬁne the simple regret as S(Λ) = mint rt = f maxt qt. S(Λ) is the difference between f and the best highest ﬁdelity query (and f + B if we have never queried at ﬁdelity M). Since S(Λ) 1

ΛR(Λ), any strategy with asymptotic sublinear regret limΛ 1

ΛR(Λ) = 0, also has vanishing simple regret.

Since, to our knowledge, this is the ﬁrst attempt to formalise regret for multi-ﬁdelity problems, the deﬁnition for R(Λ) (2) necessitates justiﬁcation. Consider a two ﬁdelity robot gold mining problem

where the second ﬁdelity is a real world robot trial, costing λ(2) dollars and the ﬁrst ﬁdelity is a computer simulation costing λ(1). A multi-ﬁdelity algorithm queries the simulator to learn about the real world. But it does not collect any actual gold during a simulation; hence no reward, which according to our assumptions is B. Meantime the oracle is investing this capital on the best experiment and collecting f gold. Therefore, the regret at this time instant is f + B. However we weight this by the cost to account for the fact that the simulation costs only λ(1). Note that lower ﬁdelities use up capital but yield the lowest reward. The goal however, is to leverage information from these cheap queries to query prudently at the highest ﬁdelity and obtain better regret.

That said, other multi-ﬁdelity settings might require different deﬁnitions for R(Λ). In online advertising, the lower ﬁdelities (displaying ads for shorter periods) would still yield rewards. In clinical trials, the regret at the highest ﬁdelity due to a bad treatment would be, say, a dead patient. However, a bad treatment on a simulation may not warrant large penalty. We use the deﬁnition in (2) because it is more aligned with our optimisation experiments: lower ﬁdelities are useful to the extent that they guide search on the expensive f (M), but there is no reward to ﬁnding the optimum of a cheap f (m).

A crucial challenge for a multi-ﬁdelity method is to not get stuck at the optimum of a lower ﬁdelity, which is typically suboptimal for f (M). While exploiting information from the lower ﬁdelities, it is also important to explore sufﬁciently at f (M). In our experiments we demonstrate that naive strategies which do not do so would get stuck at the optimum of a lower ﬁdelity.

A note on GP-UCB: Sequential optimisation methods adopting UCB principles maintain a high probability upper bound ϕt : X R for f(x) for all x X [2]. For GP-UCB, ϕt takes the form ϕt(x) = µt 1(x) + β1/2 t σt 1(x) where µt 1, σt 1 are the posterior mean and standard deviation of the GP conditioned on the previous t 1 queries. The key intuition is that the mean µt 1 encourages an exploitative strategy in that we want to query where we know the function is high and the conﬁdence band β1/2 t σt 1 encourages an explorative strategy in that we want to query at regions we are uncertain about f lest we miss out on high valued regions. We have illustrated GP-UCB in Fig 1 and reviewed the algorithm and its theoretical properties in Appendix A.2.

3 MF-GP-UCB

The proposed algorithm, MF-GP-UCB, will also maintain a UCB for f (M) obtained via the previous queries at all ﬁdelities. Denote the posterior GP mean and standard deviation of f (m) conditioned only on the previous queries at ﬁdelity m by µ(m) t , σ(m) t respectively (See (1)). Then deﬁne,

ϕ(m) t (x) = µ(m) t 1(x) + β1/2 t σ(m) t 1(x) + ζ(m), m, ϕt(x) = min m=1,...,M ϕ(m) t (x). (3)

For appropriately chosen βt, µ(m) t 1(x)+β1/2 t σ(m) t 1(x) will upper bound f (m)(x) with high probability.

By A2, ϕ(m) t (x) upper bounds f (M)(x) for all m. We have M such upper bounds, and their minimum ϕt(x) gives the best bound. Our next query is at the maximiser of this UCB, xt = argmaxx X ϕt(x).

Next we need to decide which ﬁdelity to query at. Consider any m < M. The ζ(m) conditions on f (m) constrain the value of f (M) the conﬁdence band β1/2 t σ(m) t 1 for f (m) is lengthened by ζ(m) to obtain conﬁdence on f (M). If β1/2 t σ(m) t 1(xt) for f (m) is large, it means that we have not constrained f (m) sufﬁciently well at xt and should query at the mth ﬁdelity. On the other hand, querying indeﬁnitely in the same region to reduce β1/2 t σ(m) t 1 in that region will not help us much as the ζ(m) elongation caps off how much we can learn about f (M) from f (m); i.e. even if we knew f (m) perfectly, we will only have constrained f (M) to within a ζ(m) band. Our algorithm captures this simple intuition. Having selected xt, we begin by checking at the ﬁrst ﬁdelity. If β1/2 t σ(1) t 1(xt) is

smaller than a threshold γ(1), we proceed to the second ﬁdelity. If at any stage β1/2 t σ(m) t 1(xt) γ(m) we query at ﬁdelity mt = m. If we proceed all the way to ﬁdelity M, we query at mt = M. We will discuss choices for γ(m) shortly. We summarise the resulting procedure in Algorithm 1.

Fig 2 illustrates MF-GP-UCB on a 2 ﬁdelity problem. Initially, MF-GP-UCB is mostly exploring X in the ﬁrst ﬁdelity. β1/2 t σ(1) t 1 is large and we are yet to constrain f (1) well to proceed to f (2). By t = 14, we have constrained f (1) around the optimum and have started querying at f (2) in this region.

Algorithm 1 MF-GP-UCB Inputs: kernel κ, bounds {ζ(m)}M m=1, thresholds {γ(m)}M m=1.

For m = 1, . . . , M: D(m) 0 , (µ(m) 0 , σ(m) 0 ) (0, κ1/2). for t = 1, 2, . . .

1. xt argmaxx X ϕt(x). (See Equation (3))

2. mt = minm{ m |β1/2 t σ(m) t 1(xt) γ(m) or m = M}. (See Appendix B, C for βt)

3. yt Query f (mt) at xt.

4. Update D(mt) t D(mt) t 1 {(xt, yt)}. Obtain µ(mt) t , σ(mt) t conditioned on D(mt) t (See (1)).

t = 6 ϕ(1) t

f (1) f (2)

f (1) f (2)

β1/2 t σ(1) t 1(x) γ(1)

Figure 2: Illustration of MF-GP-UCB for a 2-ﬁdelity problem initialised with 5 random points at the ﬁrst ﬁdelity. In the top ﬁgures, the solid lines in brown and blue are f (1), f (2) respectively, and the dashed lines are ϕ(1) t , ϕ(2) t . The solid green line is ϕt = min(ϕ(1) t , ϕ(2) t ). The small crosses are queries from 1 to t 1 and the red star is the maximiser of ϕt, i.e. the next query xt. x , the optimum of f (2) is shown in magenta. In the bottom ﬁgures, the solid orange line is β1/2 t σ(1) t 1 and the dashed black line is γ(1). When β1/2 t σ(1) t 1(xt) γ(1)

we play at ﬁdelity mt = 2 and otherwise at mt = 1. See Fig. 6 in Appendix B for an extended simulation.

Notice how ϕ(2) t dips to change ϕt in this region. MF-GP-UCB has identiﬁed the maximum with just 3 queries to f (2). In Appendix B we provide an extended simulation and discuss further insights.

Finally, we make an essential observation. The posterior for any f (m)(x) conditioned on previous queries at all ﬁdelities is not Gaussian due to the ζ(m) constraints (A2). However, |f (m)(x) µ(m) t 1(x)| < β1/2 t σ(m) t 1(x) holds with high probability, since, by conditioning only on queries at the mth ﬁdelity we have Gaussianity for f (m)(x). Next we summarise our main theoretical contributions.

4 Summary of Theoretical Results

For pedagogical reasons we present our results for the M = 2 case. Appendix C contains statements and proofs for general M. We also ignore constants and polylog terms when they are dominated by other terms. , denote inequality and equality ignoring constants. We begin by deﬁning the Maximum Information Gain (MIG) which characterises the statistical difﬁculty of GP bandits.

Deﬁnition 2. (Maximum Information Gain) Let f GP(0, κ). Consider any A Rd and let e A = {x1, . . . , xn} A be a ﬁnite subset. Let f e A, ϵ e A Rn be such that (f e A)i = f(xi), (ϵ e A)i N(0, η2), and y e A = f e A + ϵ e A. Let I denote the Shannon Mutual Information. The Maximum Information Gain of A is Ψn(A) = max e A A,| e A|=n I(y e A; f e A).

The MIG, which depends on the kernel κ and the set A, is an important quantity in our analysis. For a given κ, it typically scales with the volume of A; i.e. if A = [0, r]d then Ψn(A) O(rdΨn([0, 1]d)).

For the SE kernel, Ψn([0, 1]d) O((log(n))d+1) and for Matérn, Ψn([0, 1]d) O(n

d(d+1) 2ν+d(d+1) ) [28].

Recall, N is the (random) number of queries by a multi-ﬁdelity strategy within capital Λ at either ﬁdelity. Let nΛ = Λ/λ(2) be the (non-random) number of queries by a single ﬁdelity method operating only at the second ﬁdelity. As λ(1) < λ(2), N could be large for an arbitrary multi-ﬁdelity method. However, our analysis reveals that for MF-GP-UCB, N is on the order of nΛ.

Fundamental to the 2-ﬁdelity problem is the set Xg = {x X; f f (1)(x) ζ(1)}. Xg is a high valued region for f (2)(x): for all x Xg, f (2)(x) is at most 2ζ(1) away from the optimum. More interestingly, when ζ(1) is small, i.e. when f (1) is a good approximation to f (2), Xg will be much smaller than X. This is precisely the target domain for this research. For instance, in the robot gold mining example, a cheap computer simulator can be used to eliminate several bad policies and we could reserve the real world trials for the promising candidates. If a multiﬁdelity strategy were to use the second ﬁdelity queries only in Xg, then the regret will only have Ψn(Xg) dependence after n high ﬁdelity queries. In contrast, a strategy that only operates at the highest ﬁdelity (e.g. GP-UCB) will have Ψn(X) dependence. In the scenario described above Ψn(Xg) Ψn(X), and the multi-ﬁdelity strategy will have signiﬁcantly better regret than a single ﬁdelity strategy. MF-GP-UCB roughly achieves this goal. In particular, we consider a slightly inﬂated set e Xg,ρ = {x X; f f (1)(x) ζ(1) + ργ(1)}, of Xg where ρ > 0. The following result which characterises the regret of MF-GP-UCB in terms of e Xg,ρ is the main theorem of this paper.

Theorem 3 (Regret of MF-GP-UCB Informal). Let X = [0, r]d and f (1), f (2) GP(0, κ) satisfy Assumption 1. Pick δ (0, 1) and run MF-GP-UCB with βt d log(t/δ). Then, with probability > 1 δ, for sufﬁciently large Λ and for all α (0, 1), there exists ρ depending on α such that,

R(Λ) λ(2) q

nΛβnΛΨnΛ( e Xg,ρ) + λ(1)q

nΛβnΛΨnΛ(X) + λ(2)q

nα ΛβnΛΨnα Λ(X) + λ(1)ξn, e Xg,ρ,γ(1)

As we will explain shortly, the latter two terms are of lower order. It is instructive to compare the above rates against that for GP-UCB (see Theorem 4, Appendix A.2). By dropping the common and subdominant terms, the rate for MF-GP-UCB is λ(2)Ψ1/2 nΛ ( e Xg,ρ) + λ(1)Ψ1/2 nΛ (X) whereas for GP-UCB it is λ(2)Ψ1/2 nΛ (X). When λ(1) λ(2) and vol( e Xg,ρ) vol(X) the rates for MF-GPUCB are very appealing. When the approximation worsens (Xg, e Xg,ρ become larger) and the costs λ(1), λ(2) become comparable, the bound for MF-GP-UCB decays gracefully. In the worst case, MF-GP-UCB is never worse than GP-UCB up to constant terms. Intuitively, the above result states that MF-GP-UCB explores the entire X using f (1) but uses most of its queries to f (2) inside e Xg,ρ.

Now let us turn to the latter two terms in the bound. The third term is the regret due to the second ﬁdelity queries outside e Xg,ρ. We are able to show that the number of such queries is O(nα Λ) for all α > 0 for an appropriate ρ. This strong result is only possible in the multi-ﬁdelity setting. For example, in GP-UCB the best bound you can achieve on the number of plays on a suboptimal set is O(n1/2 Λ ) for the SE kernel and worse for the Matérn kernel. The last term is due to the ﬁrst ﬁdelity plays inside e Xg,ρ and it scales with vol( e Xg,ρ) and polylogarithmically with n, both of which are small. However, it has a 1/poly(γ(1)) dependence which could be bad if γ(1) is too small: intuitively, if γ(1) is too small then you will wait for a long time in step 2 of Algorithm 1 for β1/2 t σ(1) t 1 to decrease without proceeding to f (2), incurring large regret (f + B) in the process. Our analysis reveals that an optimal choice for the SE kernel scales γ(1) (λ(1)ζ(1)/(tλ(2)))1/(d+2) at time t. However this is of little practical use as the leading constant depends on several problem dependent quantities such as Ψn(Xg). In Section 5 we describe a heuristic to set γ(m) which worked well in our experiments.

Theorem 3 can be generalised to cases where the kernels κ(m) and observation noises η(m) are different at each ﬁdelity. The changes to the proofs are minimal. In fact, our practical implementation uses different kernels. As with any nonparametric method, our algorithm has exponential dependence on dimension. This can be alleviated by assuming additional structure in the problem [8, 15]. Finally, we note that the above rates translate to bounds on the simple regret S(Λ) for optimisation.

5 Implementation Details

Our implementation uses some standard techniques in Bayesian optimisation to learn the kernel such as initialisation with random queries and periodic marginal likelihood maximisation. The above techniques might be already known to a reader familiar with the BO literature. We have elaborated these in Appendix B but now focus on the γ(m), ζ(m) parameters of our method.

Algorithm 1 assumes that the ζ(m) s are given with the problem description, which is hardly the case in practice. In our implementation, instead of having to deal with M 1, ζ(m) values we set (ζ(1), ζ(2), . . . , ζ(M 1)) = ((M 1)ζ, (M 2)ζ, . . . , ζ) so we only have one value ζ. This for

200 400 600 800 1000

Bore Hole-8D, M = 2, Costs = [1; 10]

MF-GP-UCB GP-UCB EI Di Rect MF-NAIVE MF-SKO

2000 4000 6000 8000 10000

Hartmann-3D, M = 3, Costs = [1; 10; 100]

0 0.5 1 1.5 2 2.5 3 3.5 0

Number of Queries

Query frequencies for Hartmann-3D

m=1 m=2 m=3

Figure 3: The simple regret S(Λ) against the spent capital Λ on synthetic functions. The title states the function, its dimensionality, the number of ﬁdelities and the costs we used for each ﬁdelity in the experiment. All curves barring Di Rect (which is a deterministic), were produced by averaging over 20 experiments. The error bars indicate one standard error. See Figures 8, 9 10 in Appendix D for more synthetic results. The last panel shows the number of queries at different function values at each ﬁdelity for the Hartmann-3D example.

instance, is satisﬁed if f (m) f (m 1) ζ which is stronger than Assumption A2. Initially, we start with small ζ. Whenever we query at any ﬁdelity m > 1 we also check the posterior mean of the (m 1)th ﬁdelity. If |f (m)(xt) µ(m 1) t 1 (xt)| > ζ, we query again at xt, but at the (m 1)th

ﬁdelity. If |f (m)(xt) f (m 1)(xt)| > ζ, we update ζ to twice the violation. To set γ(m) s we use the following intuition: if the algorithm, is stuck at ﬁdelity m for too long then γ(m) is probably too small. We start with small values for γ(m). If the algorithm does not query above the mth ﬁdelity for more than λ(m+1)/λ(m) iterations, we double γ(m). We found our implementation to be fairly robust even recovering from fairly bad approximations at the lower ﬁdelities (see Appendix D.3).

6 Experiments

We compare MF-GP-UCB to the following methods. Single ﬁdelity methods: GP-UCB; EI: the expected improvement criterion for BO [13]; Di Rect: the dividing rectangles method [12]. Multiﬁdelity methods: MF-NAIVE: a naive baseline where we use GP-UCB to query at the ﬁrst ﬁdelity a large number of times and then query at the last ﬁdelity at the points queried at f (1) in decreasing order of f (1)-value; MF-SKO: the multi-ﬁdelity sequential kriging method from [11]. Previous works on multi-ﬁdelity methods (including MF-SKO) had not made their code available and were not straightforward to implement. Hence, we could not compare to all of them. We discuss this more in Appendix D along with some other single and multi-ﬁdelity baselines we tried but excluded in the comparison to avoid clutter in the ﬁgures. In addition, we also detail the design choices and hyper-parameters for all methods in Appendix D.

Synthetic Examples: We use the Currin exponential (d = 2), Park (d = 4) and Borehole (d = 8) functions in M = 2 ﬁdelity experiments and the Hartmann functions in d = 3 and 6 with M = 3 and 4 ﬁdelities respectively. The ﬁrst three are taken from previous multi-ﬁdelity literature [32] while we tweaked the Hartmann functions to obtain the lower ﬁdelities for the latter two cases. We show the simple regret S(Λ) against capital Λ for the Borehole and Hartmann-3D functions in Fig. 3 with the rest deferred to Appendix D due to space constraints. MF-GP-UCB outperforms other methods. Appendix D also contains results for the cumulative regret R(Λ) and the formulae for these functions.

A common occurrence with MF-NAIVE was that once we started querying at ﬁdelity M, the regret barely decreased. The diagnosis in all cases was the same: it was stuck around the maximum of f (1)

which is suboptimal for f (M). This suggests that while we have cheap approximations, the problem is by no means trivial. As explained previously, it is also important to explore at the higher ﬁdelities to achieve good regret. The efﬁcacy of MF-GP-UCB when compared to single ﬁdelity methods is that it conﬁnes this exploration to a small set containing the optimum. In our experiments we found that MF-SKO did not consistently beat other single ﬁdelity methods. Despite our best efforts to reproduce this (and another) multi-ﬁdelity method, we found them to be quite brittle (Appendix D.1).

The third panel of Fig. 3 shows a histogram of the number of queries at each ﬁdelity after 184 queries of MF-GP-UCB, for different ranges of f (3)(x) for the Hartmann-3D function. Many of the queries at the low f (3) values are at ﬁdelity 1, but as we progress they decrease and the second ﬁdelity queries increase. The third ﬁdelity dominates very close to the optimum but is used sparingly elsewhere. This corroborates the prediction in our analysis that MF-GP-UCB uses low ﬁdelities to explore and successively higher ﬁdelities at promising regions to zero in on x . (Also see Fig. 6, Appendix B.)

CPU Time (s)

0 2000 4000 6000 8000

CV (Classiﬁcation) Error

0.14 SVM-2D, M = 2, ntr = [500, 2000]

MF-GP-UCB GP-UCB EI Di Rect MF-NAIVE MF-SKO

CPU Time (s)

0 1000 2000 3000 4000 5000 6000 7000

CV (Least Squares) Error

SALSA-6D, M = 3, ntr = [2000, 4000, 8000]

CPU Time (s)

1000 2000 3000 4000 5000 6000 7000 8000

CV (Classiﬁcation) Error

V&J-22D, M = 2, ntr = [300, 3000]

Figure 4: Results on the hyper-parameter tuning experiments. The title states the experiment, dimensionality (number of hyperparameters) and training set size at each ﬁdelity. All curves were produced by averaging over 10 experiments. The error bars indicate one standard error. The lengths of the curves are different in time as we ran each method for a pre-speciﬁed number of iterations and they concluded at different times.

Real Experiments: We present results on three hyper-parameter tuning tasks (results in Fig. 4), and a maximum likelihood inference task in Astrophysics (Fig. 5). We compare methods on computation time since that is the cost in all experiments. We include the processing time for each method in the comparison (i.e. the cost of determining the next query).

Classiﬁcation using SVMs (SVM): We trained an SVM on the magic gamma dataset using the SMO algorithm to an accuracy of 10 12. The goal is to tune the kernel bandwidth and the soft margin coefﬁcient in the ranges (10 3, 101) and (10 1, 105) respectively on a dataset of size 2000. We set this up as a M = 2 ﬁdelity experiment with the entire training set at the second ﬁdelity and 500 points at the ﬁrst. Each query was 5-fold cross validation on these training sets.

Regression using Additive Kernels (SALSA): We used the regression method from [14] on the 4-dimensional coal power plant dataset. We tuned the 6 hyper-parameters the regularisation penalty, the kernel scale and the kernel bandwidth for each dimension each in the range (10 3, 104) using 5-fold cross validation. This experiment used M = 3 and 2000, 4000, 8000 points at each ﬁdelity.

Viola & Jones face detection (V&J): The V&J classiﬁer [31], which uses a cascade of weak classiﬁers, is a popular method for face detection. To classify an image, we pass it through each classiﬁer. If at any point the classiﬁer score falls below a threshold, the image is classiﬁed negative. If it passes through the cascade, then it is classiﬁed positive. One of the more popular implementations comes with Open CV and uses a cascade of 22 weak classiﬁers. The threshold values in Open CV are pre-set based on some heuristics and there is no reason to think they are optimal for a given face detection task. The goal is to tune these 22 thresholds by optimising for them over a training set. We modiﬁed the Open CV implementation to take in the thresholds as parameters. As our domain X we chose a neighbourhood around the conﬁguration used in Open CV. We set this up as a M = 2 ﬁdelity experiment where the second ﬁdelity used 3000 images from the V&J face database and the ﬁrst used 300. Interestingly, on an independent test set, the conﬁgurations found by MF-GP-UCB consistently achieved over 90% accuracy while the Open CV conﬁguration achieved only 87.4% accuracy.

CPU Time (s)

500 1000 1500 2000 2500 3000 3500

Log Likelihood

Supernova-3D, M = 3, Grid = [100, 10K, 1M]

Figure 5: Results on the supernova inference problem. The y-axis is the log likelihood so higher is better. MF-NAIVE is not visible as it performed very poorly.

Type Ia Supernovae: We use Type Ia supernovae data [7] for maximum likelihood inference on 3 cosmological parameters, the Hubble constant H0 (60, 80), the dark matter and dark energy fractions ΩM, ΩΛ (0, 1). Unlike typical parametric maximum likelihood problems, the likelihood is only available as a black-box. It is computed using the Robertson Walker metric which requires a one dimensional numerical integration for each sample in the dataset. We set this up as a M = 3 ﬁdelity task. The goal is to maximise the likelihood at the third ﬁdelity where the integration was performed using the trapezoidal rule on a grid of size 106. For the ﬁrst and second ﬁdelities, we used grids of size 102, 104 respectively. The results are given in Fig. 5. Conclusion: We introduced and studied the multi-ﬁdelity bandit under Gaussian Process assumptions. We present, to our knowledge, the ﬁrst formalism of regret and the ﬁrst theoretical results in this setting. They demonstrate that MF-GP-UCB explores the space via cheap lower ﬁdelities, and leverages the higher ﬁdelities on successively smaller regions hence achieving better regret than single ﬁdelity strategies. Experimental results demonstrate the efﬁcacy of our method.

[1] Alekh Agarwal, John C Duchi, Peter L Bartlett, and Clement Levrard. Oracle inequalities for computationally budgeted model selection. In COLT, 2011. [2] Peter Auer. Using Conﬁdence Bounds for Exploitation-exploration Trade-offs. J. Mach. Learn. Res., 2003. [3] E. Brochu, V. M. Cora, and N. de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical RL. Co RR, 2010. [4] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 2012. [5] Mark Cutler, Thomas J. Walsh, and Jonathan P. How. Reinforcement Learning with Multi-Fidelity Simulators. In ICRA, 2014. [6] V. Dani, T. P. P. Hayes, and S. M Kakade. Stochastic Linear Optimization under Bandit Feedback. In COLT, 2008. [7] T. M. Davis et al. Scrutinizing Exotic Cosmological Models Using ESSENCE Supernova Data Combined with Other Cosmological Probes. Astrophysical Journal, 2007. [8] J Djolonga, A Krause, and V Cevher. High-Dimensional Gaussian Process Bandits. In NIPS, 2013. [9] Alexander I. J. Forrester, András Sóbester, and Andy J. Keane. Multi-ﬁdelity optimization via surrogate modelling. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Science, 2007. [10] Subhashis Ghosal and Anindya Roy. Posterior consistency of Gaussian process prior for nonparametric binary regression". Annals of Statistics, 2006. [11] D. Huang, T.T. Allen, W.I. Notz, and R.A. Miller. Sequential kriging optimization using multiple-ﬁdelity evaluations. Structural and Multidisciplinary Optimization, 2006. [12] D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian Optimization Without the Lipschitz Constant. J. Optim. Theory Appl., 1993. [13] Donald R. Jones, Matthias Schonlau, and William J. Welch. Efﬁcient global optimization of expensive black-box functions. J. of Global Optimization, 1998. [14] Kirthevasan Kandasamy and Yaoliang Yu. Additive Approximations in High Dimensional Nonparametric Regression via the SALSA. In ICML, 2016. [15] Kirthevasan Kandasamy, Jeff Schenider, and Barnabás Póczos. High Dimensional Bayesian Optimisation and Bandits via Additive Models. In International Conference on Machine Learning, 2015. [16] Kirthevasan Kandasamy, Gautam Dasarathy, Jeff Schneider, and Barnabas Poczos. The Multi-ﬁdelity Multi-armed Bandit. In NIPS, 2016. [17] K. Kawaguchi, L. P. Kaelbling, and T. Lozano-Pérez. Bayesian Optimization with Exponential Convergence. In NIPS, 2015. [18] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. SCIENCE, 1983. [19] A. Klein, S. Bartels, S. Falkner, P. Hennig, and F. Hutter. Towards efﬁcient Bayesian Optimization for Big Data. In Bayes Opt, 2015. [20] R. Martinez-Cantin, N. de Freitas, A. Doucet, and J. Castellanos. Active Policy Learning for Robot Planning and Exploration under Uncertainty. In Proceedings of Robotics: Science and Systems, 2007. [21] Jonas Mockus. Application of Bayesian approach to numerical methods of global and stochastic optimization. Journal of Global Optimization, 1994. [22] R. Munos. Optimistic Optimization of Deterministic Functions without the Knowledge of its Smoothness. In NIPS, 2011. [23] D. Parkinson, P. Mukherjee, and A.. R Liddle. A Bayesian model selection analysis of WMAP3. Physical Review, 2006. [24] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. UPG Ltd, 2006. [25] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 1952. [26] A Sabharwal, H Samulowitz, and G Tesauro. Selecting near-optimal learners via incremental data allocation. In AAAI, 2015. [27] J. Snoek, H. Larochelle, and R. P Adams. Practical Bayesian Optimization of Machine Learning Algorithms. In NIPS, 2012. [28] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In ICML, 2010. [29] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task bayesian optimization. In NIPS, 2013. [30] W. R. Thompson. On the Likelihood that one Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 1933. [31] Paul A. Viola and Michael J. Jones. Rapid Object Detection using a Boosted Cascade of Simple Features. In Computer Vision and Pattern Recognition, 2001. [32] Shifeng Xiong, Peter Z. G. Qian, and C. F. Jeff Wu. Sequential design and analysis of high-accuracy and low-accuracy computer codes. Technometrics, 2013. [33] C. Zhang and K. Chaudhuri. Active Learning from Weak and Strong Labelers. In NIPS, 2015.