# gaussian_process_bandit_optimisation_with_multifidelity_evaluations__f056da98.pdf Gaussian Process Bandit Optimisation with Multi-fidelity Evaluations Kirthevasan Kandasamy , Gautam Dasarathy , Junier Oliva , Jeff Schneider , Barnabás Póczos Carnegie Mellon University, Rice University {kandasamy, joliva, schneide, bapoczos}@cs.cmu.edu, gautamd@rice.edu In many scientific and engineering applications, we are tasked with the optimisation of an expensive to evaluate black box function f. Traditional methods for this problem assume just the availability of this single function. However, in many cases, cheap approximations to f may be obtainable. For example, the expensive real world behaviour of a robot can be approximated by a cheap computer simulation. We can use these approximations to eliminate low function value regions cheaply and use the expensive evaluations of f in a small but promising region and speedily identify the optimum. We formalise this task as a multi-fidelity bandit problem where the target function and its approximations are sampled from a Gaussian process. We develop MF-GP-UCB, a novel method based on upper confidence bound techniques. In our theoretical analysis we demonstrate that it exhibits precisely the above behaviour, and achieves better regret than strategies which ignore multi-fidelity information. MF-GP-UCB outperforms such naive strategies and other multi-fidelity methods on several synthetic and real experiments. 1 Introduction In stochastic bandit optimisation, we wish to optimise a payoff function f : X R by sequentially querying it and obtaining bandit feedback, i.e. when we query at any x X, we observe a possibly noisy evaluation of f(x). f is typically expensive and the goal is to identify its maximum while keeping the number of queries as low as possible. Some applications are hyper-parameter tuning in expensive machine learning algorithms, optimal policy search in complex systems, and scientific experiments [20, 23, 27]. Historically, bandit problems were studied in settings where the goal is to maximise the cumulative reward of all queries to the payoff instead of just finding the maximum. Applications in this setting include clinical trials and online advertising. Conventional methods in these settings assume access to only this single expensive function of interest f. We will collectively refer to them as single fidelity methods. In many practical problems however, cheap approximations to f might be available. For instance, when tuning hyper-parameters of learning algorithms, the goal is to maximise a cross validation (CV) score on a training set, which can be expensive if the training set is large. However CV curves tend to vary smoothly with training set size; therefore, we can train and cross validate on small subsets to approximate the CV accuracies of the entire dataset. For a concrete example, consider kernel density estimation (KDE), where we need to tune the bandwidth h of a kernel. Figure 1 shows the CV likelihood against h for a dataset of size n = 3000 and a smaller subset of size n = 300. The two maximisers are different, which is to be expected since optimal hyper-parameters are functions of the training set size. That said, the curve for n = 300 approximates the n = 3000 curve quite well. Since training/CV on small n is cheap, we can use it to eliminate bad values of the hyper-parameters and reserve the expensive experiments with the entire dataset for the promising candidates (e.g. boxed region in Fig. 1). In online advertising, the goal is to maximise the cumulative number of clicks over a given period. In the conventional bandit treatment, each query to f is the display of an ad for a specific time, say one 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. hour. However, we may display ads for shorter intervals, say a few minutes, to approximate its hourly performance. The estimate is biased, as displaying an ad for a longer interval changes user behaviour, but will nonetheless be useful in gauging its long run click through rate. In optimal policy search in robotics and automated driving vastly cheaper computer simulations are used to approximate the expensive real world performance of the system. Scientific experiments can be approximated to varying degrees using less expensive data collection, analysis, and computational techniques. In this paper, we cast these tasks as multi-fidelity bandit optimisation problems assuming the availability of cheap approximate functions (fidelities) to the payoff f. Our contributions are: 1. We present a formalism for multi-fidelity bandit optimisation using Gaussian Process (GP) assumptions on f and its approximations. We develop a novel algorithm, Multi-Fidelity Gaussian Process Upper Confidence Bound (MF-GP-UCB) for this setting. 2. Our theoretical analysis proves that MF-GP-UCB explores the space at lower fidelities and uses the high fidelities in successively smaller regions to zero in on the optimum. As lower fidelity queries are cheaper, MF-GP-UCB has better regret than single fidelity strategies. 3. Empirically, we demonstrate that MF-GP-UCB outperforms single fidelity methods on a series of synthetic examples, three hyper-parameter tuning tasks and one inference problem in Astrophysics. Our matlab implementation and experiments are available at github.com/kirthevasank/mf-gp-ucb. Related Work: Since the seminal work by Robbins [25], the multi-armed bandit problem has been studied extensively in the K-armed setting. Recently, there has been a surge of interest in the optimism under uncertainty principle for K armed bandits, typified by upper confidence bound (UCB) methods [2, 4]. UCB strategies have also been used in bandit tasks with linear [6] and GP [28] payoffs. There is a plethora of work on single fidelity methods for global optimisation both with noisy and noiseless evaluations. Some examples are branch and bound techniques such as dividing rectangles (Di Rect) [12], simulated annealing, genetic algorithms and more [17, 18, 22]. A suite of single fidelity methods in the GP framework closely related to our work is Bayesian Optimisation (BO). While there are several techniques for BO [13, 21, 30], of particular interest to us is the Gaussian process upper confidence bound (GP-UCB) algorithm of Srinivas et al. [28]. Many applied domains of research such as aerodynamics, industrial design and hyper-parameter tuning have studied multi-fidelity methods [9, 11, 19, 29]; a plurality of them use BO techniques. However none of these treatments neither formalise nor analyse any notion of regret in the multifidelity setting. In contrast, MF-GP-UCB is an intuitive UCB idea with good theoretical properties. Some literature have analysed multi-fidelity methods in specific contexts such as hyper-parameter tuning, active learning and reinforcement learning [1, 5, 26, 33]. Their settings and assumptions are substantially different from ours. Critically, none of them are in the more difficult bandit setting where there is a price for exploration. Due to space constraints we discuss them in detail in Appendix A.3. The multi-fidelity poses substantially new theoretical and algorithmic challenges. We build on GPUCB and our recent work on multi-fidelity bandits in the K-armed setting [16]. Section 2 presents our formalism including a notion of regret for multi-fidelity GP bandits. Section 3 presents our algorithm. The theoretical analysis is in Appendix C with a synopsis for the 2-fidelity case in Section 4. Section 6 presents our experiments. Appendix A.1 tabulates the notation used in the manuscript. 2 Preliminaries We wish to maximise a payoff function f : X R where X [0, r]d. We can interact with f only by querying at some x X and obtaining a noisy observation y = f(x)+ϵ. Let x argmaxx X f(x) and f = f(x ). Let xt X be the point queried at time t. The goal of a bandit strategy is to maximise the sum of rewards Pn t=1 f(xt) or equivalently minimise the cumulative regret Pn t=1 f f(xt) after n queries; i.e. we compete against an oracle which queries at x at all t. Our primary distinction from the classical setting is that we have access to M 1 successively accurate approximations f (1), f (2), . . . , f (M 1) to the payoff f = f (M). We refer to these approximations as fidelities. We encode the fact that fidelity m approximates fidelity M via the assumption, f (M) f (m) ζ(m), where ζ(1) > ζ(2) > > ζ(M) = 0. Each query at fidelity m expends a cost λ(m) of a resource, e.g. computational effort or advertising time, where λ(1) < λ(2) < < λ(M). A strategy for multi-fidelity bandits is a sequence of query-fidelity pairs {(xt, mt)}t 0, where n=300 n=3000 Figure 1: Left: Average CV log likelihood on datasets of size 300, 3000 on a synthetic KDE task. The crosses are the maxima. Right: Illustration of GP-UCB at time t. The figure shows f(x) (solid black line), the UCB ϕt(x) (dashed blue line) and queries until t 1 (black crosses). We query at xt = argmaxx X ϕt(x) (red star). (xn, mn) could depend on the previous query-observation-fidelity tuples {(xt, yt, mt)}n 1 t=1 . Here yt = f (mt)(xt) + ϵ. After n steps we will have queried any of the M fidelities multiple times. Some smoothness assumptions on f (m) s are needed to make the problem tractable. A standard in the Bayesian nonparametric literature is to use a Gaussian process (GP) prior [24] with covariance kernel κ. In this work we focus on the squared exponential (SE) κσ,h and the Matérn κν,h kernels as they are popularly used in practice and their theoretical properties are well studied. Writing z = x x 2, they are defined as κσ,h(x, x ) = σ exp z2/(2h2) , κν,h(x, x ) = 21 ν h , where Γ, Bν are the Gamma and modified Bessel functions. A convenience the GP framework offers is that posterior distributions are analytically tractable. If f GP(0, κ), and we have observations Dn = {(xi, yi)}n i=1, where yi = f(xi) + ϵ and ϵ N(0, η2) is Gaussian noise, the posterior distribution for f(x)|Dn is also Gaussian N(µn(x), σ2 n(x)) with µn(x) = k 1Y, σ2 n(x) = κ(x, x) k 1k. (1) Here, Y Rn with Yi = yi, k Rn with ki = κ(x, xi) and = K + η2I Rn n where Ki,j = κ(xi, xj). In keeping with the above, we make the following assumptions on our problem. Assumption 1. A1: The functions at all fidelities are sampled from GPs, f (m) GP(0, κ) for all m = 1, . . . , M. A2: f (M) f (m) ζ(m) for all m = 1, . . . , M. A3: f (M) B. The purpose of A3 is primarily to define the regret. In Remark 7, Appendix A.4 we argue that these assumptions are probabilistically valid, i.e. the latter two events occur with nontrivial probability when we sample the f (m) s from a GP. So a generative mechanism would keep sampling the functions and deliver them when the conditions hold true. A point x X can be queried at any of the M fidelities. When we query at fidelity m, we observe y = f (m)(x) + ϵ where ϵ N(0, η2). We now present our notion of cumulative regret R(Λ) after spending capital Λ of a resource in the multi-fidelity setting. R(Λ) should reduce to the conventional definition of regret for any single fidelity strategy that queries only at M th fidelity. As only the optimum of f = f (M) is of interest to us, queries at fidelities less than M should yield the lowest possible reward, ( B) according to A3. Accordingly, we set the instantaneous reward qt at time to be B if mt = M and f (M)(xt) if mt = M. If we let rt = f qt denote the instantaneous regret, we have rt = f + B if mt = M and f f(xt) if mt = M. R(Λ) should also factor in the costs of the fidelity of each query. Finally, we should also receive ( B) reward for any unused capital. Accordingly, we define R(Λ) as, t=1 λ(mt)qt + Λ t=1 λ(mt) ( B) t=1 λ(mt)rt, (2) where Λres = Λ PN t=1 λ(mt). Here, N is the (random) number of queries at all fidelities within capital Λ, i.e. the largest n such that Pn t=1 λ(mt) Λ. According to (2) above, we wish to compete against an oracle that uses all its capital Λ to query x at the M th fidelity. R(Λ) is at best 0 when we follow the oracle and at most 2ΛB. Our goal is a strategy that has small regret for all values of (sufficiently large) Λ, i.e. the equivalent of an anytime strategy, as opposed to a fixed time horizon strategy in the usual bandit setting. For the purpose of optimisation, we also define the simple regret as S(Λ) = mint rt = f maxt qt. S(Λ) is the difference between f and the best highest fidelity query (and f + B if we have never queried at fidelity M). Since S(Λ) 1 ΛR(Λ), any strategy with asymptotic sublinear regret limΛ 1 ΛR(Λ) = 0, also has vanishing simple regret. Since, to our knowledge, this is the first attempt to formalise regret for multi-fidelity problems, the definition for R(Λ) (2) necessitates justification. Consider a two fidelity robot gold mining problem where the second fidelity is a real world robot trial, costing λ(2) dollars and the first fidelity is a computer simulation costing λ(1). A multi-fidelity algorithm queries the simulator to learn about the real world. But it does not collect any actual gold during a simulation; hence no reward, which according to our assumptions is B. Meantime the oracle is investing this capital on the best experiment and collecting f gold. Therefore, the regret at this time instant is f + B. However we weight this by the cost to account for the fact that the simulation costs only λ(1). Note that lower fidelities use up capital but yield the lowest reward. The goal however, is to leverage information from these cheap queries to query prudently at the highest fidelity and obtain better regret. That said, other multi-fidelity settings might require different definitions for R(Λ). In online advertising, the lower fidelities (displaying ads for shorter periods) would still yield rewards. In clinical trials, the regret at the highest fidelity due to a bad treatment would be, say, a dead patient. However, a bad treatment on a simulation may not warrant large penalty. We use the definition in (2) because it is more aligned with our optimisation experiments: lower fidelities are useful to the extent that they guide search on the expensive f (M), but there is no reward to finding the optimum of a cheap f (m). A crucial challenge for a multi-fidelity method is to not get stuck at the optimum of a lower fidelity, which is typically suboptimal for f (M). While exploiting information from the lower fidelities, it is also important to explore sufficiently at f (M). In our experiments we demonstrate that naive strategies which do not do so would get stuck at the optimum of a lower fidelity. A note on GP-UCB: Sequential optimisation methods adopting UCB principles maintain a high probability upper bound ϕt : X R for f(x) for all x X [2]. For GP-UCB, ϕt takes the form ϕt(x) = µt 1(x) + β1/2 t σt 1(x) where µt 1, σt 1 are the posterior mean and standard deviation of the GP conditioned on the previous t 1 queries. The key intuition is that the mean µt 1 encourages an exploitative strategy in that we want to query where we know the function is high and the confidence band β1/2 t σt 1 encourages an explorative strategy in that we want to query at regions we are uncertain about f lest we miss out on high valued regions. We have illustrated GP-UCB in Fig 1 and reviewed the algorithm and its theoretical properties in Appendix A.2. 3 MF-GP-UCB The proposed algorithm, MF-GP-UCB, will also maintain a UCB for f (M) obtained via the previous queries at all fidelities. Denote the posterior GP mean and standard deviation of f (m) conditioned only on the previous queries at fidelity m by µ(m) t , σ(m) t respectively (See (1)). Then define, ϕ(m) t (x) = µ(m) t 1(x) + β1/2 t σ(m) t 1(x) + ζ(m), m, ϕt(x) = min m=1,...,M ϕ(m) t (x). (3) For appropriately chosen βt, µ(m) t 1(x)+β1/2 t σ(m) t 1(x) will upper bound f (m)(x) with high probability. By A2, ϕ(m) t (x) upper bounds f (M)(x) for all m. We have M such upper bounds, and their minimum ϕt(x) gives the best bound. Our next query is at the maximiser of this UCB, xt = argmaxx X ϕt(x). Next we need to decide which fidelity to query at. Consider any m < M. The ζ(m) conditions on f (m) constrain the value of f (M) the confidence band β1/2 t σ(m) t 1 for f (m) is lengthened by ζ(m) to obtain confidence on f (M). If β1/2 t σ(m) t 1(xt) for f (m) is large, it means that we have not constrained f (m) sufficiently well at xt and should query at the mth fidelity. On the other hand, querying indefinitely in the same region to reduce β1/2 t σ(m) t 1 in that region will not help us much as the ζ(m) elongation caps off how much we can learn about f (M) from f (m); i.e. even if we knew f (m) perfectly, we will only have constrained f (M) to within a ζ(m) band. Our algorithm captures this simple intuition. Having selected xt, we begin by checking at the first fidelity. If β1/2 t σ(1) t 1(xt) is smaller than a threshold γ(1), we proceed to the second fidelity. If at any stage β1/2 t σ(m) t 1(xt) γ(m) we query at fidelity mt = m. If we proceed all the way to fidelity M, we query at mt = M. We will discuss choices for γ(m) shortly. We summarise the resulting procedure in Algorithm 1. Fig 2 illustrates MF-GP-UCB on a 2 fidelity problem. Initially, MF-GP-UCB is mostly exploring X in the first fidelity. β1/2 t σ(1) t 1 is large and we are yet to constrain f (1) well to proceed to f (2). By t = 14, we have constrained f (1) around the optimum and have started querying at f (2) in this region. Algorithm 1 MF-GP-UCB Inputs: kernel κ, bounds {ζ(m)}M m=1, thresholds {γ(m)}M m=1. For m = 1, . . . , M: D(m) 0 , (µ(m) 0 , σ(m) 0 ) (0, κ1/2). for t = 1, 2, . . . 1. xt argmaxx X ϕt(x). (See Equation (3)) 2. mt = minm{ m |β1/2 t σ(m) t 1(xt) γ(m) or m = M}. (See Appendix B, C for βt) 3. yt Query f (mt) at xt. 4. Update D(mt) t D(mt) t 1 {(xt, yt)}. Obtain µ(mt) t , σ(mt) t conditioned on D(mt) t (See (1)). t = 6 ϕ(1) t f (1) f (2) f (1) f (2) β1/2 t σ(1) t 1(x) γ(1) Figure 2: Illustration of MF-GP-UCB for a 2-fidelity problem initialised with 5 random points at the first fidelity. In the top figures, the solid lines in brown and blue are f (1), f (2) respectively, and the dashed lines are ϕ(1) t , ϕ(2) t . The solid green line is ϕt = min(ϕ(1) t , ϕ(2) t ). The small crosses are queries from 1 to t 1 and the red star is the maximiser of ϕt, i.e. the next query xt. x , the optimum of f (2) is shown in magenta. In the bottom figures, the solid orange line is β1/2 t σ(1) t 1 and the dashed black line is γ(1). When β1/2 t σ(1) t 1(xt) γ(1) we play at fidelity mt = 2 and otherwise at mt = 1. See Fig. 6 in Appendix B for an extended simulation. Notice how ϕ(2) t dips to change ϕt in this region. MF-GP-UCB has identified the maximum with just 3 queries to f (2). In Appendix B we provide an extended simulation and discuss further insights. Finally, we make an essential observation. The posterior for any f (m)(x) conditioned on previous queries at all fidelities is not Gaussian due to the ζ(m) constraints (A2). However, |f (m)(x) µ(m) t 1(x)| < β1/2 t σ(m) t 1(x) holds with high probability, since, by conditioning only on queries at the mth fidelity we have Gaussianity for f (m)(x). Next we summarise our main theoretical contributions. 4 Summary of Theoretical Results For pedagogical reasons we present our results for the M = 2 case. Appendix C contains statements and proofs for general M. We also ignore constants and polylog terms when they are dominated by other terms. , denote inequality and equality ignoring constants. We begin by defining the Maximum Information Gain (MIG) which characterises the statistical difficulty of GP bandits. Definition 2. (Maximum Information Gain) Let f GP(0, κ). Consider any A Rd and let e A = {x1, . . . , xn} A be a finite subset. Let f e A, ϵ e A Rn be such that (f e A)i = f(xi), (ϵ e A)i N(0, η2), and y e A = f e A + ϵ e A. Let I denote the Shannon Mutual Information. The Maximum Information Gain of A is Ψn(A) = max e A A,| e A|=n I(y e A; f e A). The MIG, which depends on the kernel κ and the set A, is an important quantity in our analysis. For a given κ, it typically scales with the volume of A; i.e. if A = [0, r]d then Ψn(A) O(rdΨn([0, 1]d)). For the SE kernel, Ψn([0, 1]d) O((log(n))d+1) and for Matérn, Ψn([0, 1]d) O(n d(d+1) 2ν+d(d+1) ) [28]. Recall, N is the (random) number of queries by a multi-fidelity strategy within capital Λ at either fidelity. Let nΛ = Λ/λ(2) be the (non-random) number of queries by a single fidelity method operating only at the second fidelity. As λ(1) < λ(2), N could be large for an arbitrary multi-fidelity method. However, our analysis reveals that for MF-GP-UCB, N is on the order of nΛ. Fundamental to the 2-fidelity problem is the set Xg = {x X; f f (1)(x) ζ(1)}. Xg is a high valued region for f (2)(x): for all x Xg, f (2)(x) is at most 2ζ(1) away from the optimum. More interestingly, when ζ(1) is small, i.e. when f (1) is a good approximation to f (2), Xg will be much smaller than X. This is precisely the target domain for this research. For instance, in the robot gold mining example, a cheap computer simulator can be used to eliminate several bad policies and we could reserve the real world trials for the promising candidates. If a multifidelity strategy were to use the second fidelity queries only in Xg, then the regret will only have Ψn(Xg) dependence after n high fidelity queries. In contrast, a strategy that only operates at the highest fidelity (e.g. GP-UCB) will have Ψn(X) dependence. In the scenario described above Ψn(Xg) Ψn(X), and the multi-fidelity strategy will have significantly better regret than a single fidelity strategy. MF-GP-UCB roughly achieves this goal. In particular, we consider a slightly inflated set e Xg,ρ = {x X; f f (1)(x) ζ(1) + ργ(1)}, of Xg where ρ > 0. The following result which characterises the regret of MF-GP-UCB in terms of e Xg,ρ is the main theorem of this paper. Theorem 3 (Regret of MF-GP-UCB Informal). Let X = [0, r]d and f (1), f (2) GP(0, κ) satisfy Assumption 1. Pick δ (0, 1) and run MF-GP-UCB with βt d log(t/δ). Then, with probability > 1 δ, for sufficiently large Λ and for all α (0, 1), there exists ρ depending on α such that, R(Λ) λ(2) q nΛβnΛΨnΛ( e Xg,ρ) + λ(1)q nΛβnΛΨnΛ(X) + λ(2)q nα ΛβnΛΨnα Λ(X) + λ(1)ξn, e Xg,ρ,γ(1) As we will explain shortly, the latter two terms are of lower order. It is instructive to compare the above rates against that for GP-UCB (see Theorem 4, Appendix A.2). By dropping the common and subdominant terms, the rate for MF-GP-UCB is λ(2)Ψ1/2 nΛ ( e Xg,ρ) + λ(1)Ψ1/2 nΛ (X) whereas for GP-UCB it is λ(2)Ψ1/2 nΛ (X). When λ(1) λ(2) and vol( e Xg,ρ) vol(X) the rates for MF-GPUCB are very appealing. When the approximation worsens (Xg, e Xg,ρ become larger) and the costs λ(1), λ(2) become comparable, the bound for MF-GP-UCB decays gracefully. In the worst case, MF-GP-UCB is never worse than GP-UCB up to constant terms. Intuitively, the above result states that MF-GP-UCB explores the entire X using f (1) but uses most of its queries to f (2) inside e Xg,ρ. Now let us turn to the latter two terms in the bound. The third term is the regret due to the second fidelity queries outside e Xg,ρ. We are able to show that the number of such queries is O(nα Λ) for all α > 0 for an appropriate ρ. This strong result is only possible in the multi-fidelity setting. For example, in GP-UCB the best bound you can achieve on the number of plays on a suboptimal set is O(n1/2 Λ ) for the SE kernel and worse for the Matérn kernel. The last term is due to the first fidelity plays inside e Xg,ρ and it scales with vol( e Xg,ρ) and polylogarithmically with n, both of which are small. However, it has a 1/poly(γ(1)) dependence which could be bad if γ(1) is too small: intuitively, if γ(1) is too small then you will wait for a long time in step 2 of Algorithm 1 for β1/2 t σ(1) t 1 to decrease without proceeding to f (2), incurring large regret (f + B) in the process. Our analysis reveals that an optimal choice for the SE kernel scales γ(1) (λ(1)ζ(1)/(tλ(2)))1/(d+2) at time t. However this is of little practical use as the leading constant depends on several problem dependent quantities such as Ψn(Xg). In Section 5 we describe a heuristic to set γ(m) which worked well in our experiments. Theorem 3 can be generalised to cases where the kernels κ(m) and observation noises η(m) are different at each fidelity. The changes to the proofs are minimal. In fact, our practical implementation uses different kernels. As with any nonparametric method, our algorithm has exponential dependence on dimension. This can be alleviated by assuming additional structure in the problem [8, 15]. Finally, we note that the above rates translate to bounds on the simple regret S(Λ) for optimisation. 5 Implementation Details Our implementation uses some standard techniques in Bayesian optimisation to learn the kernel such as initialisation with random queries and periodic marginal likelihood maximisation. The above techniques might be already known to a reader familiar with the BO literature. We have elaborated these in Appendix B but now focus on the γ(m), ζ(m) parameters of our method. Algorithm 1 assumes that the ζ(m) s are given with the problem description, which is hardly the case in practice. In our implementation, instead of having to deal with M 1, ζ(m) values we set (ζ(1), ζ(2), . . . , ζ(M 1)) = ((M 1)ζ, (M 2)ζ, . . . , ζ) so we only have one value ζ. This for 200 400 600 800 1000 Bore Hole-8D, M = 2, Costs = [1; 10] MF-GP-UCB GP-UCB EI Di Rect MF-NAIVE MF-SKO 2000 4000 6000 8000 10000 Hartmann-3D, M = 3, Costs = [1; 10; 100] 0 0.5 1 1.5 2 2.5 3 3.5 0 Number of Queries Query frequencies for Hartmann-3D m=1 m=2 m=3 Figure 3: The simple regret S(Λ) against the spent capital Λ on synthetic functions. The title states the function, its dimensionality, the number of fidelities and the costs we used for each fidelity in the experiment. All curves barring Di Rect (which is a deterministic), were produced by averaging over 20 experiments. The error bars indicate one standard error. See Figures 8, 9 10 in Appendix D for more synthetic results. The last panel shows the number of queries at different function values at each fidelity for the Hartmann-3D example. instance, is satisfied if f (m) f (m 1) ζ which is stronger than Assumption A2. Initially, we start with small ζ. Whenever we query at any fidelity m > 1 we also check the posterior mean of the (m 1)th fidelity. If |f (m)(xt) µ(m 1) t 1 (xt)| > ζ, we query again at xt, but at the (m 1)th fidelity. If |f (m)(xt) f (m 1)(xt)| > ζ, we update ζ to twice the violation. To set γ(m) s we use the following intuition: if the algorithm, is stuck at fidelity m for too long then γ(m) is probably too small. We start with small values for γ(m). If the algorithm does not query above the mth fidelity for more than λ(m+1)/λ(m) iterations, we double γ(m). We found our implementation to be fairly robust even recovering from fairly bad approximations at the lower fidelities (see Appendix D.3). 6 Experiments We compare MF-GP-UCB to the following methods. Single fidelity methods: GP-UCB; EI: the expected improvement criterion for BO [13]; Di Rect: the dividing rectangles method [12]. Multifidelity methods: MF-NAIVE: a naive baseline where we use GP-UCB to query at the first fidelity a large number of times and then query at the last fidelity at the points queried at f (1) in decreasing order of f (1)-value; MF-SKO: the multi-fidelity sequential kriging method from [11]. Previous works on multi-fidelity methods (including MF-SKO) had not made their code available and were not straightforward to implement. Hence, we could not compare to all of them. We discuss this more in Appendix D along with some other single and multi-fidelity baselines we tried but excluded in the comparison to avoid clutter in the figures. In addition, we also detail the design choices and hyper-parameters for all methods in Appendix D. Synthetic Examples: We use the Currin exponential (d = 2), Park (d = 4) and Borehole (d = 8) functions in M = 2 fidelity experiments and the Hartmann functions in d = 3 and 6 with M = 3 and 4 fidelities respectively. The first three are taken from previous multi-fidelity literature [32] while we tweaked the Hartmann functions to obtain the lower fidelities for the latter two cases. We show the simple regret S(Λ) against capital Λ for the Borehole and Hartmann-3D functions in Fig. 3 with the rest deferred to Appendix D due to space constraints. MF-GP-UCB outperforms other methods. Appendix D also contains results for the cumulative regret R(Λ) and the formulae for these functions. A common occurrence with MF-NAIVE was that once we started querying at fidelity M, the regret barely decreased. The diagnosis in all cases was the same: it was stuck around the maximum of f (1) which is suboptimal for f (M). This suggests that while we have cheap approximations, the problem is by no means trivial. As explained previously, it is also important to explore at the higher fidelities to achieve good regret. The efficacy of MF-GP-UCB when compared to single fidelity methods is that it confines this exploration to a small set containing the optimum. In our experiments we found that MF-SKO did not consistently beat other single fidelity methods. Despite our best efforts to reproduce this (and another) multi-fidelity method, we found them to be quite brittle (Appendix D.1). The third panel of Fig. 3 shows a histogram of the number of queries at each fidelity after 184 queries of MF-GP-UCB, for different ranges of f (3)(x) for the Hartmann-3D function. Many of the queries at the low f (3) values are at fidelity 1, but as we progress they decrease and the second fidelity queries increase. The third fidelity dominates very close to the optimum but is used sparingly elsewhere. This corroborates the prediction in our analysis that MF-GP-UCB uses low fidelities to explore and successively higher fidelities at promising regions to zero in on x . (Also see Fig. 6, Appendix B.) CPU Time (s) 0 2000 4000 6000 8000 CV (Classification) Error 0.14 SVM-2D, M = 2, ntr = [500, 2000] MF-GP-UCB GP-UCB EI Di Rect MF-NAIVE MF-SKO CPU Time (s) 0 1000 2000 3000 4000 5000 6000 7000 CV (Least Squares) Error SALSA-6D, M = 3, ntr = [2000, 4000, 8000] CPU Time (s) 1000 2000 3000 4000 5000 6000 7000 8000 CV (Classification) Error V&J-22D, M = 2, ntr = [300, 3000] Figure 4: Results on the hyper-parameter tuning experiments. The title states the experiment, dimensionality (number of hyperparameters) and training set size at each fidelity. All curves were produced by averaging over 10 experiments. The error bars indicate one standard error. The lengths of the curves are different in time as we ran each method for a pre-specified number of iterations and they concluded at different times. Real Experiments: We present results on three hyper-parameter tuning tasks (results in Fig. 4), and a maximum likelihood inference task in Astrophysics (Fig. 5). We compare methods on computation time since that is the cost in all experiments. We include the processing time for each method in the comparison (i.e. the cost of determining the next query). Classification using SVMs (SVM): We trained an SVM on the magic gamma dataset using the SMO algorithm to an accuracy of 10 12. The goal is to tune the kernel bandwidth and the soft margin coefficient in the ranges (10 3, 101) and (10 1, 105) respectively on a dataset of size 2000. We set this up as a M = 2 fidelity experiment with the entire training set at the second fidelity and 500 points at the first. Each query was 5-fold cross validation on these training sets. Regression using Additive Kernels (SALSA): We used the regression method from [14] on the 4-dimensional coal power plant dataset. We tuned the 6 hyper-parameters the regularisation penalty, the kernel scale and the kernel bandwidth for each dimension each in the range (10 3, 104) using 5-fold cross validation. This experiment used M = 3 and 2000, 4000, 8000 points at each fidelity. Viola & Jones face detection (V&J): The V&J classifier [31], which uses a cascade of weak classifiers, is a popular method for face detection. To classify an image, we pass it through each classifier. If at any point the classifier score falls below a threshold, the image is classified negative. If it passes through the cascade, then it is classified positive. One of the more popular implementations comes with Open CV and uses a cascade of 22 weak classifiers. The threshold values in Open CV are pre-set based on some heuristics and there is no reason to think they are optimal for a given face detection task. The goal is to tune these 22 thresholds by optimising for them over a training set. We modified the Open CV implementation to take in the thresholds as parameters. As our domain X we chose a neighbourhood around the configuration used in Open CV. We set this up as a M = 2 fidelity experiment where the second fidelity used 3000 images from the V&J face database and the first used 300. Interestingly, on an independent test set, the configurations found by MF-GP-UCB consistently achieved over 90% accuracy while the Open CV configuration achieved only 87.4% accuracy. CPU Time (s) 500 1000 1500 2000 2500 3000 3500 Log Likelihood Supernova-3D, M = 3, Grid = [100, 10K, 1M] Figure 5: Results on the supernova inference problem. The y-axis is the log likelihood so higher is better. MF-NAIVE is not visible as it performed very poorly. Type Ia Supernovae: We use Type Ia supernovae data [7] for maximum likelihood inference on 3 cosmological parameters, the Hubble constant H0 (60, 80), the dark matter and dark energy fractions ΩM, ΩΛ (0, 1). Unlike typical parametric maximum likelihood problems, the likelihood is only available as a black-box. It is computed using the Robertson Walker metric which requires a one dimensional numerical integration for each sample in the dataset. We set this up as a M = 3 fidelity task. The goal is to maximise the likelihood at the third fidelity where the integration was performed using the trapezoidal rule on a grid of size 106. For the first and second fidelities, we used grids of size 102, 104 respectively. The results are given in Fig. 5. Conclusion: We introduced and studied the multi-fidelity bandit under Gaussian Process assumptions. We present, to our knowledge, the first formalism of regret and the first theoretical results in this setting. They demonstrate that MF-GP-UCB explores the space via cheap lower fidelities, and leverages the higher fidelities on successively smaller regions hence achieving better regret than single fidelity strategies. Experimental results demonstrate the efficacy of our method. [1] Alekh Agarwal, John C Duchi, Peter L Bartlett, and Clement Levrard. Oracle inequalities for computationally budgeted model selection. In COLT, 2011. [2] Peter Auer. Using Confidence Bounds for Exploitation-exploration Trade-offs. J. Mach. Learn. Res., 2003. [3] E. Brochu, V. M. Cora, and N. de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical RL. Co RR, 2010. [4] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 2012. [5] Mark Cutler, Thomas J. Walsh, and Jonathan P. How. Reinforcement Learning with Multi-Fidelity Simulators. In ICRA, 2014. [6] V. Dani, T. P. P. Hayes, and S. M Kakade. Stochastic Linear Optimization under Bandit Feedback. In COLT, 2008. [7] T. M. Davis et al. Scrutinizing Exotic Cosmological Models Using ESSENCE Supernova Data Combined with Other Cosmological Probes. Astrophysical Journal, 2007. [8] J Djolonga, A Krause, and V Cevher. High-Dimensional Gaussian Process Bandits. In NIPS, 2013. [9] Alexander I. J. Forrester, András Sóbester, and Andy J. Keane. Multi-fidelity optimization via surrogate modelling. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Science, 2007. [10] Subhashis Ghosal and Anindya Roy. Posterior consistency of Gaussian process prior for nonparametric binary regression". Annals of Statistics, 2006. [11] D. Huang, T.T. Allen, W.I. Notz, and R.A. Miller. Sequential kriging optimization using multiple-fidelity evaluations. Structural and Multidisciplinary Optimization, 2006. [12] D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian Optimization Without the Lipschitz Constant. J. Optim. Theory Appl., 1993. [13] Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black-box functions. J. of Global Optimization, 1998. [14] Kirthevasan Kandasamy and Yaoliang Yu. Additive Approximations in High Dimensional Nonparametric Regression via the SALSA. In ICML, 2016. [15] Kirthevasan Kandasamy, Jeff Schenider, and Barnabás Póczos. High Dimensional Bayesian Optimisation and Bandits via Additive Models. In International Conference on Machine Learning, 2015. [16] Kirthevasan Kandasamy, Gautam Dasarathy, Jeff Schneider, and Barnabas Poczos. The Multi-fidelity Multi-armed Bandit. In NIPS, 2016. [17] K. Kawaguchi, L. P. Kaelbling, and T. Lozano-Pérez. Bayesian Optimization with Exponential Convergence. In NIPS, 2015. [18] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. SCIENCE, 1983. [19] A. Klein, S. Bartels, S. Falkner, P. Hennig, and F. Hutter. Towards efficient Bayesian Optimization for Big Data. In Bayes Opt, 2015. [20] R. Martinez-Cantin, N. de Freitas, A. Doucet, and J. Castellanos. Active Policy Learning for Robot Planning and Exploration under Uncertainty. In Proceedings of Robotics: Science and Systems, 2007. [21] Jonas Mockus. Application of Bayesian approach to numerical methods of global and stochastic optimization. Journal of Global Optimization, 1994. [22] R. Munos. Optimistic Optimization of Deterministic Functions without the Knowledge of its Smoothness. In NIPS, 2011. [23] D. Parkinson, P. Mukherjee, and A.. R Liddle. A Bayesian model selection analysis of WMAP3. Physical Review, 2006. [24] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. UPG Ltd, 2006. [25] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 1952. [26] A Sabharwal, H Samulowitz, and G Tesauro. Selecting near-optimal learners via incremental data allocation. In AAAI, 2015. [27] J. Snoek, H. Larochelle, and R. P Adams. Practical Bayesian Optimization of Machine Learning Algorithms. In NIPS, 2012. [28] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In ICML, 2010. [29] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task bayesian optimization. In NIPS, 2013. [30] W. R. Thompson. On the Likelihood that one Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 1933. [31] Paul A. Viola and Michael J. Jones. Rapid Object Detection using a Boosted Cascade of Simple Features. In Computer Vision and Pattern Recognition, 2001. [32] Shifeng Xiong, Peter Z. G. Qian, and C. F. Jeff Wu. Sequential design and analysis of high-accuracy and low-accuracy computer codes. Technometrics, 2013. [33] C. Zhang and K. Chaudhuri. Active Learning from Weak and Strong Labelers. In NIPS, 2015.