# multifidelity_blackbox_optimization_with_hierarchical_partitions__e4642aa3.pdf

Multi-Fidelity Black-Box Optimization with Hierarchical Partitions

Rajat Sen 1 Kirthevasan Kandasamy 2 Sanjay Shakkottai 1

Motivated by settings such as hyper-parameter tuning and physical simulations, we consider the problem of black-box optimization of a function. Multi-ﬁdelity techniques have become popular for applications where exact function evaluations are expensive, but coarse (biased) approximations are available at much lower cost. A canonical example is that of hyper-parameter selection in a learning algorithm. The learning algorithm can be trained for fewer iterations this results in a lower cost, but its validation error is only coarsely indicative of the same if the algorithm had been trained till completion. We incorporate the multi-ﬁdelity setup into the powerful framework of black-box optimization through hierarchical partitioning. We develop tree-search based multi-ﬁdelity algorithms with theoretical guarantees on simple regret. We ﬁnally demonstrate the performance gains of our algorithms on both real and synthetic datasets.

1. Introduction

Optimizing a black-box function f over a Euclidean domain X is a classical problem studied in several disciplines including computer science, mathematics, and operations research. It ﬁnds applications in many real world scientiﬁc and engineering tasks including scientiﬁc experimentation, industrial design, and model selection in statistics and machine learning (Martinez-Cantin et al., 2007; Parkinson et al., 2006; Snoek et al., 2012). Given a budget of n evaluations, an optimization algorithm operates sequentially at time t, it chooses to evaluate f at xt based on its previous evaluations {xi, f(xi)}t 1

i=1. At the end of n evaluations, it makes a recommendation x(n) and its performance is measured by its

1Univerity of Texas as Austin 2Carnegie Mellon University. Correspondence to: Rajat Sen <rajat.sen@utexas.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

simple regret Rn,

f(x) f(x(n)).

Our study focuses on applications where exact evaluation of the function f is expensive. As an example, in the case of model selection, training and validating large neural networks can take several hours to days. Similarly, the simulation of an astrophysical process typically takes multiple days even on a cluster of super computers. Traditional methods for black-box optimization are poorly suited for such applications because we need to invest a considerable number of evaluations to optimize f. This motivates studying the multi-ﬁdelity setting where we have cheaper, but potentially biased approximations to the function (Cutler et al., 2014; Huang et al., 2006a; Kandasamy et al., 2016c; 2017). As an illustration, in a hyper-parameter tuning scenario, the task is to ﬁnd the best set of hyper-parameters for training a machine learning model. In this setting, the black-box function that needs to be optimized is the validation error after training the learning algorithm to completion (a certain number of maximum iterations) i.e X represents the allowed set of hyper-parameters while the function represents the validation error after training to completion. However, as training the algorithm till completion is expensive, we may choose to train the learning algorithm for a few iterations at chosen hyper-parameters and then test it on the validation set. These inexpensive validation errors act as the cheap approximations (ﬁdelities) to the function value and can indeed provide valuable information regarding the quality of the hyper-parameters.

The multi-ﬁdelity setup for black-box function optimization has been popularly studied in the Bayesian optimization setting (Huang et al., 2006b; Kandasamy et al., 2016b; 2017). However, in this paper we focus on another powerful framework for sequential black-box optimization that works with hierarchical partitioning of the function domain X. These tree-search based methods were initially motivated by an empirically successful heuristic UCT (Kocsis & Szepesv ari, 2006), which subsequently lead to several theoretically founded algorithms for black-box optimization through hierarchical partitioning (Bubeck et al., 2011; Kleinberg et al., 2008; Munos, 2011; Valko et al., 2013).

In this work, we incorporate cheap approximations or ﬁdeli-

Multi-Fidelity Black-Box Optimization

ties with tree-search based methods for black-box optimization. We assume access to a tree-like partitioning of the domain X similar to (Bubeck et al., 2011; Grill et al., 2015; Munos, 2011). The partitioning of the domain X is denoted as P and it contains hierarchical cells {Ph,i}, where h denotes the height of the cell and i denotes the index. A cell Ph,i at height h has K children {Ph+1,ik}K

k=1, which are distinct partitions of Ph,i. At height 0, there is only one partition P0,1 = X. An example of such an hierarchical partition for X = [0, 1] and K = 2 would be: P0,1 = [0, 1], P1,1, P1,2 = [0, 0.5], (0.5, 1], P2,1 = [0, 0.25]... and so on. Most of the prior work assume some smoothness property about the function and hierarchical partitioning. We follow a similar path adopting the smoothness assumptions in (Grill et al., 2015). This assumption states that there exists and 2 (0, 1) such that,

8h 0, 8x 2 Ph,i

h, f(x) f(x ) h, (1)

where x is assumed to be the unique point in X such that f(x ) = supx2X f(x). This assumption basically says that the diameter of the function is bounded for all cells that contain the optima, and that this diameter goes down at a geometric rate with the height of the cell. We also adopt the deﬁnition of the well-known near-optimality dimension d( , ) which restricts the number of cells at height h that contain points close to the optima. This is an important quantity in the analysis of many tree-search based methods (Bubeck et al., 2011; Grill et al., 2015; Munos, 2011; Valko et al., 2013).

In addition we also model that the function can be accessed at a continuous range of ﬁdelities within Z = [0, 1], where z = 0 is the cheapest ﬁdelity and z = 1 is the most expensive one. For instance, in our hyper-parameter tuning example, z = 1 may correspond to training the learning algorithm for 1000 iterations while z = 0 represents training the algorithm to 50 iterations. When a function is evaluated at a point x 2 X with a ﬁdelity z 2 Z, a value fz(x) is revealed such that |f(x) fz(x)| < (z) where (.) is a ﬁxed bias function. The bias function is monotonically decreasing in z with (1) = 0. There is also a cost associated with these evaluations which is captured by a cost function λ : Z ! R+. The cost function is assumed to be monotonically increasing in z. For instance, in hyper-parameter tuning the cost increases linearly with the number of iterations. The objective is to locate a point x such that f(x) is as close as possible to supx2X f(x) given a ﬁnite cost budget .

The following are the main contributions of this work:

(i) We incorporate multiple ﬁdelities/cheap approximations in black-box function optimization through hierarchical partitioning. We propose and analyze two algorithms in this setting. Our ﬁrst algorithm is known as MFDOO (Algorithm 1) which requires knowledge about the smoothness

parameter ( , ). This algorithm is similar to DOO (Munos, 2011), however it is designed to explore coarser partitions at lower ﬁdelities while exploring ﬁner partitions at higher ﬁdelities, when the algorithm zooms in on a promising area of the function domain. Motivated by recent work (Grill et al., 2015), we also propose a second algorithm MFPDOO (Algorithm 2), which does not require knowledge about the smoothness. This algorithm spawns several instances of MFDOO (Algorithm 1) with carefully selected parameters, at least one of which is bound to perform nearly as well as MFDOO when the smoothness parameters are known.

(ii) We provide simple regret bounds for both of our algorithms, given a ﬁxed cost budget for performing evaluations. First we show that when the smoothness parameters are known, MFDOO has simple regret of O( 1/d( , )+1) under some conditions on the bias and cost function. Here, d( , ) is the near-optimality dimension of the function with respect to parameters ( , ). On the other hand a naive application of DOO (Munos, 2011)1 only using the highest ﬁdelity z = 1 would yield a regret bound of O(( /λ(1)) 1/d( , )). We also show that our second algorithm MFPDOO can obtain a simple regret bound of O(( / log ) 1/d( , )+1) even when the smoothness parameters are not known. The precise details about our theoretical guarantees can be found in Section 5.

(iii) Finally, we compare the performance of our algorithms with several state of the art algorithms (Grill et al., 2015; Huang et al., 2006b; Jones et al., 1998; Kandasamy et al., 2016b;b; Srinivas et al., 2009) for black-box optimization in the multi-ﬁdelity setting, on real and synthetic data-sets. We demonstrate that our algorithms outperform the state of the art in most of these experiments.

2. Related Work

We build on a line of work on bandits and black-box optimization with hierarchical partitions (Bubeck et al., 2011; Kleinberg et al., 2008; Munos, 2011; Valko et al., 2013). These methods rely on the principle of optimism i.e they build upper bounds on the value of the functions inside different partitions using the already explored points x1..., xt. Then at time t + 1, a point is chosen from the partition that has the highest value of this upper-bound. We closely follow the line of work initiated in (Munos, 2011) that was later extended to noisy function evaluations in (Grill et al., 2015; Valko et al., 2013). In (Munos, 2011) it was assumed that the function follows a local Lipschitz condition with respect to a semi-metric , and the diameter of the hierarchical partitions with respect to this semi-metric decrease geometrically with height. Grill et al. (Grill et al., 2015) later merged these

1Note that DOO also requires knowledge of the smoothness parameters of the function.

Multi-Fidelity Black-Box Optimization

two assumptions into one, by having a single condition that related the smoothness of the function with the hierarchical partition. In this work we adapt the regime of (Grill et al., 2015). However, we also model cheap approximations to the functions through a one-dimensional ﬁdelity space.

Multi-ﬁdelity optimization has had a rich history in many settings (Agarwal et al., 2011; Forrester et al., 2007; Huang et al., 2006a; Klein et al., 2016; Lam et al., 2015; Li et al., 2016; Poloczek et al., 2016; Sabharwal et al., 2015; Zhang & Chaudhuri, 2015), with those that are not applicationspeciﬁc focusing on a Bayesian framework without formal guarantees (we refer to (Kandasamy et al., 2017) for additional discussion). Kandasamy et al. (2016c) propose and analyse a UCB style multi-ﬁdelity algorithm for the K-armed bandit setting assuming a ﬁnite number of approximations to the K arms. They then extend this work to develop UCB algorithms for black-box optimization under Bayesian Gaussian process assumptions on f, both with a ﬁnite number of approximations and a continuous spectrum of approximations (Kandasamy et al., 2016a;b; 2017). In all these works, the relation between the approximations and the true function is known and appears in the form of uniform bounds on the approximation or a smoothness assumption arising out of the kernel of the Gaussian process. In our work we merge the multi-ﬁdelity setting with the hierarchical partitions framework.

3. Problem Setting

We consider the problem of optimizing a function f : X ! R with black-box access at different ﬁdelities. The aim is to locate a point x such that f(x) is as close as possible to supx2X f(x), given a ﬁnite budget for performing evaluations.

We assume that the function can be queried at a continuous range of ﬁdelities in Z , [0, 1]. When the function is queried at a point x 2 X with ﬁdelity z 2 Z, a value fz(x) is revealed. We assume that |fz(x) f(x)| (z), where : Z ! R+ is a known bias function. It is also assumed that a single query at ﬁdelity z incurs a cost λ(z), where λ : Z ! R+ is a known cost function. We assume there is a unique point x 2 X at which supx2X f(x) is achieved.

Bias and Cost Functions: The bias function is assumed to be bounded and monotonically decreasing in z. The optimal ﬁdelity z is assumed to have zero bias i.e. (1) = 0. The cost function λ is assumed to be bounded and monotonically increasing in z.

The multi-ﬁdelity setting is motivated by engineering applications where cheap approximations are available. One promising use case is that of hyper-parameter tuning, where the validation performance of a learning algorithm at different hyper-parameters can be observed. The aim is to locate

the best hyper-parameter. In such a setting, cheap approximations are available for instance instead of evaluating the learning algorithm after a maximum of T iterations, one may choose to evaluate it after t < T iterations. In this case T can be mapped to z = 1 and t can be mapped to a z < 1. The cost function in this setting is proportional to the O(t) computation required. The bias function is monotonically decreasing with z, however may not be known exactly in practice. However, prior works in multi-ﬁdelity setup (Kandasamy et al., 2016a;c; Kleinberg et al., 2008) have all assumed access to a known bias function for the theoretical guarantees. Even though we assume the bias function is known in theory, we shall see in our experiments in Section 6 that a simple parametric form of the bias function can be assumed and the parameters can be updated online during the course of our algorithm (similar to (Kandasamy et al., 2017)).

Simple Regret: The objective is to locate a point x such that f(x) is as close as possible to supx2X f(x) given a ﬁnite cost budget. Let be the total cost budget allowed. Consider an optimization policy that queries a sequence of points {x1, ..., xn( )} at ﬁdelities {z1, ..., zn( )} respectively and ﬁnally returns a recommendation x . Our main quantity of interest is the simple regret which is deﬁned as,

f(x) f (x ) , (2)

such that Pn( )

i=1 λ(zi) . Note that the simple regret is always measured at the highest ﬁdelity as we are only interested in optimizing the actual function.

3.1. Hierarchical Partitions and Assumptions

In this section we deﬁne the hierarchical partitions of the domain X that we assume access to and then provide our technical assumptions about the function and the hierarchical partitions.

Hierarchical Partitions: We assume access to a tree-like hierarchical partitioning P = {Ph,i} of the domain X, where, h denotes a depth parameter. For any depth h 0, the cells {Ph,i}1 i Ih denote a partitioning of the space X, where Ih is the number of cells at depth h. At depth 0 there is a single cell P0,1 = X. A cell Ph,i can be split into K child nodes at depth h + 1. In what follows, querying a cell Ph,i would refer to evaluating the function at a ﬁxed representative point xh,i 2 Ph,i at a chosen ﬁdelity. The ﬁxed representative point is usually chosen to be the coordinate wise mid-point for any given cell.

As an illustrative example let us consider a hierarchical black-box optimization problem over the domain X = [0, 1] [0, 1]. Let us consider a hierarchical partition of this domain where the cells are of the form {x 2 X : b1,1 x1 < b1,2, b2,1 x2 < b2,2}. Such a

Multi-Fidelity Black-Box Optimization

cell will be denoted by the notation [[b1,1, b1,2], [b2,1, b2,2]]. Then a hierarchical partition with K = 2 starts with the root node P0,1 = [[0, 1], [0, 1]]. This can be further sub-divided into children cells at h = 1 given by P1,1 = [[0, 0.5], [0, 1]] and P1,2 = [[0.5, 1], [0, 1]]. P1,2 can be further partitioned into P2,1 = [[0.5, 1], [0, 0.5]] and P2,2 = [[0.5, 1], [0.5, 1]] and so on. The ﬁxed representative point for a cell [[b1,1, b1,2], [b2,1, b2,2]] is chosen as the point [(b1,1 + b1,2)/2, (b2,1 + b2,2)/2].

Black-box optimization is akin to a needle in a haystack problem without any conditions on the function f(x). Therefore, similar to prior work (Grill et al., 2015) we make the following smoothness assumption which depends on the properties of both the function f and the hierarchical partitioning P.

Assumption 1 (Smoothness Decay). There exists and 2 (0, 1) such that,

8h 0, 8x 2 Ph,i

h, f(x) f(x ) h, (3)

h is the unique partition of height h which contains x .

We also adopt the deﬁnition of the near-optimalitydimension for parameters ( , ) from (Grill et al., 2015). This is a quantity that depends on the choice of parameters, the partitioning and the function itself.

Deﬁnition 1. The near-optimality dimension of f with respect to parameters ( , ) is given by,

d( , ) , inf

d0 2 R+ : 9C( , ), 8h 0,

Nh(2 h) C( , ) d0ho

where Nh( ) is the number of cells Ph,i such that supx2Ph,i f(x) f(x ) .

Let ( , ) be the parameters with the minimum near optimality dimension d( , ).

Discussion: Access to hierarchical partitions have been assumed in a string of previous works on black-box optimization (Bubeck et al., 2011; Grill et al., 2015; Kleinberg et al., 2008; Munos, 2011; Slivkins, 2011; Valko et al., 2013). Many of these prior works assume a semi-metric over the domain X (Bubeck et al., 2011; Munos, 2011; Valko et al., 2013). In (Bubeck et al., 2011), it is assumed that the function satisﬁes a weak-Lipschitzness condition. More recent works (Munos, 2011; Valko et al., 2013) have assumed a local-smoothness property w.r.t the metric given by 8x 2 X, f(x ) f(x) (x, x ). However, recently Grill et al. (Grill et al., 2015) have observed that Assumption 1 is sufﬁcient to combine several assumptions about the semi-metric, the function and the hierarchical partitions into one combinatorial condition and similarly have adapted

the deﬁnition of the near-optimality dimension without the semi-metric. It was depicted in (Grill et al., 2015) that prior algorithms like (Bubeck et al., 2011; Munos, 2011; Valko et al., 2013) can be shown to have good regret guarantees with this new set of assumptions, and therefore we adopt these assumptions in our work.

4. Algorithms

In this section we present two algorithms for black-box optimization using different ﬁdelities and the hierarchical partitioning provided. In Section 4.1, we provide Algorithm 1 which requires the knowledge of the optimal smoothness decay parameters ( , ). Then in Section 4.2, we provide Algorithm 2 that searches for the optimal smoothness by spawning O(log ) instances of Algorithm 1 with a carefully designed sequence of smoothness parameters ( , ) as arguments.

4.1. Algorithm with known ( , )

In this section we provide an algorithm which takes as an argument the smoothness parameters ( , ). We show in Section 5 that if the parameters provided match with the optimal parameters ( , ) then the algorithm enjoys strong theoretical guarantees under some conditions on the bias and cost functions (z) and λ(z) respectively.

Algorithm 1 MFDOO: Multi-Fidelity Deterministic Optimistic Optimization

1: Arguments: ( , ), (z), λ(z), P, 2: Deﬁne zh = 1( h) 3: Let T = {(0, 1)} be the tree initialized (root node

evaluated with ﬁdelity z0). Set of leaves at time t: Lt. 4: Time: t = 1; Cost: C = λ(z0). 5: while C do 6: Select the leaf (h, j) 2 Lt with maximum bh,j , fzh(xh,j) + h + (zh). 7: Expand this node; add to Tt the K children of (h, j). 8: Evaluate the children at the ﬁdelity level zh+1. t = t + 1. C = C + Kλ(zh+1). 9: end while 10: Let h( ) be the height of the tree. Return x = arg max(h( ),i) fzh( )(xh( ),i).

In Algorithm 1, with some abuse of notation we deﬁne for all h 0, zh = 1( h) i.e the ﬁdelity at which the bias becomes less than or equal to the smoothness decay parameter at height h. All cells at height h are evaluated at the ﬁdelity zh. The intuition is that if x belongs to a cell Ph,i at height h that has been evaluated, then by Assumption 1 we have that all points in the cell are at least h optimal. Ideally, beyond this point we would only like to expand leaf nodes that are at least O( h) optimal,

Multi-Fidelity Black-Box Optimization

which can only be achieved if the error due to the ﬁdelities is O( h). At each step, a leaf node with the highest upper bound parameter bh,i, is expanded and the children cells are evaluated.

4.2. Algorithm without the knowledge of ( , )

In this section we describe an algorithm that does not require the optimal parameters ( , ). Algorithm 2 just requires max, max which are loose upper-bounds of and respectively. The algorithm proceeds by spawning O(log ) MFDOO instances with different ( , ) s which have been carefully designed. Similar ideas were explored in a setting without ﬁdelities in (Grill et al., 2015). In Section 5, we show that Algorithm 2 does almost as well as Algorithm 1 without requiring the optimal parameters as input.

Algorithm 2 MFPDOO: Multi-Fidelity Parallel Deterministic Optimistic Optimization

1: Arguments: ( max, max), (z), λ(z), P, 2: Let N = (1/2)Dmax log( / log( )) where Dmax =

log K/ log(1/ max) 3: for i = 0 to N 1 do 4: Spawn MFDOO (Algorithm 1) with parameters ( max, N/(N i)

max ) with budget ( Nλ(1))/N 5: end for 6: Let x(i)

be the point returned by the ith MFDOO instance for i 2 {0, .., N 1}. Evaluate all {x(i)

}i at the z = 1. Return the point x = x(i )

where i = arg maxi f(x(i)

Algorithm 2 proceeds by spawning N different MFDOO instances with the parameters speciﬁed in step 4 of the algorithm. We will show in Theorem 2 that at least one of the MFDOO instances will have a performance comparable to Algorithm 1 supplied with parameters ( , ) with a budget of O( /N). Step 6 of the algorithm obtains the exact value of the points returned by all the MFDOO instances by evaluating them at the highest ﬁdelity, and then chooses the one with the maximum value. This ensures that the highest performing MFDOO instance is selected.

Remark 1. Our algorithms and the theoretical results assume that the bias function (.) is known. However, in practice we do not assume perfect knowledge about the bias function. We assume a simple parametric form of the bias function and update the parameters online, when the bias assumptions are violated. We provide more details in Section 6 and show that even without this knowledge, the algorithms perform better than other benchmarks.

It should be noted that the different MFDOO instances created by Algorithm 2 can share information among each other, when multiple instances query the same partition at

very similar ﬁdelities. This leads to huge improvements in practice in terms of effectively using the cost budget.

5. Theoretical Results

In this section we ﬁrst prove a general result about the simple regret of Algorithm 1 which assumes access to the optimal parameters ( , ). This naturally implies a simple regret bound on Algorithm 1 when it is supplied with the parameters ( , ) i.e. the ones that have the minimum near-optimality dimension. Then we reﬁne these guarantees under some natural conditions on the bias and cost functions. Finally, we show that Algorithm 2 can achieve guarantees close to Algorithm 1 with the optimal parameters, without having access to them.

We ﬁrst present the following general result about Algorithm 1.

Theorem 1. Let h0 be the biggest number h such

C( , )Kλ(zl) d( , )l .

Let h( ) = h0 + 1. Then Algorithm 1 run with parameters ( , ) (s.t , ), incurs a simple regret of at

most 2 h( ) and terminates using a total cost of at most + Kλ(1).

We defer the proof of Theorem 1 to Section A in the appendix. Note that the guarantee in Theorem 1 is the tightest when the parameters ( , ) are supplied as the near optimality dimension d( , ) is the lowest.

Now, we impose some natural conditions on the cost and bias functions. We provide more specialized versions of the guarantees in Theorem 1 under these two conditions separately, which are described below.

Assumption 2. We assume that (.) and λ(.) are such that λ(z

h) min{βh, (1)} for some positive constant β. Here, z

Motivation: The above assumption is motivated by the following hyper-parameter tuning scenario. Consider training a learning algorithm with a particular hyper-parameter that involves optimizing a strongly convex and smooth function with gradient descent. Let the ﬁdelity denote a rescaled version of the number of steps in gradient descent n. We assume that at the optimal ﬁdelity (N steps) we reach the optimal value of the function up to an error of . Let zn = n/N. At ﬁdelity zn the error decays to (zn) = O(rn) for some r 2 (0, 1). The cost incurred is linear in the number of steps say λ(zn) = sn for s > 0. In this setting it can be shown that if (zn) h

, then n = O(h) and therefore λ(zn) = O(h).

Multi-Fidelity Black-Box Optimization

The second assumption under which we provide specialized guarantees is as follows.

Assumption 3. We assume that (.) and λ(.) are such that λ(z

h) min{γ h, (1)} for some constant γ 2 ( , 1). Here, z

Motivation: Assumption 3 is motivated by a similar hyperparameter tuning scenario as above. Consider training a learning algorithm with a particular hyper-parameter that involves optimizing a smooth convex function with accelerated gradient descent. Let the ﬁdelity denote a rescaled version of the number of steps in gradient descent n as above. At ﬁdelity zn the error decays to (zn) = O(1/n2). The cost incurred is linear in the number of steps say λ(zn) = sn for s > 0. In this setting it can be shown that if (zn) h

, then n = O(γ h) for γ = O(p ).

We are now at a position to introduce a specialized corollary of Theorem 1.

Corollary 1. Algorithm 1 with parameters ( , ) (s.t , ) run with a total budget of terminates with a total cost of at most + Kλ(1) and has the following properties:

(i) Under Assumption 2: R 2

C( , )Kβ (1 d( , ))

for some small > 0, provided is large enough.

(ii) Under Assumption 3:

2C( , )K (γ 1 d( , ) 1)

1 d( , )+1 .

Comparison with DOO (Munos, 2011): The above result can be directly compared to DOO (Munos, 2011) which is in the noiseless black-box optimization regime, without access to ﬁdelities. The simple regret of DOO under the same assumptions would scale as O

( /λ(1)) 1/d( , )

when all the evaluations are performed at the highest ﬁdelity. In contrast our bounds under Assumption 2 scales as O

( /β) 1/(d( , )+ )

, where is a constant close to zero. Note that λ(z

h) λ(1), and therefore β = λ(1) trivially satisﬁes the inequality in Assumption 2. Typically, β is expected to be much less as compared to the highest ﬁdelity cost λ(1). For example in our hyper-parameter tuning example where the ﬁdelity is the number of iterations (a maximum of N iterations), β is a small constant (see the discussion on Assumption 2), while λ(1) can be O(N). This can lead to signiﬁcant gains in simple regret as we show in our empirical results in Section 6. Similarly, under Assumption 3 our simple regret scales as O

1/(d( , )+1)(

, which can be much better than that of DOO (Munos, 2011) as the total budget is not divided by λ(1).

Now, we will provide one of our main results which states that Algorithm 2 can recover simple regret bounds which

are close to that of Algorithm 1 even when the optimal smoothness parameters are not known.

Theorem 2. Algorithm 2 when run with upper-bounds max and max with a total cost budget of terminates after using up a cost of at most + O(Kλ(1) log ) and has the following regret guarantees:

(i) Under Assumption 2 the simple regret is O

Dmax +d( , )

2 KβDmax log( / log ) λ(1)

(ii) Under Assumption 3 if γ max the simple regret is

2Dmax 1+d( , )

2 KDmax log( / log ) λ(1)

We defer the proof of this theorem to Appendix C.

Comparison with POO (Grill et al., 2015): It is worthwhile to compare our result with that of POO (Grill et al., 2015) which uses only the highest ﬁdelity. It should be noted that POO is in a noisy setting, which gives rise to extra polylog factors in the bounds. However, ignoring polylog factors the simple regret bound of POO would scale as O

( /(log( /λ(1)) λ(1))) 1/(d( , )+2)(

. In contrast

our bounds scale as O

( /(β log( ))) 1/(d( , )+ )

( / log( )) 1/(d( , )+1)(

under assumptions 2 and 3 respectively. This can lead to much better performance at the same cost budget. We demonstrate this in our empirical results in Section 6.

6. Empirical Results

In this section we provide empirical results on synthetic and real datasets. We compare our algorithm with the following related works: (i) BOCA (Kandasamy et al., 2017) which is a multi-ﬁdelity Gaussian Process (GP) based algorithm that can handle continuous ﬁdelity spaces, (ii) MFGP-UCB (Kandasamy et al., 2016c) which is a GP based multi-ﬁdelity method that can handle ﬁnite ﬁdelities, (iii) GP-EI criterion in bayesian optimization (Jones et al., 1998), (iv) MF-SKO, the multi-ﬁdelity sequential kriging optimisation method (Huang et al., 2006b), (v) GP-UCB (Srinivas et al., 2009) and (vi) MFPDOO(z = 1) which is a version of our algorithm that uses only the highest ﬁdelity; this is very similar to POO (Grill et al., 2015) but in a noiseless setting. This algorithm is referred to as PDOO in the ﬁgures, which is essentially DOO (Munos, 2011) with the smoothness parameters tuned according to the scheme in POO (Grill et al., 2015).

For our theoretical guarantees the bias function is assumed to be known. However, in practice we assume a parametric

Multi-Fidelity Black-Box Optimization

0 50 100 150 200 10-4

BOCA GP-UCB GP-EI MF-GP-UCB MF-SKO MFPDOO PDOO

0 50 100 150 200 10-4

0 50 100 150 200

0 50 100 150 200

0 50 100 150 200

0 100 200 300 400 0.4

Figure 1. We compare all the algorithms in both synthetic and real settings. Note that the common legend for all the ﬁgures are provided

in Fig. (a) in the interest of space. Figs. (a) to (e) consists of simulation experiments with standard benchmark functions in global optimization. The function name and the dimensions of the ﬁdelity space and the domain are mentioned on top of the individual ﬁgures. The x-axis plots the total cost budget in multiples of the highest ﬁdelity cost λ(1). The y-axis denotes the simple regret in a logarithmic scale. Fig (f) represents a hyper-parameter tuning application on the 20 news group dataset, where the aim is to tune two hyper-parameters of an SVM classiﬁer. The metric is 5-folds cross-validation accuracy. The ﬁdelities used are the number of samples used in estimating the cv accuracy, with the highest ﬁdelity being 5000 samples. The x-axis plots the time taken for the experiments to run.

form for the bias function that is (z) = c(1 z) where c is initially set to a very small constant like 0.001 in our experiments. The nature of Algorithm 2 is such that the same cells are queried at different ﬁdelities by the different MFDOO instances spawned. If a cell is queried at two different ﬁdelities z1 and z2 and the function values obtained are f1 and f2, then we update c to 2c whenever c|z1 z2| < |f1 f2|. The above update is only performed if |z1 z2| is greater than a speciﬁed threshold (0.0001 in our experiments). The hierarchical partitioning is performed according to a scheme similar to that of the DIRECT algorithm (Finkel, 2003), where each time a cell is split into K children the dimension that has the biggest width is split into K regions. We set K = 2 in all our experiments. Now, we will present the results of our synthetic experiments.

Our implementation can be found at https://github.com/rajatsen91/MFTREE DET.

6.1. Synthetic Experiments

We evaluate all the algorithms on standard benchmark functions used in global optimization. The functions have been modiﬁed to incorporate the ﬁdelity space Z = [0, 1]. The setup followed is identical to the one in (Kandasamy et al., 2017), except that we only work in a one dimensional ﬁdelity space. Also, we perform our experiments in a noiseless setting and therefore no Gaussian noise is added to the function evaluations, unlike in (Kandasamy et al., 2017). Note that MF-GP-UCB and MF-SKO are ﬁnite ﬁdelity methods. The approximations for these methods are obtained at z = 0.333 and z = 0.667. We provide more details about the synthetic functions and the ﬁdelities in Appendix D. Our experiments are performed under the deterministic setting, where no noise is added to the approximations. However, several of the algorithms that we compare to have a randomized component. For these algorithms, the results are averaged over 10 experiments and the corresponding error bars are shown. In our algorithm we set the number of MFDOO instances spawned to be N = 0.1Dmax log( /λ(1)), given

Multi-Fidelity Black-Box Optimization

a total budget . We set max = 0.95 and max = 2.0.

The results of the synthetic experiments are shown in Figure 1(a)-(e), where the title of each ﬁgure shows the name of the function, the dimension of the domain (d) and the dimension of the ﬁdelity space (p). We have p = 1 in all our experiments. It can be observed the tree based methods outperform the other algorithms by a large margin, except in the experiments with the Curin Exp function (Fig. 1c). Tree-based methods can handle higher dimensions better, as we can see in the Hartman6 (Fig. 1b) and Borehole (Fig. 1e) function experiments. Note that MFPDOO also beats PDOO by a large margin which only uses the highest ﬁdelity. PDOO is essentially DOO (Munos, 2011) where the smoothness decay parameters are tuned according to the scheme in (Grill et al., 2015). MFPDOO can effectively explore the space at cheaper ﬁdelities and then expend the higher ﬁdelities in promising regions of the domain, unlike PDOO.

6.2. Tuning SVM for News Group Classiﬁcation

In this section we describe our experiments that involve tuning hyper-parameters for text classiﬁcation. For this purpose we use a subset of the 20 news group dataset (Joachims, 1996). All the algorithms are used for tuning two hyper-

parameters: (i) the regularization penalty and (ii) the temperature of the rbf kernel both in the range of [10 2, 103]. For our experiments, we use the scikit-learn implementation of SVM classiﬁer and also the inbuilt KFold function for crossvalidation. The bag of words in each of the text document is converted into tf-idf features before applying the classiﬁcation models. We use a one-dimensional ﬁdelity space, where the ﬁdelity denotes the number of samples used to obtain 5-fold cross-validation accuracy. z = 1 corresponds to 5000 samples which is the maximum number of samples in the subset of the data used. z = 0 corresponds to 100 samples. Note that for the ﬁnite ﬁdelity methods MF-SKO and MF-GP-UCB, approximations are obtained at z = 0.33 and z = 0.667.

For our algorithms we set max = 1.0 and max = 0.9. At the beginning of the experiment some of the budget is expended to obtain the function values at a point x with two different ﬁdelities z1 = 0.8 and z2 = 0.2. Thus the total budget spent in the initialization is (0.8) + λ(0.2). The function values obtained are then used to initialize c in the bias function (z) = c(1 z). The initial value of c is set to 2|f1 f2|/|z1 z2|. Thereafter, c is updated online according to the method detailed above. We set N = 0.5Dmax log( /λ(1)).

The cross-validation accuracy obtained as a function of time is plotted in Fig. 1f for all the candidate algorithms. It can be observed that MFPDOO outperforms the other algorithms, especially in low-budget settings.

7. Conclusion

We considered the problem of black-box function optimization using hierarchical partitions in the presence of cheap approximations or ﬁdelities. We propose two tree-search based algorithms which can navigate the domain effectively using cheaper ﬁdelities for coarser partitions and more expensive ones while zeroing in on ﬁner partitions. We analyze our algorithms under standard smoothness assumptions and provide simple regret guarantees given a cost budget . Our simple regret guarantees scale much better with

respect to as compared to other hierarchical partitioning based algorithms that do not use cheaper ﬁdelities. Our ﬁrst algorithm (MFDOO) requires the knowledge of the smoothness parameters ( , ) and has a simple regret bound of O( 1/(d( , )+1)) where d( , ) is the near-optimality dimension. Our second algorithm (MFPDOO) can obtain a simple regret bound of O(( / log ) 1/(d( , )+1)) even when the smoothness parameter are unknown. Finally, we empirically validate the performance of our algorithms on real and synthetic datasets, where they outperform the stateof-the art multi-ﬁdelity algorithms.

This work opens up several interesting research directions. The theoretical guarantees of our algorithms assume some nice properties about the bias and cost functions. We believe it is possible to design more robust algorithms that have similar guarantees even for the bias and cost functions that are not well-designed. Our setting is also restricted to a one dimensional ﬁdelity space. However, in many application the ﬁdelity space may be multi-dimensional. For instance, in the hyper-parameter tuning one can choose to use less samples or train for lesser iterations. It is an interesting research direction to incorporate a multi-dimensional ﬁdelity space with tree-search based algorithms. Finally, in this work we work in the noise-less setting where the function and the approximations are deterministic. We believe it is possible to extend our results to a setting where zero-mean noise is added to the function and its approximations.

Acknowledgment

This work is partially supported by NSF grant 1320175, ARO grant W911NF-17-1-0359, and the US Do T supported D-STOP Tier 1 University Transportation Center.

Multi-Fidelity Black-Box Optimization

Agarwal, Alekh, Duchi, John C, Bartlett, Peter L, and Lev-

rard, Clement. Oracle inequalities for computationally budgeted model selection. In COLT, 2011.

Bubeck, S ebastien, Munos, R emi, Stoltz, Gilles, and

Szepesv ari, Csaba. X-armed bandits. Journal of Machine Learning Research, 12(May):1655 1695, 2011.

Currin, Carla. A bayesian approach to the design and anal-

ysis of computer experiments. Technical report, ORNL Oak Ridge National Laboratory (US), 1988.

Cutler, Mark, Walsh, Thomas J., and How, Jonathan P. Re-

inforcement Learning with Multi-Fidelity Simulators. In ICRA, 2014.

Dixon, L. C. W. and Szego, George Philip. Towards global

optimisation 2, volume 2. North Holland, 1978.

Finkel, Daniel E. Direct optimization algorithm user guide.

Center for Research in Scientiﬁc Computation, North Carolina State University, 2, 2003.

Forrester, Alexander I. J., S obester, Andr as, and Keane,

Andy J. Multi-ﬁdelity optimization via surrogate modelling. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Science, 2007.

Grill, Jean-Bastien, Valko, Michal, and Munos, R emi.

Black-box optimization of noisy functions with unknown smoothness. In Advances in Neural Information Processing Systems, pp. 667 675, 2015.

Huang, D., Allen, T.T., Notz, W.I., and Miller, R.A. Se-

quential kriging optimization using multiple-ﬁdelity evaluations. Structural and Multidisciplinary Optimization, 2006a.

Huang, Deng, Allen, Theodore T, Notz, William I, and

Miller, R Allen. Sequential kriging optimization using multiple-ﬁdelity evaluations. Structural and Multidisciplinary Optimization, 32(5):369 382, 2006b.

Joachims, Thorsten. A probabilistic analysis of the rocchio

algorithm with tﬁdf for text categorization. Technical report, Carnegie-mellon univ pittsburgh pa dept of computer science, 1996.

Jones, Donald R, Schonlau, Matthias, and Welch, William J.

Efﬁcient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455 492, 1998.

Kandasamy, Kirthevasan, Dasarathy, Gautam, Oliva, Ju-

nier B, Schneider, Jeff, and P oczos, Barnab as. Gaussian process bandit optimisation with multi-ﬁdelity evaluations. In Advances in Neural Information Processing Systems, pp. 992 1000, 2016a.

Kandasamy, Kirthevasan, Dasarathy, Gautam, Oliva, Ju-

nier B, Schneider, Jeff, and Poczos, Barnabas. Multiﬁdelity gaussian process bandit optimisation. ar Xiv preprint ar Xiv:1603.06288, 2016b.

Kandasamy, Kirthevasan, Dasarathy, Gautam, Poczos, Barn-

abas, and Schneider, Jeff. The multi-ﬁdelity multi-armed bandit. In Advances in Neural Information Processing Systems, pp. 1777 1785, 2016c.

Kandasamy, Kirthevasan, Dasarathy, Gautam, Schneider,

Jeff, and Poczos, Barnabas. Multi-ﬁdelity bayesian optimisation with continuous approximations. ar Xiv preprint ar Xiv:1703.06240, 2017.

Klein, Aaron, Falkner, Stefan, Bartels, Simon, Hennig,

Philipp, and Hutter, Frank. Fast bayesian optimization of machine learning hyperparameters on large datasets. ar Xiv preprint ar Xiv:1605.07079, 2016.

Kleinberg, Robert, Slivkins, Aleksandrs, and Upfal, Eli.

Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pp. 681 690. ACM, 2008.

Kocsis, Levente and Szepesv ari, Csaba. Bandit based monte-

carlo planning. In European conference on machine learning, pp. 282 293. Springer, 2006.

Lam, R emi, Allaire, Douglas L, and Willcox, Karen E. Mul-

tiﬁdelity optimization using statistical surrogate modeling for non-hierarchical information sources. In 56th AIAA/ASCE/AHS/ASC Structures, Structural Dynamics,

and Materials Conference, pp. 0143, 2015.

Li, Lisha, Jamieson, Kevin, De Salvo, Giulia, Rostamizadeh,

Afshin, and Talwalkar, Ameet. Hyperband: A novel bandit-based approach to hyperparameter optimization. ar Xiv preprint ar Xiv:1603.06560, 2016.

Martinez-Cantin, R., de Freitas, N., Doucet, A., and Castel-

lanos, J. Active Policy Learning for Robot Planning and Exploration under Uncertainty. In Proceedings of Robotics: Science and Systems, 2007.

Munos, R emi. Optimistic optimization of a deterministic

function without the knowledge of its smoothness. In Advances in neural information processing systems, pp. 783 791, 2011.

Parkinson, David, Mukherjee, Pia, and Liddle, Andrew R. A

Bayesian model selection analysis of WMAP3. Physical Review, 2006.

Poloczek, Matthias, Wang, Jialei, and Frazier, Peter I.

Multi-information source optimization. ar Xiv preprint ar Xiv:1603.00389, 2016.

Multi-Fidelity Black-Box Optimization

Sabharwal, A, Samulowitz, H, and Tesauro, G. Selecting

near-optimal learners via incremental data allocation. In AAAI, 2015.

Slivkins, Aleksandrs. Multi-armed bandits on implicit met-

ric spaces. In Advances in Neural Information Processing Systems, pp. 1602 1610, 2011.

Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Prac-

tical Bayesian Optimization of Machine Learning Algorithms. In Advances in Neural Information Processing Systems, 2012.

Srinivas, Niranjan, Krause, Andreas, Kakade, Sham M, and

Seeger, Matthias. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009.

Valko, Michal, Carpentier, Alexandra, and Munos, R emi.

Stochastic simultaneous optimistic optimization. In International Conference on Machine Learning, pp. 19 27, 2013.

Zhang, C. and Chaudhuri, K. Active Learning from Weak

and Strong Labelers. In NIPS, 2015.