# budgeted_optimization_with_constrained_experiments__7a600f2a.pdf Journal of Artificial Intelligence Research 56 (2016) 119-152 Submitted 07/15; published 05/16 Budgeted Optimization with Constrained Experiments Javad Azimi jaazimi@microsoft.com Microsoft, Sunnyvale, CA, USA Xiaoli Z. Fern xfern@eecs.oregonstate.edu School of EECS, Oregon State University Alan Fern afern@eecs.oregonstate.edu School of EECS, Oregon State University Motivated by a real-world problem, we study a novel budgeted optimization problem where the goal is to optimize an unknown function f( ) given a budget by requesting a sequence of samples from the function. In our setting, however, evaluating the function at precisely specified points is not practically possible due to prohibitive costs. Instead, we can only request constrained experiments. A constrained experiment, denoted by Q, specifies a subset of the input space for the experimenter to sample the function from. The outcome of Q includes a sampled experiment x, and its function output f(x). Importantly, as the constraints of Q become looser, the cost of fulfilling the request decreases, but the uncertainty about the location x increases. Our goal is to manage this trade-offby selecting a set of constrained experiments that best optimize f( ) within the budget. We study this problem in two different settings, the non-sequential (or batch) setting where a set of constrained experiments is selected at once, and the sequential setting where experiments are selected one at a time. We evaluate our proposed methods for both settings using synthetic and real functions. The experimental results demonstrate the efficacy of the proposed methods. 1. Introduction This work is motivated by the experimental design problem of optimizing the power output of nano-enhanced microbial fuel cells. Microbial fuel cells (MFCs) (Bond & Lovley, 2003; Fan, Hu, & Liu, 2007; Park & Zeikus G, 2003; Reguera, Mc Carthy, Mehta, Nicoll, Tuominen, & Lovley, 2005) use micro-organisms to break down organic matter and generate electricity. For a particular MFC design, it is critical to optimize the biological energetics and the microbial/electrode interface of the system, which research has shown to depend strongly on the surface properties of the anodes (Park & Zeikus G, 2003; Reguera et al., 2005). This motivates the design of nano-enhanced anodes, where nano-structures (e.g., carbon nanowire) are grown on the anode surface to improve the MFC s power output. Unfortunately, there is little understanding of the interaction between various possible nano-enhancements and MFC capabilities for different micro-organisms. Thus, optimizing the anode design for a particular application is largely guesswork. Our goal is to develop algorithms to aid this process. Traditional experimental design, Bayesian optimization and response surface methods (Myers, Montgomery, & Anderson-Cook, 1995; Jones, 2001; Brochu, Cora, & De Freitas, 2010) commonly assume that the experimental inputs can be specified precisely and attempt c 2016 AI Access Foundation. All rights reserved. Azimi, Fern, & Fern to optimize a design by requesting specific experiments. For example, requesting an anode to be tested with nano-wire of specific length and density. However, these parameters are unlike usual experimental control parameters (such as temperature) that can be easily set at precise values. Manufacturing nano-structures is rather an art and it is very difficult to achieve a precise parameter setting. Instead, it is more practical to request constrained experiments, which place constraints on these parameters. For example, we may specify intervals for the length and density of the nano-wire. Given such a request, nano-materials that satisfy the given set of constraints can be produced at some cost, which will typically increase with tighter constraints. Note that a possible alternative to requesting constrained experiments would be to treat the nano-structure manufacturing parameters as the experimental inputs. Such inputs can be precisely specified, and hence standard methods can be used. However, this approach will not yield a satisfactory solution for our problem. In particular, the mapping between the manufacturing parameters and the produced nano-structures is extremely noisy. This makes it difficult to find the manufacturing parameters that optimize the expected MFC power output. Further, the scientists are primarily interested in learning what nano-structure properties optimize the MFC power output, rather than knowing the specific manufacturing parameters, which can vary significantly from lab to lab. Thus we focus on directly optimizing in the space of nano-structure properties via constrained experiments. Based on the above motivation, in this paper, we study the associated budgeted optimization problem where, given a budget, our goal is to optimize an unknown function f( ) by requesting a set of constrained experiments. Solving this problem requires careful consideration of the trade-offbetween the cost and the uncertainty of a constrained experiment: weakening the constraints will lower the cost of an experiment, but increase the uncertainty about the location of the next observation. Prior work on experimental design, stochastic optimization, and active learning do not directly apply to constrained experiments because they typically assume precise experiments. This problem can be formulated in the theoretical framework of partially observable Markov decision processes (POMDPs), where the optimal solution corresponds to finding an optimal POMDP policy. However, solving for optimal or even near-optimal policies is computationally intractable, even in the case of traditional optimization problems. This has led researchers to develop a variety of myopic policies in the context of traditional optimization, which have been observed to achieve good performance, even in comparison to more sophisticated, less myopic strategies (Moore & Schneider, 1995; Jones, 2001; Madani, Lizotte, & Greiner, 2004; Brochu et al., 2010). Our problem can be considered in two different settings, non-sequential and sequential. In the non-sequential setting, which is also referred to as the batch setting, all constrained experiments must be selected at once. This setting is appropriate for applications where there are multiple experimental facilities and the experiments are too time consuming to be run sequentially. In contrast, the sequential setting allows us to request one constrained experiment at a time, and wait for the outputs of previous experiments before making the next request. The sequential setting has the advantage that it allows us to use the maximum available information for selecting each experiment, and is generally expected to outperform the non-sequential setting when the total running time is not a concern. In this paper, we study both settings. Budgeted Optimization with Constrained Experiments For the non-sequential setting, we introduce a non-decreasing submodular objective function to select a batch of constrained experiments within the given budget. For a given set of constrained experiments, the objective measures its expected maximum output. We present a computationally efficient greedy algorithm that approximately optimizes the proposed objective. For the sequential setting, we build on a set of classic myopic policies that have been shown to achieve good performance in traditional optimization (Moore & Schneider, 1995; Jones, 2001; Madani et al., 2004; Brochu et al., 2010) and introduce non-trivial extensions to make them applicable to constrained experiments. We present experimental evaluations for both settings using synthetic functions and functions generated from real-world experimental data. The results indicate that, in both settings, our proposed methods outperform competing baselines. The remainder of the paper is organized as follows. We will introduce the background and related work in Section 2. Section 3 describes the problem setup. The non-sequential setting is studied in Section 4. Section 5 introduces our proposed methods for selecting constrained experiments in the sequential setting. Section 6 presents the empirical evaluation of the proposed methods. We conclude the paper and discuss future work in Section 7. 2. Background and Related Work Given an unknown black box function that is costly to evaluate, we are interested in finding the extreme point (minimizer or maximizer) of the function via a small number of function evaluations. To solve this problem, Bayesian Optimization (BO) approaches have been heavily studied (Jones, 2001; Brochu et al., 2010) and demonstrated significant promises. There are two key components in the basic framework of Bayesian Optimization. The first component is a probabilistic model of the underlying function that is built based on the prior information (i.e., the existing observed experiments). Gaussian process (GP) regression has been widely used in the literature of BO for this purpose (Brochu et al., 2010). For any unobserved point, a GP models its function output as a normal random variable, with its mean predicting the expected function output of the point and the variance capturing the uncertainty associated with the prediction. The second key component of BO is the selection criterion that is used to determine what experiment to select based on the learned model. In the existing literature, various selection criteria have been proposed and most of them are a combination of exploring the unexplored input space of the function (i.e., areas of high variance) and exploiting the promising area (i.e., area with large mean). A selection criterion can be either sequential (Jones, 2001; Locatelli, 1997; Moore, Schneider, Boyan, & Lee, 1998; Srinivas, Krause, Kakade, & Seeger, 2010) in which only one experiment is asked at each iteration or nonsequential (Schonlau, 1997; Azimi, Fern, & Fern, 2010; Ginsbourger, Riche, & Carrarog, 2010) where a batch of experiments are requested at each iteration. Below we review some of the most successful sequential selection criteria in the literature of BO. One of the first sequential policies is based on selecting the sample with maximum probability of improving (MPI) the best current observation, ymax (assuming we want to maximize f), by a given margin α (Elder, 1992; Stuckman, 1988). Let the best current observation be ymax. The goal of MPI is to select the next experiment that has the highest Azimi, Fern, & Fern probability of producing an output no smaller than (1+α)ymax. One issue of this approach is that the performance can often be sensitive to the value of the margin parameter α (Jones, 2001). For small values of α, MPI will focus on the most promising area at first and then move onto unexplored areas. In contrast, for large values of α, MPI will primarily explore and converge very slowly. Selecting a proper value for α can be challenging in practice. The maximum expected improvement (MEI) (Locatelli, 1997) criterion avoids this issue and selects the experiment that directly maximizes the expected improvement over the current best observation. Heuristics based on upper-confidence bounds have also been explored (Srinivas et al., 2010), which has been shown to be competitive with MEI given appropriate parameter selection. However, selecting the best parameters for a particular application empirically can be a challenge. An alternative approach that has received attention is Thompson Sampling (Chapelle & Li, 2011), which is a randomized strategy for managing the exploration-exploitation trade-off. This approach first samples the underlying uncertainty, in our case the unknown function f, and then returns the experiment that maximizes the sample. In this work, we have focused on extending the above deterministic methods for BO to constrained experiments. Extending alternatives such as Thompson sampling is a potentially interesting future direction. Recently, researchers have begun to consider non-sequential or batch Bayesian optimization (Azimi et al., 2010; Ginsbourger et al., 2010; Desautels, Krause, & Burdick, 2014), which selects multiple experiments at once. Non-sequential BO is considered more appropriate for applications where there is a need and capability to run multiple experiments simultaneously. One non-sequential approach (Azimi et al., 2010) selects k > 1 experiments at once by matching the behavior of executing a given sequential policy (e.g., MEI) for k steps. In another non-sequential approach (Ginsbourger et al., 2010), the authors tried to select a batch of experiments that will lead to the highest expected improvement. However, it was shown that the expected improvement over a set of jointly normal random variables does not have any closed form solution when k > 2, nor it can be solved efficiently using numerical methods. Instead, simple heuristics were proposed to approximate the expected improvement and select a batch accordingly. More recently, an algorithm based on upper-confidence bounds has also been introduced (Desautels et al., 2014), which is computationally cheaper than prior work but requires careful parameter selection. Note that all of the aforementioned approaches assume that the unknown function we aim to optimize can be sampled at precisely specified points, making them unsuitable for tasks such as our motivating nano application, where sampling the function at exact locations is impractical. The proposed sequential approaches in this paper, have been previously presented in less detail (Azimi et al., 2010). In this paper, we provide a more complete and formal description of the sequential approaches with additional empirical results. In addition, we introduce and evaluate a batch selection algorithm that chooses a batch of constrained experiments at each iteration. 3. Problem Setup Let X Rd be a d-dimensional input space, where each dimension i is bounded in [Ai, Bi]. We often refer to the elements of X as experiments. We assume there is an unknown real-valued function f (x) : X ℜ, which represents the expected value of the dependent Budgeted Optimization with Constrained Experiments variable after running experiment x. In our motivating application, f(x) is the expected power output produced by a particular nano-structure x. Conducting an experiment x produces a noisy outcome y = f(x) + ϵ, where ϵ is a noise term. In traditional optimization settings (Jones, 2001; Brochu et al., 2010), the goal is to find an x X that approximately optimizes f( ) by requesting a set of experiments and observing their outcomes. Since sampling the function at exactly specified points is prohibitively expensive in our application, we request constrained experiments, which define a subset of experiments in X. Specifically, we define a constrained experiment as a hyper-rectangle in X, denoted by Q = (q1, q2, , qd), where qi = (ai, bi) with Ai ai < bi Bi defines the range of values that is considered admissible for each input dimension i. Note that for computational reasons, in this work we consider a discretized input space, where each input dimension is divided into equal-sized intervals. As such, a constrained experiment Q will indicate for each dimension i the first (specified by ai) and the last (specified by bi) intervals to be included in the hyper-rectangle. For the remainder of this paper, we will interchangeably use the terms constrained experiment, hyper-rectangle and query. Given a constrained experiment Q, the experimenter will first construct an experiment x (we assume that x can be precisely measured after being produced) that satisfies the given constraints of Q, run the experiment, and return the noisy observation of f(x). Note that x is a random variable given Q, and we assume this conditional distribution, px(x|Q), is known a priori as part of the problem inputs. More precisely, for any query Q, the experimenter will return a 2-tuple (x, y), where: x = (x1, x2, , xd) is an experiment that satisfies the constraints of Q, y is the noisy observation of the function f( ) at x, y = f(x) + ϵ. In practice, the cost c of fulfilling a constrained experiment can be variable depending on the size of the hyper-rectangle. In particular, higher cost will be associated with tighter constraints or smaller hyper-rectangles. We assume that this cost is modeled by a deterministic function fc( ), which is provided to us as part of the inputs. For example, in our motivating application, fc( ) is dominated by the time required to produce the nano material that satisfies the given constraints, which is inversely correlated with the size of the constraints. In addition, we must operate within a total budget B. Thus, the objective is to find a set of queries within budget B that leads to the best estimate of the maximizer of the function over the input space X. To summarize, the inputs to our problem include a set of prior experiments D (which could potentially be empty), a budget B, a deterministic cost function fc( ) of fulfilling a constrained experiment Q, and a conditional probability density function px(x|Q) of the specific experiment x generated for any given constrained experiment Q. Given the inputs, our task is to select a set of constrained experiments Q = {Q1, Q2, , Qk} whose total cost is within budget B. Running the selected constrained experiments will result in a set of k tuples (xi, yi)k i=1, with which we must determine a final output x {x1, . . . , xk}, which is our prediction of the maximizer of f( ) among all observed experiments. Note that we restrict ourselves to returning an experiment that was actually observed, even in cases where we might predict some other non-observed experiment to be better. This formulation matches the objective of our motivating application to produce a Azimi, Fern, & Fern good nano-structure x using the given budget, rather than to make a prediction of what nano-structure might be good. We study this problem in two different settings, non-sequential (or batch) and sequential. In the non-sequential setting, we must decide the entire set of queries at the same time. In contrast, the sequential setting requests constrained experiments sequentially one at a time: only after receiving the result of the previous request, another query is selected and presented to the experimenter. This procedure is repeated until we reach the budget limit. In the following two sections, we will introduce our proposed solutions for both settings. 4. Non-sequential Approach In this section, we consider the non-sequential setting, in which we must select the entire batch of queries Q within the given budget B at once. Note that in general these batches can be multi-sets that have repeated queries, which may be desirable in noisy settings. This is also called the non-adaptive (Goel et al., 2006; Krause et al., 2008) or Batch (Azimi et al., 2010) setting. This setting is commonly used in applications where we must start multiple experiments at once and cannot wait for the outputs of the previous queries to decide the next queries (Tatbul et al., 2003). 4.1 The Objective Function Let QB be the set of feasible solutions such that for any Q QB the total cost of Q is no greater than the budget B. Our goal is to find the optimal multi-set of queries Q = {Q 1, Q 2, , Q k} QB. To define what we mean by optimal, let us consider the outcome of the queries, which are a set of tuples: (xi, yi) , i = 1, 2, . . . , k. The xi s are the experiments produced by the experimenter given the queries and the yi s represent their experimental output (i.e., the noisy observation of f(xi)). We will then select a final output x {x1, . . . , xk} that is believed to achieve the maximal f( ) value. As such, for any Q QB, we can measure how good Q is based on the maximal y value resulting from this set of queries. Specifically, this is captured by: J(Q) = E(x1, ,x|Q|) h E(y1, ,y|Q|) h max y1, . . . , y|Q| D, (x1, , x|Q|) ii , (1) where the first expectation is taken over all possible values of the xi s, which represent the individual experiments created for each query in Q, and the second expectation is taken over all possible yi s, which represents the experimental outcomes of the xi s. As mentioned previously, the xi s are distributed according to px(xi|Qi), which is part of the inputs. The distribution of yi s given the xi s depends on the posterior distribution of f( ) given D. In our work, we use Gaussian processes to model the distribution of f( ). Consequently, the set of yi s are jointly normal conditioned on all xi s and D. Since our input space is discretized, we can enumerate all possible constrained experiments and denote them as QM = {Q1, Q2, ..., QM}, where M is the total number of possible constrained experiments, and let c1, . . . , c M be their corresponding cost (i.e., ci = fc(Qi)). Let S S = {1, ..., M} be a subset of indices and QS denote the collection of queries indexed by S, i.e., QS = {Qi : i S}. Our goal can then be stated as selecting an S such that the corresponding QS maximizes the objective (Equation 1) subject to the constraint Budgeted Optimization with Constrained Experiments P i S ci B. Unfortunately, optimizing this objective is intractable due to the combinatorial nature of the problem and exponentially many possible solutions to consider. Below we will reformulate the objective to demonstrate that it is a non-decreasing submodular set function and introduce an algorithm with an approximation guarantee. Specifically, we will consider a slightly different but equivalent view of the querying process. So far our view is that after S is chosen, each query Q QS will result in an experiment x, which can be viewed as a random sample drawn from the distribution px(x|Q) (note that in this work px is uniform within the hyper-rectangle defined by the query). From the process point of view, it clearly does not matter whether this random draw happens after Q is chosen, or at the very beginning of the whole process before Q is chosen. Following this reasoning, we could assume that for every possible query in QM, a random experiment is drawn at the very beginning of the process and the results are stored and used later when S is selected. Let XM = {x1, . . . , x M} denote the random variables representing the outcome of the random draw for Q1, ..., QM respectively. The objective can be then reformulated as: J(S) = EXS h E(y1,...,y|S|) h max y1, . . . , y|S| D, XS ii , (2) where XS = {xi : i S} is the subset of XM defined by S, and the yi s are the noisy outcomes of the xi s in XS. 4.2 Approximation Algorithm Since standard batch Bayesian Optimization is a special case of optimizing J(S), the hardness of optimizing J(S) follows from NP-hardness of Bayesian Optimization. Thus, below we will show that J(S) is a non-decreasing submodular set function and present an algorithm with a bounded approximation factor. Definition 1. Suppose S is a finite set, g : 2S R+ is a submodular set function if for all S1 S2 S and x S \ S2, it holds that g(S1 {x}) g(S1) g(S2 {x}) g(S2). Thus, a set function is submodular if adding an element to a smaller set provides no less improvement than adding the element to a larger superset. Also, a set function is non-decreasing if for any set S and element x we have g(S) g(S {x}). To show that J(S) is submodular, we will rewrite the objective function by defining JXM (S) to be the inner expectation of Equation 2 for a fixed realization of random variable XM: JXM (S) = E(y1,...,y|S|) h max y1, . . . , y|S| D, XS i . Lemma 1. For any given XM, JXM (S), which returns the expected maximum over a set of jointly distributed random variables, is a monotonically non-decreasing submodular set function. The proof is in the Appendix. The proposed objective function, J(S) = EXM [JXM (S)] takes the expectation of JXM (S) over all possible values of XM. Because JXM (S) is non-decreasing, it is easy to verify that J(S) is also non-decreasing. Further note that the set of submodular functions is Azimi, Fern, & Fern closed under expectation, we can thus conclude that the proposed objective, J(S), is a non-decreasing submodular function. We now present our proposed algorithm for optimizing J(S). The inputs to our algorithm include the set of all possible constrained experiments, QM = {Q1, ..., QM}, their associated costs c1, ..., c M, and a total budget B, and the output is a subset S S = {1, ..., M} such that P i S ci B. We first introduce a simple greedy algorithm, which begins with an initial empty set S = and greedily adds one constrained experiment (its index) at a time until the total cost of S reaches B. In each step, let S be the current set and C be the total cost of S, the greedy algorithm selects an index i S such that: i = argmax i/ S;ci B C J(S i) J(S) In other words, at each step, the algorithm greedily selects a new constrained experiment that is within the budget and leads to the largest cost-normalized improvement of the objective. It is known that this simple greedy algorithm does not have any bounded approximation factor (Khuller, Moss, & Naor, 1999). Previous work (Khuller et al., 1999; Krause & Guestrin, 2005) introduced a small change to the greedy algorithm that provides us with a bounded approximation factor. In particular, one just needs to consider, as an alternative to the output of the greedy algorithm, the single query that is within the budget and achieves the best objective value (denoted by Sa). By comparing this alternative with the output of the greedy algorithm, we are guaranteed to achieve a bounded approximation factor. The complete algorithm is summarized in Algorithm 1. The approximation bound for this algorithm follows from the following theorem. Theorem 1. (Khuller et al., 1999) Let J( ) be a monotonically non-decreasing submodular set function such that J( ) = 0, and S is the output of our Algorithm 1. Suppose OPT is the optimal solution, the following bound is guaranteed 1 1 1 |S | + 1 The dominating factor of the run time is the linear dependence on M, the number of possible queries. Note that in the discretized setting that we consider, M will be exponential in the number of dimensions. In the scientific application domains that motivate our work, the number of dimensions is typically small (e.g., 2 or 3). However, when working with a fine resolution discretization, the computation time can still be significant. To address this issue, in the next section we describe a simple strategy for soundly pruning candidate queries from consideration, which yields significant speedups. Problems with significantly higher dimensions, however, will require continuous rather than discretized optimization. Budgeted Optimization with Constrained Experiments Algorithm 1 The Greedy Non-Sequential Algorithm Input: D, B > 0, {Q1, ..., QM}, {c1, ..., c M} Output: A set of indices S S = {1, ..., M} such that P i S ci B - i = argmaxi S,ci B J({i}) Sa {i } - S , C 0 while (C < B) do Select i such that: i = argmax i/ S,ci B C J(S {i}) J(S) ci - C C + ci - S S {i } end while if J(S) J(Sa) then - return S else - return Sa end if Algorithm 2 Accelerated Greedy Algorithm Input: D, B > 0, {Q1, ..., QM}, {c1, ..., c M} Output: A set of indices S S = {1, ..., M}, s.t., P s S ci B - i = argmaxi S,ci B J({i}) - Sa {i } -S , C = 0, δ(i) = J({i})/ci, for i = 1, . . . , M while (C < B) do while true do Set z = argmaxi:i S\S δ(i), ci B C, then re-calculate δ(z) such that δ(z) = J(S {z}) J(S) if δ(z) maxi S\{S z} δ(i) then Break end if end while - C C + cz, S S {z} end while if J(S) J(Sa) then - return S else - return Sa end if Azimi, Fern, & Fern 4.3 Accelerated Greedy Algorithm In this section, following prior applications of submodular optimization (e.g. Krause & Guestrin, 2005), we describe an accelerated greedy algorithm, which yields significant gains in computational efficiency. At each greedy step, let S represent the set of queries that have been selected so far. To make another greedy selection, we need to compute the cost normalized incremental difference δ(i) = J(S i) J(S) ci for each candidate query i, such that i / S and ci B C. This computation can be expensive because the number of candidate queries is often very large. Fortunately, by carefully maintaining the normalized incremental differences calculated in the first greedy step, we can avoid recomputing a large majority of them in later iterations. Specifically, the first iteration will compute the δ(i) values for all i S. We then sort them in decreasing order based on their δ values, and select the first query and remove it from the list. For the next iteration, we move on to the next query in the sorted list and recompute its δ value. If the value remains the largest, we will immediately select this query and remove it from the list, and proceed to the next iteration without recomputing any other δ values. Otherwise, we proceed to evaluate the next query in the sorted list until finding one whose recomputed δ value is greater than the other stored values and select that query. The submodular property of our objective guarantees that this approach makes the same choices as the full greedy algorithm, but effectively avoids a large number of computations of the δ values in practice. The proposed accelerated algorithm is summarized in Algorithm 2. 4.4 Computation of the Expected Maximum For any given set S, to compute J(S), we need to compute the expected maximum value of a set of jointly distributed random variables (y1, ..., y|S|). Unfortunately, the expected maximum of a set of dependent random variables generally does not have a closed-form solution (Ross, 2008). Instead, we use a Monte-Carlo simulation approach for computing the expected maximum value. Specifically, given S, to compute J(S), we first sample one experiment for each Q QS, resulting in {x1, ..., x|S|}. We then sample the yi s from their posterior distribution py(y1, ..., y|S||x1, ..., x|S|, D) and take the maximum of the sampled yi s. This is repeated l independent times and the expected maximum is then obtained by averaging across the l results. Note that our computation of the expected maximum value with simulation will not be exact. Denoting the simulated results by ˆJ, standard Chernoffbounds can be used to bound the error of ˆJ(S) with high probability. Assuming a bounded error, that is |J(S) ˆJ(S)| ϵ for some ϵ 0, the following theorem holds for non-decreasing submodular objective functions: Theorem 2. (Krause & Guestrin, 2005) Let J( ) be a non-decreasing submodular function and S = max S:c(S) B J(S ) be the cost constrained optimizer of J. For any ˆJ( ) such that |J(S) ˆJ(S)| ϵ for all S, if Algorithm 1 is run using ˆJ in place of J, then the returned set ˆS satisfies the following approximation bound, where cmin = mini ci: cmin + |S | ϵ. (5) Budgeted Optimization with Constrained Experiments 5. Sequential Approach In this section, we consider the sequential setting (Deshpande et al., 2004; Silberstein et al., 2006) where each query is selected one at a time after the result for the previous query becomes available. This is the most commonly studied setting for Bayesian Optimization and is appropriate for many applications where there is only a single experimental facility to process the queries. The inputs to our problem remain the same, which include B, the total budget, fc( ) the cost function, px(x|Q), the distribution of the constructed experiment x given query Q, and D, the set of observed experiments. In the sequential setting, given the inputs we must request a sequence (one at a time) of constrained experiments whose total cost is within the budget. Leveraging the extensive body of research on traditional Bayesian Optimization, we design our sequential selection policies by extending a number of well-established myopic sequential selection policies from the literature. Most existing policies for traditional Bayesian Optimization can be viewed as defining a greedy heuristic that assigns a score to each candidate experiment x based on the current experimental state, which we denote by (Dc, Bc), where Dc represent the current set of prior experiments, and Bc represents the current remaining budget. As reviewed in Section 2, many of the existing heuristics have been observed to perform well for traditional Bayesian Optimization problems. Unfortunately they cannot be directly used for our problem since they select individual specific experiments rather than constrained experiments, as we require. 5.1 Model-Free Policies Model-free policies do not consider statistical models of the data in making the selection. In this work we consider two model-free policies, Round Robin (RR) and Biased Round Robin (BRR), which are motivated by previous work on budgeted multi-armed bandit problems (Lizotte et al., 2003; Madani et al., 2004). 5.1.1 Round Robin (RR) In the multi-armed bandit setting, the RR policy seeks to keep the number of pulls of each arm as uniform as possible. In the context of constrained experiments, we apply the same principle to keep the experiments as uniformly distributed as possible in the experimental space X. Given the current experimental state (Dc, Bc), we define the RR policy to return the largest hyper-rectangle or the least costly query Q that does not contain any previous experiment in Dc. If the cost Q exceeds the current budget Bc, we return the constrained experiment with cost Bc that contains the fewest experiments in Dc. Ties are broken randomly. Note that in RR the outputs of previous queries do not have any effect in selecting the next queries. However, the exact location of the previous experiments do have a significant effect in the next query selection. Therefore, we can not consider RR as a non-sequential approach. Azimi, Fern, & Fern 5.1.2 Biased Round Robin (BRR) BRR policy behaves identically to RR, except that it repeats the previously selected constrained experiment as long as the outcome of the constrained experiment has improved the performance and it does not exceed the budget. In particular, given the current experimental state (Dc, Bc), the query Q is repeated as long as it results in an outcome that improves over the current best outcome in the set Dc, and fc(Q) Bc. Otherwise, the RR policy is followed. This policy is analogous to BRR in multi-armed bandit problems (Madani et al., 2004) where an arm is pulled as long as it has a positive outcome. 5.2 Model-Based Policies For model-based policies, it is assumed that a conditional posterior distribution p(y|Dc, x) over the outcome y of each individual experiment x Q is learned from the set of currently observed experiments Dc. Existing model-based myopic policies for traditional experimental design typically select the experiment x that maximizes certain heuristics computed from statistics of the posterior (Jones, 2001). The heuristics provide different mechanisms for trading offbetween exploration (probing unexplored regions of the experimental space) and exploitation (probing areas that appear promising) given Dc. Note that the experiment x is a fixed and known point in the experimental design literature before running the real experiment in the lab since it is assumed that we can ask for a particular fixed point. However, in our constrained experiment application, we ask for a hyper-rectangle query Q rather than a fixed experiment point x. Therefore the conditional posterior distribution for each constrained experiment Q is defined as p(y|Q, Dc) Ex|Q [p(y|x, Dc)]. This definition corresponds to the process of drawing an experiment x from Q and then drawing an outcome for x from p(y|x, Dc). p( ) effectively allows us to treat constrained experiments as if they were individual experiments in a traditional optimization problem. Thus, we can define heuristics for constrained experiments by computing the same statistics of the posterior p( ), as used in traditional optimization. In this work we consider four such heuristics. 5.2.1 Maximum Mean (MM) In the context of traditional optimization with individual experiments, the MM heuristic, also known as PMAX (Moore & Schneider, 1995; Moore et al., 1998; Schneider & Moore, 2002), simply selects the experiment that has the largest expected outcome according to the current posterior. In our constrained experiments, the MM heuristic computes the expected outcome of a given query according to the current posterior p( ). The MM of any arbitrary query Q is computed as follows: MM(Q|Dc) = E [y|Q, Dc] , where y p(y|Q, Dc). (6) MM is purely an exploitative heuristic and has the weakness that it can often be too greedy and get stuck in a poor local maximum point before exploring the rest of the experimental space. Budgeted Optimization with Constrained Experiments 5.2.2 Maximum Upper Interval (MUI) The MUI heuristic, also known as IEMAX in previous work (Moore & Schneider, 1995; Moore et al., 1998; Schneider & Moore, 2002), attempts to overcome the greedy nature of MM by exploring areas with non-trivial probability of achieving a good result as measured by the upper bound of the 95% confidence interval of output prediction. Thus, the MUI heuristic for any arbitrary constrained experiments Q is calculated as follow (assuming that Gaussian process is used for estimating the posterior distribution of f( )): MUI(Q|Dc) = E [y|Q, Dc] + 1.96 p Var[y|Q, Dc], where y p(y|Q, Dc). (7) Intuitively, MUI will aggressively explore untouched regions of the experimental space since the outcomes in such regions will have high posterior variance. However, as experimentation continues for a long time and uncertainty decreases, MUI will focus on the most promising areas and behaves like MM. Note that MUI is a specific case of a more general heuristic GP-UCB (Cox & John, 1992, 1997), where the constant 1.96 is replaced by a varying parameter that results in certain theoretical guarantees. Empirically GP-UCB has been observed to perform comparatively to the MEI heuristic which we introduce later in this section. 5.2.3 Maximum Probability of Improvement (MPI) In the context of individual experiments, the MPI heuristic corresponds to selecting the experiment that has the highest probability of generating an outcome y that outperforms the best current outcome in Dc. An issue with the basic MPI strategy is that it has a tendency to behave similar to MM and focuses on the areas that currently look promising, rather than exploring unknown areas. The reason for this behavior is that the basic MPI does not take into consideration the amount of improvement over the current best outcome. In particular, it is typical for the posterior to assign small amounts of variances to the outcomes in well explored regions. It means while there might be a good probability of observing a small amounts of improvement, the probability of a substantial improvement is small. Hence, it is common to consider the use of a margin α when using MPI, which we will refer to as MPI(α). Let y c represent the current best outcome that was observed in Dc, then MPIα(Q|Dc) is equal to the probability that the outcome of the constrained experiment Q will be greater than ((1+α)y c) (assuming non-negative y c values). The MPI heuristic is given by: MPIα(Q|Dc) = p (y (1 + α)y c|Q, Dc) , where y p(y|Q, Dc). (8) The MPI(α) heuristic is sensitive to the α margin parameter. Adjusting the margin α from small to large causes the heuristic to change its behavior from more exploitive to more explorative. 5.2.4 Maximum Expected Improvement (MEI) The maximum expected improvement (Locatelli, 1997) heuristic seeks to improve on the basic MPI heuristic without requiring a margin parameter α. Rather than focus on the probability of improvement, it considers the expected amount of improvement according to Azimi, Fern, & Fern the current posterior. In particular let I(y, y c) = max(y y c, 0). Then, the MEI heuristic is defined as: MEI(Q|Dc) = Ey [I(y, y c)|Dc, x] , where y p(y|Q, Dc). (9) 5.3 Cost-Sensitive Policies The above introduced sequential heuristics do not take the variable cost of a constrained experiment into account. If only the heuristic value is used as our selection criterion, the most costly constrained experiment might be selected. In fact, the nature of our heuristics will typically assign the highest score to the constrained experiments that are maximally constrained and centered around the individual experiment that maximizes the heuristic. Unfortunately, such constrained experiments are also maximally costly. More generally, assume that the cost of a constrained experiment Q monotonically decreases with the size of its region of support, which is the most natural case to consider. It is easy to show that for all of our heuristics, the value of a constrained experiment Q is monotonically non-decreasing with respect to the cost of Q. This is true when reducing the size of a constrained experiment would remove the points which have the heuristic values less than the constrained experiment value. Thus, maximizing the above defined heuristics leads to the selection of the most costly experiments, which might consume more budget than is warranted. This suggests that there is a fundamental trade-offbetween the heuristic values and the cost of the constrained experiments that we must address. Below, we introduce two approaches that attempt to address this trade offby defining cost-sensitive policies from the cost insensitive heuristics. 5.3.1 Cost Normalized (CN) Policies Cost normalized policies have been widely used in budgetd optimization settings where the costs are non-uniform across experiments, (e.g., see Krause et al., 2008; Snoek, Larochelle, & Adams, 2012). It simply selects the constrained experiment that achieves the highest expected improvement per unit cost, or rate of the improvement. Suppose H is our heuristic. We can define a corresponding CN policy for any heuristic on constrained experiment Q given the current experimental state {Dc, Bc} as follows: CNH(Dc, Bc) = argmax Q:fc(Q)