# bayesian_optimization_under_stochastic_delayed_feedback__b0f68d34.pdf

Bayesian Optimization under Stochastic Delayed Feedback

Arun Verma * 1 Zhongxiang Dai * 1 Bryan Kian Hsiang Low 1

Bayesian optimization (BO) is a widely-used sequential method for zeroth-order optimization of complex and expensive-to-compute black-box functions. The existing BO methods assume that the function evaluation (feedback) is available to the learner immediately or after a ﬁxed delay. Such assumptions may not be practical in many real-life problems like online recommendations, clinical trials, and hyperparameter tuning where feedback is available after a random delay. To beneﬁt from the experimental parallelization in these problems, the learner needs to start new function evaluations without waiting for delayed feedback. In this paper, we consider the BO under stochastic delayed feedback problem. We propose algorithms with sub-linear regret guarantees that efﬁciently address the dilemma of selecting new function queries while waiting for randomly delayed feedback. Building on our results, we also make novel contributions to batch BO and contextual Gaussian process bandits. Experiments on synthetic and real-life datasets verify the performance of our algorithms.

1. Introduction

Bayesian optimization (BO) (Shahriari et al., 2015; Frazier, 2018; Garnett, 2022) is a popular and widely-used sequential method for zeroth-order optimization of unknown black-box functions that are complex and expensive to compute. The existing BO methods assume that the function evaluation (feedback) is available to the learner immediately or after a ﬁxed delay. However, these assumptions are impractical in many real-life problems where feedback is available after a random delay. To take advantage of the experimental parallelization in these problems, the learner needs to start

*Equal contribution 1Department of Computer Science, National University of Singapore, Republic of Singapore. Correspondence to: Arun Verma <arunverma100@gmail.com>, Zhongxiang Dai <daiz9109@gmail.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

new function evaluations without waiting for the randomly delayed feedback. We refer to this new BO problem as Bayesian Optimization under Stochastic Delayed Feedback (BO-SDF). In this paper, we propose algorithms that efﬁciently address the problem of selecting new function queries while waiting for randomly delayed feedback. Speciﬁcally, we answer the following question: How to start a new function query when the observations of the past function queries are randomly delayed?

Many real-life applications can be cast as BO-SDF problems. For example, when performing clinical trials to discover new medicines, we need to optimize the amount of different raw materials in order to ﬁnd their most effective composition using a small number of clinical trials (Chow & Chang, 2006), which is a natural application for BO due to its sample efﬁciency. However, the evaluation of the effectiveness of a medicine needs to take into account the side effects which are usually not observed immediately but are instead revealed after a random period due to the varying physiology of different patients. Therefore, a natural challenge here is how to use BO to choose the next composition for testing, while accounting for the stochastically delayed observations from some previous clinical trials. A similar challenge also arises when we aim to ﬁnd the most effective dose of a new medicine through clinical trials (Takahashi & Suzuki, 2021).

Another motivating example arises from online product recommendation (Chapelle, 2014; Diemert et al., 2017). Most online platforms make recommendations in milliseconds, but the user s response (i.e., watching a movie or buying a product) generally happens after a random time period which may range from hours to days or even weeks. Furthermore, the issue of randomly delayed feedback also plagues other common applications of BO such as hyperparameter tuning of machine learning (ML) models (Snoek et al., 2012) (e.g., the training time of a neural network depends on the number of layers, network width, among others) and material discovery (Ueno et al., 2016).

The closest BO setting to our BO-SDF is batch BO (Desautels et al., 2014; Daxberger & Low, 2017; Chowdhury & Gopalan, 2019) which also allows multiple queries to be performed on the black-box function in parallel. Both problems require choosing an input query to evaluate

Bayesian Optimization under Stochastic Delayed Feedback

the black-box function while the observations of some previously selected inputs are not available (i.e., delayed). However, there are important differences because the delays are ﬁxed in a certain way in batch BO but can be random in BO-SDF. Interestingly, batch BO can be considered as a special case of our BO-SDF problem and hence, our algorithms proposed for BO-SDF can be applied to batch BO while achieving important improvements (Section 4).

Following the practice of BO, in order to choose the input queries to quickly approach the global optima, we employ a Gaussian process (GP) (Rasmussen & Williams, 2006) to model the black-box function, which builds a posterior belief of the function using the history of function queries and their feedback. However, when constructing this posterior belief in BO-SDF problems, we are faced with the important question of how to handle the randomly delayed feedback whose observations are not available. To this end, we replace the unavailable delayed feedback by the minimum function value,1 which we refer to as censored feedback. The use of censored feedback, interestingly, improves the exploration in BO-SDF problems (Section 3.1) and hence leads to better theoretical and empirical performances. With the posterior belief built using censored feedback, we propose algorithms using upper conﬁdence bound (UCB) and Thompson sampling (TS), both of which enjoy sub-linear regret guarantees. Speciﬁcally, our contributions are as follows:

We introduce and formalize the notion of censored feedback in the BO-SDF problem, and propose UCBand TS-based algorithms with sub-linear regret guarantees (Section 3). Since batch BO is a special case of BO-SDF, we apply our proposed algorithms (Section 3) to batch BO and show that our algorithms enjoy tighter regret upper bounds than classic batch BO algorithms (Section 4). This gain is mainly attributed to the censored feedback, which leads to a better exploration policy. We extend our algorithms for the BO-SDF problem to contextual Gaussian process bandits (Krause & Ong, 2011) with stochastic delayed feedback in Section 5. This new contribution is itself of independent interest. Our experimental results in Section 6 validate the different performance aspects of our proposed algorithms on synthetic and real-life datasets.

2. Problem Setting

This paper considers the Bayesian optimization (BO) problem where the function evaluations (feedback) are

1In many problems, the minimum value of a function is known, e.g., a user s minimum response in online recommendation systems is no click (i.e., 0) and the minimum accuracy for hyperparameter tuning of machine learning models is 0 .

available to the learner after a random delay. Let Q Rn

be a ﬁnite domain of all function queries where n 1.2

The learner selects a new function query as soon as enough resources are available (or when experiment parallelization is possible). We denote the t-th query by xt Q. After the learner selects the query xt to evaluate the unknown black-box function f, the environment generates a noisy function evaluation, denoted by feedback yt = f(xt) + εt. We assume that εt is an R-sub-Gaussian noise. The learner observes yt only after a stochastic delay dt, which is generated from an unknown distribution D.

We refer to this new BO problem as Bayesian Optimization under Stochastic Delayed Feedback (BO-SDF). The unknown function f, query space Q, and unknown delay distribution D identify an instance of the BO-SDF problem. The optimal query (x ) has the maximum function value (global maximizer), i.e., x argmaxx Q f(x). After selecting a query xt, the learner incurs a penalty (or instantaneous regret) pt = f(x ) f(xt).

Since the optimal function query is unknown, we sequentially estimate this using the available information of the selected queries and observed feedback. Our goal is to learn a sequential policy for selecting queries that ﬁnds the optimal query (or global maximizer) as quickly as possible. There are two common performance measures for evaluating a sequential policy. The ﬁrst performance measure is simple regret. Let xt be the t-th function query selected by the policy. Then, after observing T function evaluations, the simple regret is r T = f(x ) maxt {1,...,T } f(xt). Any good policy must have no regret, i.e., lim T r T = 0. The second performance measure of the policy is cumulative regret which is the sum of total penalties incurred by the learner. After observing T function evaluations, the cumulative regret of a policy is given by RT = PT t=1 (f(x ) f(xt)) . Any good policy should have sub-linear regret, i.e., lim T RT /T = 0. Even though simple regret and cumulative regret are different performance measures, having a policy with no regret or sub-linear regret implies that the policy will eventually converge to the optimal query (or global optimizer).

3. BO under Stochastic Delayed Feedback

A function estimator is the main component of any BO problem for achieving good performance. We use a Gaussian process (GP) as a surrogate for the posterior belief of the unknown function (Rasmussen & Williams, 2006; Srinivas et al., 2010). To deal with the delayed feedback, we will introduce the notion of censored feedback in GPs, in which the delayed feedback is replaced by the minimum

2We assume Q to be ﬁnite, but it is straightforward to extend our theoretical results to a compact domain using a suitable discretization scheme (Li & Scarlett, 2022, Lemma 2).

Bayesian Optimization under Stochastic Delayed Feedback

function value. Finally, we will propose UCBand TS-based algorithms with sub-linear regret guarantees.

3.1. Estimating Function using Gaussian Processes

We assume that the unknown function f belongs to the reproducing kernel Hilbert space (RKHS) associated with a kernel k, i.e., f k Bf where Bf > 0. By following standard practice in the literature, we assume without loss of generality that k(x, x ) 1 for all x, x Q. We also assume that feedback is bounded, i.e., |yt| By for all t 1. This assumption is not restrictive since any bounded function can be re-scaled to satisfy this boundedness requirement. For example, the validation accuracy of machine learning models in our hyperparameter tuning experiment in Section 6 is bounded in [0, 1].

Censored Feedback. The learner has to efﬁciently exploit available information about queries and feedback to achieve a better performance. The estimated GP posterior belief has two parts, i.e., mean function and covariance function. The posterior covariance function only needs the function queries for building tighter conﬁdence intervals, whereas the posterior mean function needs both queries and their feedback. Incorporating the queries with delayed feedback in updating the posterior mean function leads to a better estimate, but the question is how to do it. One possible solution is to replace the delayed feedback with some value, but then the question is what should that value be.

To pick the suitable value for replacing delayed feedback, we motivate ourselves from real-life applications like the online movies recommendations problem. When an online platform recommends a movie, it expects the following responses from the users not watching the movie, watching some part of the movie, and watching the entire movie. The platform can map these responses to [0, 1] where 0 is assigned for not watching the movie , 1 for watching the entire movie , and an appropriate value in (0, 1) for watching some part of the movie. There are two reasons for the delay in the user s response: The user does not like the recommended movie or does not have time to watch the movie. Before the user starts watching the movie, the platform can replace the user s delayed response with not watching the movie (i.e., 0 ) and update it later when more information is available. Therefore, we can replace the delayed feedback with the minimum function value.3 We refer to this replaced feedback as censored feedback. The idea of censored feedback is also used in other problems

3It is possible to replace the delayed feedback with other values. When it is replaced by the current GP posterior mean, it has the same effect as the hallucination in batch BO (Desautels et al., 2014) (more details in Section 4). Another possible value for replacing the delayed feedback is the k-nearest posterior means where k > 0.

such as censoring delayed binary feedback (Vernade et al., 2020) and censoring losses in online resource allocation problems (Verma et al., 2019; 2021).

When the minimum function value replaces the delayed feedback, the posterior mean around that query becomes small, consequently ensuring that the learner will not select the queries around the queries with delayed feedback before exploring the other parts of the query s domain. Therefore, the censored feedback leads to better exploration than replacing delayed feedback with a larger value (e.g., current posterior mean). However, the queries with delayed feedback need to be stored so that the posterior mean can be updated when feedback is revealed. Due to several reasons (e.g., limited storage, making resources available to new queries), the learner cannot keep all queries with delayed feedback and replace the old queries with the latest ones.4

It can degrade the algorithm s performance, as discussed in Section 3.3 and demonstrated by experiments in Section 6.

Let an incomplete query represent the function query with unobserved feedback and m be the number of incomplete queries the learner can store. We assume that the random delay for a query is the number of new queries that need to be started (i.e., number of iterations elapsed) before observing its feedback. Therefore, the delay dt only takes values of non-negative integers.5 The censored feedback of the s-th query before selecting the t-th query is denoted by ys,t = ys1{ds min(m, t s)}. That is, the learner censors the incomplete queries by assigning 0 to them.6

Now, the GP posterior mean and covariance functions can be expressed using available function queries and their censored feedback, as follows:

µt 1(x) = kt 1(x) K 1 t,λ yt 1 ,

σ2 t 1(x, x ) = k(x, x ) kt 1(x) K 1 t,λ kt 1(x ) (1)

where Kt,λ = Kt 1 + λI, kt 1(x) = (k(x, xt )) t [t 1], yt 1 = ( ys,t) s [t 1], Kt 1 = (k(xt , xt ))t ,t [t 1], and λ is a regularization parameter that ensures Kt,λ is a positive deﬁnite matrix. We denote σ2 t 1(x) = σ2 t 1(x, x).

4Our motivation for removing older queries comes from online recommendation problems where the chance of observing feedback diminishes with time. The learner can also use a more complicated scheme for query removal. 5We have introduced the delay based on iterations here to bring out the main ideas of censored feedback. Any suitable discrete distribution (e.g., Poisson) can model such delays. We have also introduced the censored feedback with time-based delay, which is more practical and can be modeled using a suitable continuous distribution (see Appendix A.4 for more details). 6We can make the minimum value of any bounded function 0 by adding to it the negative of the minimum function value. It does not change the optimizer since the optimizer of a function is invariant to the constant function shift. By doing this, we can represent censored feedback by an indicator function, making censored feedback easier to handle in theoretical analysis.

Bayesian Optimization under Stochastic Delayed Feedback

3.2. Algorithms for the BO-SDF problem

After having a posterior belief of the unknown function, the learner can decide which query needs to be selected for the subsequent function evaluation. Since the feedback is only observed for the selected query, the learner needs to deal with the exploration-exploitation trade-off. We use UCBand TS-based algorithms for BO-SDF problems that handle the exploration-exploitation trade-off efﬁciently.

UCB-based Algorithm for BO-SDF problems. UCB is an effective and widely-used technique for dealing with the exploration-exploitation trade-off in various sequential decision-making problems (Auer et al., 2002; Garivier & Cappé, 2011). We propose a UCB-based algorithm named GP-UCB-SDF for the BO-SDF problems. It works as follows: When the learner is ready to start the t-th function query, he ﬁrstly updates the GP posterior mean and standard deviation deﬁned in Eq. (1) using available censored noisy feedback. Then, he selects the input xt for the next function evaluation by maximizing the following UCB value:

xt = argmaxx Q (µt 1 + νtσt 1(x)) (2)

where νt = By Pt 1 s=t m σt 1(xs) + βt, βt = Bf + (R + By) p

2(γt 1 + 1 + log(2/δ)), δ (0, 1), and γt is the maximum information gain about the function f from any set of t function queries (Srinivas et al., 2010).

GP-UCB-SDF UCB-based Algorithm for BO-SDF

1: Input: λ > 0, m N 2: for t = 1, 2, 3, ... do 3: Update GP posterior deﬁned in Eq. (1) using available censored noisy feedback 4: Select input xt using Eq. (2) and compute f at xt 5: Keep observing the noisy feedback until the starting of next function evaluation 6: end for

After selecting the input xt, the learner queries the unknown function f(xt) and the environment generates a noisy feedback yt with an associated delay dt D. The learner observes yt only after a delay of dt iff dt m. Before starting a new function evaluation, the learner keeps observing the noisy feedback of past queries. The same process is repeated for the subsequent function evaluations.

TS-based Algorithm for BO-SDF problems. In contrast to the deterministic UCB, TS (Thompson, 1933; Agrawal & Goyal, 2012; 2013a; Kaufmann et al., 2012) is based on Bayesian updates that select inputs according to their probability of being the best input. Many works have shown that TS is empirically superior to UCB-based algorithms (Chapelle & Li, 2011; Chowdhury & Gopalan, 2019). We name our TS-based algorithm GP-TS-SDF which works

similarly to GP-UCB-SDF, except that the t-th function query is selected as follows:

xt = argmaxx Q ft(x) (3)

where ft GP(µt 1( ), ν2 t σ2 t 1( )).

GP-TS-SDF TS-based Algorithm for BO-SDF

1: Input: λ > 0, m N 2: for t = 1, 2, 3, ... do 3: Update GP posterior deﬁned in Eq. (1) using available censored noisy feedback 4: Select input xt using Eq. (3) and compute f at xt 5: Keep observing the noisy feedback until the starting of next function evaluation 6: end for

3.3. Regret Analysis

Let ρm = P(ds m) for s 1 be the probability of observing delayed feedback within the next m iterations. The following result gives the conﬁdence bounds of GP posterior mean function and is important to our subsequent regret analysis:

Theorem 1 (Conﬁdence Ellipsoid). Let x Q, λ > 1, and t 1. Then, with probability at least 1 δ,

|µt 1(x) ρmf(x)| νtσt 1(x).

The detailed proofs of Theorem 1 and other results below are given in Appendix A. Now, we state the regret upper bounds of our proposed GP-UCB-SDF and GP-TS-SDF:

Theorem 2 (GP-UCB-SDF). Let C1 = p

2/log(1 + λ 1). Then, with probability at least 1 δ,

TγT + m By C2 1γT .

As expected, the regret inversely depends on ρm, i.e., the probability of observing delayed feedback within the next m iterations. If we ignore the logarithmic factors and constants in Theorem 2, then the regret of GP-UCB-SDF is upper bounded by O ρ 1 m γT

Theorem 3 (GP-TS-SDF). With probability at least 1 δ,

TγT ( γT + 1) + m(γT +

The regret bounds in both Theorems 2 and 3 are sub-linear for the commonly used squared exponential (SE) kernel, for which γT = O(logn+1(T)) (Srinivas et al., 2010).

By following the common practice in the BO literature (Contal et al., 2013; Kandasamy et al., 2018), we can

Bayesian Optimization under Stochastic Delayed Feedback

upper-bound the simple regret by the average cumulative regret, i.e., r T RT /T.

Discussion on the trade-off regarding m. A large value of m ensures that fewer queries are discarded and favors the regret bound (i.e., via the term 1/ρm) by making ρm large. However, a large m also means that our algorithm allows larger delays, which consequently forces us to enlarge the width of the conﬁdence interval (Theorem 1) to account for them. It makes the regret bound worse, which is also reﬂected by the linear dependence on m in the second term of the regret upper bounds (Theorems 2 and 3). As m is directly related to the storage requirement of input queries whose feedback are yet to be observed, a large value of m is constrained by the resources available to the learner.

Input-dependent random delays. In some applications, the random delay can depend on the function query: for example, a larger number of hidden layers increases the training time of a neural network. Our regret bounds still hold with input-dependent random delays by redeﬁning ρm for T queries as ρm = mint [T ] P {dt m}.

Limitations of censored feedback. Censoring, i.e., replacing values of incomplete queries with the minimum function values, leads to an aggressive exploration strategy. However, censoring is required to ensure that our BO method does not unnecessarily explore the incomplete queries or queries near them. Moreover, we can reduce the effect of this aggressive exploration by increasing the value of m, since a larger m leads to less aggressive censoring. Another limitation of the censoring method is that it needs to know the minimum function value. However, when the minimum value is unknown, we can use a suitable lower bound as a proxy for the minimum function value. However, regret analysis of such a method can be challenging and left for future research.

4. Improved Algorithms for Batch BO

Notably, our BO-SDF problem subsumes batch BO as a special case. Speciﬁcally, the function queries in batch BO can be viewed as being sequentially selected, in which some queries need to be selected while some other queries of the same batch are still incomplete (Desautels et al., 2014). Therefore, in batch BO with a batch size of Bx, the incomplete queries can be viewed as delayed feedback with ﬁxed delays s.t. ds Bx 1. In this case, by choosing m = Bx 1, we can ensure that ρm = P(ds Bx 1) = 1. As a result, our method of censoring (Section 3.1) gives a better treatment to the incomplete queries than the classic technique from batch BO, which we will demonstrate next via both intuitive justiﬁcation and regret comparison.

The classic technique to handle the incomplete queries in batch BO is hallucination which was proposed by the

GP-BUCB algorithm from (Desautels et al., 2014). It has also been adopted by a number of its extensions (Daxberger & Low, 2017; Chowdhury & Gopalan, 2019). In particular, the feedback of incomplete queries is hallucinated to be (i.e., censored with) the GP posterior mean computed using only the completed queries (excluding the incomplete queries). It is equivalent to keeping the GP posterior mean unchanged (i.e., unaffected by the incomplete queries) while updating the GP posterior variance using all selected queries (including the incomplete queries). Note that in contrast to the hallucination, our censoring technique sets the incomplete observations to the minimum function value (i.e., 0) (Section 3.1). As a result, compared with the hallucination where the GP posterior mean is unaffected by the incomplete queries, our censoring can reduce the value of the GP posterior mean at the incomplete queries (because we have treated their observations as if they are 0). As a result, our censoring can discourage these incomplete queries from being unnecessarily queried again. It encourages the algorithm to explore other unexplored regions, hence leading to better exploration.

Interestingly, the better exploration resulting from our censoring technique also translates to a tighter regret upper bound. Speciﬁcally, in batch BO with a batch size of Bx, after plugging in m = Bx 1 and ρm = 1, our next result gives the regret upper bound of GP-UCB-SDF. Proposition 1. With probability at least 1 δ,

As long as γT = Ω(log T) which holds for most commonly used kernels such as the squared exponential (SE) and Matérn kernels, both terms in Proposition 1 grow slower than RT = O(γT

T exp(γBx 1)) which is the regret upper bound of the GP-BUCB algorithm based on hallucination (Desautels et al., 2014). Importantly, to achieve a sub-linear regret upper bound (i.e., to ensure that exp(γBx 1) can be upper-bounded by a constant), GP-BUCB (Desautels et al., 2014) and its extensions (Daxberger & Low, 2017; Kandasamy et al., 2018; Chowdhury & Gopalan, 2019) all require a special initialization scheme (i.e., uncertainty sampling). It is often found unnecessary in practice (Kandasamy et al., 2018) and hence represents a gap between theory and practice. Interestingly, our tighter regret upper bound from Proposition 1 has removed the dependence on exp(γBx 1) and hence eliminated the need for uncertainty sampling, thereby closing this gap between theory and practice.

Note that our discussions above regarding the advantage of censoring over hallucination also apply to TS-based algorithms. In particular, in batch BO with a batch size of Bx, the regret upper bound of GP-TS-SDF is given by the following result.

Bayesian Optimization under Stochastic Delayed Feedback

Proposition 2. With probability at least 1 δ,

TγT ( γT + 1) + BxγT + Bx

All three terms in Proposition 2 grow slower than RT = O(exp(γBx 1) TγT ( γT + 1)) which is the regret upper bound of the GP-BTS algorithm based on hallucination (Chowdhury & Gopalan, 2019) as long as γT = Ω(log T).

To summarize, though batch BO is only a special case of our BO-SDF problem, we have made non-trivial contributions to the algorithm and theory of batch BO. Our GP-UCB-SDF and GP-TS-SDF algorithms and our theoretical analyses may serve as inspiration for future works on batch BO.

5. Contextual Gaussian Process Bandits under Stochastic Delayed Feedback

Before making a decision, sometimes additional information is available to the learner in many real-life scenarios (e.g., users proﬁle information for an online platform and patients medical history before clinical trials). This additional information is referred to as context in the literature, and the value of a function depends on the context (Agrawal & Goyal, 2013b; Chu et al., 2011; Krause & Ong, 2011; Li et al., 2010). The learner can design the algorithms to exploit the contextual information to make a better decision. We can extend our results for the BO-SDF problem to the contextual Gaussian process bandits (Krause & Ong, 2011) with stochastic delayed feedback where a non-linear function maps a context to the feedback.

Let Q Rn be a ﬁnite set of available actions (or queries) and Z Rn be a set of all contexts where n, n 1. In round t, the environment generates a context zt Z. Then, the learner selects an action xt Q and observes noisy feedback denoted by yt = g(zt, xt) + εt. We assume that g : Z Q R and εt is an R-sub-Gaussian noise. The yt is only observed after a random delay dt, which is generated from an unknown distribution D.

We refer to this new problem as Contextual Gaussian process Bandits under Stochastic Delayed Feedback (CGB-SDF). The unknown function g, context space Z, action space Q, and unknown delay distribution D identify an instance of the CGB-SDF problem. The optimal action x t for a given context zt is the action where the function g has its maximum value, i.e., x t = argmaxx Q g(zt, x). The learner incurs a penalty of (g(zt, x t ) g(zt, xt)) for a context zt and action xt. We aim to learn a policy that achieves the minimum cumulative penalties or regret which is given by RT = PT t=1 (g(zt, x t ) g(zt, xt)).

Our goal is to learn a policy that has a small sub-linear regret, i.e., lim T RT /T 0. The sub-linear regret here implies that the policy will eventually select the

optimal action for a given context. We have adapted our algorithms for the BO-SDF problem to the CGB-SDF problem and shown that they also enjoy similar sub-linear regret guarantees; see Appendix B for more details.

6. Experiments

We compare with previous baseline methods that can be applied to BO-SDF problems after modiﬁcations. Firstly, we compare with standard GP-UCB (Srinivas et al., 2010) which ignores the delayed observations when selecting the next query. Note that GP-UCB is likely to repeat previously selected queries in some iterations when the GP posterior remains unchanged (i.e., if no observation is newly collected between two iterations). Next, we also compare with the batch BO method of asy-TS (Kandasamy et al., 2018) which, similarly to GP-UCB, ignores the delayed observations when using TS to choose the next query. Lastly, we compare with the batch BO methods of GP-BUCB (Desautels et al., 2014) and GP-BTS (Chowdhury & Gopalan, 2019) which handle the delayed observations via hallucination (Section 4). Following the suggestion from (Vernade et al., 2020), if not speciﬁed otherwise, we set m to be 2µ where µ is the mean of the delay distribution. However, this is for convenience only since we will demonstrate (Section 6.1) that our algorithms consistently achieve competitive performances as long as m is large enough. Therefore, µ does not need to be known in practice. We defer some experimental details to Appendix C due to space constraint. Our code is released at https: //github.com/daizhongxiang/BO-SDF.

6.1. Synthetic Experiments

For this experiment, we use a Poisson distribution (with a mean parameter µ) as the delay distribution and sample the objective function f from a GP with an SE kernel deﬁned on a discrete 1-dimensional domain. Fig. 1a plots the performances of our GP-UCB-SDF for different delay distributions and shows that larger delays (i.e., larger µ s) result in worse performances. Fig. 1b shows that for a ﬁxed delay distribution (i.e., µ = 10), an overly small value of m leads to large regrets since it causes our algorithm to ignore a large number of delayed observations (i.e., overly aggressive censoring). On the other hand, a large m is desirable since m = 20 = 2µ and m = 40 produce comparable performances.

However, an overly large value of m may exert an excessive resource requirement since we may need to cater to many pending function evaluations. Therefore, when the delay distribution (hence µ) may be unknown, we recommend using a large value of m that is allowed by the resource constraints. We now compare the performances of our GP-UCB-SDF with the baselines of GP-UCB and

Bayesian Optimization under Stochastic Delayed Feedback

0 25 50 75 100 125 150 Iterations

Simple Regret

GP-UCB-SDF ( = 2)

GP-UCB-SDF ( = 5) GP-UCB-SDF ( = 10)

GP-UCB-SDF ( = 15) GP-UCB-SDF ( = 20) GP-UCB-SDF ( = 30)

0 25 50 75 100 125 150 Iterations

Simple Regret

GP-UCB-SDF (m = 5) GP-UCB-SDF (m = 10) GP-UCB-SDF (m = 20) GP-UCB-SDF (m = 40)

0 25 50 75 100 125 150 Iterations

Simple Regret

GP-UCB-SDF GP-UCB GP-BUCB

0 25 50 75 100 125 150 Iterations

Simple Regret

GP-UCB-SDF GP-UCB GP-BUCB

(a) (b) (c) (d)

Figure 1: Performances of our GP-UCB-SDF for (a) different delay distributions and (b) different values of m (µ is ﬁxed at µ = 10). Performances of GP-UCB-SDF and baseline methods for (c) stochastic and (d) deterministic delay distributions.

0 20 40 60 80 100 Iterations

Validation Error (%)

GP-UCB-SDF GP-UCB GP-BUCB

0 20 40 60 80 100 Iterations

Validation Error (%)

GP-UCB-SDF GP-UCB GP-BUCB

0 20 40 60 80 100 Iterations

Validation Error (%)

GP-TS-SDF asy-TS GP-BTS

0 20 40 60 80 100 Iterations

Validation Error (%)

GP-TS-SDF asy-TS GP-BTS

(a) (b) (c) (d)

0 25 50 75 100 125 150 Iterations

Validation Error (%)

GP-UCB-SDF GP-UCB GP-BUCB

0 25 50 75 100 125 150 Iterations

Validation Error (%)

GP-UCB-SDF GP-UCB GP-BUCB

0 25 50 75 100 125 150 Iterations

Validation Error (%)

GP-TS-SDF asy-TS GP-BTS

0 25 50 75 100 125 150 Iterations

Validation Error (%)

GP-TS-SDF asy-TS GP-BTS

(e) (f) (g) (h)

Figure 2: Performances for hyperparameter tuning of SVM: UBC-based methods with (a) stochastic and (b) deterministic delay distributions, and TS-based methods with (c) stochastic and (d) deterministic delay distributions. Performances for hyperparameter tuning of CNN: UBC-based methods with (e) stochastic and (f) deterministic delay distributions, and TS-based methods with (g) stochastic and (h) deterministic delay distributions.

GP-BUCB for both stochastic (µ = 10, Fig. 1c) and deterministic (delays ﬁxed at 10, Fig. 1d) delay distributions. The ﬁgures show that our GP-UCB-SDF achieves smaller simple regret than the baselines in both settings.

6.2. Real-world Experiments

In this section, we perform real-world experiments on hyperparameter tuning for ML models under two realistic settings. In the ﬁrst setting, we consider stochastic delay distributions where the delays are sampled from a Poisson distribution with a mean parameter µ = 10. This setting is used to simulate real-world scenarios where the evaluations of different hyperparameter conﬁgurations may be stochastically delayed because they may require a different amount of computations (e.g., training a neural

network with more layers is more time-consuming), or the machines deployed in the experiment may have different computational capabilities. The second setting considers ﬁxed delay distributions where every delay is ﬁxed at Bx 1 = 10, which is used to emulate real-world asynchronous batch BO problems with a batch size of Bx = 11.

We now tune the hyperparameters of three different ML models to demonstrate the consistency of our algorithms. Speciﬁcally, we tune two hyperparameters of SVM (i.e., the penalty parameter and RBF kernel parameter) used for diabetes diagnosis, three hyperparameters of convolution neural networks (CNNs) (i.e., the batch size, learning rate, and learning rate decay) using the MNIST dataset, and the same three hyperparameters of logistic regression (LR) used for breast cancer detection. The experimental results

Bayesian Optimization under Stochastic Delayed Feedback

0 25 50 75 100 125 150 Iterations

Validation Error (%)

GP-UCB-SDF GP-UCB GP-BUCB

0 25 50 75 100 125 150 Iterations

Validation Error (%)

GP-UCB-SDF GP-UCB GP-BUCB

0 25 50 75 100 125 150 Iterations

Validation Error (%)

GP-TS-SDF asy-TS GP-BTS

0 25 50 75 100 125 150 Iterations

Validation Error (%)

GP-TS-SDF asy-TS GP-BTS

(a) (b) (c) (d)

Figure 3: Performances for hyperparameter tuning of LR, UBC-based methods for (a) stochastic and (b) deterministic delay distributions, and TS-based methods for (a) stochastic and (b) deterministic delay distributions.

0 250 500 750 1000 1250 Iterations

Cumulative Regret

GP-UCB-SDF GP-UCB GP-BUCB

0 250 500 750 1000 1250 Iterations

Cumulative Regret

GP-TS-SDF asy-TS GP-BTS

0 100 200 300 400 Iterations

Cumulative Regret

GP-UCB-SDF GP-UCB GP-BUCB

0 100 200 300 400 Iterations

Cumulative Regret

GP-TS-SDF asy-TS GP-BTS

(a) (b) (c) (d)

Figure 4: Cumulative regret of contextual GP bandit for (a) UCB-based and (b) TS-based methods for multi-task BO, and (c) UCB-based and (d) TS-based methods for non-stationary BO.

on hyperparameter tuning for SVM and CNN are plotted in Fig. 2, including the results for UCBand TS-based algorithms for both settings. The results for hyperparameter tuning of LR are shown in Fig. 3. The ﬁgures show that our GP-UCB-SDF and GP-TS-SDF consistently outperform the corresponding baselines in both settings with stochastic and deterministic delay distributions. The results here verify the practical effectiveness of our proposed algorithms in real-world problems.

6.3. Real-world Experiments on Contextual Gaussian Process Bandits

In this section, we apply our algorithms to two real-world contextual GP bandit problems (Section 5). Our ﬁrst experiment performs multi-task BO in the contextual GP bandit setting following the work of Krause & Ong (2011). We use the tabular benchmark dataset for SVM hyperparameter tuning from the work of Wistuba et al. (2015), which consists of the validation accuracy evaluated on a discrete domain of six SVM hyperparameters for 50 different classiﬁcation tasks. Each of the 50 tasks is associated with a set of meta-features which are used as the contexts zt. We sequentially tackle the 50 tasks and perform hyperparameter tuning for each task for 30 iterations, i.e., the context zt for every task is repeated consecutively for 30 times and the contexts for all tasks arrive sequentially.

Our second experiment performs non-stationary BO in

the contextual GP bandit setting. In some real-world BO problems, the objective function is non-stationary (i.e., changing over time). For example, when we tune the hyperparameters of an ML model for diabetes diagnosis (Section 6.2), we may need to perform the hyperparameter tuning task periodically because the size of the dataset is progressively growing due to sequentially arriving patient data. We divide the entire dataset into 20 progressively growing datasets and assign an index zt {0, . . . , 19} to each dataset such that a smaller dataset has a smaller index. Then, we use the indices as the contexts because two datasets whose indices are close share many common data points and are hence expected to lead to similar performances. We let the contexts arrive in the order of their indices to simulate sequentially arriving patients and again repeat each context for 30 iterations.

For both experiments, we use a stochastic delay distribution (i.e., Poisson distribution with µ = 3) and set m = 6. The cumulative regrets of different algorithms are shown in Fig. 4, in which both GP-UCB-SDF and GP-TS-SDF incur smaller regrets than the other baselines.

7. Related Work

We have developed algorithms for the BO problems with stochastic delayed feedback. In the following, we brieﬂy review the papers relevant to our problem.

Bayesian Optimization under Stochastic Delayed Feedback

Batch BO. Batch BO is the closest BO setting to our BO-SDF problem. Many algorithms have been proposed for different variants of batch BO problems in the literature (Desautels et al., 2014; González et al., 2016; Daxberger & Low, 2017; Kandasamy et al., 2018; Gong et al., 2019; Chowdhury & Gopalan, 2019; Balandat et al., 2020). However, the delay in observing feedback observations in all these batch BO problems is ﬁxed. Some batch BO algorithms are based on the UCB index (Daxberger & Low, 2017; Desautels et al., 2014), while some others are based on TS (Chowdhury & Gopalan, 2019; Gong et al., 2019; Kandasamy et al., 2018). The batch BO methods like local penalization (González et al., 2016) and fantasize (Balandat et al., 2020) can not work with stochastically delayed feedback and have no theoretical guarantees. All variants of batch BO problems can be treated as special cases of the BO-SDF problem by ﬁxing the delays appropriately. Therefore, our UCBand TS-based algorithms can be used for any batch BO problem.

Contextual Gaussian Process (GP) Bandits. Several prior works consider the availability of additional information to the learner before making the decision. This line of work is popularly known as contextual bandits (Li et al., 2010). It has many applications in advertising, web search, and e-commerce. In contextual bandits, the mean reward is a function of context and often parameterized (e.g., linear (Agrawal & Goyal, 2013b; Chu et al., 2011; Li et al., 2010), GLM (Li et al., 2017), or non-linear (Valko et al., 2013)). Contextual GP bandits model the posterior belief of a non-linear function using GPs (Krause & Ong, 2011). To the best of our knowledge, our work is the ﬁrst to introduce stochastic delayed feedback into contextual GP bandits.

Stochastic Delayed Feedback. Due to several practical applications, some works have explored ﬁxed or stochastic delayed feedback in the stochastic multi-armed bandits (MAB) (Bubeck et al., 2012; Lattimore & Szepesvári, 2020; Slivkins, 2019) which is a simpler sequential decision-making framework than BO. In the MAB literature, the problems ranging from known ﬁxed delay (Joulani et al., 2013), unknown delay (Zhong et al., 2017), unknown stochastic delay (Grover et al., 2018; Vernade et al., 2017), and unknown delays in the adversarial setting (Li et al., 2019) are well-studied. The setting of the unknown stochastic delay is further extended to the linear bandits (Vernade et al., 2020; Zhou et al., 2019) which is closest to our BO-SDF problem. However, these works on linear bandits assume that the black-box function is linear and has binary feedback, which may not hold in practice.

8. Conclusion

This paper has studied the Bayesian optimization (BO) problem with stochastic delayed feedback (BO-SDF). We

have used Gaussian processes as a surrogate model for the posterior belief of the unknown black-box function. To exploit the information of queries with delayed feedback (i.e., incomplete queries) for updating the posterior mean function, we have used censored feedback that assigns the minimum function value (or 0 ) to the delayed feedback. It discourages the learner from selecting the incomplete queries (or queries near them) when choosing the next query and hence leads to more efﬁcient exploration.

We have proposed UCBand TS-based algorithms for the BO-SDF problem and given their sub-linear regret guarantees. Interestingly, when delays are ﬁxed in a certain way, batch BO becomes a special case of the BO-SDF problem. Hence, we can adapt our algorithms for the batch BO problem. We have shown that our algorithms are theoretically and empirically better and do not need the special initialization scheme required for the existing batch BO algorithms. We also extended our algorithms to contextual Gaussian process bandits with stochastic delayed feedback, which is in itself a non-trivial contribution.

As the learning environment keeps changing in real life, an interesting future direction will be considering the BO-SDF problem with a non-stationary function. We also plan to generalize our algorithms to nonmyopic BO (Kharkovskii et al., 2020b; Ling et al., 2016), high-dimensional BO (Hoang et al., 2018), private outsourced BO (Kharkovskii et al., 2020a), preferential BO (Nguyen et al., 2021d), federated/collaborative BO (Dai et al., 2020b; 2021; Sim et al., 2021), meta-BO (Dai et al., 2022), and multi-ﬁdelity BO (Zhang et al., 2017; 2019) settings, handle risks (Nguyen et al., 2021b;a; Tay et al., 2022) and information-theoretic acquisition functions (Nguyen et al., 2021c;e), incorporate early stopping (Dai et al., 2019), and/or recursive reasoning (Dai et al., 2020a), and consider its application to neural architecture search (Shu et al., 2022a;b) and inverse reinforcement learning (Balakrishnan et al., 2020). For applications with a huge budget of function evaluations, we like to couple our algorithms with the use of distributed/decentralized (Chen et al., 2012; 2013a;b; 2015; Hoang et al., 2016; 2019; Low et al., 2015; Ouyang & Low, 2018), online/stochastic (Hoang et al., 2015; 2017; Low et al., 2014; Xu et al., 2014; Yu et al., 2019b), or deep sparse GP models (Yu et al., 2019a; 2021) to represent the belief of the unknown objective function efﬁciently.

Acknowledgements

This research/project is supported by A*STAR under its RIE2020 Advanced Manufacturing and Engineering (AME) Industry Alignment Fund Pre Positioning (IAF-PP) (Award A19E4a0101) and by the Singapore Ministry of Education Academic Research Fund Tier 1.

Bayesian Optimization under Stochastic Delayed Feedback

Agrawal, S. and Goyal, N. Analysis of Thompson sampling for the multi-armed bandit problem. In Proc. COLT, pp. 39.1 39.26, 2012.

Agrawal, S. and Goyal, N. Further optimal regret bounds for Thompson sampling. In Proc. AISTATS, pp. 99 107, 2013a.

Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In Proc. ICML, pp. 127 135, 2013b.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, pp. 235 256, 2002.

Balakrishnan, S., Nguyen, Q. P., Low, B. K. H., and Soh, H. Efﬁcient exploration of reward functions in inverse reinforcement learning via Bayesian optimization. In Proc. Neur IPS, pp. 4187 4198, 2020.

Balandat, M., Karrer, B., Jiang, D., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. Bo Torch: a framework for efﬁcient Monte-Carlo Bayesian optimization. In Proc. Neur IPS, pp. 21524 21538, 2020.

Besson, L. and Kaufmann, E. What doubling tricks can and can t do for multi-armed bandits. ar Xiv preprint ar Xiv:1803.06971, 2018.

Bubeck, S., Cesa-Bianchi, N., et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012.

Chapelle, O. Modeling delayed feedback in display advertising. In Proc. KDD, pp. 1097 1105, 2014.

Chapelle, O. and Li, L. An empirical evaluation of Thompson sampling. In Proc. Neur IPS, pp. 2249 2257, 2011.

Chen, J., Low, K. H., Tan, C. K.-Y., Oran, A., Jaillet, P., Dolan, J. M., and Sukhatme, G. S. Decentralized data fusion and active sensing with mobile sensors for modeling and predicting spatiotemporal trafﬁc phenomena. In Proc. UAI, pp. 163 173, 2012.

Chen, J., Cao, N., Low, K. H., Ouyang, R., Tan, C. K.-Y., and Jaillet, P. Parallel Gaussian process regression with low-rank covariance matrix approximations. In Proc. UAI, pp. 152 161, 2013a.

Chen, J., Low, K. H., and Tan, C. K.-Y. Gaussian process-based decentralized data fusion and active sensing for mobility-on-demand system. In Proc. RSS, 2013b.

Chen, J., Low, K. H., Jaillet, P., and Yao, Y. Gaussian process decentralized data fusion and active sensing for spatiotemporal trafﬁc modeling and prediction in mobility-on-demand systems. IEEE Trans. Autom. Sci. Eng., 12:901 921, 2015.

Chow, S.-C. and Chang, M. Adaptive Design Methods in Clinical Trials. CRC Press, 2006.

Chowdhury, S. R. and Gopalan, A. On kernelized multi-armed bandits. In Proc. ICML, pp. 844 853, 2017.

Chowdhury, S. R. and Gopalan, A. On batch Bayesian optimization. ar Xiv preprint ar Xiv:1911.01032, 2019.

Chu, W., Li, L., Reyzin, L., and Schapire, R. E. Contextual bandits with linear payoff functions. In Proc. AISTATS, pp. 208 214, 2011.

Contal, E., Buffoni, D., Robicquet, A., and Vayatis, N. Parallel Gaussian process optimization with upper conﬁdence bound and pure exploration. In Proc. ECML/PKDD, pp. 225 240, 2013.

Dai, Z., Yu, H., Low, B. K. H., and Jaillet, P. Bayesian optimization meets Bayesian optimal stopping. In Proc. ICML, pp. 1496 1506, 2019.

Dai, Z., Chen, Y., Low, B. K. H., Jaillet, P., and Ho, T.-H. R2-B2: Recursive reasoning-based Bayesian optimization for no-regret learning in games. In Proc. ICML, pp. 2291 2301, 2020a.

Dai, Z., Low, B. K. H., and Jaillet, P. Federated Bayesian optimization via Thompson sampling. In Proc. Neur IPS, pp. 9687 9699, 2020b.

Dai, Z., Low, B. K. H., and Jaillet, P. Differentially private federated Bayesian optimization with distributed exploration. In Proc. Neur IPS, pp. 9125 9139, 2021.

Dai, Z., Chen, Y., Yu, H., Low, B. K. H., and Jaillet, P. On provably robust meta-Bayesian optimization. In Proc. UAI, 2022.

Daxberger, E. A. and Low, B. K. H. Distributed batch Gaussian process optimization. In Proc. ICML, pp. 951 960, 2017.

Desautels, T., Krause, A., and Burdick, J. W. Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization. JMLR, 15:3873 3923, 2014.

Diemert, E., Meynet, J., Galland, P., and Lefortier, D. Attribution modeling increases efﬁciency of bidding in display advertising. In Proc. KDD Workshop on Ad KDD and Target Ad, 2017.

Bayesian Optimization under Stochastic Delayed Feedback

Frazier, P. I. A tutorial on Bayesian optimization. ar Xiv preprint ar Xiv:1807.02811, 2018.

Garivier, A. and Cappé, O. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proc. COLT, pp. 359 376, 2011.

Garnett, R. Bayesian Optimization. Cambridge Univ. Press, 2022.

Gong, C., Peng, J., and Liu, Q. Quantile Stein variational gradient descent for batch Bayesian optimization. In Proc. ICML, pp. 2347 2356, 2019.

González, J., Dai, Z., Hennig, P., and Lawrence, N. Batch Bayesian optimization via local penalization. In Proc. AISTATS, pp. 648 657, 2016.

Grover, A., Markov, T., Attia, P., Jin, N., Perkins, N., Cheong, B., Chen, M., Yang, Z., Harris, S., Chueh, W., and Ermon, S. Best arm identiﬁcation in multi-armed bandits with delayed feedback. In Proc. AISTATS, pp. 833 842, 2018.

Hoang, Q. M., Hoang, T. N., and Low, K. H. A generalized stochastic variational Bayesian hyperparameter learning framework for sparse spectrum Gaussian process regression. In Proc. AAAI, pp. 2007 2014, 2017.

Hoang, T. N., Hoang, Q. M., and Low, K. H. A unifying framework of anytime sparse Gaussian process regression models with stochastic variational inference for big data. In Proc. ICML, pp. 569 578, 2015.

Hoang, T. N., Hoang, Q. M., and Low, K. H. A distributed variational inference framework for unifying parallel sparse Gaussian process regression models. In Proc. ICML, pp. 382 391, 2016.

Hoang, T. N., Hoang, Q. M., and Low, B. K. H. Decentralized high-dimensional Bayesian optimization with factor graphs. In Proc. AAAI, pp. 3231 3238, 2018.

Hoang, T. N., Hoang, Q. M., Low, K. H., and How, J. P. Collective online learning of Gaussian processes in massive multi-agent systems. In Proc. AAAI, 2019.

Joulani, P., Gyorgy, A., and Szepesvári, C. Online learning under delayed feedback. In Proc. ICML, pp. 1453 1461, 2013.

Kandasamy, K., Krishnamurthy, A., Schneider, J., and Póczos, B. Parallelised Bayesian optimisation via Thompson sampling. In Proc. AISTATS, pp. 133 142, 2018.

Kaufmann, E., Korda, N., and Munos, R. Thompson sampling: An asymptotically optimal ﬁnite-time analysis. In Proc. ALT, pp. 199 213, 2012.

Kharkovskii, D., Dai, Z., and Low, B. K. H. Private outsourced Bayesian optimization. In Proc. ICML, pp. 5231 5242, 2020a.

Kharkovskii, D., Ling, C. K., and Low, B. K. H. Nonmyopic Gaussian process optimization with macro-actions. In Proc. AISTATS, pp. 4593 4604, 2020b.

Krause, A. and Ong, C. S. Contextual Gaussian process bandit optimization. In Proc. Neur IPS, pp. 2447 2455, 2011.

Lattimore, T. and Szepesvári, C. Bandit Algorithms. Cambridge Univ. Press, 2020.

Li, B., Chen, T., and Giannakis, G. B. Bandit online learning with unknown delays. In Proc. AISTATS, pp. 993 1002, 2019.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proc. WWW, pp. 661 670, 2010.

Li, L., Lu, Y., and Zhou, D. Provably optimal algorithms for generalized linear contextual bandits. In Proc. ICML, pp. 2071 2080, 2017.

Li, Z. and Scarlett, J. Gaussian process bandit optimization with few batches. In Proc. AISTATS, pp. 92 107, 2022.

Ling, C. K., Low, B. K. H., and Jaillet, P. Gaussian process planning with Lipschitz continuous reward functions: Towards unifying Bayesian optimization, active learning, and beyond. In Proc. AAAI, pp. 1860 1866, 2016.

Low, K. H., Xu, N., Chen, J., Lim, K. K., and Özgül, E. B. Generalized online sparse Gaussian processes with application to persistent mobile robot localization. In Proc. ECML/PKDD Nectar Track, pp. 499 503, 2014.

Low, K. H., Yu, J., Chen, J., and Jaillet, P. Parallel Gaussian process regression for big data: Low-rank representation meets Markov approximation. In Proc. AAAI, pp. 2821 2827, 2015.

Nguyen, Q. P., Dai, Z., Low, B. K. H., and Jaillet, P. Optimizing conditional value-at-risk of black-box functions. In Proc. Neur IPS, pp. 4170 4180, 2021a.

Nguyen, Q. P., Dai, Z., Low, B. K. H., and Jaillet, P. Value-at-risk optimization with Gaussian processes. In Proc. ICML, pp. 8063 8072, 2021b.

Nguyen, Q. P., Low, B. K. H., and Jaillet, P. An information-theoretic framework for unifying active learning problems. In Proc. AAAI, pp. 9126 9134, 2021c.

Nguyen, Q. P., Tay, S., Low, B. K. H., and Jaillet, P. Top-k ranking Bayesian optimization. In Proc. AAAI, pp. 9135 9143, 2021d.

Bayesian Optimization under Stochastic Delayed Feedback

Nguyen, Q. P., Wu, Z., Low, B. K. H., and Jaillet, P. Trusted-maximizers entropy search for efﬁcient Bayesian optimization. In Proc. UAI, pp. 1486 1495, 2021e.

Ouyang, R. and Low, K. H. Gaussian process decentralized data fusion meets transfer learning in large-scale distributed cooperative perception. In Proc. AAAI, pp. 3876 3883, 2018.

Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006.

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148 175, 2015.

Shu, Y., Cai, S., Dai, Z., Ooi, B. C., and Low, B. K. H. NASI: Labeland data-agnostic neural architecture search at initialization. In Proc. ICLR, 2022a.

Shu, Y., Chen, Y., Dai, Z., and Low, B. K. H. Neural ensemble search via Bayesian sampling. In Proc. UAI, 2022b.

Sim, R. H. L., Zhang, Y., Low, B. K. H., and Jaillet, P. Collaborative Bayesian optimization with fair regret. In Proc. ICML, pp. 9691 9701, 2021.

Slivkins, A. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(1 2): 1 286, 2019.

Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Proc. Neur IPS, 2012.

Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. ICML, pp. 1015 1022, 2010.

Takahashi, A. and Suzuki, T. Bayesian optimization design for dose-ﬁnding based on toxicity and efﬁcacy outcomes in phase I/II clinical trials. Pharmaceutical Statistics, 20 (3):422 439, 2021.

Tay, S. S., Foo, C. S., Urano, D., Leong, R. C. X., and Low, B. K. H. Efﬁcient distributionally robust Bayesian optimization with worst-case sensitivity. In Proc. ICML, 2022.

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3 4):285 294, 1933.

Ueno, T., Rhone, T. D., Hou, Z., Mizoguchi, T., and Tsuda, K. Combo: An efﬁcient Bayesian optimization library for materials science. Materials Discovery, 4:18 21, 2016.

Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. Finite-time analysis of kernelised contextual bandits. In Proc. UAI, pp. 654 663, 2013.

Verma, A., Hanawal, M., Rajkumar, A., and Sankaran, R. Censored semi-bandits: A framework for resource allocation with censored feedback. In Proc. Neur IPS, pp. 14499 14509, 2019.

Verma, A., Hanawal, M. K., Rajkumar, A., and Sankaran, R. Censored semi-bandits for resource allocation. Technical report, 2021.

Vernade, C., Cappé, O., and Perchet, V. Stochastic bandit models for delayed conversions. In Proc. UAI, 2017.

Vernade, C., Carpentier, A., Lattimore, T., Zappella, G., Ermis, B., and Brueckner, M. Linear bandits with stochastic delayed feedback. In Proc. ICML, pp. 9712 9721, 2020.

Wistuba, M., Schilling, N., and Schmidt-Thieme, L. Learning hyperparameter optimization initializations. In Proc. IEEE International Conference on Data Science and Advanced Analytics, 2015.

Xu, N., Low, K. H., Chen, J., Lim, K. K., and Özgül, E. B. GP-Localize: Persistent mobile robot localization using online sparse Gaussian process observation model. In Proc. AAAI, pp. 2585 2592, 2014.

Yu, H., Chen, Y., Dai, Z., Low, B. K. H., and Jaillet, P. Implicit posterior variational inference for deep Gaussian processes. In Proc. Neur IPS, pp. 14475 14486, 2019a.

Yu, H., Hoang, T. N., Low, K. H., and Jaillet, P. Stochastic variational inference for Bayesian sparse Gaussian process regression. In Proc. IJCNN, 2019b.

Yu, H., Liu, D., Low, K. H., and Jaillet, P. Convolutional normalizing ﬂows for deep Gaussian processes. In Proc. IJCNN, 2021.

Zhang, Y., Hoang, T. N., Low, B. K. H., and Kankanhalli, M. Information-based multi-ﬁdelity Bayesian optimization. In Proc. NIPS Workshop on Bayesian Optimization, 2017.

Zhang, Y., Dai, Z., and Low, B. K. H. Bayesian optimization with binary auxiliary information. In Proc. UAI, pp. 1222 1232, 2019.

Zhong, J., Huang, Y., and Liu, J. Asynchronous parallel empirical variance guided algorithms for the thresholding bandit problem. ar Xiv preprint ar Xiv:1704.04567, 2017.

Zhou, Z., Xu, R., and Blanchet, J. Learning in generalized linear contextual bandits with stochastic delays. In Proc. Neur IPS, pp. 5197 5208, 2019.

Bayesian Optimization under Stochastic Delayed Feedback

A. Proofs of Theoretical Results

A.1. Proof of Theorem 1

We deﬁne ϕ(x) = k(x, ) as the feature map associated with the kernel k, i.e., ϕ maps any input x Q into a feature in the (potentially inﬁnite-dimensional) feature space of the RKHS associated with k. The reproducing property tells us that f(x) = ϕ(x), f k = ϕ(x) f. Next, we deﬁne Vt(λ) Pt 1 s=1 ϕ(xs)ϕ(xs) + λI, Bt Pt 1 s=1 ϕ(xs) ys,t, and ˆθW t Vt(λ) 1Bt. Of note, the GP posterior mean (1) can be equivalently expressed as µt 1(x) = ϕ(x) Vt(λ) 1Bt, and the GP posterior variance (1) can be equivalently written as σ2 t 1(x, x) = σ2 t 1(x) = λϕ(x)Vt(λ) 1ϕ(x) = λ ϕ(x) 2 Vt(λ) 1. In this section, we prove Theorem 1, which is restated below. Theorem 1 (Conﬁdence Ellipsoid). Let x Q, λ > 1, and t 1. Then, with probability at least 1 δ,

|µt 1(x) ρmf(x)| νtσt 1(x).

Proof. Firstly, we can decompose Bt as follows:

s=1 ϕ(xs) ys,t =

s=1 ϕ(xs)ys1{ds m} +

s=t m ϕ(xs)ys (1{ds t s} 1{ds m}) .

Next, we can plug in this decomposition of Bt to derive:

|µt 1(x) ρmf(x)| = |ϕ(x) Vt(λ) 1Bt ρmϕ(x) f|

ϕ(x) Vt(λ) 1 t 1 X

s=1 ϕ(xs)ys1{ds m}

ϕ(x) Vt(λ) 1 t 1 X

s=t m ϕ(xs)ys (1{ds t s} 1{ds m})

in which the equality makes use of the expression for the GP posterior mean: µt 1(x) = ϕ(x) Vt(λ) 1Bt, and the reproducing property: f(x) = ϕ(x) f. Next, we will separately provide upper bounds on the two terms in Eq. (4).

Firstly, We upper-bound the second term in Eq. (4) as follows:

ϕ(x) Vt(λ) 1 t 1 X

s=t m ϕ(xs)ys (1{ds t s} 1{ds m})

ϕ(x) Vt(λ) 1

Vt(λ) 1 t 1 X

s=t m ϕ(xs)ys (1{ds t s} 1{ds m})

s=t m ϕ(xs)ys (1{ds t s} 1{ds m})

s=t m ϕ(xs)ys (1{ds t s} 1{ds m}) Vt(λ) 1

s=t m ϕ(xs) Vt(λ) 1

= σt 1(x) By

s=t m σt 1(xs)

s=t m σt 1(xs).

Bayesian Optimization under Stochastic Delayed Feedback

The second inequality and the equality have made use of the expression for the GP posterior variance: σ2 t 1(x) = λ ϕ(x) 2 Vt(λ) 1, the third and last inequalities has made use of the fact that 1

λ < 1 because λ > 1 (note that if λ > 0, then there will be an additional multiplier factor of 1/λ), the fourth inequality follows since we have assumed that all observations are bounded: |ys| By, s 1.

Next, deﬁne µ t 1(x) as the standard GP posterior mean without censoring the observations (i.e., replace every ys,t by ys in Eq. (1)). Now the ﬁrst term in Eq. (4) can be upper-bounded as:

ϕ(x) Vt(λ) 1 t 1 X

s=1 ϕ(xs)ys1{ds m}

= ϕ(x) Vt(λ) 1 t 1 X

s=1 ϕ(xs)ys(ρm + ηs)

ϕ(x) Vt(λ) 1 t 1 X

s=1 ϕ(xs)ysηs + ρm ϕ(x) Vt(λ) 1 t 1 X

s=1 ϕ(xs)ys ϕ(x) f

= ϕ(x) Vt(λ) 1 t 1 X

s=1 ϕ(xs)ysηs + ρm µ t 1(x) f(x)

ϕ(x) Vt(λ) 1 t 1 X

s=1 ϕ(xs)ysηs + Bf + R p

2(γt 1 + 1 + log(2/δ)) σt 1(x).

In the ﬁrst equality, we have deﬁned ηs 1{ds m} ρm. In the second equality, we have again used the expression for the GP posterior mean and the reproducing property for f. The last inequality follows from Theorem 2 of Chowdhury & Gopalan (2017), which holds with the probability of 1 δ/2. We have also used the fact that ρm 1 in the last inequality.

To bound the ﬁrst term in Eq. (6), we ﬁrstly deﬁne η s ysηs, which is by deﬁnition a By-sub-Gaussian noise. Also deﬁne η 1:t 1 = [η s]s=1,...,t 1 which is a (t 1) 1 dimensional vector and deﬁne Φt 1 [ϕ(x1), . . . , ϕ(xt 1)] which is a (t 1) -dimensional matrix. These deﬁnitions imply that Kt 1 = Φt 1Φ t 1 and Vt(λ) = Φ t 1Φt 1 + λI. Also

deﬁne λ = 1 + η7 with η > 0. Based on these deﬁnitions, we can upper-bound Pt 1 s=1 ϕ(xs)ysηs Vt(λ) 1 by

s=1 ϕ(xs)ysηs

s=1 ϕ(xs)η s

s=1 η sϕ(xs) !

Vt(λ) 1 t 1 X

s=1 η sϕ(xs)

η 1:t 1Φt 1(Φ t 1Φt 1 + λI) 1Φ t 1η 1:t 1

η 1:t 1Φt 1Φ t 1(Φt 1Φ t 1 + λI) 1η 1:t 1

η 1:t 1Kt 1(Kt 1 + (1 + η)I) 1η 1:t 1

η 1:t 1(Kt 1 + ηI)(Kt 1 + (1 + η)I) 1η 1:t 1

η 1:t 1((Kt 1 + ηI) 1 + I) 1η 1:t 1

= η 1:t 1 ((Kt 1+ηI) 1+I) 1 .

In the equality in the third line, we have made use of the following matrix equality: (Φ t 1Φt 1 + λI) 1Φ t 1 = Φ t 1(Φt 1Φ t 1 + λI) 1, and in the second last equality, we have used the matrix equality of (Kt 1 + ηI)(Kt 1 + ηI + I) 1 = ((Kt 1 + ηI) 1 + I) 1. Next, making use of Theorem 1 of Chowdhury & Gopalan (2017) and following

7The value of λ is set as (1 + 2/T) (Chowdhury & Gopalan, 2017). However, one can use a Doubling Trick (Besson & Kaufmann, 2018) for unknown T.

Bayesian Optimization under Stochastic Delayed Feedback

similar steps as those in the proof there, we have that with probability of 1 δ/2,

η 1:t 1 ((Kt 1+ηI) 1+I) 1 By

det((1 + η)I + Kt 1)

2(γt 1 + 1 + log(2/δ)).

Next, combining Eq. (7) and Eq. (8) above allows us to upper-bound the ﬁrst term in Eq. (6):

ϕ(x) Vt(λ) 1 t 1 X

s=1 ϕ(xs)ysηs ϕ(x) Vt(λ) 1

Vt(λ) 1 t 1 X

s=1 ϕ(xs)ysηs

s=1 ϕ(xs)ysηs

σt 1(x)By p

2(γt 1 + 1 + log(2/δ)),

in which the second inequality has made use of the expression for the GP posterior variance: σ2 t 1(x) = λ ϕ(x) 2 Vt(λ) 1, and the last inequality follows from Eq. (7) and Eq. (8) (and hence holds with probability of 1 δ/2), as well as the fact that λ > 1. Now, we can plug in Eq. (9) as an upper bound on the ﬁrst term in Eq. (6), and then use the resulting Eq. (6) as an upper bound on the ﬁrst term in Eq. (4). Combining this with the upper bound on the second term of Eq. (4) given by Eq. (5), we have that

|µt 1(x) ρmf(x)| Byσt 1(x)

s=t m σt 1(xs) + Bf + R p

2(γt 1 + 1 + log(2/δ)) σt 1(x)+

σt 1(x)By p

2(γt 1 + 1 + log(2/δ))

s=t m σt 1(xs) + Bf + (R + By) p

2(γt 1 + 1 + log(2/δ))

= νtσt 1(x),

in which we have plugged in the deﬁnition of νt in the last equality. This completes the proof.

A.2. Proof of Theorem 2

Theorem 2 (GP-UCB-SDF). Let C1 = p

2/log(1 + λ 1). Then, with probability at least 1 δ,

TγT + m By C2 1γT .

Proof. To begin with, we can upper-bound the instantaneous regret rt = f(x ) f(xt) as

rt = f(x ) f(xt) = 1

ρm [ρmf(x ) ρmf(xt)]

ρm [µt 1(x ) + νtσt 1(x ) ρmf(xt)]

ρm [µt 1(xt) + νtσt 1(xt) ρmf(xt)]

ρm 2νtσt 1(xt),

in which the ﬁrst and last inequalities have made use of Theorem 1, and the second inequality follows because xt is selected using the UCB policy: xt = argmaxx Q (µt 1 + νtσt 1(x)) (2). Now the cumulative regret can be upper-bounded as

t=1 νtσt 1(xt) = 2

t=1 σt 1(xt)βt + 2

t=1 σt 1(xt)

s=t m σt 1(xs)

Bayesian Optimization under Stochastic Delayed Feedback

where we have plugged in the expression of νt in the equality.

Next, we use I(y1:T ; f1:T ) to denote the information gain from the noisy observations in the ﬁrst T iterations about the objective function f. The ﬁrst term in Eq. (11) can be upper-bounded as:

t=1 σt 1(xt)βt 2

t=1 σt 1(xt)

t=1 σ2 t 1(xt)

1 log(1 + λ 1) log(1 + λ 1σ2 t 1(xt))

1 log(1 + λ 1)βT

t=1 log(1 + λ 1σ2 t 1(xt))

1 log(1 + λ 1)βT

2I(y1:T ; f)

2 log(1 + λ 1)βT

The ﬁrst inequality follows since βt is monotonically increasing in t, the inequality in the second line follows because λ 1a λ 1 log(1+λ 1) log(1+λ 1a) for a (0, 1) (substitute a = σ2 t 1(xt)), and the inequality in the fourth line follows from

plugging in the expression for the information gain: I(y1:T ; f) = 1

2 PT t=1 log(1 + λ 1σ2 t 1(xt)) (Lemma 5.3 of Srinivas et al. (2010)).

Next, we can derive an upper bound on the second term in Eq. (11):

t=1 σt 1(xt)

s=t m σt 1(xs)

s=t m σt 1(xt)σt 1(xs)

σ2 t 1(xt) + σ2 t 1(xs)

σ2 t 1(xt) + σ2 s 1(xs)

t=1 σ2 t 1(xt)

ρm By C2 1γT .

The ﬁrst inequality has made use of the inequality of ab (a2 + b2)/2, the second inequality follows since σ2 t 1(xs) σ2 s 1(xs) which is because σ2 t 1(xs) is calculated by conditioning on more input locations than σ2 s 1(xs), and the last inequality follows easily from some of the intermediate steps in the derivations of Eq. (12).

Lastly, we can plug in Eq. (12) and Eq. (13) as upper bounds on the two terms in Eq. (11), to obtain:

ρm By C2 1γT = 2

TγT + m By C2 1γT .

It completes the proof.

Bayesian Optimization under Stochastic Delayed Feedback

A.3. Proof of Theorem 3

First we deﬁne βt Bf + (R + By) p

2(γt 1 + 1 + log(4/δ)), νt By Pt 1 s=t m σt 1(xs) + βt and ct νt(1 + p

2 log(|Q|t2)). Note that we have replaced the δ in the deﬁnition of βt (Theorem 1) by δ/2. We use Ft 1 to denote the ﬁltration containing the history of selected inputs and observed outputs up to iteration t 1. To begin with, we deﬁne two events Ef(t) and Eft(t) through the following two lemmas.

Lemma 1. Let δ (0, 1). Deﬁne Ef(t) as the event that |µt 1(x) ρmf(x)| νtσt 1(x) for all x Q. We have that P Ef(t) 1 δ/2 for all t 1.

Note that Lemma 1 is the same as Theorem 1 except that we have replaced the error probability of δ by δ/2.

Lemma 2. Deﬁne Eft(t) as the event that |ft(x) µt 1(x)| νt p

2 log(|Q|t2)σt 1(x). We have that P Eft(t)|Ft 1 1 1/t2 for any possible ﬁltration Ft 1.

Lemma 2 makes use of the concentration of a random variable sampled from a Gaussian distribution, and its proof follows from Lemma 5 of Chowdhury & Gopalan (2017). Next, we deﬁne a set of saturated points in every iteration, which intuitively represents those inputs that lead to large regrets in an iteration.

Deﬁnition 1. In iteration t, deﬁne the set of saturated points as

St = {x Q : ρm (x) > ctσt 1(x)},

where (x) = f(x ) f(x) and x arg maxx Q f(x).

Based on this deﬁnition, we will later prove that we can get a lower bound on the probability that the selected input in iteration t is unsaturated (Lemma 4). To do that, we ﬁrst need the following auxiliary lemma.

Lemma 3. For any ﬁltration Ft 1, conditioned on the event Ef(t), we have that x Q,

P (ft(x) > ρmf(x)|Ft 1) p, (14)

where p = 1 4e π.

Proof. Adding and subtracting µt 1(x) νtσt 1(x) both sides of P (ft(x) > ρmf(x)|Ft 1), we get

P (ft(x) > ρmf(x)|Ft 1) = P ft(x) µt 1(x)

νtσt 1(x) > ρmf(x) µt 1(x)

νtσt 1(x) |Ft 1

P ft(x) µt 1(x)

νtσt 1(x) > |ρmf(x) µt 1(x)|

νtσt 1(x) |Ft 1

P ft(x) µt 1(x)

νtσt 1(x) > 1|Ft 1

in which the second inequality makes use of Lemma 1 (note that we have conditioned on the event Ef(t) here), and the last inequality follows from the Gaussian anti-concentration inequality: P(z > a) exp( a2)/(4 πa) where z N(0, 1).

The next lemma proves a lower bound on the probability that the selected input xt is unsaturated.

Lemma 4. For any ﬁltration Ft 1, conditioned on the event Ef(t), we have that,

P (xt Q \ St|Ft 1) p 1/t2.

Proof. To begin with, we have that

P (xt Q \ St|Ft 1) P (ft(x ) > ft(x), x St|Ft 1) . (16)

Bayesian Optimization under Stochastic Delayed Feedback

This inequality can be justiﬁed because the event on the right hand side implies the event on the left hand side. Speciﬁcally, according to Deﬁnition 1, x is always unsaturated because (x ) = 0 < ctσt 1(x ). Therefore, because xt is selected by xt = arg maxx Qft(x), we have that if ft(x ) > ft(x), x St, then the selected xt is guaranteed to be unsaturated. Now conditioning on both events Ef(t) and Eft(t), for all x St, we have that

ft(x) ρmf(x) + ctσt 1(x) ρmf(x) + ρm (x) = ρmf(x) + ρmf(x ) ρmf(x) = ρmf(x ), (17)

in which the ﬁrst inequality follows from Lemma 1 and Lemma 2 and the second inequality makes use of Deﬁnition 1. Next, separately considering the cases where the event Eft(t) holds or not and making use of Eq. (16) and Eq. (17), we have that

P (xt Q \ St|Ft 1) P (ft(x ) > ft(x), x St|Ft 1)

P (ft(x ) > f(x )|Ft 1) P

Eft(t)|Ft 1

in which the last inequality has made use of Lemma 2 and Lemma 3.

Next, we use the following lemma to derive an upper bound on the expected instantaneous regret. Lemma 5. For any ﬁltration Ft 1, conditioned on the event Ef(t), we have that,

E[rt|Ft 1] 11ct

ρmp E [σt 1(xt)|Ft 1] + 2Bf

Proof. To begin with, deﬁne xt as the unsaturated input with the smallest GP posterior standard deviation:

xt = arg minx Q\Stσt 1(x). (19)

This deﬁnition gives us:

E [σt 1(xt)|Ft 1] E [σt 1(xt)|Ft 1, xt Q \ St] P (xt Q \ St|Ft 1) σt 1(xt)(p 1/t2), (20)

in which the second inequality makes use of Lemma 4, as well as the deﬁnition of xt.

Next, conditioned on both events Ef(t) and Eft(t), we can upper-bound the instantaneous regret as

rt = f(x ) f(xt) = 1

ρm [ρmf(x ) ρmf(xt) + ρmf(xt) ρmf(xt)]

ρm [ρm (xt) + ft(xt) + ctσt 1(xt) ft(xt) + ctσt 1(xt)]

ρm [ctσt 1(xt) + ft(xt) + ctσt 1(xt) ft(xt) + ctσt 1(xt)]

ρm [2ctσt 1(xt) + ctσt 1(xt)] ,

in which the ﬁrst inequality follows from Lemma 1 and Lemma 2 as well as the deﬁnition of ( ) (Deﬁnition 1), the second inequality follows because xt is unsaturated, and the last inequality follows because ft(xt) ft(xt) 0 since xt = arg maxx Qft(x). Next, by separately considering the cases where the event Eft(t) holds and otherwise, we are ready to upper-bound the expected instantaneous regret:

E[rt|Ft 1] 1

ρm E[2ctσt 1(xt) + ctσt 1(xt)|Ft 1] + 2Bf

ρm E 2ct σt 1(xt)

p 1/t2 + ctσt 1(xt)|Ft 1

2 p 1/t2 + 1 E [σt 1(xt)|Ft 1] + 2Bf

ρmp E [σt 1(xt)|Ft 1] + 2Bf

in which the second inequality results from Eq. (20), and the last inequality follows because 2 p 1/t2 10/p and 1 1/p.

Bayesian Optimization under Stochastic Delayed Feedback

Next, we deﬁne the following stochastic process (Yt : t = 0, . . . , T), which we prove is a super-martingale in the subsequent lemma by making use of Lemma 5.

Deﬁnition 2. Deﬁne Y0 = 0, and for all t = 1, . . . , T,

rt = rt I{Ef(t)}, Xt = rt 11ct

ρmpσt 1(xt) 2Bf

t2 , and Yt =

Lemma 6. (Yt : t = 0, . . . , T) is a super-martingale with respect to the ﬁltration Ft.

Proof. As Xt = Yt Yt 1, we have

E [Yt Yt 1|Ft 1] = E [Xt|Ft 1]

ρmpσt 1(xt) 2Bf

= E [rt|Ft 1] 11ct

ρmp E [σt 1(xt)|Ft 1] + 2Bf

When the event Ef(t) holds, the last inequality follows from Lemma 5; when Ef(t) is false, rt = 0 and hence the inequality trivially holds.

Lastly, we are ready to prove the upper bound of GP-TS-SDF on the cumulative regret RT by applying the Azuma-Hoeffding Inequality to the stochastic process deﬁned above.

Theorem 3 (GP-TS-SDF). With probability at least 1 δ,

TγT ( γT + 1) + m(γT +

Proof. To begin with, we derive an upper bound on |Yt Yt 1|:

|Yt Yt 1| = |Xt| |rt| + 11ct

ρmpσt 1(xt) + 2Bf

1 ρmp (4Bf + 11ct) ,

where the second inequality follows because σt 1(xt) 1, and the last inequality follows since 1 ρmp 1. Now we are ready to apply the Azuma-Hoeffding Inequality to (Yt : t = 0, . . . , T) with an error probability of δ/2:

11ct ρmpσt 1(xt) +

v u u t2 log(2/δ)

ρmp (4Bf + 11ct) 2

2 log(|Q|T 2))

t=1 νtσt 1(xt) + Bπ2

3 + 4Bf + 11c T

2T log(2/δ)

2 log(|Q|T 2)) C1βT p

TγT + m By C2 1γT + 4Bf + 11c T

2T log(2/δ) + Bπ2

in which the second inequality makes use of the fact that ct is monotonically increasing and that PT t=1 1/t2 π2/6, and the last inequality follows from the proof of Theorem 2 (Appendix A.2) which gives an upper bound on PT t=1 νtσt 1(xt). Note that Eq. (23) holds with probability 1 δ/2. Also note that rt = rt with probability of 1 δ/2 because the event

Bayesian Optimization under Stochastic Delayed Feedback

Ef(t) holds with probability of 1 δ/2 (Lemma 1), therefore, the upper bound from Eq. (23) is an upper bound on RT = PT t=1 rt with probability of 1 δ.

Lastly, we can simplify the regret upper bound from Eq. (23) into asymptotic notation. Note that ct = νt(1+ p

2 log(|Q|t2)), νt = By Pt 1 s=t m σt 1(xs) + βt and βt = Bf + (R + By) p

2(γt 1 + 1 + log(4/δ)). To simplify notations, we analyze the asymptotic dependence while ignoring all log factors, and ignore the dependence on constants including B, By and R. This gives us βT = O( γT ) and c T = O(m + γT ). As a result, the regret upper bound can be simpliﬁed as

T + mγT + 1

ρm (m + γT )

TγT ( γT + 1) + m(γT +

It completes the proof.

A.4. Time based Censored Feedback

Let m be the amount of time the learner waits for the delayed feedback, τt be the time of initiating the selection process for the t-th query, and τt be the time of starting the t-th query. The censored feedback of s-th query before selecting t-th query is denoted by ys,t, where ys,t = ys1{ds min(m, τt τs)}. Now, the GP posterior mean and covariance functions can be expressed using available information of function queries and their censored feedback as follows:

µt 1(x) = kt 1(x) K 1 t,λ yt 1,

σ2 t 1(x, x ) = k(x, x ) kt 1(x) K 1 t,λkt 1(x ), (24)

where Kt,λ = Kt 1 + λI, kt 1(x) = (k(x, xt )) τt [0, τt], yt 1 = ( ys,t) τs [0, τt], Kt 1 = (k(xt , xt ))τt ,τt [0, τt] and λ is a regularization parameter to ensures Kt,λ is an invertible matrix. Our current regret analysis will work for this setting as well after appropriately changing the used variables.

B. Theoretical Analysis of Contextual Gaussian Process Bandit

Our regret upper bounds in Theorem 2 and Theorem 3 are also applicable in the contextual GP bandit setting after some slight modiﬁcations to their proofs. Note that in the contextual GP bandit setting (Section 5), the only major difference with the BO setting is that now the objective function g depends on both a context vector zt Z and an input vector xt Q, in which every zt is generated by the environment and xt is selected by our GP-UCB-SDF or GP-TS-SDF algorithms. Therefore, the most important modiﬁcations to the proofs of Theorem 2 and Theorem 3 involve replacing the original input space of Q by the new input space of Z Q and accounting for the modiﬁed deﬁnition of regret in Section 5.

To begin with, with the same deﬁnition of νt, it is easy to verify that Theorem 1 can be extended to the contextual GP bandit setting:

Theorem 4 (Conﬁdence Ellipsoid for Contextual GP Bandit). With probability at least 1 δ,

|µt 1(z, x) ρmg(z, x)| νtσt 1(z, x), z Z, x Q.

The proof of Theorem 4 is the same as that of Theorem 1 except that the input x in the proof of Theorem 1 is replaced by (z, x).

B.1. GP-UCB-SDF for Contextual Gaussian Process Bandit

For the GP-UCB-SDF algorithm in the contextual GP bandit setting, the instantaneous regret rt can be analyzed in a similar way to Eq. (10) in the proof of Theorem 2 (Appendix A.2):

rt = g(zt, x t ) g(zt, xt) = 1

ρm [ρmg(zt, x t ) ρmg(zt, xt)]

ρm [µt 1(zt, x t ) + νtσt 1(zt, x t ) ρmg(zt, xt)]

ρm [µt 1(zt, xt) + νtσt 1(zt, xt) ρmg(zt, xt)]

Bayesian Optimization under Stochastic Delayed Feedback

ρm 2νtσt 1(zt, xt),

in which the ﬁrst and last inequalities both make use of Theorem 4, and the second inequality follows from the way in which xt is selected: xt = arg maxx Qµt 1(zt, x) + νtσt 1(zt, x). Following this, the subsequent steps in the proof in Appendix A.2 immediately follow, hence producing the same regret upper bound as Theorem 2.

B.2. GP-TS-SDF for Contextual Gaussian Process Bandit

To begin with, the events Ef(t) (Lemma 1) and Eft(t) (Lemma 2) can be similarly deﬁned: Lemma 7. Let δ (0, 1). Deﬁne Ef(t) as the event that |µt 1(z, x) ρmg(z, x)| νtσt 1(z, x) for all z Z, x Q. We have that P Ef(t) 1 δ/2 for all t 1.

Lemma 8. Deﬁne Eft(t) as the event that |gt(z, x) µt 1(z, x)| νt p

2 log(|Z||Q|t2)σt 1(z, x). We have that P Eft(t)|Ft 1 1 1/t2 for any possible ﬁltration Ft 1.

The validity of Lemma 7 follows immediately from Theorem 4 (after replacing the error probability of δ by δ/2 in the deﬁnition of νt, which we have omitted here for simplicity), and the proof of Lemma 8 is the same as that of Lemma 2. Next, we modify the deﬁnition of saturated points from Deﬁnition 1: Deﬁnition 3. In iteration t, given a context vector zt, deﬁne the set of saturated points as

St = {x Q : ρm (zt, x) > ctσt 1(zt, x)},

where (zt, x) = g(zt, x t ) g(zt, xt) and x t = argmax x Q g(zt, x).

Note that according to this deﬁnition, x t is always unsaturated. The next lemma is a counterpart of Lemma 3, and the proofs are the same. Lemma 9. For any ﬁltration Ft 1, conditioned on the event Ef(t), we have that z Z, x Q,

P (gt(z, x) > ρmg(z, x)|Ft 1) p,

where p = 1 4e π.

The next lemma, similar to Lemma 4, shows that in contextual GP bandit problems, the probability that the selected xt is unsaturated can also be lower-bounded. Lemma 10. For any ﬁltration Ft 1, given a context vector zt, conditioned on the event Ef(t), we have that,

P (xt Q \ St|Ft 1) p 1/t2.

Proof. To begin with, we have that

P (xt Q \ St|Ft 1) P (gt(zt, x t ) > gt(zt, x), x St|Ft 1) . (25)

The validity of this equation can be seen following the same analysis of Eq. (16), i.e., the event on the right hand side implies the event on the left hand side. Now conditioning on both events Ef(t) and Eft(t), for all x St, we have that

gt(zt, x) ρmg(zt, x) + ctσt 1(zt, x) ρmg(zt, x) + ρm (zt, x)

= ρmg(zt, x) + ρmg(zt, x t ) ρmg(zt, x) = ρmg(zt, x t ), (26)

in which the ﬁrst inequality follows from Lemma 7 and Lemma 8 and the second inequality makes use of Deﬁnition 3.

Next, separately considering the cases where the event Eft(t) holds or not and making use of Eq. (25) and Eq. (26), we have that

P (xt Q \ St|Ft 1) P (gt(zt, x t ) > gt(zt, x), x St|Ft 1)

P (gt(zt, x t ) > ρmg(zt, x t )|Ft 1) P

Eft(t)|Ft 1

where the last inequality has made use of Lemma 8 and Lemma 9.

Bayesian Optimization under Stochastic Delayed Feedback

Lemma 11. For any ﬁltration Ft 1, given a context vector zt, conditioned on the event Ef(t), we have that,

E[rt|Ft 1] 11ct

ρmp E [σt 1(zt, xt)|Ft 1] + 2Bf

Proof. To begin with, given a context vector zt, deﬁne xt as the unsaturated input with the smallest GP posterior standard deviation:

xt = arg minx Q\Stσt 1(zt, x). (27)

This deﬁnition gives us:

E [σt 1(zt, xt)|Ft 1] E [σt 1(zt, xt)|Ft 1, xt Q \ St] P (xt Q \ St|Ft 1)

σt 1(zt, xt)(p 1/t2), (28)

in which the second inequality makes use of Lemma 4, as well as the deﬁnition of xt. Next, conditioned on both events Ef(t) and Eft(t), we can upper-bound the instantaneous regret as

rt = g(zt, x t ) g(zt, xt)

ρm [ρmg(zt, x t ) ρmg(zt, xt) + ρmg(zt, xt) ρmg(zt, xt)]

ρm [ρm (zt, xt) + gt(zt, xt) + ctσt 1(zt, xt) gt(zt, xt) + ctσt 1(zt, xt)]

ρm [ctσt 1(zt, xt) + gt(zt, xt) + ctσt 1(zt, xt) gt(zt, xt) + ctσt 1(zt, xt)]

ρm [2ctσt 1(zt, xt) + ctσt 1(zt, xt)] ,

where the ﬁrst inequality follows from Lemma 7 and Lemma 8 as well as the deﬁnition of ( ) (Deﬁnition 3), the second inequality follows because xt is unsaturated, and the last inequality follows because gt(zt, xt) gt(zt, xt) 0 since xt = arg maxx Qgt(zt, x). Next, the proof in Eq. (21) can be immediately applied, hence producing the upper bound on the instantaneous regret in this lemma.

Now the remaining steps in the proof of Theorem 3 in Appendix A.3 immediately follow. Speciﬁcally, we can ﬁrst deﬁne a stochastic process (Yt : t = 0, . . . , T) in the same way as Deﬁnition 2, and then use Lemma 6 to show that (Yt : t = 0, . . . , T) is super-martingale. Lastly, we can apply the Azuma Hoeffding Inequality to (Yt : t = 0, . . . , T) in the same way as Theorem 3, which completes the proof.

C. More Experimental Details

In our experiments, since it has been repeatedly observed that the theoretical value of βt is overly conservative (Srinivas et al., 2010), we set βt to be a constant (βt = 1) for all methods. In all experiments and for all methods, we use the SE kernel for the GP and optimize the GP hyperparameters by maximizing the marginal likelihood after every 10 iterations. In all experiments where we report the simple regret (e.g., Fig. 1, Fig. 2, etc.), we calculate the simple regret in an iteration using only those function evaluations which have converted (i.e., we ignore all pending observations).

C.1. Synthetic Experiment

In the synthetic experiment, the objective function f is sampled from a GP using the SE kernel with a lengthscale of 0.02 deﬁned on a 1-dimensional domain. The domain is an equally spaced grid within the range [0, 1] with a size 1000. The sampled function is normalized into the range of [0, 1].

C.2. Real-world Experiments

In the SVM hyperparameter tuning experiment, the diabetes diagnosis dataset can be found at https://www.kaggle. com/uciml/pima-indians-diabetes-database. We use 70% of the dataset as the training set and the remaining 30% as the validation set. For every evaluated hyperparameter conﬁguration, we train the SVM using the

Bayesian Optimization under Stochastic Delayed Feedback

0 25 50 75 100 125 150 Iterations

Simple Regret

GP-TS-SDF asy-TS GP-BTS

0 25 50 75 100 125 150 Iterations

Simple Regret

GP-TS-SDF asy-TS GP-BTS

Figure 5: Performances of GP-TS-SDF and other TS-based baseline methods for (a) stochastic and (b) deterministic delay distributions in the synthetic experiment (Section 6.1).

training set with the particular hyperparameter conﬁguration and then evaluate the learned SVM on the validation set, whose validation accuracy is reported as the observations. We tune the penalty parameter within the range of [10 4, 100] and the RBF kernel parameter within [10 4, 10].

In the experiment on hyperparameter tuning for CNN, we use the MNIST dataset and follow the default training-testing sets partitions given by the Py Torch package. The CNN consists of one convolutional layer (with 8 channels and convolutional kernels of size 3), followed by a max-pooling layer (with a pooling size of 3), and then followed by a fully connected layer (with 8 nodes). We use the Re LU activation function and the Adam optimizer. We tune the batch size (within [128, 512]), the learning rate (within [10 6, 1]) and the learning rate decay (within [10 6, 1]).

In the hyperparameter tuning experiment for LR, the breast cancer dataset we have adopted can be found at https: //archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). Similar to the SVM experiment, we use 70% of the dataset as the training set and the remaining 30% as the testing set. Here we tune three hyperparameters of the LR model: the batch size (within [32, 128]), the learning rate (within [10 6, 1]) and learning rate decay (within [10 6, 1]).

C.3. Real-world Experiments on Contextual Gaussian Process Bandits

For the multi-task BO experiment, the tabular benchmark on hyperparameter tuning of SVM is introduced by the work of (Wistuba et al., 2015) and can be found at https://github.com/wistuba/TST. The dataset consists of 50 tasks, and each task corresponds to a different classiﬁcation problem with a different dataset. The domain of hyperparameter conﬁgurations in this benchmark (which is shared by all 50 tasks) consists of discrete values for six hyperparameters: 3 binary parameters indicating (via one-hot encoding) whether the linear, polynomial, or RBF kernel is used, and the penalty parameter, the degree for the polynomial kernel, and the RBF kernel parameter. The size of the domain is 288. As a result, for each task, the validation accuracy for each one of the 288 hyperparameter conﬁgurations is recorded as the observation. As we have mentioned in the main text (Section 6.2), each of the 50 tasks is associated with some meta-features, and here we use the ﬁrst six meta-features as the contexts. Each of the 50 tasks is associated with a 6-dimensional context vector zt, which is used to characterize this particular task. For the non-stationary BO task, the dataset and the other experimental settings (e.g., the tuned hyperparameters, the range of the hyperparameters, etc.) are the same as the SVM hyperparameter tuning experiment in Section 6.2. Refer to Appendix C.2 for more details.