# delayed_feedback_in_kernel_bandits__18cfeb33.pdf

Delayed Feedback in Kernel Bandits

Sattar Vakili 1 Danyal Ahmed 1 Alberto Bernacchia 1 Ciara Pike-Burke 2

Black box optimisation of an unknown function from expensive and noisy evaluations is a ubiquitous problem in machine learning, academic research and industrial production. An abstraction of the problem can be formulated as a kernel based bandit problem (also known as Bayesian optimisation), where a learner aims at optimising a kernelized function through sequential noisy observations. The existing work predominantly assumes feedback is immediately available; an assumption which fails in many real world situations, including recommendation systems, clinical trials and hyperparameter tuning. We consider a kernel bandit problem under stochastically delayed feedback, and propose an algorithm with O( p

Γk(T)T +E[τ]) regret, where T is the number of time steps, Γk(T) is the maximum information gain of the kernel with T observations, and τ is the delay random variable. This represents a significant improvement over the state of the art regret bound of O(Γk(T)

T + E[τ]Γk(T)) reported in Verma et al. (2022). In particular, for very non-smooth kernels, the information gain grows almost linearly in time, trivializing the existing results. We also validate our theoretical results with simulations.

1. Introduction

The kernel bandit problem is a flexible framework which captures the problem of learning to optimise an unknown function through successive input queries. Typically, the game proceeds in rounds where in each round the learner selects an input point to query and then immediately receives a noisy observation of the function at that point. This observation can be used immediately to improve the learners decision of which point to query next. Due to its general-

1Media Tek Research 2Imperial College London. Correspondence to: Sattar Vakili <sattar.vakili@mtkresearch.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

ity, the kernel bandit problem has become very popular in practice. In particular, it enables us to sequentially learn to optimise a variety of different functions without needing to know many details about the functional form.

However, in many settings where we may want to use kernel bandits, we also have to deal with delayed observations. For example, consider using kernel bandits to sequentially learn to select the optimal conditions for a chemical experiment. The chemical reactions may not be instantaneous, but instead take place at some random time in the future. If we start running a sequence of experiments, we can start new experiments before receiving the stochastically delayed feedback from the previous ones. However, in this situation we have to update the conditions for future experiments before receiving all the feedback from previous experiments. Similar situations arise in recommendation systems, clinical trials and hyperparameter tuning, so it is of practical relevance that we design kernel bandit algorithms that are able to deal with delayed feedback. Moreover, a big challenge for existing kernel bandit algorithms is the computational complexity. Generally speaking, in each round t, algorithms for kernel bandits require fitting a kernel model to the t observed data points which can have an O(t3) complexity. To reduce the complexity, there has been a recent interest in considering batch versions of kernel bandit algorithms. These algorithms select τ input values using the same model then update the model after receiving all observations in the batch. This corresponds to a delay of at most τ in receiving each observation, and thus can be thought of as an instance of delayed feedback in kernel bandits.

In this paper, we study the kernel bandit problem with stochastically delayed feedback. We propose Batch Pure Exploration with Delays (BPE-Delay) an adaptation of the Batch Pure Exploration algorithm (Li and Scarlett, 2022) to the delayed feedback setting , and show that, under mild assumptions on the unknown delay distribution, BPEDelay achieves near optimal regret. Indeed, we prove that BPE-Delay enjoys regret scaling as O( p

Γk(T)T + E[τ]),1

where T is the number of time steps, Γk(T) is the maximum information gain of the kernel k with T observations (see Section 2), and τ is the delay random variable. This result

1We use the notation O and O to denote order and order up to logarithmic factors, respectively.

Delayed Feedback in Kernel Bandits

essentially shows that the impact of delayed feedback on the regret of this algorithm is independent of the horizon T and the problem parameter Γk(T). This desirable property means that as T or Γk(T) increase, the impact of the delay remains the same.

We note that for linear models, Γk(T) = O(d log(T)), where d is the input dimension. In the case of very smooth kernels such as Squared Exponential (SE), Γk(T) = poly log(T). However Γk(T) can become arbitrary close to linear in T for less smooth kernels. For example, in the case of a Mat ern kernel with smoothness parameter ν, we have Γk(T) = O(T d 2ν+d ) (Vakili et al., 2021c). Therefore, our results represent a significant theoretical improvement over the existing work on kernel bandits with delayed feedback, where an O(Γk(T)

T + E[τ]Γk(T)) regret bound was shown for an algorithm based on upper confidence bound (UCB) acquisition of observation points (Verma et al., 2022). In particular, for non-smooth kernels, our result reduces the delay penalty from possibly near O(E[τ]T) to just O(E[τ]). We demonstrate in Section 7 that these theoretical improvements are also visible experimentally. In addition, when applied to the special case of linear bandits (a kernel bandit problem with linear kernel), our results improve the dependency of the delay related term in the regret bound on the input dimension by a d3/2 factor compared to the state of the art in Howson et al. (2022). A detailed comparison with related work is provided in Section 2.

2. Background and Related Work

In this section, we give an overview of the background on kernel bandits with immediate feedback and delayed feedback in simpler stochastic bandit formulations (namely in K-armed and linear bandits). We then provide a detailed comparison with the most related work on kernel bandits with delayed feedback.

Kernel bandits with immediate feedback. In the typical kernel bandit setting with immediate feedback, classic algorithms such as selecting the query points based on upper confidence bounds (GP-UCB) (Srinivas et al., 2010) and Thompson Sampling (Chowdhury and Gopalan, 2017) achieve a regret bound of O(Γk(T)

T). This regret bound is suboptimal (see, Vakili et al., 2021e, for details), and can be improved to O( p

Γk(T)T) using more recent work. In particular, Batch Pure Exploration (BPE) introduced in Li and Scarlett (2022) and Threshold Domain Shrinking (GPThre DS) proposed in Salgia et al. (2021) achieve this improved regret bound. Here, Γk(T) is a kernel specific complexity term, which can be interpreted as the information gain or the effective dimension.

Information gain and effective dimension. While the feature space representation of common kernels is infinite dimensional, with a finite data set, only a finite number of features have a significant effect on the kernel based regression. That leads to the definition of the effective dimension. In particular, consider a finite set XT = {X1, . . . , XT } of observation points and a positive definite kernel k : X X R. Let KXT = [k(Xt, Xt )]T t,t =1 be the kernel matrix resulting from the pairwise kernel values between the observation points. The effective dimension for a given kernel and observation set is often defined as (Zhang, 2005; Valko et al., 2013)

dk(T) = tr KXT (KXT + λ2IT ) 1 , (1)

where λ > 0 is a free parameter and IT denotes an identity matrix of dimension T.

It is also useful to define a related notion of information gain. For this, assume f is a centered Gaussian Process (GP) on the domain X with kernel k. Information gain then refers to the mutual information I(y T ; f) between the data values y T = [yt = f(Xt) + ϵt]T t=1 and f (assuming the surrogate GP distribution and a zero mean Gaussian noise ϵt with variance λ2). From the closed form expression of mutual information between two multivariate Gaussian distributions (Cover, 1999), it follows that I(y T ; f) = 1

2 log det IT + 1 λ2 KXT . We then define the data-independent and kernel-specific maximum information gain as follows (Srinivas et al., 2010):

Γk(T) = sup XT X I(y T ; f). (2)

It is known that the information gain and the effective dimension are the same up to logarithmic factors. Specifically, we have dk(T) I(y T ; f), and I(y T ; f) = O( dk(T) log(T)) (Calandriello et al., 2019). For specific kernels, explicit bounds on Γk(T) are derived in Srinivas et al. (2010); Vakili et al. (2021b;c). We report our regret bounds in terms of information gain that can be easily replaced by the effective dimension, up to logarithmic factors.

Regret lower bounds. For commonly used SE and Mat ern kernels, Scarlett et al. (2017) derived lower bounds on regret (in the immediate feedback setting). In particular, they showed Ω( p

T(log(T))d/2) and Ω(T ν+d 2ν+d ) lower bounds on regret, in the case of SE and Mat ern kernels, respectively. These bounds are matched by the regret bounds for GP-Thre DS (Salgia et al., 2021) and BPE (Li and Scarlett, 2022), up to logarithmic factors.

Delayed feedback in stochastic bandits. Delayed feedback has been studied significantly in the stochastic bandit problem where there are K independent arms. Joulani et al. (2013) provided a queue based wrapper algorithm which,

Delayed Feedback in Kernel Bandits

when applied with any bandit algorithm, leads to an additive penalty of E[τ] in the regret compared to the non-delayed setting. They also showed that using a UCB algorithm (Auer et al., 2002) just with the observed data would lead to a similar regret bound of O(

KT + E[τ]). Mandel et al. (2015) also provided a queue based algorithm with the same regret bound. Vernade et al. (2017) developed a UCB algorithm to deal with delayed feedback where some observations are censored if their delay exceeds a threshold. We note that we cannot directly extend these queue based approaches to the kernel bandit problem where we have a large number of dependent arms. However we show that the same additive penalty can be maintained even in our more difficult setting. Additionally, while it is possible to extend the UCB based algorithms to our setting, in Section 7, we show that our proposed algorithm performs better empirically than using a delayed version of GP-UCB.

There has also been work studying the impact of delayed feedback in generalised linear bandits. Zhou et al. (2019) and Howson et al. (2022) provided adaptations of optimistic (UCB) algorithms to account for delayed feedback with sub-exponential delays. Zhou et al. (2019) obtained a regret bound scaling with O(d

d TE[τ]), while Howson et al. (2022) obtained an improved regret bound of O(d

T + d 3 2 E[τ]), although here the delay penalty still depends on the dimension which is not the case for the delayed stochastic K-armed bandit problem. When applied to this setting, we show that our proposed algorithm removes this interaction between the dimension and the delay. Specifically, our results improve the delay related term in the regret bound with a d 3 2 factor in the special case of linear bandits (a kernel bandit problem with linear kernel). We also note that Vernade et al. (2020) extended their work on delayed feedback with censoring to the contextual linear bandit setting, and Dudik et al. (2011) studied constant delays in the the contextual bandit setting, although these settings are not directly comparable to ours.

Delayed feedback in kernel bandits. The most relevant work to ours is Verma et al. (2022) where the kernel bandit problem with stochastically delayed feedback was also considered. Verma et al. (2022) proposed algorithms based on GP-UCB (Srinivas et al., 2010) and Thompson sampling (Chowdhury and Gopalan, 2017) in the delayed feedback setting. In these algorithms, referred to as GP-UCBSDF and GP-TS-SDF (where SDF stands for Stochastic Delayed Feedback), the unavailable feedback due to delay are replaced by minimum function value (assuming it is known). They provided a regret bound for this algorithm of O(Γk(T)

T + E[τ]Γk(T)). This is an improvement over a naive application of the existing algorithms (which would lead to a regret bound of O(Γk(T) p

TE[τ])), but still suffers from a scaling of the term involving the delay

by Γk(T). In comparison, our algorithm does not require the additional knowledge of the minimum function value (we simply disregard the unavailable observations). Furthermore, our results significantly improve upon Verma et al. (2022), by completely decoupling the impact of the delay and the problem parameters. Our regret bounds are also order optimal and cannot be further improved for the cases where a lower bound on regret is known, in particular, the bounds given in Scarlett et al. (2017) for the SE and Mat ern kernels (under the immediate feedback setting).

Kernel bandits with batch observations can be considered as a special case of our delayed feedback framework, with constant (non-stochastic) delays. Specifically, the observations for the points in a batch are available with a fixed delay that is at most the length of the batch. Notable examples are Desautels et al. (2014); Vakili et al. (2021d); Daxberger and Low (2017); Chowdhury and Gopalan (2019) which are based on the hallucination technique introduced in Desautels et al. (2014). In this technique, the unavailable observations are hallucinated to be the kernel based prediction using only the available feedback. This is equivalent to keeping the prediction unaffected by the unavailable observations, while updating the uncertainty estimate using all selected observation points, including the ones with delayed observations (Chowdhury and Gopalan, 2019). In contrast, Verma et al. (2022) set the unavailable observations to the minimum function value (assuming it is known). In our algorithm, we simply disregard the unavailable observations in the prediction, while using both variations of uncertainty estimate (one updated using all observation points, and one updated using only the observation points with available feedback). Details are given in Section 5. Our regret bounds also offer a significant improvement over the batch setting, despite this being a significantly simpler setting due to the fixed and known delays. In particular, in the batch setting, the best known regret bounds by Chowdhury and Gopalan (2019) are O(Γk(T)

TΓk(T)), where with an abuse of notation to enable comparison with our results, we use τ for the batch size. Theorem 6.1 improves upon both terms in this regret bound.

3. Problem Definition

Consider a positive definite kernel k : X X R supported on a compact d-dimensional set X Rd . A Hilbert space Hk of functions on X equipped with an inner product , Hk is called a reproducing kernel Hilbert space (RKHS) with reproducing kernel k if the following is satisfied. For all x X, k( , x) Hk, and for all x X and f Hk, f, k( , x) Hk = f(x) (reproducing property).

We consider a kernel bandit problem. We assume there exists an unknown objective function f : X R of interest in the RKHS of a known kernel k. This is a very gen-

Delayed Feedback in Kernel Bandits

eral assumption, since the RKHS of common kernels can approximate almost all continuous functions on compact subsets of the Euclidean space (Srinivas et al., 2010). Our aim is to find an input x which maximises the unknown function f, i.e. x arg maxx X f(x). In order to do this, we can sequentially query f at a chosen sequence of observation points Xt X, t = 1, 2, . . . , and receive the noisy feedback yt = f(Xt) + ϵt where ϵt is a sequence of independently distributed sub-Gaussian noise with zero mean.

We measure the performance of any procedure for this problem in terms of its regret. The regret is defined as

t=1 f(x ) f(Xt) (3)

and represents the cumulative amount that we loose by querying f at X1, . . . , XT rather than at an unknown maxima x arg maxx X f(x).

In order to make the problem tractable, we make the following assumptions which are common in the literature.

Assumption 3.1. The RKHS norm of the objective function f is bounded, i.e., f Hk Ck, for some Ck > 0, where the notation 2 Hk = , Hk is used for the RKHS norm.

Assumption 3.2. The observation noise terms ϵt are independent σ-sub-Gaussian random variables with zero mean. That is, for all t N, for all η R, and for some σ > 0, the moment generating function of ϵt satisfies E[exp(ηϵt)] exp( η2σ2

Delayed feedback. In this work, we are interested in the kernel bandit problem in a setting with stochastically delayed feedback. In this setting, at each time t, as well as generating the observation yt, the environment also independently generates a stochastic delay τt 0. The learner receives the feedback for the decision made at time t at the observation time t+τt. We assume that τ1, . . . , τT represent a sequence of independent and identically distributed subexponential random variables. For simplicity, we assume that the τt random variables are discrete, although it is easy to extend our results to continuous delays by considering the events that τt + t s for all s t. Our assumption on the delay distribution is formally stated below, and is the same as that commonly made in the literature (Zhou et al., 2019; Howson et al., 2022).

Assumption 3.3. The delays τt are i.i.d. sub-exponential random variables. That is, for all t N, for some ξ, b > 0, for all |η| 1

b, the moment generating function of τt satisfies E [exp (η(τt E[τt]))] exp( η2ξ2

4. Confidence Bounds

Kernel based regression provides powerful predictors and uncertainty estimates. Those could be used to form confidence intervals for the unknown objective function f, that are a crucial building block in developing algorithms for the kernel based bandit problem. Consider a set of observation points Xt = {X1, . . . , Xt}, and the corresponding vector of observation values yt = [y1, . . . , yt] . Recall KXt = [k(Xs, Xs )]t s,s =1 the positive definite kernel matrix resulting from the pairwise kernel values between the observation points. We have the following predictor and uncertainty estimate for f, which can be interpreted as the mean and variance of a surrogate GP model for f with kernel k, respectively (e.g., see, Rasmussen and Williams, 2006; Kanagawa et al., 2018)

µXt,yt(x) = k Xt(x)(KXt + λ2It) 1yt σ2 Xt(x) = k(x, x)

k Xt(x)(KXt + λ2It) 1k Xt(x), (4)

where k Xt(x) = [k(x1, x), . . . , k(xt, x)] is the kernel values vector between observation points and the point of interest and λ > 0 is a free parameter.

Equipped with the above predictor and uncertainty estimate, we have the following confidence bounds.

Lemma 4.1. [(Vakili et al., 2021a)] Consider the predictor and uncertainty estimate given in (4). Assume the observation set Xt is selected independent of the observation noise ϵs, s = 1, . . . , t. Under Assumptions 3.1 and 3.2, the following inequalities each hold with probability 1 δ for any fixed x X,

f(x) µXt,yt(x) + β(δ)σXt(x),

f(x) µXt,yt(x) β(δ)σXt(x),

where β(δ) = Ck + σ

δ ), Ck and σ are the parameters given in Assumptions 3.1 and 3.2, and λ is the parameter in (4).

If the domain X is finite, then uniform confidence bounds readily follow from this result via a union bound, and δ |X| can be substituted for δ. For continuous domains, a technical discretization argument can be used to extend Lemma 4.1 to a uniform bound under the following continuity assumption.

Assumption 4.2. For each t N, there exists a discretization X of X such that, for any f Hk with f Hk Ck, we have f(x) f([x]) 1

t , where [x] = argminx X||x x||l2 is the closest point in X to x, and |X| c Cd ktd, where c is a constant independent of t and Ck.

Assumption 4.2 is a mild and technical assumption that holds for typical kernels such as SE and Mat ern with ν >

Delayed Feedback in Kernel Bandits

1 (Srinivas et al., 2010; Chowdhury and Gopalan, 2017; Vakili et al., 2021a). Lemma 4.3 (A special case of Corollary 3.7 in Vakili et al. (2022)). Under the setting and assumptions of Lemma 4.1, under Assumption 4.2, the following inequalities each hold uniformly in x X with probability at least 1 δ,

f(x) µXt,yt(x) + 2

t + β δ(t)(σXt(x) + 2

f(x) µXt,yt(x) 2

t β δ(t)(σXt(x) + 2

where β δ(t) = β( δ

2Γt ), Γt = c( Ck( δ

2))dtd, Ck(δ) = Ck +

max{σ,kmax}

δ ), kmax = maxx X k(x, x).

Lemma 4.3 uses a high probability bound Ck(δ) on the RKHS norm of µXt,yt (that is a random variable due to the random noise), together with applying the continuity assumptions to µXt,yt and σXt to construct the confidence bounds. Effectively, the multiplier of the width of the confidence intervals, β δ(t), scales as

In the kernel bandit with delayed feedback setting, some observations may not be available due to delayed feedback. Let Xt = {Xs Xt : s + τs t} be the set of observation points for which the feedback has arrived by time t, and note that this is a random set of observations due to the stochastic delays. Simplifying the notation, we use µt, σt for the predictor and uncertainty estimate using Xt, yt, and µt, σt for the predictor and uncertainty estimate using Xt, yt, where yt = [f(Xs) + ϵs] Xs Xt is the vector of available observations. We note that in the delayed feedback setting, we can compute σt in addition to µt, σt since this only depends on the chosen inputs, not the observations. On the other hand, µt cannot be computed since it depends on some of the missing feedback. All three available statistics µt, σt, σt are used in designing our algorithm, BPE-Delay.

5. Batch Pure Exploration with Delays

In this section, we describe our proposed algorithm: Batch Pure Exploration with Delays (BPE-Delay). The algorithm proceeds in rounds r = 1, 2, . . . , R, where R is the total number of rounds. Each round consists of tr time steps so that PR r=1 tr = T. During each round, the observation points are selected based on a maximum uncertainty acquisition function (defined in (7)). At the end of each round, confidence intervals are used to remove the points which are unlikely to be the maximiser of f.

The length tr of round r is increasing in r and is chosen carefully to balance the exploration needed in each round, taking

into account the stochastically delayed feedback, and the number of rounds. In particular we set tr = qr + u T (δ) where u T (δ) is a 1 δ upper bound on the delay random variable, and qr = p

qr 1T is determined recursively, for r 1, initialized at q0 = 1. The delay related quantity is set to u T (δ) = E[τ] + ψT ( δ

ψt(δ) = min

2ξ2 log( 3t

2δ ), 2b log( 3t

where ξ and b are the parameters specified in Assumption 3.3. In the analysis in Section 6, we will see that u T (δ) is a 1 δ

2 upper confidence bound on the delay random variable. It turns out that this choice of tr depending on u T (δ) is crucial in enabling us to improve the delay dependence of our algorithm. If we know there is no delay in the observations, we can set u T (δ) to zero. The length of the last round can also be easily adjusted to ensure PR t=1 tr = T.

BPE-Delay maintains a set Xr of potential maximisers of f. This set is recursively pruned from Xr 1 using confidence intervals around each x Xr 1 to remove those that are sub-optimal with high probability. We start with the full input space, X0 = X. During each round r, the k-th observation point Xk,r is selected from Xr based on a maximum uncertainty acquisition:

Xk,r arg max x Xr σk 1,r(x), (7)

where σk,r( ) denotes the uncertainty estimate in (4) calculated using the points Xk,r = {X1,r, X2,r, . . . , Xk,r}. BPE-Delay uses only the inputs chosen in round r so far, to calculate the acquisition function and choose the remainder of the points in round r based on the uncertainty estimates. While using entire past observations may be effective, establishing corresponding regret bounds becomes difficult. The reason is that creating such dependency among observation points invalidates tight confidence intervals given in Lemma 4.1. Nonetheless, Li and Scarlett (2022) showed in experiments that discarding the data collected in previous rounds, although may seem wasteful, does not significantly harm the performance. From the definition of the uncertainty estimate given in (4), we can see that it does not depend on the observation values. Thus, σ2 k,r( ) is well defined, despite not all observation values being available due to delay. We use double index r = 1, 2, . . . , k = 1, 2, . . . , tr for clarity of presentation. The observation points however can be indexed using t = 1, 2, . . . by concatenating the observation points in all rounds (see Algorithm 1).

To contrast the statistics derived from the entire set of observations in round r ignoring the delay , and the ones using available observations considering the delay , we recall the notations µk,r and σk,r for the statistics using just the available observations. Specifically, µk,r and σk,r

Delayed Feedback in Kernel Bandits

Algorithm 1 Batch Pure Exploration with Delays (BPEDelay)

Input: Action set X, number of steps T; Initialize: X1 X, q0 = 1, t = 1 ; # Algorithm proceeds in rounds r = 1, 2, . . . . for r = 1, 2, . . . , until t T do

Tqr 1 # qr is used to determine tr tr = qr + u T (δ) # tr is the length of round r for k = 1, . . . , tr do

# the points in each round are chosen by a maximum uncertainty acquisition function Xk,r arg maxx Xr σk 1,r(x) Xt Xk,r t = t + 1 end for # Remove the input points which are unlikely to be the maximiser of f, using confidence intervals Xr+1 {x Xr : Ur,δ(x) Lr,δ(z), z Xr} end for

are the prediction and uncertainty estimate after selecting k observation points in round r using the set Xk,r = {Xs,r Xk,r : s+τs,r k} of points selected in round r, with available observations yk,r = [y = f(Xs,r) + ϵs,r]Xs,r Xk,r in round r.

At the end of each round r, using the available observations in round r, we create upper and lower confidence bounds on f( ), Ur( ) and Lr( ) respectively. These are used to remove points which are unlikely to be a maximiser of f. In particular, if there exists a point z Xr for which the lower confidence bound is larger than the upper confidence bound for x Xr, then x is unlikely to be the maximiser of f so we can remove it. The confidence interval widths are selected in a way that all the confidence intervals used in the algorithm hold true with high probability (see Theorems 6.1 and 6.2 for details).

We emphasize that while we use σk,r for selecting the observation points within each round, µk,r and σk,r are used for forming the confidence intervals at the end of each round. Using σk,r avoids selecting repetitive points due to unavailable feedback, while the valid confidence intervals can only be formed using µk,r and σk,r.

The complete pseudo-code for the BPE-Delay algorithm is given in Algorithm 1.

6. Analysis

In this section, we provide the analysis of the BPEDelay algorithm. Specifically, we prove a high probability O( p

TΓk(T)+E[τ]) bound on its regret. This regret bound completely decouples the effect of the delay from the dominant time dependent regret term. We note that the additional regret due to delay in our result is significantly smaller than the existing work and matches what is seen in the simpler K-armed bandit problem (Joulani et al., 2013).

We consider two cases of finite and continuous X separately, due to minor differences in the confidence intervals which lead to mild differences in the regret bounds. In particular, when X is finite, the regret bound scales with O( p

log(|X|)). In the case of a continuous and compact X Rd, under Assumption 4.2, we prove a similar regret bound which scales with an O(

d) factor instead of O( p

log(|X|)). While in the kernel bandit setting,

d is typically hidden in the O notation as a constant not growing with T, in the linear bandit setting, the dependency on d should be pronounced and is important in evaluating the algorithms. Theorem 6.1. Consider the kernel based bandit problem with delayed feedback described in Section 3. When X is finite, set the confidence intervals in BPE-Delay (Algorithm 1) to

Ur,δ(x) = µtr,r(x) + β(δ) σtr,r(x),

Lr,δ(x) = µtr,r(x) β(δ) σtr,r(x), (8)

with β(δ) = Ck + σ

2 log( 4R|X|

δ ). Under Assumptions 3.1, 3.2, 3.3, with probability at least 1 δ, the regret of BPE-Delay is bounded by

R(T) = O β(δ) log log(T) p

TΓk(T) + E[τ] . (9)

Note that with a finite X, β(δ) = O(1), not depending on problem parameters such as d and T. In the next theorem, we bound the regret when X is not finite, but a compact subset of Rd. Theorem 6.2. Consider the kernel based bandit problem with delayed feedback described in Section 3, under Assumptions 3.1, 3.2, 3.3 and 4.2. Set the confidence intervals in BPE-Delay (Algorithm 1) to

Ur,δ(x) = µtr,r(x) + 2

ur + β r(δ)( σtr,r(x) + 2 ur ),

Ur,δ(x) = µtr,r(x)+ 2

ur + β r(δ)( σtr,r(x)+ 2 ur ), (10)

with β r(δ) = β ur( δ

4R) and β t given in Lemma 4.3. Then, with probability at least 1 δ, the regret of BPE-Delay is bounded by,

R(T) = O β R(δ) log log(T) p

TΓk(T) + E[τ] . (11)

Delayed Feedback in Kernel Bandits

From Lemma 4.3, we can see that β R(δ) = O(Ck +

d log( T RCk

δ )). Thus the regret bound scales with

d) that becomes particularly important when applied to linear bandits.

Proof Sketch. By the choice of size tr of each round, the number R of rounds is bounded as R = O(log log(T)). We first show that the uncertainty estimate using tr observations σtr,r sufficiently shrinks. Using the confidence intervals, we prove that i) the maximiser x is not removed during any round; i.e. x Xr, for all r, and ii) the instantaneous regret incurred by selecting each point Xk,r Xr is bounded. The bound on f(x ) f(Xk,r) is based on the amount of exploration in previous round, taking into account the delay. Using the bound on instantaneous regret and the size tr of round r we bound the regret in round r. A detailed proof is provided in the appendix.

Theorem 6.1 can be specialized for many commonly used kernels including Mat ern and SE, as well as for the special case of linear kernels.

Corollary 6.3. Under the setting of Theorem 6.2, the following hold with probability at least 1 δ,

In the case of Mat ern kernel with smoothness parameter ν,

R(T) = O T ν+d 2ν+d + E[τ] . (12)

In the case of SE kernel,

T(log(T))d+1 log log(T) + E[τ] . (13)

In the case of linear kernel,

T log(T) log log(T) + E[τ] . (14)

Comparing to the lower bound in the case of linear bandit (Lattimore and Szepesv ari, 2020), and the lower bounds for kernel bandit in the case of SE and Mat ern kernels (Scarlett et al., 2017), the regret bounds shown in Corollary 6.3 are order optimal, up to logarithmic factors and the additive delay penalty which is independent of all other problem parameters. We thus reduced the delay related term in the regret bounds to only E[τ] without affecting the dominant term in the regret bound. For comparison, in the (generalized) linear setting the best known bound on the delay related regret was O(d 3 2 E[τ]) (Howson et al., 2022); that is improved with a d 3 2 factor in our results. In the kernel bandit setting, we also improved the O(ΓT E[τ]) delay related regret (Verma et al., 2022) to only O(E[τ]). This represents a significant improvement considering that ΓT can become

Figure 1. Objective functions used in our experiments.

arbitrarily close to linear in T, in the case of a Mat ern kernel with a small ν and large d. This is in addition to the improvement in the first term in regret from O(ΓT

T) to O( ΓT T).

7. Experiments

Following our theoretical analysis, we provide numerical experiments on the performance of BPE-Delay and compare it to the GP-UCB-SDF (Verma et al., 2022). We test the algorithms on the objective functions f1 and f2 shown in Figure 1. These functions are generated by fitting a kernel based model to points randomly generated from a multivariate Gaussian. This is a common technique to create RKHS elements (e.g., see Chowdhury and Gopalan, 2017; Vakili et al., 2021a; Li and Scarlett, 2022). We use a SE kernel with a length scale parameter l = 0.8 for f1 and l = 1.0 for f2 in order to generate these objective functions. The learner can then choose from |X| = 2500 points over a uniform 50 50 grid. The sampling noise is zero mean Gaussian with standard deviation σ = 0.02. The stochastic delay in the feedback is generated from a Poisson distribution with parameter λ. The calculation of ψt for BPE-Delay uses ξ = 9 and b = 1 given in Assumption 4.2. The cumulative regret curves are the average of 10 independent experiments. The error bars indicate half a standard deviation. The code for these experiments is provided in a Git Hub repository.2

In Figure 2, we compare the regret performance of BPEDelay with GP-UCB-SDF introduced in the most related work in the same setting as ours (Verma et al., 2022). The figure shows a significant improvement in the regret performance using BPE-Delay in both cases. In this experiment, the delays are Poisson distributed with λ = 50.

In Figures 3 and 4, we show the effect of delay for these two algorithms. Specifically, we vary the Poisson delay parameter as λ = 0, 25 and 50, while maintaining the rest of parameters as the previous experiment. BPE-Delay shows a linear excess in the regret with the expected delay. GPUCB-SDF on the other hand shows dramatic jump in regret

2https://github.com/svakili89/delayed kernel bandit

Delayed Feedback in Kernel Bandits

Figure 2. Comparing regret performance of BPE-Delay and GPUCB-SDF.

Figure 3. Regret performance of BPE-Delay for varying delay Poisson parameters.

even with moderate delay values. We also compare the performance of the BPE-Delay and BPE (Li and Scarlett, 2022) which was designed for non-delayed settings. As we can see from Figure 5, BPE-Delay naturally performs better than BPE in a delayed setting.

8. Conclusion

There has been a great attention paid to kernel bandits due to their numerous applications in machine learning, and academic and industrial experimental design. In many real world scenarios, e.g., recommender systems, the feedback from the decisions are not immediately available, naturally leading to the formulation of kernel bandits with stochas-

Figure 4. Regret performance of GP-UCB-SDF for varying delay Poisson parameters.

Figure 5. Comparing the regret performance of BPE and BPEDelay in the delayed feedback setting.

tically delayed feedback. For this setting, we proposed BPE-Delay which is able to efficiently deal with the delayed feedback. We showed an order optimal regret bound for BPE-Delay with a very small excess in regret due to the delayed feedback, significantly improving upon the existing work in the same setting. We also show that these theoretical improvements are maintained in several simulation studies. In addition, when applied to the special case of linear kernels, our theoretical regret bounds improve the delay related penalty by a factor of d 3 2 compared to the state of the art. In all cases, our results are the first to show that the additive delay penalty only depends on E[τ]. This extends what is known for the simpler K-armed bandit setting to the much more realistic kernel bandit setting.

Delayed Feedback in Kernel Bandits

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002.

D. Calandriello, L. Carratino, A. Lazaric, M. Valko, and L. Rosasco. Gaussian process optimization with adaptive sketching: Scalable and no regret. In Conference on Learning Theory, 2019.

S. R. Chowdhury and A. Gopalan. On kernelized multiarmed bandits. In International Conference on Machine Learning, pages 844 853, 2017.

S. R. Chowdhury and A. Gopalan. On batch bayesian optimization. ar Xiv preprint ar Xiv:1911.01032, 2019.

T. M. Cover. Elements of information theory. John Wiley & Sons, 1999.

E. A. Daxberger and B. K. H. Low. Distributed batch gaussian process optimization. In International conference on machine learning, pages 951 960. PMLR, 2017.

T. Desautels, A. Krause, and J. W. Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research, 15:3873 3923, 2014.

M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang. Efficient optimal learning for contextual bandits. In Proceedings of the Twenty Seventh Conference on Uncertainty in Artificial Intelligence, pages 169 178, 2011.

B. Howson, C. Pike-Burke, and S. Filippi. Delayed feedback in generalised linear bandits revisited. ar Xiv preprint ar Xiv:2207.10786, 2022.

P. Joulani, A. Gyorgy, and C. Szepesv ari. Online learning under delayed feedback. In International Conference on Machine Learning, pages 1453 1461. PMLR, 2013.

M. Kanagawa, P. Hennig, D. Sejdinovic, and B. K. Sriperumbudur. Gaussian processes and kernel methods: A review on connections and equivalences. ar Xiv preprint ar Xiv:1807.02582, 2018.

T. Lattimore and C. Szepesv ari. Bandit algorithms. Cambridge University Press, 2020.

Z. Li and J. Scarlett. Gaussian process bandit optimization with few batches. In International Conference on Artificial Intelligence and Statistics, pages 92 107. PMLR, 2022.

T. Mandel, Y.-E. Liu, E. Brunskill, and Z. Popovi c. The queue method: Handling delay, heuristics, prior data, and evaluation in bandits. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

C. E. Rasmussen and C. K. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.

S. Salgia, S. Vakili, and Q. Zhao. A domain-shrinking based bayesian optimization algorithm with order-optimal regret performance. Advances in Neural Information Processing Systems, 34:28836 28847, 2021.

J. Scarlett, I. Bogunovic, and V. Cevher. Lower bounds on regret for noisy Gaussian process bandit optimization. In Conference on Learning Theory, pages 1723 1742, 2017.

N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning, pages 1015 1022, 2010.

S. Vakili, N. Bouziani, S. Jalali, A. Bernacchia, and D.-s. Shiu. Optimal order simple regret for Gaussian process bandits. In Conference on Neural Information Processing Systems, 2021a.

S. Vakili, M. Bromberg, J. Garcia, D.-s. Shiu, and A. Bernacchia. Uniform generalization bounds for overparameterized neural networks. ar Xiv preprint ar Xiv:2109.06099, 2021b.

S. Vakili, K. Khezeli, and V. Picheny. On information gain and regret bounds in Gaussian process bandits. In International Conference on Artificial Intelligence and Statistics, pages 82 90, 2021c.

S. Vakili, H. Moss, A. Artemev, V. Dutordoir, and V. Picheny. Scalable thompson sampling using sparse gaussian process models. Advances in Neural Information Processing Systems, 34:5631 5643, 2021d.

S. Vakili, J. Scarlett, and T. Javidi. Open problem: Tight online confidence intervals for rkhs elements. In Conference on Learning Theory, pages 4647 4652. PMLR, 2021e.

S. Vakili, J. Scarlett, D.-s. Shiu, and A. Bernacchia. Improved convergence rates for sparse approximation methods in kernel-based learning. In International Conference on Machine Learning. PMLR, 2022.

M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. In Conference on Uncertainty in Artificial Intelligence, page 654 663, 2013.

Delayed Feedback in Kernel Bandits

A. Verma, Z. Dai, and B. K. H. Low. Bayesian optimization under stochastic delayed feedback. In International Conference on Machine Learning, pages 22145 22167. PMLR, 2022.

C. Vernade, O. Capp e, and V. Perchet. Stochastic bandit models for delayed conversions. In Conference on Uncertainty in Artificial Intelligence, 2017.

C. Vernade, A. Carpentier, T. Lattimore, G. Zappella, B. Ermis, and M. Brueckner. Linear bandits with stochastic delayed feedback. In International Conference on Machine Learning, pages 9712 9721. PMLR, 2020.

T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17 (9):2077 2098, 2005.

Z. Zhou, R. Xu, and J. Blanchet. Learning in generalized linear contextual bandits with stochastic delays. Advances in Neural Information Processing Systems, 32, 2019.

Delayed Feedback in Kernel Bandits

In this section, we provide detailed proofs for Theorems 6.1, 6.2 and Corollary 6.3. We structure the proof of theorems as follows. We first overview the kernel based confidence intervals, the bounds on the delay and the number of rounds. After formalizing these bounds, we proceed with the proof in three steps. In the first step, we bound the maximum uncertainty at any point x Xr+1 at the end of each round r. In the second step, we bound the instantaneous regret for selecting a point in each round. In the third step, we bound the total regret. At the end of the proof, we discuss Corollary 6.3.

Confidence Intervals. Recall the confidence intervals used in the BPE-Delay algorithm to update the set Xr of potential maximisers in round r. Using Lemmas 4.1 and 4.3 and a probability union bound over rounds, all confidence intervals used in the algorithm are valid with probability at least 1 δ

2. Specifically, let us define the event

E = {Lr,δ(x) f(x) Ur,δ(x), r = 1, 2, . . . , R, x Xr}. (15)

Then, we have Pr[E] 1 δ

Delay. Recall the definition of ψt(δ) given in (6). We have the following concentration for sub-exponential random variables. Lemma A.1 ((Howson et al., 2022)). Let τt E[τt], t = 1, 2, . . . , be i.i.d. sub-exponential random variables with parameters ξ and b as specified in Assumption 3.3. Then,

Pr [ t 1, τt E[τ] + ψt(δ)] 1 δ. (16)

Let τk,r denote the random delay for the k-th observation in round r of BPE-Delay. Let us define the event

E = { r, k qr : τk,r u T (δ)}. (17)

Using Lemma A.1 and definition of u T , we have Pr[E ] 1 δ

We thus have the probability that both E and E hold true Pr[E E ] 1 δ.

We condition the rest of the proof on E E .

The number of rounds. The number R of rounds in BPE-Delay is bounded in the following lemma. Lemma A.2. Recall the choice of round lengths tr in BPE-Delay. We have R = O(log log(T)).

The proof follows from similar steps as in the proof of Proposition 1 in (Li and Scarlett, 2022).

We now proceed to the main steps in the proof of theorems.

Step 1 (Maximum uncertainty at the end of each round). In each round r, the sum of uncertainties using all observations (including the ones with delayed feedback) at the end of qr observations can be bounded using Γk(qr). Lemma A.3. [(Srinivas et al., 2010)] Recall definition of Γk(t) given in (2), for a set of t observation Xt, we have

s=1 σ2 Xs 1(Xs) c1Γk(t), (18)

where c1 = 2 log(1+ 1

Thus, considering the observation points in round r, we have

k=1 σ2 k 1,r(Xk,r) c1Γk(qr), (19)

where c1 is the constant given in Lemma A.3. By the selection rule of observation points based on maximum uncertainty acquisition (7), we have, for all x Xr, 1 k qr

σk 1,r(x) σk 1,r(Xk,r). (20)

Delayed Feedback in Kernel Bandits

In addition, by positive definiteness of the kernel matrix, conditioning on a bigger set of observations reduces the uncertainty (see (4)). Thus,

σqr,r(x) σk 1,r(Xk,r). (21)

Combining the last inequality with (19), we obtain

k=1 σ2 qr,r(x) c1Γk(qr), (22)

that implies

σ2 qr,r(x) c1Γk(qr)

Recall the event E given in (17). Under E , σ2 tr,r is conditioned on a larger set of observations compared to σ2 qr,r. The positive definiteness of the kernel matrix implies that conditioning on a larger set of observations reduces the uncertainty. Thus, we have σ2 tr,r(x) σ2 qr,r(x), x Xr. As a result,

σ2 tr,r(x) c1Γk(qr)

qr , x Xr. (24)

Step 2 (Instantaneous regret). We can use the confidence intervals for RKHS elements given in Lemmas 4.1 and 4.3 to bound the instantaneous regret as follows. Recall the event E defined in (15).

We first note that for all x Xr,

Ur,δ(x ) f(x )

Thus, Ur,δ(x ) maxx Xr Lr,δ(x) for all r, and x will not be eliminated during any round of BPE-Delay. Therefore x Xr, for all r.

We now proceed to bounding the instantaneous regret under two different settings of finite X and continuous X, separately. The two cases correspond to Theorems 6.1 and 6.2, respectively.

When X is finite (The setting of Theorem 6.1):

For all x Xr+1,

f(x ) f(x) Ur,δ(x ) Lr,δ(x)

= µtr,r(x ) + β(δ) σtr,r(x ) µtr,r(x) β(δ) σtr,r(x )

= Lr,δ(x ) Ur,δ(x) + 2 β(δ) σtr,r(x ) + 2 β(δ) σtr,r(x)

2 β(δ) σtr,r(x ) + 2 β(δ) σtr,r(x). (25)

The first and second equality follow from the choice of the confidence intervals in Theorem 6.1. The last line follows from the choice of Xr+1 that ensures for all x, z Xr+1, Lr,δ(z) Ur,δ(x).

Combining with the bound on σtr,r(x) obtained in the previous step, we bound the instantaneous regret for all x Xr+1 as follows.

f(x ) f(x) 4 β(δ)

Delayed Feedback in Kernel Bandits

When X is continuous (The setting of Theorem 6.2):

Following similar steps as in the previous case, for all x Xr+1,

f(x ) f(x) Ur,δ(x ) Lr,δ(x)

µtr,r(x ) + 2

qr + β r(δ)( σtr,r(x ) + 2 qr )

µtr,r(x) + 2

qr + β r(δ)( σtr,r(x) + 2 qr )

= Lr,δ(x ) Ur,δ(x)

qr + 2 β r(δ) σtr,r(x ) + σtr,r(x) + 4 qr

8 qr + 2 β r(δ) σtr,r(x ) + σtr,r(x) + 4 qr

Combining with the bound on σtr,r(x) obtained in the previous step, we bound the instantaneous regret for all x Xr+1 as follows.

f(x ) f(x) 8

qr + 4 β r(δ)

Step 3 (Bounding cumulative regret): Let Rr = Ptr k=1(f(x ) f(Xk,r)) be the total regret in round r. Then, we have R(T) = PR r=1 Rr. Summing up the regret in round r, we obtain the following.

When X is finite (The setting of Theorem 6.1):

Replacing the bound on instantaneous regret from (26), for r 2,

Rr 4tr β(δ)

Tqr 1 + u T (δ) β(δ)

c1TΓk(qr 1) + (u T (δ) + 1) β(δ)

The second inequality is obtained by the choice of tr.

When X is continuous (The setting of Theorem 6.2):

Replacing the bound on instantaneous regret from (27), for r 2,

8 qr 1 + 4 β r 1(δ)

qr 1 + 2 qr 1

Tqr 1 + u T (δ)

8 qr 1 + 4 β r 1(δ)

qr 1 + 2 qr 1

4 β r 1(δ) p

c1TΓk(qr 1) + 8

+ (u T (δ) + 1)

8 qr 1 + 4 β r 1(δ)

qr 1 + 2 qr 1

Delayed Feedback in Kernel Bandits

We also note that f(x) = f, k( , x) (reproducing property). Thus |f(x)| f Hk k( , x) ; i,e, |f(x)| Ckkmax. Thus, the regret in the first round can be simply bounded using its length,

R1 t1Ckkmax

T + u T (δ) Ckkmax.

Recall u T (δ) = O(E[τ] + log( T

δ )) and note that, for r > 1,

8 qr 1 + 4 β r 1(δ)

qr 1 + 2 qr 1

= O(1), (28)

T for r > 1.

Adding up the regret over all rounds r = 1, 2, . . . , R with R = log log(T), we obtain

R(T) = O β(δ) log log(T) p

TΓk(T) + Ck E[τ] , (29)

and, R(T) = O β R(δ) log log(T) p

TΓk(T) + Ck E[τ] , (30)

under the discrete and continuous X cases, proving Theorems 6.1 and 6.2, respectively.

Corollary 6.3 immediately follows from replacing the value of Γk, in particular, Γk(T) = O(T d 2ν+d (log(T)) 2ν 2ν+d ), Γk(T) = O((log(T))d+1) and Γk(T) = O(d log(T)), in the cases of Mat ern, SE and linear kernels respectively (Srinivas et al., 2010; Vakili et al., 2021c). In the case of linear kernel, we emphasize the O(

d) factor in β R(δ).