# expected_improvement_for_contextual_bandits__aecdda20.pdf

Expected Improvement for Contextual Bandits

Hung Tran-The Applied Artificial Intelligence Institute Deakin University

Sunil Gupta Applied Artificial Intelligence Institute Deakin University

Santu Sana Applied Artificial Intelligence Institute Deakin University

Tuan Truong FPT Software AI Center

Long Tran-Thanh University of Warwick, UK Svetha Venkatesh Applied Artificial Intelligence Institute Deakin University

The expected improvement (EI) is a popular technique to handle the tradeoff between exploration and exploitation under uncertainty. This technique has been widely used in Bayesian optimization but it is not applicable for the contextual bandit problem which is a generalization of the standard bandit and Bayesian optimization. In this paper, we initiate and study the EI technique for contextual bandits from both theoretical and practical perspectives. We propose two novel EI based algorithms, one when the reward function is assumed to be linear and the other for more general reward functions. With linear reward functions, we demonstrate that our algorithm achieves a near-optimal regret. Notably, our regret improves that of Lin TS [3] by a factor

d while avoiding to solve a NP-hard problem at each iteration as in Lin UCB [1]. For more general reward functions which are modeled by deep neural networks, we prove that our algorithm achieves a O( d

T) regret, where d is the effective dimension of a neural tangent kernel (NTK) matrix, and T is the number of iterations. Our experiments on various benchmark datasets show that both proposed algorithms work well and consistently outperform existing approaches, especially in high dimensions.

1 Introduction

The contextual bandit problem is an important field in machine learning ([9, 23]) to optimize the tradeoff between exploration and exploitation in sequential decision making, and has been extensively studied in real-world applications such as personalized recommendation [24], advertising [18], robotic control [25], and healthcare [19]. At each round, the agent observes a feature vector (the context") for each of the K arms, pulls one of them, and in return receives a scalar reward. The goal is to maximize the cumulative reward, or minimize regret (see our definition in Section 2), in a total of T rounds. To do so, the agent must find a trade-off between exploration and exploitation.

There are two standard techniques to solve the trade-off in contextual bandits. The first technique uses the optimism in face of uncertainty which chooses promising actions by maximizing upper-confidence bounds (UCB), and the second one using Thompson Sampling (TS) whose basic idea is to estimate a posterior distribution on the reward, and sample an arm that maximises a random reward drawn

Correspondence to Hung Tran-The <hung.tranthe@deakin.edu.au>.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

from this distribution. A series of work has applied both UCB and TS techniques or their variants to explore in contextual bandits with many forms of reward functions - from linear to nonlinear. In the line of UCB, there are works of [24, 13, 1] for the linear bandits, works of [17, 35] for nonlinear contextual bandits, and very recently [39], which uses neural networks to learn the reward function. In the line of TS, [3, 31] are for linear bandits, [31, 22] for generalized linear functions, and [30, 38] for nonlinear bandits using deep neural networks.

Different from UCB and TS, the expected improvement (EI) [26] is a greedy improvement-based heuristic that samples an action offering the greatest expected improvement over an incumbent. EI is also one of the oldest and popular techniques to handle the tradeoff between exploration and exploitation under uncertainty. Due to its ability to handle uncertainty, EI has been widely used in Bayesian optimization [28, 37] - a problem that is closely related to infinite-arm multi-arm bandits. However, to the best of our knowledge, EI has not been used in contextual bandits.

In addition, in many situations where the exploration may costly or infeasible, EI can be a safer strategy than UCB and TS which are considered as over-exploration strategies. For example, in a medical application of contextual bandits, choosing a treatment that is not the estimated-best choice (pure exploration) for a specific patient may be unethical [7]. Compared to UCB and TS strategies, EI brings a safer treatment because EI only chooses a treatment with high possibilities for improvement over the estimated-best choice.

Motivated by these advantages, it is valuable to study EI technique in the contextual bandit setting. Whether EI can handle well the trade-off between exploration and exploitation in a contextual bandit setting and further in reinforcement learning is an interesting question. Our main contributions in this paper are as follows:

We introduce and formalize Expected Improvement as a new strategy for contextual bandits creating a parallel to UCB and TS.

We propose two EI-based algorithms. The first algorithm (Lin EI) is designed for the linear bandits whilst the second algorithm (Neural EI) is designed for more general reward function and we model it by a deep neural network.

For the linear reward function, our Lin EI algorithm is able to achieve O(d

T) regret with probability 1 δ, which matches the information theoretic lower bound Ω(d

T) for this problem (up to ln(T)). Our regret improves that of Lin TS [3] by a factor

d while avoiding to solve a NP-hard problem at each iteration as in Lin UCB [1]. By this advantage, we will show in section 6.1.1 that our proposed Lin EI scales to high dimensions (in term of d) better than Lin UCB and Lin TS; in terms of computations, Lin EI is also significantly cheaper than both Lin UCB and Lin TS. Our source code is publicly available at https://github.com/ Tran-The Hung/Expected-Improvement-for-Contextual-Bandits.

For the general reward function, we prove that, under standard assumptions (see section 4.1), our Neural EI algorithm is able to achieve O( d

T) regret, where d is the effective dimension of a neural tangent kernel matrix and T is the number of rounds. The regret bound of Neural EI has the same order as those of Neural UCB [39] and Neural TS [38].

Finally, we demonstrate the performance of our proposed EI-based algorithms against UCB and TS based approaches and other baselines on various benchmark datasets. Our experiments show that Lin EI outperforms other baselines in the linear setting, and when the reward function is non-linear, Neural EI outperforms all baselines.

Under a theoretical perspective, we remark that the EI technique have not been well studied compared to UCB and TS even in the Bayesian optimization setting. A key challenge of analyzing EI algorithms theoretically comes from its improvement function involving nonlinear, nonconvex term. A notable exception is the work in [11], which provides a convergence analysis of EI for Bayesian optimization in the noise-free setting. Another work in [34] provides the convergence analysis of EI for Bayesian optimization in the noisy setting. There are also several papers [33, 29] using EI to study the best-arm identification problem (also known as pure exploration ) which is a finite variant of Bayesian optimization. However, none of these works provides the optimal convergence rate in their settings. In contrast, our work is the first to show that EI can achieve the optimal convergence rate at least in linear contextual bandit setting.

2 Problem Setting

We consider the problem of K-arm contextual bandits. At time t = 1, 2, ..., the agent observes K contextual vectors xi,t Rd, then selects an arm a(t) and receives a reward ra(t),t which has a general form as follows: ra(t),t = h(xa(t),t) + ξa(t),t,

where h is an unknown reward mean function satisfying 0 h(x) 1 for any x Rd, and ξa(t),t is a zero mean, conditionally R-sub Gaussian noise with a constant R 0, i.e., λ R, E[eλξa(t),t|{xi,t}K i=1] exp( λ2R2

2 ). In our setting, we assume that these context vectors may be chosen by an adversary in an adaptive manner after observing the arms played and their rewards up to time t 1. For the unknown function h, we consider two cases as follows:

The reward function h is linear, i.e., h(xt,i) = x T t,iθ , where θ Rd is fixed but unknown parameters. Without loss of generality, we here assume that ||xi,t|| 1, ||θ || 1.

The reward function h is modelled by a fully connected neural network with depth L 2 defined recursively by

f(x; θ) = m WLσ(WL 1σ(...σ(W1x))),

where σ(x) := max{x, 0} is the Re LU activation, θ = (vec(W1); ...; vec(WL)) Rp is the collection of parameters of the neural network with p = dm + m2(L 2) + m which is the number of parameters of the network. Without loss of generality, we assume that the width of each hidden layer is the same (i.e., m) for convenience in analysis. We denote the gradient of the neural network function by g(x; θ) = θf(x; θ) Rp.

Performance Measure. Let a (t) denote the optimal arm at time t. The objective is to minimize the cumulative regret R(T) = PT t=1(x T a (t),tθ x T a(t),tθ ).

3 The Lin EI algorithm for linear bandits

In this section, we design a provable version of EI algorithm for linear bandits. To do this, we make some assumptions on the prior distribution of the reward function before defining a form of the expected improvement.

Prior and Posterior Distributions. Inspired from the design of priors of the reward function like TS algorithm [3], we assume that the reward ri,t of each arm i follows a Gaussian distribution N(x i,tθ , v2), where the variance v2 is a free parameter (possibly time-dependent) that can be set specific to an algorithm. Let

X(t) = λI +

j=1 xa(j),jx a(j),j

ˆθt = X(t) 1(

j=1 xa(j),jra(j),j).

Then if we assume that the prior for θ at time t is given by N(ˆθt, v2X 1(t)), then the posterior distribution of θ at time t + 1 is N(ˆθt+1, v2X 1(t + 1)) ( see the proof in Appendix A.1 in [3]).

Expected Improvement. We now use this posterior distribution update to define the form of the expected improvement of each arm in contextual bandits. We denote r+ t = maxi K{x T i,tˆθt} which is the largest mean estimate of reward among all arms at time t. We define the expected improvement of an arm i at time t as

αEI i,t = Eµ N(ˆθt,v2 t X(t) 1)[max{0, x i,tµ r+ t }], (1)

This form is similar to those of EI in the case of the multi-arm bandit [29] and in the case of Bayesian optimization [34], but for contextual bandits. The αEI i,t value measures the potential of arm i to

Algorithm 1 The Linear Expected Improvement Algorithm (Lin EI) Input: Parameters C0, β

1: for t = 1 to T do 2: Observe contexts {xi,t}K i=1 3: Set a(t) := argmaxi KαEI i,t , a(t) = argmaxi K{x i,tˆθt}

tβ then 5: a(t) = a(t) 6: else 7: a(t) = a(t) 8: end if 9: Play arm a(t), and observe reward ra(t),t 10: Update X(t + 1) = λId + Pt j=1 xa(j),jx a(j),j, ˆθt+1 = X(t + 1) 1(Pt j=1 xa(j),jra(j),j) 11: end for

improve upon an incumbent. Here we define the incumbent as the largest posterior reward mean r+ t at time t. In Bayesian optimization, the incumbent is usually selected as the best reward value so far [11], or the largest reward mean so far [34]. In our setting with contextual bandits, we choose the latter for convenience in analysis. We also note that the EI form here can not be considered as an expected version of the TS in [3]. Even though the expected version of TS may be considered as Eµ N(ˆθt,v2 t X(t) 1)x i,tµ, but the expectation operator cannot be taken inside the max operator as it is a nonlinear function

The closed form for EI. Since µ N(ˆθt, v2X(t) 1), the random variable x T i,tµ is Gaussian

with mean x T i,tˆθt and standard deviation vsi,t, where we define si,t = q

x T i,t X(t) 1xi,t. Setting

zi,t = x i,t ˆθt r+ t vsi,t , we can express the expected improvement in closed form as follows

αEI i,t = vsi,t[zi,tΦ(zi,t) + ϕ(zi,t)], (2)

where Φ(.) and ϕ(.) are the standard normal cdf and pdf function of the normal distribution respectively.

3.1 Proposed Lin EI Algorithm.

We now use the above form of expected improvement to design our algorithm. The algorithm is performed iteratively as follows. At iteration t, it selects the arm suggested by the EI strategy if the expected improvement (measured by αEI i,t of all arms) is higher than a threshold. Otherwise it simply selects the arm suggested by the greedy strategy (corresponding to the arms whose posterior reward mean is r+ t ). The choice of thresholds plays an important role for convergence of our algorithm. We propose to use an adaptive threshold in the form of C0

tβ which is controlled by two parameters C0, β. A relevant choice of parameters for convergence guarantee will be discussed in the next section. Our algorithm is summarized in Algorithm 1.

We remark that the proposed algorithm can be considered as a variant of EI strategy with a simple modification. This modification comes from our observation that at an iteration, when the EI value is less than some threshold, using a greedy strategy is better than the EI strategy in the sense that the instantaneous regret is smaller. This is our interesting observation about the relation between the EI and the greedy strategy, and we use this in our algorithm design. Without the modification, a pure EI algorithm can make the cumulative regret become unbounded in our theoretical analysis.

Comparison with the Lin UCB algorithm [1] and the Lin TS algorithm [3]. Lin UCB looks for the most optimistic value for an arm in an ellipsoid defined by a level set of the posterior rather than integrating over it, and then chooses an arm that maximizes the optimistic value. Lin TS generates a sample µ from the posterior distributions N(ˆθt, v2 t X(t) 1) of reward, and then chooses an arm that maximizes x T i,tµ. In contrast, Lin EI could choose an arm whose maximum optimistic value is lower unlike Lin UCB, and/or choose an arm having lower x T i,tµ unlike Lin TS but for which there are more possibilities for improvement.

3.2 Theoretical Analysis

In this section, we provide the regret bound for the proposed Lin EI algorithm.

The EI technique is used so far in BAI and BO problems to seek the best arm with the highest reward mean. In such problems, the reward mean of each arm is fixed in advance. To leverage this property, [29] uses the allocation matching technique to allocate enough number of samples to suboptimal arms so that we can eliminate these arm with a high confidence. In BO, [11] uses the monotonicity property of incumbent to derive directly an upper bound for the simple regret. However, in contextual bandits where the best arm does not exist, and the mean reward of any arm can be different in different iterations, the monotonicity property of incumbent disappears, and the allocation technique is also no longer applicable. The proposed Lin EI addresses these challenges. Moreover, it selects an arm based on not only EI strategy but also a greedy strategy depending on a threshold function. This makes our regret analysis different from the standard EI technique in [36],[27].

In our proof, we first extend several results in [11] in Bayesian optimization in a non-contextual setting to our setting with contexts and noises. For instance, the following lemma is used to upper bound and lower bound the expected improvement for each arm i. Lemma 3.1. Pick 0 < δ < 1. Set Ii,t = max{0, x T i,tθ r+ t }. Then with probability 1 δ

Ii,t βtsi,t αEI i,t Ii,t + (βt + vt)si,t

Here, βt = R q

δ ) + 1, si,t = q

x i,t X(t) 1xi,t and vt is used instead of v. Then, depending on the arm selection by our Lin EI algorithm, we can upper bound the instantaneous regret rt = x a (t),tθ x a(t),tθ by two different ways. Combining them, we achieve an upper bound for the instantaneous regret rt = x a (t),tθ x a(t),tθ as in section C of our Supplementary Material:

vt ) (2βt + vt)sa(t),t + βtsa(t),t + τ( βt

2ln(Rtβ) + 2

3C 2 0 dln( t

δ )vtsa(t),t,

with probability at least 1 δ. The function τ is defined as τ(z) = zΦ(z) + ϕ(z). Here, we note that parameter vt plays the role of parameter v at time t we discussed above. In our analysis, vt is used to eliminate the influence of βt so that βt

vt is bounded as t grows.

Finally, by a relevant choice of parameters C0 and β, we achieve the regret bound for our proposed algorithm as follows:

Theorem 3.2. Given any δ (0, 1). If vt = R q

d C0 d and 0.5 β 3 then with probability 1 δ, the cumulative regret of the Lin EI algorithm is bounded as

Tln2(T)ln T

4 The Neural EI algorithm for Neural Contextual Bandits

In this section, we extend our Algorithm 1 to the more general setting where the reward function is modelled by a fully connected neural network. Similar to the Neural Thompson Sampling approach [38], our algorithm maintains a Gaussian distribution for each arm s reward. At time t, the posterior distribution of the reward of arm i is updated as follows. The mean is set to the output of the neural network, denoted by f(xi,t; θt 1), and the variance is defined as σ2 i,t = λg (xi,t; θt 1)U 1 t 1g(xi,t; θt 1)/m, where the matrix U 1 t is updated as Ut = Ut 1 +g(xa(t),t; θt)g (xa(t),t; θt)/m and parameter θt is the solution to the following minimization problem:

i=1 [f(xa(i),i;θ) ra(i),i]2/2 + mλ||θ θ0||2 2/2, (3)

where θ0 is randomly initialized network parameter. We can adapt gradient descent algorithms to solve this problem with step size η and total number of iterations J like the gradient descent algorithm of [39].

Algorithm 2 Neural Expected Improvement Algorithm (Neural EI) Input: Number of rounds T, exploration variance ν, network width m, regularization parameter λ and parameters C0, β

1: Set U0 = λI 2: for t = 1 to T do 3: Set a(t) := argmaxi [K]αEI i,t , a(t) = argmaxi [K]{f(xi,t; θt 1)}

tβ then 5: a(t) = a(t) 6: else 7: a(t) = a(t) 8: end if 9: Play arm a(t), and observe reward ra(t),t 10: Set θt to be the output of gradient descent for solving Eq(3) 11: Ut = Ut 1 + g(xa(t),t; θt)g (xa(t),t; θt)/m 12: end for

Expected Improvement for Neural Contextual Bandits. We now define the form of the expected improvement in this setting. At each time step t, we denote f + t = maxi [K]{f(xi,t; θt 1)} which is the highest mean estimate of f(x, θt 1) among all arms at time t. We define the expected improvement value of an arm i at time t as

αEI i,t = E e fi,k N(f(xi,t;θt 1),ν2σ2 i,t)[max{0, efi,k f + t }].

Further, by setting zi,t = f(xi,t;θt 1) f + t νσi,t , the above expectation can be computed analytically as follows

αEI i,t = νσi,t[zi,tΦ(zi,t) + ϕ(zi,t)] (4)

Our Neural EI algorithm is given in Algorithm 2. It starts by initializing θ0 = (vec(W1); ...; vec(WL)), where for each 1 l L 1, Wl = (W, 0; 0, W), each entry of W is generated independently from N(0, 4/m); WL = (w , w ), each entry of w is generated independently from N(0, 2/m). Neural EI extends our Lin EI algorithm to the setting where the reward function h is modelled by a fully connected neural network.

4.1 Regret Analysis

In this section, we provide a regret analysis of the Neural EI algorithm. We first provide necessary background on the neural tangent kernel (NTK) theory, which plays an important role in our analysis. Following a recent line of research [39, 38], we define the covariance between two data point x, y Rd

as follows: H(1)(x, y) = σ(1)(x, y) = x y, A(l)(x, y) = σ(l)(x, x) σ(l)(x, y) σ(l)(x, y) σ(l)(y, y)

, σl+1(x, y) =

2E(u,v) N(0,A(l)(x,y))[σ(u)σ(v)], H(l+1)(x, y) = 2 H(l)(x, y)E(u,v) N(0,A(l)(x,y))[σ (u)σ (v)] + σ(l+1)(x, y). Similar to [39, 38], we assume that the number of rounds T is known and denote the neural tangent kernel (NTK) matrix H RT K T K based on all contextual vectors {xt,k}t [T ],k [K]. Renumbering {xt,k}t [T ],k [K] as {xi}i=1,...,T K, then each entry Hij is defined as

Hij = ( H(L)(xi, xj) + σ(L)(xi, xj))/2, (5)

for all i, j [TK]. Based on the above definition, we impose the following assumption on the contexts generated by the adversary and the corresponding NTK matrix H.

Assumption 4.1. Let H be defined in Eq(5). There exists λ0 > 0 such that H λ0I. In addition, for any t [T], k [K], ||xt,k||2 = 1 and [xt,k]j = [xt,k]j+d/2.

Remark 1. Compared to Algorithm 1 for linear bandits, our Algorithm 2 needs an additional Assumption 1 to guarantee the convergence. The assumption that the NTK matrix is positive definite has been considered in prior work on NTK which is a mild condition and also imposed in other

related works [4, 15, 39, 38]. The assumption on contexts ensures that f(xi,t; θ0) = 0 for any i [K], t [T].

The NTK technique builds a connection between deep neural networks and kernel methods. It enables us to adapt some complexity measures for kernel methods to describe the complexity of the neural network through the notation of the effective dimensions as defined in [39, 38]. The effective dimension d of matrix H with regularization parameter λ is defined as d = log det(I+H/λ)

log(1+T K/λ) .

Using these notations, we are now ready to present the second main result of the paper. Let a (t) = argmaxi [K]E[ri,t] be the optimal action at round t that maximizes the expected reward, we

define the expected cumulative regret after T iterations as R(T) = E[PT t=1(ra (t),t ra(t),t)]. Then, we achieve the following upper regret bound for our Algorithm 2 by combining our EI techniques for Lin EI with NTK techniques. A completed proof is provided in Supplementary Material. Theorem 4.2. Under Assumption 1, set the parameters in Algorithm 2 as λ = 1 + 1/T,

ν = B + R q

dlog(1 + TK/λ) + 2 + 2log(1/δ), where B = max{1,

2h H 1h} with h =

(h(x1), ..., h(x T K)) . If p

d C0 d, β 2, and the network width m satisfies m poly(γ, T, K, L, log(1/δ)), then with probability at least 1 δ, the regret of Algorithm 2 is bounded as

R(T) O(ed p

βlog(1 + TK)log(T)T)

Remark 2. The regret bound depends on the parameter β. The best choice is β = 2 that tightens the regret. Theorem 4.2 implies the regret of Neural EI is on the order of O( d

T). This result matches the regret bound of Neural UCB ([39]), Neural TS ([38]), as well as of [12]. The effective dimension d measures how quickly the eigenvalues of H diminish, and only depends on T logarithmically in several special cases ([35]), according to [39]. Furthermore, [38] shows that d can be upper bounded if all contexts are nearly on some low-dimensional subspace of the RKHS space spanned by NTK. Currently, [20] gives an explicit sublinear regret bound for a neural network based UCB algorithm. However, their solution requires that contexts lie on on the d-dimensional hyper-sphere. Compared to these works, our EI based results may be of independent interest.

Remark 3. Similar to most of existing results in neural bandits, our results require a large value of m. This is rooted in the current deep learning theory based on the neural tangent kernel.

4.2 Technical Challenges

While the proof of Neural EI follows the same lines as that of Lin EI, we emphasize several key differences. First, we define filtration Ft 1 as the union of history until time t 1, and the contexts at time t. Given a time t, we define an event Eσ t as follows:

Eσ t = {ω Ft : i [K], | fi,t f(xi,t; θt 1)| ctνσi,t},

where ct = 4logt + 2log K.

Similarly, we define an event Eµ t as follows: Eµ t = {ω Ft : i [K], |f(xi,t; θt 1) h(xt,k)| νσi,t + ϵ(m)}, where ϵ(m) is defined as in [38].

Based on the definitions of event Eσ t and Eµ t , we achieve two different ways to lower bound and upper bound the expected improvement as follows: Lemma 4.3. Assume that the event Eµ t holds. Set Ji,t = max{0, h(xi,t) f + t }. Then for every i [K], we have Ji,t νσi,t ϵ(m) αEI i,t Ji,t + νσi,t + ϵ(m).

Lemma 4.4. Assume that the event Eσ t holds. Set Ii,t = max{0, fi,t f + t }. Then for every i [K], we have Ii,t ctνσi,t αEI i,t Ii,t + ν(ct + 1)σi,t.

While Lemma 4.3 is an adaptation of Lemma 3.1 to neural contextual bandits, Lemma 4.4 is a novel result compared to the regret analysis in our Section 3.1 for linear bandits. Challenges of analyzing the regret of Neural EI come from the fact that there exists additionally a factor ϵ(m) in 4.3. Compared to Lin EI, it is hard to analyze directly the regret h(xa (t),t) h(xa(t),t). We divide the analysis into two cases: if σa (t),t ϵ(m), we can follow the technique as in Lin EI. Otherwise, we use another way to bound the regret using a novel result as we mentioned in the main paper. Please see Lemma D.11 in the appendix for details.

5 Related Works and Discussion

Given the vast literature on bandit algorithms, we restrict our review to linear bandits and neural contextual bandits.

Linear Contextual Bandit. A lower bound of Ω(d

T) for linear bandits was given in [14], when the number of arms is allowed to be infinite. [1] analyze a UCB-style algorithm and provide a regret upper bound O(dlog(T)

d Tlog(T/δ)). Compared to this work, the regret of our proposed Lin EI has an additional

ln T, however, it still matches the information theoretic lower bound in this

setting. When the number of arms K is finite, [13] achieve a regret bound of O( q

Tdln3(KTln T)/δ) with probability at least 1 δ. [10] provides an algorithm based on exponential weights, with regret of order O(d Tlog K). These algorithms may not be effective when the number of arms K is large. For example, when K is exponential in d, the regret bound of [13] would become O(d2

T) showing a quadratic growth in d.

The Thompson Sampling algorithm [3] and an alternative given in [2] bear an additional

d in the regret bound. Very recently, [21] improved this regret bound of TS in some cases by integrating a doubly robust estimator with TS. However, this work requires additional significant computations and their setting is limited in independent contexts. In contrast, our EI-based algorithm achieves the optimal order of d even in a general setting where the contexts may be controlled by an adaptive adversary.

Another approach for linear bandits is the Information Directed Sampling (IDS) which was introduced in [32]. It provides an action-selection mechanism by minimizing the information ratio between the squared expected regret and the mutual information between optimal action and the next observation over all action sampling distributions. IDS obtains a performance improvement over TS and UCB algorithms in some cases, but has heavy sampling requirements. It has been shown in their experiments that IDS requires significantly more compute time than Thompson sampling and UCB algorithms. Recently, [5] provided a modification of the arm scoring rule of IDS to reduce computations. However, both [32] and [5] only provide the bounds on expected regret. In contrast, our work provides regret bounds in terms of cumulative regret which is tighter than expected regret.

Neural Contextual Bandit. Neural contextual bandits are becoming attractive due to the current advancement in optimization and generalization of deep neural networks [4, 15]. Neural contextual bandits have been considered in both popular techniques UCB [39] and Thompson Sampling [38]. Currently, [6] proposes a novel neural exploration strategy and their solution achieves a sublinear regret on T. Compared to all existing works in this sub-field, our EI based algorithm is new, and may be of independent interest.

Due to space limit, we will add additional related works in Section B of Supplementary Material.

6 Experiments

6.1 Linear Bandits

In this subsection, we assess the performance of our Lin EI algorithm on several benchmark datasets including covertype, magic, avila, dry bean, statlog, letter, pendigits, all from UCI [16]. See our Table for details. We compare the Lin EI with methods designed for linear bandits including: Lin TS [3], Lin UCB [1], Linear Epsilon Greedy for the linear reward, Lin IDS [32] for linear bandits. To transform these classification problems into multi-armed bandits, we adapt the

Table 1: Characteristics of benchmark datasets used in Section 6.

Dataset letter pendigits covertype avila magic dry bean statlog Classes (K) 26 10 7 12 2 7 7 Feature Dimension 17 16 54 10 10 16 8 Dataset size 20000 10992 581012 20867 19020 13611 58000

Figure 1: Comparison of our proposed Lin EI and baseline algorithms in linear bandits.

disjoint models to build a context feature vector for each arm: given an input feature x Rd of a k-class classification problem, we build the context feature vector with dimension kd as x1 = (x; 0; ...; 0), x2 = (0; x; ...; 0), ..., xk = (0; 0...; x). The algorithm generates a set of predicted reward and pulls the greedy arm. For these classification problems, if the algorithm selects a correct class by pulling the corresponding arm, it will receive a reward as 1, otherwise 0. The cumulative regret over time horizon T is measured by the total mistakes made by the algorithm.

We set the time horizon of our algorithm to 10000 for all data sets. In the experiments, we shuffle all datasets randomly. For (λ, ν) used in Lin UCB and Lin TS and our algorithm, we set λ = 1 following previous works and do a grid search of ν {1, 0.1, 0.01} to select the parameter with the best performance. All experiments are repeated 10 times, and the average with standard error are reported. For Lin IDS, we use the number of samples M = 100. For Linear Epsilon Greedy, we use ϵ = 0.1. For our Lin EI algorithm, we can choose any value C0 [

d, d] and β [0.5, 3]. For Lin EI, we set C0 =

d and β = 2.

Figure 1 shows the total regret of all algorithms for datasets bean, covertype and statlog. The Linear Epsilon Greedy performs the worst. While Lin UCB, Lin TS and Lin IDS are competitive, all these methods are significantly outperformed by the proposed algorithm. This difference in performance is because of the strategy used by each method. Due to space limit, the additional results on magic, pendigits, letter, and avila are shown in Section A of Supplementary Material. Our results on various datasets confirm that the exploration of the expected improvement is effective in practice. This also is widely observed in Bayesian optimization.

6.1.1 Comparison of Lin UCB, Lin TS and Lin EI on a large-scale multi-label dataset

Figure 2: Comparison of Lin UCB, Lin TS and our proposed Lin EI on a large-scale multi-label dataset.

We now compare our proposed Lin EI against Lin UCB and Lin TS on the eurlex-4k dataset which is a large multi-label dataset [8]. It contains data with 5000 features and 3993 labels. It corresponds to a contextual bandit problem with 3993 actions and the dimension of context is 5000 + 3993 = 8993 when mapped to our setting. This can be considered as a contextual bandit problem with large action

Figure 3: Comparison of Neural EI and baseline algorithms on real-world datasets.

space and with very high dimensions. Due to the heavy computation, we run algorithms on only 3000 datapoints from the test set. Our experimental results show that Lin EI scales better to high dimensions and to large action space (see our Figure 2). This can be explained as the regret of Lin EI is better than that of Lin TS by a factor

d while Lin UCB is an overexploration strategy in high dimensions. In terms of required computations, Lin EI is significantly cheaper than Lin UCB and Lin TS. Please see Section A.2 of the supplementary material for details. Due to space limit, we will add additional comparisons in term of dimension d in Section A.1 of Supplementary Material.

6.2 Neural Bandits

We compare the proposed Neural EI with baselines including: Lin UCB [1], our Lin EI for linear bandits problem, Neural Epsilon Greedy, Neural UCB [39], Neural TS [38]. We do the same classification problems as the experiments in subsection Linear Bandits. For methods using the neural network, we use one-hidden layer neural networks with 100 neurons to model the reward function. During posterior updating, gradient descent is run for 100 iterations with learning rate 0.001. For Neural UCB/Thompson Sampling and Neural EI, we use a grid search on λ {1, 101, 10 2, 10 3} and ν {10 1, 10 2, 10 3, 10 4, 10 5} as in [39] and [38]. We consider our algorithm on both synthetic datasets and real-world datasets. Due to space limit, we will provide results on synthetic datasets in Section A of Supplementary Material.

Real-world Datasets. Similar to the subsection Linear Bandit, we build the context feature vector with dimension kd as x1 = (x; 0; ...; 0), x2 = (0; x; ...; 0), ..., xk = (0; 0...; x). We also estimate our algorithm on datasets bean, covertype and statlog. Figure 3 show our results in the case of the neural bandits problem. Neural-based methods perform better because they can capture the nonlinearity of the underlying reward function. In real-datasets, while neural-based methods outperform Lin EI and Lin UCB for datasets covertype and statlog, these methods are not sampleefficient for learning the reward function of dataset bean. Perhaps, the reward function for dataset bean is linear. However, in all cases, our Neural EI algorithm performs better than other neural-based methods. This suggests that using the expected improvement strategy is effective in both linear bandits and neural contextual bandits.

7 Conclusion

We introduced and formalized Expected Improvement as a new strategy for contextual bandits. We proposed two EI-based algorithms and analyzed them theoretically. The first algorithm assumes the reward function to be linear whilst the second algorithm is designed for the case when the reward function is general and can be modelled by a deep neural network. Our promising empirical results on real-world datasets suggested that our EI-based algorithms work well in practice compared to other approaches especially in high dimensions. We believe our work would be useful for further improvements and extensions. For example, extending EI to the reinforcement learning setting is an interesting open problem.

[1] Yasin Abbasi-yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger,

editors, Advances in Neural Information Processing Systems, volume 24, pages 2312 2320. Curran Associates, Inc., 2011.

[2] Marc Abeille and Alessandro Lazaric. Linear Thompson Sampling Revisited. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 176 184. PMLR, 20 22 Apr 2017.

[3] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In Sanjoy Dasgupta and David Mc Allester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 127 135, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR.

[4] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[5] Jackie Baek and Vivek F. Farias. Ts-ucb: Improving on thompson sampling with little to no additional computation, 2021.

[6] Yikun Ban, Yuchen Yan, Arindam Banerjee, and Jingrui He. EE-net: Exploitation-exploration neural networks in contextual bandits. In International Conference on Learning Representations, 2022.

[7] Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Mostly exploration-free algorithms for contextual bandits. Management Science.

[8] K. Bhatia, K. Dahiya, H. Jain, P. Kar, A. Mittal, Y. Prabhu, and M. Varma. The extreme classification repository: Multi-label datasets and code, 2016.

[9] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems, 2012.

[10] Sébastien Bubeck, Nicoló Cesa-Bianchi, and Sham M. Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Shie Mannor, Nathan Srebro, and Robert C. Williamson, editors, Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pages 41.1 41.14, Edinburgh, Scotland, 25 27 Jun 2012. PMLR.

[11] Adam D. Bull. Convergence rates of efficient global optimization algorithms. J. Mach. Learn. Res., 12:2879 2904, November 2011.

[12] Sayak Ray Chowdhury, Aditya Gopalan, and abc. On kernelized multi-armed bandits. ICML 17, page 844 853. JMLR.org, 2017.

[13] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 208 214, Fort Lauderdale, FL, USA, 11 13 Apr 2011. PMLR.

[14] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 355 366, 2008.

[15] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1675 1685. PMLR, 09 15 Jun 2019.

[16] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.

[17] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010.

[18] Thore Graepel, Joaquin Quiñonero Candela, Thomas Borchert, and Ralf Herbrich. Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft s bing search engine. ICML 10, page 13 20, Madison, WI, USA, 2010. Omnipress.

[19] Kristjan Greenewald, Ambuj Tewari, Predrag Klasnja, and Susan Murphy. Action centered contextual bandits, 2017.

[20] Parnian Kassraie and Andreas Krause. Neural contextual bandits without regret, 2021.

[21] Wonyoung Kim, Gi soo Kim, and Myunghee Cho Paik. Doubly robust thompson sampling for linear payoffs, 2021.

[22] Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh, and Craig Boutilier. Randomized exploration in generalized linear bandits. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 2066 2076. PMLR, 26 28 Aug 2020.

[23] Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.

[24] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Michael Rappa, Paul Jones, Juliana Freire, and Soumen Chakrabarti, editors, WWW, pages 661 670. ACM, 2010.

[25] Jeffrey Mahler, Florian T. Pokorny, Brian Hou, Melrose Roderick, Michael Laskey, Mathieu Aubry, Kai Kohlhoff, Torsten Kröger, James Kuffner, and Ken Goldberg. Dex-net 1.0: A cloud-based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In 2016 IEEE International Conference on Robotics and Automation (ICRA), page 1957 1964. IEEE Press, 2016.

[26] J. Moˇckus. On bayesian methods for seeking the extremum. In G. I. Marchuk, editor, Optimization Techniques IFIP Technical Conference Novosibirsk, July 1 7, 1974, pages 400 404, Berlin, Heidelberg, 1975. Springer Berlin Heidelberg.

[27] Vu Nguyen, Sunil Gupta, Santu Rana, Cheng Li, and Svetha Venkatesh. Regret for expected improvement over the best-observed value and stopping condition. In Min-Ling Zhang and Yung-Kyun Noh, editors, Proceedings of the Ninth Asian Conference on Machine Learning, volume 77 of Proceedings of Machine Learning Research, pages 279 294. PMLR, 15 17 Nov 2017.

[28] Michael A. Osborne. Bayesian gaussian processes for sequential prediction, optimisation and quadrature. 2010.

[29] Chao Qin, Diego Klabjan, and Daniel Russo. Improving the expected improvement algorithm. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, page 5387 5397, Red Hook, NY, USA, 2017. Curran Associates Inc.

[30] Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. In International Conference on Learning Representations, 2018.

[31] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling, 2014.

[32] Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.

[33] Ilya O. Ryzhov. On the convergence rates of expected improvement methods. Oper. Res., 64(6):1515 1528, 2016.

[34] Hung Tran-The, Sunil Gupta, Santu Rana, and Svetha Venkatesh. Regret bounds for expected improvement algorithms in gaussian process bandit optimization. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 8715 8737. PMLR, 28 30 Mar 2022.

[35] Michal Valko, Nathan Korda, Rémi Munos, Ilias Flaounas, and Nello Cristianini. Finite-time analysis of kernelised contextual bandits. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI 13, page 654 663, Arlington, Virginia, USA, 2013. AUAI Press.

[36] Ziyu Wang and Nando de Freitas. Theoretical analysis of bayesian optimisation with unknown gaussian process hyper-parameters, 2014.

[37] Dawei Zhan and Huanlai Xing. Expected improvement for expensive optimization: a review. J. Glob. Optim., 78(3):507 544, 2020.

[38] Weitong Zhang, Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural thompson sampling. In International Conference on Learning Representations, 2021.

[39] Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with UCB-based exploration. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 11492 11502. PMLR, 13 18 Jul 2020.

The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

Did you include the license to the code and datasets? [Yes] See Section ??.

Did you include the license to the code and datasets? [No] The code and the data are proprietary.

Did you include the license to the code and datasets? [N/A]

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes]

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]

(c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]