# provably_and_practically_efficient_neural_contextual_bandits__b2cb8a8f.pdf

Provably and Practically Efficient Neural Contextual Bandits

Sudeep Salgia 1

We consider the neural contextual bandit problem. In contrast to the existing work which primarily focuses on Re LU neural nets, we consider a general set of smooth activation functions. Under this more general setting, (i) we derive non-asymptotic error bounds on the difference between an overparameterized neural net and its corresponding neural tangent kernel, (ii) we propose an algorithm with a provable sublinear regret bound that is also efficient in the finite regime as demonstrated by empirical studies. The non-asymptotic error bounds may be of broader interests as a tool to establish the relation between the smoothness of the activation functions in neural contextual bandits and the smoothness of the kernels in kernel bandits.

1. Introduction

The stochastic contextual bandit has been extensively studied in the literature (see, Langford & Zhang, 2007; Lattimore & Szepesv ari, 2020, and references therein). In this problem, at each discrete sequential step, a context is revealed by the environment. Then, a bandit agent chooses one of the K available actions. Choosing each action yields a random reward, the distribution of which is determined by the context. The problem was primarily studied under a linear setting where the expected reward of each arm is a linear function of the context (Chu et al., 2011; Li et al., 2019b; Chen et al., 2021). It was later extended to kernel-based models (Valko et al., 2013), where the expected reward function associated with the

1Department of Electrical and Computer Engineering, Cornell University, Ithaca, NY, USA. This is a joint work with Sattar Vakili at Media Tek Research, Cambridge, UK, and Qing Zhao at Cornell University. Due to an error at the time of submission, their names were not included in the authorship. The original preprint of this work with full authorship can be found at http://arxiv. org/abs/2206.00099. Correspondence to: Sudeep Salgia <ss3827@cornell.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

context-action pairs belongs to the reproducing kernel Hilbert space (RKHS) determined by a known kernel. Recently, a variation of the problem has been proposed where the expected reward function associated with the context-action pairs is modeled using a neural net (Zhou et al., 2020; Zhang et al., 2021; Gu et al., 2021; Kassraie & Krause, 2022). The three contextual bandit settings mentioned above are increasingly more complex in the following order: Linear Kernel-based Neural.

Contextual bandits find application in recommender systems, information retrieval, healthcare and finance (see a survey on applications in Bouneffouf & Rish (2019)). The problem can also be seen as a middle step from classic stochastic bandits (Auer et al., 2002) and (non-contextual) linear bandits (Abbasi-Yadkori et al., 2011) towards a reinforcement learning (RL) problem on a Markov decision process (MDP), where the contexts resemble the states of the MDP. One of the first formulations of linear contextual bandits was referred to as associative RL with linear value function (Auer, 2002). Nonetheless, contextual bandit is different from RL in that the contexts are determined arbitrarily (adversarially) rather than following a stochastic Markovian process.

1.1. Existing Results on Neural Contextual Bandits

With neural nets demonstrating a great representation power (much more general than linear models), there has been an increasing interest in modeling bandit and RL problems based on neural nets. This setting can be implemented using the typical neural net toolboxes. The neural contextual bandit has been considered in several works: (Zhou et al., 2020) and (Zhang et al., 2021), respectively, adopted the upper confidence bound (UCB) and Thompson sampling (TS) algorithms to the neural bandit setting. The algorithms are, respectively, referred to as Neural UCB and Neural TS. (Gu et al., 2021) considered a batch observation setting, where the reward function can be evaluated only on batches of data. (Kassraie & Krause, 2022) provided analysis for a variation of Neural UCB algorithm with diminishing instantaneous regret.

The analysis of neural bandits has been enabled through

Provably and Practically Efficient Neural Contextual Bandits

the theory of neural tangent (NT) kernel (Jacot et al., 2018) which approximates an overparameterized neural net with the kernel corresponding to its infinite width, and the bounds on the approximation error (e.g., see, Arora et al. (2019)). Zhou et al. (2020), Zhang et al. (2021) and Gu et al. (2021) proved O(Γk(T)

T)1 bounds on cumulative regret, where T is the number of steps and Γk(T) is a complexity term determined by the neural tangent kernel k associated with the particular neural network. Specifically, Γk(T) is the information gain corresponding to k for datasets of size T. For the Re LU neural nets on d-dimensional domains considered in Zhou et al. (2020), Zhang et al. (2021) and Gu et al. (2021), Γk(T) = O(T d 1

d ) is the best upper bound known for the information gain (Kassraie & Krause, 2022; Vakili et al., 2021b). Inserting this bound on Γk(T), the aforementioned regret bounds become trivial (superlinear) generally when d > 1, unless further restrictive assumptions are made on the contexts. Kassraie & Krause (2022) addressed this issue by considering the Sup variant of Neural UCB (referred to as Sup Neural UCB). The Sup variant is an adoption of Sup Lin UCB, which was initially introduced in Chu et al. (2011) for linear contextual bandits, to the neural setting. Kassraie & Krause (2022) proved an O( p

Γk(T)T) regret bound for Sup Neural UCB in the case of Re LU neural nets, which solved the issue of superlinear regret bound (non-diminishing instantaneous regret).

1.2. Contribution

All the existing works mentioned above, are limited to neural nets with Re LU activation functions. This limitation is rooted in the fact that the existing error bound between an overparameterized neural net and the corresponding NT kernel (Arora et al., 2019, Theorem 3.2) holds only for the Re LU neural nets. In Theorem 1, which may be of broader interest, we extend this result by providing error bounds for neural nets with arbitrarily smooth activation functions. Using Theorem 1, together with the recently established bounds on Γk(T) corresponding to arbitrarily smooth neural nets (Vakili et al., 2021b), we provide regret bounds for neural contextual bandits under a more general setting than only Re LU activation function. In particular, in Theorem 3, we prove an O(T 2d+2s 3 2d+4s 4 ) upper bound on regret, where s is the smoothness parameter of the activation functions (s = 1 in the case of Re LU, and s grows larger as the activations become smoother). Our regret bounds recover that of Kassraie & Krause (2022) for Re LU activations, and can become as small as O(

T) when the activations become infinitely smooth.

Broadening the scope of neural bandits to general smooth

1The notations O and O are used for mathematical order and that up to logarithmic factors, respectively.

activations is of interest in two aspects. Firstly, as shown in (Li et al., 2019a; Opschoor et al., 2022), deep neural nets constructed using smoother activation functions like Rectified Power Units or Re PUs provide better approximation guarantees than those with Re LUs. This is particularly true when the function to be approximated is smooth, which is generally the case in practical applications. Additionally, smoother activation functions have also been shown to more suitable for certain applications like implicit representations (Sitzmann et al., 2020). We explore whether such advantages of smoother activations reflect in the neural bandit settings. In particular, the regret bounds with Re LU neural nets, although sublinear, grow at a fast O(T 2d 1

2d ) rate, that quickly approaches the linear rate as d grows. We show that using neural nets with smoother activations can significantly reduce rate and thereby affirmatively establish the benefits of employing smoother activations in neural nets. Secondly, our results help establish connections between varying smoothness of activations in neural bandits and varying smoothness of kernels in kernel-based bandits . Kernel-based bandits (Srinivas et al., 2010) and Kernel-based contextual bandits (Valko et al., 2013) are well studied problems with regret bounds reported for popular kernels such as Mat ern family, which is perhaps the most commonly used family of kernels in practice (Snoek et al., 2012; Shahriari et al., 2015). Our regret bound for a neural contextual bandit problem with activations with smoothness parameter s is equivalent to the regret bound in the case of kernel-based bandits with a Mat ern kernel (Valko et al., 2013; Li & Scarlett, 2022) with smoothness parameter s 1

2. This connection may be of general interest in terms of contrasting neural nets and kernel-based models through NT kernel theory.

In Theorem 2, we prove confidence intervals for overparameterized neural nets with smooth activation functions, which are later used in the analysis of our algorithm. The confidence intervals are a consequence of our Theorem 1 and those for kernel regression. In contrast to prior work, our confidence intervals are tighter and hold for a general set of smooth activation functions (see Sec. 3).

In addition to the above analytical results for neural bandits with general smooth activation functions, we propose a practically efficient algorithm. The Sup variants of UCB algorithms are known to perform poorly in experiments (Calandriello et al., 2019; Li & Scarlett, 2022), including Sup Neural UCB presented in Kassraie & Krause (2022), due to over-exploration. We propose Neural GCB, a neural contextual bandit algorithm guided by both upper and lower confidence bounds to provide a finer control of explorationexploitation trade-off to avoid over-exploration. In the experiments, we show that Neural GCB outperforms all other

Provably and Practically Efficient Neural Contextual Bandits

existing algorithms, namely Neural UCB (Zhou et al., 2020), Neural TS (Zhang et al., 2021) and Sup Neural UCB (Kassraie & Krause, 2022).

1.3. Other Related Work

Linear bandits have been considered also in a non-contextual setting, where the expected rewards are linear in the actions in a fixed domain (e.g., see the seminal work of Abbasi Yadkori et al., 2011). The kernel-based bandits (also referred to as Gaussian process bandits) have been extensively studied under a non-contextual setting (Srinivas et al., 2010; Chowdhury & Gopalan, 2017). The GP-UCB and GP-TS algorithms proposed in this setting also show suboptimal regret bounds (see Vakili et al. (2021c) for details). Recently there have been several works offering optimal order regret bounds for kernel-based (non-contextual) bandits (Salgia et al., 2021; Camilleri et al., 2021; Li & Scarlett, 2022). These results however do not address the additional difficulties in the contextual setting.

There is a large body of work that studies the performance of overparametrized neural nets (see, for example, Cao & Gu, 2019; 2020; Allen-Zhu et al., 2019) which is closely related to the NTK approximation studied in this work. While there are similarities, our proposed technical approach introduces non-trivial novel analysis over all these studies. Specifically, all of them consider neural nets with Re LU activation functions. In our work, we derive new results and lemmas that hold for a larger class of activation functions as opposed to only the Re LU activation function. A notable exception to activation function being Re LU is the work by Du et al. (2019) where the authors also consider smooth activation functions. However, the activation functions considered in that work are smooth in the sense that the derivative of the activation function is globally Lipschitz. This is not necessarily true for the class of activation functions considered in this work, especially for s 3. Consequently, our work extends the results in Du et al. (2019) to a new class of activation functions.

2. Preliminaries and Problem Formulation

In the stochastic contextual bandit problem, at each discrete sequential step t = 1, 2, . . . , T, for each action a [K] := {1, 2, . . . , K}, a context vector xt X is revealed, where X is a compact subset of Rd. Then, a bandit agent chooses an action at [K] and receives the corresponding reward rt = h(xt,at) + ξt, where h : Rd R is the underlying reward function and ξt R is a zero-mean martingale noise process. The goal is to choose actions which maximize the cumulative reward over T steps. This is equivalent to minimizing the cumulative regret, that is defined as the total difference between the maximum possible context-

dependent reward and the actually received reward

t=1 (h(xt,a t ) h(xt,at)), (1)

where a t arg maxa [K] h(xt,a) is the context-dependent maximizer of the reward function at step t. Following is a formal statement of the assumptions on X, f and ξt. These are mild assumptions which are shared with other works on neural bandits and NT kernel. Assumption 1. (i) The input space is the d dimensional hypersphere: X = Sd 1. (ii) The reward function h belongs to the RKHS induced by the NT kernel, and h(x) [0, 1], x X. (iii) ξt are assumed to be conditionally ν-sub-Gaussian, i.e., for any ζ R, ln(E[eζξt|ξ1, . . . , ξt 1]) ζ2ν2/2.

2.1. The Neural Net Model and Corresponding NT Kernel

In this section, we briefly outline the neural net model and the associated NT kernel. Let f(x; W) be a fully connected feedforward neural net with L hidden layers of equal width m. We can express f using the following set of recursive equations:

f(x; W) = rcs

m W(L+1)σs(f (L)(x)), (2)

f (1)(x) = W(1)x, f (l)(x) = rcs

m W(l)σs(f (l 1)(x))

for 1 < l L

Let W = (Wl)1 l L+1 denote the collection of all weights. The dimensions of weight matrices satisfy: W(1) Rm d; for 1 < l L, W(l) Rm m; W(L+1) R1 m. All the weights W(l) i,j, l, i, j, are initialized to N(0, 1) and cs := 2/(2s 1)!!. With an abuse of notation, σs : Rm Rm is used for coordinate-wise application the activations of the form σs(u) = (max(0, u))s to the outputs of each layer. Note that σs is s 1 times differentiable everywhere which gives our results a general interpretability across smoothness of activations and the resulting function classes and allows us to draw parallels between our results and well established results for kernel-based bandits.

It has been shown that gradient based training of the neural net described above reaches a global minimum where the weights remain within a close vicinity of random initialization. As a result the model can be approximated with its linear projection on the tangent space at random initialization W0:

f(x; W) f(x; W0) + (W W0) g(x; W0),

Provably and Practically Efficient Neural Contextual Bandits

where the notation g(x; W) = Wf(x; W) is used for the gradient of f with respect to the parameters. The approximation error can be bounded by the second order term O( W W0 2), and shown to be diminishing as the width m grows, where the implied constant depends on the spectral norm of the Hessian matrix (Liu et al., 2020). The NT kernel corresponding to this neural net when m grows to infinity is defined as the kernel in the feature space determined by the gradient at initialization:

k(x, x ) = lim m g (x; W0)g(x; W0).

The particular form on the NT kernel depends on the activation function and the number of layers. For example, for a two layer neural net with Re LU activations, we have

k(x, x ) = 1

2u(π arccos(u)) + p

where u = x x . For the closed form derivation of NT kernel for other values of s and L, see Vakili et al. (2021b).

2.2. Assumptions on Neural Net and NT Kernel

The following technical assumptions are mild assumptions which are used in the handling of the overparameterized neural nets using NT kernel. These assumptions are common in the literature and often are fulfilled without loss of generality (Arora et al., 2019; Zhou et al., 2020; Zhang et al., 2021; Gu et al., 2021; Kassraie & Krause, 2022).

Assumption 2. (i) Consider H RKT KT such that [H]i,j = k(zi, zj) for all pairs in Z Z, where Z = {{xt,a}K a=1}T t=1. We assume H λ0I, where λ0 > 0. (ii) f(x; W0) is assumed to be 0. (iii) The number of epochs during training, J, and the learning rate, η, satisfy J = 2 cλT log(T) and η = c /T for some constants c, c > 0 and λ > 0 is the regularization parameter.

2.3. Information Gain

The regret bounds for neural bandits are typically given in terms of maximal information gain (or the effective dimension) of the NT kernel. The maximal information gain is defined as the maximum log determinant of the kernel matrix (Srinivas et al., 2010):

Γk(t) = sup Xt X log det I + 1

It is closely related to the effective dimension of the kernel, which denotes the number of features with a significant impact on the regression model and can be finite for a finite dataset even when the feature space of the kernel is infinite dimensional (Zhang, 2005; Valko et al., 2013). It is defined as

dk(t) = tr k Xt,Xt(k Xt,Xt + λI) 1 . (4)

It is known that the information gain and the effective dimension are the same up to logarithmic factors (Calandriello et al., 2019). We give our results in terms of information gain; nonetheless, that can be replaced by effective dimension.

3. Approximation Error and Confidence Bounds

As discussed earlier, the error between an overparameterized neural net and the associated NT kernel can be quantified and shown to be small for large m. To formalize this, we recall the technique of kernel ridge regression. Given a dataset Dt = {(xi, yi)}t i=1, let f NTK denote the regressor obtained by kernel ridge regression using NT kernel. In particular, we have

f NTK(x) = k Xt(x)(λIt + k Xt,Xt) 1Yt, (5)

where k Xt(x) = [k(x1, x), . . . , k(xt, x)] is the NT kernel evaluated between x and t data points, k Xt,Xt = [k(xi, xj)]t i,j=1 is the NT kernel matrix evaluated on the data, Yt = [y1, . . . , yt] is the vector of output values, and λ 0 is a free parameter. Note that f NTK can be computed in closed form (without training a neural net). In addition, let f NN be prediction of the neural net at the end of training using Dt. Theorem 3.2 of Arora et al. (2019) states that, when the activations are Re LU and m is sufficiently large, we have, with probability at least 1 δ, that |f NN(x) f NTK(x)| ϵ. In this theorem, the value of m should be sufficiently large in terms of t, L, δ, ϵ, and λ0. We now present a bound on the error between the neural net and the associated NT kernel for smooth activations.

Theorem 1. Consider the neural net f(x; W) defined in (2) consisting of L hidden layers each of width m poly(1/ε, s, 1/λ0, |D|, log(1/δ)) with activations σs. Let f NTK and f NN be the kernel ridge regressor and trained neural net using a dataset D. Then for any x X, with probability at least 1 δ over the random initialization of the neural net, we have,

|f NTK(x) f NN(x)| ε.

Theorem 1 is an generalization of Theorem 3.2 in Arora et al. (2019) to the class of smooth activation functions {σs, s N}. To prove Theorem 1, we show the following bound on the error between NT kernel and the kernel induced by the finite width neural net at initialization. This result is an generalization of Theorem 3.1 in Arora et al. (2019) to smooth activation functions.

Lemma 1. Consider the neural net f(x; W) defined in (2) consisting of L hidden layers each of width m O( s LL2c2 s ε2 log2(s L/δε)) with activations σs. Let

Provably and Practically Efficient Neural Contextual Bandits

g(x; W) = Wf(x; W) and k be the associated NT kernel, as defined in Section 2.1. Fix ε > 0 and δ (0, 1). Then for any x, x X, with probability at least 1 δ over the initialization of the network, we have, g (x; W0)g(x ; W0) k(x, x ) (L + 1)ε.

The proof entails a non-trivial generalization of various lemmas used in the proof of Arora et al. (2019, Theorem 3.2) to the class of smooth activation functions. We defer the detailed proof of the lemma and theorem to supplementary material.

Confidence Intervals: The analysis of bandit problems classically builds on confidence intervals applicable to the values of the reward function. The NT kernel allows us to use the confidence intervals for kernel ridge regression, in building confidence intervals for overparameterized neural nets. In particular, given a dataset Dt, let

i=1 g(xi; W0)g (xi; W0) (6)

be the Gram matrix in the feature space determined by the gradient. Taking into account the linearity of the model in the feature space, we can define a (surrogate) posterior variance as follows,

ˆσ2 t (x) = g (x)V 1 t g(x), (7)

that can be used as a measure of uncertainty in the prediction provided by the trained neural net. Using Theorem 1 in conjunction with the confidence intervals for kernel ridge regression, we prove the following confidence intervals for a sufficiently wide neural net.

Theorem 2. Let Dt = {(xi, ri)}t i=1 denote a dataset obtained under the observation model described in Section 2 such that the points {xi}t i=1 are independent of the noise sequence {ξi}t i=1. Suppose Assumptions 1 and 2 hold. Suppose the neural net defined in (2) consisting of L hidden layers each of width m poly(T, s, L, K, λ 1, λ 1 0 , log(1/δ)) is trained using this dataset. Then, with probability at least 1 δ, the following relation holds for any x X:

|h(x) f(x; Wt)| Cs,Lt

λm + βtˆσt(x), (8)

where Wt denotes the parameters of the trained model, βt = S + 2ν p

log(1/δ) + C s,Lt(1 ηλ)J/2p

t/λ + C s,L p

t3/λm, S is the RKHS (corresponding to the NT kernel) norm of h and Cs,L, C s,L, C s,L are constants depending only on s and L.

Both results presented in this section may be of broader interest in neural net literature. We would like to emphasize

a couple of points here. Similar to (Vakili et al., 2021a; Kassraie & Krause, 2022), the independence of query points from the noise sequence is fundamental to obtain these tighter bounds and hence cannot be directly used to improve the regret bounds of adaptive algorithms like Neural UCB. Secondly, these bounds hold for all activation functions {σs : s N} improving upon the existing results which hold only for Re LU activation function. Furthermore, this bound is tighter than the one derived in Kassraie & Krause (2022), even for the case of Re LU activation function. The confidence intervals are important building blocks in the analysis of Neural GCB in Section 4. Please refer to the supplementary material for a detailed proof of the theorem.

4. Algorithm

The UCB family of algorithms achieve optimal regret in classic stochastic finite action bandits (Auer et al., 2002). The optimal regret, however, hinges on the statistical independence of the actions. In more complex settings such as kernel-based and neural bandits, the inherent statistical dependence across adaptively chosen actions leads to skewed posterior distributions, resulting in sub-optimal and potentially trivial (i.e., superlinear) regret guarantees of UCB based algorithms (Vakili et al., 2021c). To address this issue, one approach adopted in the literature is to use samples with limited adaptivity. These algorithms are typically referred to as Sup variant of UCB algorithms, and have been developed for linear (Chu et al., 2011), kernel-based (Valko et al., 2013) and neural (Kassraie & Krause, 2022) contextual bandits. These Sup variants, however, are known to perform poorly in practice due to their tendency to overly explore suboptimal arms.

We propose Neural GCB, a neural bandit algorithm guided by both upper and lower confidence bounds to avoid over-exploring, leading to superior performance in practice while preserving the nearly optimal regret guarantee. The key idea of Neural GCB is a finer control of the exploration-exploitation tradeoff based on the predictive standard deviation of past actions. This finer control encourages more exploitation and reduces unnecessary exploration. Specifically, Neural GCB partitions past action-reward pairs into R = log T subsets (some subsets may be empty at the beginning of the learning horizon). Let {Ψ(r)}R r=1 denote these subsets of action-reward pairs where the index r represents the level of uncertainty about the data points in that set (with r = 1 representing the highest uncertainty). Upon seeing a new context at time t, Neural GCB traverses down the subsets starting from r = 1. At each level r, it evaluates the largest predictive standard deviation among current set of candidate actions, Ar, and compares it with 2 r.

Provably and Practically Efficient Neural Contextual Bandits

Algorithm 1 Neural GCB

1: Require: Time horizon T, maximum initial variance σ0, error probability δ 2: Initialize: R log2 T , Ensemble of R Neural Nets with W(r) 0 = W0, Ψ(r) 0 , r [R], H , arrays ctr, max mu, fb of size R with all elements set to 0, batch sizes {qr}R r=1 3: for t = 1, 2, 3, . . . T do 4: r 1, ˆAr(t) = [K] 5: while True do 6: Receive the context-action pairs {xt,a}K a=1 7: {f(xt,a; W(r) t ), ˆσ(r) t 1(xt,a)}K a=1, W(r) t , fb[r]

Get Predictions H, Ψ(r) t 1, {xt,a}K a=1, fb[r], W(r) t 1, qr

8: σ(r) t 1 maxa ˆ Ar(t) ˆσ(r) t 1(xt,a)

9: max mu[r] arg maxa ˆ Ar(t) f(xt,a; W(r) t 1)

10: if σ(r) t 1 σ02 r then

11: a UCB arg maxa ˆ Ar(t) UCB(r) t 1(xt,a)

12: if σ(r) t 1(xt,a UCB) η0/

t then 13: Choose at a UCB and set υt 1 14: Receive yt = h(xt,at) + ξt and update H H {(xt,at, yt)}

15: Set Ψ(r+1) t Ψ(r+1) t 1 {(t, υt)} and

Ψ(r ) t Ψ(r ) t 1 for all r [R] \ {r + 1} 16: break 17: else 18: ˆAr+1(t) {a ˆAr(t) : UCB(r) t 1(xt,a) maxa ˆ Ar(t) LCB(r) t 1(xt,a )}, r r + 1 19: end if 20: else 21: if r = 1 or ctr[r] > α04r then 22: Choose any at ˆAr(t) such that ˆσ(r) t 1(xt,a) > σ02 r and set υt 2 23: else 24: Choose at max mu[r 1] and set υt 3 25: end if 26: Receive yt = h(xt,at) + ξt and update H H {(xt,at, yt)}, ctr[r] ctr[r] + 1

27: Set Ψ(r) t Ψ(r) t 1 {(t, υt)} and Ψ(r ) t Ψ(r ) t 1 for all r [R] \ {r} 28: break 29: end if 30: end while 31: end for

If this value is smaller than 2 r, Neural GCB checks if it can exploit based on predictions at the rth level. Specifically, if the predictive standard deviation of the action that maximizes the UCB score is O(1/

t) (indicating a high reward with reasonably high confidence), then Neural GCB

plays that action. Otherwise, it updates Ar by eliminating actions with small UCB score and moves to the next level. This encourages exploitation of less certain actions in a controlled manner as opposed to Sup Neural UCB which exploits only actions with much higher certainty.

On the other hand, if the value is greater than 2 r, Neural GCB directly exploits the maximizer of the mean computed using data points in Ψ(r 1) until the allocated budget for exploitation at level r is exhausted. It then resorts to more exploratory actions by choosing actions with large values of ˆσ(r) t 1. The exploitation budget is set to match the length of exploration sequence at the corresponding level to ensure an optimal balance of exploitation and exploration and preserve the optimal regret order while boosting empirical performance.

In addition to improved empirical performance, Neural GCB takes into consideration the practical requirements for training the neural nets. Specifically, as pointed out in Gu et al. (2021), it is practically more efficient to train the neural net over batches of observations, rather than sequentially at each step. Consequently, before evaluating the mean and variance at any level r, Neural GCB retrains the neural net corresponding to that level only if qr samples have been added to that level since it was last trained. This indexdependent choice of batch size, qr lends a natural adaptivity to the retraining frequency by reducing it as time progresses.

A pseudo code of the algorithm is outlined in Algorithm 1. In the pseudo code, UCB(r) t and LCB(r) t refer to the upper and lower confidence scores respectively at time t corresponding to index r and are defined as UCB(r) t ( ) = f( ; W(r) t ) + βˆσ(r) t ( ) and LCB(r) t ( ) = f( ; W(r) t ) βˆσ(r) t ( ). Get Predictions is a local routine that calculates the predictive mean and variance after appropriately retraining the neural net (see supplementary for a pseudo code). Lastly, the arrays ctr, max mu and fb are used to store the exploitation count, maximizer of the neural net output and feedback time instants. We now formally state the regret guarantees for Neural GCB in the following theorem. Theorem 3. Suppose Assumptions 1 and 2 hold. Consider Neural GCB given in Algorithm 1, with R neural nets, as defined in (2) with L hidden layers each of width m poly(T, L, K, s, λ 1, λ 1 0 , S 1, log(1/δ)). Suppose Neural GCB is run with a fixed batch size for each group, then the regret defined in (1) satisfies

TΓk(T) log(1/δ)+

max r qrΓk(T)

As suggested by the above theorem, Neural GCB preserves

Provably and Practically Efficient Neural Contextual Bandits

the regret guarantees of Sup Neural UCB which are much tighter than those of Neural UCB. Moreover, these guarantees hold even for smooth activation functions as opposed to just for Re LU activation. In particular, on plugging in the upper bound on Γk(T) obtained in Vakili et al. (2021b), we obtain that the regret incurred by Neural GCB satisfies O(T 2d+2s 3 2d+4s 4 ), which is a sublinear function of T, implying convergence to the maximizer. For s = 1, i.e., the Re LU activation function, it reduces to O(T 2d 1

d ), matching the known regret bound for Re LU neural nets (Kassraie & Krause, 2022). Moreover, the regret bound for neural nets with activation functions with smoothness parameter s matches the optimal order of regret for optimization of a function in an RKHS corresponding to the Mat ern kernel with smoothness s 1/2. This equivalence paves a path to enable us extend known results for Mat ern kernels for NTK theory. Furthermore, the above regret guarantees also show that the retraining neural nets only at the end of batches of observations increases the regret at most with an additive term in the batch size. Our results also hold with an adaptive batch size (see Appendix D).

5. Empirical Studies

In this section, we provide numerical experiments on comparing Neural GCB with several representative baselines, namely, Lin UCB (Chu et al., 2011), Neural UCB (Zhou et al., 2020), Neural TS (Zhang et al., 2021), Sup Neural UCB (Kassraie & Krause, 2022) and Batched Neural UCB (Gu et al., 2021).

We perform the empirical studies on three synthetic and two real-world datasets. We first compare Neural GCB with the fully sequential algorithms (Lin UCB, Neural UCB, Neural TS ans Sup Neural UCB). We then compare the regret incurred and the time taken by Neural GCB, Neural UCB and Batch Neural UCB. We perform both set of experiments with two different activation functions. The construction of the synthetic datasets and the experimental settings are described below.

5.1. Datasets

For each of the synthetic datasets, we construct a contextual bandit problem with a feature dimension of d = 10 and K = 4 actions per context running over a time horizon of T = 2000 rounds. The set of context vectors {{xt,a}K a=1}T t=1 are drawn uniformly from the unit sphere. Similar to Zhou et al. (2020), we consider the following three reward functions:

h1(x) = 4|a x|2; h2(x) = 4 sin2(a x); h3(x) = Ax 2. (9)

For the above functions, the vector a is drawn uniformly from the unit sphere and each entry of matrix A is randomly

generated from N(0, 0.25).

We also consider two real datasets for classification namely Mushroom and Statlog (Shuttle), both of which are available on the UCI repository (Dua & Graff, 2017). The classification problem is then converted into a contextual bandit problem using techniques outlined in Li et al. (2010). Each datapoint in the dataset (x, y) Rd R is transformed into K vectors of the form x(1) = (x, 0, . . . , 0), . . . , x(K) = (0, . . . , 0, x) RKd corresponding to the K actions associated with context x. Here K denotes the number of classes in the original classification problem. The reward function is set to h(x(k)) = 1{y = k}, that is, the agent receives a reward of 1 if they classify the context correctly and 0 otherwise. We present the results for h1(x), h2(x) and the Mushroom dataset in the main paper, and the additional results in the supplementary material.

5.2. Experimental Setting

For all the experiments, the rewards are generated by adding zero mean Gaussian noise with a standard deviation of 0.1 to the reward function. All the experiments are run for a time horizon of T = 2000. We report the regret averaged over 10 Monte Carlo runs with different random seeds. For the real datasets, we shuffle the context vectors for each Monte Carlo run. For all the algorithms, we set the parameter ν to 0.1, and S, the RKHS norm of the reward, to 4 for synthetic functions, and 1 for real datasets. The exploration parameter βt is set to the value prescribed by each algorithm.

We consider a 2 layered neural net for all the experiments as described in Equation (2). We carry two sets of experiments each with different activation functions, namely σ1 (or equivalently, Re LU) and σ2. For the experiments with σ1 as the activation, we set the number of hidden neurons to m = 20 and m = 50 for synthetic and real datasets respectively. Similarly, for σ2, m is set to 30 and 80 for synthetic and real datasets, respectively. For all the experiments, we perform a grid search for λ and η over {0.05, 0.1, 0.5} and {0.001, 0.01, 0.1}, respectively, and choose the best ones for each algorithm. The number of epochs is set to 200 for synthetic datasets and Mushroom, and to 400 for Statlog. For the experiments with sequential algorithms, we retrain the neural nets at every step, including Neural GCB. For Batched Neural UCB, we use a fixed batch size of 10 for synthetic datasets, and 20 for Mushroom. For Neural GCB we set batch size to qr = 5 2r 1 for synthetic datasets, and qr = 5 2r+1 for Mushroom. More details and additional experiments are given in the supplementary material.

Provably and Practically Efficient Neural Contextual Bandits

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Lin UCB Neural UCB Neural TS Sup NNUCB Neural GCB

(a) h1(x) with σ1(x)

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Neural UCB Neural TS Sup NNUCB Neural GCB

(b) h1(x) with σ2(x)

50 100 150 200 250 300 350 Time (in seconds)

Batched Neural UCB Neural GCB-Batched Neural UCB-seq Neural GCB-seq

(c) Batched algorithms on h1(x)

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Lin UCB Neural UCB Neural TS Sup NNUCB Neural GCB

(d) h2(x) with σ1(x)

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Neural UCB Neural TS Sup NNUCB Neural GCB

(e) h2(x) with σ2(x)

50 100 150 200 250 300 350 Time (in seconds)

Batched Neural UCB Neural GCB-Batched Neural UCB-seq Neural GCB-seq

(f) Batched algorithms on h2(x)

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Lin UCB Neural UCB Neural TS Sup NNUCB Neural GCB

(g) Mushroom with σ1(x)

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Neural UCB Neural TS Sup NNUCB Neural GCB

(h) Mushroom with σ2(x)

100 200 300 400 500 600 Time (in seconds)

Batched Neural UCB Neural GCB-Batched Neural UCB-seq Neural GCB-seq

(i) Batched algorithms on Mushroom

Figure 1: First, second and third rows correspond to the reward functions h1(x), h2(x) and the Mushroom dataset, respectively. The two leftmost columns show the cumulative regret incurred by the algorithms against number of steps, with σ1 activation functions for the first column and σ2 for the second. The rightmost column compares the regret incurred and the time taken (in seconds) for batched and sequential versions of Neural UCB and Neural GCB.

5.3. Results

The two leftmost columns in Fig. 1show that Neural GCB outperforms other algorithms in the case of both synthetic and real world datasets, corroborating the theoretical claims. That holds true for both set of experiments with different activation functions, further bolstering the practical efficiency of the proposed algorithm. In addition, the regret incurred by the algorithms in experiments with σ2 as the activation is less than that for the experiments with σ1 as the activation, demonstrating the effect of smooth kernels on the performance of the algorithms in practice.

In the third column, we compare the regret and running time between the batched and the sequential versions of Neural UCB and Neural GCB. We plot the regret incurred against time taken for different training schedules for 10 different runs. For all functions, the regret incurred by the batched version is comparable to that of the sequential version while having a significantly less running time. Furthermore, Neural GCB has a smaller regret compared to Batched Neural UCB for comparable running times.

Provably and Practically Efficient Neural Contextual Bandits

6. Conclusion

In this work, we considered the problem of neural contextual bandits with general set of smooth activation functions. We established non-asymptotic error bounds on the difference between an overparametrized neural net and its corresponding NT kernel along with confidence bounds for prediction using neural nets under this general setting. Furthermore, we proposed a new algorithm that incurs sublinear regret under this general setting is also efficient in practice, as demonstrated by extensive empirical studies.

7. Acknowledgements

This work was supported by the Government of Israel - Israeli Ministry of Defense, Mission to the USA.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242 252. PMLR, 2019.

Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. On exact computation with an infinitely wide neural net. Advances in Neural Information Processing Systems, 32, 2019.

Auer, P. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002.

Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013. ISBN 978-019-953525-5. doi: 10.1093/acprof:oso/9780199535255. 001.0001. URL https://doi.org/10.1093/ acprof:oso/9780199535255.001.0001.

Bouneffouf, D. and Rish, I. A survey on practical applications of multi-armed and contextual bandits. ar Xiv preprint ar Xiv:1904.10040, 2019.

Calandriello, D., Carratino, L., Lazaric, A., Valko, M., and Rosasco, L. Gaussian process optimization with adaptive sketching: Scalable and no regret. In Conference on Learning Theory, pp. 533 557. PMLR, 2019.

Camilleri, R., Jamieson, K., and Katz-Samuels, J. Highdimensional experimental design and kernel bandits. In International Conference on Machine Learning, pp. 1227 1237. PMLR, 2021.

Cao, Y. and Gu, Q. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in neural information processing systems, 32, 2019.

Cao, Y. and Gu, Q. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 3349 3356, 2020.

Chen, C., Luo, L., Zhang, W., Yu, Y., and Lian, Y. Efficient and robust high-dimensional linear contextual bandits. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 4259 4265, 2021.

Chowdhury, S. R. and Gopalan, A. On kernelized multiarmed bandits. In International Conference on Machine Learning, pp. 844 853, 2017.

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208 214. JMLR Workshop and Conference Proceedings, 2011.

Cram er, H. Sur un nouveau th eor eme-limite de la th eorie des probabilit es. Actual. sci. industr. 736, 5-23. (Conf er. internat. Sci. math. Univ. Gen eve. Th eorie des probabilit es. III: Les sommes et les fonctions de variables al eatoires.) (1938)., 1938.

Daniely, A., Frostig, R., and Singer, Y. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances in Neural Information Processing Systems, pp. 2261 2269, 2016.

Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pp. 1675 1685. PMLR, 2019.

Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Gantert, N., Ramanan, K., and Rembart, F. Large deviations for weighted sums of stretched exponential random variables. Electronic Communications in Probability, 19, 2014. ISSN 1083589X. doi: 10.1214/ECP.v19-3266.

Gu, Q., Karbasi, A., Khosravi, K., Mirrokni, V., and Zhou, D. Batched neural bandits. ar Xiv preprint ar Xiv:2102.13028, 2021.

Provably and Practically Efficient Neural Contextual Bandits

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. ar Xiv preprint ar Xiv:1806.07572, 2018.

Kassraie, P. and Krause, A. Neural contextual bandits without regret. In AISTATS, 2022.

Langford, J. and Zhang, T. The epoch-greedy algorithm for multi-armed bandits with side information. Advances in neural information processing systems, 20, 2007.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Li, B., Tang, S., and Yu, H. Better approximations of high dimensional smooth functions by deep neural networks with rectified power units. ar Xiv preprint ar Xiv:1903.05858, 2019a.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW 10, pp. 661 670, feb 2010. ISBN 9781605587998. doi: 10.1145/1772690.1772758. URL http://arxiv.org/abs/1003.0146http: //dx.doi.org/10.1145/1772690.1772758.

Li, Y., Wang, Y., and Zhou, Y. Nearly minimax-optimal regret for linearly parameterized bandits. In Conference on Learning Theory, pp. 2173 2174. PMLR, 2019b.

Li, Z. and Scarlett, J. Gaussian process bandit optimization with few batches. In AISTATS, 2022.

Liu, C., Zhu, L., and Belkin, M. On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33:15954 15964, 2020.

Opschoor, J., Schwab, C., and Zech, J. Exponential relu dnn expression of holomorphic maps in high dimension. Constructive Approximation, 55:1 46, 02 2022. doi: 10. 1007/s00365-021-09542-5.

Riquelme, C., Tucker, G., and Snoek, J. Deep Bayesian bandits showdown: An empirical comparison of Bayesian deep networks for Thompson sampling. In 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, 2018. URL https://sites.google.com/site/ deepbayesianbandits/.

Salgia, S., Vakili, S., and Zhao, Q. A domain-shrinking based Bayesian optimization algorithm with orderoptimal regret performance. Conference on Neural Information Processing Systems, 34, 2021.

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104 (1):148 175, 2015.

Sitzmann, V., Martel, J., Bergman, A., Lindell, D., and Wetzstein, G. Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems, 33:7462 7473, 2020.

Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012.

Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: no regret and experimental design. In International Conference on Machine Learning, pp. 1015 1022. Omnipress, 2010.

Vakili, S., Bouziani, N., Jalali, S., Bernacchia, A., and Shiu, D.-s. Optimal order simple regret for Gaussian process bandits. In Conference on Neural Information Processing Systems, 2021a.

Vakili, S., Bromberg, M., Garcia, J., Shiu, D.-s., and Bernacchia, A. Uniform generalization bounds for overparameterized neural networks. ar Xiv preprint ar Xiv:2109.06099, 2021b.

Vakili, S., Scarlett, J., and Javidi, T. Open problem: Tight online confidence intervals for RKHS elements. In Conference on Learning Theory, pp. 4647 4652. PMLR, 2021c.

Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. Finite-time analysis of kernelised contextual bandits. ar Xiv preprint ar Xiv:1309.6869, 2013.

Wang, T., Zhou, D., and Gu, Q. Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints, 2021. URL http: //arxiv.org/abs/2101.02195.

Zhang, T. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17 (9):2077 2098, 2005.

Zhang, W., Zhou, D., Li, L., and Gu, Q. Neural Thompson sampling. In International Conference on Learning Representations, 2021.

Zhou, D., Li, L., and Gu, Q. Neural contextual bandits with UCB-based exploration. In International Conference on Machine Learning, pp. 11492 11502. PMLR, 2020.

Provably and Practically Efficient Neural Contextual Bandits

We present the proofs of the theorems and lemmas stated in the main paper, as well as additional empirical results, in the appendix. The appendix is structured as follows: In Appendix A, we provide the proof of Theorem 1, while we defer the proof of auxiliary lemmas to Appendix B. Proof of Theorem 2 is given in Appendix C. Further details on Neural GCB and proof of Theorem 3 are provided in Appendix D. Lastly, the details on empirical studies and further results are reported in Appendix E.

A. Proof of Theorem 1

Before we provide the proof of Theorem 1, we first set up some preliminaries and notation that would be useful throughout the proof.

A.1. Proof Preliminaries

A.1.1. NOTATION

For any n N, [n] denotes the set {1, 2, . . . , n}. For any vector v Rn, diag(v) is the Rn n diagonal matrix with the elements on v on its diagonal entries. v denotes the L2 norm of the vector v. M 2 and M F denotes the spectral and Frobenius norms respectively of a matrix M. For events A and B, we define A B := A B. For matrix A, we denote the projection matrix for the column space of A by ΠA := AA , where A denotes the pseudo-inverse of A. Similarly, we denote the orthogonal projection matrix as Π A := I AA . For two random variables, X and Y , X d=A Y means X is equal to Y in distribution conditioned on the σ-algebra generated by the event A. For any ρ [ 1, 1], we use Σρ to denote

the matrix 1 ρ ρ 1

. Lastly, for n N, we define (2n 1)!! = Qn i=1(2i 1). For example 3!! = 3 and 5!! = 15.

A.1.2. FULLY CONNECTED NEURAL NETWORK

In this proof, we consider a general fully connected neural net consisting of L hidden layers defined recursively as follows:

f (l)(x) = W(l)h(h 1)(x) Rdl, h(l)(x) = rcσ

dl σ f (l)(x) Rdl, l = 1, 2, . . . , L, (10)

where h(0)(x) = x and x X is the input to the network. In the above expression, W(l) Rdl dl 1 is the weight matrix in the lth layer for l [L] and d0 = d, σ( ) : R R is a coordinate-wise activation function and the constant cσ := Ez N(0,1)[σ(z)2] 1. The output of the neural network is given by its last layer defined as follows:

f(x; W) = f (L+1)(x) = W(L+1)h(L)(x), (11)

where W(L+1) R1 d L is the weight matrix in the final layer, and W = (W(1), W(2), . . . , W(L+1)) represents all the parameters of the network. Recall that the domain is assumed to be the hypersphere Sd 1. Consequently, x = 1 for all x X. This is just the generalization of the of the neural net defined in eqn. (2) with possibly different width in each layer.

The partial derivative of the output of the neural network with respect to a particular weight matrix is given as

W(l) = b(l)(x) h(l 1)(x) , l = 1, 2, . . . , L + 1, (12)

where b(l)(x) is defined recursively as

1 l = L + 1, rcσ

dl D(l)(x) W(l+1) b(l+1)(x) l = 1, 2, . . . , L. (13)

In the above definition,

D(l)(x) := diag σ f (l)(x) Rdl dl, (14)

Provably and Practically Efficient Neural Contextual Bandits

is a diagonal matrix, where σ ( ) is the derivative of the activation function σ and is also applied coordinate wise. Note that b(l)(x) Rdl for l = 1, 2, . . . , L and b(L+1)(x) R. In other words, b(l)(x) is the gradient of output of the neural network f(x; W) with respect to f (l), the pre-activation of layer l.

A.2. Neural Tangent Kernel

In the infinite width limit, the pre-activation functions f (l) at every hidden layer l [L] has all its coordinates tending to i.i.d. centered Gaussian Processes with covariance matrix Σ(l 1) : Rd Rd R defined recursively for l [L] as:

Σ(0)(x, x ) = x x ,

Λ(l)(x, x ) = Σ(l 1)(x, x) Σ(l 1)(x, x ) Σ(l 1)(x , x) Σ(l 1)(x , x )

Σ(l)(x, x ) = cσE(u,v) N(0,Λ(l)(x,x ))[σ(u)σ(v)]. (15)

Similar to Σ(l)(x, x ), we also define Σ(l)(x, x ) as follows:

Σ(l)(x, x ) = cσE(u,v) N(0,Λ(l)(x,x ))[σ (u)σ (v)], (16)

for l [L] and Σ(L+1)(x, x ) = 1 for x, x X. The final NTK expression for the fully-connected network is given as

Θ(L)(x, x ) =

Σ(l 1)(x, x )

j=l Σ(j)(x, x )

Note that this is same as k(x, x ) referred to in the main text.

A.3. The Activation Function

In this work, we assume that the activation function σ Aσ, where Aσ = {σs(x) : s N} and σs(x) = (max(0, x))s. Note that σ1(x) corresponds to the popular Re LU activation function and existing results hold only for σ1(x).

We state some definitions which we will be later used to establish certain properties of activation functions in A.

Fact 1. (Vakili et al. (2021b)) The normalizing constant cσs = 2 (2s 1)!! for all σs A.

For simplicity of notation we use cs instead of cσs for the rest of the proof.

Definition 1. A function f : R R is said to be α-homogeneous, if f(λx) = λαf(x) for all x R and λ > 0.

It is straightforward to note that σs(x) is s-homogeneous.

Let M+(d) denote the set of positive semi-definite matrices of dimension d, that is, M+ = {M Rd d : x Mx 0 x Rd}. Similarly, we use M++(d) to denote the class of positive definite matrices of dimension d. For γ [0, 1], we denote

Mγ + = Σ11 Σ12 Σ12 Σ22

M+(2) 1 γ Σ11, Σ22 1 + γ . (18)

Definition 2. The dual of an activation function σ( ) is the function σ : [ 1, 1] R defined as

σ(ρ) = cσE(X,Y ) N(0,Σρ)[σ(X)σ(Y )]. (19)

Note that from definition of cσ, σ(ρ) [ 1, 1] and σ(1) = 1.

Fact 2 (Daniely et al. (2016), Lemma 11). The dual σ is continuous in [ 1, 1], smooth in ( 1, 1), convex in [0, 1] and is non-decreasing.

Provably and Practically Efficient Neural Contextual Bandits

The dual of an activation function can be extended to σ : M+(2) R as σ(Σ) = cσE(X,Y ) N(0,Σ)[σ(X)σ(Y )]. Note

that if Σ = Σ11 Σ12 Σ12 Σ22

and σ is k-homogeneous, we have,

σ(Σ) = (Σ11Σ22)k/2 σ Σ12 Σ11Σ22

Let σs be the dual function of σs Aσ for all s N and Aσ = { σs : s N} denote the set of dual functions. It is not difficult to note that σs( 1) = 0 and σs(1) = 1 for all s N. Since σs is non-decreasing, σs(ρ) [0, 1] for all ρ [ 1, 1].

Fact 3 (Vakili et al. (2021b), Lemma 1). The functions in Aσ satisfy,

σ s(ρ) = s2

2s 1 σs 1(ρ) (20)

for s > 1. Here σ s denotes the derivative of σs.

Consequently, | σ s(ρ)| s2

2s 1| σs 1(ρ)| s2

2s 1. Thus, σs is s2/(2s 1) Lipschitz.

Definition 3. For any function σs Aσ, we define µs,ρ := E(X,Y ) N(0,Σρ)[σs(X)σs(Y )] = σs

A.4. Proof of Lemma 1

The central piece in the proof of Theorem 1 is Lemma 1. We focus our attention on first proving Lemma 1. Since the proof is involved, we first provide an outline of the proof to give the reader an overview of the approach before delving into the technical details.

Informally, Lemma 1 states that for sufficiently large network widths, the following relation holds with probability of at least 1 δ over the random initialization of the network weights.

W , f(x ; W)

Θ(L)(x, x ) (L + 1)ε.

Firstly, note that

W , f(x ; W)

W(l) , f(x ; W)

and recall that

Θ(L)(x, x ) =

Σ(l 1)(x, x )

j=l Σ(j)(x, x )

Using these relations, note that it is sufficient to show that,

W(l) , f(x ; W)

Σ(l 1)(x, x )

j=l Σ(j)(x, x )

holds for all l [L] with probability 1 δ. Furthermore, we have, f(x; W)

W(l) , f(x ; W)

= b(l)(x) h(l 1)(x) , b(l)(x ) h(l 1)(x )

= D h(l 1)(x), h(l 1)(x) E D b(l)(x), b(l)(x ) E .

Provably and Practically Efficient Neural Contextual Bandits

The proof revolves around establishing h(l 1)(x), h(l 1)(x ) is close to Σ(l 1)(x, x ) while b(l)(x), b(l)(x ) is close to QL+1 j=l Σ(j)(x, x ). On combining these two relations, we obtain the result in (21) and consequently prove the theorem.

Throughout the proof, we fix some s N and hence σ = σs. Recall from equation (15), we have,

Σ(0)(x, x ) = x x ,

Λ(l)(x, x ) = Σ(l 1)(x, x) Σ(l 1)(x, x ) Σ(l 1)(x , x) Σ(l 1)(x , x )

Σ(l)(x, x ) = cs E(u,v) N(0,Λ(l)(x,x ))[σs(u)σs(v)].

Since x = 1 for all x X, Σ(0)(x, x) = 1 for all x X. Using induction, we can establish that Σ(l)(x, x) = 1 for all x X, for all l {0, 1, 2, . . . , L}. The base case follows immediately as Σ(0)(x, x) = 1 for all x X. Assume true

for l 1. Consequently, we have, Λ(l)(x, x) = 1 1 1 1

. On plugging this value in the definition of Σ(l)(x, x), we obtain

Σ(l)(x, x) = 1, completing the inductive step. As a result, |Σ(l)(x, x )| 1 for all x, x X and hence we can write Σ(l)(x, x ) = σs(Σ(l 1)(x, x )).

As the final step before the proof, we define a sequence of events which will be used throughout the proof. Let (l)(x, x ) := D(l)(x)D(l)(x ). We define the following events:

Al(x, x , ε1) = h(l)(x) h(l)(x ) Σ(l)(x, x ) ε1

Al(x, x , ε1) = Al(x, x , ε1) Al(x, x, ε1) Al(x , x , ε1)

A(x, x , ε1) =

l=0 Al(x, x , ε1)

Bl(x, x , ε2) =

D b(l)(x), b(l)(x ) E

j=l Σ(j)(x, x )

Bl(x, x , ε2) = Bl(x, x , ε2) Bl(x, x, ε2) Bl(x , x , ε2)

B(x, x , ε2) =

l=1 Bl(x, x , ε2)

C(x, x , ε3) = {|f(x; W)| ε3, |f(x ; W)| ε3}

Dl(x, x , ε4) =

( cs tr( (l)(x, x ))

dl Σ(l)(x, x )

Dl(x, x , ε4) = Dl(x, x , ε4) Dl(x, x, ε4) Dl(x , x , ε4)

D(x, x , ε4) =

l=1 Dl(x, x , ε4)

El(x, x , ε5) = n (l)(x, x ) 2 < ε5 o

El(x, x , ε5) = El(x, x , ε4) El(x, x, ε4) El(x , x , ε5)

E(x, x , ε5) =

l=1 El(x, x , ε5)

Provably and Practically Efficient Neural Contextual Bandits

We also state the following two Lemmas taken from Arora et al. (2019) which are used at several points in the proof before beginning with the first part.

Lemma 2. For any two events, A and B, Pr(A B) Pr(B|A).

Lemma 3. Let w N(0, Id), G Rd k be a fixed matrix and F = w G be a random matrix. Then, conditioned on the value of F, w remains Gaussian in the null space of the row space of G. Mathematically,

Π Gw d=F=w G Π G w,

where w is a i.i.d. copy of w.

A.4.1. PRE-ACTIVATIONS ARE CLOSE TO THE NTK MATRIX

In the first step, we show that h(l 1)(x), h(l 1)(x ) is close to Σ(l 1)(x, x ). We formalize this idea in the following lemma.

Lemma 4. For σ(z) = σs(z) and [W(l)]ij i.i.d. N(0, 1) for all i [dl+1], j [dl] and l {0, 1, 2, . . . , L}. There exist

constants c1, c2 > 0 such that if minl dl c1 s LL2c2 s ε2 log2(s L/δε) and ε c2/s, then for fixed x, x X,

D h(l)(y), h(l)(y ) E Σ(l)(y, y ) ε,

holds with probability at least 1 δ for all l {0, 1, . . . , L} and all (y, y ) {(x, x), (x, x ), (x , x )}. In other words, Pr( A(ε)) 1 δ, for fixed x, x .

Proof. The proof of this Lemma relies on following lemmas.

Lemma 5. Let ρ [ 1, 1] and (X, Y ) N(0, Σρ). If (X1, Y1), (X2, Y2), . . . , (Xn, Yn) are independent samples from N(0, Σρ) and Sn = 1

n Pn i=1 σs(Xi)σs(Yi) for some σs A, then for any t 0,

Pr(Sn µs t)

(n + 1) exp

if t t (n),

(n + 1) exp (nt)1/s

if t > t (n).

Pr(Sn µs t) exp nt2

where t (n) =

2s 1). Consequently, if n n (s, δ, ρ), then

2(p3µ4s,ρ + µ2s,ρ)

n log n + 1

where n (s, δ, ρ) := min

m N : m (8(1 + ρ))2s

For simplicity of notation, we define φs(n, δ, ρ) := q

3µ4s,ρ+µ2s,ρ)

δ . Thus, the result of Lemma 5 can be restated as Pr(|Sn µs,ρ| φs(n, δ, ρ)) δ.

Lemma 6. For a given s N, the dual activation function σs is βs-Lipschitz in Mγs + w.r.t. the -norm for γs 1/s and βs 6s.

Provably and Practically Efficient Neural Contextual Bandits

Lemma 7 ((Daniely et al., 2016)). For σ(z) = σs(z) and [W(l)]ij i.i.d. N(0, 1) for all i [dl+1], j [dl] and

l {0, 1, 2, . . . , L}. If minl [L] dl n s ε BL,s , δ

, then for fixed x, x X,

D h(l)(y), h(l)(y ) E Σ(l)(y, y ) ε,

holds with probability at least 1 δ for all l {0, 1, . . . , L} and all (y, y ) {(x, x), (x, x ), (x , x )}. In the above expression, BL,s = PL 1 j=0 βi s, d = PL+1 l=1 dl and n s(ε, δ) = min{n : φs(n, δ) ε}.

Lemma 4 follows immediately from Lemma 7 which in turn follows from the previous lemmas. The proofs of Lemma 5 and 6 are provided in Appendix B.

A.4.2. PRE-ACTIVATION GRADIENTS ARE CLOSE TO NTK DERIVATIVE MATRICES

In the second step, we show that b(l)(x), b(l)(x ) is close to QL+1 j=l Σ(j)(x, x ). We formalize this idea in the following lemma.

Lemma 8. For σ = σs and [W(l)]ij i.i.d. N(0, 1) for all i [dl+1], j [dl] and l {0, 1, 2, . . . , L}. There exist constants

c1, c2 > 0 such that if minl dl c1 s LL2c2 s ε2 log2(s L/δε) and ε c2s L+2/L, then for fixed x, x X,

D b(l)(y), b(l)(y ) E

j=l Σ(j)(y, y )

L(βs + 1)s L+1ε,

holds with probability at least 1 δ for all l {0, 1, . . . , L} and all (y, y ) {(x, x), (x, x ), (x , x )}. In other words, if minl dl c1 s LL2c2 s ε2 1 log2(s L/δε1) and ε1 c2s L+2/L, then for fixed x, x , Pr A(ε1/2) B(L(βs + 1)s L+1ε1) 1 δ.

Proof. We first state some helper lemmas that will be used in the proof followed by the proof of the above lemma.

Lemma 9 (Arora et al. (2019), Lemma E.4).

" AL(ε/2) C

1 δ ε [0, 1], δ (0, 1).

Lemma 10. If AL(ε1/2) Bl+1(ε2) C(ε3) Dl(ε4) El(ε5) for any pair of points x, x X, then for any (y, y ) {(x, x), (x, x ), (x , x )}, we have cs tr( (l)(y, y ))

D b(l+1)(y), b(l+1)(y ) E

j=l Σ(l)(y, y )

s L lε4 + sε2.

Lemma 11. If AL(ε1/2) Bl+1(ε2) C(ε3) Dl(ε4) El(ε5) for any pair of points x, x X, then for any (y, y ) {(x, x), (x, x ), (x , x )} cs dl

b(l+1)(y) W(l+1)Π G (l)(y, y )Π G W(l+1) b(l+1)(y) cs

dl tr( (l)(y, y )) D b(l+1)(y), b(l+1)(y ) E

cs D b(l+1)(y), b(l+1)(y) E + D b(l+1)(y ), b(l+1)(y ) E s

D b(l+1)(y), b(l+1)(y ) E ε5.

Provably and Practically Efficient Neural Contextual Bandits

holds with probability at least 1 δ. Consequently, for any y {x, x },

b(l+1)(y) W(l+1)Π GD(l)(y) q

2cs b(l+1)(y), b(l+1)(y) ε5.

Lemma 12. If AL(ε1/2) Bl+1(ε2) C(ε3) Dl(ε4) El(ε5) for any pair of points x, x X, then for any (y, y ) {(x, x), (x, x ), (x , x )}

b(l+1)(y) W(l+1) (l)(y, y ) W(l+1) b(l+1)(y )

b(l+1)(y) W(l+1)Π G (l)(y, y )Π G W(l+1) b(l+1)(y )

(s(L l)/2 + ε2)

(s(L l)/2 + 1)

8(s(L l)/2 + 1)

holds with probability at least 1 δ.

Lemma 13. For any ε1 [0, 1/s], ε2, ε4 [0, 1] and ε3, ε5 0,

Pr AL(ε1/2) Bl+1(ε2) C(ε3) Dl(ε4) El(ε5) Bl(ε ) 1 δ,

ε = 2cs(s L l + 1)

dl (s L l + 1)ε5 + s L lε4 + sε2

(s(L l)/2 + 1)

(s(L l)/2 + 1)

8(s(L l)/2 + 1)

We now prove Lemma 8 by using induction on Lemma 13. Before the inductive step, we note some relations that will be useful to us during the analysis. Firstly, using Lemma 4, we can conclude that if dl n s(ε/BL,s, δ/32 d(L + 1)) and ε c2/s, then

Pr AL(ε/2) 1 δ/4. (22)

From Lemma 9, we can conclude that

" AL(ε/2) C

1 δ/4. (23)

If dl n s(ε/3, δ/24(L + 1)), then we can conclude that 3s2φs 1((minl dl), δ/24(L + 1)) + s2βsε

2s 1 (β + 1)s2ε. Consequently for ε 2/s,

Pr A(ε/2) D((βs + 1)s2ε) E (ε ) 1 δ/4, (24)

Provably and Practically Efficient Neural Contextual Bandits

where ε = 3s2 2 log 6(L + 1)(maxl dl)

s 1 . Applying the union bound on (22), (23) and (24), we obtain that

D((βs + 1)s2ε) E (ε )

1 3δ/4. (25)

Let dl be chosen large enough to ensure that

s L l+2ε 2cs(s L l + 1)

2 dl log 24(L + 1)

+ log 24(L + 1)

8 dl (s(L l)/2 + 1)

(s(L l)/2 + 1)

2 log 64(L + 1)

log 16(L + 1)

(s(L l)/2 + 1)

2 log 64(L + 1)

log 16(L + 1)

Define υl := (L + 1 l)(βs + 1)s L l+2ε for l = 0, 1, . . . , L + 1. Invoking Lemma 13 with ε1 = ε, ε2 = υl+1,

δ , ε4 = (βs + 1)s2ε and ε5 = ε gives us that

" AL(ε/2) Bl+1(υl+1) C

Dl((βs + 1)s2ε) El(ε ) Bl(υl)

1 δ 4(L + 1).

Pr A(ε/2) B(υ1)

D((βs + 1)s2ε) E(ε )

| {z } :=F(ε)

F(ε) BL+1(υL+1)

j=L+1 l Bj+1(υj+1)

F(ε) BL+1(υL+1)

( AL(ε/2) C

Dl((βs + 1)s2ε) El(ε ) Bl+1(υl+1) Bl(υl)

D((βs + 1)s2ε) E(ε )

Pr BL+1(υL+1)

( AL(ε/2) C

Dh((βs + 1)s2ε) El(ε ) Bl+1(υl+1) Bl(υl)

Thus, Pr A(ε/2) B(L(βs + 1)s L+1ε) 1 δ, as required. The proof of the helper lemmas are provided in Appendix B.

Provably and Practically Efficient Neural Contextual Bandits

Now, it is straightforward to note that Lemma 1 immediately follows from the Lemma 4 and Lemma 8. With Lemma 1 in place, the rest of the proof of Theorem 1 follows similar to the proof of Theorem 3.2 in Arora et al. (2019). The only step in the proof of Theorem 3.2 that is dependent on kernel being Re LU, is Lemma F.6, where they bound the perturbation in gradient based on the amount of perturbation in the weight matrices. We would like to emphasize that this is only step apart from the ones which use Theorem 3.1. Such steps follow immediately by invoking Lemma 1 proven above, instead of Theorem 3.1. Using the result from Liu et al. (2020, Theorem 3.2), we know that the spectral norm of the Hessian of the neural net is O(m 1/2). Here the implied constant only depends on s and L. Thus for the a perturbation in the weight matrices of the order of O( mω), the corresponding perturbation in the gradient is O(ω), with the implied constant depending only on s and L. On substituting this relation in place of Lemma F.6, the claim of Theorem 1 follows using the same arguments as the ones used in the proof of Theorem 3.2 of Arora et al. (2019).

B. Proof of Helper Lemmas

We first state some additional intermediate lemmas which are used in the proofs several helper lemmas then provide a proof of all the lemmas.

Definition 4.

G(l)(x, x ) = [h(l)(x) h(l)(x )] Rdl 2

G(l)(x, x ) = [G(l)(x, x )] G(l)(x, x ) = h(l)(x), h(l)(x) h(l)(x), h(l)(x ) h(l)(x ), h(l)(x) h(l)(x ), h(l)(x ) G11 G12 G12 G22

" Al(ε) Dl+1 3s2φs 1(dl, δ/6, 1) + 2s2βsε

3s2 2 log 6dl

ε [0, 1/s], δ (0, 1).

Lemma 15. Let

ε = 3s2φs 1((min l dl), δ/6(L + 1), 1) + 2s2βsε

2s 1 , ε = 3s2 2 log 6(L + 1)(maxl dl)

Pr A(ε) D(ε ) E(ε ) 1 δ ε [0, 1/s], δ (0, 1).

Lemma 16. For any 1 l L, any fixed {W(i)}l 1 i=1 and x, x X, with probability 1 δ over the randomness of W(l), we have, cs tr( (l)(x, x ))

2s 1 σs 1 G(l)(x, x ) s2(G11G22)

2 φs 1(dl, δ/2, ρG),

(l)(x, x ) 2 s2 p

G11G22 + G12 log 2dl

where ρG = G12 G11G22 .

Lemma 17 ((Boucheron et al., 2013)). Let ξ N(0, In) be an n-dimensional unit Gaussian random vector and A Rn n

be a symmetric matrix, then for any t > 0,

Pr ξ Aξ E h ξ Aξ i > 2 A F

t + 2 A 2t exp( t),

Provably and Practically Efficient Neural Contextual Bandits

or equivalently,

Pr ξ Aξ E h ξ Aξ i > t exp t2

4 A 2 F + 4 A 2

Lemma 18. If AL(ε1/2) Bl+1(ε2) C(ε3) Dl(ε4) El(ε5) for any pair of points x, x X, then for any y {x, x }

ΠG W(l+1) b(l+1)(y) (s(L l)/2 + 1)

holds with probability at least 1 δ.

B.1. Proof of Lemma 5

We fix a particular s N. Let (X1, Y1), (X2, Y2), . . . , (Xn, Yn) be n independent samples from N(0, Σρ) and let Zi = σs(Xi)σs(Yi). Define Sn := 1 n Pn i=1 Zi. Note that E[Sn] = E[Z1] = µs. To prove the bound, we first bound Pr(Zi > t) for any t 0 and then using techniques borrowed from Gantert et al. (2014), we obtain an upper bound on Pr(Sn µs > t). The bound on Pr(Sn µs < t) follows from Cramer s Theorem (Cram er, 1938).

We begin with bounding Pr(Zi > t) for any t 0. Since Zi = (σ1(Xi)σ1(Yi))s, we just focus on bounding Pr(σ1(X)σ1(Y ) > t) for (X, Y ) N(0, Σρ). Fix a t 0. We have,

Pr(σ1(X)σ1(Y ) > t) = E [1 {σ1(X)σ1(Y ) > t}]

= E [1 {XY > t, X 0, Y 0}] .

Using a change of variables define U = (X + Y )/

2 and V = (X Y )/

2. Note that U and V are zero-mean Gaussian random variables with Cov(U, V ) = 0 implying that they are independent. Also Var(U) = 1 + ρ and Var(V ) = 1 ρ. Thus,

Pr(σ1(X)σ1(Y ) > t) = EX,Y [1 {XY > t, X 0, Y 0}]

= EU,V 1 U 2 V 2 > 2t, U 0, V [ U, U]

EU,V 1 U 2 > V 2 + 2t, U 0

EV h Pr U > p

exp V 2 + 2t

exp v2 + 2t

e t/(1+ρ) p

exp t 1 + ρ

exp t 1 + ρ

In the sixth line we used the concentration bound for Gaussian random variable and in the eighth line we used the Gaussian integral. Consequently,

Pr(Zi > t) = Pr(σ1(X)σ1(Y ) > t1/s) exp t1/s

Provably and Practically Efficient Neural Contextual Bandits

Using this relation, we now move on to bound Pr(Sn x) for x µs,ρ. For the rest of the proof we drop the argument ρ for simplicity. Firstly, note that

Pr(Sn x) Pr max j [n] Zj n(x µs)

| {z } An 1

+ Pr Sn x, max j [n] Zj < n(x µs)

| {z } An 2

We begin with the first term.

Pr max j [n] Zj n(x µs) n Pr(Zi n(x µs)) n exp n1/s(x µs)1/s

Let βζ(n) = ζn1/s

1+ρ > 0 for some ζ > 0 which will be specified later. We have,

An 2 = Pr Sn x, max j [n] Zj < n(x µs)

= Pr exp (βζ(n)Sn) exp (βζ(n)x) , max j [n] Zj < n(x µs)

exp ( βζ(n)x) E exp (βζ(n)Sn) 1 max j [n] Zj < n(x µs)

exp ( βζ(n)x)

j=1 E exp βζ(n)Zj

1 {Zj < n(x µs)}

= log (An 2) βζ(n)x +

j=1 Γζ (j)(n),

where Γζ (j)(n) = log

βζ(n)Zj (n)

and Zj (n) := Zj1 {Zj < n(x µs)}. Using the relations log(1 +

x) x and ex 1 + x + x2

2 ex, we have,

j=1 Γζ (j)(n) =

βζ(n)Zj (n)

" βζ(n)Zj (n)

βζ(n)Zj (n)

βζ(n)Zj (n)

" βζ(n)Zj (n)

βζ(n)Zj (n)

βζ(n)Zj (n)

" βζ(n)Zj (n)

n E h Zj (n)i

For the second term we have,

βζ(n)Z1 (n)

βζ(n)Z1 (n)

" Z1 (n) 2(1+η)

η # η 1+η E

(1 + η)βζ(n)Z1 (n)

Provably and Practically Efficient Neural Contextual Bandits

for some η > 0 by using H older s inequality. Since E Z1 (n) 2(1+η)/η η/(1+η) = h µ 2s(1+η)

i η 1+η , we focus on the other

term. We have,

(1 + η)βζ(n)Z1 (n)

= 1 + Z n(x µs)

(1 + η)βζ(n)

n exp (1 + η)βζ(n)z

Pr(Z1 > z) dz

= 1 + n(x µs) Z 1

(1 + η)βζ(n)

n exp (1 + η)βζ(n)n(x µs)y

Pr(Z1 > n(x µs)y) dy

1 + (x µs) Z 1

0 (1 + η)βζ(n) exp (1 + η)βζ(n)(x µs)y (n(x µs)y)1/s

1 + (x µs)(1 + η)βζ(n) Z 1

0 exp βζ(n) (1 + η)(x µs)y (x µs)1/sy1/s

For ζ (x µs)1/s 1

2(1 + η) , we have,

(1 + η)βζ(n)Z1 (n)

1 + (x µs)(1 + η)βζ(n) Z 1

0 exp βζ(n) (1 + η)(x µs)y (x µs)1/sy1/s

1 + (x µs)(1 + η)βζ(n) Z 1

2βζ(n)(1 + η)(x µs)y dy

On combining all the equations, we obtain that for ζ (x µs)1/s 1

2(1+η) and η > 0

log (An 2) βζ(n)(x µs) + (βζ(n))2

h µ 2s(1+η)

i η 1+η 3 1 1+η

We set η = 1. Thus for ζ 0.25(x µs)1/s 1, we have,

log (An 2) ζn1/s(x µs)

1 + ρ + ζ2n2/s 1

2(1 + ρ)2 p

Note that the RHS is minimized at ζ = n1 1/s(x µs)(1 + ρ) 3µ4s . However, ζ 0.25(x µs)1/s 1 only for (x µs) 3µ4s

2 1/s n ( s 1

| {z } :=t (n)

. Thus, for (x µs) t (n), we can plug in ζ = ζ in the above expression to obtain

log (An 2) n(x µs)2

For (x µs) > t (n), we plug in ζ = 0.25(x µs)1/s 1 to obtain

log (An 2) n1/s(x µs)1/s

Provably and Practically Efficient Neural Contextual Bandits

On combining the two, we obtain that for any t 0

Pr(Sn µs t) =

(n + 1) exp nt2

if t t (n),

(n + 1) exp (nt)1/s

if t > t (n).

We now consider the other side. Consider any x µs and λ > 0. We have,

Pr(Sn x) = Pr exp( λSn) e λx

E [exp( λSn)] eλx

E 1 λSn + λ2S2 n 2

1 λE[Sn] + λ2E[S2 n] 2

1 λµs + λ2µ2s

exp λµs + λ2µ2s

exp λ(x µs) + λ2µ2s

Minimizing this over λ yields λ = n(x µs)

µ2s > 0. On plugging this value, we obtain,

Pr(Sn x) exp n(x µs)2

Consequently for any t 0, we have,

Pr(Sn µs t) exp nt2

On combining the relations, we can conclude that if n n (s, δ, ρ), then

2( 3µ4s + µ2s)

n log n + 1

B.2. Proof of Lemma 6

Fix a s N. We drop the subscript s for the remainder of the proof for ease of notation. We need to show that the

dual activation function σ is β-Lipschitz in Mγ + w.r.t. the -norm. Let Σ = Σ11 Σ12 Σ12 Σ22

Mγ +. In order to prove

β-Lipschitzness w.r.t. the -norm, it is sufficient to show that,

σs 1 = σs Σ12

Recall that since σs is s-homogeneous,

σs(Σ) = (Σ11Σ22)s/2 σs

Provably and Practically Efficient Neural Contextual Bandits

Using this relation, we have,

σs Σ12 = (Σ11Σ22) s 1

2(Σ11Σ22) s 2 1Σ22 σs

2(Σ11Σ22) s 1

2(Σ11Σ22) s 2 1 σs

2(Σ11Σ22) s 1

2(Σ11Σ22) s 2 1Σ11 σs

2(Σ11Σ22) s 1

2(Σ11Σ22) s 2 1 σs

2(Σ11Σ22) s 1

Consequently,

σs 1 = (Σ11Σ22) s 1

(Σ11 + Σ22)(Σ11Σ22) s 2 1

For simplicity, let ρ = Σ12 Σ11Σ22 . For s = 1, Arora et al. (2019) showed that the above expression is 1 + o(γ). We consider the case for s > 1. Using (20), we have,

2s 1(Σ11Σ22) s 1

2 | σs 1 ( ρ)| + (Σ11 + Σ22)(Σ11Σ22) s 2 1

s σs ( ρ) s2

2s 1 σs 1 ( ρ) ρ

3 (Σ11Σ22) s 1

4 (Σ11 + Σ22)(Σ11Σ22) s 2 1

6 (1 + γ)s 1

If γ 1/s, then σs 1 6s, as required.

B.3. Proof of Lemma 10

cs tr( (l)(y, y ))

D b(l+1)(y), b(l+1)(y ) E

j=l Σ(j)(y, y )

cs tr( (l)(y, y ))

dl Σ(l)(y, y )

D b(l+1)(y), b(l+1)(y ) E +

Σ(l)(y, y )

D b(l+1)(y), b(l+1)(y ) E

j=l+1 Σ(j)(y, y )

D b(l+1)(y), b(l+1)(y ) E ε4 + s2

2s 1 σs 1 Λ(l)(x, x ) ε2

Provably and Practically Efficient Neural Contextual Bandits

Since Bl+1(ε2), b(l+1)(y), b(l+1)(y ) QL j=l+1 Σ(j)(y, y ) + ε2 QL j=l+1 Σ(j)(y, y ) + 1. Moreover, since Σ(l)(y, y ) = s2 2s 1 σs 1 Λ(l)(y, y ) s2 2s 1, we have, cs tr( (l)(y, y ))

D b(l+1)(y), b(l+1)(y ) E

j=l+1 Σ(j)(y, y )

s L lε4 + sε2.

B.4. Proof of Lemma 11

Fix any (y, y ) {(x, x), (x, x ), (x , x )}. Recall that G(l)(y, y ) = [h(l)(y) h(l)(y )] Rdl 2. Similarly define, F(l+1)(y, y ) = [f (l+1)(y) f (l+1)(y )] Rdl 2. Using Lemma 3 for each row of W(l+1), we can conclude that

W(l+1)Π G d=F(l+1)=W(l+1)G(l) f WΠ G,

where f W is a i.i.d. copy of W(l+1).

Note that h b(l+1)(y) f W b(l+1)(y ) f W i R2dl N(0, Σ) where Σ = b Idl b Idl b Idl b Idl

. In the definition of

Σ, b = b(l+1)(y), b(l+1)(y) , b = b(l+1)(y), b(l+1)(y ) and b = b(l+1)(y ), b(l+1)(y ) . If M is such that Σ = MM , then b(l+1)(y) f W b(l+1)(y ) f W d= Mξ,

where ξ N(0, I2dl). Thus, for fixed G(l) and conditioned on the value of {b(l+1)(y), b(l+1)(y ), h(l)(y), h(l)(y )}, we have, b(l+1)(y) W(l+1)Π G (l)(y, y )Π G W(l+1) b(l+1)(y)

d= b(l+1)(y) f WΠ G (l)(y, y )Π G f W b(l+1)(y)

d= ([Idl 0]Mξ) Π G (l)(y, y )Π G([0 Idl]Mξ)

2ξ M 0 Π G (l)(y, y )Π G Π G (l)(y, y )Π G 0

2M 0 Π G (l)(y, y )Π G Π G (l)(y, y )Π G 0

Then we have,

E[ξ Aξ] = tr(A) = 1

2tr 0 Π G (l)(y, y )Π G Π G (l)(y, y )Π G 0

= b tr(Π G (l)(y, y )Π GIdl)

= b tr( (l)(y, y )Π G)

= b tr( (l)(y, y )(Idl ΠG))

= b h tr( (l)(y, y )) tr( (l)(y, y )ΠG) i .

Since tr(ΠG) = rank(ΠG)) 2, we have,

0 tr( (l)(y, y )ΠG) (l)(y, y ) 2tr(ΠG) 2 (l)(y, y ) 2.

Provably and Practically Efficient Neural Contextual Bandits

Consequently,

E[ξ Aξ] b tr( (l)(y, y )) 2b (l)(y, y ) 2.

For the upper bound on the spectrum, note that M 2 2 = Σ 2 b + b . Hence,

2 M 2 2 Π G (l)(y, y )Π G 2

2(b + b ) Π G 2 (l)(y, y ) 2 Π G 2

2 (l)(y, y ) 2.

Thus, on using Lemma 17 with t = log(3/δ), we have with probability at least 1 δ/3

ξ Aξ E[ξ Aξ] 2( A F

(b + b ) (l)(y, y ) 2

Consequently,

b(l+1)(y) W(l+1)Π G (l)(y, y )Π G W(l+1) b(l+1)(y) cs

dl tr( (l)(y, y )) D b(l+1)(y), b(l+1)(y ) E

ξ Aξ E[ξ Aξ] + cs

E[ξ Aξ] b tr( (l)(y, y ))

cs(b + b ) (l)(y, y ) 2

dl (l)(y, y ) 2,

holds with probability at least 1 δ/3. The lemma follows by taking a union bound over all possible choices of (y, y ).

We can use the above result to obtain the following bound for any y {x, x }

b(l+1)(y) W(l+1)Π GD(l)(y)

b(l+1)(y) W(l+1)Π G (l)(y, y)Π G W(l+1) b(l+1)(y) 1/2

( cs dl tr( (l)(y, y))b + 2csb (l)(y, y) 2

dl (l)(y, y ) 2

bcs (l)(y, y) 2

2bcs (l)(y, y) 2,

where b = b(l+1)(y), b(l+1)(y) and the last step uses the relation the bound on dl.

Provably and Practically Efficient Neural Contextual Bandits

B.5. Proof of Lemma 12

Fix any (y, y ) {(x, x), (x, x ), (x , x )}. For ease of writing let V(l+1)(x) := W(l+1) b(l+1)(x). We are interested in bounding the following expression.

V(l+1)(y) (l)(y, y )V(l+1)(y ) V(l+1)(y) Π G (l)(y, y )Π GV(l+1)(y )

= V(l+1)(y) (ΠG + Π G)D(l)(y)D(l)(y )(ΠG + Π G)V(l+1)(y ) V(l+1)(y) Π GD(l)(y)D(l)(y )Π GV(l+1)(y )

= V(l+1)(y) ΠGD(l)(y)D(l)(y )ΠGV(l+1)(y ) + V(l+1)(y) ΠGD(l)(y)D(l)(y )Π GV(l+1)(y )+ V(l+1)(y) Π GD(l)(y)D(l)(y )ΠGV(l+1)(y ).

Consequently,

V(l+1)(y) (l)(y, y )V(l+1)(y ) V(l+1)(y) Π G (l)(y, y )Π GV(l+1)(y )

V(l+1)(y) ΠGD(l)(y)D(l)(y )ΠGV(l+1)(y ) + V(l+1)(y) ΠGD(l)(y)D(l)(y )Π GV(l+1)(y )+

V(l+1)(y) Π GD(l)(y)D(l)(y )ΠGV(l+1)(y )

b(l+1)(y) W(l+1)ΠGD(l)(y)

D(l)(y )ΠG W(l+1) b(l+1)(y ) +

b(l+1)(y) W(l+1)ΠGD(l)(y)

D(l)(y )Π G W(l+1) b(l+1)(y ) +

b(l+1)(y) W(l+1)Π GD(l)(y)

D(l)(y )ΠG W(l+1) b(l+1)(y )

(s(L l)/2 + 1)

(s(L l)/2 + 1)

D(l)(y) 2 D(l)(y ) 2+

(s(L l)/2 + 1)

D(l)(y) 2 b(l+1)(y ) D(l)(y ) 2+

(s(L l)/2 + 1)

D(l)(y) 2 b(l+1)(y) D(l)(y ) 2

(s(L l)/2 + 1)

(l)(y, y ) 2+

(s(L l)/2 + 1)

(s(L l)/2 + 1) (l)(y, y ) 2

(s(L l)/2 + 1)

(s(L l)/2 + 1)

8(s(L l)/2 + 1)

where we used Lemma 11 and 18 in fourth step and the condition of Bl+1(ε2) El(ε5) in the last step. Both lemmas hold with probability at least 1 δ/2, ensuring that the overall expression is true with probability at least 1 δ.

Provably and Practically Efficient Neural Contextual Bandits

B.6. Proof of Lemma 13

Fix any (y, y ) {(x, x), (x, x ), (x , x )}. Conditioned on AL(ε1/2) Bl+1(ε2) C(ε3) Dl(ε4) El(ε5), we begin

with bounding b(l)(y), b(l)(y ) QL+1 j=l Σ(j)(x, y ) . Using the definition of b(l)(y) we have,

D b(l)(y), b(l)(y ) E

j=l Σ(j)(y, y )

b(l+1)(y) W(l+1) (l)(y, y ) W(l+1) b(l+1)(y )

j=l Σ(j)(y, y )

b(l+1)(y) W(l+1) (l)(y, y ) W(l+1) b(l+1)(y )

b(l+1)(y) W(l+1)Π G (l)(y, y )Π G W(l+1) b(l+1)(y )

b(l+1)(y) W(l+1)Π G (l)(y, y )Π G W(l+1) b(l+1)(y)

cs dl tr( (l)(y, y )) D b(l+1)(y), b(l+1)(y ) E

cs tr( (l)(y, y ))

D b(l+1)(y), b(l+1)(y ) E

j=l Σ(l)(y, y )

(s(L l)/2 + 1)

(s(L l)/2 + 1)

8(s(L l)/2 + 1)

+ cs D b(l+1)(y), b(l+1)(y) E + D b(l+1)(y ), b(l+1)(y ) E s

D b(l+1)(y), b(l+1)(y ) E ε5 + s L lε4 + sε2

(s(L l)/2 + 1)

(s(L l)/2 + 1)

8(s(L l)/2 + 1)

+ 2cs(s L l + 1)

dl (s L l + 1)ε5 + s L lε4 + sε2.

In the fourth step, we used Lemma 10, 12 and 11 with the last two each holding with probability at least 1 δ/2. Consequently, the above expression holds with probability at least 1 δ.

B.7. Proof of Lemma 14

We carry out the analysis conditioned on Al(ε). Since h(l)(y) h(l)(y ) Σ(l)(y, y ) ε for all (y, y )

{(x, x), (x, x ), (x , x )} we can conclude that G(l) Λ(l) 2ε. Since σs 1 is βs 1-Lipschitz in Mγs 1 + w.r.t.

Provably and Practically Efficient Neural Contextual Bandits

-norm, we have,

| σs 1(G(l)) σs 1(Λ(l+1))| βs 1 G(l) Λ(l+1) 2βs 1ε,

for ε γs 1. Fix any (y, y ) {(x, x), (x, x ), (x , x )}. We have that,

cs tr( (l+1)(y, y ))

dl+1 Σ(l+1)(y, y )

cs tr( (l)(y, y ))

2s 1 σs 1 Λ(l+1)(y, y )

cs tr( (l)(y, y ))

2s 1 σs 1 G(l)(y, y ) +

2s 1| σs 1( G(l)) σs 1(Λ(l+1))|

s2(1 + ε)s 1φs 1(dl, δ/2, min{1, Σ(l)(y, y ) + ε}) + 2s2βsε

holds with probability 1 δ. The last step follows from Lemma 16. On taking a union bound, we can conclude that

cs tr( (l+1)(y, y ))

dl+1 Σ(l+1)(y, y )

s2(1 + ε)s 1φs 1(dl, δ/6, 1) + 2s2βsε

holds for all (y, y ) {(x, x), (x, x ), (x , x )} with probability 1 δ. Since we have already invoked Lemma 16, it also gives us that

(l+1)(x, x ) 2 s2 2 (1 + ε) log 6dl

holds for all (y, y ) {(x, x), (x, x ), (x , x )} with probability 1 δ. Using this result along with Lemma 2 and ε 1/s yields us

" Al(ε) Dl+1 3s2φs 1(dl, δ/6, 1) + 2s2βsε

3s2 2 log 6dl

as required.

Provably and Practically Efficient Neural Contextual Bandits

B.8. Proof of Lemma 15

Pr A(ε) D(ε ) E(ε ) = Pr

Dl+1(ε ) El+1(ε ) !#

Dl+1(ε ) El+1(ε ) !#

l=0 Dl+1(ε ) El+1(ε ) !#

Dl+1(ε ) El+1(ε ) !#

Dl+1(ε ) El+1(ε ) #

l=0 Pr Al(ε) Dl+1(ε ) El+1(ε )

l=0 Pr Al(ε) Dl+1(ε ) El+1(ε )

where the eighth line follows from Lemma 14 and the choice of ε and ε .

B.9. Proof of Lemma 16

We can rewrite tr (l)(x, x ) as follows:

tr( (l)(x, x )) = tr h D(l)(x)D(l)(x ) i

= tr h diag σ s f (l)(x) diag σ s f (l)(x ) i

= D σ s f (l)(x) , σ s f (l)(x ) E

= s2 D σs 1 W(l)h(l 1)(x) , σs 1 W(l)h(l 1)(x ) E .

Note that since all the matrices {W(i)}l 1 i=1 are fixed, h(l 1)( ) is a fixed function. Consequently, each term of the vector W(l)h(l 1)(x) is distributed as N 0, h(l)(x), h(l)(x) and is independent of others. If (w(l) j ) denotes the jth row of the matrix W(l), then we can write the inner product as

tr( (l)(x, x )) = s2 D σs 1 W(l)h(l 1)(x) , σs 1 W(l)h(l 1)(x ) E

w(l) j h(l 1)(x) σs 1

w(l) j h(l 1)(x )

Provably and Practically Efficient Neural Contextual Bandits

where Zj are all independent and identically distributed as σs 1(X)σs 1(Y ) where (X, Y ) N 0, G(l 1)(x, x ) . If

we define Zj := (G11G22) (s 1)

2 , then Zj are all independent and identically distributed as σs 1( X)σs 1( Y ) where ( X, Y ) N (0, ΣρG) and ρG = G12 G11G22 . From Lemma 5, we know that

j=1 Zj E[ Z1]

φs 1(dl, δ, ρG)

with probability at least 1 δ. Consequently,

j=1 Zj E[Z1]

2 φs 1(dl, δ, ρG)

with probability at least 1 δ. Note that E[Z1] = E (X,Y ) N(0, G(h)(x,x )) [σs 1(X)σs 1(Y )] = 1 cs 1 σs 1 G(l)(x, x )

and cs 1 = cs(2s 1). On combining this with the previous result, we can conclude that,

cs tr( (l)(x, x ))

2s 1 σs 1 G(l 1)(x, x ) s2

j=1 Zj E[Z1]

2 φs 1(dl, δ/2, ρG),

holds with probability 1 δ/2. For the spectral norm of (l)(x, x ), we have,

(l)(x, x ) 2 = s2 max j dl Zj = s2(G11G22)

2 max j dl Zj.

Using (26), we can show that

max j dl Zj > (1 + ρG) log dl

Z1 > (1 + ρG) log dl

dl exp 1 + ρG

1 + ρG log dl

Consequently,

(l)(x, x ) 2 = s2(G11G22)

2 max j dl Zj

2 (1 + ρG) log 2dl

holds with probability 1 δ/2. Thus, both the hold simultaneously with probability at least 1 δ.

B.10. Proof of Lemma 18

We will prove the claim for y {x, x }. Recall that G(l)(y, y ) = [h(l)(y) h(l)(y )] Rdl 2. For simplicity, we drop the arguments from h(l) and b(l+1). Define Πh = hh and ΠG/h = ΠG Πh. Note that ΠG/h is still a projection matrix of rank 0 or 1.

Recall that b(l+1) is the gradient with respect to the pre-activation layer l + 1 given by f (l+1) = W(l+1)h(l). If we view the output f as a function of h(l), W(l+1), . . . , W(L+1), then we have the following relation:

h(l) f(h(l), W(l+1), . . . , W(L+1)) = b(l+1) W(l+1).

Provably and Practically Efficient Neural Contextual Bandits

Recall that σs is s-homogeneous,that is, σs(λx) = λsσs(x) for any λ > 0. Using this recursion repeatedly, we have,

f(λh(l), W(l+1), . . . , W(L+1)) = λs L lf(h(l), W(l+1), . . . , W(L+1))

= λf(λh(l), W(l+1), . . . , W(L+1)) = s L lλ(s L l 1)f(h(l), W(l+1), . . . , W(L+1)).

Using this, we can write,

f(h(l), W(l+1), . . . , W(L+1)) = sl L

λf(λh(l), W(l+1), . . . , W(L+1)) λ=1

= sl L (λh(l))f(λh(l), W(l+1), . . . , W(L+1)) λ=1

= sl L (λh(l))f(λh(l), W(l+1), . . . , W(L+1)) λ=1 h(l)

= sl L b(l+1) W(l+1)h(l).

By definition of Πh, we have, Πh W(l+1) b(l+1) = 1 h(l) 2 h(l) h(l) W(l+1) b(l+1)

= s L l |f(y; W)|

Since AL(ε1/2) C(ε3), h(l) 1 ε1/2 1/2 and |f(y; W)| ε3. Consequently, Πh W(l+1) b(l+1) s L lε3

Similar to the proof of Lemma 11, we note that for a fixed h and conditioned on f (l+1) and b(l+1) Lemma 3 gives us that

W(l+1)Π h d=f (l+1)=W(l+1)h(l) f WΠ h ,

where f W is a i.i.d. copy of W(l+1). From the definition of ΠG/h, we can conclude that Π G/hΠh = 0. Thus, combining this with the previous equation, we can conclude that

W(l+1)ΠG/h d=f (l+1)=W(l+1)h(l) f WΠG/h.

If rank(ΠG/h) = 1, then ΠG/h = uu for some unit vector u. Thus,

ΠG/h W(l+1) b(l+1) = uu W(l+1) b(l+1)

= u W(l+1) b(l+1)

d= u f W b(l+1) = |t|,

where t N(0, b(l+1) 2). Therefore, with probability at least 1 δ/2,

(s(L l)/2 + 1)

Provably and Practically Efficient Neural Contextual Bandits

In the above step, we used similar arguments as used in Appendix. B.7 to bound b(l+1) . If rank(ΠG/h) = 0, ΠG/h W(l+1) b(l+1) = 0. On combining this with the previous result, we obtain that with probability 1 δ

ΠG W(l+1) b(l+1) ΠG/h W(l+1) b(l+1) + Πh W(l+1) b(l+1)

(s(L l)/2 + 1)

Taking the union bound over the two possible choices of y gives us the final result.

C. Proof of Theorem 2

The proof of Theorem 2 involves combining several smaller results, some of which follow from Theorem 1 and the others are minor modifications of existing results. We begin with stating these results along with proofs, if needed, and then combine them to prove Theorem 1.

Firstly, using Theorem 1 along with Lemma 5.1 from Zhou et al. (2020), we can conclude that for m poly(T, L, K, λ 1, λ 1 0 , S 1, log(1/δ)), there exists a W such that h(x) = g(x; W0), W W0 with probability at least 1 δ for all x Z = {{xt,a}K a=1}T t=1. Furthermore, W W0 2 S.

Secondly, using Lemma 4 and 8, along with (12), we can conclude that for m poly(T, L, K, λ 1, λ 1 0 , S 1, log(1/δ)), g(x; W0) O(s L), independent of m.

Thirdly, we know that the spectral norm of the Hessian of the neural net is O(m 1/2) (Liu et al., 2020, Theorem 3.2) for all W such that W W0 R, where R is some finite radius. Here the implied constant depends on s, L and R. Thus, by choosing m to be sufficiently large, we can conclude that for all W W0 p

T/λ, H(W) Cs,Lm 1/3. Here H denotes the Hessian and Cs,L denotes a constant that depends only on s and L. Using this relation, we can conclude that for any W W0 p

T/λ, we have

|f(x; W) f(x; W0) g(x; W0), W W0 | Cs,Lm 1/3T/λ

= |f(x; W) g(x; W0), W W0 | Cs,Lλ 1m 1/6,

where the last step follows by choosing a sufficiently large m and Assumption 1. Moreover, the above result holds for any x Z.

Lastly, using the relation h(x) = g(x; W0), W W0 along with the result from Vakili et al. (2021a, Theorem 1) on the kernel defined by neural net at initialization, i.e. ˆk(x; x ) = g(x; W0), g(x ; W0) , we can conclude that

|h(x) g (x; W0)V 1 t r|

2 λ log(1/δ)

In the above equation, Vt and ˆσt are defined with respect to the dataset D and r = Pt i=1 yig(xi; W0). Also, we have used the independence of dataset from noise sequence to invoke the above theorem.

Provably and Practically Efficient Neural Contextual Bandits

Using these results we can prove our main result. For any x Z, we have,

|h(x) f(x; Wt)| |h(x) g (x; W0)V 1r| + |g (x; W0)V 1r f(x; Wt)|

2 λ log(1/δ)

ˆσt(x) + |g (x; W0)V 1 t r f(x; Wt)|

2 λ log(1/δ)

ˆσ(x) + |g (x; W0)V 1 t r g(x; W0), Wt W0 |

2 λ log(1/δ)

ˆσ(x) + Wt W0 V 1 t r Vt ˆσ(x)

2 λ log(1/δ) + (1 ηλ)J/2p

t/λ + C s,L λm1/6

(λ + C s,Lt)ˆσ(x),

as required. As before Cs,L, C s,L and C s,L represent constants that depend only on s and L. In the last but one step, we used the fact that for any positive definite matrix A Rd d and x, y Rd, x, y (x T A 1x) (y T Ay) (x T A 1x) A 2 y 2. And in the last step, we used the results from Zhou et al. (2020, Lemmas B.3,B.5,C.2) along with the bound on gradient established earlier.

D. Additional Details on Neural GCB

We first provide all the pseudo codes for Neural GCB followed by proof of Theorem 3.

D.1. Pseudo Codes

In Alg. 2, UCB(r) t and LCB(r) t refer to the upper and lower confidence scores respectively at time t corresponding to index r and are defined as UCB(r) t ( ) = f( ; W(r) t )+βtˆσ(r) t ( ) and LCB(r) t ( ) = f( ; W(r) t ) βtˆσ(r) t ( ). The arrays ctr, max mu and fb are used to store the exploitation count, maximizer of the neural net output and feedback time instants respectively. The exploitation count is tracked to ensure that the exploitation budget is not exceeded. Similarly, the maximizer of the neural net output is stored to guide the algorithm in exploitation at the next level and the feedback time instants are used to keep track of last time instant when the neural net was retrained, or equivalently obtained a feedback. Lastly, υt s provide analytical and notational convenience and are not crucial to the working of the algorithm.

Get Predictions is a local routine that calculates the predictive mean and variance after appropriately retraining the neural net and described in Alg. 3 and used as a sub-routine in Neural GCB. We would like to emphasize that Neural GCB computes only the mean with the trained parameters. The predictive variance is computed using the gradients at initialization. This is made possible by having an additional ad hoc neural net which is just use to compute the gradients for calculating the variance. This is similar to the setup used in (Kassraie & Krause, 2022).

After computing the predictive variance for the set of current actions, the Get Predictions routine decides whether the neural net needs to be retrained based on the previous feedback instant tfb and batch parameter q. In particular, we consider two training schemes, fixed and adaptive. The fixed frequency scheme retrains the neural net if there are an additional of q points in Ψ after tfb. Consequently, the neural corresponding to the rth subset in the partition is retrained after every qr points are added to the subset. On the other hand, in the adaptive scheme, inspired by the rarely switching bandits (Abbasi-Yadkori et al., 2011; Wang et al., 2021), the neural net is retrained whenever det(V) > q det(Vfb) where V and Vfb are as defined in line 2 and 3 of Alg. 3. Since log(det(V)/ det(λI)) is a measure of information gain, the neural net is effectively retrained only once sufficient additional information has been obtained since the last time the neural net was trained. We would like to point out that in the main text, we only refer to the fixed training scheme due to space constraints. However, our Neural GCB works even with the adaptive scheme and the regret guarantees under this training schedule are formalized in Appendix D.

Lastly, Train NN is another local routine that carries out the training of neural net for a given dataset using gradient descent for J epochs with a step size of η starting with W0.

Provably and Practically Efficient Neural Contextual Bandits

Algorithm 2 Neural GCB

1: Require: Time horizon T, maximum initial variance σ0, error probability δ

2: Initialize: R log2 T , Ensemble of R Neural Nets with W(r) 0 = W0, Ψ(r) 0 , r [R], H , arrays ctr, max mu, fb of size R with all elements set to 0, batch sizes {qr}R r=1 3: for t = 1, 2, 3, . . . T do 4: r 1, ˆAr(t) = [K] 5: while True do 6: Receive the context-action pairs {xt,a}K a=1 7: {f(xt,a; W(r) t ), ˆσ(r) t 1(xt,a)}K a=1, W(r) t , fb[r] Get Predictions H, Ψ(r) t 1, {xt,a}K a=1, fb[r], W(r) t 1, qr

8: σ(r) t 1 maxa ˆ Ar(t) ˆσ(r) t 1(xt,a), max mu[r] arg maxa ˆ Ar(t) f(xt,a; W(r) t 1)

9: if σ(r) t 1 σ02 r then

10: a UCB arg maxa ˆ Ar(t) UCB(r) t 1(xt,a)

11: if σ(r) t 1(xt,a UCB) η0/

t then 12: Choose at a UCB and set υt 1 13: Receive yt = h(xt,at) + ξt and update H H {(xt,at, yt)}

14: Set Ψ(r+1) t Ψ(r+1) t 1 {(t, υt)} and Ψ(r ) t Ψ(r ) t 1 for all r [R] \ {r + 1} 15: break 16: else 17: ˆAr+1(t) {a ˆAr(t) : UCB(r) t 1(xt,a) maxa ˆ Ar(t) LCB(r) t 1(xt,a )}, r r + 1 18: end if 19: else 20: if r = 1 or ctr[r] > α04r then

21: Choose any at ˆAr(t) such that ˆσ(r) t 1(xt,a) > σ02 r and set υt 2 22: else 23: Choose at max mu[r 1] and set υt 3 24: end if 25: Receive yt = h(xt,at) + ξt and update H H {(xt,at, yt)}, ctr[r] ctr[r] + 1

26: Set Ψ(r) t Ψ(r) t 1 {(t, υt)} and Ψ(r ) t Ψ(r ) t 1 for all r [R] \ {r} 27: break 28: end if 29: end while 30: end for

D.2. Proof of Theorem 3

To analyse the regret of Neural GCB, we divide the time horizon into three disjoint sets depending on how the action at that time instant was chosen. In particular, for i = {1, 2, 3}, we define

Ti(s) := {t ψ(r) T : υt = i}

Thus, T1, T2 and T3 consist of all the points chosen by the chosen algorithm at line 12, 21 and 23 respectively. We consider the regret incurred at time instants in each of these sets separately. We begin with T1. For any t T1(r),

h(xt,a t ) h(xt,at) f(xt,a t ; W(r) t 1) + β + βt 1ˆσ(r 1) t 1 (xt,a t ) (f(xt,at; W(r) t 1) β βt 1ˆσ(r 1) t 1 (xt,at))

f(xt,at; W(r) t 1) + β + βt 1ˆσ(r 1) t 1 (xt,at) f(xt,at; W(r) t 1) + β + βt 1ˆσ(r 1) t 1 (xt,at)

2 β + 2βt 1ˆσ(r 1) t 1 (xt,at)

2 β + 2η0βt 1

Provably and Practically Efficient Neural Contextual Bandits

Algorithm 3 Get Predictions

1: Input: Set of past actions and their corresponding rewards H, Index set Ψ, Current set of actions {xt,a}K a=1, Previous feedback instant tfb, Wtfb, Batch choice parameter q 2: Z λI + 1

i Ψ g(xi; W0)g(xi; W0)

3: Zfb λI + 1

i Ψ,i tfb g(xi; W0)g(xi; W0)

4: Set σ2(xt,a) g(xt,a; W0) Z 1g(xt,a; W0)/m for all a [K] 5: For fixed batch size: to train = (|Ψ| tfb == q) 6: For adaptive batch size: to train = (det(Z) > q det(Zfb)) 7: if to train then 8: W Train NN(m, L, J, η, λ, W0, {(xr, yr) H : r Ψ}), tfb |Ψ| 9: else 10: W Wtfb 11: end if 12: return {f(xt,a; W), σ(xt,a)}K a=1, W, tfb.

Algorithm 4 Train NN(m, L, J, η, λ, W0, {(xi, yi)}n i=1)

1: Define L(W) = Pn i=1(f(xi; W) yi)2 + mλ W W0 2

2: for j = 0, 1, . . . , J 1 do 3: Wj+1 = Wj η WL(Wj) 4: end for 5: return WJ

where we use Theorem 2 in the first step, β = Cs,L λm1/6 and βt = S + ν q

2 λ log(1/δ) + (1 ηλ)J/2p

t/λ + C s,L λm1/6 (λ +

C s,Lt). Since this is independent of r, this relation holds for all t T1. Consequently, the regret incurred over the time instants in T1 can be bounded as

t T1 h(xt,a t ) h(xt,at) X

2 β + 2η0βt 1

2 β|T1| + 2η0βT

We now consider a time instant in T2 or T3. Fix any r > 1 and t T2(r) T3(r) which implies that at ˆAr(t). Consequently, xt,at satisfies the following relation:

f(xt,at; W(r 1) t 1 ) + βt 1ˆσ(r 1) t 1 (xt,at) max a ˆ Ar(t) f(xt,a ; W(r 1) t 1 ) βt 1ˆσ(r 1) t 1 (xt,a )

Note that output of the neural net is based on the parameters evaluated during the last time the neural net was trained, which we denote by tfb. In other words, f(xt,at; W(r 1) t 1 ) = f(xt,at; W(r 1) tfb ). However, ˆσ(r 1) t 1 is evaluated using all the points in Ψ(r) t 1. We can account for the discrepancy using Lemma 12 from Abbasi-Yadkori et al. (2011) which states that for any two positive definite matrices A B Rd d and x Rd, we have x Ax x Bx p

det(A)/ det(B). Using this result along with noting that ˆσ(r 1) t 1 (xt,a) = g (xt,a; W0)V 1 t 1g(xt,a; W0), ˆσ(r 1) tfb (xt,a) = g (xt,a; W0)V 1 tfb g(xt,a; W0) and Vt 1 Vfb, we can conclude that ˆσ(r 1) tfb (xt,a) ˆσ(r 1) t 1 (xt,a) p

det(Vt 1)/ det(Vfb).

For any batch such that det(Vt)/ det(Vfb) 4, we have, ˆσ(r 1) tfb (xt,a) 2ˆσ(r 1) t 1 (xt,a) for all a [K]. Consequently, |h(xt,a) f(xt,at; W(r 1) tfb )| β + βtfb ˆσ(r 1) tfb (xt,a) β + 2βt 1ˆσ(r 1) t 1 (xt,a). Thus, for any such batch and a time

Provably and Practically Efficient Neural Contextual Bandits

instant t T2(r), using Theorem 2 we have,

f(xt,at; W(r 1) t 1 ) + βt 1ˆσ(r 1) t 1 (xt,at) max a ˆ Ar(t) f(xt,a ; W(r 1) t 1 ) βt 1ˆσ(r 1) t 1 (xt,a )

= f(xt,at; W(r 1) tfb ) + βt 1ˆσ(r 1) t 1 (xt,at) f(xt,a t ; W(r 1) tfb ) βt 1ˆσ(r 1) t 1 (xt,a )

= h(xt,at) + 3βt 1ˆσ(r 1) t 1 (xt,at) + β h(xt,a ) 3βt 1ˆσ(r 1) t 1 (xt,a ) β.

Since ˆAr(t) ˆAr 1(t), ˆσ(r 1) t 1 (xt,a) σ(r 1) t 1 σ021 r for all a ˆAr(t). Thus,

h(xt,a t ) h(xt,at) 3βt 1ˆσ(r 1) t 1 (xt,a) + 3βt 1ˆσ(r 1) t 1 (xt,a t ) + 2 β 12σ0βt 1 2 r + 2 β.

Similarly, for a time instant t T3(r), we have,

h(xt,a t ) h(xt,at) f(xt,a t ; W(r 1) tfb ) + β + 2βt 1ˆσ(r 1) t 1 (xt,a t ) f(xt,at; W(r 1) tfb ) + β + 2βt 1ˆσ(r 1) t 1 (xt,at)

3βt 1ˆσ(r 1) t 1 (xt,a t ) + 3βt 1ˆσ(r 1) t 1 (xt,a) + 2 β 12σ0βt 1 2 r + 2 β

12σ0βt 1 2 r + 2 β,

where the first step uses Theorem 2 and the last step uses the fact that xt,at ˆAr 1(t). For any t T2(1) T3(1), we use the trivial bound h(xt,a ) h(xt,at) 2.

We now bound |T2(r)| using similar techniques outlined in Valko et al. (2013); Auer (2002). Fix any r 1. Using the fact that for all t T2(r), ˆσ(r 1) t 1 (xt,at) σ02 r and the bound on the sum of posterior standard deviations from Chowdhury & Gopalan (2017, Lemma 4), we can write

σ02 r|T2(r)| X

t T2(r) ˆσ(r 1) t 1 (xt,at) p

8|T2(r)|(λΓk(T) + 1)

= |T2(r)| 2r q

8σ 2 0 |T2(r)|(λΓk(T) + 1)

We also used the fact that the information gain of the kernel induced by the finite neural net is close to that of NTK for sufficiently large m (Kassraie & Krause, 2022, Lemma D.5). This also provides a guideline for setting the length of the

exploration sequence. Specifically, we can set α0 such that |T3(r)| also satisfies |T3(r)| 2r q

8σ 2 0 |T3(r)|(λΓk(T) + 1),

or equivalently, α0 = O(Γk(T)). In general, we have |T3(r)| α04r = |T3(r)| 2rp

We can use these conditions to bound the regret incurred for t T 2, the subset of T2 that consists only of batches that satisfy the determinant condition. We have,

t T 2 h(xt,a t ) h(xt,at)

t T 2 (r) h(xt,a t ) h(xt,at)

h 12σ0βt 1 2 r + 2 β i

h 12σ0βt 1 2 r + 2 β i |T2(r)|

|T2(1)| + 2 β|T2| +

r=2 (12σ0βt 1 2 r) 2r q

8σ 2 0 |T2(r)|(λΓk(T) + 1)

|T2(1)| + 2 β|T2| + 12βT p

8|T2|R(λΓk(T) + 1),

Provably and Practically Efficient Neural Contextual Bandits

where we used Cauchy-Schwarz inequality in the last step. Using a similar analysis, we can bound the regret incurred for t T 3, which is defined similarly to T 2, to obtain, X

t T 3 h(xt,a t ) h(xt,at) |T3(1)| + 2 β|T3| + 12βT p

Now, for batches for which the determinant condition is not satisfied, the regret incurred in any batch corresponding to subset index r can be trivially bounded as qr. We show that the number of such batches is bounded by O(Γk(T)). Fix a r 1. Let there be Br such batches for which the determinant condition is not satisfied and T (r) denote the set of time instants at which such batches begin. Then,

det(Vt) det(VT +1)

Since log (det(VT +1)/ det(λI)) is O(Γk(T)), we can conclude that Br is also O(Γk(T)). Thus, regret incurred in all such batches can be bounded as O(maxr qrΓk(T) log T) as there are R = log T batches.

We can combine the two cases to obtain the regret incurred over T2 and T3 as X

t T2 h(xt,a t ) h(xt,at) |T2(1)| + 2 β|T2| + 12βT p

8|T2|R(λΓk(T) + 1) + O(max r qrΓk(T) log T)

t T3 h(xt,a t ) h(xt,at) |T3(1)| + 2 β|T3| + 12βT p

8|T3|Rα0 + O(max r qrΓk(T) log T).

The overall regret is now obtained by adding the regret incurred over T1, T2 and T3. We have,

t=1 h(xt,a ) h(xt,at)

t T1 h(xt,a ) h(xt,at) + X

t T2 h(xt,a ) h(xt,at) + X

t T3 h(xt,a ) h(xt,at)

2 β(|T1| + |T2| + |T3|) + |T2(1)| + |T3(1)| + 2η0βT

8|T2|R(λΓk(T) + 1) + O(max r qrΓk(T) log T)

2 βT + 2η0βT

8T(λΓk(T) + 1 + α0) + O(max r qrΓk(T) log T),

where we again use the Cauchy Schwarz inequality in the last step along with the fact that |T2(1)| and |T3(1)| satisfy O(Γk(T)). Since m poly(T), the first term βT is O(1) and using the values of J and η specified in Assumption 2 along with the bound on m, we have βT β 2S + ν p

2λ 1 log(1/δ). Consequently, the regret incurred by Neural GCB satisfies O( p

TΓk(T) log(1/δ) + p

T log(1/δ) + maxr qrΓk(T) log T).

The above analysis carries through almost as is for the case of adaptive batch sizes. Since the batches always satisfy the determinant condition with qr instead of 4, we do not need to separately consider the case where the determinant condition is not met. Using the same steps as above, we can conclude that the regret incurred by Neural GCB when run with adaptive step sizes is O( p

TΓk(T) log(1/δ) maxr qr). Thus for the adaptive case, we incur an additional multiplicative factor. This can be explained by the fact that in the adaptive batch setting, the number of times the neural nets are retrained is O(Γk(T)). However, in the fixed batch setting, even if we use the optimal batch size of p

T/Γk(T), we end up retraining the neural nets about O( p

TΓk(T)) times, which is more than in the case of the adaptive batch setting. The more frequent of retraining of neural nets helps us achieve a tighter regret bound in the case of fixed batch setting.

E. Empirical Studies

In this section, we provide further details of the setup used during our experiments. For completeness, we first restate the construction and preprocessing of the datasets followed by the experimental setup. We then provide the complete array of results for the all the algorithm on different datasets.

Provably and Practically Efficient Neural Contextual Bandits

E.1. Datasets

For each of the synthetic datasets, we construct a contextual bandit problem with a feature dimension of d = 10 and K = 4 actions per context running over a time horizon of T = 2000 rounds. The set of context vectors {{xt,a}K a=1}T t=1 are drawn uniformly from the unit sphere. Similar to Zhou et al. (2020), we consider the following three reward functions:

h1(x) = 4|a x|2; h2(x) = 4 sin2(a x); h3(x) = Ax 2. (27)

For the above functions, the vector a is drawn uniformly from the unit sphere and each entry of matrix A is randomly generated from N(0, 0.25).

We also consider two real datasets for classification namely Mushroom and Statlog (Shuttle), both of which are available on the UCI repository (Dua & Graff, 2017).

Mushroom: The Mushroom dataset consists of 8124 samples with 22 features where each data point is labelled as an edible or poisonous mushroom, resulting in a binary classification problem. For the purpose of the experiments, we carry out some basic preprocessing on this dataset. We drop the attributed labeled veil-type as it is the same for all the datapoints resulting in a d = 21 dimensional feature vector. Furthermore, we randomly select 1000 points from each class to create a smaller dataset of size 2000 to run the experiements.

Statlog: The Statlog (Shuttle) dataset consists of 58000 entries with d = 7 attributes (we do not consider time as an attribute) each. The original dataset is divided into 7 classes. Since the data is skewed, we convert this into a binary classification problem by combining classes. In particular, we combine the five smallest classes, namely, 2, 3, 5, 6, 7 and combine them to form a single class and drop class 4 resulting in a binary classification problem, with points labelled as 1 forming the first class and ones with labels in {2, 3, 5, 6, 7} forming the second. Once again, we select a subset of 2000 points by randomly choosing 1000 points from each class to use for the experiments. Lastly, we normalize the features of this dataset.

The classification problem is then converted into a contextual bandit problem using techniques outlined in (Li et al., 2010), which is also adopted in Zhou et al. (2020) and (Gu et al., 2021). Specifically, each datapoint in the dataset (x, y) Rd R is transformed into K vectors of the form x(1) = (x, 0, . . . , 0) . . . , x(K) = (0, . . . , 0, x) RKd corresponding to the K actions associated with context x. Here K denotes the number of classes in the original classification problem. The reward function is set to h(x(k)) = 1{y = k}, that is, the agent receives a reward of 1 if they classify the context correctly and 0 otherwise. As mentioned above, we have d = 21 for Mushroom dataset and d = 7 for the Shuttle dataset and K = 2 for both of them.

E.2. Experimental Setting

For all the experiments, the rewards are generated by adding zero mean Gaussian noise with a standard deviation of 0.1 to the reward function. All the experiments are run for a time horizon of T = 2000. We report the regret averaged over 10 Monte Carlo runs with different random seeds along with an error bar of one standard deviation on either side. For all the datasets, we generate new rewards for each Monte Carlo run by changing the noise. Furthermore, for the real datasets, we also shuffle the context vectors for each Monte Carlo run.

Setting the hyperparameters: For all the algorithms, the parameter ν, the standard deviation proxy for the sub-Gaussian noise, to 0.1, the standard deviation of the Gaussian noise added. Also, the RKHS norm of the reward, S to 4 for synthetic functions, and 1 for real datasets in all the algorithms. For all the experiments, we perform a grid search for λ and η over {0.05, 0.1, 0.5} and {0.001, 0.01, 0.1}, respectively, for each algorithm and choose the best ones for the corresponding algorithm and reward function. The exploration parameter βt is set to the value prescribed by each algorithm. For Neural UCB and Neural TS, the prescribed value for the exploration parameter in the original work was given as:

log det(Vt)

log m L4t5/3λ 1/6 + 2 log(1/δ) + S

+ (λ + C3t L) h (1 ηmλ)J/2p

t/λ + m 1/6p

log m L7/2t5/3λ 5/3(1 + p

Provably and Practically Efficient Neural Contextual Bandits

where β0 = p

1 + C1m 1/6 log m L4t7/6λ 7/6. We ignore the smaller order terms and set βt = 2S +

log det(Vt)

det(λI) + 2 log(1/δ) as (λ + C3t L)(1 ηmλ)J/2p

t/λ S for given choice of parameters as shown in Zhou

et al. (2020). For Neural GCB and Sup NNUCB, we apply the same process of ignoring the smaller order terms and bounding

the additional constant term by S to obtain βt = 2S +ν q

2 λ log(1/δ), which is used in the algorithms. Also for Neural GCB, η0 was set to 0.2 and α0 to 20 log T, as suggested in Sec. D, for all experiments. The number of epochs is set to 200 for synthetic datasets and Mushroom, and to 400 for Statlog.

Neural Net architecture: We consider a 2 layered neural net as described in Equation (2) for all the experiments. We perform two different sets of experiments with two different activation functions, namely σ1 (or equivalently, Re LU) and σ2. For the experiments with σ1 as the activation, we set the number of hidden neurons to m = 20 and m = 50 for synthetic and real datasets respectively. For those with σ2, m is set to 30 and 80 for synthetic and real datasets, respectively.

Training Schedule: For the experiments with sequential algorithms, we retrain the neural nets at every step, including Neural GCB. For the batched algorithms, we consider a variety of training schedules. Particularly, for each setting (i.e., dataset and activation function) we consider two fixed and two adaptive batch size training schedules, which are denoted by F1, F2 and A1, A2 respectively. We specify the training schedules below for different dataset and activation functions beginning with those for σ1 (Re LU). These were chosen to ensure that both algorithms have roughly the same running time to allow for a fair comparison.

1. For Batched Neural UCB run on a synthetic function:

F1: 200 batches, or equivalently retrain after every 10 steps. F2: 300 batches, or equivalently retrain after every 6 7 steps. A1: q = 2 A2: q = 3

2. For Batched Neural UCB run on Mushroom:

F1: 100 batches, or equivalently retrain after every 20 steps. F2: 200 batches, or equivalently retrain after every 10 steps. A1: q = 500 A2: q = 700

3. For Batched Neural UCB run on Shuttle:

F1: 100 batches, or equivalently retrain after every 20 steps. F2: 200 batches, or equivalently retrain after every 10 steps. A1: q = 5 A2: q = 10

4. For Neural GCB run on a synthetic function:

F1: q1 = 4, q2 = 20, qr = 40 for r 3 (i.e., retrain after qr steps) F2: q1 = 2, q2 = 10, qr = 20 for r 3. A1: q1 = 1.2, q2 = 1.8, qr = 2.7 for r 3. A2: q1 = 1.5, q2 = 2.25, qr = 3.375 for r 3.

5. For Neural GCB run on Mushroom:

F1: q1 = 20, q2 = 40, qr = 80 for r 3 (i.e., retrain after qr steps) F2: q1 = 10, q2 = 20, qr = 40 for r 3. A1: q1 = 300, q2 = 1200, qr = 4800 for r 3. A2: q1 = 500, q2 = 2000, qr = 8000 for r 3.

6. For Neural GCB run on Shuttle:

Provably and Practically Efficient Neural Contextual Bandits

F1: q1 = 20, q2 = 40, qr = 80 for r 3 (i.e., retrain after qr steps) F2: q1 = 10, q2 = 20, qr = 40 for r 3. A1: q1 = 5, q2 = 12.5, qr = 31.25 for r 3. A2: q1 = 10, q2 = 25, qr = 62.5 for r 3.

Similarly, for σ2, the training schedules are given as follows:

1. For Batched Neural UCB run on a synthetic function:

F1: 200 batches, or equivalently retrain after every 10 steps. F2: 300 batches, or equivalently retrain after every 6 7 steps. A1: q = 3 A2: q = 4

2. For Batched Neural UCB run on Mushroom:

F1: 100 batches, or equivalently retrain after every 20 steps. F2: 200 batches, or equivalently retrain after every 10 steps. A1: q = 500 A2: q = 700

3. For Batched Neural UCB run on Shuttle:

F1: 50 batches, or equivalently retrain after every 40 steps. F2: 100 batches, or equivalently retrain after every 20 steps. A1: q = 5 A2: q = 10

4. For Neural GCB run on a synthetic function:

F1: q1 = 4, q2 = 20, qr = 40 for r 3 (i.e., retrain after qr steps) F2: q1 = 2, q2 = 10, qr = 20 for r 3. A1: q1 = 1.2, q2 = 1.8, qr = 2.7 for r 3. A2: q1 = 1.5, q2 = 2.25, qr = 3.375 for r 3.

5. For Neural GCB run on Mushroom:

F1: q1 = 10, q2 = 30, qr = 60 for r 3 (i.e., retrain after qr steps) F2: q1 = 10, q2 = 20, qr = 40 for r 3. A1: q1 = 300, q2 = 1200, qr = 4800 for r 3. A2: q1 = 500, q2 = 2000, qr = 8000 for r 3.

6. For Neural GCB run on Shuttle:

F1: q1 = 5, q2 = 12, qr = 31 for r 3 (i.e., retrain after qr steps) F2: q1 = 5, q2 = 10, qr = 20 for r 3 (i.e., retrain after qr steps) A1: qr = 3 for all r A2: qr = 4 for all r

We use this notation of F1, F2, A1 and A2 to refer to the training schedules in the plots shown in the next section.

Provably and Practically Efficient Neural Contextual Bandits

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Lin UCB Neural UCB Neural TS Sup NNUCB Neural GCB

(a) h1(x) with σ1(x)

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Neural UCB Neural TS Sup NNUCB Neural GCB

(b) h1(x) with σ2(x)

50 100 150 200 250 300 350 Time (in seconds)

BNUCB-F1 BNUCB-F2 NGCB-F1 NGCB-F2 NUCB-seq NGCB-seq

(c) h1(x) with fixed batch and σ1(x)

50 100 150 200 250 300 350 Time (in seconds)

BNUCB-A1 BNUCB-A2 NGCB-A1 NGCB-A2 NUCB-seq NGCB-seq

(d) h1(x) with adaptive batch and σ1(x)

0 100 200 300 400 500 600 Time (in seconds)

BNUCB-F1 BNUCB-F2 NGCB-F1 NGCB-F2 NUCB-seq NGCB-seq

(e) h1(x) with fixed batch and σ2(x)

0 100 200 300 400 500 600 Time (in seconds)

BNUCB-A1 BNUCB-A2 NGCB-A1 NGCB-A2 NUCB-seq NGCB-seq

(f) h1(x) with adaptive batch and σ2(x)

Figure 2: Plots with reward function h1(x)

In this section, we plot the performance of Neural GCB against several baselines for different experimental setups and reward functions. For each reward function, we plot the following 6 figures:

Fig. (a): We plot the cumulative regret against time (number of rounds) for various baselines for the case where the activation function of the underlying neural net is σ1, or equivalently, Re LU.

Fig (b): We plot the cumulative regret against time (number of rounds) for various baselines for the case where the activation function of the underlying neural net is σ2.

Fig (c)-(f): We compare the regret and running time (in seconds) between the batched and the sequential versions of Neural UCB and Neural GCB. We plot the regret incurred against time taken (running time, in seconds) for Batched Neural UCB (BNUCB), (Batched)-Neural GCB (BNGCB), Neural UCB (NUCB-seq), (Sequential) Neural GCB (NGCBseq) under different training schedules for 10 different runs. In particular, in (c) and (e), we consider fixed batch sizes with σ1 and σ2 as activation functions respectively and in (d) and (f) with consider adaptive batch sizes with σ1 and σ2 as activation functions respectively. The batch sizes used in the plots are described in the previous section.

From the figures, it is evident that Neural GCB outperforms all other existing algorithms on variety of datasets, including both synthetic and real world datasets. Moreover, the conclusion holds equally well for different activation functions, further bolstering the practical efficiency of the proposed algorithm. Additionally, it can be noted that the regret incurred by the algorithms in experiments with σ2 as the activation is less than that for the experiments with σ1 as the activation, thereby demonstrating the effect of smooth kernels on the performance of the algorithms in practice. we would like to point out that we only focus on algorithms with theoretical guarantees in this study. There are various approximate TS algorithms that also have good empirical performance (Riquelme et al., 2018) but lack such provable efficiency, which is central to the motivation of our paper. In addition, the formulation of contextual bandits in Riquelme et al. (2018) is different from the usual one in the literature (Valko et al., 2013; Zhou et al., 2020; Zhang et al., 2021;

Provably and Practically Efficient Neural Contextual Bandits

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Lin UCB Neural UCB Neural TS Sup NNUCB Neural GCB

(a) h2(x) with σ1(x)

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Neural UCB Neural TS Sup NNUCB Neural GCB

(b) h2(x) with σ2(x)

50 100 150 200 250 300 350 Time (in seconds)

BNUCB-F1 BNUCB-F2 NGCB-F1 NGCB-F2 NUCB-seq NGCB-seq

(c) h2(x) with fixed batch and σ1(x)

50 100 150 200 250 300 350 Time (in seconds)

BNUCB-A1 BNUCB-A2 NGCB-A1 NGCB-A2 NUCB-seq NGCB-seq

(d) h2(x) with adaptive batch and σ1(x)

0 100 200 300 400 500 600 Time (in seconds)

BNUCB-F1 BNUCB-F2 NGCB-F1 NGCB-F2 NUCB-seq NGCB-seq

(e) h2(x) with fixed batch and σ2(x)

0 100 200 300 400 500 600 Time (in seconds)

BNUCB-A1 BNUCB-A2 NGCB-A1 NGCB-A2 NUCB-seq NGCB-seq

(f) h2(x) with adaptive batch and σ2(x)

Figure 3: Plots with reward function h2(x)

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Lin UCB Neural UCB Neural TS Sup NNUCB Neural GCB

(a) h3(x) with σ1(x)

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Neural UCB Neural TS Sup NNUCB Neural GCB

(b) h3(x) with σ2(x)

0 100 200 300 400 500 600 700 800 Time (in seconds)

BNUCB-F1 BNUCB-F2 NGCB-F1 NGCB-F2 NUCB-seq NGCB-seq

(c) h3(x) with fixed batch and σ1(x)

0 100 200 300 400 500 600 700 800 Time (in seconds)

BNUCB-A1 BNUCB-A2 NGCB-A1 NGCB-A2 NUCB-seq NGCB-seq

(d) h3(x) with adaptive batch and σ1(x)

25 50 75 100 125 150 175 200 Time (in seconds)

BNUCB-F1 BNUCB-F2 NGCB-F1 NGCB-F2 NUCB-seq NGCB-seq

(e) h3(x) with fixed batch and σ2(x)

40 60 80 100 120 140 160 180 200 Time (in seconds)

BNUCB-A1 BNUCB-A2 NGCB-A1 NGCB-A2 NUCB-seq NGCB-seq

(f) h3(x) with adaptive batch and σ2(x)

Figure 4: Plots with reward function h3(x)

Provably and Practically Efficient Neural Contextual Bandits

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Lin UCB Neural UCB Neural TS Sup NNUCB Neural GCB

(a) Mushroom with σ1(x)

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Neural UCB Neural TS Sup NNUCB Neural GCB

(b) Mushroom with σ2(x)

200 300 400 500 600 Time (in seconds)

BNUCB-F1 BNUCB-F2 NGCB-F1 NGCB-F2 NUCB-seq NGCB-seq

(c) Mushroom with fixed batch and σ1(x)

200 300 400 500 600 Time (in seconds)

BNUCB-A1 BNUCB-A2 NGCB-A1 NGCB-A2 NUCB-seq NGCB-seq

(d) Mushroom with adaptive batch and σ1(x)

300 400 500 600 700 800 900 Time (in seconds)

BNUCB-F1 BNUCB-F2 NGCB-F1 NGCB-F2 NUCB-seq NGCB-seq

(e) Mushroom with fixed batch and σ2(x)

300 400 500 600 700 800 900 Time (in seconds)

BNUCB-A1 BNUCB-A2 NGCB-A1 NGCB-A2 NUCB-seq NGCB-seq

(f) Mushroom with adaptive batch and σ2(x)

Figure 5: Plots for the dataset Mushroom

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Lin UCB Neural UCB Neural TS Sup NNUCB Neural GCB

(a) Statlog with σ1(x)

0 250 500 750 1000 1250 1500 1750 2000 Time (No. of Rounds)

Neural UCB Neural TS Sup NNUCB Neural GCB

(b) Statlog with σ2(x)

100 200 300 400 500 600 Time (in seconds)

BNUCB-F1 BNUCB-F2 NGCB-F1 NGCB-F2 NUCB-seq NGCB-seq

(c) Statlog with fixed batch and σ1(x)

100 200 300 400 500 600 Time (in seconds)

BNUCB-A1 BNUCB-A2 NGCB-A1 NGCB-A2 NUCB-seq NGCB-seq

(d) Statlog with adaptive batch and σ1(x)

100 200 300 400 500 600 700 800 Time (in seconds)

BNUCB-F1 BNUCB-F2 NGCB-F1 NGCB-F2 NUCB-seq NGCB-seq

(e) Statlog with fixed batch and σ2(x)

100 200 300 400 500 600 700 800 Time (in seconds)

BNUCB-A1 BNUCB-A2 NGCB-A1 NGCB-A2 NUCB-seq NGCB-seq

(f) Statlog with adaptive batch and σ2(x)

Figure 6: Plots for the dataset Statlog (Shuttle)

Provably and Practically Efficient Neural Contextual Bandits

Gu et al., 2021; Kassraie & Krause, 2022), considered in our work, which also makes a direct empirical comparison infeasible.

Lastly, the extensive study with different training schedules shows that batched version performs almost as well as the fully sequential version in terms of regret on all the datasets while having a considerably smaller running time. It can be noted from the plots that Neural GCB has a superior performance compared to Batched-Neural UCB in terms of regret for comparable running times on different datasets and with different activation functions which further strengthens the case of Neural GCB as a practically efficient algorithm.

All the experiments were carried out using Python3 on a computer (CPU) with 12 GB RAM and Intel i7 processor (3.4 GHz) with an overall compute time of around 250-300 hours.