# learning_neural_contextual_bandits_through_perturbed_rewards__d100d6e0.pdf Published as a conference paper at ICLR 2022 LEARNING NEURAL CONTEXTUAL BANDITS THROUGH PERTURBED REWARDS Yiling Jia1, Weitong Zhang2, Dongruo Zhou2, Quanquan Gu2, Hongning Wang1 1Department of Computer Science, University of Virginia 2Department of Computer Science, University of California, Los Angeles. yj9xs@virginia.edu, wt.zhang@ucla.edu, drzhou@cs.ucla.edu, qgu@cs.ucla.edu, hw5x@virginia.edu Thanks to the power of representation learning, neural contextual bandit algorithms demonstrate remarkable performance improvement against their classical counterparts. But because their exploration has to be performed in the entire neural network parameter space to obtain nearly optimal regret, the resulting computational cost is prohibitively high. We perturb the rewards when updating the neural network to eliminate the need of explicit exploration and the corresponding computational overhead. We prove that a e O(ed T) regret upper bound is still achievable under standard regularity conditions, where T is the number of rounds of interactions and ed is the effective dimension of a neural tangent kernel matrix. Extensive comparisons with several benchmark contextual bandit algorithms, including two recent neural contextual bandit models, demonstrate the effectiveness and computational efficiency of our proposed neural bandit algorithm. 1 INTRODUCTION Contextual bandit is a well-formulated abstraction of many important real-world problems, including content recommendation (Li et al., 2010; Wu et al., 2016), online advertising (Schwartz et al., 2017; Nuara et al., 2018), and mobile health (Lei et al., 2017; Tewari & Murphy, 2017). In such problems, an agent iteratively interacts with an environment to maximize its accumulated rewards over time. Its essence is sequential decision-making under uncertainty. Because the reward from the environment for a chosen action (also referred to as an arm in literature) under each context is stochastic, a no-regret learning algorithm needs to explore the problem space for improved reward estimation, i.e., learning the mapping from an arm and its context1 to the expected reward. Linear contextual bandit algorithms (Abbasi-Yadkori et al., 2011; Li et al., 2010), which assume the reward mapping is a linear function of the context vector, dominate the community s attention in the study of contextual bandits. Though theoretically sound and practically effective, their linear reward mapping assumption is incompetent to capture possible complex non-linear relations between the context vector and reward. This motivated the extended studies in parametric bandits, such as generalized linear bandits (Filippi et al., 2010; Faury et al., 2020) and kernelized bandits (Chowdhury & Gopalan, 2017; Krause & Ong, 2011). Recently, to unleash the power of representation learning, deep neural networks (DNN) have also been introduced to learn the underlying reward mapping directly. In (Zahavy & Mannor, 2019; Riquelme et al., 2018; Xu et al., 2020), a deep neural network is applied to provide a feature mapping, and exploration is performed at the last layer. Neural UCB (Zhou et al., 2020) and Neural TS (ZHANG et al., 2020) explore the entire neural network parameter space to obtain nearly optimal regret using the neural tangent kernel technique (Jacot et al., 2018). These neural contextual bandit algorithms significantly boosted empirical performance compared to their classical counterparts. Nevertheless, a major practical concern of existing neural contextual bandit algorithms is their added computational cost when performing exploration. Take the recently developed Neural UCB and Neural TS for example. Their construction of the high-probability confidence set for model exploration 1When no ambiguity is invoked, we refer to the feature vector for an arm and its context as a context vector. Published as a conference paper at ICLR 2022 depends on the dimensionality of the network parameters and the learned context vectors representations, which is often very large for DNNs. For instance, a performing neural bandit solution often has the number of parameters in the order of 100 thousands (if not less). It is prohibitively expensive to compute inverse of the induced covariance matrix on such a huge number of parameters, as required by their construction of confidence set. As a result, approximations, e.g., only using the diagonal of the covariance matrix (Zhou et al., 2020; ZHANG et al., 2020), are employed to make such algorithms operational in practice. But there is no theoretical guarantee for such diagonal approximations, which directly leads to the gap between the theoretical and empirical performance of the neural bandit algorithms. In this work, to alleviate the computational overhead caused by the expensive exploration, we propose to eliminate explicit model exploration by learning a neural bandit model with perturbed rewards. At each round of model update, we inject pseudo noise generated from a zero-mean Gaussian distribution to the observed reward history. With the induced randomization, sufficient exploration is achieved when simply pulling the arm with the highest estimated reward. This brings in considerable advantage over existing neural bandit algorithms: no additional computational cost is needed to obtain no regret. We rigorously prove that with a high probability the algorithm obtains a e O(ed T) regret, where ed is the effective dimension of a neural tangent kernel matrix and T is the number of rounds of interactions. This result recovers existing regret bounds for the linear setting where the effective dimension equals to the input feature dimension. Besides, our extensive empirical evaluations demonstrate the strong advantage in efficiency and effectiveness of our solution against a rich set of state-of-the-art contextual bandit solutions over both synthetic and real-world datasets. 2 RELATED WORK Most recently, attempts have been made to incorporate DNNs with contextual bandit algorithms. Several existing work study neural-linear bandits (Riquelme et al., 2018; Zahavy & Mannor, 2019), where exploration is performed on the last layer of the DNN. Under the neural tangent kernel (NTK) framework (Jacot et al., 2018), Neural UCB (Zhou et al., 2020) constructs confidence sets with DNN-based random feature mappings to perform upper confidence bound based exploration. Neural TS (ZHANG et al., 2020) samples from the posterior distribution constructed with a similar technique. However, as the exploration is performed in the induced random feature space, the added computational overhead is prohibitively high, which makes such solutions impractical. The authors suggested diagonal approximations of the resulting covariance matrix, which however leave the promised theoretical guarantees of those algorithms up in the air. Reward perturbation based exploration has been studied in a number of classical bandit models (Kveton et al., 2019a; 2020; 2019b). In a context-free k-armed bandit setting, Kveton et al. (2019b) proposed to estimate each arm s reward over a perturbed history and select the arm with the highest estimated reward at each round. Such an arm selection strategy is proved to be optimistic with a sufficiently high probability. Later this strategy has been extended to linear and generalized linear bandits (Kveton et al., 2019a; 2020). In (Kveton et al., 2020), the authors suggested its application to neural network models; but only some simple empirical evaluations were provided, without any theoretical justifications. Our work for the first time provides a rigorous regret analysis of neural contextual bandits with the perturbation-based exploration: a sublinear regret bound is still achievable in terms of the number of interactions between the agent and environment. 3 NEURAL BANDIT LEARNING WITH PERTURBED REWARDS We study the problem of contextual bandit with finite K arms, where each arm is associated with a d-dimensional context vector: xi Rd for i [K]. At each round t [T], the agent needs to select one of the arms, denoted as at, and receives its reward rat,t, which is generated as rat,t = h(xat) + ηt. In particular, h(x) represents the unknown underlying reward mapping function satisfying 0 h(x) 1 for any x, and ηt is an R-sub-Gaussian random variable that satisfies E[exp(µηt)] exp[µ2R2] for all µ 0. The goal is to minimize pseudo regret over T rounds: t=1(ra rt,at) , (3.1) Published as a conference paper at ICLR 2022 Algorithm 1 Neural bandit with perturbed reward (NPR) 1: Input: Number of rounds T, regularization coefficient λ, perturbation parameter ν, network width m, network depth L. 2: Initialization: θ0 = (vec(W1), . . . , vec(WL)] Rp with Gaussian distribution: for 1 l L 1, Wl = (W, 0; 0, W) with each entry of W sampled independently from N(0, 4/m); WL = (w , w ) with each entry of w sampled independently from N(0, 2/m). 3: for t = 1, . . . , T do 4: if t > K then 5: Pull arm at and receive reward rt,at, where at = argmaxi [K] f(xi, θt 1). 6: Generate {γt s}s [t] N(0, ν2). 7: Set θt by the output of gradient descent for solving Eq (3.2). 8: else 9: Pull arm ak. 10: end if 11: end for where a is the optimal arm with the maximum expected reward. To deal with the potential non-linearity of h(x) and unleash the representation learning power of DNNs, we adopt a fully connected neural network f(x; θ) to approximate h(x): f(x; θ) = m WLφ WL 1φ φ(W1x) , where φ(x) = Re LU(x), θ = [vec(W1), . . . , vec(WL)] Rp with p = m + md + m2(L 1), and depth L 2. Each hidden layer is assumed to have the same width (i.e., m) for convenience in later analysis; but this does not affect the conclusion of our theoretical analysis. Existing neural bandit solutions perform explicit exploration in the entire model space (Zhou et al., 2020; ZHANG et al., 2020; Zahavy & Mannor, 2019; Riquelme et al., 2018; Xu et al., 2020), which introduces prohibitive computational cost. And oftentimes the overhead is so high that approximation has to be employed (Zhou et al., 2020; ZHANG et al., 2020), which unfortunately breaks the theoretical promise of these algorithms. In our proposed model, to eliminate such explicit model exploration in neural bandit, a randomization strategy is introduced in the neural network update. We name the resulting solution as Neural bandit with Perturbed Rewards, or NPR in short. In NPR, at round t, the neural model is learned with the t rewards perturbed with designed perturbations: min θ L(θ) = Xt s=1 f(xas; θ) (rs,as + γt s) 2/2 + mλ θ θ0 2 2/2 (3.2) where {γt s}t s=1 N(0, σ2) are Gaussian random variables that are independently sampled in each round t, and σ is a hyper-parameter that controls the strength of perturbation (and thus the exploration) in NPR. We use an l2-regularized square loss for model estimation, where the regularization centers at the randomly initialization θ0 with the trade-off parameter λ. The detailed procedure of NPR is given in Algorithm 1. The algorithm starts by pulling all candidate arms once. This guarantees that for any arm, NPR is sufficiently optimistic compared to the true reward with respect to its approximation error (Lemma 4.4). When all the K arms have been pulled once, the algorithm pulls the arm with the highest estimated reward, at = argmaxi f(xi; θt 1). Once received the feedback rat,t, the model perturbs the entire reward history so far via a freshly sampled noise sequence {γt s}t s=1, and updates the neural network by {(xas, ras,s + γt s)}t s=1 using gradient descent. In the regret analysis in Section 4, we prove that the variance from the added {γt s}s [t] will lead to the necessary optimism for exploration (Abbasi-Yadkori et al., 2011). We adopt gradient descent for analysis convenience, while stochastic gradient descent can also be used to solve the optimization problem with a similar theoretical guarantee based on recent works (Allen Zhu et al., 2019; Zou et al., 2019). Compared with existing neural contextual bandit algorithms (Zhou et al., 2020; ZHANG et al., 2020; Zahavy & Mannor, 2019; Riquelme et al., 2018; Xu et al., 2020), NPR does not need any added computation for model exploration, besides the regular neural network update. This greatly alleviates the overhead for the computation resources (both space and time) and makes NPR sufficiently general to be applied for practical problems. More importantly, our theoretical analysis directly corresponds to its actual behavior when applied, as no approximation is needed. Published as a conference paper at ICLR 2022 4 REGRET ANALYSIS In this section, we provide finite time regret analysis of NPR, where the time horizon T is set beforehand. The theoretical analysis is built on the recent studies about the generalization of DNN models (Cao & Gu, 2020; 2019; Chen et al., 2020; Daniely, 2017; Arora et al., 2019), which illustrate that with (stochastic) gradient descent, the learned parameters of a DNN locate in a particular regime with the generalization error being characterized by the best function in the corresponding neural tangent kernel (NTK) space (Jacot et al., 2018). We leave the background of NTK in the appendix and focus on the key steps of our proof in this section. The analysis starts with the following lemma for the set of context vectors {xi}K i=1. Lemma 4.1. There exists a positive constant C such that for any δ (0, 1), with probability at least 1 δ, when m CK4L6 log(K2L/δ)/λ4 0, there exists a θ Rp, for all i [K], h(xi) = g(xi; θ0), θ θ0 , m θ θ0 2 2h H 1h, (4.1) where H is the NTK matrix defined on the context set and h = (h(x1), . . . , h(x K)). Details of H can be found in the appendix. This lemma suggests that with a satisfied neural network width m, with high probability, the underlying reward mapping function can be approximated by a linear function over g(xi; θ0), parameterized by θ θ0, where g(xi; θ0) = θf(x; θ0) Rd is the gradient of the initial neural network. We also define the covariance matrix At at round t as, s=1 g(xs,as; θ0)g(xs,as; θ0) /m + λI (4.2) where λ > 0 is the l2 regularization parameter in Eq (3.2). Discussed before, there exist two kinds of noises in the training samples: the observation noise and the perturbation noise. To analyze the effect of each noise in model estimation, two auxiliary least square solutions are introduced, A 1 t bt and A 1 t bt with s=1(rs,as + E[γt 1 s ])g(xs,as; θ0)/ m, bt = Xt 1 s=1(rs,as + γt 1 s )g(xs,as; θ0)/ m. These two auxiliary solutions isolate the induced deviations: the first least square solution only contains the deviation caused by observation noise {ηs}t 1 s=1; and the second least square solution has an extra deviation caused by {γt 1 s }t 1 s=1. To analysis the estimation error of the neural network, we first define the following events. At round t, we have event Et,1 defined as: Et,1 = n i [K], | g(xt,i; θ0), A 1 t bt/ m h(xt,i)| αt g(xt,i; θ0) A 1 t According to Lemma 4.1 and the definition of bt, under event Et,1, the deviation caused by the observation noise {ηs}t 1 s=1 can be controlled by αt. We also need event Et,2 at round t defined as Et,2 = { i [K] : |f(xt,i; θt 1) g(xt,i; θ0), A 1 t bt/ m | ϵ(m) + βt g(xt,i; θ0) A 1 t }, where ϵ(m) is defined as ϵ(m) =Cϵ,1m 1/6T 2/3λ 2/3L3p log m + Cϵ,2(1 ηmλ)Jp + Cϵ,3m 1/6T 5/3λ 5/3L4p log m(1 + p with constants {Cϵ,i}3 i=1. Event Et,2 considers the situation that the deviation caused by the added noise {γt 1 s }t 1 s=1 is under control with parameter βt at round t, and the approximation error ϵ(m) is carefully tracked by properly setting the neural network width m. The following lemmas show that given the history Ft 1 = σ{x1, r1, x2, r2, ..., xt 1, rt 1, xt}, which is σ-algebra generated by the previously pulled arms and the observed rewards, with proper setting, events Et,1 and Et,2 happen with a high probability. Published as a conference paper at ICLR 2022 Lemma 4.2. For any t [T], λ > 0 and δ > 0, with αt = q R2 log det(At)/(δ2 det(λI)) + λS, we have P(Et,1|Ft 1) 1 δ. In particular, S denotes the upper bound of 2h H 1h, which is a constant if the reward function h( ) belongs to the RKHS space induced by the neural tangent kernel. Lemma 4.3. There exist positive constants {Ci}3 i=1, such that with step size and the neural network satisfy that η = C1(mλ+m LT) 1, m C2 λL 3/2[log(TKL2/δ)]3/2, and m[log m] 3 C3 max{TL12λ 1, T 7λ 8L18(λ+LT)6, L21T 7λ 7(1+ p T/λ)6}, given history Ft 1, by choosing βt = σ 4 log t + 2 log K, we have P(Et,2|Ft 1) 1 t 2. We leave the detailed proof in the appendix. According to Lemma 4.2 and Lemma 4.3, with more observations become available, the neural network based reward estimator f(x; θt) will gradually converge to the true expected reward h(x). In the meantime, the variance of the added perturbation leads to reward overestimation, which will compensate the potential underestimation caused by the observation noise and the approximation error. The following lemma describes this anticoncentration behavior of the added noise {γt 1 s }t 1 s=1. Lemma 4.4. For any t [T], with σ = αt(1 λλ 1 K (AK)) 1/2, αt defined in Lemma 4.2, ϵ(m) defined in Lemma 4.3, given history Ft 1, we have P Et,3) = P(f(xt,i; θt 1) h(xt,i) ϵ(m)|Ft 1, Et,1 (4e π) 1, where AK is the covariance matrix constructed after the initial K-round arm pulling (line 8 to 10 in Algorithm 1) by Eq (4.2), and λK(AK) represents the K-th largest eigenvalue of matrix AK. With αt defined in Lemma 4.2, and ϵ(m) and βt defined in Lemma 4.3, we divide the arms into two groups. More specifically, we define the set of sufficiently sampled arms in round t as: Ωt = { i [K] : 2ϵ(m) + (αt + βt) g(xt,i; θ0) A 1 t h(xt,a ) h(xt,i)}. Accordingly, the set of undersampled arms is defined as Ωt = [K] \ Ωt. Note that by this definition, the best arm a belongs to Ωt. In the following lemma, we show that at any time step t > K, the expected one-step regret is upper bounded in terms of the regret due to playing an undersampled arm at Ωt. Lemma 4.5. With m, η, σ satisfying the conditions in Lemma 4.2, 4.3, 4.4, and ϵ(m) defined in Lemma 4.3, with probability at least 1 δ the one step regret of NPR is upper bounded, E[h(xt,a t ) h(xt,at)] P( Et,2) + 4ϵ(m) + (βt + αt) 1 + 2/P(at Ωt) g(xt,at; θ0)/ m A 1 t . Furthermore, P(at Ωt) Pt(Et,3) Pt( Et,2). To bound the cumulative regret of NPR, we also need the following lemma. Lemma 4.6. With ed as the effective dimension of the NTK matrix H, we have XT t=1 min g(xt,at; θ0)/ m 2 A 1 t 1, 1 2(ed log(1 + TK/λ) + 1) The effective dimension is first proposed in (Valko et al., 2013) to analyze the kernelized contextual bandit, which roughly measures the number of directions in the kernel space where the data mostly lies. The detailed definiton of ed can be found in Definition B.3 in the appendix. Finally, with proper neural network setup, we provide an m-independent regret upper bound. Theorem 4.7. Let ed be the effective dimension of the NTK kernel matrix H. There exist positive constants C1 and Cr, such that for any δ (0, 1), when network structure and the step size for model update satisfy m poly(T, L, K, λ 1, S 1, log(1/δ)), η = C1(m TL + mλ) 1 and λ max{1, S 2}, with probability at least 1 δ, the cumulative regret of NPR satisfies ed log(1 + TK/λ) + 2 2 log δ + 2ed T log(1 + TK/λ) + 2 + 7 + K where Γ = 44e π(1 + q (4 log T + 2 log K)/(1 λλ 1 K (AK))). Published as a conference paper at ICLR 2022 Proof. By Lemma 4.2, Et,1 holds for all t [T] with probability at least 1 δ. Therefore, with probability at least 1 δ, we have t=1 E[(h(xt,a t ) h(xt,at))1{Et,1}] t=K+1 E[(h(xt,a t ) h(xt,at))1{Et,1}] + K t=1 Pt( Et,2) + (βt + αt)(1 + 2 Pt(Et,3) Pt( Et,2)) g(xt,at; θ0)/ m A 1 t + K. The second inequality is due to the analysis of one-step regret in Lemma 4.5. According to Lemma 4.3, by choosing δ = σct, with ct = 4 log t + 2 log K, Pt( Et,2) t 2. Hence, PT t=1 Pt( Et,2) π2/6. Based on Lemma 4.4, with σ = αt(1 λλ 1 K (AK)) 1/2, Pt(Et,3) 1 4e π. Therefore, 1 + 2/(Pt(Et,3) Pt( Et,2)) 44e π. By chaining all the inequalities, we have the cumulative regret of NPR upper bounded as, 3 + 4Tϵ(m) + XT t=1(βt + αt)(1 + 2 Pt(Et,3) Pt( Et,2)) g(xt,at; θ0)/ m A 1 t + K ed log(1 + TK/λ) + 2 2 log δ + 2ed T log(1 + TK/λ) + 2 + π2 + Cϵ,1m 1/6T 5/3λ 2/3L3p log m + Cϵ,2(1 ηmλ)JT p + Cϵ,3m 1/6T 8/3λ 5/3L4p log m(1 + p where Γ = 44e π(1 + q (4 log T + 2 log K)/(1 λλ 1 K (AK))). The second inequality is due to Lemma 4.6, and the bounds of events Et,2 and Et,3. By setting η = c(mλ + m LT) 1, and J = (1+LT/λ)(log(Cϵ,2)+log T p TL/λ)/c, Cϵ,2(1 ηmλ)JT p TL/λ 1. Then by choosing the satisfied m, Cϵ,1m 1/6T 5/3λ 2/3L3 log m+Cϵ,3m 1/6T 8/3λ 5/3L4 log m(1+ p 2/3. RT can be further bounded by RT Γ R q ed log(1 + TK/λ) + 2 2 log δ + 2ed T log(1 + TK/λ) + 2 + 7 + K This completes the proof. 5 EXPERIMENTS In this section, we empirically evaluate the proposed neural bandit algorithm NPR against several state-of-the-art baselines, including: linear and neural UCB (Lin UCB (Abbasi-Yadkori et al., 2011) and Neural UCB (Zhou et al., 2020)), linear and neural Thompson Sampling (Lin TS (Agrawal & Goyal, 2013) and Neural TS (ZHANG et al., 2020)), perturbed reward for linear bandits (Lin FPL) (Kveton et al., 2019a). To demonstrate the real impact of using diagonal approximation in confidence set construction, we also include Neural UCB and Neural TS with diagonal approximation as suggested in their original papers for comparisons. We evaluated all the algorithms on 1) synthetic dataset via simulation, 2) six K-class classification datasets from UCI machine learning repository (Beygelzimer et al., 2011), and 3) two real-world datasets extracted from the social bookmarking web service Delicious and music streaming service Last FM (Wu et al., 2016). We implemented all the algorithms in Py Torch and performed all the experiments on a server equipped with Intel Xeon Gold 6230 2.10GHz CPU, 128G RAM, four NVIDIA Ge Force RTX 2080Ti graphical cards. 5.1 EXPERIMENT ON SYNTHETIC DATASET In our simulation, we first generate a size-K arm pool, in which each arm i is associated with a d-dimensional feature vector xi. The contextual vectors are sampled uniformly at random from a unit ball. We generate the rewards via the following two nonlinear functions: h1(x) = 10 2(x ΣΣ x), h2(x) = exp( 10(x θ)2) , Published as a conference paper at ICLR 2022 0 2000 4000 6000 8000 10000 Rounds Lin FPL Lin TS Lin UCB NPR Neural TS Neural TS (Diagonal) Neural UCB Neural UCB (Diagonal) (a) h1(x) = 10 2(x ΣΣ x) 0 2000 4000 6000 8000 10000 Rounds (b) h2(x) = exp( 10(x θ)2) Neural UCB Neural TS Neural UCB (Diagonal) Neural UCB (Diagonal) Elapsed Time (sec) Time for Model Update Time for Arm Selection (c) Elapsed time 10 20 50 100 Size of Arm Pool Elapsed Time (sec) (d) Effect of pool size [16 16] [32 32] [64 64] [128 128] Neural Network Structure Elapsed Time (sec) Neural UCB Neural TS Neural UCB (Diagonal) Neural TS (Diagonal) NPR (e) Effect of network complexity Figure 1: Empirical results of regret and time consumption on synthetic dataset. where each element of Σ Rd d is randomly sampled from N(0, 1), and θ is randomly sampled from a unit ball. At each round t, only a subset of k arms out of the total K arms are sampled without replacement and disclosed to all algorithms for selection. The ground-truth reward ra is corrupted by Gaussian noise η = N(0, ξ2) before feeding back to the bandit algorithms. We fixed the feature dimension d = 50 and the pool size K = 100 with k = 20. ξ was set to 0.1 for both h1 and h2. Cumulative regret is used to compare the performance of different algorithms. Here the best arm is defined as the one with the highest expected reward in the presented k candidate arms. All the algorithms were executed up to 10000 rounds in simulation, and the averaged results over 10 runs are reported in Figure 1(a) to 1(e). For the neural bandit algorithms, we adopted a 3-layer neural network with m = 64 units in each hidden layer. We did a grid search on the first 1000 rounds for regularization parameter λ over {10 i}4 i=0, and step size η over {10 i}3 i=0. We also searched for the concentration parameter δ so that the exploration parameter ν is equivalently searched over {10 i}4 i=0. The best set of hyper-parameters for each algorithm are applied for the full simulation over 10000 rounds, i.e., only that trial will be continued for the rest of 9000 rounds. In Figure 1(a) and 1(b), we can observe that linear bandit algorithms clearly failed to learn nonlinear mapping and suffered much higher regret compared to the neural algorithms. NPR achieved similar performance as Neural UCB and Neural TS, which shows the effectiveness of our randomized exploration strategy in neural bandit learning. Figure 1(c) shows the average running time for all the neural bandit models in this experiment. With the same network structure, the time spent for model update was similar across different neural bandit models. But we can clearly observe that Neural UCB and Neural TS took much more time in arm selection. When choosing an arm, Neural UCB and Neural TS, and their corresponding versions with diagonal approximation, need to repeatedly compute: 1) the inverse of the covariance matrix based on the pulled arms in history; and 2) the confidence set of their reward estimation on each candidate arm. The size of the covariance matrix depends on the number of parameters in the neural network (e.g., 7460-by-7460 in this experiment). With diagonal approximation, the computational cost is significantly reduced. But, as mentioned before, there is no theoretical guarantee about the approximation s impact on the model s performance. Intuitively, the impact depends on how the full covariance matrix concentrates on its diagonal, e.g., whether the random feature mappings of the network on the pulled arms are correlated among their dimensions. Unfortunately, this cannot be determined in advance. Figure 1(a) and 1(b) show that the diagonal approximation leads to a significant performance drop. This confirms our concern of such an approximation. In contrast, as shown in Figure 1(c), there is no extra computation overhead in NPR; in the meantime, it provides comparable performance to Neural UCB and Neural TS with a full covariance matrix. Published as a conference paper at ICLR 2022 0 2000 4000 6000 8000 10000 Rounds Lin FPL Lin TS Lin UCB NPR Neural TS Neural TS (Diagonal) Neural UCB Neural UCB (Diagonal) 0 1000 2000 3000 4000 5000 6000 7000 8000 Rounds (b) Mushroom 0 2000 4000 6000 8000 10000 Rounds (c) Shuttle Figure 2: Comparison of NPR and the baselines on the UCI datasets. To further investigate the efficiency advantage of NPR, we reported the total running time for the neural bandit algorithms under different sizes of the arm pool and different structures of the neural network under the same simulation setting. With the same neural network structure, models with different sizes of candidate arms should share the same running time for model update. Hence, Figure 1(d) reports how the size of arm pool affects the time spent on arm selection. As we can find, with more candidate arms, Neural UCB, Neural TS, and their corresponding diagonal approximations spent more time on arm selection, or more specifically on constructing the confidence set of the reward estimation for each arm. With a fixed size arm pool, the complexity of the neural network will strongly affect the time for computing the inverse of the covariance matrix. In Figure 1(e), the x-axis shows the structure for the hidden layers. The running time for Neural UCB and Neural TS significantly increased with enlarged neural networks. The running time of NPR stayed stable, across varying sizes of arm pools and the network complexity. This shows the strong advantage of NPR in efficiency and effectiveness, especially for large-scale problems. 5.2 EXPERIMENT ON CLASSIFICATION DATASETS Following the setting in (Zhou et al., 2020), we evaluated the bandit algorithms on K-class classification problem with six public benchmark datasets, including adult, mushroom, and shuttle from UCI Machine Learning Repository (Dua & Graff, 2017). Due to space limit, we report the results on these three datasets in this section, and leave the other three, letter, covertype, and Magic Telescope in the appendix. We adopted the disjoint model (Li et al., 2010) to build the context vectors for candidate arms: given an input instance with feature x Rd of a k-class classification problem, we constructed the context features for k candidate arms as: x1 = (x, 0, . . . , 0), . . . , xk = (0, . . . , 0, x) Rd k. The model receives reward 1 if it selects the correct class by pulling the corresponding arm, otherwise 0. The cumulative regret measures the total mistakes made by the model over T rounds. Similar to the experiment in synthetic datasets, we adopted a 3-layer neural network with m = 64 units in each hidden layer. We used the same grid search setting as in our experiments on synthetic datasets. Figure 2 shows the averaged results across 10 runs for 10000 rounds, except for mushroom which only contains 8124 data instances. There are several key observations in Figure 2. First, the improvement of applying a neural bandit model depends on the nonlinearity of the dataset. For example, the performance on mushoom and shuttle was highly boosted by the neural models, while the improvement on adult was limited. Second, the impact of the diagonal approximation for Neural UCB and Neural TS also depends on the dataset. It did not hurt their performance too much on shuttle, while using the full covariance matrix led to much better performance on Mushroom. Overall, NPR showed consistently better or at least comparable performance to the best neural bandit baselines in all six datasets. Such robust performance and high efficiency make it more advantageous in practice. 5.3 EXPERIMENT ON LASTFM & DELICIOUS The Last FM dataset was extracted from the music streaming service website Last.fm, and the Delicious data set was extracted from the social bookmark sharing service website Delicious (Wu et al., 2016). In particular, the Last FM dataset contains 1,892 users and 17,632 items (artists). We treat the listened artists in each user as positive feedback. The Delicious dataset contains 1,861 users and 69,226 items (URLs). We treat the bookmarked URLs for each user as positive feedback. Following the standard setting (Cesa-Bianchi et al., 2013), we pre-processed these two datasets. For the item Published as a conference paper at ICLR 2022 0 2000 4000 6000 8000 10000 Rounds Normalized Payoff NPR Neural TS (Diagonal) Neural UCB (Diagonal) h Lin UCB (a) Last FM dataset 0 2000 4000 6000 8000 10000 Rounds Normalized Payoff NPR Neural TS (Diagonal) Neural UCB (Diagonal) h Lin UCB (b) Delicious dataset Figure 3: Comparisons of normalized rewards on Last FM & Delicious datasets. features, we used all the tags associated with each item to create its TF-IDF feature vector. Then PCA was applied to reduce the dimensionality of the features. In both datasets, we used the first 25 principle components to construct the item feature vectors. On the user side, we created user features by running node2vec (Grover & Leskovec, 2016) on the user relation graph extracted from the social network provided by the datasets. Each of the users was represented with a 50-dimensional feature vector. We generated the candidate pool as follows: we fixed the size of candidate arm pool to k = 25 for each round; for each user, we picked one item from those non-zero payoff items according to the complete observations in the dataset, and randomly picked the other 24 from those zero-payoff items. We constructed the context vectors by concatenating the user and item feature vectors. In our experiment, a 3-layer neural network with m = 128 units in each hidden layer was applied to learn the bandit model. And the same gird search was performed to find the best set of hyper-parameters. We performed the experiment with 10000 rounds. Because of the high demand of GPU memory for the matrix inverse computation, we only compared NPR to Neural UCB and Neural TS with the diagonal approximation. We compared the cumulative reward from each algorithm in Figure 3. Besides the neural contextual bandit algorithms, we also include the best reported model on these two datasets, h Lin UCB (Wang et al., 2016) for comparison. To improve visibility, we normalized the cumulative reward by a random strategy s cumulative reward (Wu et al., 2016), where the arm was randomly sampled from the candidate pool. All the neural models showed the power to learn the potential non-linear reward mapping on these two real-world datasets. NPR showed its advantage over the baselines (especially on Last FM). As NPR has no additional requirement on the computation resource (e.g., no need for the expensive matrix operation as in the baselines), we also experienced much shorter running in NPR than the other two neural bandit models. 6 CONCLUSION Existing neural contextual bandit algorithms sacrifice computation efficiency for effective exploration in the entire neural network parameter space. To eliminate the computational overhead caused by such an expensive operation, we appealed to randomization-based exploration and obtained the same level of learning outcome. No extra computational cost or resources are needed besides the regular neural network update. We proved that the proposed solution obtains e O(ed T) regret for T rounds of interactions between the system and user. Extensive empirical results on both synthetic and real-world datasets support our theoretical analysis, and demonstrate the strong advantage of our model in efficiency and effectiveness, which suggest the potential of deploying powerful neural bandit model online in practice. Our empirical efficiency analysis shows that the key bottleneck of learning a neural model online is at the model update step. Our current regret analysis depends on (stochastic) gradient descent over the entire training set for model update in each round, which is prohibitively expensive. We would like to investigate the possibility of more efficient model update, e.g., online stochastic gradient descent or continual learning, and the corresponding effect on model convergence and regret analysis. Currently, our analysis focuses on the K-armed setting, and it is important to extend our solution and analysis to infinitely many arms in our future work. Published as a conference paper at ICLR 2022 ACKNOWLEDGEMENTS We thank the anonymous reviewers for their helpful comments. YJ and HW are partially supported by the National Science Foundation under grant IIS-2128019 and IIS-1553568. WZ, DZ and QG are partially supported by the National Science Foundation CAREER Award 1906169 and IIS-1904183. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies. Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011. Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pp. 127 135, 2013. Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. In International Conference on Machine Learning, pp. 242 252, 2019. Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, 2019. Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 19 26, 2011. Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, 2019. Yuan Cao and Quanquan Gu. Generalization error bounds of gradient descent for learning overparameterized deep relu networks. In the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020. Nicolo Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. A gang of bandits. ar Xiv preprint ar Xiv:1306.0811, 2013. Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? In International Conference on Learning Representations, 2020. Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In International Conference on Machine Learning, pp. 844 853. PMLR, 2017. Amit Daniely. SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pp. 2422 2430, 2017. Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pp. 1675 1685, 2019. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive. ics.uci.edu/ml. Louis Faury, Marc Abeille, Cl ement Calauz enes, and Olivier Fercoq. Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning, pp. 3052 3060. PMLR, 2020. Sarah Filippi, Olivier Cappe, Aur elien Garivier, and Csaba Szepesv ari. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pp. 586 594, 2010. Published as a conference paper at ICLR 2022 Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855 864, 2016. Arthur Jacot, Franck Gabriel, and Cl ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571 8580, 2018. Andreas Krause and Cheng S Ong. Contextual Gaussian process bandit optimization. In Advances in neural information processing systems, pp. 2447 2455, 2011. Branislav Kveton, Csaba Szepesv ari, Mohammad Ghavamzadeh, and Craig Boutilier. Perturbedhistory exploration in stochastic linear bandits. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 176, 2019a. Branislav Kveton, Csaba Szepesvari, Sharan Vaswani, Zheng Wen, Tor Lattimore, and Mohammad Ghavamzadeh. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 3601 3610. PMLR, 2019b. Branislav Kveton, Manzil Zaheer, Csaba Szepesv ari, Lihong Li, Mohammad Ghavamzadeh, and Craig Boutilier. Randomized exploration in generalized linear bandits. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2020. Huitian Lei, Ambuj Tewari, and Susan A Murphy. An actor-critic contextual bandit algorithm for personalized mobile health interventions. ar Xiv preprint ar Xiv:1706.09090, 2017. Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670. ACM, 2010. Alessandro Nuara, Francesco Trovo, Nicola Gatti, and Marcello Restelli. A combinatorial-bandit algorithm for the online joint bid/budget optimization of pay-per-click advertising campaigns. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Carlos Riquelme, George Tucker, and Jasper Snoek. Deep Bayesian bandits showdown. In International Conference on Learning Representations, 2018. Eric M Schwartz, Eric T Bradlow, and Peter S Fader. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500 522, 2017. Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pp. 495 517. Springer, 2017. Michal Valko, Nathaniel Korda, R emi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits. ar Xiv preprint ar Xiv:1309.6869, 2013. Huazheng Wang, Qingyun Wu, and Hongning Wang. Learning hidden features for contextual bandits. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 1633 1642, 2016. Qingyun Wu, Huazheng Wang, Quanquan Gu, and Hongning Wang. Contextual bandits in a collaborative environment. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 529 538, 2016. Pan Xu, Zheng Wen, Handong Zhao, and Quanquan Gu. Neural contextual bandits with deep representation and shallow exploration. ar Xiv preprint ar Xiv:2012.01780, 2020. Tom Zahavy and Shie Mannor. Deep neural linear bandits: Overcoming catastrophic forgetting through likelihood matching. ar Xiv preprint ar Xiv:1901.08612, 2019. Weitong ZHANG, Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural thompson sampling. In International Conference on Learning Representations, 2020. Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pp. 11492 11502. PMLR, 2020. Published as a conference paper at ICLR 2022 Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep Re LU networks. Machine Learning, 2019. A ADDITIONAL EXPERIMENTS A.1 EXPERIMENTS ON SYNTHETIC DATASETS A.1.1 REWARD PREDICTION WITH MORE COMPLICATED NEURAL NETWORKS In the main paper, to compare NPR with the existing neural bandit baselines, we chose a relatively small neural network, e.g., m = 64, so that Neural UCB/Neural TS can be executed with a full covariance matrix. To better capture the non-linearity in reward generation, we increased the width of the neural network to m = 128, and report the results in Figure 4. All other experiment settings (e.g., underlying reward generation function) are the same as reported in the main paper. And to provide a more comprehensive picture, we also reported the variance of each algorithm s performance in this figure. With the increased network width and limited by the GPU memory, Neural UCB and Neural TS can only be executed with the diagonal matrix of the covariance matrix when computing their required exploration bonus term . In Figure 4, we can observe 1) all neural bandit algorithms obtained the expected sublinear regret; and 2) NPR performed better than Neural UCB and Neural TS, because of the diagonal approximation that we have to use on these two baselines. This again strongly suggested the advantage of NPR: the approximation due to the computational complexity in Neural UCB and Neural TS limits their practical effectiveness, though they have the same theoretical regret bound as NPR s. 3) FALCON+ shows better performance in the initial rounds, while with more iterations, its regret is higher than the neural bandit baselines and our NPR model . Compared to the neural bandit models, FALCON+ has much slower convergence, e.g, the slope of its regret curve did not decrease as fast as other models over the course of interactions. After a careful investigation, we found this is caused by its scheduled update: because less and less updates can be performed in the later rounds, FALCON+ still selected the suboptimal arms quite often. Another important observation is that FALCON+ has a very high variance in its performance: in some epochs, it could keep sampling the suboptimal arms. This is again caused by its lazy update: in some epochs of some trials, its estimation could be biased by large reward noise, which will directly cost it miss the right arm in almost the entire next epoch. (a) h1(x) = 10 2(x ΣΣ x) (b) h2(x) = exp( 10(x θ)2) Figure 4: Comparison of NPR and the baselines on the synthetic dataset with m = 128 A.1.2 COMPARISONS UNDER DIFFERENT VALUES OF k For the experiment in the main paper, at each round only k = 20 arms are sampled from the K = 100 arm pool and disclosed to all the algorithms for selection. It is to test a more general setting in practice, e.g., different recommendation candidates appear across rounds. In the meantime, such setting also introduces more practical difficulties as only a subset of arms are revealed to the agent to learn from each time. In this experiment, we report the results of the neural bandit models with Published as a conference paper at ICLR 2022 (a) h1(x) = 10 2(x ΣΣ x) (b) h2(x) = exp( 10(x θ)2) Figure 5: Comparison of NPR and the baselines on the synthetic dataset with different k. different value k, e.g., the number of arms disclosed each round. We followed the setting in Section A.1.1 to set the network width m = 128. Figure 5 demonstrates that for both datasets, revealing all the candidate arms, e.g., k = 100, will considerably reduce the regret compared to only revealing a subset of the arms. But in both settings, all bandit algorithms, including NPR, can obtain desired performance (i.e., sublinear regret). And again, NPR still obtained more satisfactory performance as it is not subject to the added computational complexity in Neural UCB and Neural TS. The FALCON+ baseline performed much better when k = 20 than k = 100, as it has less chance to select the suboptimal arms in each epoch, which is expected based on its polynomial dependence on the number of arms K. On the contrary, NPR only has a logarithmic dependence on K, whose advantage was validated in this experiment. A.2 EXPERIMENTS ON CLASSIFICATION DATASETS 0 2000 4000 6000 8000 10000 Rounds Lin FPL Lin TS Lin UCB NPR Neural TS Neural TS (Diagonal) Neural UCB Neural UCB (Diagonal) (a) Magic Telescope 0 2000 4000 6000 8000 10000 Rounds Lin FPL Lin TS Lin UCB NPR Neural TS (Diagonal) Neural UCB (Diagonal) 0 2000 4000 6000 8000 10000 Rounds Lin FPL Lin TS Lin UCB NPR Neural TS (Diagonal) Neural UCB (Diagonal) (c) Covertype Figure 6: Comparison of NPR and the baselines on the UCI datasets. Table 1: Dataset statistics DATASET ADULT MUSHROOM SHUTTLE MAGIC LETTER COVERTYPE INPUT DIMENSION 2 15 2 23 7 9 2 12 16 26 54 7 We present the experiment results of the classification problem on UCI datasets, Magic Telescope, letter and covertype. The experiment setting and grid search setting for parameter tuning are the same as described in Section 5. For datasets letter and covertype, only the results of Neural UCB and Neural TS with diagonal approximation are reported as the required GPU memory for the calculation of matrix inverse, together with the neural model update, exceeds the supported memory of the NVIDIA GEFORCE RTX 2080 graphical card. Similar observations are obtained for these three datasets. Compared to the linear contextual models, the neural contextual bandit algorithms significantly improve the performance with much lower regret (mistakes on classification). NPR showed consistently better performance to the best neural bandit baselines. And due to its perturbation-based exploration strategy, no extra computational resources are needed, which makes it more practical in real-world problems, especially large-scale tasks. Published as a conference paper at ICLR 2022 B BACKGROUND FOR THEORETICAL ANALYSIS We denote the set of all the possible K arms by their context vectors as {xi}K i=1. Our analysis is built upon the existing work in neural tangent kernel techniques. Definition B.1 (Jacot et al. (2018); Cao & Gu (2019)). Let {xi}K i=1 be a set of contexts. Define e H(1) i,j = Σ(1) i,j = xi, xj , B(l) i,j = Σ(l) i,i Σ(l) i,j Σ(l) i,j Σ(l) j,j Σ(l+1) i,j = 2E(u,v) N(0,B(l) i,j) [φ(u)φ(v)] , e H(l+1) i,j = 2 e H(l) i,j E(u,v) N(0,B(l) i,j) [φ (u)φ (v)] + Σ(l+1) i,j . Then, H = ( e H(L) + Σ(L))/2 is called the neural tangent kernel (NTK) matrix on the context set. With Definition B.1, we also have the following assumption on the contexts: {xi}K i=1. Assumption B.2. H λ0I; and moreover, for any 1 i K, xi 2 = 1 and [xi]j = [xi]j+d/2. By assuming H λ0I, the neural tangent kernel matrix is non-singular (Du et al., 2019; Arora et al., 2019; Cao & Gu, 2019). It can be easily satisfied when no two context vectors in the context set are in parallel. The second assumption is just for convenience in analysis and can be easily satisfied by: for any context vector x, x 2 = 1, we can construct a new context x = [x , x ] / 2. It can be verified that if θ0 is initialized as in Algorithm 1, then f(xi; θ0) = 0 for any i [K]. The NTK technique builds a connection between the analysis of DNN and kernel methods. With the following definition of effective dimension, we are able to analyze the neural network by some complexity measures previously used for kernel methods. Definition B.3. The effective dimension ed of the neural tangent kernel matrix on contexts {xi}K i=1 is defined as ed = log det(I + TH/λ) log(1 + TK/λ) . The notion of effective dimension is first introduced in Valko et al. (2013) for analyzing kernel contextual bandits, which describes the actual underlying dimension in the set of observed contexts {xi}K i=1. And we also need the following concentration bound on Gaussian distribution. Lemma B.4. Consider a normally distributed random variable X N(µ, σ2) and β 0. The probability that X is within a radius of βσ from its mean can be written as, P(|X µ| βσ) 1 exp( β2/2). C PROOF OF THEOREMS AND LEMMAS IN SECTION 4 C.1 PROOF OF LEMMA 4.2 Proof of Lemma 4.2. By Lemma 4.1, with a satisfied m, with probability at least 1 δ, we have for any i [K], h(xi) = g(xi; θ0)/ m, m(θ θ0) , and m θ θ0 2 Hence, based on the definition of At and bt, we know that A 1 t bt is the least square solution on top of the observation noise. Therefore, by Theorem 1 in Abbasi-Yadkori et al. (2011), with probability at least 1 δ, for any 1 t T, θ satisfies that m(θ θ0) A 1 t bt At R2 log det(At) δ2 det(λI) + Published as a conference paper at ICLR 2022 where R is the sub-Gaussian variable for the observation noise η. Then we have, | g(xt,i; θ0), A 1 t bt/ m h(xt,i)| =| g(xt,i; θ0), A 1 t bt/ m g(xt,i; θ0), θ θ0 | m(θ θ0) A 1 t bt At g(xt,i; θ0)/ m A 1 t αt g(xt,i; θ0)/ m A 1 t . with αt = q R2 log det(At) δ2 det(λI) + λS. This completes the proof. C.2 PROOF OF LEMMA 4.3 The proof starts with three lemmas that bound the error terms of the function value and gradient of a neural network. Lemma C.1 (Lemma B.2, Zhou et al. (2020)). There exist constants { Ci}5 i=1 > 0 such that for any δ > 0, with J as the number of steps for gradient descent in neural network learning, if for all t [T], η and m satisfy t/(mλ) C1m 3/2L 3/2[log(KL2/δ)]3/2, t/(mλ) C2 min L 6[log m] 3/2, m(λη)2L 6t 1(log m) 1 3/8 , η C3(mλ + tm L) 1, log m L7/2t7/6λ 7/6(1 + p then with probability at least 1 δ, we have that θt 1 θ0 2 2 p θt 1 θ0 A 1 t bt/ m 2 (1 ηmλ)J/2p t/(mλ) + C5m 2/3p log m L7/2t5/3λ 5/3(1 + p Lemma C.2 (Lemma B.3 Zhou et al. (2020)). There exist constants { Cu i }3 i=1 > 0 such that for any δ > 0, if τ satisfies Cu 1 m 3/2L 3/2[log(KL2/δ)]3/2 τ Cu 2 L 6[log m] 3/2, then with probability at least 1 δ, for any eθ and bθ satisfying eθ θ0 2 τ, bθ θ0 2 τ and j [K] we have f(xj; eθ) f(xj; bθ) g(xj; bθ), eθ bθ Cu 3 τ 4/3L3p Lemma C.3 (Lemma B.4, Zhou et al. (2020)). There exist constants { Cv i }3 i=1 > 0 such that for any δ > 0, if τ satisfies that Cv 1m 3/2L 3/2[log(KL2/δ)]3/2 τ Cv 2L 6[log m] 3/2, then with probability at least 1 δ, for any θ θ0 2 τ and j [K] we have g(xj; θ) F Cv 3 With above lemmas, we prove Lemma 4.3 as follows. Proof of Lemma 4.3. According to Lemma C.1, with satisfied neural network and the learning rate η, we have θt 1 θ0 2 2 p t/mλ. Further, by the choice of m, Lemma C.2 and C.3 hold. Therefore, |f(xt,i; θt 1) g(xt,i; θ0), A 1 t bt/ m | can be first upper bounded by: |f(xt,i; θt 1) g(xt,i; θ0), A 1 t bt/ m | |f(xt,i; θt 1) f(xt,i; θ0) g(xt,i; θ0), θt 1 θ0 | + | g(xt,i; θ0), θt 1 θ0 A 1 t bt/ m | Cϵ,1m 1/6T 2/3λ 2/3L3p log m + | g(xt,i; θ0), θt 1 θ0 A 1 t bt/ m | (C.2) where the first inequality holds due to triangle inequality, f(xt,i; θ0) = 0 by the random initialization of θ0, and the second inequality holds due to Lemma C.2 with the fact θt 1 θ0 2 2 p Published as a conference paper at ICLR 2022 Next, we bound the second term of Eq (C.2), | g(xt,i; θ0), θt 1 θ0 A 1 t bt/ m | | g(xt,i; θ0), θt 1 θ0 A 1 t bt/ m | + | g(xt,i; θ0), A 1 t bt/ m A 1 t bt/ m | g(xt,i; θ0) 2 θt 1 θ0 A 1 t bt/ m 2 + | g(xt,i; θ0), A 1 t bt/ m A 1 t bt/ m | Cϵ,2(1 ηmλ)Jp TL/λ + Cϵ,3m 1/6T 5/3λ 5/3L4p log m(1 + p + | g(xt,i; θ0), A 1 t bt/ m A 1 t bt/ m | (C.3) where the first inequality holds due to the triangle inequality, the second inequality holds according to Cauchy Schwarz inequality, and the third inequality holds due to Lemma C.1, C.3, with the fact that θt 1 θ0 2 2 p t/mλ and t T. Now, we provide the upper bound of the last term in Eq (C.3). Based on the definition of bt and bt, we have | g(xt,i; θ0), A 1 t bt/ m A 1 t bt/ m | = | g(xt,i; θ0)/ m, A 1 t Xt 1 s=1 γt 1 s g(xs,as; θ0) | According to the definition of the added random noise γt s N(0, σ2), and Lemma B.4, with βt 0, we have the following inequality, | g(xt,i; θ0)/ m, A 1 t Xt 1 s=1 γt 1 s g(xs,as; θ0)/ m | βt g(xt,i; θ0)/ m A 1 t β2 t g(xt,i; θ0)/ m 2 A 1 t 2σ2g(xt,i; θ0) A 1 t (Pt 1 s=1 g(xs,as; θ0)g(xs,as; θ0) /m)A 1 t g(xt,i; θ0)/m 2 exp β2 t 2σ2 where the last inequality stands as, g(xt,i; θ0) A 1 t ( Xt 1 s=1 g(xs,as; θ0)g(xs,as; θ0) /m)A 1 t g(xt,i; θ0)/m g(xt,i; θ0) A 1 t ( Xt 1 s=1 g(xs,as; θ0)g(xs,as; θ0) /m + λI)A 1 t g(xt,i; θ0)/m = g(xt,i; θ0)/ m 2 A 1 t . Therefore, in round t, for the given arm i, Pt | g(xt,i; θ0), A 1 t bt/ m A 1 t bt/ m | βt g(xt,i; θ0)/ m A 1 t 1 exp( β2 t /(2σ2)). Taking the union bound over K arms, we have that for any t: Pt i [K], | g(xt,i; θ0), A 1 t bt/ m A 1 t bt/ m | βt g(xt,i; θ0)/ m A 1 t 1 K exp( β2 t /(2σ2)). By choosing βt = σ 4 log t + 2 log K, we have: Pt i [K], | g(xt,i; θ0), A 1 t bt/ m A 1 t bt/ m | βt g(xt,i; θ0)/ m A 1 t This completes the proof. C.3 PROOF OF LEMMA 4.4 The proof requires the anti-concentration bound of Gaussian distribution. Lemma C.4. For a Gaussian random variable X N(µ, σ2), for any β > 0, σ > β exp( β2) Published as a conference paper at ICLR 2022 Proof of Lemma 4.4. Under event Et,2, according to Lemma 4.2, (C.2) and (C.3), we have, for all i [K], Pt(f(xt,i; θt) > h(xt,a t ) ϵ(m)) Pt( g(xt,i; θ0), A 1 t bt/ m > g(xt,i; θ0), A 1 t bt/ m + αt g(xt,i; θ0)/ m A 1 t ) According to the definition of At, bt and bt, we have, g(xt,i; θ0), A 1 t bt/ m A 1 t bt/ m = mg(xt,i; θ0) A 1 t g(xs,as; θ0) = Ut which follows a Gaussian distribution with mean E[Ut] = 0 and variance Var[Ut] as: Var[Ut] =σ2 ( Xt mg(xt,i; θ0) A 1 t g(xs,as; θ0))2) =σ2 g(xt,i; θ0)/ m 2 A 1 t λσ2 g(xt,i; θ0)/ m 2 A 2 t σ2(1 λλ 1 min(At)) g(xt,i; θ0)/ m 2 A 1 t σ2(1 λλ 1 K (AK)) g(xt,i; θ0)/ m 2 A 1 t , where the second equality holds according to the definition of At, and the first inequality holds as for any positive semi-definite matrix M Rd d, x M2x = λ2 max(M)x (λ 2 max(M)M2)x λmax(M) x 2 M, and λmin(At) λK(AK). Therefore, based on Lemma C.4, by choosing σ = 1 λλ 1 K (AK), the target probability could be lower bounded by, Pt(f(xt,j; θt) > h(xt,a t ) ϵ(m)) Pt(Ut > αt g(xt,i; θ0)/ m A 1 t ) =Pt( Ut Var[Ut] > αt g(xt,i; θ0)/ m A 1 t Var[Ut] ) Pt( Ut Var[Ut] > 1) 1 4e π This completes the proof. C.4 PROOF OF LEMMA 4.5 Proof of Lemma 4.5. With the defined sufficiently sampled arms in round t, Ωt, we have the set of undersampled arms Ωt = [K] \ Ωt. We also have the least uncertain and undersampled arm et in round t defined as, et = argmin j Ωt g(xj,t; θ0)/ m A 1 t In round t, it is easy to verify that E[h(xt,a t ) h(xt,at)] E[(h(xt,a t ) h(xt,at))1{Et,2}] + Pt( Et,2). Under event Et,1 and Et,2, we have, h(xt,a t ) h(xt,at) =h(xt,a t ) h(xt,et) + h(xt,et) h(xt,at) et + f(xt,et, θt) f(xt,at, θt) + 2ϵ(m) + (βt + αt)( g(xt,et; θ0)/ m A 1 t + g(xt,at; θ0)/ m A 1 t ) 4ϵ(m) + (βt + αt)(2 g(xt,et; θ0)/ m A 1 t + g(xt,at; θ0)/ m A 1 t ) Next, we will bound E[ g(xt,et; θ0)/ m A 1 t ]. It is easy to see that, E[ g(xt,at; θ0)/ m A 1 t ] =E[ g(xt,at; θ0)/ m A 1 t 1{at Ωt}] + E[ g(xt,at; θ0)/ m A 1 t 1{at Ωt}] g(xt,et; θ0)/ m A 1 t Pt(at Ωt). Published as a conference paper at ICLR 2022 It can be arranged as g(xt,et; θ0)/ m A 1 t E[ g(xt,at; θ0)/ m A 1 t ]/Pt(at Ωt). Hence, the one step regret can be bounded by the following inequality with probability at least 1 δ, E[h(xt,a t ) h(xt,at)] P( Et,2) + 4ϵ(m) + (βt + αt) 1 + 2/P(at Ωt) g(xt,at; θ0)/ m A 1 t . For the probability Pt(at Ωt), we have: Pt(at Ωt) Pt( ai Ωt : f(xt,ai; θt) > max aj Ωt f(xt,aj; θt)) Pt(f(xt,a t ; θt) > max aj Ωt f(xt,aj; θt)) Pt(f(xt,a t ; θt) > max aj Ωt f(xt,aj; θt), Et,2 occurs) Pt(f(xt,a t ; θt) > h(xt,a t ) ϵ(m)) Pt( Et,2) Pt(Et,3) Pt( Et,2) where the fourth inequality holds as for arm aj Ωt, under event Et,1 and Et,2: f(xt,j; θt) h(xt,j) + ϵ(m) + (βt + αt) g(xt,j; θ0)/ m A 1 t h(xt,a t ) ϵ(m). This completes the proof. C.5 PROOF OF LEMMA 4.6 We first need the following lemma from Abbasi-Yadkori et al. (2011). Lemma C.5 (Lemma 11, Abbasi-Yadkori et al. (2011)). We have the following inequality: XT t=1 min g(xt,at; θ0)/ m 2 A 1 t 1, 1 2 log det AT Lemma C.6 (Lemma B.1, Zhou et al. (2020)). Let G = [g(x1; θ0), . . . , g(x K; θ0)]/ m Rp K. Let H be the NTK matrix as defined in Definition B.1. For any δ (0, 1), if m = Ω L6 log(K2L/δ) then with probability at least 1 δ, we have G G H F Kϵ. Proof of Lemma 4.6. Denote G = [g(x1; θ0)/ m, . . . , g(x K; θ0)/ m] Rp K, then we have det λI = log det I + XT t=1 g(xt,at; θ0)g(xt,at; θ0) /(mλ) log det I + XT i=1 g(xi; θ0)g(xi; θ0) /(mλ) = log det I + TGG /λ = log det I + TG G/λ , (C.4) where the inequality holds naively, the third equality holds since for any matrix A Rp K, we have det(I + AA ) = det(I + A A). We can further bound Eq (C.4) as follows: log det I + TG G/λ = log det I + TH/λ + T(G G H)/λ log det I + TH/λ + (I + TH/λ) 1, T(G G H)/λ log det I + TH/λ + (I + TH/λ) 1 F G G H F T/λ log det I + TH/λ + T log det I + TH/λ + 1 = ed log(1 + TK/λ) + 1, Published as a conference paper at ICLR 2022 where the first inequality holds due to the concavity of log det( ), the second inequality holds due to the fact that A, B A F B F , the third inequality holds due to the facts that I + H/λ I, λ 1 and A F K A 2 for any A RK K, the fourth inequality holds by Lemma C.6 with the required choice of m, the fifth inequality holds by the definition of effective dimension in Definition B.3. This completes the proof.