# neural_thompson_sampling__61a5ba02.pdf Published as a conference paper at ICLR 2021 NEURAL THOMPSON SAMPLING Weitong Zhang Department of Computer Science University of California, Los Angeles Los Angeles, CA, USA, 90095 wt.zhang@ucla.edu Dongruo Zhou Department of Computer Science University of California, Los Angeles Los Angeles, CA, USA, 90095 drzhou@cs.ucla.edu Lihong Li Google Research USA lihong@google.com Quanquan Gu Department of Computer Science University of California, Los Angeles Los Angeles, CA, USA, 90095 qgu@cs.ucla.edu Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation. At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network. We prove that, provided the underlying reward function is bounded, the proposed algorithm is guaranteed to achieve a cumulative regret of O(T 1/2), which matches the regret of other contextual bandit algorithms in terms of total round number T. Experimental comparisons with other benchmark bandit algorithms on various data sets corroborate our theory. 1 INTRODUCTION The stochastic multi-armed bandit (Bubeck & Cesa-Bianchi, 2012; Lattimore & Szepesv ari, 2020) has been extensively studied, as an important model to optimize the trade-off between exploration and exploitation in sequential decision making. Among its many variants, the contextual bandit is widely used in real-world applications such as recommendation (Li et al., 2010), advertising (Graepel et al., 2010), robotic control (Mahler et al., 2016), and healthcare (Greenewald et al., 2017). In each round of a contextual bandit, the agent observes a feature vector (the context ) for each of the K arms, pulls one of them, and in return receives a scalar reward. The goal is to maximize the cumulative reward, or minimize the regret (to be defined later), in a total of T rounds. To do so, the agent must find a trade-off between exploration and exploitation. One of the most effective and widely used techniques is Thompson Sampling, or TS (Thompson, 1933). The basic idea is to compute the posterior distribution of each arm being optimal for the present context, and sample an arm from this distribution. TS is often easy to implement, and has found great success in practice (Chapelle & Li, 2011; Graepel et al., 2010; Kawale et al., 2015; Russo et al., 2017). Recently, a series of work has applied TS or its variants to explore in contextual bandits with neural network models (Blundell et al., 2015; Kveton et al., 2020; Lu & Van Roy, 2017; Riquelme et al., 2018). Riquelme et al. (2018) proposed Neural Linear, which maintains a neural network and chooses the best arm in each round according to a Bayesian linear regression on top of the last network layer. Kveton et al. (2020) proposed Deep FPL, which trains a neural network based on perturbed training data and chooses the best arm in each round based on the neural network output. Similar approaches have also been used in more general reinforcement learning problem (e.g., Azizzadenesheli et al., 2018; Fortunato et al., 2018; Lipton et al., 2018; Osband et al., 2016a). Despite the reported empirical success, strong regret guarantees for TS remain limited to relatively simple models, under fairly restrictive assumptions on the reward function. Examples are linear Published as a conference paper at ICLR 2021 functions (Abeille & Lazaric, 2017; Agrawal & Goyal, 2013; Koc ak et al., 2014; Russo & Van Roy, 2014), generalized linear functions (Kveton et al., 2020; Russo & Van Roy, 2014), or functions with small RKHS norm induced by a properly selected kernel (Chowdhury & Gopalan, 2017). In this paper, we provide, to the best of our knowledge, the first near-optimal regret bound for neural network-based Thompson Sampling. Our contributions are threefold. First, we propose a new algorithm, Neural Thompson Sampling (Neural TS), to incorporate TS exploration with neural networks. It differs from Neural Linear (Riquelme et al., 2018) by considering weight uncertainty in all layers, and from other neural network-based TS implementations (Blundell et al., 2015; Kveton et al., 2020) by sampling the estimated reward from the posterior (as opposed to sampling parameters). Second, we give a regret analysis for the algorithm, and obtain an e O(ed T) regret, where ed is the effective dimension and T is the number of rounds. This result is comparable to previous bounds when specialized to the simpler, linear setting where the effective dimension coincides with the feature dimension (Agrawal & Goyal, 2013; Chowdhury & Gopalan, 2017). Finally, we corroborate the analysis with an empirical evaluation of the algorithm on several benchmarks. Experiments show that Neural TS yields competitive performance, in comparison with stateof-the-art baselines, thus suggest its practical value in addition to strong theoretical guarantees. Notation: Scalars and constants are denoted by lower and upper case letters, respectively. Vectors are denoted by lower case bold face letters x, and matrices by upper case bold face letters A. We denote by [k] the set {1, 2, , k} for positive integers k. For two non-negative sequence {an}, {bn}, an = O(bn) means that there exists a positive constant C such that an Cbn, and we use e O( ) to hide the log factor in O( ). We denote by 2 the Euclidean norm of vectors and the spectral norm of matrices, and by F the Frobenius norm of a matrix. 2 PROBLEM SETTING AND PROPOSED ALGORITHM In this work, we consider contextual K-armed bandits, where the total number of rounds T is known. At round t [T], the agent observes K contextual vectors {xt,k Rd | k [K]}. Then the agent selects an arm at and receives a reward rt,at. Our goal is to minimize the following pseudo regret: t=1 (rt,a t rt,at) , (2.1) where a t is the optimal arm at round t that has the maximum expected reward: a t = argmaxa [K] E[rt,a]. To estimate the unknown reward given a contextual vector x, we use a fully connected neural network f(x; θ) of depth L 2, defined recursively by f1 = W1 x, fl = Wl Re LU(fl 1), 2 l L, f(x; θ) = mf L, (2.2) where Re LU(x) := max{x, 0}, m is the width of neural network, W1 Rm d, Wl Rm m, 2 l < L, WL R1 m, θ = (vec(W1); ; vec(WL)) Rp is the collection of parameters of the neural network, p = dm + m2(L 2) + m, and g(x; θ) = θf(x; θ) is the gradient of f(x; θ) w.r.t. θ. Our Neural Thompson Sampling is given in Algorithm 1. It maintains a Gaussian distribution for each arm s reward. When selecting an arm, it samples the reward of each arm from the reward s posterior distribution, and then pulls the greedy arm (lines 4 8). Once the reward is observed, it updates the posterior (lines 9 & 10). The mean of the posterior distribution is set to the output of the neural network, whose parameter is the solution to the following minimization problem: min θ L(θ) = i=1 [f(xi,ai; θ) ri,ai]2/2 + mλ θ θ0 2 2/2. (2.3) We can see that (2.3) is an ℓ2-regularized square loss minimization problem, where the regularization term centers at the randomly initialized network parameter θ0. We adapt gradient descent to solve (2.3) with step size η and total number of iterations J. Published as a conference paper at ICLR 2021 Algorithm 1 Neural Thompson Sampling (Neural TS) Input: Number of rounds T, exploration variance ν, network width m, regularization parameter λ. 1: Set U0 = λI 2: Initialize θ0 = (vec(W1); ; vec(WL)) Rp, where for each 1 l L 1, Wl = (W, 0; 0, W), each entry of W is generated independently from N(0, 4/m); WL = (w , w ), each entry of w is generated independently from N(0, 2/m). 3: for t = 1, , T do 4: for k = 1, , K do 5: σ2 t,k = λ g (xt,k; θt 1) U 1 t 1 g(xt,k; θt 1)/m 6: Sample estimated reward ert,k N(f(xt,k; θt 1), ν2σ2 t,k) 7: end for 8: Pull arm at and receive reward rt,at, where at = argmaxa ert,a 9: Set θt to be the output of gradient descent for solving (2.3) 10: Ut = Ut 1 + g(xt,at; θt)g(xt,at; θt) /m 11: end for A few observations about our algorithm are in place. First, compared to typical ways of implementing Thompson Sampling with neural networks, Neural TS samples from the posterior distribution of the scalar reward, instead of the network parameters. It is therefore simpler and more efficient, as the number of parameters in practice can be large. Second, the algorithm maintains the posterior distributions related to parameters of all layers of the network, as opposed to the last layer only (Riquelme et al., 2018). This difference is crucial in our regret analysis. It allows us to build a connection between Algorithm 1 and recent work about deep learning theory (Allen-Zhu et al., 2018; Cao & Gu, 2019), in order to obtain theoretical guarantees as will be shown in the next section. Third, different from linear or kernelized TS (Agrawal & Goyal, 2013; Chowdhury & Gopalan, 2017), whose posterior can be computed in closed forms, Neural TS solves a non-convex optimization problem (2.3) by gradient descent. This difference requires additional techniques in the regret analysis. Moreover, stochastic gradient descent can be used to solve the optimization problem with a similar theoretical guarantee (Allen-Zhu et al., 2018; Du et al., 2018; Zou et al., 2019). For simplicity of exposition, we will focus on the exact gradient descent approach. 3 REGRET ANALYSIS In this section, we provide a regret analysis of Neural TS. We assume that there exists an unknown reward function h such that for any 1 t T and 1 k K, rt,k = h(xt,k) + ξt,k, with |h(xt,k)| 1 where {ξt,k} forms an R-sub-Gaussian martingale difference sequence with constant R > 0, i.e., E[exp(λξt,k)|ξ1:t 1,k, x1:t,k] exp(λ2R2) for all λ R. Such an assumption on the noise sequence is widely adapted in contextual bandit literature (Agrawal & Goyal, 2013; Bubeck & Cesa Bianchi, 2012; Chowdhury & Gopalan, 2017; Chu et al., 2011; Lattimore & Szepesv ari, 2020; Valko et al., 2013). Next, we provide necessary background on the neural tangent kernel (NTK) theory (Jacot et al., 2018), which plays a crucial role in our analysis. In the analysis, we denote by {xi}T K i=1 the set of observed contexts of all arms and all rounds: {xt,k}1 t T,1 k K where i = K(t 1) + k. Definition 3.1 (Jacot et al. (2018)). Define e H(1) i,j = Σ(1) i,j = xi, xj , A(l) i,j = Σ(l) i,i Σ(l) i,j Σ(l) i,j Σ(l) j,j Σ(l+1) i,j = 2E(u,v) N(0,A(l) i,j) max{u, 0} max{v, 0}, e H(l+1) i,j = 2 e H(l) i,j E(u,v) N(0,A(l) i,j) 1(u 0) 1(v 0) + Σ(l+1) i,j . Then, H = ( e H(L) + Σ(L))/2 is called the neural tangent kernel matrix on the context set. Published as a conference paper at ICLR 2021 The NTK technique builds a connection between deep neural networks and kernel methods. It enables us to adapt some complexity measures for kernel methods to describe the complexity of the neural network, as given by the following definition. Definition 3.2. The effective dimension ed of matrix H with regularization parameter λ is defined as ed = log det(I + H/λ) log(1 + TK/λ) . Remark 3.3. The effective dimension is a metric to describe the actual underlying dimension in the set of observed contexts, and has been used by Valko et al. (2013) for the analysis of kernel UCB. Our definition here is adapted from Yang & Wang (2019), which also considers UCB-based exploration. Compared with the maximum information gain γt used in Chowdhury & Gopalan (2017), one can verify that their Lemma 3 shows that γt log det(I + H/λ)/2. Therefore, γt and ed are of the same order up to a ratio of 1/(2 log(1 + TK/λ)). Furthermore, ed can be upper bounded if all contexts xi are nearly on some low-dimensional subspace of the RKHS space spanned by NTK (Appendix D). We will make a regularity assumption on the contexts and the corresponding NTK matrix H. Assumption 3.4. Let H be defined in Definition 3.1. There exists λ0 > 0, such that H λ0I. In addition, for any t [T], k [K], xt,k 2 = 1 and [xt,k]j = [xt,k]j+d/2. The assumption that the NTK matrix is positive definite has been considered in prior work on NTK (Arora et al., 2019; Du et al., 2018). The assumption on context xt,a ensures that the initial output of neural network f(x; θ0) is 0 with the random initialization suggested in Algorithm 1. The condition on x is easy to satisfy, since for any context x, one can always construct a new context ex as [x/( 2 x 2), x/( We are now ready to present the main result of the paper: Theorem 3.5. Under Assumption 3.4, set the parameters in Algorithm 1 as λ = 1 + 1/T, ν = ed log(1 + TK/λ) + 2 + 2 log(1/δ) where B = max n 1/(22e π), 2h H 1h o with h = (h(x1), . . . , h(x T K)) , and R is the sub-Gaussian parameter. In line 9 of Algorithm 1, set η = C1(mλ + m LT) 1 and J = (1 + LT/λ)(C2 + log(T 3Lλ 1 log(1/δ)))/C1 for some positive constants C1, C2. If the network width m satisfies: m poly λ, T, K, L, log(1/δ), λ 1 0 , then, with probability at least 1 δ, the regret of Algorithm 1 is bounded as RT C3(1 + c T )ν q 2λL(ed log(1 + TK) + 1)T + (4 + C4(1 + c T )νL) p 2 log(3/δ)T + 5, where C3, C4 are some positive absolute constants, and c T = 4 log T + 2 log K. Remark 3.6. The definition B in Theorem 3.5 is inspired by the RKHS norm of the reward function defined in Chowdhury & Gopalan (2017). It can be verified that when the reward function h belongs to the function space induced by NTK, i.e., h H < , we have h H 1h h H according to Zhou et al. (2019), which suggests that B max{1/(22e π), Remark 3.7. Theorem 3.5 implies the regret of Neural TS is on the order of e O(ed T 1/2). This result matches the state-of-the-art regret bound in Chowdhury & Gopalan (2017); Agrawal & Goyal (2013); Zhou et al. (2019); Kveton et al. (2020). Remark 3.8. In Theorem 3.5, the requirement of m is specified in Condition 4.1 and the proof of Theorem 3.5, which is a high-degree polynomial in the time horizon T, number of layers L and number of actions K. However, in our experiments, we can choose reasonably small m (e.g., m = 100) to obtain good performance of Neural TS. See Appendix A.1 for more details. This discrepancy between theory and practice is due to the limitation of current NTK theory (Du et al., 2018; Allen-Zhu et al., 2018; Zou et al., 2019). Closing the gap is a venue for future work. Remark 3.9. Theorem 3.5 suggests that we need to know T before we run the algorithm in order to set m. When T is unknown, we can use the standard doubling trick (See e.g., Cesa-Bianchi & Lugosi (2006)) to set m adaptively. In detail, we decompose the time interval (0, + ) as a union of non-overlapping intervals [2s, 2s+1). When 2s t < 2s+1, we restart Neural TS with the input T = 2s+1. It can be verified that similar e O(ed T) regret still holds. Published as a conference paper at ICLR 2021 4 PROOF OF THE MAIN THEOREM This section sketches the proof of Theorem 3.5, with supporting lemmas and technical details provided in Appendix B. While the proof roadmap is similar to previous work on Thompson Sampling (e.g., Agrawal & Goyal, 2013; Chowdhury & Gopalan, 2017; Koc ak et al., 2014; Kveton et al., 2020), our proof needs to carefully track the approximation error of neural networks for approximating the reward function. To control the approximation error, the following condition on the neural network width is required in several technical lemmas. Condition 4.1. The network width m satisfies λL 3/2[log(TKL2/δ)]3/2, T 6K6L6 log(TKL/δ) max{λ 4 0 , 1}, o m[log m] 3 CTL12λ 1 + CT 7λ 8L18(λ + LT)6 + CL21T 7λ 7(1 + p where C is a positive absolute constant. For any t, we define an event Eσ t as follows Eσ t = {ω Ft+1 : k [K], |ert,k f(xt,k; θt 1)| ctνσt,k}, (4.1) where ct = 4 log t + 2 log K. Under event Eσ t , the difference between the sampled reward ert,k and the estimated mean reward f(xt,k; θt 1) can be controlled by the reward s posterior variance. We also define an event Eµ t as follows Eµ t = {ω Ft : k [K], |f(xt,k; θt 1) h(xt,k)| νσt,k + ϵ(m)}, (4.2) where ϵ(m) is defined as ϵ(m) = ϵp(m) + Cϵ,1(1 ηmλ)Jp ϵp(m) = Cϵ,2T 2/3m 1/6λ 2/3L3p log m + Cϵ,3m 1/6p log m L4T 5/3λ 5/3(1 + p + Cϵ,4 B + R p log det(I + H/λ) + 2 + 2 log(1/δ) p log m T 7/6m 1/6λ 2/3L9/2, and {Cϵ,i}4 i=1 are some positive absolute constants. Under event Eµ t , the estimated mean reward f(xt,k; θt 1) based on the neural network is similar to the true expected reward h(xt,k). Note that the additional term ϵ(m) is the approximate error of the neural networks for approximating the true reward function. This is a key difference in our proof from previous regret analysis of Thompson Sampling Agrawal & Goyal (2013); Chowdhury & Gopalan (2017), where there is no approximation error. The following two lemmas show that both events Eσ t and Eµ t happen with high probability. Lemma 4.2. For any t [T], Pr Eσ t Ft 1 t 2. Lemma 4.3. Suppose the width of the neural network m satisfies Condition 4.1. Set η = C(mλ + m LT) 1, then we have Pr t [T], Eµ t 1 δ, where C is an positive absolute constant. The next lemma gives a lower bound of the probability that the sampled reward er is larger than true reward up to the approximation error ϵ(m). Lemma 4.4. For any t [T], k [K], we have Pr ert,k + ϵ(m) > h(xt,k) Ft, Eµ t (4e π) 1. Following Agrawal & Goyal (2013), for any time t, we divide the arms into two groups: saturated and unsaturated arms, based on whether the standard deviation of the estimates for an arm is smaller than the standard deviation for the optimal arm or not. Note that the optimal arm is included in the group of unsaturated arms. More specifically, we define the set of saturated arms St as follows St = k k [K], h(xt,a t ) h(xt,k) (1 + ct)νσt,k + 2ϵ(m) . (4.4) Note that we have taken the approximate error ϵ(m) into consideration when defining saturated arms, which differs from the Thompson Sampling literature (Agrawal & Goyal, 2013; Chowdhury & Gopalan, 2017). It is now easy to show that the immediate regret of playing an unsaturated arm can be bounded by the standard deviation plus the approximation error ϵ(m). The following lemma shows that the probability of pulling a saturated arm is small in Algorithm 1. Published as a conference paper at ICLR 2021 Lemma 4.5. Let at be the arm pulled at round t [T]. Then, Pr at / St|Ft, Eµ t 1 4e π 1 The next lemma bounds the expectation of the regret at each round conditioned on Eµ t . Lemma 4.6. Suppose the width of the neural network m satisfies Condition 4.1. Set η = C1(mλ + m LT) 1, then with probability at least 1 δ, we have for all t [T] that E[h(xt,a t ) h(xt,at)|Ft, Eµ t ] C2(1 + ct)ν LE[min{σt,at, 1}|Ft, Eµ t ] + 4ϵ(m) + 2t 2, where C1, C2 are some positive absolute constants. Based on Lemma 4.6, we define t := (h(xt,a t ) h(xt,at)) 1(Eµ t ), and Xt := t C (1 + ct)ν L min{σt,at, 1} + 4ϵ(m) + 2t 2 , Yt = i=1 Xi, (4.5) where C is the same with constant C in Lemma 4.6. By Lemma 4.6, we can verify that with probability at least 1 δ, {Yt} forms a super martingale sequence since E(Yt Yt 1) = EXt 0. By Azuma-Hoeffding inequality (Hoeffding, 1963), we can prove the following lemma. Lemma 4.7. Suppose the width of the neural network m satisfies Condition 4.1. Then set η = C1(mλ + m LT) 1, we have, with probability at least 1 δ, that i=1 i 4Tϵ(m) + π2/3 + C2(1 + c T )ν i=1 min{σt,at, 1} + (4 + C3(1 + c T )νL + 4ϵ(m)) p 2 log(1/δ)T, where C1, C2, C3 are some positive absolute constants. The last lemma is used to control PT i=1 min{σt,at, 1} in Lemma 4.7. Lemma 4.8. Suppose the width of the neural network m satisfies Condition 4.1. Then set η = C1(mλ + m LT) 1, we have, with probability at least 1 δ, it holds that i=1 min{σt,at, 1} q 2λT(ed log(1 + TK) + 1) + C2T 13/6p log mm 1/6λ 2/3L9/2, where C1, C2 are some positive absolute constants. With all the above lemmas, we are ready to prove Theorem 3.5. Proof of Theorem 3.5. By Lemma 4.3, Eµ t holds for all t [T] with probability at least 1 δ. Therefore, with probability at least 1 δ, we have i=1 (h(xt,a t ) h(xt,at)) 1(Eµ t ) 4Tϵ(m) + π2 3 + C1(1 + c T )ν i=1 min{σt,at, 1} + (4 + C2(1 + c T )νL + 4ϵ(m)) p 2 log(1/δ)T C1(1 + c T )ν 2λT(ed log(1 + TK) + 1) + C3T 13/6p log mm 1/6λ 2/3L9/2 3 + 4Tϵ(m) + 4ϵ(m) p 2 log(1/δ)T + (4 + C2(1 + c T )νL) p 2 log(1/δ)T, = C1(1 + c T )ν 2λT(ed log(1 + TK) + 1) + C3T 13/6p log mm 1/6λ 2/3L9/2 3 + ϵp(m)(4T + p 2 log(1/δ)T) + (4 + C2(1 + c T )νL) p 2 log(1/δ)T + Cϵ,1(1 ηmλ)Jp TL/λ(4T + p 2 log(1/δ)T), Published as a conference paper at ICLR 2021 where C1, C2, C3 are some positive absolute constants, the first inequality is due to Lemma 4.7, and the second inequality is due to Lemma 4.8. The third equation is from (4.3). By setting η = C4(mλ + m LT) 1 and J = (1 + LT/λ)(log(24Cϵ,1) + log(T 3Lλ 1 log(1/δ)))/ C4, we have Cϵ,1(1 ηmλ)Jp TL/λ(4T + p 2 log(1/δ)T) 1 Then choosing m such that C1 C3(1 + c T )νT 13/6p log mm 1/6λ 2/3L5 1 3, ϵp(m)(4T + p 2 log(1/δ)T) 1 RT can be further bounded by RT C1(1 + c T )ν q 2λL(ed log(1 + TK) + 1)T + (4 + C2(1 + c T )νL) p 2 log(1/δ)T + 5. Taking union bound over Lemmas 4.3, 4.7 and 4.8, the above inequality holds with probability 1 3δ. By replacing δ with δ/3 and rearranging terms, we complete the proof. 5 EXPERIMENTS This section gives an empirical evaluation of our algorithm in several public benchmark datasets, including adult, covertype, magic telescope, mushroom and shuttle, all from UCI (Dua & Graff, 2017), as well as MNIST (Le Cun et al., 2010). The algorithm is compared to several typical baselines: linear and kernelized Thompson Sampling (Agrawal & Goyal, 2013; Chowdhury & Gopalan, 2017), linear and kernelized UCB (Chu et al., 2011; Valko et al., 2013), Bootstrap NN (Osband et al., 2016b; Riquelme et al., 2018), and ϵ-greedy for neural networks. Bootstrap NN trains multiple neural networks with subsampled data, and at each step pulls the greedy action based on a randomly selected network. It has been proposed as a way to approximate Thompson Sampling (Osband & Van Roy, 2015; Osband et al., 2016b). 5.1 EXPERIMENT SETUP To transform these classification problems into multi-armed bandits, we adapt the disjoint models (Li et al., 2010) to build a context feature vector for each arm: given an input feature x Rd of a k-class classification problem, we build the context feature vector with dimension kd as: x1 = x; 0; ; 0 , x2 = 0; x; ; 0 , , xk = 0; 0; ; x . Then, the algorithm generates a set of predicted reward following Algorithm 1 and pulls the greedy arm. For these classification problems, if the algorithm selects a correct class by pulling the corresponding arm, it will receive a reward as 1, otherwise 0. The cumulative regret over time horizon T is measured by the total mistakes made by the algorithm. All experiments are repeated 8 times with reshuffled data. We set the time horizon of our algorithm to 10 000 for all data sets, except for mushroom which contains only 8 124 data. In order to speed up training for the Neural UCB and Neural Thompson Sampling, we use the inverse of the diagonal elements of U as an approximation of U 1. Also, since calculating the kernel matrix is expensive, we stop training at t = 1000 and keep evaluating the performance for the rest of the time, similar to previous work (Riquelme et al., 2018; Zhou et al., 2019). Due to space limit, we defer the results on adult, covertype and magic telescope, as well as further experiment details, to Appendix A. In this section, we only show the results on mushroom, shuttle and MNIST. 5.2 EXPERIMENT I: PERFORMANCE OF NEURAL THOMPSON SAMPLING The experiment results of Neural Thompson Sampling and other benchmark algorithms are shown in Figure 1. A few observations are in place. First, Neural Thompson Sampling s performance is among the best in 6 datasets and is significantly better than all other baselines in 2 of them. Second, the function class used by an algorithm is important. Those with linear representations tend to perform worse due to the nonlinearity of rewards in the data. Third, Thompson Sampling is competitive with, and sometimes better than, other exploration strategies with the same function class, in particular when neural networks are used. Published as a conference paper at ICLR 2021 0 2000 4000 6000 8000 10000 # of round Total Regrets Linear UCB Linear TS Kernel UCB Kernel TS Booststrap NN eps-greedy Neural UCB Neural TS (ours) 0 1000 2000 3000 4000 5000 6000 7000 8000 # of round Total Regrets (b) Mushroom 0 2000 4000 6000 8000 10000 # of round Total Regrets (c) Shuttle Figure 1: Comparison of Neural Thompson Sampling and baselines on UCI datasets and MNIST dataset. The total regret measures cumulative classification errors made by an algorithm. Results are averaged over 8 runs with standard errors shown as shaded areas. 5.3 EXPERIMENT II: ROBUSTNESS TO REWARD DELAY This experiment is inspired by practical scenarios where reward signals are delayed, due to various constraints, as described by Chapelle & Li (2011). We study how robust the two most competitive methods from Experiment I, Neural UCB and Neural Thompson Sampling, are when rewards are delayed. More specifically, the reward after taking an action is not revealed immediately, but arrive in batches when the algorithms will update their models. The experiment setup is otherwise identical to Experiment I. Here, we vary the batch size (i.e., the amount of reward delay), and Figure 2 shows the corresponding total regret. Clearly, we recover the result in Experiment I when the delay is 0. Consistent with previous findings (Chapelle & Li, 2011), Neural TS degrades much more gracefully than Neural UCB when the reward delay increases. The benefit may be explained by the algorithm s randomized exploration nature that encourages exploration between batches. We, therefore, expect wider applicability of Neural TS in practical applications. 0 200 400 600 800 1000 Reward Delay Total Regret Neural UCB Neural TS 0 200 400 600 800 1000 Reward Delay Total Regret (b) Mushroom 0 200 400 600 800 1000 Reward Delay Total Regret (c) Shuttle Figure 2: Comparison of Neural Thompson Sampling and Neural UCB on UCI datasets and MNIST dataset under different scale of delay. The total regret measures cumulative classification errors made by an algorithm. Results are averaged over 8 runs with standard errors shown as error bar. 6 RELATED WORK Thompson Sampling was proposed as an exploration heuristic almost nine decades ago (Thompson, 1933), and has received significant interest in the last decade. Previous works related to the present paper are discussed in the introduction, and are not repeated here. Upper confidence bound or UCB (Agrawal, 1995; Auer et al., 2002; Lai & Robbins, 1985) is a widely used alternative to Thompson Sampling for exploration. This strategy is shown to achieve near-optimal regrets in a range of settings, such as linear bandits (Abbasi-Yadkori et al., 2011; Auer, 2002; Chu et al., 2011), generalized linear bandits (Filippi et al., 2010; Jun et al., 2017; Li et al., 2017), and kernelized contextual bandits (Valko et al., 2013). Neural networks are increasingly used in contextual bandits. In addition to those mentioned earlier (Blundell et al., 2015; Kveton et al., 2020; Lu & Van Roy, 2017; Riquelme et al., 2018), Zahavy Published as a conference paper at ICLR 2021 & Mannor (2019) used a deep neural network to provide a feature mapping and explored only at the last layer. Schwenk & Bengio (2000) proposed an algorithm by boosting the estimation of multiple deep neural networks. While these methods all show promise empirically, no regret guarantees are known. Recently, Foster & Rakhlin (2020) proposed a special regression oracle and randomized exploration for contextual bandits with a general function class (including neural networks) along with theoretical analysis. Zhou et al. (2019) proposed a neural UCB algorithm with near-optimal regret based on UCB exploration, while this paper focuses on Thompson Sampling. 7 CONCLUSIONS In this paper, we adapt Thompson Sampling to neural networks. Building on recent advances in deep learning theory, we are able to show that the proposed algorithm, Neural TS, enjoys a e O(ed T 1/2) regret bound. We also show the algorithm works well empirically on benchmark problems, in comparison with multiple strong baselines. The promising results suggest a few interesting directions for future research. First, our analysis needs Neural TS to perform multiple gradient descent steps to train the neural network in each round. It is interesting to analyze the case where Neural TS only performs one gradient descent step in each round, and in particular, the trade-off between optimization precision and regret minimization. Second, when the number of arms is finite, e O( d T) regret has been established for parametric bandits with linear and generalized linear reward functions. It is an open problem how to adapt Neural TS to achieve the same rate. Third, Allen-Zhu & Li (2019) suggested that neural networks may behave differently from a neural tangent kernel under some parameter regimes. It is interesting to investigate whether similar results hold for neural contextual bandit algorithms like Neural TS. ACKNOWLEDGEMENT We would like to thank the anonymous reviewers for their helpful comments. WZ, DZ and QG are partially supported by the National Science Foundation CAREER Award 1906169 and IIS-1904183. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies. Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011. Marc Abeille and Alessandro Lazaric. Linear Thompson sampling revisited. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp. 176 184, 2017. Rajeev Agrawal. Sample mean based index policies by o(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054 1078, 1995. Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pp. 127 135, 2013. Zeyuan Allen-Zhu and Yuanzhi Li. What can resnet learn efficiently, going beyond kernels? In Advances in Neural Information Processing Systems, pp. 9015 9025, 2019. Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. ar Xiv preprint ar Xiv:1811.03962, 2018. Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. ar Xiv preprint ar Xiv:1901.08584, 2019. Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002. Peter Auer, Nicol o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2 3):235 256, 2002. Published as a conference paper at ICLR 2021 Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efficient exploration through Bayesian deep Q-networks. In Proceedings of the 2018 Information Theory and Applications Workshop, pp. 1 9, 2018. Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems, pp. 12893 12904, 2019. Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1613 1622, 2015. S ebastien Bubeck and Nicol o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012. Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, pp. 10835 10845, 2019. Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quanquan Gu. Towards understanding the spectral bias of deep learning. ar Xiv preprint ar Xiv:1912.01198, 2019. Nicolo Cesa-Bianchi and G abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006. Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pp. 2249 2257, 2011. Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning, pp. 844 853, 2017. Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 208 214, 2011. Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. ar Xiv preprint ar Xiv:1810.02054, 2018. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive. ics.uci.edu/ml. Sarah Filippi, Olivier Cappe, Aur elien Garivier, and Csaba Szepesv ari. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pp. 586 594, 2010. Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel, Ian Osband, Alex Graves, Volodymyr Mnih, R emi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for exploration. In Proceedings of the 6th International Conference on Learning Representations, 2018. Dylan J Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. ar Xiv preprint ar Xiv:2002.04926, 2020. Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft s Bing search engine. In Proceedings of the 27th International Conference on Machine Learning, pp. 13 20, 2010. Kristjan Greenewald, Ambuj Tewari, Susan Murphy, and Predag Klasnja. Action centered contextual bandits. In Advances in Neural Information Processing Systems 30, pp. 5977 5985, 2017. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963. Matthew W Hoffman, Bobak Shahriari, and Nando de Freitas. Exploiting correlation and budget constraints in bayesian multi-armed bandit optimization. ar Xiv preprint ar Xiv:1303.6746, 2013. Published as a conference paper at ICLR 2021 Arthur Jacot, Franck Gabriel, and Cl ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, pp. 8571 8580, 2018. Kwang-Sung Jun, Aniruddha Bhargava, Robert D. Nowak, and Rebecca Willett. Scalable generalized linear bandits: Online computation and hashing. In Advances in Neural Information Processing Systems 30, pp. 99 109, 2017. Jaya Kawale, Hung Hai Bui, Branislav Kveton, Long Tran-Thanh, and Sanjay Chawla. Efficient Thompson sampling for online matrix-factorization recommendation. In Advances in Neural Information Processing Systems 28, pp. 1297 1305, 2015. Tom aˇs Koc ak, Michal Valko, R emi Munos, and Shipra Agrawal. Spectral Thompson sampling. In 28th AAAI Conference on Artificial Intelligence, 2014. Tom aˇs Koc ak, Michal Valko, R emi Munos, and Shipra Agrawal. Spectral Thompson sampling. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, pp. 1911 1917, 2014. Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh, and Craig Boutilier. Randomized exploration in generalized linear bandits. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2020. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985. Tor Lattimore and Csaba Szepesv ari. Bandit Algorithms. Cambridge University Press, 2020. Yann Le Cun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pp. 661 670, 2010. Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2071 2080. JMLR. org, 2017. Zachary C. Lipton, Jianfeng Gao, Lihong Li, Xiujun Li, Faisal Ahmed, and Li Deng. BBQnetworks: Efficient exploration in deep reinforcement learning for task-oriented dialogue systems. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pp. 5237 5244, 2018. Xiuyuan Lu and Benjamin Van Roy. Ensemble sampling. In Advances in Neural Information Processing Systems 30, pp. 3258 3266, 2017. Jeffrey Mahler, Florian T Pokorny, Brian Hou, Melrose Roderick, Michael Laskey, Mathieu Aubry, Kai Kohlhoff, Torsten Kr oger, James Kuffner, and Ken Goldberg. Dex-net 1.0: A cloud-based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In 2016 IEEE international conference on robotics and automation (ICRA), pp. 1957 1964. IEEE, 2016. Ian Osband and Benjamin Van Roy. Bootstrapped thompson sampling and deep exploration. ar Xiv preprint ar Xiv:1507.00300, 2015. Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems 29, pp. 4026 4034, 2016a. Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pp. 4026 4034, 2016b. Carlos Riquelme, George Tucker, and Jasper Snoek. Deep Bayesian bandits showdown: An empirical comparison of Bayesian deep networks for Thompson sampling. ar Xiv preprint ar Xiv:1802.09127, 2018. Published as a conference paper at ICLR 2021 Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221 1243, 2014. Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling. ar Xiv preprint ar Xiv:1707.02038, 2017. Holger Schwenk and Yoshua Bengio. Boosting neural networks. Neural computation, 12(8):1869 1887, 2000. William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. Michal Valko, Nathaniel Korda, R emi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits. ar Xiv preprint ar Xiv:1309.6869, 2013. Lin F Yang and Mengdi Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. ar Xiv preprint ar Xiv:1905.10389, 2019. Tom Zahavy and Shie Mannor. Deep neural linear bandits: Overcoming catastrophic forgetting through likelihood matching. ar Xiv preprint ar Xiv:1901.08612, 2019. Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with UCB-based exploration. ar Xiv preprint ar Xiv:1911.04462, 2019. Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes overparameterized deep relu networks. Machine Learning, pp. 1 26, 2019. Published as a conference paper at ICLR 2021 A FURTHER DETAIL OF THE EXPERIMENTS IN SECTION 5 A.1 PARAMETER TUNING In the experiments, we shuffle all datasets randomly, and normalize the features so that their ℓ2norm is unity. One-hidden-layer neural networks with 100 neurons are used. Note that we do not choose m as suggested by theory, and such a disconnection has its root in the current deep learning theory based on neural tangent kernel, which is not specific in this work. During posterior updating, gradient descent is run for 100 iterations with learning rate 0.001. For Bootstrap NN, we use 10 identical networks, and to train each network, data point at each round has probability 0.8 to be included for training (p = 10, q = 0.8 in the original paper (Schwenk & Bengio, 2000)) For ϵGreedy, we tune ϵ with a grid search on {0.01, 0.05, 0.1}. For (λ, ν) used in linear and kernel UCB / Thompson Sampling, we set λ = 1 following previous works (Agrawal & Goyal, 2013; Chowdhury & Gopalan, 2017), and do a grid search of ν {1, 0.1, 0.01} to select the parameter with best performance. For the Neural UCB / Thompson Sampling methods, we use a grid search on λ {1, 10 1, 10 2, 10 3} and ν {10 1, 10 2, 10 3, 10 4, 10 5}. All experiments are repeated 20 times, and the average and standard error are reported. A.2 DETAILED RESULTS Table 1 summarizes the total regrets measured at the last round on different data sets, with mean and standard deviation error computed based on 20 independent runs. The Bold Faced data is the top performance over 8 experiments. Table 2 shows the number of times the algorithm in that row significantly outperforms, ties, or significantly underperforms, compared with other algorithm with t-test at 90% significance level. Figure 3 shows the performance of Neural Thompson Sampling compared with other baseline method. Figure 4 shows the comparison between Neural Thompson Sampling and Neural UCB in delay reward settings. Table 1: Total regrets get at the last step with standard deviation attached Adult Covertype Magic1 MNIST Mushroom Shuttle Round# 10 000 10 000 10 000 10 000 8 124 10 000 Input Dim2 2 15 2 55 2 12 10 784 2 23 7 9 Random3 5000 5000 5000 9000 4062 8571 Linear UCB 2097.5 50.3 3222.7 67.2 2604.4 34.6 2544.0 235.4 562.7 23.1 966.6 39.0 Linear TS 2154.7 40.5 4297.3 328.7 2700.5 46.7 2781.4 338.3 643.3 30.4 1020.9 42.8 Kernel UCB 2080.1 44.8 3546.2 175.7 2406.5 79.4 3595.8 580.1 199.0 41.0 166.5 39.4 Kernel TS 2111.5 87.4 3659.9 113.8 2442.6 64.5 3406.0 411.7 291.2 40.0 283.3 180.5 Booststrap NN 2097.3 39.3 3067.0 56.1 2269.4 27.9 1765.6 321.1 132.3 8.6 211.7 20.9 eps-greedy 2328.5 50.4 3334.2 72.6 2381.8 37.3 1893.2 93.7 323.2 32.5 682.0 79.8 Neural UCB 2061.8 42.8 3012.1 87.0 2033.0 48.6 2071.6 922.2 160.4 95.3 338.6 386.4 Neural TS (ours) 2092.5 48.0 2999.1 74.3 2037.4 61.3 1583.4 198.5 115.0 35.8 232.0 149.5 1Magic is short for data set Magic Telescope 2Using disjoint encoding thus is Numof Class Numof Features 3Random pulling an arm at each round 4Magic is short for data set Magic Telescope Published as a conference paper at ICLR 2021 Table 2: Performance on total regret comparing with other methods on all datasets. Tuple (w/t/l) indicates the times of the algorithm at that row wins, ties with or loses, compared to all other 7 algorithms with t-test at 90% significant level. Adult Covertype Magic4 MNIST Mushroom Shuttle Linear UCB 2/3/2 4/0/3 1/0/6 2/2/3 1/0/6 1/0/6 Linear TS 1/0/6 0/0/7 0/0/7 2/1/4 0/0/7 0/0/7 Kernel UCB 4/3/0 2/0/5 3/1/3 0/0/7 4/0/3 7/0/0 Kernel TS 2/3/2 1/0/6 2/0/5 1/0/6 3/0/4 3/3/1 Booststrap NN 2/4/1 5/0/2 5/0/2 4/3/0 5/2/0 3/3/1 eps-greedy 0/0/7 3/0/4 3/1/3 4/2/1 2/0/5 2/0/5 Neural UCB 6/1/0 6/1/0 6/1/0 3/3/1 5/1/1 3/3/1 Neural TS (ours) 2/4/1 6/1/0 6/1/0 6/1/0 6/1/0 3/3/1 0 2000 4000 6000 8000 10000 # of round Total Regrets Linear UCB Linear TS Kernel UCB Kernel TS Booststrap NN eps-greedy Neural UCB Neural TS (ours) 0 2000 4000 6000 8000 10000 # of round Total Regrets (b) Covertype 0 2000 4000 6000 8000 10000 # of round Total Regrets (c) Magic Telescope Figure 3: Comparison of Neural Thompson Sampling and baselines on UCI datasets and MNIST dataset. The total regret measures cumulative classification errors made by an algorithm. Results are averaged over multiple runs with standard errors shown as shaded areas. 0 200 400 600 800 1000 Reward Delay Total Regret Neural UCB Neural TS 0 200 400 600 800 1000 Reward Delay Total Regret (b) Covertype 0 200 400 600 800 1000 Reward Delay Total Regret (c) Magic Telescope Figure 4: Comparison of Neural Thompson Sampling and Neural UCB on UCI datasets and MNIST dataset under different scale of delay. The total regret measures cumulative classification errors made by an algorithm. Results are averaged over multiple runs with standard errors shown as error bar. Published as a conference paper at ICLR 2021 Bootstrap NN eps-greedy Neural UCB Neural TS (ours) 0 running time (sec) Bootstrap NN eps-greedy Neural UCB Neural TS (ours) 0 running time (sec) (b) Covertype Bootstrap NN eps-greedy Neural UCB Neural TS (ours) 0 running time (sec) (c) Magic Telescope Bootstrap NN eps-greedy Neural UCB Neural TS (ours) 0 running time (sec) Bootstrap NN eps-greedy Neural UCB Neural TS (ours) 0 running time (sec) (e) mushroom Bootstrap NN eps-greedy Neural UCB Neural TS (ours) 0 running time (sec) (f) shuttle Figure 5: Comparison of the running time for Neural TS, Neural UCB and ϵ-greedy for neural networks on UCI datasets and MNIST dataset. A.3 RUN TIME ANALYSIS We compare the run time of the four algorithms based on neural networks: Bootstrap NN, ϵ-greedy for neural networks, Neural UCB, and Neural TS. The comparison is shown in Figure 5. We can see that Neural TS and Neural UCB are about 2 to 3 times slower than ϵ-greedy, which is due to the extra calculation of the neural network gradient for each input context. Bootstrap NN is often more than 5 times slower than ϵ-greedy because it has to train several neural networks at each round. B PROOF OF LEMMAS IN SECTION 4 Under Condition 4.1, we can show that the following inequalities hold. 1/λ Cm,1m 1L 3/2[log(TKL2/δ)]3/2, T/λ Cm,2 min m1/2L 6[log m] 3/2, m7/8 (λη)2L 6T 1(log m) 1 3/8 , m1/6 Cm,3 p log m L7/2T 7/6λ 7/6(1 + p m Cm,4T 6K6L6 log(TKL/δ) max{λ 4 0 , 1}, where {Cm,1, Cm,2, . . . , Cm,4} are some positive absolute constants. B.1 PROOF OF LEMMA 4.2 The following concentration bound on Gaussian distributions will be useful in our proof. Lemma B.1 (Hoffman et al. (2013)). Consider a normally distributed random variable X N(µ, σ2) and β 0. The probability that X is within a radius of βσ from its mean can then be written as Pr |X µ| βσ 1 exp( β2/2). Proof of Lemma 4.2. Since the estimated reward ert,k is sampled from N(f(xt,k; θt 1), ν2σ2 t,k) if given filtration Ft, Lemma B.1 implies that, conditioned on Ft and given t, k, Pr |ert,k f(xt,k; θt 1)| ctνσt,k Ft 1 exp( c2 t/2). Taking a union bound over K arms, we have that for any t Pr k, |ert,k f(xt,k; θt 1)| ctνσt,k Ft 1 K exp( c2 t/2). Published as a conference paper at ICLR 2021 Finally, choose ct = 4 log t + 2 log K as defined in (4.1), we get the bound that Pr Eσ t Ft = Pr k, |ert,k f(xt,k; θt 1)| ctνσt,k Ft 1 1 B.2 PROOF OF LEMMA 4.3 Before going into the proof, some notation is needed about linear and kernelized models. Definition B.2. Define Ut = λI+Pt i=1 g(xi,ai; θ0)g(xt,at; θ0) /m and based on Ut, we further define σ2 t,k = λg (xt,k; θ0) U 1 t 1g(xt,k; θ0)/m. Furthermore, for convenience we define Jt = (g(x1,a1; θt) g(xt,at; θt)) , Jt = (g(x1,a1; θ0) g(xt,at; θ0)) ht = (h(x1,a1) h(xt,at)) , rt = (r1 rt) , ϵt = (h(x1,a1) r1 h(xt,at) rt) , where ϵt is the reward noise. We can verify that Ut = λI + Jt J t /m, Ut = λI + Jt J t /m . We further define Kt = J t Jt/m. The first lemma shows that the target function is well-approximated by the linearized neural network if the network width m is large enough. Lemma B.3 (Lemma 5.1, Zhou et al. (2019)). There exists some constant C > 0 such that for any δ (0, 1), if m CT 4K4L6 log(T 2K2L/δ)/λ4 0, then with probability at least 1 δ over the random initialization of θ0, there exists a θ Rp such that h(xi) = g(xi; θ0), θ θ0 , m θ θ0 2 2h H 1h B, (B.1) for all i [TK], where B is defined in Theorem 3.5. From Lemma B.3, it is easy to show that under this initialization parameter θ0, we have that ht = J t (θ θ0) The next lemma bounds the difference between the σt,k from the linearized model and the σt,k actually used in the algorithm. Its proof, together with other technical lemmas , will be given in the next section. Lemma B.4. Suppose the network size m satisfies Condition 4.1. Set η = C1(mλ + m LT) 1, then with probability at least 1 δ, | σt,k σt,k| C2 p log mt7/6m 1/6λ 2/3L9/2, where C1, C2 are two positive constants. We next bound the difference between the outputs of the neural network and the linearized model. Lemma B.5. Suppose the network width m satisfies Condition 4.1. Then, set η = C1(mλ + m LT) 1, with probability at least 1 δ over the random initialization of θ0, we have f(xt,k; θt 1) g(xt,k; θ0), U 1 t 1 Jt 1rt 1/m C2t2/3m 1/6λ 2/3L3p + C3(1 ηmλ)Jp log m L4t5/3λ 5/3(1 + p where {Ci}4 i=1 are positive constants. Published as a conference paper at ICLR 2021 The next lemma, due to Chowdhury & Gopalan (2017), controls the quadratic value generated by an R-sub-Gaussian random vector ϵ: Lemma B.6 (Theorem 1, Chowdhury & Gopalan (2017)). Let {ϵt} t=1 be a real-valued stochastic process such that for some R 0 and for all t 1, ϵt is Ft-measurable and R-sub-Gaussian conditioned on Ft, Recall Kt defined in Definition B.2. With probability 0 < δ < 1 and for a given η > 0, with probability 1 δ, the following holds for all t, ϵ 1:t((Kt + ηI) 1 + I) 1ϵ1:t R2 log det((1 + η)I + Kt) + 2R2 log(1/δ). Finally, the following lemma shows the linearized kernel and the neural tangent kernel are closed: Lemma B.7. For all t [T], there exists a positive constants C such that the following holds: if the network width m satisfies m CT 6L6K6 log(TKL/δ), then with probability at least 1 δ, log det(I + λ 1Kt) log det(I + λ 1H) + 1. We are now ready to prove Lemma 4.3. Proof of Lemma 4.3. First of all, since m satisfies Condition 4.1, then with the choice of η ,the condition required in Lemmas B.3 B.7 are satisfied. Thus, taking a union bound, we have with probability at least 1 5δ, that the bounds provided by these lemmas hold. Then for any t [T], we will first provide the difference between the target function and the linear function g(xt,k; θ0), U 1 t 1 Jt 1rt 1/m as: h(xt,k) g(xt,k; θ0), U 1 t 1 Jt 1rt 1/m h(xt,k) g(xt,k; θ0), U 1 t 1 Jt 1ht 1/m + g(xt,k; θ0), U 1 t 1 Jt 1ϵt 1/m = g(xt,k; θ0), θ θ0 U 1 t 1 Jt 1 J t 1(θ θ0)/m + g(xt,k; θ0) U 1 t 1 Jt 1ϵt 1/m = g(xt,k; θ0), (I U 1 t 1( Ut 1 λI))(θ θ0) + g(xt,k; θ0) U 1 t 1 Jt 1ϵt 1/m = λ g(xt,k; θ0) U 1 t 1(θ θ0) + g(xt,k; θ0) U 1 t 1 Jt 1ϵt 1/m g(xt,k; θ0) U 1 t 1g(xt,k; θ0) q (θ θ0) U 1 t 1(θ θ0) g(xt,k; θ0) U 1 t 1g(xt,k; θ0) q ϵ t 1 J t 1 U 1 t 1 Jt 1ϵt 1/m m θ θ0 2 σt,k + σt,kλ 1/2 q ϵ t 1 J t 1 U 1 t 1 Jt 1ϵt 1/m (B.2) where the first inequality uses triangle inequality and the fact that rt 1 = ht 1 + ϵt 1; the first equality is from Lemma B.3 and the second equality uses the fact that Jt 1 J t 1 = m( Ut 1 λI) which can be verified using Definition B.2; the second inequality is from the fact that |α Aβ| β Aβ. Since U 1 t 1 1 λI and σt,k defined in Definition B.2, we obtain the last inequality. Furthermore, by obtaining J t 1U 1 t 1 Jt 1/m = J t 1(λI + Jt 1 J t 1/m) 1Jt 1 = J t 1(λ 1I λ 2 Jt 1(I + λ 1 J t 1 Jt 1/m) 1 J t 1/m) Jt 1/m = λ 1 J t 1 Jt 1/m λ 1J t 1 Jt 1(λI + J t 1 Jt 1/m) 1 J t 1 Jt 1/m2 = λ 1Kt 1(I (λI + Kt 1) 1Kt 1) = Kt 1(λI + Kt 1) 1, where the first equality is from the Sherman-Morrison formula, and the second equality uses Definition B.2 and the fact that (λI + Kt 1) 1Kt 1 = I λ(λI + Kt 1) 1 which could be verified by multiplying the LHS and RHS together, we have that q ϵ t 1 J t 1U 1 t 1 Jt 1ϵt 1/m q ϵ t 1Kt 1(λI + Kt 1) 1ϵt 1 ϵ t 1(Kt 1 + (λ 1)I)(λI + Kt 1) 1ϵt 1 ϵ t 1(I + (Kt 1 + (λ 1)I) 1) 1ϵt 1 (B.3) Published as a conference paper at ICLR 2021 where the second inequality is because λ = 1 + 1/T 1 set in Theorem 3.5. Based on (B.2) and (B.3), by utilizing the bound on θ θ 2 provided in Lemma B.3, as well as the bound given in Lemma B.6, and λ 1, we have h(xt,k) g(xt,k; θ0), U 1 t 1 Jt 1rt 1m B + R p log det(λI + Kt 1) + 2 log(1/δ) σt,k, since it is obvious that log det(λI + Kt 1) = log det(I + λ 1Kt 1) + (t 1) log λ log det(I + λ 1Kt 1) + t(λ 1) log det(I + λ 1H) + 2, where the first equality moves the λ outside the log det, the first inequality is due to log λ λ 1, and the second inequality is from Lemma B.7 and the fact that λ = 1+1/T (as set in Theorem 3.5). Thus, we have h(xt,k) g(xt,k; θ0), U 1 t 1 Jt 1rt 1m ν σt,k, where we set ν = B + R p log det(I + H/λ) + 2 + 2 log(1/δ). Then, by combining this bound with Lemma B.5, we conclude that there exist positive constants C1, C2, C3 so that |f(xt,k; θt 1) h(xt,k)| ν σt,k + C1t2/3m 1/6λ 2/3L3p log m + C2(1 ηmλ)Jp log m L4t5/3λ 5/3(1 + p νσt,k + C1t2/3m 1/6λ 2/3L3p log m + C2(1 ηmλ)Jp log m L4t5/3λ 5/3(1 + p log det(I + H/λ) + 2 + 2 log(1/δ) ( σt,k σt,k). Finally, by utilizing the bound of | σt,k σt,k| provided in Lemma B.4, we conclude that |f(xt,k; θt 1) h(xt,k)| νσt,k + ϵ(m), where ϵ(m) is defined by adding all of the additional terms and taking t = T: ϵ(m) = C1T 2/3m 1/6λ 2/3L3p log m + C2(1 ηmλ)Jp log m L4T 5/3λ 5/3(1 + p + C4 B + R p log det(I + H/λ) + 2 + 2 log(1/δ) p log m T 7/6m 1/6λ 2/3L9/2, where is exactly the same form defined in (4.3). By setting δ to δ/5 (required by the union bound discussed at the beginning of the proof), we get the result presented in Lemma 4.3. B.3 PROOF OF LEMMA 4.4 Our proof requires an anti-concentration bound for Gaussian distribution, as stated below: Lemma B.8 (Gaussian anti-concentration). For a Gaussian random variable X with mean µ and standard deviation σ, for any β > 0, σ > β exp( β2) Published as a conference paper at ICLR 2021 Proof of Lemma 4.4. Since ert,k N(f(xt,k; θt 1), ν2 t σ2 t,k) conditioned on Ft, we have Pr ert,k + ϵ(m) > h(xt,k) Ft, Eµ t = Pr ert,k f(xt,k; θt 1) + ϵ(m) νσt,k > h(xt,k) f(xt,k; θt 1) Pr ert,k f(xt,k; θt 1) + ϵ(m) νσt,k > |h(xt,k) f(xt,k; θt 1)| = Pr ert,k f(xt,k; θt 1) νσt,k > |h(xt,k) f(xt,k; θt 1)| ϵ(m) Pr ert,k f(xt,k; θt 1) νσt,k > 1 Ft, Eµ t where the first inequality is due to |x| x, and the second inequality follows from event Eµ t , i.e., k [K], |f(xt,k; θt 1) h(xt,k)| νσt,k + ϵ(m). B.4 PROOF OF LEMMA 4.5 Proof of Lemma 4.5. Consider the following two events at round t: A = { k St, ert,k < ert,a t |Ft, Eµ t }, B = {at / St|Ft, Eµ t }. Clearly, A implies B, since at = argmaxk ert,k. Therefore, Pr at / St Ft, Eµ t Pr k St, ert,k < ert,a t Ft, Eµ t . Suppose Eµ also holds, then it is easy to show that k [K], |h(xt,k) ert,k| |h(xt,k) f(xt,k; θt)| + |f(xt,k; θt) ert,k| ϵ(m) + (1 + ect)νtσt,k. (B.4) Hence, for all k St, we have that h(xt,a t ) ert,k h(xt,a t ) h(xt,k) |h(xt,k) ert,k| ϵ(m), where we used the definitions of saturated arms in Definition 4.4, and of Eµ t and Eσ t in (4.1). Consider the following event C = {h(xt,a t ) ϵ(m) < ert,a t |Ft, Eµ t }. Since Eσ t implies h(xt,a t ) ϵ(m) ert,k, we have that if C, Eσ t holds, then A holds, i.e. Eσ t C A. Taking union with Eσ t we have that C = Eσ t Eσ t C A Eσ t , which implies Pr(A) + Pr( Eσ t ) Pr(C). (B.5) Then, (B.5) implies that Pr k St, ert,k < ert,a t Ft, Eµ t Pr ert,a t + ϵ(m) > h(xt,a t ) Ft, Eµ t Pr Eσ t |Ft, Eµ t where the first inequality is from a t is a special case of k [K], the second inequality is from Lemmas 4.2 and 4.4. Published as a conference paper at ICLR 2021 B.5 PROOF OF LEMMA 4.6 To prove Lemma 4.6, we will need an upper bound bound on δt,k. Lemma B.9. For any time t [T], k [K], and δ (0, 1), if the network width m satisfies Condition 4.1, we have, with probability at least 1 δ, that where C is a positive constant. Proof of Lemma 4.6. Recall that given Ft and Eµ t , the only randomness comes from sampling ert,k for k [K]. Let kt be the unsaturated arm with the smallest σt, , i.e. kt = argmin k/ St σt,k, then we have that E[σt,at|Ft, Eµ t ] E[σt,at|Ft, Eµ t , at / St] Pr(at / St|Ft, Eµ t ) where the first inequality ignores the case when at St, and the second inequality is from Lemma 4.5 and the definition of kt mentioned above. If both Eσ t and Eµ t hold, then k [K], |h(xt,k) ert,k| ϵ(m) + (1 + ct)νσt,k, (B.7) as proved in equation (B.4). Thus, h(xt,a t ) h(xt,at) = h(xt,a t ) h(xt, kt) + h(xt, kt) h(xt,at) (1 + ct)νσt, kt + 2ϵ(m) + h(xt, kt) ert, kt h(xt,at) + ert,at + ert, kt ert,at (1 + ct)ν(2σt, kt + σt,at) + 4ϵ(m), (B.8) where the first inequality is from Definition 4.4 and kt / St, and the second inequality comes from equation (B.7). Since a trivial bound on h(xt,a t ) h(xt,at) could be get by h(xt,a t ) h(xt,at) |h(xt,a t )| + |h(xt,at)| 2, then we have E[h(xt,a t ) h(xt,at)|Ft, Eµ t ] = E[h(xt,a t ) h(xt,at)|Ft, Eµ t , Eσ t ] Pr(Eσ t ) + E[h(xt,a t ) h(xt,at)|Ft, Eµ t , Eσ t ] Pr( Eσ t ) (1 + ct)ν(2σt, kt + E[σt,at|Ft, Eµ t ]) + 4ϵ(m) + 2 2E[σt,at|Ft, Eµ t ] 1 4e π 1 t2 + E[σt,at|Ft, Eµ t ] + 4ϵ(m) + 2 44e π(1 + ct)νE[σt,at|Ft, Eµ t ] + 4ϵ(m) + 2t 2, where the inequality on the second line uses the bound provide in (B.8) and the trivial bound of h(xt,a t ) h(xt,at) for the second term plus Lemma 4.2, the inequality on the third line uses the bound of σt, kt provide in (B.6), inequality on the forth line is directly calculated by 1 4e π and which trivially holds since LHS is negative when t 4 and when t = 5, the LHS reach its maximum as 84.11 < 96.36 RHS. Noticing that |h(x)| 1, it is trivial to further extend the bound as E[h(xt,a t ) h(xt,at)|Ft, Eµ t ] min{44e π(1 + ct)νE[σt,at|Ft, Eµ t ], 2} + 4ϵ(m) + 2t 2, Published as a conference paper at ICLR 2021 and since we have 1 + ct 1 and ν = B + R p log det(I + H/λ) + 2 + 2 log(1/δ) B, recall 22e πB 1, it is easy to verify the following inequality also holds: E[h(xt,a t ) h(xt,at)|Ft, Eµ t ] 44e π(1 + ct)ν min{E[σt,at|Ft, Eµ t ], 1} + 4ϵ(m) + 2t 2 44e π(1 + ct)νC1 LE[min{σt,at, 1}|Ft, Eµ t ] + 4ϵ(m) + 2t 2, where we use the fact that there exists a constant C1 such that σt,at is bounded by C1 L with probability 1 δ provided by Lemma B.9. Merging the positive constant C1 with 44e π, we get the statement in Lemma 4.6. B.6 PROOF OF LEMMA 4.7 We start with introducing the Azuma-Hoeffding inequality for super-martingale: Lemma B.10 (Azuma-Hoeffding Inequality for Super Martingale). If a super-martingale Yt, corresponding to filtration Ft satisfies that |Yt Yt 1| Bt, then for any δ (0, 1), w.p. 1 δ, we have v u u t2 log(1/δ) Proof of Lemma 4.7. From Lemma B.9, we have that there exists a positive constant C1 such that Xt defined in (4.5) is bounded with probability 1 δ by |Xt| | t| + C1(1 + ct)ν L min{σt,at, 1} + 4ϵ(m) + 2t 2 2 + 2t 2 + C1C2(1 + ct)νL + 4ϵ(m) 4 + C1C2(1 + ct)νL + 4ϵ(m) where the first inequality uses the fact that |a b| |a| + |b|; the second inequality is from Lemma B.9 and the fact that h 1, where C2 is a positive constant used in Lemma B.9; the third inequality uses the fact that t 2 1. Noticing the fact that ct c T , and from Lemma 4.6, we know that with probability at least 1 δ, Yt is a super martingale. From Lemma B.10, we have YT Y0 (4 + C1C2(1 + c T )νL + 4ϵ(m)) p 2 log(1/δ)T. (B.9) Considering the definition of YT in (4.5), (B.9) is equivalent to i=1 i 4Tϵ(m) + 2 i=1 t 2 + C1(1 + c T )ν i=1 min{σt,at, 1} + (4 + C1C2(1 + c T )νL + 4ϵ(m)) p 2 log(1/δ)T, then by utilizing P i=1 t 2 = π2/6, and merge the constant C1 with 44e π, taking union bound of the probability bound of Lemma 4.6, B.10, B.9, we have the inequality above hold with probability at least 1 3δ. Re-scaling δ to δ/3 and merging the product of C1C2 as a new positive constant leads to the desired result. B.7 PROOF OF LEMMA 4.8 We first state a technical lemma that will be useful: Lemma B.11 (Lemma 11, Abbasi-Yadkori et al. (2011)). Let {vt} t=1 be a sequence in Rd, and define Vt = λI + Pt i=1 viv i . If λ 1, then i=1 min{v t V 1 t 1vt 1, 1} 2 log det I + λ 1 t X Published as a conference paper at ICLR 2021 Proof of Lemma 4.8. First, recall σt,k defined in Definition B.2 and the bound of σt,k σt,k provided in Lemma B.4. We have that there exists a positive constants C1 such that i=1 min{σt,at, 1} = i=1 min{ σt,at, 1} + i=1 (σt,at σt,at) i=1 min{ σ2 t,at, 1} + C1T 13/6p log mm 1/6λ 2/3L9/2, where the first term in the inequality on the second line is from Cauchy-Schwartz inequality, and the second term is from Lemma B.4. From Definition B.2, we have i=1 min{ σ2 t,at, 1} λ i=1 min{g(xt,at, θ0) U 1 t 1g(xt,at, θ0)/m, 1} 2λ log det I + λ 1 T X i=1 g(xt,at; θ0)g(xt,at; θ0) /m = 2λ log det(I + λ 1 JT J T /m) = 2λ log det(I + λ 1 J T JT /m) = 2λ log det(I + λ 1KT ) where the first inequality moves the positive parameter λ outside the min operator and uses the definition of σt,k in Definition B.2, then the second inequality utilizes Lemma B.11, the first equality use the definition of Jt in Definition B.2, the second equality is from the fact that det(I + AA ) = det(I + A A), and the last equality uses the definition of Kt in Definition B.2. From Lemma B.7, we have that log det(I + λ 1KT ) log det(I + λ 1H) + 1 under condition on m and η presented in Theorem 3.5. By taking a union bound we have, with probability 1 2δ, that i=1 min{ σt,at, 1} q 2λT(ed log(1 + TK) + 1) + C1T 13/6p log mm 1/6λ 2/3L9/2, where we use the definition of ed in Definition 3.2. Replacing δ with δ/2 completes the proof. C PROOF OF AUXILIARY LEMMAS IN APPENDIX B In this section, we are about to show the proof of the Lemmas used in Appendix B, we will start with the following NTK Lemmas. Among them, the first is to control the difference between the parameter learned via Gradient Descent and the theoretical optimal solution to linearized network. Lemma C.1 (Lemma B.2, Zhou et al. (2019)). There exist constants {Ci}5 i=1 > 0 such that for any δ > 0, if η, m satisfy that for all t [T], t/λ C1m 1L 3/2[log(TKL2/δ)]3/2, t/λ C2 min m1/2L 6[log m] 3/2, m7/8 (λη)2L 6t 1(log m) 1 3/8 , η C3(mλ + tm L) 1, log m L7/2t7/6λ 7/6(1 + p then with probability at least 1 δ over the random initialization of θ0, for any t [T], we have that θt 1 θ0 2 2 p θt 1 θ0 U 1 t 1 Jt 1rt 1/m 2 t/(mλ) + C5m 2/3p log m L7/2t5/3λ 5/3(1 + p Published as a conference paper at ICLR 2021 And the next lemma, controls the difference between the function value of neural network and the linearized model: Lemma C.2 (Lemma 4.1, Cao & Gu (2019)). There exist constants {Ci}3 i=1 > 0 such that for any δ > 0, if τ satisfies that C1m 3/2L 3/2[log(TKL2/δ)]3/2 τ C2L 6[log m] 3/2, then with probability at least 1 δ over the random initialization of θ0, for all eθ, bθ satisfying eθ θ0 2 τ, bθ θ0 2 τ and j [TK] we have f(xj; eθ) f(xj; bθ) g(xj; bθ), eθ bθ C3τ 4/3L3p Furthermore, to continue with, next lemma is proposed to control the difference between the gradient and the gradient on the initial point. Lemma C.3 (Theorem 5, Allen-Zhu et al. (2018)). There exist constants {Ci}3 i=1 > 0 such that for any δ (0, 1), if τ satisfies that C1m 3/2L 3/2[log(TKL2/δ)]3/2 τ C2L 6[log m] 3/2, then with probability at least 1 δ over the random initialization of θ0, for all θ θ0 2 τ and j [TK] we have g(xj; θ) g(xj; θ0) 2 C3 p log mτ 1/3L3 g(xj; θ0) 2. Also, we need the next lemma to control the gradient norm of the neural network with the help of NTK. Lemma C.4 (Lemma B.3, Cao & Gu (2019)). There exist constants {Ci}3 i=1 > 0 such that for any δ > 0, if τ satisfies that C1m 3/2L 3/2[log(TKL2/δ)]3/2 τ C2L 6[log m] 3/2, then with probability at least 1 δ over the random initialization of θ0, for any θ θ0 2 τ and j [TK] we have g(xj; θ) F C3 Finally, as literally shows, we can also provide bounds on the kernel provided by the linearized model and the NTK kernel if the network is width enough. Lemma C.5 (Lemma B.1, Zhou et al. (2019)). Set K = PT t=1 PK k=1 g(xt,k; θ0)g(xt,k; θ0)/m, recall the definition of H in Definition 3.1,then there exists a constant C1 such that m C1L6 log(TKL/δ)ϵ 4, we could get that K H F TKϵ. Equipped with these lemmas, we could continue for our proof. C.1 PROOF OF LEMMA B.4 Proof of Lemma B.4. Firstly, set τ = 2 p t/(mλ), then we have the condition on the network m and learning rate η satisfy all of the condition need from Lemma C.1 to Lemma C.5. Thus from Lemma C.1, we have that there exists θt 1 θ0 2 τ, thus from Lemma C.4, we have that there exists positive constant C1 such that g(x; θt 1) 2 C1 m L, g(x; θ0) 2 C1 m L, consider the function defined as ψ(a, a1, , at 1) = v u u ta t 1 X i=1 λI + aia i it is then easy to verify that ψ g(xt,k; θt 1) m , g(x1,a1; θ1) m , , g(xt 1,at 1; θt 1) m ψ g(xt,k; θ0) m , g(x1,a1; θ0) m , , g(xt 1,at 1; θ0) m Published as a conference paper at ICLR 2021 then we obtain that the function ψ is defined under the domain a 2 C1 L then by taking the derivation w.r.t. ψ2, we have that 2ψ ψ = ( a) t 1 X i=1 λI + aia i 1 a + a t 1 X i=1 λI + aia i i=1 λI + aia i ( ai)a i + ai a i t 1 X i=1 λI + aia i by taking trace with both side and utilizing tr(AB) = tr(BA) and tr(α β) = tr(αβ ), we have that 2 tr(ψ ψ) = tr 2( a) t 1 X i=1 λI + aia i j=1 ( aj) t 1 X i=1 λI + aia i i=1 λI + aia i thus by setting C = Pt 1 i=1 λI + aia i 1 for simplicity and decompose C = Q DQ, b = Qa where D = diag(ϱ1, , ϱp) as the eigen-value of C, we have that a Ca , aψ 2 = Pd i=1 b2 i ϱ2 i Pd i=1 b2 i ϱi 1/ where the last inequality is from the fact that C 1/λI, which indicates that all eigen-value ϱi 1/λ, for the same reason, we have aiψ 2 = Caa Cai 2 a Ca ai 2 Caa C 2 a Ca = ai 2 Ca 2 C a 2 a Ca ai a / Thus under the domain that a 2 C1 L, we have that λ, aiψ 2 C2 1L/ Then, Lipschitz continuity implies |σt,k σt,k| = ψ g(xt,k; θt 1) m , g(x1,a1; θ1) m , , g(xt 1,at 1; θt 1) m ψ g(xt,k; θ0) m , g(x1,a1; θ0) m , , g(xt 1,at 1; θ0) m sup{ aψ 2} g(xt,k; θt 1) m g(xt,k; θ0) m i=1 sup{ aiψ 2} g(xi,ai; θi) m g(xi,ai; θ0) m g(xt,k; θt) g(xt,k; θ0) m g(xi,ai; θi) g(xi,ai; θ0) m By Lemma C.3 with τ = 2 p t/mλ, there exist positive constants C2 and C3 so that each gradient difference in (C.1) is bounded by 1 m g(x; θ) g(x; θ0) 2 C2 p log mτ 1/3L3 g(x; θ0) 2/ m log mt1/6m 1/6λ 1/6L7/2. Published as a conference paper at ICLR 2021 Thus, since we obtain that there exists constant C5 such that |σt,k σt,k| C1 p log mt7/6m 1/6λ 2/3L9/2, where we use the fact that C1 = max{ C3, C3 C2 1} and L 1 to merge the first term into the summation. This inequality is based on Lemma C.1, Lemma C.3 and Lemma C.4, thus it holds with probability at least 1 3δ. Replacing δ with δ/3 completes the proof. C.2 PROOF OF LEMMA B.5 Proof of Lemma B.5. Setting τ = 2 p t/mλ, we have the condition on the network m and learning rate η satisfy all of the condition needed by Lemmas C.1 to C.5. From Lemma C.1 we have θt 1 θ0 2 τ. Then, by Lemma C.2, there exists a constant C1 such that |f(xt,k; θt 1) g(xt,k; θ0), θt 1 θ0 | C1t2/3m 1/6λ 2/3L3p log m, (C.2) Using the bound on θt 1 θ0 U 1 t 1 Jt 1rt 1/m provided in Lemma C.1 and the norm of gradient bound given in Lemma C.4, we have that there exist positive constants C1, C2 such that | g(xt,k; θ0), θt 1 θ0 g(xt,k; θ0), U 1 t 1 Jt 1rt 1/m | g(xt,k) 2 θt 1 θ0 U 1 t 1 Jt 1rt 1/m 2 m L (1 ηmλ)Jp t/(mλ) + C2m 2/3p log m L7/2t5/3λ 5/3(1 + p = C2(1 ηmλ)Jp t L/λ + C3m 1/6p log m L4t5/3λ 5/3(1 + p t/λ), (C.3) where C2 = C1, C3 = C1 C2. Combining (C.2) and (C.3), we have f(xt,k; θt 1) g(xt,k; θ0), U 1 t 1 Jt 1rt 1/m C1t2/3m 1/6λ 2/3L3p + C2(1 ηmλ)Jp log m L4t5/3λ 5/3(1 + p which holds with probability 1 3δ with a union bound (Lemma C.4, Lemma C.1, and Lemma C.2). Replacing δ with δ/3 completes the proof. C.3 PROOF OF LEMMA B.7 Proof of Lemma B.7. From the definition of Kt, we have that log det(I + λ 1Kt) = log det I + i=1 g(xi,ai; θ0)g(xi,ai; θ0) /(mλ) log det I + k=1 g(xi,ai; θ0)g(xi,ai; θ0) /(mλ)) = log det(I + K/λ) log det(I + H/λ + (H K)λ) + T(λ 1) log det(I + H/λ) + (I + H/λ) 1, (K H)/λ log det(I + H/λ) + (I + H/λ) 1 F (K H) F /λ log det(I + H/λ) + TK (K H) F log det(I + H/λ) + 1 where the the first inequality is because the double summation on the second line contains more elements than the summation on the first line. The second inequality utilizes the definition of K in Lemma C.5 and H in Definition 3.1, the third inequality is from the convexity of log det( ) function, and the forth inequality is from the fact that A, B A F B F . Then the fifth inequality is from the fact that A F TK A 2 if A RT K T K and λ 0. Finally, the sixth inequality utilizes Lemma C.5 by setting ϵ = (TK) 3/2 with m C1L6T 6K6 log(TKL/δ), where we conclude our proof. Published as a conference paper at ICLR 2021 C.4 PROOF OF LEMMA B.9 Proof of Lemma B.9. Set τ in Lemma C.4 as 2 p t/(mλ). Then the network width m and learning rate η satisfy all of the condition needed by Lemma C.1 to C.5. Hence, there exists C1 such that g(x; θ) 2 g(x; θ) F C1 m L for all x, since it is easy to verify that U 1 t λ 1I. Thus we have that for all t [T], k [K], σ2 t,k = λg (xt,k; θt 1)U 1 t 1g(xt,k; θt 1)/m g(xt,k; θt 1) 2 2/m C2 5L. Therefore, we could get that σt,k C1 L, with probability 1 2δ by taking a union bound (Lemmas C.1 and C.4). Replacing δ with δ/2 completes the proof. D AN UPPER BOUND OF EFFECTIVE DIMENSION ed We now provide an example, showing when all contexts xi concentrate on a d -dimensional nonlinear subspace of the RKHS space spanned by NTK, the effective dimension ed is bounded by d . We consider the case when λ = 1, L = 2. Suppose that there exists a constant d such that for any i > d , 0 < λi(H) 1/(TK). Then the effective dimension ed can be bounded as ed = log det(I + H) log(1 + TK) i=1 log(1 + λi(H)) i=1 λi(H) = i=d +1 λi(H) For I1 and I2 we have i=1 H 2 = Θ(d ), I2 TK 1/(TK) = 1, Therefore, the effective dimension satisfies that ed d + 1. To show how to satisfy the requirement, we first give a charcterization of the RKHS space spanned by NTK. By Bietti & Mairal (2019); Cao et al. (2019) we know that each entry of H has the following formula: j=1 Yk,j(xi)Yk,j(xs), where Yk,j for j = 1, . . . , N(d, k) are linearly independent spherical harmonics of degree k in d variables, d is the input dimension, N(d, k) = (2k + d 2)/k Cd 2 k+d 3, µk = Θ(max{k d, (d 1) k+1}). In that case, the feature mapping ( µk Yk,j(x))k,j maps any context x from Rd to a RKHS space R corresponding to H. Let yi R denote the mapping for xi. Then if there exists a d -dimension subspace R such that for all i, yi zi 1 where zi is the projection of yi onto Rd , the requirement for λi(H) holds.