# contextual_bandits_with_online_neural_regression__260be8f8.pdf Published as a conference paper at ICLR 2024 CONTEXTUAL BANDITS WITH ONLINE NEURAL REGRESSION Rohan Deb, Yikun Ban, Shiliang Zhou, Jingrui He, & Arindam Banerjee University of Illinois, Urbana-Champaign {rd22,yikunb2,szuo3,jingrui,arindamb}@cs.illinois.edu Recent works have shown a reduction from contextual bandits to online regression under a realizability assumption (Foster and Rakhlin, 2020; Foster and Krishnamurthy, 2021). In this work, we investigate the use of neural networks for such online regression and associated Neural Contextual Bandits (Neu CBs). Using existing results for wide networks, one can readily show a O( T) regret for online regression with square loss, which via the reduction implies a O( KT 3/4) regret for Neu CBs. Departing from this standard approach, we first show a O(log T) regret for online regression with almost convex losses that satisfy QG (Quadratic Growth) condition, a generalization of the PL (Polyak-Łojasiewicz) condition, and that have a unique minima. Although not directly applicable to wide networks since they do not have unique minima, we show that adding a suitable small random perturbation to the network predictions surprisingly makes the loss satisfy QG with unique minima. Based on such a perturbed prediction, we show a O(log T) regret for online regression with both squared loss and KL loss, and subsequently convert these respectively to O( KL + K) regret for Neu CB, where L is the loss of the best policy. Separately, we also show that existing regret bounds for Neu CBs are Ω(T) or assume i.i.d. contexts, unlike this work. Finally, our experimental results on various datasets demonstrate that our algorithms, especially the one based on KL loss, persistently outperform existing algorithms. 1 INTRODUCTION Contextual Bandits (CBs) provide a powerful framework for sequential decision making problems, where a learner takes decisions over T rounds based on partial feedback from the environment. At each round, the learner is presented with K context vectors to choose from, and a scalar output is generated based on the chosen context. The objective is to minimize1 the accumulated output in T rounds. Many existing works assume that the expected output at each round depends linearly on the chosen context. This assumption has enabled tractable solutions, such as UCB-based approaches (Chu et al., 2011; Abbasi-Yadkori et al., 2011) and Thompson Sampling (Agrawal and Goyal, 2013). However, in many real-world applications, the output function may not be linear, rendering these methods inadequate. Recent years have seen progress in the use of neural networks for contextual bandit problems (Zhou et al., 2020; Zhang et al., 2021) by leveraging the representation power of overparameterized models, especially wide networks (Allen-Zhu et al., 2019b; Cao and Gu, 2019b; Du et al., 2019; Arora et al., 2019b). These advances focus on learning the output function in the Neural Tangent Kernel (NTK) regime and drawing on results from the kernel bandit literature (Valko et al., 2013). Separately, (Foster and Rakhlin, 2020) adapted the inverse gap weighting idea from Abe and Long (1999); Abe et al. (2003) and gave an algorithm (Square CB) that relates the regret of CBs, Reg CB(T) to the regret of online regression with square loss RSq(T). The work uses a realizability assumption: the true function generating the outputs belongs to some function class F. In this approach, the learner learns a score for every arm (using online regression) and computes the probability of choosing an arm based on the inverse of the gap in scores leading to a regret bound Reg CB(T) = O p KTRSq(T) . Subsequently, Foster and Krishnamurthy (2021) revisited Square CB and provided a modified algorithm (Fast CB), with binary Kullback Leibler (KL) loss and a re-weighted inverse gap weighting scheme that attains a first-order regret bound. A first-order regret bound is data-dependent in the 1We use the loss formulation instead of rewards. Published as a conference paper at ICLR 2024 sense that it scales sub-linearly with the loss of the best policy L instead of T. They showed that if regret of online regression with KL loss is RKL(T) then the regret for the bandit problem can be bounded by Reg CB(T) = O( p KL RKL(T) + KRKL(T)). Further, under the realizability assumption, Simchi-Levi and Xu (2020) showed that an offline regression oracle with O(log T) calls can also achieve an optimal regret for CBs. This improves upon the O(T) calls to an online regression oracle made by Square CB and Fast CB but works only for the stochastic setting, i.e., when the contexts are drawn i.i.d. from a fixed distribution. In this work we develop novel regret bounds for online regression with neural networks and subsequently use the reduction in Foster and Rakhlin (2020); Foster and Krishnamurthy (2021) to give regret guarantees for Neu CBs. Before we unpack the details, we discuss the research gaps in existing algorithms for Neu CBs to better motivate our contributions in the context of available literature. 1.1 RESEARCH GAPS IN NEURAL CONTEXTUAL BANDITS We discuss some problems and restrictive assumptions with existing Neu CB algorithms. Table 1 summarizes these comparisons. We also discuss why naively extending existing results for wide networks with the CBs to online regression reduction lead to sub-optimal regret bounds. In Section F we further outline some related works in contextual bandits and overparameterized neural models. 1. Neural UCB ((Zhou et al., 2020)) and Neural TS ((Zhang et al., 2021)): Both these works focus on learning the loss function in the Neural Tangent Kernel (NTK) regime and drawing on results from the kernel bandit literature (Valko et al., 2013). The regret bound is shown to be O( d T), where d is the effective dimension of the NTK matrix. When the eigen-values of the kernel decreases polynomially, one can show that d depends logarithmically in T (see Remark 4.4 in (Zhou et al., 2020) and Remark 1 in (Valko et al., 2013)) and therefore the final regret is still O( T). However in Appendix A we show that under the assumptions in the papers, the regret bounds for both Neural UCB and Neural TS is Ω(T) in worst case. 2. EE-Net (Ban et al., 2022b): This work uses an exploitation network for learning the output function and an exploration network to learn the potential gain of exploring at current step. Although EE-Net avoids picking up a d dependence in its regret bound, it has two drawbacks. 1) It assumes that the contexts are chosen i.i.d. from a given distribution, an assumption that generally does not hold for real world CB problems. 2) It needs to store all the previous networks until the current time step and then makes a prediction by randomly picking a network from all the past networks (see lines 32-33 in Algorithm 1 of Ban et al. (2022b)), a strategy that does not scale to real world deployment. 3. Square CB (Foster and Rakhlin, 2020) and Fast CB (Foster and Krishnamurthy, 2021): Both these works provide regret bounds for CBs in terms of regret for online regression. Using online gradient descent for online regression (as in this paper) with regret O( T), Square CB and Fast CB provide O( KT 3/4) and O( KL T 1/4 + K T) regret for CBs (see Example 2 in Section 2.3 of Foster and Rakhlin (2020) and Example 5 in Section 4 of Foster and Krishnamurthy (2021) respectively). Existing analysis with neural models that use almost convexity of the loss (see Chen et al. (2021)) show O( T) regret for online regression, and naively combining it with the Square CB and Fast CB lead to the same sub-optimal regret bounds for Neu CBs. 1.2 OUR CONTRIBUTIONS 1. Lower Bounds: As outlined in Section 1.1, we formally show that an (oblivious) adversary can always choose a set of context vectors and a reward function at the beginning, such that the regret bounds for Neural UCB ((Zhou et al., 2020)) and Neural TS ((Zhang et al., 2021)) becomes Ω(T). See Appendix A.1 and A.2 for the corresponding theorems and their proofs. 2. QG Regret: We provide O(log T)+ϵT regret for online regression when the loss function satisfies (i) ϵalmost convexity, (ii) QG condition, and (iii) has unique minima (cf. Assumption 2) as long as the minimum cumulative loss in hindsight (interpolation loss) is O(log T). This improves over the O( T) + ϵT bound in Chen et al. (2021) that only exploits ϵalmost convexity. 3. Regret for wide networks: While the QG result is not directly applicable for neural models, since they do not have unique minima, we show adding a suitably small random perturbation to the prediction (10), makes the losses satisfy QG with unique minima. Using such a perturbed neural prediction, we provide regret bounds with the following loss functions: Published as a conference paper at ICLR 2024 Algorithm Regret Remarks Neural UCB (Zhou et al., 2020) O( d T) Bound is Ω(T) in worst case. Neural TS Zhang et al. (2021) O( d T) Bound is Ω(T) in worst case. EE-Net (Ban et al., 2022b) O( T) Assumes that the contexts are drawn i.i.d and needs to store all previous networks. Neu Square CB (This work) O( KT) No dependence on d and holds even when the contexts are chosen adversarially. Neu Fast CB (This work) O( L K + K) No dependence on d and holds even when the contexts are chosen adversarially. Table 1: Comparison with prior works. T is the horizon length, L is the cumulative loss of the best policy, d is the effective dimension of the NTK matrix and K is the number of arms. (a) Squared loss: We provide O(log T) regret for online regression with the perturbed network (Theorem 3.2) and thereafter using the online regression to CB reduction obtain O( KT) regret for Neu CBs with our algorithm Neu Square CB (Algorithm 1). (b) Kullback Leibler (KL) loss: We further provide an O(log T) regret for online regression with KL loss using the perturbed network (Theorem 3.3). To the best of our knowledge, this is the first result that shows PL/QG condition holds for KL loss, and would be of independent interest. Further, using the reduction, we obtain the first data dependent regret bound of O( L K + K) for Neu CBs with our algorithm Neu Fast CB (Algorithm 2). 4. Empirical Performance: Finally, in Section 5 we compare our algorithms against baseline algorithms for Neu CBs. Unlike previous works, we feed in contexts to the algorithms in an adversarial manner (see Section 5 for details). Our experiments on various datasets demonstrate that the proposed algorithms (especially Neu Fast CB) consistently outperform existing Neu CB algorithms, which themselves have been shown to outperform their linear counterparts such as Lin UCB and Linear TS (Zhou et al., 2020; Zhang et al., 2021; Ban et al., 2022b). We also emphasize that our regret bounds are independent of the effective dimension that appear in kernel bandits (Valko et al., 2013) and some recent neural bandit algorithms (Zhou et al., 2020; Zhang et al., 2021). Further our algorithms are efficient to implement, do not require matrix inversions unlike Neural UCB (Zhou et al., 2020) and Neural TS (Zhang et al., 2021), and work even when the contexts are chosen adversarially unlike approaches specific to the i.i.d. setting (Ban et al., 2022b). 2 NEURAL ONLINE REGRESSION: SETTING AND FORMULATION Problem Formulation: At round t [T], the learner is presented with an input xt X Rd and is required to make real-valued predictions ˆyt. Then, the true outcome yt Y = [0, 1] is revealed. Assumption 1 (Realizability). The conditional expectation of yt given xt is given by some unknown function h: X 7 Y, i.e., E[yt|xt] = h(xt). Further, the context vectors satisfy xt 1 t [T]. Neural Architecture: We consider a feedforward neural network with smooth activations as in Du et al. (2019); Banerjee et al. (2023) and the network output is given by f(θt; x) := m 1/2v t ϕ(m 1/2Wt (L)ϕ( ϕ(m 1/2Wt (1)x) )) , (1) where L is the number of hidden layers and m is the width of the network. Further, Wt (1) Rm d, W (l) t Rm m, l {2, . . . , L} are layer-wise weight matrices with W (l) t = [w(l) t,i,j], vt Rm is the last layer vector and ϕ( ) is a lipschitz and smooth (pointwise) activation function. θt := (vec(W (1) t ) , . . . , vec(W (L) t ) , v ) (2) as the vector of all parameters in the network. Note that θt Rp, where p = md + (L 1)m2 + m is total number of parameters. Published as a conference paper at ICLR 2024 Regret: At time t an algorithm picks a θt B, where B Rp is some comparator class, and outputs the prediction f(θt; xt). Consider a loss function ℓ: Y Y 7 R that measures the error between f(θt; xt) and the true output yt. Then the regret of neural online regression with loss ℓis defined as t=1 ℓ(yt, f(θt; xt)) inf θ B t=1 ℓ(yt, f(θ; xt)) , (3) Remark 2.1. Given the definition of regret in (3), one might wonder if we are assuming that the function h in Assumption 1 is somehow f( θ; ) for some θ B. Wide networks in fact can realize any function h on a finite set of T points (see Theorem E.1 in Appendix E). Using almost convexity of the loss function for wide networks, Chen et al. (2021) show R(T) = O( T) regret. Instead, we work with a small random perturbation to the neural model prediction denoted by f (see Section 3) with E[ f] = f and consider the following regret: t=1 ℓ(yt, f(θt; xt)) inf θ B t=1 ℓ(yt, f(θ; xt)) . (4) As we shortly show in Section 3, the surprising aspect of working with such mildly perturbed f is that we will get R(T) = O(log T). Further, the cumulative loss with such f will also be competitive against the best non-perturbed f with θ B in hindsight (Remark 3.4). Notation: n = O(t) (and Ω(t) respectively) means there exists constant c > 0 such that n ct (and n ct respectively). Further the notation n = Θ(t) means there exist constants c1, c2 > 0 such that c1t n c2t. The notations O(t), Ω(t), Θ(t) further hide the dependence on logarithmic terms. 3 NEURAL ONLINE REGRESSION: REGRET BOUNDS Our objective in this section is to provide regret bounds for projected Online Gradient Descent (OGD) with the projection operator Y B (θ) = arginfθ B θ θ 2. The iterates are given by θt ηt ℓ(yt, f(θt; xt)) . (5) Definition 3.1 (Quadratic Growth (QG) condition). Consider a function J : Rp R and let the solution set be J = {θ : θ arg minθ J(θ)}. Then J is said to satisfy the QG condition with QG constant µ, if J(θ) J(θ ) µ 2 θ θ 2 , θ Rp \ J , where θ is the projection of θ onto J . Remark 3.1 (PL QG). The recent literature has extensively studied the PL condition and how neural losses satisfy the PL condition (Karimi et al., 2016; Liu et al., 2020; 2022). While the Quadratic Growth (QG) condition has not been as widely studied, one can show that the PL condition implies the QG condition (Karimi et al., 2016, Appendix A), i.e., for µ > 0 J(θ) J(θ ) 1 2µ J(θ) 2 2 (PL) J(θ) J(θ ) µ 2 θ θ 2 2 (QG) Assumption 2. Consider a predictor g(θ; x) and suppose the loss ℓ(yt, g(θ; xt)), t [T], satisfies (a) Almost convexity, i.e., there exists ϵ > 0, such that θ, θ B, ℓ(yt, g(θ; xt)) ℓ(yt, g(θ ; xt)) + θ θ , θℓ(yt, g(θ ; xt)) ϵ. (6) (b) QG condition i.e., µ > 0, such that θ B \ Θ t , where Θ t = {θt|θt arginfθ ℓ(yt, g(θ; xt))} and θ t is the projection of θ onto Θ t we have ℓ(yt, g(θ; xt)) ℓ(yt, g(θ t ; xt)) µ 2 θ θ t 2 2. (7) (c) has a unique minima. Following (Liu et al., 2020; 2022), we also assume the following for the activation and loss function. Assumption 3. With ℓt := ℓ(yt, ˆyt), ℓ t := dℓt dˆyt , and ℓ t := d2ℓt dˆy2 t , we assume that the loss ℓ(yt, ˆyt) is Lipschitz, i.e., |ℓ t| λ, strongly convex, i.e., ℓ t a and smooth, i.e., ℓ t b, for some λ, a, b > 0. Published as a conference paper at ICLR 2024 Theorem 3.1 (Regret Bound under QG condition). Under Assumptions 2 and 3 the regret of projected OGD with step size ηt = 4 µt, where µ is the QG constant from Assumption 2(b), satisfies µ log T + ϵT + 2 inf θ B t=1 ℓ(yt, g(θ; xt)). (8) The proof of Theorem 3.1 can be found in Appendix C.1. Remark 3.2. Under Assumption 2(a), Chen et al. (2021) show R(T) = O( T) + ϵT. In contrast our bound improves the first term to O(log T) but has an extra term: cumulative loss of the best θ, and uses two additional assumptions (2(b) and 2(c)). Note that the third term vanishes if g interpolates and the interpolation loss is zero (holds for e.g., with over-parameterized networks and square loss). Further for over-parameterized networks Chen et al. (2021) show that ϵ = 1/poly(m), and therefore the ϵT term is O(1) for large enough m. A similar argument also holds for our model. Next we discuss if Theorem 3.1 can indeed be used for neural loss functions. For concreteness consider the square loss ℓSq(yt, f(θt; xt)) := 1 2(yt f(θt; xt))2. Following Liu et al. (2020); Banerjee et al. (2023), we make the following assumption on the initial parameters of the network. Assumption 4. We initialize θ0 with w(l) 0,ij N(0, σ2 0) for l [L] where σ0 = σ1 2 1+ log m and v0 is a random unit vector with v0 2 = 1. Next we define KNTK(θ) := [ f(θ; xt), f(θ; xt ) ] RT T to be the so-called Neural Tangent Kernel (NTK) matrix at θ (Jacot et al., 2018) and make the following assumption at initialization. Assumption 5. KNTK(θ0) is positive definite, i.e., KNTK(θ0) λ0I for some λ0 > 0. Remark 3.3. The above assumption on the NTK is common in the deep learning literature (Du et al., 2019; Arora et al., 2019b; Cao and Gu, 2019a) and is ensured as long any two context vectors xt do not overlap. All existing regret bounds for Neu CBs make this assumption (see Assumption 4.2 in Zhou et al. (2020), Assumption 3.4 in Zhang et al. (2021) and Assumption 5.1 in Ban et al. (2022b)). Finally we choose the comparator class B = BFrob ρ,ρ1 (θ0), the layer-wise Frobenius ball around the initialization θ0 with radii ρ, ρ1 (to be chosen as part of analysis) which is defined as BFrob ρ,ρ1 (θ0) := {θ Rp as in (2) : vec(W (l)) vec(W (l) 0 ) 2 ρ, l [L], v v0 2 ρ1}. (9) Consider a network f(θ; x) that satisfies Assumption 5 with θ0 initialized as in Assumption 4. Then for B = BFrob ρ,ρ1 (θ0) we can show the following: (i) ℓSq satisfies Assumption 2(a) with ϵ = O(poly(L, ρ, ρ1)/ m) (Lemma 13) and therefore with m = Ω(poly(T, L, ρ, ρ1)), the second term in (8) is O(1). (ii) ℓSq for a wide networks satisfies PL condition (Liu et al., 2022) which implies QG, and therefore Assumption 2(b) is satisfied. (iii) Further the network interpolates (Theorem E.1) which ensures that the third term in (8) is 0, which makes the rhs O(log T). However neural models do not have a unique minima and as such we cannot take advantage of Theorem 3.1 as Assumption 2(c) is violated. To mitigate this, in the next subsection we construct a randomized predictor f with E[ f] = f and show that Assumption 2(a), 2(b) and 2(c) hold in high probability. 3.1 REGRET BOUNDS FOR SQUARED LOSS To ensure that the loss has a unique minimizer at every time step, we consider a random network with a small perturbation to the output. In detail, given the input xt, we define a perturbed network as f(θt, xt, ε) = f(θt; xt) + cp (θt θ0)T ejεj m1/4 , (10) where f(θt; xt) is the output of the network as defined in (1), cp is the perturbation constant to be chosen later, {ej}p j=1 are the standard basis vectors and ε = (ε1, . . . , εp)T is a random vector where εj is drawn i.i.d from a Rademacher distribution, i.e., P(εj = +1) = P(εj = 1) = 1/2. Note that E[ f] = f. Further f ensures that the expected loss EεℓSq yt, f(θ, xt, ε) has a unique minimizer (see Lemma 14). However, since running projected OGD on EεℓSq yt, f(θ, xt, ε) is not Published as a conference paper at ICLR 2024 feasible, we next consider an average version of it. Let εs = (εs,1, εs,2, . . . , εs,p)T where each εs,i is i.i.d. Rademacher. Consider S such i.i.d. draws {εs}S s=1, and define the predictor f (S) θ; xt, ε(1:S) = 1 s=1 f(θ; xt, εs), (11) where f(θ; xt, εs) is as defined in (10). We define the corresponding regret with square loss as t=1 ℓSq yt, f (S) θt; xt, ε(1:S) min θ BFrob ρ,ρ1 (θ0) t=1 ℓSq yt, f (S) θ; xt, ε(1:S) . (12) However, instead of running projected OGD with the loss ℓSq yt, f (S) θ; xt, ε(1:S) , we use L(S) Sq yt, f(θ; xt, εs) S s=1 s=1 ℓSq yt, f(θ; xt, εs) (13) Note that since ℓSq is convex in the second argument, using Jensen we have ℓSq yt, f (S) θ; xt, ε(1:S) = ℓSq yt, 1 s=1 f(θ; xt, εs) 1 s=1 ℓSq yt, f(θ; xt, εs) . (14) Subsequently we will show via (14) that bounding the regret with (13) implies a bound on (12). Theorem 3.2 (Regret Bound for square loss). Under Assumption 1, 4 and 5 with appropriate choice of step-size sequence {ηt}, width m, and perturbation constant cp in (10), with probability at least 1 C T 4 for some constant C > 0, over the randomness of initialization and {ε}S s=1, the regret in (12) of projected OGD with loss L(S) Sq yt, f(θ; xt, εs) S s=1 , S = Θ(log m) and projection ball BFrob ρ,ρ1 (θ0) with ρ = Θ( T/λ0) and ρ1 = Θ(1) is given by RSq(T) = O(log T). Proof sketch. The proof of the theorem follows along four key steps as described below. All of these hold with high probability over the randomness of initialization and {εs}S s=1. A detailed version of the proof along with all intermediate lemmas and their proofs are in Appendix C.2. Note that we do not use Assumptions 2 and 3 and, but rather explicitly prove that they hold. 1. Square loss is Lipschitz, strongly convex, and smooth w.r.t. the output: This step ensures that Assumption 3 is satisfied. Strong convexity and smoothness follow trivially from the definition of the ℓSq. To show that ℓSq is Lipschitz we show that the output f(θ; x, ε) is bounded for any θ BFrob ρ,ρ1 (θ0). Also note from Theorem 3.1 that the lipschitz parameter of the loss, λ appears in the log T term and therefore to obtain a O(log T) regret we also ensure that λ = O(1). 2. The average loss in (13) is almost convex and has a unique minimizer: We show that with S = Θ(log m), the average loss in (13) is ν - Strongly Convex (SC) with ν = O 1 m w.r.t. θ BFrob ρ,ρ1 (θ0), t [T] which immediately implies Assumption 2(a) and 2(c). 3. The average loss in (13) satisfies the QG condition: It is known that square loss with wide networks under Assumption 5 satisfies the PL condition (eg. Liu et al. (2022)) with µ = O(1). We show that the average loss in (13) with S = Θ(log m), also satisfies the PL condition with µ = O(1), which implies that it satisfies the QG condition with same µ. 4. Bounding the final regret: Steps 1 and 2 above surprisingly show that with a small output perturbation, square loss satisfies (a) almost convexity, (b) QG, and (c) unique minima as in Assumption 2. Combining with step 3, all the assumptions of Theorem 3.1 are satisfied by L(S) Sq . Using union bound over the three steps, invoking Theorem 3.1 we get with high probability t=1 L(S) Sq yt, f(θt; xt, εs) S s=1 inf θ BFrob ρ,ρ1 (θ0) t=1 L(S) Sq yt, f(θ; xt, εs) S s=1 2 inf θ BFrob ρ,ρ1 (θ0) t=1 L(S) Sq yt, f(θ; xt, εs) S s=1 + O(log T). (15) Published as a conference paper at ICLR 2024 Algorithm 1 Neural Square CB (Neu Square CB); Uses Square loss 1: Initialize θ0, γ, {ηt} 2: for t = 1, 2, ..., T do 3: Receive contexts xt,1, ..., xt,K, and compute ˆyt,a = f (S) θ; xt,a, ε(1:S) , a [K] using equation 11 4: Let b = arg mina ˆyt,a, pt,a = 1 K+γ(ˆyt,b ˆyt,a), and pt,b = 1 P a =b pt,a 5: Sample arm at pt and observe output yt,at 6: Update θt+1 = Q BFrob ρ,ρ1 (θ0) θt ηt L(S) Sq yt,at, f(θ; xt,at, εs) S s=1 . Finally we show infθ BFrob ρ,ρ1 (θ0) PT t=1 L(S) Sq yt, f( θ ; xt, εs) S s=1 = O(1) using the fact that wide networks interpolate (Theorem E.1) which implies PT t=1 L(S) Sq yt, f(θt; xt, εs) S s=1 = O(log T) which using (14) and recalling the definition in (13) implies RSq(T) = O(log T) Remark 3.4. Note that PT t=1 L(S) Sq yt, f(θt; xt, εs) S s=1 = O(log T) from step-4 above also im- plies that PT t=1 L(S) Sq yt, f(θt; xt, εs) S s=1 minθ BFrob ρ,ρ1 (θ0) PT t=1 ℓSq yt, f(θ, xt) = O(log T) and therefore our predictions are competitive against f as defined in (1) as well. Remark 3.5. Although the average loss in (13) is SC (Lemma 6), we do not use standard results from Shalev-Shwartz (2012); Hazan (2021) to obtain O(log T) regret. This is because, the strong convexity constant ν = O(1/ m), and although OGD ensures O(log T) regret for SC functions, the constant hidden by O scales as 1 ν = m. For large width models, m >> T, and therefore this approach does not yield a O(log T) bound. The key idea is to introduce bare minimum strong convexity using (10), to ensure unique minima, without letting go of the QG condition with µ = O(1). 3.2 REGRET BOUNDS FOR KL LOSS Next we consider the binary KL loss, defined as ℓKL(yt, ˆyt) = yt ln yt σ(ˆyt) +(1 yt) ln 1 yt 1 σ(ˆyt) , where σ(y) = 1 1+e y is the sigmoid function. Following the approach outlined in Section 3.1, we consider a perturbed network as defined in (10). Note that here the output of the neural network is finally passed through a sigmoid. As in (11), we will consider a combined predictor. With slight abuse of notation, we define the prediction and the corresponding regret with ℓKL respectively as σ f (S) θ; xt, ε(1:S) = 1 s=1 σ( f(θ; xt, εs)) (16) t=1 ℓKL yt, σ f (S) θt; xt, ε(1:S) min θ BFrob ρ,ρ1 (θ0) t=1 ℓKL yt, σ f (S) θ; xt, ε(1:S) . Theorem 3.3 (Regret Bound for KL Loss). Under Assumption 1, 4 and 5 for yt [z, 1 z], 0 < z < 1, with appropriate choice of step-size sequence {ηt}, width m, and perturbation constant cp, with high probability over the randomness of initialization and {ε}S s=1, the regret of projected OGD with loss L(S) KL yt, σ( f(θ; xt, εs)) S s=1 = 1 S PS s=1 ℓKL yt, σ( f θ; xt, εs) , S = Θ(log m) and projection ball BFrob ρ,ρ1 (θ0) with ρ = Θ( T/λ0) and ρ1 = Θ(1) is given by RKL(T) = O(log T). The proof of the theorem follows a similar approach as in proof of Theorem 3.2 (See Appendix C.3). 4 NEURAL CONTEXTUAL BANDITS: FORMULATION AND REGRET BOUNDS We consider a contextual bandit problem where a learner needs to make sequential decisions over T time steps. At any round t [T], the learner observes the context for K arms xt,1, , ..., xt,K Rd, where the contexts can be chosen adversarially. The learner chooses an arm at [K] and then the associated output yt,at [0, 1] is observed. We make the following assumption on the output. Assumption 6. The conditional expectation of yt,a given xt,a is given by h: Rd 7 [0, 1], i.e., E[yt,a|xt,a] = h(xt,a). Further, the context vectors satisfy xt,a 1, t [T], a [K]. Published as a conference paper at ICLR 2024 Algorithm 2 Neural Fast CB (Neu Fast CB); Uses KL loss 1: Initialize θ0, γ, {ηt} 2: for t = 1, 2, ..., T do 3: Receive contexts xt,1, ..., xt,K and compute ˆyt,k, k [K] using (16) 4: Let bt = argmin k [K] ˆyt,k, pt,k = ˆyt,bt K ˆyt,bt+γ(ˆyt,k ˆyt,bt), k [K], and pt,bt = 1 P k =bt pt,k. 5: Sample arm at pt and observe output yt,at 6: Update θt+1 = Q BFrob ρ,ρ1 (θ0) θt ηt L(S) KL yt,at, f(θ; xt,at, εs) S s=1 . The learner s goal is to minimize the regret of the contextual bandit problem and is defined as the expected difference between the cumulative output of the algorithm and that of the optimal policy: Reg CB(T) = E h T X yt,at yt,a t i , (17) where a t = arg mink [K] h(xt,a) is the best action minimizing the expected output in round t. Neu Square CB and Neu Fast CB are summarized in Algorithm 1 and Algorithm 2 respectively. At time t, the algorithm computes f (S) θ; xt,a, ε(1:S) , a [K] using (11) (see line 4). It then computes the probability of selecting an arm using the gap between learned outputs following inverse gap weighting scheme from Abe and Long (1999) (see line 6) and samples an action at from this distribution (see line 7). It then receives the true output for the selected arm yt,at, and updates the parameters of the network using projected online gradient descent. Neu Fast CB employs a similar approach, except that it uses KL loss to update the parameters of the network and uses a slightly different weighting scheme to compute the action distribution (Foster and Krishnamurthy, 2021). Theorem 4.1 (Regret bound for Neu Square CB). Under Assumption 6 and 5 with appropriate choice of the parameter γ, step-size sequence {ηt} width m, and regularization parameter cp, with high probability over the randomness in the initialization and {ε}S s=1 the regret for Neu Square CB with ρ = Θ( T/λ0), ρ1 = Θ(1) is given by Reg CB(T) O( Theorem 4.2 (Regret bound for Neu Fast CB). Under Assumption 6 and 5 with appropriate choice of the parameter γ, step-size sequence {ηt} width m, and regularization parameter cp, with high probability over the randomness in the initialization and {ε}S s=1, the regret for Neu Fast CB with ρ = Θ( T/λ0), ρ1 = Θ(1) is given by Reg CB(T) O( L K + K), where L = PT t=1 yt,a t . The proof of the Theorem 4.1 and 4.2 follow using the reduction from Foster and Rakhlin (2020) and (Foster and Krishnamurthy, 2021) respectively, and crucially using our sharp regret bounds for online regression in Section 3. We provide both proofs in Appendix D for completeness. Remark 4.1. Note that since L T then O( KT). Therefore Neu Fast CB is expected to perform better in most settings in practice , especially when L is small, i.e., the best policy has low regret. Also note that going by the upper bounds on the regret, especially the dependence on K, Neu Square CB could outperform Neu Fast CB only if L = Θ(T) and K >> T. Remark 4.2. In the linear setting, (Azoury and Warmuth, 2001) gives RSq(T) O(p log(T/p)), where p is the feature dimension (Section 2.3, Foster and Rakhlin (2020)). This translates to Reg CB(T) O( p KT). Further with KL loss, using continuous exponential weights gives RKL(T) = O(p log T/p) which translates to Reg CB(T) O( p L Kp log T/p + Kp log T/p) (Section 4, Foster and Krishnamurthy (2021)). However, with over-parameterized networks, (with p >> T), both bounds are Ω(T). Therefore it becomes essential to obtain regret bounds that are independent of the number of parameters in the network, which our results do. 5 EXPERIMENTS In this section, we evaluate the performance of Neu Square CB and Neu Fast CB, without output perturbation against some popular Neu CB algorithms. We briefly describe the settings and the baselines considered here. For more details, a scaled-up version of Figure 1 and a discussion on the effect of output perturbation, see Appendix G). Published as a conference paper at ICLR 2024 0 1000 2000 3000 4000 5000 EE-Net Neural UCB Neural TS Neural Neu Square CB Neu Fast CB 0 1000 2000 3000 4000 5000 0 EE-Net Neural UCB Neural TS Neural Neu Square CB Neu Fast CB 0 1000 2000 3000 4000 5000 EE-Net Neural UCB Neural TS Neural Neu Square CB Neu Fast CB Magic Telescope 0 1000 2000 3000 4000 5000 EE-Net Neural UCB Neural TS Neural Neu Square CB Neu Fast CB 0 1000 2000 3000 4000 5000 0 EE-Net Neural UCB Neural TS Neural Neu Square CB Neu Fast CB 0 1000 2000 3000 4000 5000 0 EE-Net Neural UCB Neural TS Neural Neu Square CB Neu Fast CB Figure 1: Comparison of cumulative regret of Neu Square CB and Neu Fast CB with baselines on openml datasets (averaged over 20 runs). Baselines and Datasets. For comparison, we choose four Neu CB algorithms: (i) Neural UCB (Zhou et al., 2020) (ii) Neural TS (Zhang et al., 2021), (iii) EE-Net (Ban et al., 2022b) and (iv) Neural-ϵ greedy. We consider a collection of 6 multiclass classification datasets from the openml.org platform: covertype, fashion, Magic Telescope, mushroom, Plants and shuttle. We follow the standard evaluation strategy as in Zhou et al. (2020); Ban et al. (2022b) (see Appendix G for details). Adversarial Contexts. In order to evaluate the performance of the algorithms on adversarially chosen contexts, we feed data points to each of the model in the following manner: for the first 500 rounds, an instance x is uniformly sampled from each of the classes, transformed into context vectors as described above and presented to the model. We calculate the accuracy for each class by recording the rewards of this class divided by the number of instances drawn from this class. In the subsequent 500 rounds, we increase the probability of sampling instances from the class which had the least accuracy in the previous rounds. We repeat this procedure every 500 rounds. Results. Figure 1 plots the cumulative regret of all the algorithms across different rounds. All experiments were averaged across 20 rounds and the standard deviation is plotted along with the average performance. Although all the algorithms use a neural network to model the potential nonlinearity in the reward, the baseline algorithms show erratic performance with a lot of variance. Both our algorithms Neu Square CB and Neu Fast CB show consistent performance across all the datasets. Moreover, Neu Fast CB persistently outperforms all baselines for all the datasets. 6 CONCLUSION In this work, we develop novel regret bounds for online regression with neural networks and subsequently give regret guarantees for Neu CBs. We provide a sharp O(log T) regret for online regression when the loss satisfies almost convexity, QG condition, and has unique minima. We then propose a network with a small random perturbation, and show that this surprisingly makes the loss satisfy all three conditions. Using these results we obtain O(log T) regret bound with both square loss and KL loss and thereafter, convert these bounds to regret bounds for Neu CBs. Separately, we provide lower bound results for Neural UCB (Zhou et al., 2020) and Neural TS (Zhang et al., 2021) and show that even an oblivious adversary can choose a sequence of contexts and a reward function that make their regret bounds Ω(T). Our algorithms in contrast guarantee O( T) regret, are efficient to implement, work even for contexts drawn by an adaptive adversary and does not need to store previous networks (unlike (Ban et al., 2022b)). Additionally, our experimental comparisons with the baselines on various datasets further highlight the advantages of our methods and therefore significantly advances the state of the art in Neu CBs from both theoretical and empirical perspectives. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENT The work was supported in part by grants from the National Science Foundation (NSF) through awards IIS 21-31335, OAC 21-30835, DBI 20-21898, IIS-2002540 as well as a C3.ai research award. Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011. N. Abe and P. M. Long. Associative reinforcement learning using linear probabilistic concepts. In ICML, pages 3 11. Citeseer, 1999. N. Abe, A. W. Biermann, and P. M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263 293, 2003. S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127 135. PMLR, 2013. Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019a. Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242 252. PMLR, 2019b. S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322 332. PMLR, 2019a. S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang. On exact computation with an infinitely wide neural net. Advances in Neural Information Processing Systems, 32, 2019b. K. S. Azoury and M. K. Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach. Learn., 43(3):211 246, jun 2001. ISSN 0885-6125. doi: 10.1023/A:1010896012157. URL https://doi.org/10.1023/A:1010896012157. Y. Ban and J. He. Generic outlier detection in multi-armed bandit. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 913 923, 2020. Y. Ban and J. He. Local clustering in contextual multi-armed bandits. In Proceedings of the Web Conference 2021, pages 2335 2346, 2021. Y. Ban, J. He, and C. B. Cook. Multi-facet contextual bandits: A neural network perspective. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 35 45, 2021. Y. Ban, Y. Qi, T. Wei, and J. He. Neural collaborative filtering bandits via meta learning. ar Xiv preprint ar Xiv:2201.13395, 2022a. Y. Ban, Y. Yan, A. Banerjee, and J. He. EE-net: Exploitation-exploration neural networks in contextual bandits. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=X_ch3Vr NSRg. A. Banerjee, P. Cisneros-Velarde, L. Zhu, and M. Belkin. Restricted strong convexity of deep learning models with smooth activations. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PINRbk7h01. Y. Cao and Q. Gu. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. In AAAI Conference on Artificial Intelligence, 2019a. Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in neural information processing systems, 32, 2019b. Published as a conference paper at ICLR 2024 Z. Charles and D. Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 745 754. PMLR, 10 15 Jul 2018. X. Chen, E. Minasyan, J. D. Lee, and E. Hazan. Provable regret bounds for deep online learning and control. ar Xiv preprint ar Xiv:2110.07807, 2021. W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208 214. JMLR Workshop and Conference Proceedings, 2011. S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pages 1675 1685. PMLR, 2019. D. Foster and A. Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pages 3199 3210. PMLR, 2020. D. J. Foster and A. Krishnamurthy. Efficient first-order contextual bandits: Prediction, allocation, and triangular discrimination. Advances in Neural Information Processing Systems, 34, 2021. S. Frei and Q. Gu. Proxy convexity: A unified framework for the analysis of neural networks trained by gradient descent. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview. net/forum?id=NVp GLJUu Px5. E. Hazan. Introduction to online convex optimization, 2021. J. Honorio and T. Jaakkola. Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees. In S. Kaski and J. Corander, editors, Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 384 392, Reykjavik, Iceland, 22 25 Apr 2014. PMLR. URL https://proceedings.mlr.press/v33/honorio14.html. A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018. H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods under the Polyak-lojasiewicz condition. In Joint European conference on machine learning and knowledge discovery in databases, pages 795 811. Springer, 2016. C. Liu, L. Zhu, and M. Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33:15954 15964, 2020. C. Liu, L. Zhu, and M. Belkin. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85 116, 2022. S. Lojasiewicz. A topological property of real analytic subsets (in french). In Les équations aux dérivées partielles, pages 87 89. Coll. du CNRS, 1963. X. Lu and B. Van Roy. Ensemble sampling. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, page 3260 3268, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964. B. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864 878, 1963. ISSN 0041-5553. doi: https://doi.org/ 10.1016/0041-5553(63)90382-3. URL https://www.sciencedirect.com/science/ article/pii/0041555363903823. Y. Qi, Y. Ban, and J. He. Neural bandit with arm group graph. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1379 1389, 2022. Y. Qi, Y. Ban, and J. He. Graph neural bandits. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1920 1931, 2023. Published as a conference paper at ICLR 2024 C. Riquelme, G. Tucker, and J. Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Sy Ye6k-CW. I. Sason. On reverse pinsker inequalities, 2015. S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012. ISSN 1935-8237. doi: 10.1561/2200000018. URL http://dx.doi.org/10.1561/2200000018. D. Simchi-Levi and Y. Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Ar Xiv, abs/2003.12699, 2020. M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. ar Xiv preprint ar Xiv:1309.6869, 2013. R. Vershynin. Introduction to the non-asymptotic analysis of random matrices, 2011. T. Zahavy and S. Mannor. Neural linear bandits: Overcoming catastrophic forgetting through likelihood matching, 2020. URL https://openreview.net/forum?id=r1gzdh EKv H. W. Zhang, D. Zhou, L. Li, and Q. Gu. Neural thompson sampling. In International Conference on Learning Representation (ICLR), 2021. D. Zhou, L. Li, and Q. Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pages 11492 11502. PMLR, 2020. Published as a conference paper at ICLR 2024 A COMPARISON WITH RECENT NEURAL CONTEXTUAL BANDIT ALGORITHMS In this section we show that the regret bounds for Neural UCB (Zhou et al., 2020) and Neural Thompson Sampling (Neural TS) (Zhang et al., 2021) is Ω(T) in the worst case. We start with a brief description of the notations used in these works. H is the Neural Tangent Kernel (NTK) matrix computed from all the context vectors xt,i, t [T], i [K], and h(x) is the true reward function given a context x. The cumulative reward vector is defined as h = (h(x1), . . . , h(x T K)) . With λ as the regularization parameter in the loss, the effective dimension d of the Neural Tangent Kernel H on the contexts {xt}T K t=1 is defined as: d = log det(I + H/λ) log(1 + T/λ) Further it is assumed that H λ0I (see Assumption 4.2 in (Zhou et al., 2020) and Assumption 3.4 in (Zhang et al., 2021)). A.1 Ω(T) REGRET FOR NEURALUCB The bound on the regret for Neural UCB (Zhou et al., 2020) is given by (ignoring constants): d log 1 + TK d log 1 + TK d log 1 + TK d log 1 + TK := BNUCB(T, λ), where, λ is the regularization constant and S h T H 1h with h = (h(x1), . . . , h(x T K))T . We provide two Ω(T) regret bounds for Neural UCB. The first result creates an instance that an oblivious adversary can choose before the algorithm begins such that the regret bound is Ω(T). The second result provides an Ω(T) bound for any reward function and set of context vectors as long as κ h is Θ(1). Theorem A.1. There exists a reward vector h such that the regret bound for Neural UCB is lower bounded as BNUCB(λ, T) 1 Proof. Consider the eigen-decomposition of H = UΣHU T , where columns of U RT K T K are the normalized eigen-vectors {ui}T K i=1 of H and ΣH is a diagonal matrix containing the eigenvalues {λi(H)}T K i=1 of H. Now, h T H 1h = q h T UΣ 1 H U T h = 1 λi(H)(u T i h)2 ξ 1 λi(H), (18) where ξ = min i (u T i h) Further observe that we can rewrite the effective dimension as follows: d = log det(I + H/λ) log(1 + TK/λ) = log(1 + TK/λ) = log(1 + TK/λ) Published as a conference paper at ICLR 2024 Using these we can lower bound BNUCB(T, λ) as follows: BNUCB(T, λ) log 1 + λi(H) log 1 + T K λ log 1 + TK log(1 + λi(H) λ ) log 1 + T K λ log 1 + TK i=1 log 1 + λi(H) i=1 log 1 + λi(H) i=1 log 1 + λi(H) λ λi(H)log 1 + λi(H) i=1 log 1 + 1 i=1 yilog 1 + 1 where yi = λ λi(H), i [TK]. Using this we can further bound BNUCB(T, λ) as follows: BNUCB(T, λ) i=1 log 1 + 1 yilog 1 + 1 i=1 log 1 + 1 + ξyilog 1 + 1 i=1 log 1 + 1 Recall that ξ = min i (u T i h)2. Now, consider an h that makes a π/4 angle with all the eigen-vectors ui, i [TK] and therefore ξ = 1 2. Note that for the positive definite assumption of NTK to hold, all the contexts need to be distinct and therefore an oblivious adversary can always choose such an h. In such a case, the regret bound for Neural UCB is BNUCB(T, λ) = Ω(T). Remark A.1. Note that the Ω(T) regret holds for any h whose dot product with all the eigen vectors of H is lower bounded by a constant. Theorem A.2. For any cumulative reward vector h, with κ as the condition number of the NTK matrix, the regret bound for Neural UCB is BNUCB(λ, T) h 2 κ Published as a conference paper at ICLR 2024 Proof. Using the same notation as in the previous section, we can lower bound S and d as follows: λmin(H 1) h 2 2 = 1 p λmax(H) h 2 d = log det(I + H/λ) log(1 + TK/λ) log λmin (I + H/λ)T K log (1 + TK/λ) TK log(1 + λ0 λ ) log 1 + T K Using these we can lower bound BNUCB(T, λ) as follows: BNUCB(T, λ) TK log(1 + λ0 λ ) log 1 + T K λ log 1 + TK TK λ λmax(H) log(1 + λ0 λ ) log 1 + T K λ log 1 + TK TK log 1 + λ0 λ λ0 log 1 + λ0 TK log 1 + 1 y log 1 + 1 where y = λ λ0 and κ = λmax(H) λmin(H) is the condition number of of the NTK. Since y log(1 + 1 y) 1 and for T, K 1, we have BNUCB(T, λ) T K log 1 + 1 1 + y h 2 κ 1 + h 2 κ y K h 2 κ (21) If κ h is Θ(1) then BNUCB(T, λ) is Ω(T). A.2 Ω(T) REGRET FOR NEURAL THOMPSON SAMPLING The bound on the regret for Neural Thompson sampling (Zhang et al., 2021) is given by (ignoring constants): log T + log K) S + q d log(1 + TK/λ) q λ d log(1 + TK) We present two lower bounds on BNTS(T) below. As in Neural UCB, the first result creates an instance that an oblivious adversary can choose before the algorithm begins such that the regret bound is Ω(T) and the second result provides an Ω(T) bound for any reward function and set of context vectors as long as λ0 is Θ(1). Theorem A.3. There exists a reward vector h such that the regret bound for Neural UCB is lower bounded as BNUCB(λ, T) 1 2 Proof. For Neural Thompson sampling the regularization parameter λ is chosen to be λ = 1 + 1/T (see Theorem 3.5 in Zhang et al. (2021)) and therefore 1 λ 2 for any T 1. Therefore we can Published as a conference paper at ICLR 2024 lower bound the effective dimension as, d = log det(I + H/λ) log(1 + TK/λ) = log(1 + TK/λ) i=1 log 1 + λi(H) log(1 + TK/λ) i=1 log 1 + λi(H) log(1 + TK) i=1 log 1 + λi(H) i=1 log 1 + λi(H) i=1 log 1 + λi(H) i=1 log 1 + λi(H) where recall from equation 18 that where ξ = min i (u T i h) i=1 yi log 1 + 1 i=1 log 1 + 1 yi log 1 + 1 i=1 log 1 + 1 i=1 ξyi log 1 + 1 + log 1 + 1 i=1 log 1 + 1 1/2yi 1 + 1/2yi (1 + ξyi) 1 + ξyi 1 + 2yi As in the proof of Theorem A.1, for h making an angle of π/4 with all eigen-vectors of H, ξ 1 2, which proves the claim. Theorem A.4. For any cumulative reward vector h, the regret bound for Neural Thompson sampling is BNTS(λ, T) T TK λ0 2 + λ0 Published as a conference paper at ICLR 2024 Recall, 1 λ 2 for any T 1, and therefore we can lower bound the effective dimension as, d = log det(I + H/λ) log(1 + TK/λ) TK log(1 + λ0/λ) log(1 + TK/λ) TK log(1 + λ0/λ) log(1 + TK) Therefore, using 1 λ 2 and S h T H 1h + p TK log(1 + λ0/2) p TK log(1 + λ0/2) TK log(1 + λ0/2) TK λ0/2 1 + λ0/2 TK λ0 2 + λ0 If λ0 = Θ(1), BNTS(T) is Ω(T). B BACKGROUND AND PRELIMINARIES FOR TECHNICAL ANALYSIS Before we proceed with the proof of the claims in Section 3, we state a few recent results from Banerjee et al. (2023) that we will use throughout our proofs. We assume the loss to be the squared loss throughout this subsection. Lemma 1 (Hessian Spectral Norm Bound, Theorem 4.1 and Lemma 4.1 in Banerjee et al. (2023) ). Under Assumptions 3 and 4, for any x X, θ BSpec ρ,ρ1 (θ0), with probability at least (1 2(L+1) m ), we have 2 θf(θ; x) 2 c H m and θf(θ; x) 2 ϱ , (22) c H = L(L2γ2L + LγL + 1) (1 + ρ1) ψH max l [L] γL l + LγL max l [L] h(l) , γ = σ1 + ρ m, h(l) = γl 1 + |ϕ(0)| ψH = max 1 l1 0, over the randomness of initialization and {ε}S s=1, the regret in (12) of projected OGD with loss L(S) Sq yt, f(θ; xt, εs) S s=1 , S = Θ(log m) and projection ball BFrob ρ,ρ1 (θ0) with ρ = Θ( T/λ0) and ρ1 = Θ(1) is given by RSq(T) = O(log T). Proof. The proof follows along the following four steps. 1. Square loss is Lipschitz, strongly convex, and smooth w.r.t. the output: This step ensures that Assumption 3 (lipschitz, strong convexity and smoothness) is satisfied. We show that the loss function ℓSq is lipschitz, strongly convex and smooth with respect to the output f(θ; x, ε) inside BFrob ρ,ρ1 (θ0) with high probability over the randomness of initialization and {ε}S s=1. We will denote λSq, a Sq and b Sq as the lipschitz, strong convexity and smoothness parameters respectively as defined in Assumption 3. Lemma 5. For θ BFrob ρ,ρ1 (θ0), with probability 1 2(L+1)+1 m over the randomness of ini- tialization and {ε}S s=1, the loss ℓSq yt, f(θ; xt, ε) is lipschitz, strongly convex and smooth with respect to its output f(θ; xt, ε). Further the corresponding parameters for square loss, λSq, a Sq and b Sq are O(1). Strong convexity and smoothness follow trivially from the definition of the ℓSq. To show that ℓSq is Lipschitz we show that the output of the neural network f(θ; x, ε) is bounded for any θ BFrob ρ,ρ1 (θ0). Also note from Theorem 3.1 that λ2 appears in the log T term and therefore to obtain a O(log T) regret we must ensure that λSq = O(1), which Lemma 5 does. Note that using a union bound over t [T] and Lemma 5 it follows that with probability 1 2(L+1)+1 L(S) Sq yt, σ( f(θ; xt, εs)) S s=1 is also lipschitz, strongly convex and smooth with respect to each of the outputs f θ; xt, εs), s [S]. 2. The average loss in (13) is almost convex and has a unique minimizer: We show that with S = Θ(log m), the average loss in (13) is ν - Strongly Convex (SC) with ν = O 1 m w.r.t. θ BFrob ρ,ρ1 (θ0), t [T] which immediately implies Assumption 2(a) (almost convexity) and 2(c) (unique minima). Published as a conference paper at ICLR 2024 Lemma 6. Under Assumption 5 and cp = p 8λSq CH, with probability 1 2T (L+2) randomness of the initialization and {εs}S s=1, L(S) Sq yt, f(θ; xt, εs) S s=1 is ν-strongly convex with respect to θ BFrob ρ,ρ1 (θ0), where ν = O 1 m . 3. The average loss in (13) satisfies the QG condition: It is known that square loss with wide networks under Assumption 5 (positive definite NTK) satisfies the PL condition (eg. Liu et al. (2022)) with µ = O(1). We show that the average loss in (13) with S = Θ(log m), also satisfies the PL condition with µ = O(1), which implies that it satisfies the QG condition with same µ with high probability over the randomness of initialization εs. Lemma 7. Under Assumption 5 with probability 1 2T (L+1)C m , for some absolute constant C > 0, L(S) Sq yt, f(θ; xt, εs) S s=1 satisfies the QG condition over the randomness of the initialization and {εs}S s=1, with QG constant µ = O(1). 4. Bounding the final regret. Steps 1 and 2 above surprisingly show that with a small output perturbation, square loss satisfies (a) almost convexity, (b) QG, and (c) unique minima as in Assumption 2. Combining with step 3, we have that all the assumptions of Theorem 3.1 are satisfied by the average loss. Using union bound over the three steps, invoking Theorem 3.1 we get with probability 1 T (L+1)C some absolute constant C > 0, with θ = minθ BFrob ρ,ρ1 (θ0) PT t=1 L(S) Sq yt, f(θ; xt, εs) S s=1 t=1 L(S) Sq yt, f(θt; xt, εs) S s=1 L(S) Sq yt, f( θ ; xt, εs) S s=1 t=1 L(S) Sq yt, f( θ ; xt, εs) S s=1 + O(log T). (30) Next we bound PT t=1 L(S) Sq yt, f( θ ; xt, εs) S s=1 using the following lemma. Lemma 8. Under Assumption 1 and 5 with probability 1 2T (L+1) m over the randomness of the initialization and {εs}S s=1 we have PT t=1 L(S) Sq yt, f( θ ; xt, εs) S s=1 Using Lemma 8 and (29) we have PT t=1 L(S) Sq yt, f(θt; xt, εs) S s=1 = O(log T) which using (14) and recalling the definition in (13) implies RSq(T) = O(log T) Next we prove all the intermediate lemmas in the above steps. Lemma 5. For θ BFrob ρ,ρ1 (θ0), with probability 1 2(L+1)+1 m over the randomness of initializa- tion and {ε}S s=1, the loss ℓSq yt, f(θ; xt, ε) is lipschitz, strongly convex and smooth with respect to its output f(θ; xt, ε). Further the corresponding parameters for square loss, λSq, a Sq and b Sq are O(1). Proof. We begin by showing that the output of both regularized and un-regularized network defined in (10) and (1) respectively is bounded with high probability over the randomness of the initialization and in expectation over the randomness of ε. Consider any θ BFrob ρ,ρ1 (θ0) and x X. Then, the Published as a conference paper at ICLR 2024 output of the network in expectation w.r.t. εi s can be bounded as follows. Eε| f(θ; xt, ε)|2 2|f(θ; x)|2 + 2c2 reg Eε (θ θ0)T ejεj = 2|f(θ; x)|2 + 2c2 reg (θ θ0)T eie T j (θ θ0) m Eε[εiεj] (a) 2|f(θ; x)|2 + 2c2 reg (θ θ0)2 i m m|v α(L)(x)|2 + 2c2 reg m θ θ0 2 m v 2 α(L)(x) 2 + 2 mc2 reg(Lρ + ρ1)2 γL + |ϕ(0)| i=1 γi 1 !2 m + 2 mc2 reg(Lρ + ρ1)2 where γ = σ1 + ρ m. Here (a) follows from the fact that E[εiεj] = 0 when i = j, and E[εiεj] = 1 when i = j, (b) follows from θ θ0 2 Lρ + ρ1, and (c) holds with probability 1 2(L+1) and follows from the fact that v v0 + v v0 1 + ρ1 and thereafter using the arguments in proof of Lemma 4.2 from Banerjee et al. (2023). Finally with a union bound over t [T] and using Jensen we get with probability 1 2T (L+1) E| f(θ; xt, ε)| q E| f(θ; xt, ε)|2 v u u t2(1 + ρ1)2 γL + |ϕ(0)| i=1 γi 1 !2 + 2 mc2reg(Lρ + ρ1)2. (31) Now consider ε = (ε1, . . . , εj 1, εj, εj+1 . . . , εp) and ε = (ε1, . . . , εj 1, ε j, εj+1 . . . , εp) that differ only at the j-th variable where εj is an independent copy of ε j. Now, | f(θ; xt, ε) f(θ; xt, ε )| = creg m1/4 |(θ θ0)T vjεj (θ θ0)T vjε j| m1/4 |(θ θ0)j| By Mc Diarmid s inequality we have with probability (1 δ) over the randomness of {εi}p i=1 | f(θ; xt, ε)| E| f(θ; xt, ε)| j=1 (θ θ0)2 j ln(1/δ) 2creg m1/4 θ θ0 2 p 2creg m1/4 (Lρ + ρ1) p Taking a union bound over t [T], with probability 1 δ over the randomness of {εi}p i=1 we have | f(θ; xt, ε)| E| f(θ; xt, ε)| 2creg m1/4 (Lρ + ρ1) p Choosing δ = 1 m we get with probability 1 1 m over the randomness of {εi}p i=1 | f(θ; xt, ε)| E| f(θ; xt, ε)| 2creg m1/4 (Lρ + ρ1) p Published as a conference paper at ICLR 2024 Combining with (31), we have with probability at least 1 2T (L+2)+1 m over the randomness of the initialization and {εi}p i=1 we have | f(θ; xt, ε)| v u u t2(1 + ρ1)2 γL + |ϕ(0)| i=1 γi 1 !2 + 2 mc2reg(Lρ + ρ1)2 2creg m1/4 (Lρ + ρ1) p ln(m T) (32) Finally taking a union bound over {εs}S s=1 for S = O(log m) and using the fact that m = Ω(T 5L/λ6 0) we get with probability at least 1 2T (L+2)+1 m over the randomness of the ini- tialization and {ε}S s=1, t [T], L(S) Sq yt, f(θ; xt, εs) S s=1 is lipschitz with λSq = Θ(1). Further ℓ Sq = 1 and therefore the loss is strongly convex and smooth with a Sq = b Sq = 1 a.s. over the randomness of {εs}S s=1, which completes the proof. Lemma 6. Under Assumption 5 and cp = p 8λSq CH, with probability 1 2T (L+2) randomness of the initialization and {εs}S s=1, L(S) Sq yt, f(θ; xt, εs) S s=1 is ν-strongly convex with respect to θ BFrob ρ,ρ1 (θ0), where ν = O 1 m . Proof. From (10) we have a.s. θ f(θ, xt, ε) = θf(θ; x) + creg eiεi m1/4 , (33) 2 θ f(θ, xt, ε) = 2 θf(θ; x). (34) Next, with ℓ t = ( f(θ, xt, ε) yt) θℓSq yt, f(θ, xt, ε) = ℓ t θ f(θ, xt, ε) 2 θℓSq yt, f(θ, xt, ε) = θ f(θ, xt, ε) θ f(θ, xt, ε)T + ℓ t 2 θ f(θ, xt, ε) where we have used the fact that ℓ t = 1, and 2 θ f(θ, xt, ε) = 2 θf(θ; xt) from (34). Consider u Sp 1, the unit ball in Rp. Then u T 2 θL(S) Sq yt, f(θ; xt, εs) S s=1 θf(θ; xt) θf(θ; xt)T vi θf(θ; xt)T m1/4 εs,i + creg θf(θ; xt)v T i m1/4 εs,i + c2 reg viv T j m εs,iεs,j +ℓ t,iu T 2 θf(θ; xt)u = θf(θ; xt), u 2 + 2 θf(θ; xt), u creg + c2 reg m 1 S j=1 uiujεs,iεs,j +ℓ t,iu T 2 θf(θ; xt)u (35) Published as a conference paper at ICLR 2024 Notice that s [S], Γs = u, εs is a weighted sum of Rademacher random variables and therefore is u 2 = 1 sub-gaussian. Using Hoeffding s inequality, for a given t [T], e S2ω2 1/2. Taking a union bound we get t [T] Te S2ω2 1/2. Further if Γs is σs sub-gaussian, then Γ2 s is (ν2, α) sub-exponential with ν = 4 2σ2 s, α = 4σ2 s (see Honorio and Jaakkola (2014, Appendix B)) and therefore Γ2 s is (4 2, 4) sub-exponential. Using Bernstein s inequality for a given t [T], s=1 Γ2 s EΓ2 s ω2 2 min S2ω2 2 32 , Sω2 Taking a union bound we get for any t [T] s=1 Γ2 s EΓ2 s ω2 2 min S2ω2 2 32 , Sω2 Now choosing ω1 = ω2 = 1 2, we get for any t [T] s=1 Γ2 s EΓ2 s 1 2 min S2 128 , S Te S/8 , S 16 Observing that EΓs = 0, EΓ2 s = 1, combining with (35) and recalling that with probability 1 2(L+1)T m over the randomness of initialization, θf(θ; xt) 2 c H m we get with proba- bility at least 1 T(e S2/8 + e S/8) 2(L+1)T u T 2 θL(S) Sq yt, f(θ; xt, εs) S s=1 u θf(θ; xt), u 2 θf(θ; xt), u creg m1/4 | {z } I + c2 reg 2 m λSq CH m . Term I is minimized for θf(θ; xt), u = creg 2m1/4 and the minimum value is c2 reg 4 m. Substituting this back to the above equation we get u T 2 θL(S) Sq yt, f(θ; xt, εs) S s=1 u c2 reg 4 m + c2 reg 2 m λSq CH m = c2 reg 4 m λSq CH m where the last equality follows because c2 reg = 8λSq CH. Since S2/8 S/8 for S 1, we have for S 16 with probability 1 2Te S/8 2(L+1)T u T 2 θL(S) Sq yt, f(θ; xt, εs) S s=1 u λSq CH m > 0. Published as a conference paper at ICLR 2024 Choosing S = max{8 log m, 16}, we have with probability 1 2T (L+2) m over the randomness of initialization and {εs}S s=1 we have u T 2 θL(S) Sq yt, f(θ; xt, εs) S s=1 u λSq CH m > 0. which completes the proof. Lemma 7. Under Assumption 5 with probability 1 2T (L+1)C m , for some absolute constant C > 0, L(S) Sq yt, f(θ; xt, εs) S s=1 satisfies the QG condition over the randomness of the initialization and {εs}S s=1, with QG constant µ = O(1). Proof. We have L(S) Sq yt, f(θ; xt, εs) S s=1 s=1 LSq yt, f(θ; xt, εs) 2 s =1 ( f(θ; xt, εs) yt)( f(θ; xt, εs) yt) D f(θ; xt, εs), f(θ; xt, εs ) E S2 (F(θ; xt) yt1S)T K(θ, xt)(F(θ; xt) yt1S) (36) where F(θ, xt) : Rp d RS such that (F(θ, xt))s = f(θ; xt, εs), 1S is an S-dimensional vector of 1 s and K(θ, xt) = θ f(θ; xt, εs), θ f(θ; xt, ε s) . K(θ, xt)s,s = D f(θ; xt) + creg j=1 ejεs,j, f(θ; xt) + creg j=1 ejεs ,j E = D f(θ; xt) + creg m1/4 εs, f(θ; xt) + creg = f(θ; xt), f(θ; xt) + creg m1/4 εs, f(θ; xt) m1/4 εs , f(θ; xt) + c2 reg m εs, εs where εs RS with ( εs)j = εs,j. Therefore, K(θ, xt) = f(θ; xt), f(θ; xt) 1S1T S + c2 reg mεT MεM f(θ; xt)T ε1 f(θ; xt)T ε2 f(θ; xt)T εS f(θ; xt)T ε1 f(θ; xt)T ε2 f(θ; xt)T εS ... f(θ; xt)T ε1 f(θ; xt)T ε2 f(θ; xt)T εS f(θ; xt)T ε1 f(θ; xt)T ε1 f(θ; xt)T ε1 f(θ; xt)T ε2 f(θ; xt)T ε2 f(θ; xt)T ε2 ... ... f(θ; xt)T εS f(θ; xt)T εS f(θ; xt)T εS Published as a conference paper at ICLR 2024 where εM is an p S matrix whose columns are εs. Next we give a uniform bound on the smallest eigenvalue of K(θ, xt). Consider u SS 1, unit sphere in RS. We have u T K(θ, xt)u = f(θ; xt), f(θ; xt) u T 1S1T Su | {z } I + c2 reg mu T εT MεMu | {z } II s =1 usus f(θ; xt)T εs + f(θ; xt)T εs Consider term I. We can lower bound it as follows f(θ; xt), f(θ; xt) u T 1S1T Su f(θ; xt), f(θ; xt) λ0 where the first inequality follows because λmin(1S1T S) 0 and the second inequality because for any θ BFrob ρ,ρ1 (θ0) with probability 1 2(L+1) m over the initialization f(θ; xt), f(θ; xt) λmin(KNTK(θ)) λ0 2c HϱT m L(ρ + ρ1) λ0 where the last inequality holds by choosing m 16c2 HϱT 2(Lρ+ρ1)2 λ2 0 = Ω(T 3/λ4 0). To see why (37) holds observe that using the exact same argument as in (55) and (52) we have KNTK(θ) KNTK(θ0) 2 2 Using the fact that θ BFrob ρ,ρ1 (θ0) we have KNTK(θ) KNTK(θ0) 2 2c HϱT m L(ρ + ρ1) and therefore λmin(KNTK(θ)) λmin(KNTK(θ0)) KNTK(θ) KNTK(θ0) 2 λ0 2c HϱT m L(ρ + ρ1) . Next consider term II. Since every entry of the p S matrix εM is i.i.d Rademacher, using Lemma 5.24, Vershynin (2011), it follows that the rows are independent sub-gaussian, with the sub-gaussian norm of the any row (εM)i ψ2 Ci, where Ci is an absolute constant. Further using Theorem 5.39 from Vershynin (2011) we have with probability at least 1 2 exp( c Mt2) λmin(εT MεM) p CM where CM and c M depend on the maximum of sub-gaussian norm of the rows of εM, i.e., maxi (εM)i ψ2, which is an absolute constant. Choosing t = S, for some absolute constant c, with probability at least 1 c m , uniformly for any u SS 1 we have λmin(εT MεM) p (CM + 1) Noting that S = max{16, 8 log m} and p = md + (L 1)m2 + m we have λmin(εT MεM) m + m 2 (CM + 1) p max{16, 8 log m}. Using m 64(CM +1)2, m log m 8 2(CM +1), with probability at least 1 c m , uniformly for any u SS 1 we have u T (εT MεM)u m/2 Published as a conference paper at ICLR 2024 Finally consider term III. We have s =1 usus f(θ; xt)T εs + f(θ; xt)T εs 2 s=1 us f(θ; xt)T εs. Since each εs,j is 1 sub-gaussian, therefore f(θ; xt)T εs is f(θ; xt) 2 ( ϱ) sub-gaussian with zero mean. Using Hoeffding we have with probability at least 1 exp( S2t2) s=1 us f(θ; xt)T εs 2 Screg m1/4 St Choosing t = 1 8 and noting that S = max{16, 8 log m}, we get with probability at least 1 1 m2 s=1 us f(θ; xt)T εs 16creg m1/4 (log m)3/2 1 where the last inequality used m1/4 16creg and m1/4(log m) 3/2 4 2. Finally to get a uniform bound over SS 1, we can use a standard ϵ-net argument with ϵ = 1/4 and metric entropy S log 9 (see eg. proof of Theorem 5.39 in Vershynin (2011)). Using m(log m) 1 8 log 9, with probability at least 1 1 m uniformly for any u SS 1 we have s=1 us f(θ; xt)T εs 1 Combining all the terms and using m 4, we get with probability 1 2(L+1)C m for some absolute constant C > 0 1 S2 (F(θ; xt) yt1S)T K(θ, xt)(F(θ; xt) yt1S) m (F(θ; xt) yt1S) 2 = m 64 log m 1 S s=1 LSq yt, f(θ; xt, εs) 2µ L(S) Sq yt, f(θ; xt, εs) S s=1 where µ = 128 where the last inequality follows from m(log m) 1 1. Combining with (36) and using the fact that PL implies QG (see Remark 3.1) along with a union bound over t [T] completes the proof. Lemma 8. Under Assumption 1 and 5 with probability 1 2T (L+1) m over the randomness of the initialization and {εs}S s=1 we have PT t=1 L(S) Sq yt, f( θ ; xt, εs) S s=1 Proof. Consider any θ BFrob ρ,ρ1 (θ0). We have L(S) Sq yt, f(θ; xt, εs) S s=1 s=1 LSq yt, f(θ; xt, εs) = 1 yt f(θ; xt, εs) 2 yt f(θt; xt) creg (θt θ0)T ejεs,j " yt f(θt; xt) 2 2creg yt f(θt; xt) 1 (θ θ0)T ejεs,j + c2 reg 1 S (θ θ0)T ejεs,j Published as a conference paper at ICLR 2024 Consider term I. (θ θ0)T ejεs,j = 1 m1/4 1 S j=1 (θ θ0)jεs,j Since εs,j is 1 sub-gaussian, Γs is θ θ0 2 2 sub-gaussian. Using Hoeffding s inequality e S2ω2 1/2 θ θ0 2. Taking a union bound we get t [T] Te S2ω2 1/2 θ θ0 2. Next consider term II. (θ θ0)T ejεs,j 2 = 1 m 1 S j=1 (θ θ0)i(θ θ0)jεs,iεs,j Further if Γs is σs sub-gaussian, then Γ2 s is (ν2, α) sub-exponential with ν = 4 2σ2 s, α = 4σ2 s (see Honorio and Jaakkola (2014, Appendix B)) and therefore Γ2 s is 4 2 θ θ0 2 2, 4 θ θ0 2 2 sub-exponential. Using Bernstein s inequality for a given t [T], s=1 Γ2 s EΓ2 s ω2 2 min S2ω2 2 32 θ θ0 4 2 , Sω2 4 θ θ0 2 2 Taking a union bound we get for any t [T] s=1 Γ2 s EΓ2 s ω2 2 min S2ω2 2 32 θ θ0 4 2 , Sω2 4 θ θ0 2 2 Now choosing ω1 = θ θ0 2, ω2 = θ θ0 2 2, we get for any t [T] s=1 Γs θ θ0 2 s=1 Γ2 s EΓ2 s 1 2 min S2 128 , S Te S/8 , S 16 Observing that EΓs = 0, EΓ2 s = c2 reg m θ θ0 2 2, using the fact that λSq = |yt f(θ; xt)| = Θ(1) with probability 1 2T (L+1) m (see proof of Lemma 5) and finally combining with (38) we get with probability at least 1 T(e S2/2 + e S/8) 2T (L+1) L(S) Sq yt, f(θ; xt, εs) S s=1 LSq (yt, f(θ; xt)) + c2 reg 2 m θ θ0 2 2 + 2cregλSq m1/4 θ θ0 2 Summing over t and taking a minθ BFrob ρ,ρ1 (θ0) over the left hand side we get for any θ BFrob ρ,ρ1 (θ0) min θ BFrob ρ,ρ1 (θ0) t=1 L(S) Sq yt, f(θ; xt, εs) S s=1 t=1 LSq (yt, f(θ; xt)) + c2 reg 2 m θ θ0 2 2 + 2cregλSq m1/4 θ θ0 2 Published as a conference paper at ICLR 2024 From Theorem E.1 we know there exists θ BFrob ρ,ρ1 (θ0) such that with probability at least (1 2(L+1) m ) over the randomness of initialization we have f( θ, xt) = yt for any set of yt [0, 1], t [T] which implies LSq yt, f( θ; xt) = 0, t [T]. Therefore t=1 L(S) Sq yt, f(θ ; xt, εs) S s=1 c2 reg 2 m θ θ0 2 2 + 2cregλSq m1/4 θ θ0 2 (a) c2 reg 2 m(Lρ + ρ1)2 + 2cregλSq m1/4 (Lρ + ρ1) (b) = O(1) (39) where (a) follows because θ BFrob ρ,ρ1 (θ0) implies θ BEuc Lρ+ρ1(θ0) and (b) follows by choosing m = Ω(T 5L/λ6 0). C.3 PROOF OF THEOREM 3.3 Theorem 3.3 (Regret Bound for KL Loss). Under Assumption 1, 4 and 5 for yt [z, 1 z], 0 < z < 1, with appropriate choice of step-size sequence {ηt}, width m, and perturbation constant cp, with high probability over the randomness of initialization and {ε}S s=1, the regret of projected OGD with loss L(S) KL yt, σ( f(θ; xt, εs)) S s=1 = 1 S PS s=1 ℓKL yt, σ( f θ; xt, εs) , S = Θ(log m) and projection ball BFrob ρ,ρ1 (θ0) with ρ = Θ( T/λ0) and ρ1 = Θ(1) is given by RKL(T) = O(log T). Proof. The proof of the claim follows along similar lines as in the previous subsection. It consists of the following four steps: 1. Binary KL loss is lipschitz, strongly convex and smooth w.r.t. the output. We show that ℓKL(yt, ˆyt) is lipschitz, strongly convex and smooth with respect to the output ˆyt = f(θ; xt, ε) inside BFrob ρ,ρ1 (θ0) almost surely over the randomness of initialization and ε. We will use λKL, a KL and b KL respectively to denote the lipschitz, strong convexity and smoothness parameter for ℓKL(yt, ˆyt) (c.f. Assumption 3(lipschitz, strongly convex and smooth)). Lemma 9. For θ BFrob ρ,ρ1 (θ0), the loss ℓKL yt, f(θ; xt, ε) is lipschitz, strongly convex and smooth with respect to the output f(θ; xt, ε) a.s. over the randomness of initialization and ε. Further the parameters λKL, a KL and b KL are O(1). Note that Lemma 9 implies that L(S) KL yt, σ( f(θ; xt, εs)) S s=1 s=1 ℓKL yt, σ( f θ; xt, εs) is also lipschitz, strongly convex and smooth a.s. with respect to each of the outputs f θ; xt, εs). 2. The average loss at t [T], L(S) KL yt, σ( f(θ; xt, εs)) S s=1 is almost convext and has a unique minimizer w.r.t. θ. We show that the random perturbation to the output in (10) assures that L(S) KL yt, σ( f(θ; xt, εs)) S s=1 is strongly convex with respect to θ BFrob ρ,ρ1 (θ0) at every t [T] (with a very small O 1 m strong convexity parameter), which implies that it satisfies 2(a) and 2(c). Lemma 10. Under Assumption 5 and cp = p 8λSq CH, with probability 1 2T (L+2) over the randomness of the initialization and {εs}S s=1, L(S) KL yt, σ( f(θ; xt, εs)) S s=1 ν-strongly convex with respect to θ BFrob ρ,ρ1 (θ0), where ν = O 1 m . 3. The average loss at t [T], L(S) KL yt, σ( f(θ; xt, εs)) S s=1 satisfies the QG condition. Published as a conference paper at ICLR 2024 We show that the loss L(S) KL yt, σ( f(θ; xt, εs)) S s=1 satisfies the QG condition (with QG constant µ = O(1)) with high probability over the randomness of initialization and {εs}S s=1. Lemma 11. Under Assumption 5 with probability 1 2T (L+1)C m , for some absolute constant C > 0, L(S) KL yt, f(θ; xt, εs) S s=1 satisfies the QG condition over the randomness of the initialization and {εs}S s=1, with constant µ = O(1). 4. Bounding the final regret. The above three steps ensure that all the assumptions of Theorem 3.1 are satisfied by the loss function L(S) KL yt, σ( f(θ; xt, εs)) S s=1 . Taking a union over the events in the above three steps, invoking Theorem 3.1, we get with probability 1 T LC m for some absolute constant C > 0, t=1 L(S) KL yt, σ( f(θ; xt, εs)) S s=1 L(S) KL yt, σ( f(θ ; xt, εs)) S s=1 t=1 L(S) KL yt, σ( f( θ ; xt, εs)) S s=1 + O(log T). (40) where θ = min θ BFrob ρ,ρ1 (θ0) t=1 L(S) KL yt, σ( f(θ; xt, εs)) S s=1 Next we bound t=1 L(S) KL yt, σ( f(θ ; xt, εs)) S s=1 using the following lemma. Lemma 12. Under Assumption 1 and 5 with probability 1 T LC m for some absolute constant C > 0 over the randomness of the initialization we have t=1 L(S) KL yt, σ( f(θ ; xt, εs)) S s=1 Using Lemma 12 and (40) we have t=1 L(S) KL yt, σ( f(θ; xt, εs)) S s=1 O(log T). which implies RKL(T) O(log T). Next we prove Lemmas 9,10,11 and 12. Lemma 9. For θ BFrob ρ,ρ1 (θ0), the loss ℓKL yt, f(θ; xt, ε) is lipschitz, strongly convex and smooth with respect to the output f(θ; xt, ε) a.s. over the randomness of initialization and ε. Further the parameters λKL, a KL and b KL are O(1). Proof. Consider θ BSpec ρ,ρ1 (θ0). Now, ℓ KL := dℓKL(yt, f(θ, xt, ε)) d f(θ, xt, ε) ytσ( f(θ, xt, ε))(1 σ( f(θ, xt, ε))) σ( f(θ, xt, ε)) (1 yt)σ( f(θ, xt, ε)(1 σ( f(θ, xt, ε)) 1 σ( f(θ, xt, ε)) = yt + ytσ( f(θ, xt, ε)) + σ( f(θ, xt, ε)) ytσ( f(θ, xt, ε)) = σ( f(θ, xt, ε)) yt. It follows that λKL 2 a.s. since σ( f(θ, xt, ε)), yt [0, 1]. Further 0 ℓ KL = σ( f(θ, xt, ε)) 1 σ( f(θ, xt, ε)) 1/2 a.s. which completes the proof. Published as a conference paper at ICLR 2024 Lemma 10. Under Assumption 5 and cp = p 8λSq CH, with probability 1 2T (L+2) randomness of the initialization and {εs}S s=1, L(S) KL yt, σ( f(θ; xt, εs)) S s=1 is ν-strongly convex with respect to θ BFrob ρ,ρ1 (θ0), where ν = O 1 m . Proof. In Lemma 5 we showed that | f(θ; xt, ε)| is bounded with high probability over the randomness of initialization and ε. Specifically from (32)) we have with probability at least 1 2T (L+2)+1 over the randomness of the initialization and ε | f(θ; xt, ε)| v u u t2(1 + ρ1)2 γL + |ϕ(0)| i=1 γi 1 !2 + 2 mc2reg(Lρ + ρ1)2 2creg m1/4 (Lρ + ρ1) p := fmax(θ; xt, ε) (41) Since m = Ω(T 5/λ6 0) it follows that fmax(θ; xt, ε) is O(1). Now t [T], with q := σ( fmax(θ; xt, ε)) (42) with probability at least 1 2T (L+2)+1 m we have σ( f(θ, xt, ε)) [q, (1 q)]. Next we show that LKL is strongly convex. Recall that L(S) KL yt, σ( f(θ; xt, εs)) S s=1 s=1 ℓKL yt, σ( f θ; xt, εs) , and therefore we have θL(S) KL yt, σ( f(θ; xt, εs)) S s=1 ytσ ( f(θ, xt, εs)) θ f(θ, xt, εs) σ( f(θ, xt, εs)) (1 yt)σ ( f(θ, xt, εs)) θ f(θ, xt, εs) 1 σ( f(θ, xt, εs)) yt 1 σ( f(θ, xt, εs)) (1 yt)σ( f(θ, xt, εs)) θ f(θ, xt, εs) σ( f(θ, xt, εs)) yt θ f(θ, xt, εs) (43) where we have used the fact that σ ( ) = σ( )(1 σ( )). Further the hessian of the loss is given by, 2 θL(S) KL yt, σ( f(θ; xt, εs)) S s=1 s=1 σ ( f(θ, xt, εs)) θ f(θ, xt, εs) θ f(θ, xt, εs)T + (σ( f(θ, xt, εs)) yt) 2 θ f(θ, xt, εs) s=1 q(1 q) θ f(θ, xt, εs) θ f(θ, xt, εs)T λKL 2 θf(θ; xt) Consider u Sp 1, the unit ball in Rp. We want to show that u T 2 θL(S) KL yt, σ( f(θ; xt, εs)) S s=1 u > 0. Now as in the proof of Lemma 6 we can show that with probability 1 2T (L+2) u T 2 θL(S) KL yt, σ( f(θ; xt, εs)) S s=1 u q(1 q) c2 p 4 m λKLCH m = λKLCH m , Published as a conference paper at ICLR 2024 where the last equality follows because c2 p = 8λKLCH q(1 q) = O(1). Lemma 11. Under Assumption 5 with probability 1 2T (L+1)C m , for some absolute constant C > 0, L(S) KL yt, f(θ; xt, εs) S s=1 satisfies the QG condition over the randomness of the initialization and {εs}S s=1, with constant µ = O(1). Proof. Recall from (43) that we have, θL(S) KL yt, σ( f(θ; xt, εs)) S s=1 σ( f(θ, xt, εs)) yt θ f(θ, xt, εs), and therefore, θL(S) KL yt, σ( f(θ; xt, εs)) S s=1 σ( f(θ, xt, εs)) yt θ f(θ, xt, εs) 2 σ( f(θ; xt, εs)) yt σ( f(θ; xt, εs)) yt D f(θ; xt, εs), f(θ; xt, εs ) E S2 (F(θ; xt) yt1S)T K(θ, xt)(F(θ; xt) yt1S) where F(θ, xt) : Rp d RS such that (F(θ, xt))s = σ f(θ; xt, εs) , 1S is an S-dimensional vector of 1 s and K(θ, xt) = θ f(θ; xt, εs), θ f(θ; xt, ε s) . As in the proof of Lemma 7 with probability 1 2T (L+1)C m for some absolute constant C > 0, we have 1 S2 (F(θ; xt) yt1S)T K(θ, xt)(F(θ; xt) yt1S) 2µ L(S) Sq yt, σ f(θ; xt, εs) S s=1 with µ = 128 and L(S) Sq yt, σ f(θ; xt, εs) S s=1 s=1 ℓSq yt, σ( f(θ, xt, εs)) . Now using reverse Pinsker s inequality (see eg. Sason (2015), eq (10)) we have DKL yt||σ f(θ, xt, ε) ℓSq yt, σ( f(θ, xt, εs)) where DKL(p||q) is the KL divergence between p and q and σ( f(θ, xt, ε)) [q, (1 q)] holds with probability at least 1 2T (L+2)+1 m . Using this we have with probability at least 1 T LC m for some absolute constant C > 0, L(S) KL yt, σ( f(θ; xt, εs)) S s=1 L(S) KL yt, σ( f(θ ; xt, εs)) S s=1 s=1 DKL yt||σ f(θ, xt, εs) DKL yt||σ f(θ , xt, εs) ℓSq yt, σ( f(θ, xt, εs)) 2q L(S) Sq yt, σ f(θ; xt, εs) S s=1 Combining with (44) we get with probability at least 1 T LC m over the randomness of the initialization and {εs}S s=1 θL(S) KL yt, σ( f(θ; xt, εs)) S s=1 L(S) KL yt, σ( f(θ; xt, εs)) S s=1 L(S) KL yt, σ( f(θ ; xt, εs)) S s=1 Therefore L(S) KL yt, σ( f(θ; xt, εs)) S s=1 satisfies the QG condition with µ = O(1). Published as a conference paper at ICLR 2024 Lemma 12. Under Assumption 1 and 5 with probability 1 T LC m for some absolute constant C > 0 over the randomness of the initialization we have t=1 L(S) KL yt, σ( f(θ ; xt, εs)) S s=1 Proof. Recall from the proof of Lemma 11 that using reverse Pinsker s inequality we have with probability at least 1 T LC m for some absolute constant C > 0, L(S) KL yt, σ( f(θ; xt, εs)) S s=1 2q L(S) Sq yt, σ f(θ; xt, εs) S s=1 Therefore T X t=1 L(S) KL yt, σ( f(θ; xt, εs)) S s=1 t=1 L(S) Sq yt, σ f(θ; xt, εs) S s=1 where the last part follows from 39 in the proof of Lemma 8. Finally we can use Theorem E.1 to conclude that there exists θ BFrob ρ,ρ1 (θ0) such that with probability at least 1 2T (L+1) m over the randomness of initialization we have σ(f( θ, xt)) = yt for any set of yt [q, 1 q], t [T] where q is as defined in (42). Without loss of generality assume z = q (otherwise the predictor can be changed to ˆyt = σ(kzf(θ, xt)) for some constant kz that depends on z) and therefore we have with probability at least 1 T LC m for some absolute constant C > 0 minθ BFrob ρ,ρ1 (θ0) PT t=1 ℓKL yt, σ(f(θ; xt)) = 0. C.4 AUXILIARY LEMMAS With slight abuse of notation, we have defined zt = (xt, yt) and L(θ; zt) = ℓ(yt, f(θ; xt)). Lemma 13 (Almost Convexity of Loss). Under the conditions of Lemma 1, with a high probability, θ BFrob ρ,ρ1 (θ0), L(θ ; zt) L(θt; zt) + θ θt, θL(θt; zt) + a θ θt, f(θ ; zt) 2 ϵt , (45) where ϵt = 4 4aϱ(Lρ + ρ1) + λ c H(Lρ + ρ1)2 m , for any θ BFrob ρ,ρ1 (θ0), where c H and ϱ are as in Theorem 1 and λ, a are as in Assumption 3 (lipschitz, strongly convex, smooth). Proof. Consider any θ BFrob ρ,ρ1 (θ0). By the second order Taylor expansion around θt, we have L(θ ; zt) = L(θt; zt) + θ θt, θL(θt; zt) + 1 2(θ θt) 2L( θt; zt) θ2 (θ θt) , where θt = ξθ + (1 ξ)θt for some ξ [0, 1]. Since θt, θ BFrob ρ,ρ1 (θ0), we have vec( W (l) t ) vec(W (l) 0 ) 2 = ξ vec(W (l)) + (1 ξ) vec(W (l) t ) vec(W (l) 0 ) 2 ξ vec(W (l)) vec(W (l) 0 ) 2 + (1 ξ) vec(W (l) t ) vec(W (l) 0 ) 2 ξρ + (1 ξ)ρ = ρ, vt v0 2 ξv + (1 ξ)vt v0 2 ξ v v0 2 + (1 ξ) vt v0 2 ξρ1 + (1 ξ)ρ1 = ρ1 and therefore θt BFrob ρ,ρ1 (θ0) which implies θt BEuc Lρ+ρ1(θ0). Focusing on the quadratic form in the Taylor expansion, we have (θ θt) 2L( θt; zt) θ2 (θ θt) = (θ θt) 1 " ℓ t,i f( θt; xt) θ f( θt; xt) + ℓ t,i 2f( θt; xt) θ θt, f( θt; xt) + ℓ t,i(θ θt) 2f( θt; xt) θ2 (θ θt) | {z } I2 Published as a conference paper at ICLR 2024 ℓt,i corresponds to the loss ℓ(yt,i, f( θ; xt)), and ℓ t,i, ℓ t,i denote the corresponding first and second derivatives w.r.t. yt,i = f( θ; xt). For analyzing I1, choose any θ BFrob ρ,ρ1 (θ0). Now, note that θ θt, f( θt; xt) θ θt, f(θ ; xt) θ f(θ ; xt) = a θ θt, f(θ ; xt) θ θt, f( θt; xt) θ f(θ ; xt) + 2a θ θt, f(θ ; xt) θ θt, f( θt; xt) θ f(θ ; xt) a θ θt, f(θ ; xt) 2 2a f(θ ; xt) θ f(θ ; xt) (a) a θ θt, f(θ ; xt) 2 2aϱ c H m θt θ 2 θ θt 2 2 (b) a θ θt, f(θ ; xt) 2 16aϱc H(Lρ + ρ1)3 where (a) follows from Proposition 1 since θt Bρ(θ0) and since f(θ ; xi) 2 ϱ, and (b) follows since θt θ 2, θ θt 2, θt θ 2 2(Lρ + ρ1) by triangle inequality because θ, θ , θt BEuc Lρ+ρ1(θ0). For analyzing I2, with Qt,i = (θ θt) 2f( θt;xi) θ2 (θ θt), we have (θ θt) 2f( θt; xi) 2f( θt; xi) 2 c H θ θt 2 2 m 4c H(Lρ + ρ1)2 since θ θt 2 2(Lρ + ρ1) by triangle inequality because θ , θt BEuc Lρ+ρ1(θ0). Further, since | ℓ i| λ by Assumption 3 (lipschitz, strongly convex, smooth), we have I2 = ℓ i Qt,i | ℓ i||Qt,i| 4(Lρ + ρ1)2c Hλ m . Putting the lower bounds on I1 and I2 back, we have (θ θt) 2L( θt) θ2 (θ θt) a θ θt, f(θ ; xt) 2 4 4aϱ(Lρ + ρ1) + λ c H(Lρ + ρ1)2 m . That completes the proof. Lemma 14. Under Assumption 5 and creg = p 8λSq CH, with probability 1 2T (L+1) randomness of the initialization, the expected loss EεℓSq yt, f(θ, xt, ε) is ν-strongly convex with respect to θ BFrob ρ,ρ1 (θ0), where ν = O 1 m . Proof. From (10) we have a.s. θ f(θ, xt, ε) = θf(θ; x) + creg eiεi m1/4 , (46) 2 θ f(θ, xt, ε) = 2 θf(θ; x). (47) Published as a conference paper at ICLR 2024 Next, with ℓ t = ( f(θ, xt, ε) yt) θℓSq yt, f(θ, xt, ε) = 1 i=1 ℓ t θ f(θ, xt, ε) 2 θℓSq yt, f(θ, xt, ε) = 1 i=1 θ f(θ, xt, ε) θ f(θ, xt, ε)T + ℓ t 2 θ f(θ, xt, ε) where we have used the fact that ℓ t = 1, and 2 θ f(θ, xt, ε) = 2 θf(θ; xt) from (47). Taking expectation with respect to εi and using (46) we get 2 θEεℓSq yt, f(θ, xt, ε) = 1 i=1 Eε θ f(θ, xt, ε) θ f(θ, xt, ε)T + ℓ t,i 2 θf(θ; xt) θf(θ; xt) + creg θf(θ; xt) + creg + ℓ t,i 2 θf(θ; xt) θf(θ; xt) θf(θ; xt)T + creg ei θf(θ; xt)T θf(θ; xt)e T i m1/4 εi + c2 reg eie T j m εiεj + ℓ t,i 2 θf(θ; xt) i=1 θf(θ; xt) θf(θ; xt)T + c2 reg m I + ℓ t,i 2 θf(θ; xt) where the last equality follows from the fact that E[εi] = 0 and E[εiεj] = 0 for i = j and E[εiεj] = 1 for i = j and therefore i=1 eie T i = I, the identity matrix. Now consider any u Rp, u = 1. We want to show that u T 2 θEεℓSq yt, f(θ, xt, ε) u > 0. u T 2 θEεℓSq yt, f(θ, xt, ε) u = 1 i=1 u, θf(θ; xt) 2 + c2 reg m u 2 + ℓ t,iu T 2 θf(θ; xt)u (a) c2 reg m λSq CH m (b) = λSq CH m where (a) uses the fact that |ℓ t| λSq and that 2 θf(θ; xt) 2 CH m holds with probability 1 2(L+1) m from Proposition 1. Further (b) uses the fact that c2 reg = 2λSq CH. Finally with a union bound over it [nt], t [T] and noting that u was arbitrary we conclude that with probability 1 2T (L+1) m over the randomness of initialization EεℓSq yt, f(θ, xt, ε) is ν-strongly convex with ν = λSq CH m . Published as a conference paper at ICLR 2024 D PROOF OF RESULTS FOR CONTEXTUAL BANDITS (SECTION 4) Theorem 4.1 (Regret bound for Neu Square CB). Under Assumption 6 and 5 with appropriate choice of the parameter γ, step-size sequence {ηt} width m, and regularization parameter cp, with high probability over the randomness in the initialization and {ε}S s=1 the regret for Neu Square CB with ρ = Θ( T/λ0), ρ1 = Θ(1) is given by Reg CB(T) O( Proof. Choosing δ = 1 T and γ = p KT/(RSq(T)) + log(2T) in Theorem 1 of Foster and Rakhlin (2020) we get: E Reg CB(T) 4 q KTRSq(T) + 8 p KT ln(2T) + 1 Using Theorem 4.1 we have with probability at least 1 2LC m for some absolute constant C > 0, E Reg CB(T) O( p KT log T) + 8 p KT ln(2T) + 1 which completes the proof. Theorem 4.2 (Regret bound for Neu Fast CB). Under Assumption 6 and 5 with appropriate choice of the parameter γ, step-size sequence {ηt} width m, and regularization parameter cp, with high probability over the randomness in the initialization and {ε}S s=1, the regret for Neu Fast CB with ρ = Θ( T/λ0), ρ1 = Θ(1) is given by Reg CB(T) O( L K + K), where L = PT t=1 yt,a t . Proof. Choosing γ = max( p KL /3RKL(T), 10K) and using Theorem 1 from Foster and Krishnamurthy (2021) we get E Reg CB(T) = 40 p L KRKL(T) + 600K RKL(T) Using Theorem 4.2 we have with probability at least 1 2LC m for some absolute constant C > 0, E Reg CB(T) O( p L K log T) + 600KO(log T) which completes the proof. Remark D.1. A keen reader might notice that the reduction in (Foster and Rakhlin, 2020) requires us to control RSq(T) = PT t=1 ℓSq(yt, ˆyt) PT t=1 ℓSq(yt, h(xt)), but our regret guarantee is for RSq(T) in (12). However, note that in step-4 of the proof of Theorem 3.2 we argued that PT t=1 L(S) Sq yt, f( θ ; xt, εs) S s=1 = O(1) and therefore our regret bound im- plies PT t=1 ℓSq(yt, ˆyt) O(log T) with ˆy T = f (S) θt; xt, ε(1:S) which immediately implies Reg Sq(T) O(log T). A similar argument follows for Foster and Krishnamurthy (2021). Published as a conference paper at ICLR 2024 E INTERPOLATION WITH WIDE NETWORKS (PROOF OF THEOREM E.1) In this section, we focus on showing that under suitable assumptions, wide networks can interpolate any given data. We assume ℓto be the squared loss throughout this subsection. Theorem E.1. Under Assumptions 4 and 5, for any h : X 7 [0, 1] and any set of inputs xt X, t [T], for f(θ; x) of the form equation 1, if the width m = Ω(T 4), there exists θ BFrob ρ,ρ1 (θ0) with ρ = Θ( T λ0 ) and ρ1 = Θ(1), such that with probability at least (1 2(L+1) m ) we have f( θ, xt) = h(xt), t [T]; further, there exists θ BFrob ρ,ρ1 (θ0) such that with probability at least (1 2(L+1) m ) we have f( θ, xt) = yt, for any set of yt [0, 1], t [T]. We start with an outline of the overall proof, which has four technical steps, some of which follow from direct observations, assumptions, or existing results (especially on the NTK), and some require new proofs. 1. It is sufficient to prove the interpolation result showing f( θ, xt) = yt for any yt as the result for the interpolation f( θ, xt) = h(xt) follow as a special case with yt = h(xt). Further we consider ρ1 = 0 which immediately implies the result for ρ1 = Θ(1). 2. The interpolation analysis will utilize the fact that the NTK is positive definite at initialization. For simplicity, Assumption 5 (positive definite NTK) takes care of this aspect for Theorem E.1, with λ0 > 0 being the lower bound to minimum eigen-value of the NTK. 3. To show existence of θ which interpolates f( θ, xt) = yt, t [T], we in fact show that gradient descent on least squares loss with suitably small step size η and suitably large width m will have geometrically decreasing cumulative square loss. The decreasing loss along with the fact that the sequence of iterates θt BFrob ρ,ρ1 (θ0) with ρ = Θ( T λ0 ), ρ1 = 0, i.e., stay within the closed ball, implies existence of θ BFrob ρ,ρ1 (θ0) which interpolates the data. 4. Two key properties need to be maintained as the gradient descent iterations proceed: first, as discussed above, the iterates θt BFrob ρ,ρ1 (θ0) with ρ = Θ( T λ0 ), ρ1 = 0, i.e., the iterates stay within the ball; and second, the NTK corresponding to all θt need to stay positive definite. As we will show, these two properties are coupled, and the geometric decrease of the loss helps in the analysis of both properties. t=1 ℓ(yt, f(θ; xt)) , (48) Lemma 1 (NTK condition per step). Under Assumptions 5 and 4, for the gradient descent update θt+1 = θt ηt L(θt) for the cumulative square loss ˆℓ(θ) = PT n=1(yn f(θ; xn))2 with θt, θt+1 BFrob ρ,ρ1 (θ0) with ρ = Θ( T λ0 ), ρ1 = 0, with probability at least 1 2(L+1) m over the initialization of model, λmin(KNTK(θt+1)) λmin(KNTK(θt)) 4c Hϱ2 T mηt q ˆℓ(θt) , (49) where c H and ϱ are as in Lemma 1. Proof. Observe that KNTK(θ) = J(θ)J(θ) , where the Jacobian W (1) . . . f(θ;x1) ... ... ... f(θ;xn) W (1) . . . f(θ;xn) RT (md+Lm2) , (50) Published as a conference paper at ICLR 2024 where the parameter matrices are vectorized and the last layer W (L+1) = v0 is ignored since we will not be doing gradient descent on the last layer and its kept fixed. Then, the spectral norm of the change in the NTK is given by KNTK(θt+1) KNTK(θt) 2 = J(θt+1)J(θt+1) J(θt)J(θt) 2 = J(θt+1)(J(θt+1) J(θt)) (J(θt+1) J(θt))J(θt) 2 ( J(θt+1) 2 + J(θt) 2) J(θt+1) J(θt) 2 . (51) Now, for any θ BFrob ρ,ρ1 (θ0), J(θ) 2 2 J(θ) 2 F = where (a) follows from by Lemma 1. Assuming θt, θt+1 BFrob ρ,ρ1 (θ0), we have J(θt) 2 , J(θt+1) 2 Tϱ, so that from equation 51 we get KNTK(θt+1) KNTK(θt) 2 2 Tϱ J(θt+1) J(θt) 2 . (52) Now, note that J(θt+1) J(θt) 2 J(θt+1) J(θt) F (53) f(θt+1; xn) θ f(θt; xn) 2f( θt; xi) 2 θt+1 θt 2 T m θt+1 θt 2 (55) T m ηt ˆℓ(θt) 2 where (a) follows from the mean-value theorem with θt {(1 ξ)θt + ξθt+1for some ξ [0, 1]}, (b) follows from Lemma 1 since θ BFrob ρ,ρ1 (θ0), (c) follows from the gradient descent update, and (d) follows from Lemma 3. Then, using equation 52, we have KNTK(θt+1) KNTK(θt) 2 4c Hϱ2 T mηt q ˆℓ(θt) . (56) Then, by triangle inequality λmin(KNTK(θt+1)) λmin(KNTK(θt)) KNTK(θt+1) KNTK(θt) 2 (a) λmin(KNTK(θt)) 4c Hϱ2 T mηt q where (a) follows from equation 56. That completes the proof. Theorem E.2 (Geometric convergence: Unknown Desired Loss). Under Assumptions 4 and 5, consider the gradient descent update θt+1 = θt ηt ˆℓ(θt) for the cumulative loss ˆℓ(θ) = PT n=1(yi f(θ; xi))2 with step size ηt = η < min 1 Published as a conference paper at ICLR 2024 with β as in Lemma 4. Then, choosing width m = Ω T 3 and depth L = O(1), with probability at least 1 2(L+1) m , we have {θt}t BFrob ρ,ρ1 (θ0) with ρ = Θ( T λ0 ) ρ1 = 0, and for every t, ˆℓ(θt+1) (1 ηλ0)t ˆℓ(θ0) . (57) Proof. First note that with probability at least 1 2(L+1) m , all the bounds in Theorem 1, Lemma 2, and Corollary 3 hold, and so does Lemma 1, since its proof uses these bounds. Further, we note that for L = O(1) we obtain the constant c H = O((1+ρ1)(1+(4ν0+ ρ m)O(1))) = O(1) following Lemma 1 since ρ1 = 0, where the last equality follows from the fact that m = Ω( T 4 λ2 0 ) > Ω( T λ2 0 ) and ρ = Θ( T λ0 ). We will use c H c2 for some suitable constant c2 > 0. We also note that ϱ2 = O((1 + 1 m(1 + ρ1)2(1 + γ2(L+1)). Then, using the fact that, L = O(1), ρ = Θ( T λ0 ), ρ1 = 0 and m = Ω( T 4 λ2 0 ) > Ω( T λ2 0 ), we obtain that ϱ2 = O(1). We will use ϱ2 c3 for some suitable constant c3 > 0. Finally, we observe that cρ1,γ = O((1 + ρ2 1)(1 + γL)) (see definition in Lemma 2). Then, taking the definition of β (as in Lemma 4), we have that β = bϱ2 + 1 m O(poly(L)(1 + γ3L)(1 + ρ2 1). Again, in a similar fashion as in the analysis of the expressions c H and ϱ, we have that in our problem setting β = O(1) since ρ1 = 0. We will use β c4 for some suitable constant c4 > 0. We now proceed with the proof by induction. First, for t = 1, we show that, based on the choice of the step size, θ1 BFrob ρ,ρ1 (θ0) for ρ1 = 0. To see this, note that θ1 θ0 2 = η ˆℓ(θ0) 2 T c0,4ν0 = 2ϱηλ0 where (a) follows from Lemma 3, (b) from Lemma 2, (c) follows since ρ = Θ( T λ0 ) and ρ1 = 0 so the last layer is not getting updated. Hence, θ1 BFrob ρ,ρ1 (θ0). We now take the smoothness property from Lemma 4, and further obtain ˆℓ(θ1) ˆℓ(θ0) θ1 θ0, θˆℓ(θ0) + βT 2 θ1 θ0 2 2 (a) η θˆℓ(θ0) 2 2ℓ 0 KNTK(θ0)ℓ 0 2 λmin(KNTK(θ0)) ℓ 0 2 2 = ˆℓ(θ1) (1 ηλ0) ˆℓ(θ0), where (a) follows from the gradient descent update; (b) follows from our choice of step-size η 1 βN so that (1 βηT 2; (c) follows from the following property valid for any iterate θt Rp, n=1 ℓ t,n θf(θt; xn) n =1 ℓ t,nℓ t,n θf(θt; xn), θf(θt; xn ) = ℓ T t KNTK(θt)ℓ t , Published as a conference paper at ICLR 2024 where ℓ t := [ℓ t,n] RN, with ℓ t,n = 2(yi f(θt; xn)) and ℓt,n = (yn f(θt; xn))2; (d) follows from the definition of minimum eigenvalue; and (e) follows from the following property valid for any iterate θt Rp, n=1 ℓ 2 t,n = 4 n=1 (yn f(θt; xn))2 = 4ˆℓ(θt) . (59) Notice that, from our choice of step-size η < 1 λ0 , we have that 1 ηλ0 (0, 1). Continuing with our proof by induction, we take the following induction hypothesis: we assume that ˆℓ(θt) (1 ηλ0)t 1 ˆℓ(θ0) (60) and that θτ BFrob ρ,ρ1 (θ0) with ρ1 = 0 for τ t. First, based on the choice of the step sizes, we show that θt+1 BFrob ρ,ρ1 (θ0) with ρ1 = 0. To see this, note that, using similar inequalities as in our analysis for the case t = 1, τ=0 θτ+1 θτ 2 = τ=0 η θˆℓ(θτ) 2 2ϱη τ=0 (1 ηλ0)τ/2 ! q λ0 4ϱ c0,4ν0 where (a) follows from our induction hypothesis, (b) follows from x 1 1 xλ0 2 λ0 for x < 1 λ0 , and (c) follows since ρ = Θ( Now, we have λmin(KNTK(θt)) (a) λmin(KNTK(θt 1)) 4c Hϱ2 T mη q λmin(KNTK)(θ0) 4c Hϱ2η T m (b) λ0 4c2ϱ2η T m τ=0 (1 ηλ0)τ/2 ! q λ0 8c2ϱ2 T q ˆℓ(θ0) m η 1 1 ηλ0 (c) λ0 T 3/2 m cη 1 1 ηλ0 λ0 2 c T 3/2 where (a) follows from Lemma 1, (b) follows by the induction hypothesis, and (c) follows with c = 8 c0,σ1c3. Then, with m 16 c2 T 3 λ4 0 we have λmin(KNTK(θt)) λ0/2 . (61) Published as a conference paper at ICLR 2024 Since θt, θt+1 BFrob ρ,ρ1 (θ0) with ρ1 = 0, we now take the smoothness property and further obtain, using similar inequalities as in our analysis for the case t = 1, ˆℓ(θt+1) ˆℓ(θt) θt+1 θt, θˆℓ(θt) + βT 2 θt+1 θt 2 2 (a) η θˆℓ(θt) 2 2ℓ t KNTK(θt)ℓ t 2 λmin(KNTK(θt)) ℓ t 2 2 = ˆℓ(θt+1) (1 ηλ0) ˆℓ(θt), where (a) follows from the gradient descent update and (b) from our recently derived result. That establishes the induction step and completes the proof. Published as a conference paper at ICLR 2024 F RELATED WORK Contextual Bandits: The contextual bandit setting with linear losses has received extensive attention (see for eg. Abe et al., 2003; Chu et al., 2011; Abbasi-Yadkori et al., 2011; Agrawal and Goyal, 2013; Ban and He, 2020; 2021). Owing to its remarkable success, Lu and Van Roy (2017); Zahavy and Mannor (2020); Riquelme et al. (2018) adapted neural models to the contextual bandit setting. In these initial works all but the last layers of a DNN were utilized as a feature map to transform contexts from the raw input space to a low-dimensional space and a linear exploration policy was then learned on top of the last hidden layer of the DNN. Although these attempts have yielded promising empirical results, no regret guarantees were provided. Subsequently (Zhou et al., 2020) introduced the first neural bandit algorithm with provable regret guarantees that uses a UCB based exploration and Zhang et al. (2021) further extended it to the Thompson sampling approach. Both these approaches rely on Kernel bandits (Valko et al., 2013) and have a linear dependence on effective dimension d of the (NTK) Neural Tangent Kernel (see Allen-Zhu et al., 2019b; Jacot et al., 2018; Cao and Gu, 2019b). Moreover both these algorithms require the inversion of a matrix of size equal to the number of parameters in the model at each step of the algorithm. Recently, Ban et al. (2022b) attained a regret bound independent of d, but makes distributional assumptions on the context. (Qi et al., 2022; 2023; Ban et al., 2021; 2022a) shows the successful application of neural bandits on the recommender systems. Overparameterized Models: Considerable progress has been made in understanding the expressive power of Deep Neural Networks in the overparameterized regime (Du et al., 2019; Allen-Zhu et al., 2019b;a; Cao and Gu, 2019b; Arora et al., 2019a). It has been shown that the dynamics of the Neural Tangent Kernel always stays close to random initialization when the network is wide enough (Jacot et al., 2018; Arora et al., 2019b). Further Cao and Gu (2019b) demonstrate that the loss function of neural network has the almost convexity in the overparameterized regime while Liu et al. (2020; 2022); Frei and Gu (2021); Charles and Papailiopoulos (2018) study neural models under Polyak-Lojasiewicz (PL) Type conditions (Polyak, 1963; Lojasiewicz, 1963; Karimi et al., 2016). Recently, Banerjee et al. (2023) provided a bound on the spectral norm of the Hessian of the netowrk over a larger layerwise spectral norm radius ball (in comparison to Liu et al. (2020)) and show geometric convergence in deep learning optimization using restricted strong convexity. Our regret analysis makes use of these recent advances in deep learning. G DETAILS OF EXPERIMENTS Baselines. We choose four neural bandit algorithms: (1) Neural UCB (Zhou et al., 2020) maintains a confidence bound at every step using the gradients of the network and selects the most optimistic arm. (2) Neural TS (Neural Thompson Sampling) (Zhang et al., 2021) estimates the rewards by drawing them from a normal distribution whose mean is the output of the neural network and the variance is a quadratic form of the gradients of the network. The arm with the maximum sampled reward from this distribution is selected. (3) EE-Net (Ban et al., 2022b): In addition to employing an Exploitation network for learning the output function, it uses another Exploration network to learn the potential gain of exploring in relation to the current estimated reward. (4) Neural Epsilon employs the ϵ-greedy strategy: with probability 1 ϵ it chooses the arm with the maximum estimated reward generated by the network and with probability ϵ it chooses a random arm. Datasets. We consider a collection of 6 multiclass classification based datasets from the openml.org platform: covertype, fashion, Magic Telescope, mushroom, Plants and shuttle. Following the evaluation setting of existing works (Zhou et al., 2020; Ban et al., 2022b), given an input xt Rd for a K-class classification problem, we transform it into d K dimensional context vectors for each arm: xt,1 = (xt, 0, 0, . . . , 0)T , xt,2 = (0, xt, 0, . . . , 0)T ), . . . , xt,K = (0, 0, . . . , 0, xt)T . The reward is defined as 1 if the index of selected arm equals x s ground-truth class; otherwise, the reward is 0. Architecture: Both Neu Square CB and Neu Fast CB use a 2-layered Re Lu network with 100 hidden neurons. The last layer in Neu RIG uses a linear activation while Neu Fast CB uses a sigmoid. Following the scheme in Zhou et al. (2020) and Zhang et al. (2021) we use a diagonal matrix approximation in both Neural UCB and Neural TS to save computation cost in matrix inversion. Both use a 2-layered Re Lu network with 100 hidden neurons and the last layer uses a linear activation. We perform a grid-search over the regularization parameter λ over (1, 0.1, 0.01) and the exploration parameter ν over (0.1, 0.01, 0.001). Neural Epsilon uses the same neural architecture and the exploration parameter ϵ is searched over (0.1, 0.05, 0.01). For EE-Net we use the architecture from Published as a conference paper at ICLR 2024 https://github.com/banyikun/EE-Net-ICLR-2022. For all the algorithms we also do a grid-search for the step-size over (0.01, 0.005, 0.001). EE-Net Neural UCB Neural TS Neural Neu Square CB Neu Fast CB 0 1000 2000 3000 4000 5000 0 4000 EE-Net Neural UCB Neural TS Neural Neu Square CB Neu Fast CB 0 1000 2000 3000 4000 5000 0 EE-Net Neural UCB Neural TS Neural Neu Square CB Neu Fast CB 0 1000 2000 3000 4000 5000 500 Magic Telescope EE-Net Neural UCB Neural TS Neural Neu Square CB Neu Fast CB 0 1000 2000 3000 4000 5000 0 EE-Net Neural UCB Neural TS Neural Neu Square CB Neu Fast CB 0 1000 2000 3000 4000 5000 0 100 EE-Net Neural UCB Neural TS Neural Neu Square CB Neu Fast CB 0 1000 2000 3000 4000 5000 0 Figure 2: Figure 1 re-plotted for better visualization. Comparison of cumulative regret of Neu Square CB and Neu Fast CB with baselines on real-world datasets (averaged over 20 runs). The subplot below each figure plots the regret curve for Neu Square CB and Neu Fast CB again. Published as a conference paper at ICLR 2024 G.1 Neu Square CB AND Neu Fast CB Although our regret bounds are for the regularized network as defined in (10), the results presented in Section 5 are for the un-perturbed network. In this section we compare the un-perturbed network with the regularized one for different choices of perturbation constant c = creg/m1/4. 0 200 400 600 800 400 Neu Square CB (c=0) Neu Square CB (c=0.1) Neu Square CB (c=0.01) Neu Square CB (c=0.001) 0 200 400 600 800 Neu Square CB (c=0) Neu Square CB (c=0.1) Neu Square CB (c=0.01) Neu Square CB (c=0.001) 0 200 400 600 800 Neu Square CB (c=0) Neu Square CB (c=0.1) Neu Square CB (c=0.01) Neu Square CB (c=0.001) Magic Telescope 0 200 400 600 800 Neu Square CB (c=0) Neu Square CB (c=0.1) Neu Square CB (c=0.01) Neu Square CB (c=0.001) 0 200 400 600 800 200 Neu Square CB (c=0) Neu Square CB (c=0.1) Neu Square CB (c=0.01) Neu Square CB (c=0.001) 0 200 400 600 800 600 Neu Square CB (c=0) Neu Square CB (c=0.1) Neu Square CB (c=0.01) Neu Square CB (c=0.001) Figure 3: Comparison of cumulative regret of Neu Square CB for different choices of the regularization parameter c = creg/m1/4 for the model defined in (10) (averaged over 10 runs). As is evident from Figure 3 and 4, the cumulative regrets attained by the un-perturbed networks (c = 0) are more or less similar to the perturbed one on these data-sets. However, the output perturbation ensured a provable O( KL + K) regret bound for Neu Square CB and Neu Fast CB respectively. For the set of problems in our experiments we observe that the nonperturbed versions of Neu Square CB and Neu Fast CB behave similar to the perturbed versions, but do not come with a provable regret bound. In particular, one may be able to construct a problem where the non-perturbed version performs poorly. Published as a conference paper at ICLR 2024 0 200 400 600 800 Neu Fast CB (c=0) Neu Fast CB (c=0.1) Neu Fast CB (c=0.01) Neu Fast CB (c=0.001) 0 200 400 600 800 Neu Fast CB (c=0) Neu Fast CB (c=0.1) Neu Fast CB (c=0.01) Neu Fast CB (c=0.001) 0 200 400 600 800 60 Neu Fast CB (c=0) Neu Fast CB (c=0.1) Neu Fast CB (c=0.01) Neu Fast CB (c=0.001) Magic Telescope 0 200 400 600 800 Neu Fast CB (c=0) Neu Fast CB (c=0.1) Neu Fast CB (c=0.01) Neu Fast CB (c=0.001) 0 200 400 600 800 0 Neu Fast CB (c=0) Neu Fast CB (c=0.1) Neu Fast CB (c=0.01) Neu Fast CB (c=0.001) 0 200 400 600 800 Neu Fast CB (c=0) Neu Fast CB (c=0.1) Neu Fast CB (c=0.01) Neu Fast CB (c=0.001) Figure 4: Comparison of cumulative regret of Neu Fast CB for different choices of the regularization parameter c = creg/m1/4 for the model defined in (10) (averaged over 10 runs).