# stochastic_bandits_with_relu_neural_networks__40278fb1.pdf

Stochastic Bandits with Re LU Neural Networks

Kan Xu 1 Hamsa Bastani 2 Surbhi Goel 2 Osbert Bastani 2

We study the stochastic bandit problem with Re LU neural network structure. We show that a O(

T) regret guarantee is achievable by considering bandits with one-layer Re LU neural networks; to the best of our knowledge, our work is the first to achieve such a guarantee. In this specific setting, we propose an OFU-Re LU algorithm that can achieve this upper bound. The algorithm first explores randomly until it reaches a linear regime, and then implements a UCB-type linear bandit algorithm to balance exploration and exploitation. Our key insight is that we can exploit the piecewise linear structure of Re LU activations and convert the problem into a linear bandit in a transformed feature space, once we learn the parameters of Re LU relatively accurately during the exploration stage. To remove dependence on model parameters, we design an OFU-Re LU+ algorithm based on a batching strategy, which can provide the same theoretical guarantee.1

1. Introduction

The stochastic contextual bandit problem has been widely studied in the literature (Bubeck et al., 2012; Lattimore & Szepesv ari, 2020), with broad applications in healthcare (Bastani & Bayati, 2020), personalized recommendation (Li et al., 2010), etc. The problem is important since real-world decision-makers oftentimes adaptively gather information about their environment to learn. Formally, the bandit algorithm actively selects a sequence of actions {xt}t [T ] with xt X over some horizon T N, and observes stochastic rewards yt = fΘ (xt) + ξt, where fΘ is the true reward function (represented by a model with parameters Θ ), and ξt is random noise. Thus, to achieve good performance or

1Arizona State University, Arizona, USA 2University of Pennsylvania, Pennsylvania, USA. Correspondence to: Kan Xu <kanxu1@asu.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). 1Source code is available at https://github.com/ kanxu526/Re LUBandit.

low regret, the decision-maker must maintain small decision error uniformly across all actions xt over time to ensure that it generalizes to new, actively selected actions.

With the success of bandits in practice, there has been a great deal of recent interest in understanding the theoretical properties of bandit algorithms. For linear models, i.e., the expected reward fΘ is linear in xt, bandit algorithms have adapted techniques from statistics to address this challenge (Dani et al., 2008; Rusmevichientong & Tsitsiklis, 2010; Abbasi-Yadkori et al., 2011). Particularly, linear bandit algorithms build on linear regression, which provides parameter estimation bounds of the form ˆθ θ 2 ϵ; then, we obtain uniform generalization bounds of the form |fˆθ(x) fθ (x)| Lϵ, x X, where L is a Lipschitz constant for fθ. By adapting these techniques, linear bandit algorithms can achieve minimax rates of O(

T) (Dani et al., 2008; Abbasi-Yadkori et al., 2011) in terms of regret. However, the linear assumption oftentimes do not hold for complicated tasks (Valko et al., 2013), and has recently motivated the study of nonlinear contextual bandits, especially building upon neural networks with Re LU activations (see, e.g., Zhou et al., 2020; Zhang et al., 2020; Xu et al., 2020; Kassraie & Krause, 2022).

One of the key questions is what kinds of guarantees can be provided in such settings when the underlying true reward model has Re LU structures. The current existing literature of bandits based on Re LU neural networks mainly base their analyses on the theory of neural tangent kernel (NTK) (Jacot et al., 2018). Zhou et al. (2020); Gu et al. (2024) leverage NTK and upper confidence bound (UCB) techniques to achieve a O(γT

T) regret bound, where γT is the effective dimension or maximum information gain and is assumed to be T-independent in these literature. This assumption is strong because it intuitively assumes the function is linear on a low-dimensional subspace (more precisely, it says that the eigenvalues of the empirical covariance matrix in the kernel space vanish quickly, so the covariates approximately lie in a low-dimensional subspace); thus, it effectively converts the problem to linear bandit problem in a high-dimensional subspace. Indeed, for Re LU neural networks on a d-dimensional domain (i.e., x Rd), (Kassraie & Krause, 2022) shows a best upper bound for the information gain known as γT = O(T d 1

d ), even for a one-layer neural network. Consequently, the re-

Stochastic Bandits with Re LU Neural Networks

gret bound provided above becomes superlinear even for d > 1 without further restrictive assumptions. (Kassraie & Krause, 2022) improve upon this regret bound based on a variant of (Zhou et al., 2020) and obtain a sublinear bound of O((γT T)1/2) = O(T 2d 1

2d ), but is still far from the typical O(

T) guarantee.

Contribution. In contrast, we want to shed light on the achievable regret guarantee for a nonlinear bandit problem with Re LU neural network structure by estimating the model fΘ directly (without making an effective dimension assumption using NTK techniques). Due to the complexity of the problem, we consider the setting of one-layer Re LU neural network as the true reward function. We design two bandit algorithms that exploit the piecewise linear structure of Re LU activations; to the best of our knowledge, we provide the first O(

T) regret bound for bandit learning with Re LU neural networks.

Our first bandit algorithm OFU-Re LU is designed based on the following insight. Let our true reward function have the following Re LU structure with k neurons (see, e.g., Du & Lee, 2018; Zhang et al., 2019)

i [k] θ i x 1(θ i x 0),

where Θ = [θ 1, , θ k] . Intuitively, once we learn a sufficiently good estimate θi that is close to its corresponding true neuron parameter θ i , we can freeze the contribution of the indicator function by 1( θ i x 0). The problem then becomes linear, and we can use a UCB-type linear bandit algorithm to achieve good performance. This insight motivates our two-phase bandit algorithm, where we first randomly explore to estimate the parameters, and then run linear bandits once we reach the linear regime to obtain at a minimax optimal rate.

We want to note that even though this strategy may seem straightforward, such a two-stage design with a phase transition from exploration to bandit learning has been shown inevitable for specific nonlinear bandit problems (Rajaraman et al., 2023), and might be a more general phenomenon. In addition, applying UCB-type algorithm (i.e., Eluder UCB (Russo & Van Roy, 2013)) directly on the reward has been demonstrated suboptimal for certain family of nonlinear bandit problem, posing an interesting theoretical challenge (Rajaraman et al., 2023). In contrast, we show that instead of applying UCB to the Re LU structure directly, we will be able to reduce it to a linear bandit problem and this will make UCB optimal again.

Our second bandit algorithm OFU-Re LU+ is designed to eliminate the assumed knowledge in OFU-Re LU of the minimum gap ν between the optimal action x and any neuron θ i (we prove such a gap always exists for our Re LU structure). As long as the estimate θi is within ν /2 of the true

neuron θ i , the indicator estimate is guaranteed to be consistent with the true indicator value at the optimal action x = x , i.e., 1( θ i x 0) = 1(θ i x 0). In other words, unless our estimate θ has less than a ν -dependent error, the optimal action x would lie near a nonlinear regime of a neuron, which might fail the following linear bandit learning. Yet, in practice we do not know ν ; thus, we design a batching strategy that first makes a guess on ν and keep cutting this guess every batch so that our estimate θi will be accurate enough after a constant number of batches. Different from the previous batching strategies (see, e.g., Golrezaei et al., 2019; Luo et al., 2022), our exploration and OFUL phase both use samples from all previous batches without discarding data from previous batches.

Finally, we provide a parameter estimation error bound for one-layer Re LU neural networks that can ensure theoretical guarantee for each neuron independently through a novel proof strategy, which might be of separate interest.

1.1. Other Related Work

Early work on stochastic bandits focus on linear rewards (Dani et al., 2008; Chu et al., 2011; Abbasi-Yadkori et al., 2011) and typically use an UCB algorithm. Later, it has been extended to kernel based models, where the reward functions belong to the reproducing kernel Hilbert space (RKHS) (Valko et al., 2013). Along this line, the recent Re LU bandit literature are build upon the kernelized algorithm using NTK and achieve an effective-dimensiondependent bound (Zhou et al., 2020; Xu et al., 2020; Kassraie & Krause, 2022; Gu et al., 2024); (Salgia, 2023) generalize to smooth activations and provide a bound O(T 2d+2s 3 2d+4s 4 ) depending on smoothness s. Dong et al. (2021) study the optimization scheme of nonlinear bandit, and provide a O(T 3/4) and O(T 7/8) local and global regret for two-layer neural network bandit. (Rajaraman et al., 2023) focus specifically on the ridge function family (which does not include Re LU but only Re LU with single neuron), and design an explore-then-commit strategy. Finally, there are also many other works that study neural network bandits with other activations, such as quadratic activations (Xu et al., 2021a), polynomial functional form (Huang et al., 2021), etc.

2. Problem Formulation

Notation. Let [n] = {1, , n}. We let Sd 1(r) Rd

denote the (d 1)-sphere in d dimensions with radius r, and let Sd 1 = Sd 1(1). Define Ad 1(r) to be the area of the (d 1)-sphere with radius r (i.e., Ad 1(r) = |Sd 1(r)|). We use to represent the maximum value. Define p to be the distribution of the covariates, i.e., x p.

Re LU Neural Network. Consider a function family fΘ : X Y (where X = Sd 1 Rd, Y = R, and Θ Θ for

Stochastic Bandits with Re LU Neural Networks

a given domain Θ Rk d) consisting of neural networks with Re LU activations (see, e.g., Du & Lee, 2018; Zhang et al., 2019)

i [k] g(θ i x), (1)

where θi is the ith component of the parameter Θ = [θ1, θ2, , θk] (which we call a neuron) and g(z) = z 1(z 0). Let Θ be the ground-truth parameters.

We are provided with n data points Z = {(xi, yi)}i [n] such that yi = fΘ (xi)+ξi, where ξi is i.i.d. σ-subgaussian random noise with mean 0. Assume the covariates follow certain distribution x p. Then, the population mean squared loss is defined as

LS,p(Θ) = Ep (fΘ(x) fΘ (x))2 .

We slightly abuse our notation and use p to denote the joint distribution of each training example (xi, yi); the corresponding empirical mean squared loss is

ˆLS(Θ; Z) = 1

i [n] (fΘ(xi) yi)2. (2)

For some neural network fΘ , we obtain an estimate ˆΘ for the parameter Θ by taking the minimizer of ˆLS(Θ; Z), i.e., ˆΘ = arg minΘ Θ ˆLS(Θ; Z).

Assumptions. We provide a statistical guarantee for parameter estimation of Re LU neural network under the following assumptions.

Assumption 2.1. θ i 2 = 1 holds for all neurons i [k].

Note that under the above assumption, the domain of Θ, i.e., Θ, is included in {Θ Rk d | Θ 2,1 = k}, where Θ 2,1 := P

i [k] θi 2 is the ℓ2,1-norm.

Assumption 2.2. There exists a constant α0 > 0 such that

min j,j [k],j =j θ j θ j 2 α0.

Collectively, our assumptions limit the structure of the neural network. As evidenced by lower bounds in (Dong et al., 2021), some restrictions on the structure are necessary to obtain regret O(T α) for α < 1.

Assumption 2.1 is critical for our analysis, since neurons become hard to estimate when they are small. Similarly, Assumption 2.2 is necessary since two neurons that are close together are hard to distinguish. The assumption that the second layer consists of weights ones is not critical most of our proofs can be extended to the case where the second layer has values in { 1}.

Re LU Bandit. We consider the following bandit problem. At each step t, we choose an action xt X, and observe reward

yt = fΘ (xt) + ξt, (3)

where ξt is i.i.d. σ-subgaussian random noise with mean 0. Here, the true parameters Θ are unknown and will be learned in an online manner. Our goal is to minimize the cumulative regret RT over a time horizon T:

t=1 rt, rt = fΘ (x ) fΘ (xt)

where x = arg maxx X fΘ (x) is the optimal action and rt is the per period regret.

3. Parameter Estimation for Re LU Neural Networks

We provide an estimation error bound on the parameters of Re LU neural networks, which is important for the convergence of our bandit algorithm. Since the population loss function is symmetric regarding the neurons, meaning that any column-permutation of Θ achieves zero loss, our bound shows the existence of some mapping σ : [k] [k] from the ground truth neurons θ i to the estimated neurons ˆθσ(i) such that ˆθσ(i) θ i .

First, we show the following proposition, which states that a small generalization error on a special subset of the unit sphere Xi(ϵ) X = Sd 1 for ϵ R>0, i.e.,

Xi(ϵ) = {x X | |θ i x| ϵ},

implies a small parameter estimation error of the corresponding neuron θ i up to a sign flip.

Proposition 3.1. Suppose

LXi(ϵ)(Θ) := Z

Xi(ϵ) |fΘ(x) fΘ (x)|dx η

for some η R>0. Then, there exists a bijection σ : [k] [k] such that

min{ θσ(i) θ i 2, θσ(i) + θ i 2} h(η, ϵ),

where h(η, ϵ) := kϵ3|Sd 3|/2 ϵ2(1 dϵ2/2)|Sd 2|/8 η 6kdϵ3|Sd 2|.

We provide a detailed proof with illustrations in Appendix A. Intuitively, Xi contains those x X close to the boundary where the neuron θ i of fΘ is nonlinear (i.e., θ i x = 0). Then, this proposition claims that if the loss on certain Xi is small, then the corresponding neuron θ i in the groundtruth neural network fΘ can be identified up to a sign flip

Stochastic Bandits with Re LU Neural Networks

by a corresponding neuron θσ(i) in the approximate neural network fΘ, i.e., θσ(i) θ i or θσ(i) θ i . Moreover, if the loss on all Xi s with i [k] are small, then Θ is close to Θ via the mapping σ up to sign flips.

Our main result combines Proposition 3.1 with a standard generalization error bound for Re LU neural networks to obtain the following parameter estimation error bounds. Theorem 3.2. Suppose the distribution p satisfies

X |fΘ(x) fΘ (x)|dx Ep [|fΘ(x) fΘ (x)|] .

Then, there exists a bijection σ : [k] [k] such that

min{ ˆθσ(i) θ i 2, ˆθσ(i) + θ i 2} 727π 1

4 kd 1 4 (2ζ) 1 4

holds for all i [k] with at least a probability 1 δ, where ζ = Θ( p

We give a proof and the expression of ζ in Appendix B.

One potential limit of our proof strategy is that we may not correctly identify the sign of the ground truth neurons i.e., our guarantee has the form ˆθσ(i) θ i . However, we show in the next section that this caveat does not affect our bandit algorithm and analysis. Particularly, we show the difference between the estimated neural network f ˆΘ and the true model fΘ can be captured by a linear structure under sign misspecification. Thus, we can still run linear bandit algorithm to learn fθ in an online manner.

4. Algorithms for Re LU Bandits

Now, we describe our bandit algorithm and provide corresponding regret analysis. We begin with a simple case where we know the gap between the optimal action x and the nearest neuron, and then provide a solution when this knowledge of gap is not assumed.

4.1. Algorithm Design

We first provide intuition for our design choices. The challenge of running a Re LU bandit algorithm is the nonlinearity of the Re LU neural network, which is due to the indicator function in the Re LU activations i.e., in

θ i x 1(θ i x 0),

the first occurrence of θi is the linear contribution and the second occurrence is the contribution via the indicator function. The key of our design to tackle the nonlinearity is to first learn the indicator contribution, and then use a typical linear bandit algorithm to keep updating the model given the indicator function fixed. Particularly, once we have a sufficiently good estimate θi θ i , then our estimate of the indicator is exact:

1( θ i x 0) = 1(θ i x 0), i [k]. (4)

Next, we can fix the value of the indicator function using θi and focus on learning the linear part. That is, we run a linear bandit algorithm with the value of θi = θi in the indicator functions, but keep learning the value of θi in the linear part. In more detail, if (4) holds, then we have

E[y] = fΘ (x) = X

i [k] (1( θ i x 0)x) θ i

1( θ 1 x 0)x 1( θ 2 x 0)x ... 1( θ k x 0)x

θ 1 θ 2 ... θ k

Equivalently, we can run a linear bandit to learn fΘ (x), however, with action being the term x Rdk in (5), and parameter being the term θ Rdk. The action x is a function of the original action x and the estimated parameter Θ, and the parameter to update θ is a vectorization of the parameter Θ of interest.

Challenges. Though the above two-stage design looks intuitive, there are still two challenges associated with (4) and (5). On one hand, even if we have a relatively accurate estimate θi of θ i , (4) might not hold for some action x close to both of the estimate θi and the neuron θ i . On the other hand, our theoretical result in 3 only provides a guarantee on θi up to a sign flip, and hence still (4) might not hold; thus, we also need to design our algorithm capturing this bias.

As suggested above, note that for any x such that θ i x 0, even if θi θ i , it may still be the case that (4) fails to hold. As a consequence, (5) does not hold, and fΘ (x) is not linear in x for x close to any of θ i s. In other words, the linear bandit is misspecified in some action region, so the algorithm may not converge if the optimal action lies in such regions. Thus, our regret bounds depend on the gap between the optimal action x = arg maxx X fΘ (x) and the nearest hyperplane that is perpendicular to one of the neurons in the ground-truth neural network.

Definition 4.1. A Re LU neural network with parameters Θ

has a ν -gap for ν R 0 if |θ i x | ν for all i [k].

The following result ensures that a nontrivial gap ν > 0 always exists for our Re LU structure.

Proposition 4.2. mini [k] |θ i x | > 0 holds for the optimal action x .

We give a proof in Appendix C. Note that our proof strategy relies on the Re LU structure where the weights of the second layer equal 1. In other words, any Re LU neural network structure as in (3) that satisfies our assumptions has a positive gap ν = mini [k] |θ i x | > 0.

Stochastic Bandits with Re LU Neural Networks

Given this gap ν , as long as our estimate θi has an estimation error of θ i smaller than ν /2, i.e., θi θ i ν /2, our bandit algorithm will be able to find the optimal action x in the action space X( Θ, ν /2), where

X(Θ, ν) := {x X | |θ i x| ν, i [k]}. (6)

Intuitively, the above claim holds for two reasons: (i) x X( Θ, ν /2), and (ii) the bandit model fΘ (x) is linear in x for any x X( Θ, ν /2) (recall x is a function of x). For (i), it suffices to show that X(Θ , ν ) X( Θ, ν /2), as our Re LU neural network in (1) has a positive ν -gap and hence x X(Θ , ν ). To this end, for any x X(Θ , ν ), if θ i x > 0, we have

θ i x θ i x | θ i x θ i x|

θ i x θi θ i 2 ν ν /2 = ν /2 > 0. (7)

Similarly, if θ i x < 0, we have θ i x ν /2. Next, to show (ii), it suffices to show that for any x X( Θ, ν /2), we have 1(θ i x 0) = 1( θ i x 0). Following a similar argument as (7), we can show that θ i x 0 if θ i x 0 (and similarly for ). Thus, our plug-in indicator function is consistent with the true indicator function.

The other challenge is the sign misidentification from our Theorem 3.2. Specifically, Θ is close to the true parameters Θ only up to signs. In the worst case, the values of the corresponding indicators 1( θ i x 0) may differ from the true values 1(θ i x 0) when θi is close to θ i instead of θ i . In other words, the function fΘ (x) will still be nonlinear of x and (5) may no longer hold, leading to misspecification, when θi + θ i 2 ν /2 instead of θi θ i 2 ν /2, even if we reduce our search region to X( Θ, ν /2)

In order to correct this misspecification bias, we propose to add additional k delicately designed linear components to (5). We show that the misspecification when θi + θ i 2 ν /2 can be captured by a linear structure of k additional transformed features of x, enabling the linear bandit algorithm to function again. In more detail, we can write

i [k] (1( θ i x 0)x) θ i

2 1( θ i x 0)x)

1(θ i x 0) 1( θ i x 0)

1 2 1( θ i x 0) θ i

Note that compared to (5), we have an additional second term that captures the misspecification; this term equals 0 for any x X( Θ, ν /2) when θi θ i ν /2, as we

have shown in (7). Similarly, when θi + θ i ν /2, we can show that 1( θ i x 0) = 1 1(θ i x 0) i.e., if θ i x 0, we have θ i x 0 for any x X( Θ, ν /2). In other words, we have

1(θ i x 0) 1( θ i x 0)

1 2 1( θ i x 0) = 0, if θi θ i 2 ν

2 2, if θi + θ i 2 ν

Therefore, the true reward function fΘ (x) in (3) is equivalent to

fθ (x ) := x (x, Θ) θ (Θ , Θ), (8)

where x : X Rk d R2kd and θ : Rk d Rk d R2kd are two mappings with

1( θ 1 x 0)x 1( θ 2 x 0)x ... 1( θ k x 0)x ( 1

2 1( θ 1 x 0))x ( 1

2 1( θ 2 x 0))x ... ( 1

2 1( θ k x 0))x

θ (Θ , Θ) =

θ 1 θ 2 ... θ k 2θ 11( θ1 + θ 1 2 ν

2 ) 2θ 21( θ2 + θ 2 2 ν

2 ) ... 2θ k1( θk + θ k 2 ν

In the following, we will use x and θ to denote the two vectors in (9) and (10) for simplicity whenever no ambiguity is raised. The reward function in (8) has additional k linear components compared to (5); x builds upon x and contains additional k features that captures the misspecification due to sign flip. Now, we will be able to run a linear bandit algorithm with 2kd features to learn θ in an online manner.

4.2. OFU-Re LU Algorithm

For now, we assume ν is known throughout our design; we describe how to remove this assumption in 4.3.

Our algorithm, which we name OFU-Re LU2, has two phases:

2 OFU standards for optimism in the face of uncertainty (Abbasi-Yadkori et al., 2011)

Stochastic Bandits with Re LU Neural Networks

Algorithm 1 OFU-Re LU

Input: exploration length t0, regularization parameter λ Initialize Z0 for t [t0] do

Sample action xt i.i.d. p Take action xt and obtain reward yt Zt Zt 1 {(xt, yt)} end for Compute Θt0 arg minΘ ˆLS(Θ; Zt0) for t (t0 + 1, T] do

Compute confidence ellipsoid Ct(λ, Zt) for θ

x t arg max(x,θ) X ( Θt0,ν /2) Ct(λ,Zt) x θ

Play xt with x (xt, Θt0) = x t and obtain reward yt Zt Zt 1 {(xt, yt)} end for

Exploration. Randomly sample exploratory actions xt p for t0 time steps until our estimate Θt0 (i.e., Θ estimated using the first t0 samples) satisfies

min{ θt0,i θ i 2, θt0,i + θ i 2} ν /2

with high probability.

OFUL. Run the OFUL algorithm (Abbasi-Yadkori et al., 2011) to learn the true reward function fθ (x ) in (8), which is linear in the parameter θ and features x (x, Θt0), over the region x X ( Θt0, ν /2), where

X (Θ, ν) := {x (x, Θ) | x X(Θ, ν)}. (11)

At each time period t > t0, we follow OFUL, choosing arm x t = arg max(x,θ) X ( Θt0,ν /2) Ct(λ,Zt) x θ and observing reward yt; the confidence ellipsoid Ct(λ, Zt) for the true parameter θ depends on a regularization hyperparameter λ from OFUL, and can be computed using Theorem 2 in (Abbasi-Yadkori et al., 2011) with S =

5k (note that θ

5k) and all the data previously observed Zt = {(xτ, yτ)}τ [t 1].

We detail our algorithm in Algorithm 1. Given our design, we can obtain a regret bound scaling as O(kd

T); particularly, we control the parameter estimation error of Θt0 using Theorem 3.2 in 3, and analyze the regret of OFUL stage using Theorem 3 in (Abbasi-Yadkori et al., 2011). We provide a proof sketch below. Theorem 4.3. The cumulative regret of Algorithm 1 satisfies

RT = O k14d3(1/ν8 d4) + kd

Proof Sketch. Suppose the exploration stage ends at time t0. We have

min n θi θ i 2, θi + θ i 2 o ν /2, i [k]

with probability at least 1 δ/2 by choosing t0 large enough according to Theorem 3.2. In particular, it suffices to take δ = 1/

t0 = Ω(k13d3(1/ν8 d4)). (12)

Now we analyze the regret of our OFU-Re LU algorithm in three cases. First, at each time t, the per-period regret can be bounded trivially by

i [k] θ i 2)( x 2 + xt 2) 2k.

Therefore, the regret during the exploration phase is upper bounded by O(k14d3(1/ν8 d4)).

In the second stage, we run OFUL to find the optimal policy for the linear function fθ (x ) given our forced-sample estimate Θt0. Applying the regret bound in Theorem 3 of (Abbasi-Yadkori et al., 2011) gives the regret bound in our second stage to be O(kd

T) with at least a probability of 1 δ/2.

Finally, with a small probability δ = 1/

T, we would have linear regret scaling as 2k T; thus, the expected regret in this case is bounded by 2k Tδ = O(k

T). Our claim then follows.

We provide a detailed proof in Appendix D. Theorem 4.3 shows that our algorithm obtains a O(kd

T) regret guarantee as long as the time horizon T is large enough.

4.3. OFU-Re LU+ Algorithm

Algorithm 1 requires the knowledge of the gap ν , which is typically unknown. We can remove this assumption based on a batching strategy; we provide a schematic representation of our algorithm in Figure 1. Algorithm 2 summarizes our algorithm OFU-Re LU+ based on this insight.

At a high level, we split the entire time horizon T into M increasing batches with a grid 0 = T0 T1 TM = T. Each batch i [M 1], i.e., t (Ti 1, Ti], satisfies a(Ti Ti 1) = Ti+1 Ti i.e., Ti = (ai 1)T1 for some constant a > 1 and T1 > 1. Note that

M = log(T/T1 + 1)

For each batch, we take a fixed guess of ν ; we reduce this guess geometrically from one batch to the next. Specifically, let ν0 be our initial guess of ν at t = 0; then, the guess νi for batch i is νi = ν0/bi for some constant b > 1. Our νi will become sufficiently small that νi ν for batch i > log(ν0/ν )/ log(b). Our regret analysis in 4.2 can then be applied from that batch onwards.

Stochastic Bandits with Re LU Neural Networks

Figure 1: Schematic representation of OFU-Re LU+.

In detail, within each batch i, we have an exploration phase and a following OFUL phase as in 4.2. We initialize the batch with the exploration period as t (Ti 1, Ti 1 + t0,i]. Different from the conventional batching strategies (see, e.g., Golrezaei et al., 2019; Luo et al., 2022), we estimate Θ and run OFUL using samples from all previous batches without discarding data from previous batches. Particularly, we estimate Θ based on the accumulative samples from all the previous exploration phases t j [i](Tj 1, Tj 1 + t0,j] so that the estimation error satisfies

min{ θi θ i 2, θi + θ i 2} νi/2, i [k]. (14)

Define a function t03 of an arbitrary ν as

t0(ν) = Θ k13d3(1/ν8 d4) (15)

according to (12) (detailed expression in Appendix D). Then, it suffices to have t0,i = t0(νi) t0(νi 1) according to (12). Let our estimate be Θt0(νi) (i.e., Θ estimated using all t0(νi) random samples from previous batches).

In the rest of the batch i, we run OFUL; at each time step t (Ti 1 + t0,i, Ti], we choose arm x t = arg max(x,θ) X ( Θt0(νi),νi/2) Ct(λ,Zt) x θ and observing reward yt. Again, we compute the Ct(λ, Zt) based on (Abbasi-Yadkori et al., 2011) with S =

5k and all the data observed Zt = {(xτ, yτ)}τ [t 1]. The features are x (xτ, Θt0(νi)) and the parameter to estimate is θ (Θ , Θt0(νi)). Note that the features and the parameter are invariant of Θt0(νi) once νi ν ; thus, our estimate becomes consistent and OFUL is valid from then on.

To bound the regret, we decompose the whole time horizon into three parts and analyze them respectively:

(i). All batch i satisfying νi > ν ,

(ii). t (Ti 1, Ti 1 + t0,i] for all batch i with νi ν ,

(iii). t (Ti 1 + t0,i, Ti] for all batch i with νi ν .

The regret in (i) is independent of T and is based on our choices of a, b, T1 and ν0. Our analysis in 4.2 can be applied similarly to analyze the exploration regret in (ii) and the regret (iii) from OFUL. However, for (ii), since νi decreases over time, the regret per batch grows over time.

3Note that t0 in 4.2 is a function of ν . Here we abuse the notation t0 slightly.

Algorithm 2 OFU-Re LU+

Input: regularization parameter λ, parameters ν0, T1, a, b Initialize Z0 , E0 for i [M] do

νi νi 1/b, t0,i t0(νi) t0(νi 1) for t (Ti 1, Ti 1 + t0,i] do

Sample action xt i.i.d. p Take action xt and obtain reward yt Et Et 1 {(xt, yt)}, Zt Zt 1 {(xt, yt)} end for Compute Θt0(νi) arg minΘ ˆLS(Θ; ETi+t0,i) for t (Ti 1 + t0,i, Ti] do

Compute confidence ellipsoid Ct(λ, Zt) for θ

x t arg max(x,θ) X ( Θt0(νi),νi/2) Ct(λ,Zt) x θ

Play xt with x (xt, Θt0(νi)) = x t and obtain yt Zt Zt 1 {(xt, yt)} end for Ti+1 (ai+1 1)T1 end for

We show in the following theorem that this regret scales as a polynomial term of T of which the order depends on the relative size between a and b, and can be chosen to be asymptotically less than O(

Theorem 4.4. The cumulative regret of Algorithm 2 has

RT = O k14d7 + k14d3T 8 log(b)

log(a) + kd

Proof Sketch. We bound the regret for the three cases above respectively. First, in case (i), we have i log(ν0/ν )/ log(b). Recall that the per-period regret can be trivially bounded by 2k. Thus, the regret in this case is upper bounded by

2k(alog(ν0/ν )/ log(b) 1)T1 2k(ν0/ν )

log(a) log(b) T1.

Similarly, regret in case (ii) can also be bounded by

i= log(ν0/ν )

log(b) t0,i = O k14d7 + k14d3T 8 log(b)

where we use the definition of t0,i, that of t0(ν) in (15), and the value of M in (13).

Next, we calculate the regret of running OFUL in case (iii), which is upper bounded again by O(kd

T) following the proof strategy of Theorem 3 in (Abbasi-Yadkori et al., 2011), similar to our proof for OFU-Re LU.

Finally, with a union bound, there s a small probability M/

T that we will have a linear regret since the above analysis holds only with high probability. We can show the

Stochastic Bandits with Re LU Neural Networks

regret of this part scales as O(k

T). Combining all the above gives our final result.

We provide a detailed proof in Appendix E. Compared with

Theorem 4.3, here we gain an additional T 8 log(b)

log(a) dependence due to the increasing difficulty of learning the unknown gap ν . Theorem 4.4 implies that, as long as our choices of the multipliers a and b satisfy 8 log(b)/ log(a) 1/2, we recover a O(

T) regret dependence i.e., when the length of exploration period grows slower than the batch time horizon. for instance, taking a = 2 and b = 21/32, we obtain

RT = O k14d7 + k14d3T 1/4 + kd

Finally, if the time horizon T is at least a polynomial term of k and d, then as before, we recover an O(kd

T) regret guarantee.

5. Experiments

We compare our algorithm OFU-Re LU with several benchmarks, including OFUL (Abbasi-Yadkori et al., 2011), which assumes the true model is linear and introduces misspecification errors, and three different versions of Neural UCB (Zhou et al., 2020), i.e., Neural UCB-F, Neural UCBT and Neural UCB-TW. Particularly, Neural UCB-F follows the setup in 7.1 of (Zhou et al., 2020) with m = 20 neurons and two layers; Neural UCB-T assumes the knowledge of the neural network structure of the true reward, i.e., m = k neurons and one layer; and Neural UCB-TW inherits the structure from Neural UCB-T but expands the layer size into m = 2k.

We consider the true model of a Re LU structure as in (3), with multiple settings presented in Figure 2. The parameter Θ is randomly sampled from the sphere θi = 1 for i [k]. The noise follows a normal distribution N(0, 0.01). We randomly draw 1, 000 arms from the unit sphere in each round t and choose an optimal arm from this arm set. Note that with a discretized arm set, our claim of a nontrivial gap ν always holds. For OFUL and OFU-Re LU, we use the theoretically suggested confidence ellipsoid for UCB. Since we do not know the gap ν , we set the length of exploration phase for OFU-Re LU to be 20 for our method. We tune the hyperparameters λ for all the methods.

Figure 2 shows the performance of our bandit algorithm versus the other benchmarks with a 95% confidence interval. We find our algorithm significantly outperforms all the other benchmarks. OFUL assumes a linear reward model structure and thus incurs large regret due to the misspecification error. All Neural UCB benchmarks use gradient descent to learn the model structure over time and thus take long time to converge in general. Note that even as a fair comparison with Neural UCB-T, where the true network structure

0 200 400 600 800 1000 Time

OFUL Neural UCB-F Neural UCB-T Neural UCB-TW OFU-Re LU

(a) d = 2, k = 3

0 200 400 600 800 1000 Time

OFUL Neural UCB-F Neural UCB-T Neural UCB-TW OFU-Re LU

(b) d = 2, k = 10

Figure 2: Cumulative regret of a time horizon T = 1, 000 over 50 trials with 95% confidence interval.

is given, our method is still significantly better in terms of regret. It is worth noting that our method takes only 20 time steps to converges in a time horizon of T = 1, 000, while Neural UCB algorithms generally take a long time to converge (e.g., (Zhou et al., 2020) consider a longer horizon T 10, 000 in all their experiments). Our empirical results complement our theoretical analysis and suggest the efficiency of our algorithm in practice, especially in a short time horizon, despite a theoretically long exploration phase due to our parameter estimation error bound.

6. Conclusion

We analyze a bandit problem with the reward given by onelayer Re LU neural network structure, and propose algorithms that can provide a regret bound of O(

T). To the best of our knowledge, our work is the first to obtain such regret guarantee for bandit learning with neural networks (without an effective dimension assumption). Furthermore, we demonstrate the efficiency of our algorithm in a synthetic experiment, which suggests its practical potentials. We believe both our theoretical and empirical results provide the first insight into an efficient design of bandit algorithms based on Re LU neural network structures.

Stochastic Bandits with Re LU Neural Networks

We conclude by providing directions for future research. Due to the complexity of the problem, we tailor our focus to the one-layer Re LU activations. A natural extension is to generalize our result to piecewise linear activation functions. It is more challenging to explore whether our insight can be generalized to bandit problems with more complex activation functions or multiple-layer architectures.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In NIPS, volume 11, pp. 2312 2320, 2011.

Bastani, H. and Bayati, M. Online decision making with high-dimensional covariates. Operations Research, 68 (1):276 294, 2020.

Bubeck, S., Cesa-Bianchi, N., et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5 (1):1 122, 2012.

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208 214. JMLR Workshop and Conference Proceedings, 2011.

Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. 2008.

Dong, K., Yang, J., and Ma, T. Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. Advances in Neural Information Processing Systems, 34:26168 26182, 2021.

Du, S. and Lee, J. On the power of over-parametrization in neural networks with quadratic activation. In International conference on machine learning, pp. 1329 1338. PMLR, 2018.

Golrezaei, N., Javanmard, A., and Mirrokni, V. Dynamic incentive-aware learning: Robust pricing in contextual auctions. Advances in Neural Information Processing Systems, 32, 2019.

Gu, Q., Karbasi, A., Khosravi, K., Mirrokni, V., and Zhou, D. Batched neural bandits. ACM/JMS Journal of Data Science, 1(1):1 18, 2024.

Huang, B., Huang, K., Kakade, S., Lee, J. D., Lei, Q., Wang, R., and Yang, J. Optimal gradient-based algorithms for non-concave bandit optimization. Advances in Neural Information Processing Systems, 34:29101 29115, 2021.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.

Kassraie, P. and Krause, A. Neural contextual bandits without regret. In International Conference on Artificial Intelligence and Statistics, pp. 240 278. PMLR, 2022.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010.

Luo, Y., Sun, W. W., and Liu, Y. Contextual dynamic pricing with unknown noise: Explore-then-ucb strategy and improved regrets. Advances in Neural Information Processing Systems, 35:37445 37457, 2022.

Rajaraman, N., Han, Y., Jiao, J., and Ramchandran, K. Beyond ucb: Statistical complexity and optimal algorithms for non-linear ridge bandits. ar Xiv preprint ar Xiv:2302.06025, 2023.

Rigollet, P. and H utter, J.-C. High dimensional statistics. Lecture notes for course 18S997, 813(814):46, 2015.

Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parameterized bandits. Mathematics of Operations Research, 35 (2):395 411, 2010.

Russo, D. and Van Roy, B. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013.

Salgia, S. Provably and practically efficient neural contextual bandits. In International Conference on Machine Learning, pp. 29800 29844. PMLR, 2023.

Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. Finite-time analysis of kernelised contextual bandits. In Uncertainty in Artificial Intelligence, 2013.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge university press, 2019.

Stochastic Bandits with Re LU Neural Networks

Xu, K., Bastani, H., and Bastani, O. Robust generalization of quadratic neural networks via function identification. ar Xiv preprint ar Xiv:2109.10935, 2021a.

Xu, K., Zhao, X., Bastani, H., and Bastani, O. Groupsparse matrix factorization for transfer learning of word embeddings. ar Xiv preprint ar Xiv:2104.08928, 2021b.

Xu, P., Wen, Z., Zhao, H., and Gu, Q. Neural contextual bandits with deep representation and shallow exploration. ar Xiv preprint ar Xiv:2012.01780, 2020.

Zhang, W., Zhou, D., Li, L., and Gu, Q. Neural thompson sampling. ar Xiv preprint ar Xiv:2010.00827, 2020.

Zhang, X., Yu, Y., Wang, L., and Gu, Q. Learning onehidden-layer relu networks via gradient descent. In The 22nd international conference on artificial intelligence and statistics, pp. 1524 1534. PMLR, 2019.

Zhou, D., Li, L., and Gu, Q. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pp. 11492 11502. PMLR, 2020.

Stochastic Bandits with Re LU Neural Networks

A. Proof of Proposition 3.1

We give the proof of Proposition 3.1, followed by proofs of the lemmas used in this proof.

A.1. Intuition

We illustrate our proof strategy in Figure 3. We first define the following set

J α i = {l [k] | θl θ i 2 α, or θl + θ i 2 α}

for all i [k]. It suffices to prove that J α i is a singleton set for every i [k] in order to prove Proposition 3.1. Note that when α α0/2, J α i s are disjoint (i.e., J α i J α i = for any i, i [k], i = i ); in particular, if i J α i , then for any other i [k], we have

θi θ i 2 θ i θ i 2 θi θ i 2 α0 α α,

that is, i Jα i , as claimed, where we use Assumption 2.2. As a consequence, it suffices to show that J α i = for every i [k]. To this end, we prove its contrapositive i.e., if there exists j [k] such that J α j = , then

LXj(Θ) η = ϵ2(1 dϵ2/2)|Sd 2|

8 kϵ3|Sd 3|

2α (6k + 3σ)dϵ3|Sd 2|.

Intuitively, if θ j does not have a matching neuron θj (up to a sign flip) in J α j , then we can show that g(θ x) is linear for any

θ Θ j := {θi}i [k] {θ i }i [k],i =j

except θ j on majority of the strip X j (Figure 3 (b), formally defined in (17)), which is a close approximation of Xj. Therefore, fΘ(x) fΘ (x) can be additively decomposed into a linear term plus g(θ j x). Besides, we prove that any linear function cannot approximate g(θ j x) well on X j (Figure 3 (c)). As a result, given the definition of LXj(Θ), we can show that LXj(Θ) is lower-bounded.

Figure 3: Illustrations for proof sketch of the estimation error for Re LU Neural Networks. (a) The region X is the cylinder with caps consisting of the two green circles and radius 2ϵ. (b) Projected version of subfigure (a). (c) The green region is X

with a section of length O(ϵ/α) cut out.

A.2. Proof of Proposition 3.1

We list the details of our proof strategy in this section. We introduce additional notation that we will use in the following proof. With a slight abuse of notation, for any vector x Rd Let xi denote the ith element of x and xi:j Rj i+1 the subvector of x consisting of the ith to jth elements.

Stochastic Bandits with Re LU Neural Networks

Step 1. Note that Xj is a slice of the sphere X = Sd 1. To simplify our following analysis, we approximate Xj using a cylinder X j Rd i.e., without loss of generality assuming that for one specific j [k]

θ j = 1 0 0 , (16)

X j = [ ϵ, ϵ] Z = {x Rd | |x1| ϵ, x2:d Z}, where Z = Sd 2 p

or equivalently

X j = {ϕ(x) | x Xj}, where ϕ(x) = h x1 q

1 ϵ2 1 x2 1 x2 q

1 ϵ2 1 x2 1 xd i . (17)

This region is visualized in Figure 3 (a); its projection to two dimensions is shown in Figure 3 (b). We show that the loss restricted to X j is approximately equal to the loss restricted to Xj:

Lemma A.1. Given X j, it holds that

Xj |fΘ(x) fΘ (x)|dx Z

X j |fΘ(x) fΘ (x)|dx

6kdϵ3|Sd 2|.

The proof is provided in Appendix A.3.

Step 2. Next, we decompose X j into strips i.e.,

z Z Xz, where Xz = [ ϵ, ϵ] {z}.

Note that Xz is a one-dimensional manifold; one such strip is shown as the horizontal green line in Figure 3 (b). Our strategy is to lower bound the loss restricted to each of these strips, after which we can integrate over them to obtain a lower bound on the overall loss. With a slight abuse of notation, we further define

i [k],i =j g(θ i x).

Precisely, we provide a lower bound of the loss for those z Z where fΘ and fΘ j is linear on Xz; in particular, for such z, we have

Xz V Θ j = , where V Θ j = [

{x X j | θ x = 0}.

Note that V Θ j is the boundary at which one of the Re LUs, i.e., g(θ x) with θ Θ j, transitions from inactive to active. If Xz does not intersect V Θ j, then θ x = 0 on Xz for all θ Θ j and, hence, fΘ fΘ j must be linear on such Xz. In the following, we show that such z s make up a large proportion of Z; equivalently, we show that the following subset is small:

Zθ, where Zθ = {z Z | x1 [ ϵ, ϵ], θ (x1 z) = 0}, (18)

where x1 z := x1 z1 zd 1 .

Lemma A.2. For any θ Θ j, we have

The proof is given in Appendix A.4. This result is illustrated in Figure 3 (b); the set of Xz for which z Z Θ j (which has size O(ϵ/α)) has been removed from X j. Note that as α becomes larger, the size of Z Θ j becomes smaller.

Stochastic Bandits with Re LU Neural Networks

Step 3. Next, for z Z \ Z Θ j, we lower bound the loss on Xz. Remember the loss is

|fΘ(x) fΘ (x)| = |(fΘ(x) fΘ j(x)) g(θ j x)|,

where we argue in Step 2 that on such Xz the first term is linear. Therefore, we can lower bound the loss using the following lemma:

Lemma A.3. For any β0, β1 R, we have Z ϵ

ϵ |(β0 + β1w) g(w)|dw ϵ2

We provide the proof in Appendix A.5. Since our loss is the mean absolute error, this result follows from a geometric argument. Intuitively, as illustrated in Figure 3 (c), there is a triangular gap between fΘ fΘ j (which is linear i.e., fΘ(x) fΘ j(x) = β x for some β Rd) and g(θ j x) on Xz; this gap equals the loss, and it cannot be reduced regardless of the value of β.

Step 4. Finally, the proof of Proposition 3.1 consists of integrating the lower bound from Step 3 over z Z \ Z Θ j to obtain a lower bound on LXj(Θ).

Proof. First, note that Z

X j |fΘ(x) fΘ (x)|dx Z

ϵ |fΘ(x1 z) fΘ (x1 z)|dx1dz

ϵ |fΘ(x1 z) fΘ (x1 z)|dx1dz

ϵ |(fΘ(x1 z) fΘ j(x1 z)) g(θ j (x1 z))|dx1dz. (19)

Remember for any given z Z \ Z Θ j, the first term fΘ(x1 z) fΘ j(x1 z) is linear in x1 z, i.e., there exists a

parameter θj such that θ j (x1 z) = fΘ(x1 z) fΘ j(x1 z). Without loss of generality, we can modify the coordinate system so that

θj = t1 t2 0 0

without affecting θ j in (16). Then, we have Z ϵ

ϵ |(fΘ(x1 z) fΘ j(x1 z)) g(θ j (x1 z))|dx1 = Z ϵ

ϵ | θ j (x1 z) g(θ j (x1 z))|dx1

ϵ |(t1x1 + t2z1) g(x1)|dx1

where the last inequality uses Lemma A.3. Given the above result, we can derive from (19) that Z

X j |fΘ(x) fΘ (x)|dx |Z \ ZΘ j|ϵ2

8 = (1 ϵ2) d 2

2 |Sd 2| 4kϵ|Sd 3|

This combined with Lemma A.1 gives that Z

Xj |fΘ(x) fΘ (x)|dx (1 d

2ϵ2)|Sd 2| 4kϵ|Sd 3|

8 6kdϵ3|Sd 2|.

The result then follows.

Stochastic Bandits with Re LU Neural Networks

A.3. Proof of Lemma A.1

First, note that Z

X j |fΘ(x) fΘ (x)|dx = Z

Xj |fΘ(ϕ(x)) fΘ (ϕ(x))| | det xϕ(x)|dx

Xj |fΘ(ϕ(x)) fΘ (ϕ(x))| 1 ϵ2

where the second equality follows the fact that xϕ(x) is a lower triangular matrix. Then, we have Z

X j |fΘ(x) fΘ (x)|dx Z

Xj |fΘ(x) fΘ (x)|dx

2 |fΘ(ϕ(x)) fΘ (ϕ(x))| |fΘ(x) fΘ (x)|dx

Xj |(fΘ(x) fΘ(ϕ(x))) (fΘ (x) fΘ (ϕ(x)))|dx

Xj |fΘ(x) fΘ(ϕ(x))| + |fΘ (x) fΘ (ϕ(x))|dx,

Xj |fΘ(x) fΘ (x)|dx Z

X j |fΘ(x) fΘ (x)|dx

Xj |fΘ(x) fΘ (x)| 1 ϵ2

2 |fΘ(ϕ(x)) fΘ (ϕ(x))|dx

Xj |fΘ(x) fΘ (x)| (1 ϵ2) d 1

2 |fΘ(ϕ(x)) fΘ (ϕ(x))|dx

Xj |fΘ(x) fΘ (x)| (1 d

2ϵ2)|fΘ(ϕ(x)) fΘ (ϕ(x))|dx

Xj |fΘ(x) fΘ(ϕ(x))| + |fΘ (x) fΘ (ϕ(x))|dx

Xj |fΘ(ϕ(x)) fΘ (ϕ(x))|dx,

where we use |x1| ϵ and Bernoulli s inequality. Therefore, we have

Xj |fΘ(x) fΘ (x)|dx Z

X j |fΘ(x) fΘ (x)|dx

Xj |fΘ(x) fΘ(ϕ(x))| + |fΘ (x) fΘ (ϕ(x))|dx + d

Xj |fΘ(ϕ(x)) fΘ (ϕ(x))|dx.

Note that both fΘ and fΘ are k-Lipschitz. In particular, for any x, x Rd, we have

|fΘ(x) fΘ(x )| =

i=1 (g(θ i x) g(θ i x ))

i=1 |θ i (x x )| k x x 2.

The same result holds for fΘ . As a result, we can derive Z

Xj |fΘ(x) fΘ(ϕ(x))| + |fΘ (x) fΘ (ϕ(x))|dx 2k|Xj| max x Xj x ϕ(x) ,

Stochastic Bandits with Re LU Neural Networks

Xj |fΘ(ϕ(x)) fΘ (ϕ(x))|dx 2k|Xj| max x Xj ϕ(x) .

max x Xj x ϕ(x) 2 = max x Xj

max x Xj ϕ(x) 2 1.

Besides, we have

dx1 2ϵ|Sd 2|,

where An(r) is the area of the n-sphere with radius r. Then, the claim follows.

A.4. Proof of Lemma A.2

Consider the set Zθi for some i [k] first. Without loss of generality, we can modify the coordinate system so that

θi = t1 t2 0 . . . 0

without affecting θ j in (16). By assumption, we have θi 2 = p

t2 1 + t2 2 = 1. In the following, we first consider t1 0. Remember

α2 < θi θ j 2 2 = (1 t1)2 + t2 2 = 2(1 t1) = 2

t2 1 + t2 2

1 + t2 2/t2 1

This implies

|t2| |t1| >

For any z Zθi, the condition θ i (x1 z) = 0 is equivalent to

t1x1 + t2z1 = 0.

As a consequence, we have

|z1| |t1| |x1|

Therefore, we can obtain that

|Zθj| Z ϵ/α

where An(r) is the area of the n-sphere with radius r. The claim for θ {θi}i [k] then follows. The case t1 < 0 can be analyzed similarly using α < θi + θ j 2. In addition, the claim for θ {θ i }i [k],i =j also holds noticing 2α < α0 < θ i θ j 2 for i = j by Assumption 2.2.

Stochastic Bandits with Re LU Neural Networks

A.5. Proof of Lemma A.3

We first have Z ϵ

ϵ |(β0 + β1w) g(w)|dw = Z 0

ϵ |β0 + β1w|dw + Z ϵ

0 |β0 + (β1 1)w|dw

0 |β0 β1w|dw + Z ϵ

0 |β0 + (β1 1)w|dw

= F(β0, β1) + F(β0, 1 β1),

where we define F(a, b) = R ϵ 0 |a bw|dw. Note that we must have either β1 1/2 or 1 β1 1/2. Additionally, it holds that F( |a|, b) F(|a|, b) for b > 0. Therefore, it suffices to consider the case a 0, b > 1/2 to provide a lower bound. In this case, F(a, b) takes the minimum when a bϵ. Therefore, we have

F(a, b) = Z a/b

0 (a bw)dw + Z ϵ

a/b (bw a)dw

As a function of a, this expression is minimized when a = bϵ/2. In such case,

F(a, b) F bϵ

2 , b = bϵ2

where we use b 1/2. The result then follows.

B. Proof of Theorem 3.2

We give the proof of Theorem 3.2, followed by proofs of the lemmas used in this proof. First, we have the following technical lemma:

Lemma B.1. We have (a) the mean squared loss LS,p is 4k-Lipschitz in Θ with respect to ℓ2,1 norm, and (b) when n log( 2

δ ), its empirical counterpart ˆLS,p is (4k + 6σ)-Lipschitz with at least a probability of 1 δ/2.

We give a proof in Appendix B.1. Next, we have the following useful lemma, which says that the empirical loss converges to the true loss.

Lemma B.2. Given the same setup as in Lemma B.3, we have

sup Θ |(ˆLS(Θ; Z) σ(ξ)) LS,p(Θ)| ζ 1 δ,

where σ(ξ) := 1

i [n] ξ2 i .

The proof is given in Appendix B.2. This lemma follows from a standard covering number argument. Using this result, we provide a sample complexity bound for learning the true parameter given the i.i.d. training examples {(xi, yi)}i [n] from p.

Now we define the mean absolute loss function and its corresponding empirical version:

LA,p(Θ) = Ep [|fΘ(x) fΘ (x)|] , ˆLA(Θ; Z) = 1

i [n] |fΘ(xi) yi|.

Lemma B.3. For any δ R>0, we have

Pp h LA,p(ˆΘ) p

Stochastic Bandits with Re LU Neural Networks

where ζ is defined as

4096k2(k σ)2

dk max 1, log 1 + r n

We give a proof in Appendix B.3.

Lemma B.4. Consider any small η R>0 satisfying η min{ 1 11522

2πk2d3/2 , α2 0π1/2

14542k2d1/2 }. If the generalization error has LA,p(Θ) η, where p is a distribution on the (d 1)-sphere such that

X |fΘ(x) fΘ (x)|dx Ep [|fΘ(x) fΘ (x)|] ,

then there exists a bijection σ : [k] [k] such that

min{ θσ(i) θ i 2, θσ(i) + θ i 2} α := 727π 1

4 kd 1 4 η 1 2 , i [k].

We give a proof in Appendix B.4. Note that the assumptions on the distribution p can be easily satisfied, e.g., when p = Uniform(Sd 1) is a uniform distribution on the sphere. The above result says given a small generalization error for certain parameter estimate Θ, its estimation error of the ground-truth parameter Θ up to a sign flip is also small correspondingly.

Finally, Theorem 3.2 follows directly from Lemmas B.3 & B.4 by taking η = 2ζ.

B.1. Proof of Lemma B.1

The mean squared loss satisfies

|LS,p(Θ) LS,p(Θ )| Ep[|(fΘ(x) fΘ (x))2 (fΘ (x) fΘ (x))2|]

Ep[|(fΘ(x) fΘ (x) + fΘ (x) fΘ (x))(fΘ(x) fΘ (x))|]

4k Ep[|fΘ(x) fΘ (x)|]

i [k] θi θ i 2,

where we use x 2 = 1 and θi 2 = θ i 2 = 1. Similarly, the empirical loss satisfies

|ˆLS(Θ; Z) ˆLS(Θ ; Z)| =

i [n] [(fΘ(xi) yi)2 (fΘ (xi) yi)2]

i [n] |(fΘ(xi) fΘ (xi))2 (fΘ (xi) fΘ (xi))2| + 2

i [n] |ξi(fΘ(xi) fΘ (xi))|

i [n] |ξi|) X

i [k] θi θ i 2.

Since ξi is σ-subgaussian, so is |ξi|. When n log( 2

δ ), we have

i [n] |ξi| 2

with at least a probability of 1 δ/2, where we use Lemma 1.4 from (Rigollet & H utter, 2015) and Lemma F.2 by taking

δ )/n. The claim follows.

Stochastic Bandits with Re LU Neural Networks

B.2. Proof of Lemma B.2

Then, construct an ζ/(16k + 12σ)-net EΘ with respect to ℓ2,1 norm of Θ = {Θ Rk d | Θ 2,1 = k} (remember Θ is a superset of Θ s domain Θ). For any such Θ Θ, there exists Θ EΘ such that

|(ˆLS(Θ; Z) LS,p(Θ)) (ˆLS(Θ ; Z) LS,p(Θ ))| (8k + 6σ) Θ Θ 2,1 ζ/2,

with high probability 1 δ/2. Given the above inequality, we have

sup Θ Θ |(ˆLS(Θ; Z) σ(ξ)) LS,p(Θ)| ζ Pp

sup Θ Θ |(ˆLS(Θ; Z) σ(ξ)) LS,p(Θ)| ζ

max Θ EΘ |(ˆLS(Θ; Z) σ(ξ)) LS,p(Θ)| ζ

|(ˆLS(Θ; Z) σ(ξ)) LS,p(Θ)| ζ

+ δ/2. (20)

(ˆLS(Θ; Z) σ(ξ)) LS,p(Θ) = L1 + 2L2,

i [n] (fΘ(xi) fΘ (xi))2 Ep (fΘ(x) fΘ (x))2 , L2 = 1

i [n] (fΘ(xi) fΘ (xi))ξi.

The probability within the sum in (20) is then upper bounded by

|(ˆLS(Θ; Z) σ(ξ)) LS,p(Θ)| ζ

In the following, we bound each of the above two terms respectively. For the first term, since |fΘ(xi) fΘ (xi)| 2k, (fΘ(xi) fΘ (xi))2 is 4k2-subgaussian; thus, by Lemma F.2, we have

Next, for the second term, since ξi is σ-subgaussian and (fΘ(xi) fΘ (xi)) and ξi are independent, we can show that (fΘ(xi) fΘ (xi))ξi is (2kσ)-subgaussian; thus, by Lemma F.2, we have

Therefore, we obtain from (20) that

sup Θ Θ |(ˆLS(Θ; Z) σ(ξ)) LS,p(Θ)| ζ 4|EΘ| exp nζ2

2048k2(k2 + σ2)

2 1 + 2k(16k + 12σ)

2048k2(k2 + σ2)

2048k2(k2 + σ2) + dk log 1 + 2k(16k + 12σ)

where the second inequality follows Lemma F.1 by noticing that Θ 2,1 = k for Θ Θ. Finally, we choose ζ so that (23) is smaller than δ/2 in particular, letting

4096k2(k σ)2

dk max 1, log 1 + r n

then we have (23) is upper bounded by δ/2, where we use k + σ 1. The result follows.

Stochastic Bandits with Re LU Neural Networks

B.3. Proof of Lemma B.3

Since ˆΘ minimizes ˆLS(Θ; Z), we have

0 LS,p(ˆΘ) LS,p(Θ )

LS,p(ˆΘ) (ˆLS(ˆΘ; Z) σ(ξ)) + (ˆLS(Θ ; Z) σ(ξ)) LS,p(Θ )

2 sup Θ |(ˆLS(Θ; Z) σ(ξ)) LS,p(Θ)| 2ζ,

with probability at least 1 δ, where we use Lemma B.2. Thus, by Cauchy-Schwarz inequaltiy and the fact that LS,p(Θ ) = 0, we have

as claimed.

B.4. Proof of Lemma B.4

Using the condition on p, we have

Xi |fΘ(x) fΘ (x)|dx Z

X |fΘ(x) fΘ (x)|dx |Sd 1|LA,p(Θ) |Sd 1| η.

Then, letting η = |Sd 1| η, we have that LXi(Θ) η holds for all i [k]. Thus, by Proposition 3.1 (we will check the conditions on α later), we have

min{ θσ(i) θ i 2, θσ(i) + θ i 2} α

for all i [k]. Plugging the value of η above into α, it yields that

2 |Sd 3| |Sd 1| ϵ2(1 dϵ2/2)

8 |Sd 2| |Sd 1| η 6kdϵ3 |Sd 2|

|Sd 1| . (24)

Using |Sd 1| = 2πd/2

Γ(d/2) and Lemma 10 from (Xu et al., 2021b), it holds that for any d > 1

|Sd 3| |Sd 1| d

2π , |Sd 2| |Sd 1|

Combining the above, we obtain from (24)

(πd)1/2(1 dϵ2/2 48kdϵ)ϵ2 16

Take ϵ = (36

d)1/2. Given our choice of η, we can show that dϵ2/2 1/4 and 48kdϵ 1/4. Thus, we have

4 kd 1 4 η 1 2 .

Note that such α satisfies α α0/2 by our condition on η. The claim follows.

C. Proof of Proposition 4.2

We give a proof of Proposition 4.2.

Stochastic Bandits with Re LU Neural Networks

Figure 4: Illustrations for proof sketch of the optimal action gap ν .

C.1. Intuition

We illustrate our proof strategy in Figure 4. For all the neurons θ i s, we can decompose them into three subsets i.e., the neurons that are positively activated, A = {i [k] | θ i x > 0}, those that are orthogonal to x , B = {i [k] | θ i x = 0}, and those that are inactive, C = {i [k] | θ i x < 0}. Define θ A = P

i A θ i . We first show that θ A and x

should be in the same direction (Figure 4 (a)). Then, given this fact, we prove by showing that there is no neuron in the set B; otherwise, we can always find an action x to improve the value of fΘ (x) (Figure 4 (b)).

To see that, the maximum value of fΘ (x) is equal to fΘ (x ) = θ A x . Therefore, if θ A is not aligned with x , we can always move x a bit closer to θ A to a position x such that θ A x > θ A x without negatively affecting the other neurons in B C. Then, suppose the set B = and there s at least one j B. We can always find an increasing direction by moving x a bit closer to θ j . Intuitively, is almost orthogonal to θ A and in parallel with θ j ; thus, the increase by moving x closer to θ j exceeds the the decrease by moving it away from θ A when is small enough.

C.2. Proof of Proposition 4.2

First, we show that θ A and x are in the same direction. We prove by contradiction. Suppose θ A and x are not aligned. We will show that we can find an action x such that fΘ (x ) > fΘ (x ). Take x to be away from x , i.e., x = x + , such that x , x and θ A are in the same hyperplane (see Figure 4 (a)). We first let be small enough such that the neurons in A remains strictly activated, i.e., θ i x > 0 for i A (it suffices to take 2 < mini A |θ i x |). Then, define the angle between two vectors z and y as (z, y) := arccos( z y z 2 y 2 ).

As long as is such that (x , θ A) > (x , θ A), it s easy to show that θ A x < θ A x by noting that x 2 = x 2 = 1 and (x , θ A), (x , θ A) < π/2. Additionally, we also require to be small enough such that the neurons in C remains inactive choosing action x , i.e., θ i x < 0 for i C; note that it suffices to take 2 < mini C |θ i x |. Finally, we have P

i B g(θ i x ) P

i B g(θ i x ) = 0 since g(z) is non-negative. Therefore, we conclude that

fΘ (x ) = θ A x + X

i B g(θ i x ) > θ A x + X

i B g(θ i x ) = fΘ (x ),

which is a contradiction.

Now, we further show that B = . Similarly, we use a proof by contradiction. Suppose there is at least some j B. We take an action x = x + , such that x , x and θ j are in the same hyperplane (see Figure 4 (b)). Note that both θ A and x

are orthogonal to θ j by definition of the set B. Then, it holds that θ A x + θ j x > θ A x + θ j x , as long as is such that

θ j > θ A . (25)

Since x = x = 1 and x = x + , we have 2x = 2. Besides, let h denote the perpendicular distance

Stochastic Bandits with Re LU Neural Networks

from x to x and it satisfies

2 2 h2 = 1 h = 2

By noting that θ A/ θ A 2 = x , it suffices to have

cos( (θ j , )) 2 1

2 θ A 2 2 h 2 1

2 θ A 2 2 4 1 + θ A 2 2 (26)

so that (25) holds. In addition, we take to be small enough such that the neurons in A remains strictly activated and those in C remain inactive; to that end, it suffices to take 2 < mini A C |θ i x |. Thus, we reach a contradiction that fΘ (x ) > fΘ (x ). Our claim follows.

D. Proof of Theorem 4.3

Suppose the exploration stage ends at time t0. By Theorem 3.2, we can ensure

min{ θi θ i 2, θi + θ i 2} ν /2, i [k]

with probability at least 1 δ/2 by choosing t0 large enough so that 727π 1

4 kd 1 4 (2ζ) 1 4 ν /2 (note that ζ depends on the sample size n = t0). In particular, it suffices to take δ = 1/

t0 t1(ν ), where t1(ν) := C1k10d2(k σ)2

ν8 (dk(log(d(k σ)) log log T) + log(64T)) (27)

for some constant C1. Recall that Theorem 3.2 requires 2ζ min n 1 11522

2πk2d3/2 , α2 0π1/2

14542k2d1/2 o ; thus, we also require

t0 t2 := C2k10d6(k σ)2 (dk(log(d(k σ)) log log T) + log(64T)) (28)

for some constant C2. Therefore, we have, with a slight abuse of notation,

t0 = t0(ν ) = Θ(k13d3(1/ν8 d4)), where t0(ν) = max{t1(ν), t2}. (29)

Now we analyze the regret of our algorithm. At each time t, the per-period regret rt can be upper bounded by

rt = fΘ (x ) fΘ (xt) ( X

i [k] θ i 2)( x 2 + xt 2) 2k.

Therefore, the regret during the exploration stage is upper bounded by X

t [t0] rt 2kt0 = O(k14d3(1/ν8 d4)).

In the second stage, we run OFUL to find the optimal policy for the linear function fθ (x ) given our estimate Θt0, where x and θ are defined in (9) and (10) respectively. Following the same proof strategy of Theorem 3 in (Abbasi-Yadkori et al., 2011), it gives the regret bound in the second stage to be X

t [T ]\[t0] rt C3 p

kd T log(λ + T/(2kd)) λ1/2

log(4T) + 2kd log(1 + T/(2λkd))

with a probability at least 1 δ/2, where the contextual dimension d here is 2kd, and S =

5k by noting that θ

Finally, the above analysis shows that with a probability at least 1 δ, we both have a small estimation error in the exploration stage and a guanranteed regret upper bound of OFUL in the second stage. Thus, with a small probability δ = 1/

T, we would have linear regret scaling as 2k T; thus, the expected regret in this case is bounded by 2k Tδ = 2k

T. Our claim then follows.

Stochastic Bandits with Re LU Neural Networks

E. Proof of Theorem 4.4

We bound the regret for the three cases respectively: (i) all batch i satisfying νi > ν , (ii) t (Ti 1, Ti 1 + t0,i] for all batch i with νi ν , and (iii) t (Ti 1 + t0,i, Ti] for all batch i with νi ν .

First, in case (i), we have i log(ν0/ν )/ log(b). Recall that the per-period regret rt can be trivially bounded by 2k. Thus, the regret in this case is upper bounded by

2k(alog(ν0/ν )/ log(b) 1)T1 2k(ν0/ν )

log(a) log(b) T1.

Once νi ν , the gap νi is sufficiently accurate that the optimal action x becomes feasible in the search region. Then, we can follow our proof strategy in Appendix D.

Consider case (ii). Due to our exploration strategy, we randomly collect t0,i = t0(νi) t0(νi 1) samples in each batch i (t0(ν) defined in (29)); thus, the total number of random samples we collect over time horizon T is upper bounded by

i= log(ν0/ν )

log(b) t0,i t0(νM 1).

Thus, the regret in case (ii) is upper bounded by

2kt0(νM 1) = O k14d7 + k14d3T 8 log(b)

where we use the definition of t0 in (28) and the value of M in (13).

Next, we calculate the regret of running OFUL in case (iii). Same as the proof in Appendix D, we have

i= log(ν0/ν )

t=Ti 1+t0,i rt = O(kd

Note that the proof strategy in (Abbasi-Yadkori et al., 2011) works as long as the data are fetched sequentially, and the length of the time periods without model misspecification (i.e., batch i with νi < ν ) scales as T, which both satisfy in our algorithm.

Finally, recall that at each batch i, there s a probability of δi = 1/

T that our analysis above will fail. Thus, with a probability of at most M/

T across all M batches, we will have a linear regret. The expected regret for this small-probability event is upper bounded by

Combining all the above gives our final result.

F. Useful Lemmas

Lemma F.1. For a ball in Rd1 d2 with radius r with respect to any norm, there exists an ζ-net E such that

Proof. This claim follows by a direct application of Proposition 4.2.12 in (Vershynin, 2018).

Lemma F.2. Letting {xi}i [n] be a set of independent σ-subgaussian random variables with mean µi, then for all t 0, we have

i [n] (xi µi)| t

Proof. See Proposition 2.5 of (Wainwright, 2019).