# neural_contextual_bandits_with_ucbbased_exploration__de85afe6.pdf

Neural Contextual Bandits with UCB-based Exploration

Dongruo Zhou 1 Lihong Li 2 Quanquan Gu 1

We study the stochastic contextual bandit problem, where the reward is generated from an unknown function with additive noise. No assumption is made about the reward function other than boundedness. We propose a new algorithm, Neural UCB, which leverages the representation power of deep neural networks and uses a neural network-based random feature mapping to construct an upper conﬁdence bound (UCB) of reward for efﬁcient exploration. We prove that, under standard assumptions, Neural UCB achieves e O(

T) regret, where T is the number of rounds. To the best of our knowledge, it is the ﬁrst neural network-based contextual bandit algorithm with a near-optimal regret guarantee. We also show the algorithm is empirically competitive against representative baselines in a number of benchmarks.

1. Introduction

The stochastic contextual bandit problem has been extensively studied in machine learning (Langford & Zhang, 2008; Bubeck & Cesa-Bianchi, 2012; Lattimore & Szepesv ari, 2019): at round t {1, 2, . . . , T}, an agent is presented with a set of K actions, each of which is associated with a d-dimensional feature vector. After choosing an action, the agent will receive a stochastic reward generated from some unknown distribution conditioned on the action s feature vector. The goal of the agent is to maximize the expected cumulative rewards over T rounds. Contextual bandit algorithms have been applied to many real-world applications, such as personalized recommendation, advertising and Web search.

The most studied model in the literature is linear contextual bandits (Auer, 2002; Abe et al., 2003; Dani et al., 2008; Rusmevichientong & Tsitsiklis, 2010), which assumes that

1Department of Computer Science, University of California, Los Angeles, CA 90095, USA 2Google Research, USA. Correspondence to: Quanquan Gu <qgu@cs.ucla.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

the expected reward at each round is linear in the feature vector. While successful in both theory and practice (Li et al., 2010; Chu et al., 2011; Abbasi-Yadkori et al., 2011), the linear-reward assumption it makes often fails to hold in practice, which motivates the study of nonlinear or nonparametric contextual bandits (Filippi et al., 2010; Srinivas et al., 2010; Bubeck et al., 2011; Valko et al., 2013). However, they still require fairly restrictive assumptions on the reward function. For instance, Filippi et al. (2010) make a generalized linear model assumption on the reward, Bubeck et al. (2011) require it to have a Lipschitz continuous property in a proper metric space, and Valko et al. (2013) assume the reward function belongs to some Reproducing Kernel Hilbert Space (RKHS).

In order to overcome the above shortcomings, deep neural networks (DNNs) (Goodfellow et al., 2016) have been introduced to learn the underlying reward function in contextual bandit problem, thanks to their strong representation power. We call these approaches collectively as neural contextual bandit algorithms. Given the fact that DNNs enable the agent to make use of nonlinear models with less domain knowledge, existing work (Riquelme et al., 2018; Zahavy & Mannor, 2019) study neural-linear bandits. That is, they use all but the last layers of a DNN as a feature map, which transforms contexts from the raw input space to a low-dimensional space, usually with better representation and less frequent updates. Then they learn a linear exploration policy on top of the last hidden layer of the DNN with more frequent updates. These attempts have achieved great empirical success, but no regret guarantees are provided.

In this paper, we consider provably efﬁcient neural contextual bandit algorithms. The new algorithm, Neural UCB, uses a neural network to learn the unknown reward function, and follows a UCB strategy for exploration. At the core of the algorithm is the novel use of DNN-based random feature mappings to construct the UCB. Its regret analysis is built on recent advances on optimization and generalization of deep neural networks (Jacot et al., 2018; Arora et al., 2019; Cao & Gu, 2019). Crucially, the analysis makes no modeling assumptions about the reward function, other than that it be bounded. While the main focus of our paper is theoretical, we also show in a few benchmark problems the effectiveness of Neural UCB, and demonstrate its beneﬁts against several representative baselines.

Neural Contextual Bandits with UCB-based Exploration

Our main contributions are as follows:

We propose a neural contextual bandit algorithm that can be regarded as an extension of existing (generalized) linear bandit algorithms (Abbasi-Yadkori et al., 2011; Filippi et al., 2010; Li et al., 2010; 2017) to the case of arbitrary bounded reward functions. We prove that, under standard assumptions, our algorithm is able to achieve e O(ed

T) regret, where ed is the effective dimension of a neural tangent kernel matrix and T is the number of rounds. The bound recovers the existing e O(d

T) regret for linear contextual bandit as a special case (Abbasi-Yadkori et al., 2011), where d is the dimension of context. We demonstrate empirically the effectiveness of the algorithm in both synthetic and benchmark problems.

Notation: Scalars are denoted by lower case letters, vectors by lower case bold face letters, and matrices by upper case bold face letters. For a positive integer k, [k] denotes {1, . . . , k}. For a vector θ Rd, we denote its

ℓ2 norm by θ 2 = q Pd i=1 θ2 i and its j-th coordinate by [θ]j. For a matrix A Rd d, we denote its spectral norm, Frobenius norm, and (i, j)-th entry by A 2, A F , and [A]i,j, respectively. We denote a sequence of vectors by {θj}t j=1, and similarly for matrices. For two sequences {an} and {bn}, we use an = O(bn) to denote that there exists some constant C > 0 such that an Cbn; similarly, an = Ω(bn) means there exists some constant C > 0 such that an C bn. In addition, we use e O( ) to hide logarithmic factors. We say a random variable X is ν-sub-Gaussian if E exp(λ(X EX)) exp(λ2ν2/2) for any λ > 0.

2. Problem Setting

We consider the stochastic K-armed contextual bandit problem, where the total number of rounds T is known. At round t [T], the agent observes the context consisting of K feature vectors: {xt,a Rd | a [K]}. The agent selects an action at and receives a reward rt,at. For brevity, we denote by {xi}T K i=1 the collection of {x1,1, x1,2, . . . , x T,K}. Our goal is to maximize the following pseudo regret (or regret for short):

t=1 (rt,a t rt,at) , (2.1)

where a t = argmaxa [K] E[rt,a] is the optimal action at round t that maximizes the expected reward.

This work makes the following assumption about reward generation: for any round t,

rt,at = h(xt,at) + ξt, (2.2)

where h is an unknown function satisfying 0 h(x) 1 for any x, and ξt is ν-sub-Gaussian noise conditioned on x1,a1, . . . , xt 1,at 1 satisfying Eξt = 0. The ν-sub Gaussian assumption for ξt is standard in the stochastic bandit literature (e.g., Abbasi-Yadkori et al., 2011; Li et al., 2017), and is satisﬁed by, for example, any bounded noise. The bounded h assumption holds true when h belongs to linear functions, generalized linear functions, Gaussian processes, and kernel functions with bounded RKHS norm over a bounded domain, among others.

In order to learn the reward function h in (2.2), we propose to use a fully connected neural networks with depth L 2:

f(x; θ) = m WLσ WL 1σ σ(W1x) , (2.3)

where σ(x) = max{x, 0} is the rectiﬁed linear unit (Re LU) activation function, W1 Rm d, Wi Rm m, 2 i L 1, WL Rm 1, and θ = [vec(W1) , . . . , vec(WL) ] Rp with p = m + md + m2(L 1). Without loss of generality, we assume that the width of each hidden layer is the same (i.e., m) for convenience in analysis. We denote the gradient of the neural network function by g(x; θ) = θf(x; θ) Rp.

3. The Neural UCB Algorithm

The key idea of Neural UCB (Algorithm 1) is to use a neural network f(x; θ) to predict the reward of context x, and upper conﬁdence bounds computed from the network to guide exploration (Auer, 2002).

Initialization It initializes the network by randomly generating each entry of θ from an appropriate Gaussian dis-

tribution: for 1 l L 1, Wl is set to be W 0 0 W

where each entry of W is generated independently from N(0, 4/m); WL is set to (w , w ), where each entry of w is generated independently from N(0, 2/m).

Learning At round t, Algorithm 1 observes the contexts for all actions, {xt,a}K a=1. First, it computes an upper conﬁdence bound Ut,a for each action a, based on xt,a, θt 1 (the current neural network parameter), and a positive scaling factor γt 1. It then chooses action at with the largest Ut,a, and receives the corresponding reward rt,at. At the end of round t, Neural UCB updates θt by applying Algorithm 2 to (approximately) minimize L(θ) using gradient descent, and updates γt. We choose gradient descent in Algorithm 2 for the simplicity of analysis, although the training method can be replaced by stochastic gradient descent with a more involved analysis (Allen-Zhu et al., 2019; Zou et al., 2019).

Neural Contextual Bandits with UCB-based Exploration

Algorithm 1 Neural UCB

1: Input: Number of rounds T, regularization parameter λ, exploration parameter ν, conﬁdence parameter δ, norm parameter S, step size η, number of gradient descent steps J, network width m, network depth L. 2: Initialization: Randomly initialize θ0 as described in the text 3: Initialize Z0 = λI 4: for t = 1, . . . , T do 5: Observe {xt,a}K a=1 6: for a = 1, . . . , K do

7: Compute Ut,a = f(xt,a; θt 1) + γt 1 q

g(xt,a; θt 1) Z 1 t 1g(xt,a; θt 1)/m

8: Let at = argmaxa [K] Ut,a 9: end for 10: Play at and observe reward rt,at 11: Compute Zt = Zt 1 + g(xt,at; θt 1)g(xt,at; θt 1) /m 12: Let θt = Train NN(λ, η, J, m, {xi,ai}t i=1, {ri,ai}t i=1, θ0) 13: Compute

1 + C1m 1/6p

log m L4t7/6λ 7/6 ν

det λI + C2m 1/6p

log m L4t5/3λ 1/6 2 log δ +

+ (λ + C3t L) h (1 ηmλ)J/2p

t/λ + m 1/6p

log m L7/2t5/3λ 5/3(1 + p

14: end for

Algorithm 2 Train NN(λ, η, U, m, {xi,ai}t i=1, {ri,ai}t i=1, θ(0))

1: Input: Regularization parameter λ, step size η, number of gradient descent steps U, network width m, contexts {xi,ai}t i=1, rewards {ri,ai}t i=1, initial parameter θ(0). 2: Deﬁne L(θ) = Pt i=1(f(xi,ai; θ) ri,ai)2/2 + mλ θ θ(0) 2 2/2. 3: for j = 0, . . . , J 1 do 4: θ(j+1) = θ(j) η L(θ(j)) 5: end for 6: Return θ(J).

Comparison with Existing Algorithms We compare Neural UCB with other neural contextual bandit algorithms. Allesiardo et al. (2014) proposed Neural Bandit which consists of K neural networks. It uses a committee of networks to compute the score of each action and chooses an action with the ϵ-greedy strategy. In contrast, our Neural UCB uses upper conﬁdence bound-based exploration, which is more effective than ϵ-greedy. In addition, our algorithm only uses one neural network instead of K networks, thus can be computationally more efﬁcient.

Lipton et al. (2018) used Thompson sampling on deep neural networks (through variational inference) in reinforcement learning; a variant is proposed by Azizzadenesheli et al. (2018) that works well on a set of Atari benchmarks. Riquelme et al. (2018) proposed Neural Linear, which uses the ﬁrst L 1 layers of a L-layer DNN to learn a representation, then applies Thompson sampling on the last layer to

choose action. Zahavy & Mannor (2019) proposed a Neural Linear with limited memory (Neural Linear LM), which also uses the ﬁrst L 1 layers of a L-layer DNN to learn a representation and applies Thompson sampling on the last layer. Instead of computing the exact mean and variance in Thompson sampling, Neural Linear LM only computes their approximation. Unlike Neural Linear and Neural Linear LM, Neural UCB uses the entire DNN to learn the representation and constructs the upper conﬁdence bound based on the random feature mapping deﬁned by the neural network gradient. Finally, Kveton et al. (2020) studied the use of reward perturbation for exploration in neural network-based bandit algorithms.

A Variant of Neural UCB called Neural UCB0 is described in Appendix E. It can be viewed as a simpliﬁed version of Neural UCB where only the ﬁrst-order Taylor approximation of the neural network around the initialized parameter is updated through online ridge regression. In this sense, Neural UCB0 can be seen as Kernel UCB (Valko et al., 2013) specialized to the Neural Tangent Kernel (Jacot et al., 2018), or Lin UCB (Li et al., 2010) with Neural Tangent Random Features (Cao & Gu, 2019).

While this variant has a comparable regret bound as Neural UCB, we expect the latter to be stronger in practice. Indeed, as shown by Allen-Zhu & Li (2019), the Neural Tangent Kernel does not seem to completely realize the representation power of neural networks in supervised learning. A similar phenomenon will be demonstrated for contextual bandit learning in Section 7.

Neural Contextual Bandits with UCB-based Exploration

4. Regret Analysis

This section analyzes the regret of Neural UCB. Recall that {xi}T K i=1 is the collection of all {xt,a}. Our regret analysis is built upon the recently proposed neural tangent kernel matrix (Jacot et al., 2018):

Deﬁnition 4.1 (Jacot et al. (2018); Cao & Gu (2019)). Let {xi}T K i=1 be a set of contexts. Deﬁne

e H(1) i,j = Σ(1) i,j = xi, xj , A(l) i,j =

Σ(l) i,i Σ(l) i,j Σ(l) i,j Σ(l) j,j

Σ(l+1) i,j = 2E(u,v) N(0,A(l) i,j) [σ(u)σ(v)] ,

e H(l+1) i,j = 2 e H(l) i,j E(u,v) N(0,A(l) i,j) [σ (u)σ (v)] + Σ(l+1) i,j .

Then, H = ( e H(L) + Σ(L))/2 is called the neural tangent kernel (NTK) matrix on the context set.

In the above deﬁnition, the Gram matrix H of the NTK on the contexts {xi}T K i=1 for L-layer neural networks is deﬁned recursively from the input layer all the way to the output layer of the network. Interested readers are referred to Jacot et al. (2018) for more details about neural tangent kernels.

With Deﬁnition 4.1, we may state the following assumption on the contexts: {xi}T K i=1.

Assumption 4.2. H λ0I. Moreover, for any 1 i TK, xi 2 = 1 and [xi]j = [xi]j+d/2.

The ﬁrst part of the assumption says that the neural tangent kernel matrix is non-singular, a mild assumption commonly made in the related literature (Du et al., 2019a; Arora et al., 2019; Cao & Gu, 2019). It can be satisﬁed as long as no two contexts in {xi}T K i=1 are parallel. The second part is also mild and is just for convenience in analysis: for any context x, x 2 = 1, we can always construct a new context x = [x , x ] /

2 to satisfy Assumption 4.2. It can be veriﬁed that if θ0 is initialized as in Neural UCB, then f(xi; θ0) = 0 for any i [TK].

Next we deﬁne the effective dimension of the neural tangent kernel matrix.

Deﬁnition 4.3. The effective dimension ed of the neural tangent kernel matrix on contexts {xi}T K i=1 is deﬁned as

ed = log det(I + H/λ)

log(1 + TK/λ) . (4.1)

Remark 4.4. The notion of effective dimension was ﬁrst introduced by Valko et al. (2013) for analyzing kernel contextual bandits, which was deﬁned by the eigenvalues of any kernel matrix restricted to the given contexts. We adapt a similar but different deﬁnition of Yang & Wang (2019),

which was used for the analysis of kernel-based Q-learning. Suppose the dimension of the reproducing kernel Hilbert space induced by the given kernel is bd and the feature mapping ψ : Rd R b d induced by the given kernel satisﬁes ψ(x) 2 1 for any x Rd. Then, it can be veriﬁed that ed bd, as shown in Appendix A.1. Intuitively, ed measures how quickly the eigenvalues of H diminish, and only depends on T logarithmically in several special cases (Valko et al., 2013).

Now we are ready to present the main result, which provides the regret bound RT of Algorithm 1.

Theorem 4.5. Let ed be the effective dimension, and h = [h(xi)]T K i=1 RT K. There exist constant C1, C2 > 0, such that for any δ (0, 1), if

m poly(T, L, K, λ 1, λ 1 0 , S 1, log(1/δ)), (4.2)

η = C1(m TL + mλ) 1,

λ max{1, S 2}, and S

2h H 1h, then with probability at least 1 δ, the regret of Algorithm 1 satisﬁes

ed log(1 + TK/λ) + 2

ed log(1 + TK/λ) + 2 2 log δ

+ (λ + C2TL)(1 λ/(TL))J/2p

λS + 1. (4.3)

Remark 4.6. It is worth noting that, simply applying results for linear bandits to our algorithm would lead to a linear dependence of p or p in the regret. Such a bound is vacuous since in our setting p would be very large compared with the number of rounds T and the input context dimension d. In contrast, our regret bound only depends on ed, which can be much smaller than p.

Remark 4.7. Our regret bound (4.3) has a term (λ + C2TL)(1 λ/(TL))J/2p

T/λ, which characterizes the optimization error of Algorithm 2 after J iterations. Setting

J = 2 log λS

T(λ + C2TL) TL

λ = e O(TL/λ), (4.4)

which is independent of m, we have (λ + C2TL)(1 λ/(TL))J/2p

λS, so the optimization error is dominated by

λS. Hence, the order of the regret bound is not affected by the error of optimization.

Remark 4.8. With ν and λ treated as constants, S =

2h H 1h, and J given in (4.4), the regret bound (4.3)

becomes RT = e O p

max{ed, h H 1h} . Speciﬁcally, if h belongs to the RKHS H induced by the neural

Neural Contextual Bandits with UCB-based Exploration

tangent kernel with bounded RKHS norm h H, we have h H

h H 1h; see Appendix A.2 for more details. Thus our regret bound can be further written as

max{ed, h H} . (4.5)

The high-probability result in Theorem 4.5 can be used to obtain a bound on the expected regret.

Corollary 4.9. Under the same conditions in Theorem 4.5, there exists a positive constant C such that

ed log(1 + TK/λ) + 2

ed log(1 + TK/λ) + 2 + 2 log T

λS + (λ + CTL)(1 λ/(TL))J/2p

5. Proof of Main Result

This section outlines the proof of Theorem 4.5, which has to deal with the following technical challenges:

We do not make parametric assumptions on the reward function as some previous work (Filippi et al., 2010; Chu et al., 2011; Abbasi-Yadkori et al., 2011).

To avoid strong parametric assumptions, we use overparameterized neural networks, which implies m (and thus p) is very large. Therefore, we need to make sure the regret bound is independent of m.

Unlike the static feature mapping used in kernel bandit algorithms (Valko et al., 2013), Neural UCB uses a neural network f(x; θt) and its gradient g(x; θt) as a dynamic feature mapping depending on θt. This difference makes the analysis of Neural UCB more difﬁcult.

These challenges are addressed by the following technical lemmas, whose proofs are gathered in the appendix.

Lemma 5.1. There exists a positive constant C such that for any δ (0, 1), if m CT 4K4L6 log(T 2K2L/δ)/λ4 0, then with probability at least 1 δ, there exists a θ Rp

h(xi) = g(xi; θ0), θ θ0 , m θ θ0 2

2h H 1h, (5.1)

for all i [TK].

Lemma 5.1 suggests that with high probability, the reward function restricted to {xi}T K i=1 can be regarded as a linear

function of g(xi; θ0) parameterized by θ θ0, where θ

lies in a ball centered at θ0. Note that here θ is not a ground truth parameter for the reward function. Instead, it is introduced only for the sake of analysis. Equipped with Lemma 5.1, we can utilize existing results on linear bandits (Abbasi-Yadkori et al., 2011) to show that with high probability, θ lies in the sequence of conﬁdence sets.

Lemma 5.2. There exist positive constants C1 and C2 such that for any δ (0, 1), if η C1(Tm L + mλ) 1 and

m C2 max T 7λ 7L21(log m)3,

λ 1/2L 3/2(log(TKL2/δ))3/2 ,

then with probability at least 1 δ, we have θt θ0 2 2 p

t/(mλ) and θ θt Zt γt/ m for all t [T], where γt is deﬁned in Algorithm 1.

Lemma 5.3. Let a t = argmaxa [K] h(xt,a). There exists a positive constant C such that for any δ (0, 1), if η and m satisfy the same conditions as in Lemma 5.2, then with probability at least 1 δ, we have

h xt,a t h xt,at

2γt 1 min g(xt,at; θt 1)/ m Z 1 t 1, 1

+ C Sm 1/6p

log m T 7/6λ 1/6L7/2

log m T 5/3λ 2/3L3 .

Lemma 5.3 gives an upper bound for h xt,a t h xt,at , which can be used to bound the regret RT . It is worth noting that γt has a term log det Zt. A trivial upper bound of log det Zt would result in a quadratic dependence on the network width m, since the dimension of Zt is p = md + m2(L 2) + m. Instead, we use the next lemma to establish an m-independent upper bound. The dependence on ed is similar to Valko et al. (2013, Lemma 4), but the proof is different as our notion of effective dimension is different.

Lemma 5.4. There exist positive constants { Ci}3 i=1 such that for any δ (0, 1), if m C1 max T 7λ 7L21(log m)3, T 6K6L6(log(TKL2/δ))3/2

and η C2(Tm L + mλ) 1, then with probability at least 1 δ, we have

t=1 γ2 t 1 min g(xt,at; θt 1)/ m 2 Z 1 t 1, 1

ed log(1 + TK/λ) + Γ1 Γ2

ed log(1 + TK/λ) + Γ1 2 log δ +

+ (λ + C3t L) h (1 ηmλ)J/2p

T/λ + Γ3(1 + p

Neural Contextual Bandits with UCB-based Exploration

Γ1 = 1 + C3m 1/6p

log m L4T 5/3λ 1/6,

1 + C3m 1/6p

log m L4T 7/6λ 7/6,

Γ3 = m 1/6p

log m L7/2T 5/3λ 5/3.

We are now ready to prove the main result.

Proof of Theorem 4.5. Lemma 5.3 implies that the total regret RT can be bounded as follows with a constant C1 > 0:

h xt,a t h xt,at

t=1 γt 1 min g(xt,at; θt 1)/ m Z 1 t 1, 1

+ C1 Sm 1/6p

log m T 13/6λ 1/6L7/2

log m T 8/3λ 2/3L3 .

It can be further bounded as follows:

t=1 γ2 t 1 min g(xt,at; θt 1)/ m 2 Z 1 t 1, 1

+ C1 Sm 1/6p

log m T 13/6λ 1/6L7/2

log m T 8/3λ 2/3L3

ed log(1 + TK/λ) + Γ1 Γ2

ed log(1 + TK/λ) + Γ1 2 log δ +

+ (λ + C2TL) h (1 ηmλ)J/2p

+ C1 Sm 1/6p

log m T 13/6λ 1/6L7/2

log m T 8/3λ 2/3L3

ed log(1 + TK/λ) + 2

ed log(1 + TK/λ) + 2 2 log δ

+ (λ + C3TL)(1 ηmλ)J/2p

where C1, C2, C3 are positive constants, the ﬁrst inequality is due to Cauchy-Schwarz inequality, the second inequality due to Lemma 5.4, and the third inequality holds for sufﬁciently large m. This completes our proof.

6. Related Work

Contextual Bandits There is a line of extensive work on linear bandits (e.g., Abe et al., 2003; Auer, 2002; Abe et al., 2003; Dani et al., 2008; Rusmevichientong & Tsitsiklis, 2010; Li et al., 2010; Chu et al., 2011; Abbasi-Yadkori et al., 2011). Many of these algorithms are based on the idea of upper conﬁdence bounds, and are shown to achieve near-optimal regret bounds. Our algorithm is also based on UCB exploration, and the regret bound reduces to that of Abbasi-Yadkori et al. (2011) in the linear case.

To deal with nonlinearity, a few authors have considered generalized linear bandits (Filippi et al., 2010; Li et al., 2017; Jun et al., 2017), where the reward function is a composition of a linear function and a (nonlinear) link function. Such models are special cases of what we study in this work.

More general nonlinear bandits without making strong modeling assumptions have also be considered. One line of work is the family of expert learning algorithms (Auer et al., 2002; Beygelzimer et al., 2011) that typically have a time complexity linear in the number of experts (which in many cases can be exponential in the number of parameters).

A second approach is to reduce a bandit problem to supervised learning, such as the epoch-greedy algorithm (Langford & Zhang, 2008) that has a non-optimal O(T 2/3) regret. Later, Agarwal et al. (2014) develop an algorithm that enjoys a near-optimal regret, but relies on an oracle, whose implementation still requires proper modeling assumptions.

A third approach uses nonparametric modeling, such as perceptrons (Kakade et al., 2008), random forests (F eraud et al., 2016), Gaussian processes and kernels (Kleinberg et al., 2008; Srinivas et al., 2010; Krause & Ong, 2011; Bubeck et al., 2011). The most relevant is by Valko et al. (2013), who assumed that the reward function lies in an RKHS with bounded RKHS norm and developed a UCB-based algorithm. They also proved an e O( p

ed T) regret, where ed is a form of effective dimension similar to ours. Compared with these interesting works, our neural network-based algorithm avoids the need to carefully choose a good kernel or metric, and can be computationally more efﬁcient in large-scale problems. Recently, Foster & Rakhlin (2020) proposed contextual bandit algorithms with regression oracles which achieve a dimension-independent O(T 3/4) regret. Compared with Foster & Rakhlin (2020), Neural UCB achieves a dimension-dependent e O(ed

T) regret with a better dependence on the time horizon.

Neural Networks Substantial progress has been made to understand the expressive power of DNNs, in connection to the network depth (Telgarsky, 2015; 2016; Liang & Srikant, 2016; Yarotsky, 2017; 2018; Hanin, 2017), as well as network width (Lu et al., 2017; Hanin & Sellke, 2017).

Neural Contextual Bandits with UCB-based Exploration

0 2000 4000 6000 8000 10000 Round

Lin UCB GLMUCB Kernel UCB Bootstrapped NN Neural ε-Greedy0 Neural UCB0 Neural ε-Greedy

(a) h1(x) = 10(x a)2

0 2000 4000 6000 8000 10000 Round

Lin UCB GLMUCB Kernel UCB Bootstrapped NN Neural ε-Greedy0 Neural UCB0 Neural ε-Greedy

(b) h2(x) = x A Ax

0 2000 4000 6000 8000 10000 Round

Lin UCB GLMUCB Kernel UCB Bootstrapped NN Neural ε-Greedy0 Neural UCB0 Neural ε-Greedy

(c) h3(x) = cos(3x a)

Figure 1. Comparison of Neural UCB and baseline algorithms on synthetic datasets.

The present paper on neural contextual bandit algorithms is inspired by these theoretical justiﬁcations and empirical evidence in the literature.

Our regret analysis for Neural UCB makes use of recent advances in optimizing a DNN. A series of works show that (stochastic) gradient descent can ﬁnd global minima of the training loss (Li & Liang, 2018; Du et al., 2019b; Allen Zhu et al., 2019; Du et al., 2019a; Zou et al., 2019; Zou & Gu, 2019). For the generalization of DNNs, a number of authors (Daniely, 2017; Cao & Gu, 2019; 2020; Arora et al., 2019; Chen et al., 2019) show that by using (stochastic) gradient descent, the parameters of a DNN are located in a particular regime and the generalization bound of DNNs can be characterized by the best function in the corresponding neural tangent kernel space (Jacot et al., 2018).

7. Experiments

In this section, we evaluate Neural UCB empirically and compare it with seven representative baselines: (1) Lin UCB, which is also based on UCB but adopts a linear representation; (2) GLMUCB (Filippi et al., 2010), which applies a nonlinear link function over a linear function; (3) Kernel UCB (Valko et al., 2013), a kernelised UCB algorithm which makes use of a predeﬁned kernel function; (4) Bootstrapped NN (Efron, 1982; Riquelme et al., 2018), which simultaneously trains a set of neural networks using bootstrapped samples and at every round chooses an action based on the prediction of a randomly picked model; (5) Neural ϵ-Greedy, which replaces the UCB-based exploration in Algorithm 1 by ϵ-greedy; (6) Neural UCB0, as described in Section 3; and (7) Neural ϵ-Greedy0, same as Neural UCB0 but with ϵ-greedy exploration. We use the cumulative regret as the performance metric.

7.1. Synthetic Datasets

In the ﬁrst set of experiments, we use contextual bandits with context dimension d = 20 and K = 4 actions. The number of rounds T = 10 000. The contextual vectors

{x1,1, . . . , x T,K} are chosen uniformly at random from the unit ball. The reward function h is one of the following:

h1(x) = 10(x a)2,

h2(x) = x A Ax,

h3(x) = cos(3x a) ,

where each entry of A Rd d is randomly generated from N(0, 1), a is randomly generated from uniform distribution over unit ball. For each hi( ), the reward is generated by rt,a = hi(xt,a) + ξt, where ξt N(0, 1).

Following Li et al. (2010), we implement Lin UCB using a constant α (for the variance term in the UCB). We do a grid search for α over {0.01, 0.1, 1, 10}. For GLMUCB, we use the sigmoid function as the link function and adapt the online Newton step method to accelerate the computation (Zhang et al., 2016; Jun et al., 2017). We do grid searches over {0.1, 1, 10} for regularization parameter, {1, 10, 100} for step size, {0.01, 0.1, 1} for exploration parameter. For Kernel UCB, we use the radial basis function (RBF) kernel with parameter σ, and set the regularization parameter to 1. Grid searches over {0.1, 1, 10} for σ and {0.01, 0.1, 1, 10} for the exploration parameter are done. To accelerate the calculation, we stop adding contexts to Kernel UCB after 1000 rounds, following the same setting for Gaussian Process in Riquelme et al. (2018). For all ﬁve neural algorithms, we choose a two-layer neural network f(x; θ) = m W2σ(W1x) with network width m = 20, where θ = [vec(W1) , vec(W2) ] Rp and p = md + m = 420.1

Moreover, we set γt = γ in Neural UCB, and do a grid search over {0.01, 0.1, 1, 10}. For Neural UCB0, we do grid searches for ν over {0.1, 1, 10}, for λ over {0.1, 1, 10}, for δ over {0.01, 0.1, 1}, for S over {0.01, 0.1, 1, 10}. For Neural ϵ-Greedy and Neural ϵ-Greedy0, we do a grid search for ϵ over {0.001, 0.01, 0.1, 0.2}. For Bootstrapped NN, we follow Riquelme et al. (2018) to set the number of models to be 10 and the transition probability to be 0.8. To accelerate

1Note that the bound on the required network width m is likely not tight. Therefore, in experiments we choose m to be relatively large, but not as large as theory suggests.

Neural Contextual Bandits with UCB-based Exploration

0 2000 4000 6000 8000 10000 12000 14000 Round

Lin UCB GLMUCB Kernel UCB Bootstrapped NN Neural ε-Greedy0 Neural UCB0 Neural ε-Greedy

(a) covertype

0 2000 4000 6000 8000 10000 12000 14000 Round

Lin UCB GLMUCB Kernel UCB Bootstrapped NN Neural ε-Greedy0 Neural UCB0 Neural ε-Greedy

0 2000 4000 6000 8000 10000 12000 14000 Round

Lin UCB GLMUCB Kernel UCB Bootstrapped NN Neural ε-Greedy0 Neural UCB0 Neural ε-Greedy

(c) statlog

0 2000 4000 6000 8000 10000 12000 14000 Round

Lin UCB GLMUCB Kernel UCB Bootstrapped NN Neural ε-Greedy0 Neural UCB0 Neural ε-Greedy

Figure 2. Comparison of Neural UCB and baseline algorithms on real-world datasets.

Table 1. Dataset statistics

DATASET COVERMAGIC STATLOG MNIST TYPE

FEATURE 54 10 8 784 DIMENSION NUMBER OF 7 2 7 10 CLASSES NUMBER OF 581012 19020 58000 60000 INSTANCES

the training process, for Bootstrapped NN, Neural UCB and Neural ϵ-Greedy, we update the parameter θt by Train NN every 50 rounds. We use stochastic gradient descent with batch size 50, J = t at round t, and do a grid search for step size η over {0.001, 0.01, 0.1}. For all grid-searched parameters, we choose the best of them for the comparison. All experiments are repeated 10 times, and the averaged results reported for comparison.

7.2. Real-world Datasets

We evaluate our algorithms on real-world datasets from the UCI Machine Learning Repository (Dua & Graff, 2017): covertype, magic, and statlog. We also evaluate our algorithms on mnist dataset (Le Cun et al., 1998). These are all K-class classiﬁcation datasets (Table 1), and are converted into K-armed contextual bandits (Beygelzimer & Langford, 2009). The number of rounds is set as T = 15000. Following Riquelme et al. (2018), we create

contextual bandit problems based on the prediction accuracy. In detail, to transform a classiﬁcation problem with k-classes into a bandit problem, we adapts the disjoint model (Li et al., 2010) which transforms each contextual vector x Rd into k vectors x(1) = (x, 0, . . . , 0), . . . , x(k) = (0, . . . , 0, x) Rdk. The agent received regret 0 if he classiﬁes the context correctly, and 1 otherwise. For all the algorithms, We reshufﬂe the order of contexts and repeat the experiment for 10 runs. Averaged results are reported for comparison.

For Lin UCB, GLMUCB and Kernel UCB, we tune their parameters as Section 7.1 suggests. For Bootstrapped NN, Neural UCB, Neural UCB0, Neural ϵ-Greedy and Neural ϵGreedy0, we choose a two-layer neural network with width m = 100. For Neural UCB and Neural UCB0, since it is computationally expensive to store and compute a whole matrix Zt, we use a diagonal matrix which consists of the diagonal elements of Zt to approximate Zt. To accelerate the training process, for Bootstrapped NN, Neural UCB and Neural ϵ-Greedy, we update the parameter θt by Train NN every 100 rounds starting from round 2000. We do grid searches for λ over {10 i}, i = 1, 2, 3, 4, for η over {2 10 i, 5 10 i}, i = 1, 2, 3, 4. We set J = 1000 and use stochastic gradient descent with batch size 500 to train the networks. For the rest of parameters, we tune them as those in Section 7.1 and choose the best of them for comparison.

Neural Contextual Bandits with UCB-based Exploration

7.3. Results Figures 1 and 2 show the cumulative regret of all algorithms. First, due to the nonlinearity of reward functions h, Lin UCB fails to learn them for nearly all tasks. GLMUCB is only able to learn the true reward functions for certain tasks due to its simple link function. In contrast, thanks to the neural network representation and efﬁcient exploration, Neural UCB achieves a substantially lower regret. The performance of Neural ϵ-Greedy is between the two. This suggests that while Neural ϵ-Greedy can capture the nonlinearity of the underlying reward function, ϵ-Greedy based exploration is not as effective as UCB based exploration. This conﬁrms the effectiveness of Neural UCB for contextual bandit problems with nonlinear reward functions. Second, it is worth noting that Neural UCB and Neural ϵ-Greedy outperform Neural UCB0 and Neural ϵ-Greedy0. This suggests that using deep neural networks to predict the reward function is better than using a ﬁxed feature mapping associated with the Neural Tangent Kernel, which mirrors similar ﬁndings in supervised learning (Allen-Zhu & Li, 2019). Furthermore, we can see that Kernel UCB is not as good as Neural UCB, which suggests the limitation of simple kernels like RBF compared to ﬂexible neural networks. What s more, Bootstrapped NN can be competitive, approaching the performance of Neural UCB in some datasets. However, it requires to maintain and train multiple neural networks, so is computationally more expensive than our approach, especially in large-scale problems.

8. Conclusion In this paper, we proposed Neural UCB, a new algorithm for stochastic contextual bandits based on neural networks and upper conﬁdence bounds. Building on recent advances in optimization and generalization of deep neural networks, we showed that for an arbitrary bounded reward function, our algorithm achieves an e O(ed

T) regret bound. Promising empirical results on both synthetic and real-world data corroborated our theoretical ﬁndings, and suggested the potential of the algorithm in practice.

We conclude the paper with a suggested direction for future research. Given the focus on UCB exploration in this work, a natural open question is provably efﬁcient exploration based on randomized strategies, when DNNs are used. These methods are effective in practice, but existing regret analyses are mostly for shallow (i.e., linear or generalized linear) models (Chapelle & Li, 2011; Agrawal & Goyal, 2013; Russo et al., 2018; Kveton et al., 2020). Extending them to DNNs will be interesting. Meanwhile, our current analysis of Neural UCB is based on the NTK theory. While NTK facilitates the analysis, it has its own limitations, and we will leave the analysis of Neural UCB beyond NTK as future work.

Acknowledgement

We would like to thank the anonymous reviewers for their helpful comments. This research was sponsored in part by the National Science Foundation IIS-1904183 and IIS1906169. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011.

Abe, N., Biermann, A. W., and Long, P. M. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263 293, 2003.

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. E. Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of the 31st International Conference on Machine Learning (ICML), pp. 1638 1646, 2014.

Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pp. 127 135, 2013.

Allen-Zhu, Z. and Li, Y. What can Res Net learn efﬁciently, going beyond kernels? In Advances in Neural Information Processing Systems, 2019.

Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242 252, 2019.

Allesiardo, R., F eraud, R., and Bouneffouf, D. A neural networks committee for the contextual bandit problem. In International Conference on Neural Information Processing, pp. 374 381. Springer, 2014.

Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. On exact computation with an inﬁnitely wide neural net. In Advances in Neural Information Processing Systems, 2019.

Auer, P. Using conﬁdence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002.

Azizzadenesheli, K., Brunskill, E., and Anandkumar, A. Efﬁcient exploration through Bayesian deep Q-networks.

Neural Contextual Bandits with UCB-based Exploration

In 2018 Information Theory and Applications Workshop (ITA), pp. 1 9. IEEE, 2018.

Beygelzimer, A. and Langford, J. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 129 138, 2009.

Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. E. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 19 26, 2011.

Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012.

Bubeck, S., Munos, R., Stoltz, G., and Szepesv ari, C. Xarmed bandits. Journal of Machine Learning Research, 12(May):1655 1695, 2011.

Cao, Y. and Gu, Q. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, 2019.

Cao, Y. and Gu, Q. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. In the Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence, 2020.

Chapelle, O. and Li, L. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pp. 2249 2257, 2011.

Chen, Z., Cao, Y., Zou, D., and Gu, Q. How much overparameterization is sufﬁcient to learn deep relu networks? ar Xiv preprint ar Xiv:1911.12360, 2019.

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 208 214, 2011.

Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. 2008.

Daniely, A. SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pp. 2422 2430, 2017.

Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent ﬁnds global minima of deep neural networks. In International Conference on Machine Learning, pp. 1675 1685, 2019a.

Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019b. URL https://openreview. net/forum?id=S1e K3i09YQ.

Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Efron, B. The jackknife, the bootstrap, and other resampling plans, volume 38. Siam, 1982.

F eraud, R., Allesiardo, R., Urvoy, T., and Cl erot, F. Random forest for the contextual bandit problem. In Artiﬁcial Intelligence and Statistics, pp. 93 101, 2016.

Filippi, S., Cappe, O., Garivier, A., and Szepesv ari, C. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pp. 586 594, 2010.

Foster, D. J. and Rakhlin, A. Beyond ucb: Optimal and efﬁcient contextual bandits with regression oracles. ar Xiv preprint ar Xiv:2002.04926, 2020.

Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. http://www. deeplearningbook.org.

Hanin, B. Universal function approximation by deep neural nets with bounded width and Re LU activations. ar Xiv preprint ar Xiv:1708.02691, 2017.

Hanin, B. and Sellke, M. Approximating continuous functions by Re LU nets of minimal width. ar Xiv preprint ar Xiv:1710.11278, 2017.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571 8580, 2018.

Jun, K.-S., Bhargava, A., Nowak, R. D., and Willett, R. Scalable generalized linear bandits: Online computation and hashing. In Advances in Neural Information Processing Systems 30 (NIPS), pp. 99 109, 2017.

Kakade, S. M., Shalev-Shwartz, S., and Tewari, A. Efﬁcient bandit algorithms for online multiclass prediction. In Proceedings of the 25th international conference on Machine learning, pp. 440 447, 2008.

Kleinberg, R., Slivkins, A., and Upfal, E. Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pp. 681 690. ACM, 2008.

Neural Contextual Bandits with UCB-based Exploration

Krause, A. and Ong, C. S. Contextual Gaussian process bandit optimization. In Advances in neural information processing systems, pp. 2447 2455, 2011.

Kveton, B., Zaheer, M., Szepesv ari, C., Li, L., Ghavamzadeh, M., and Boutilier, C. Randomized exploration in generalized linear bandits. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, 2020.

Langford, J. and Zhang, T. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems 20 (NIPS), pp. 1096 1103, 2008.

Lattimore, T. and Szepesv ari, C. Bandit Algorithms. Cambridge University Press, 2019. In press.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670. ACM, 2010.

Li, L., Lu, Y., and Zhou, D. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pp. 2071 2080. JMLR. org, 2017.

Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pp. 8157 8166, 2018.

Liang, S. and Srikant, R. Why deep neural networks for function approximation? ar Xiv preprint ar Xiv:1610.04161, 2016.

Lipton, Z., Li, X., Gao, J., Li, L., Ahmed, F., and Deng, L. BBQ-networks: Efﬁcient exploration in deep reinforcement learning for task-oriented dialogue systems. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expressive power of neural networks: A view from the width. In Advances in neural information processing systems, pp. 6231 6239, 2017.

Riquelme, C., Tucker, G., and Snoek, J. Deep Bayesian bandits showdown. In International Conference on Learning Representations, 2018.

Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parameterized bandits. Mathematics of Operations Research, 35 (2):395 411, 2010.

Russo, D., Roy, B. V., Kazerouni, A., Osband, I., and Wen, Z. A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1 96, 2018.

Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 1015 1022. Omnipress, 2010.

Telgarsky, M. Representation beneﬁts of deep feedforward networks. ar Xiv preprint ar Xiv:1509.08101, 2015.

Telgarsky, M. Beneﬁts of depth in neural networks. ar Xiv preprint ar Xiv:1602.04485, 2016.

Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. Finite-time analysis of kernelised contextual bandits. ar Xiv preprint ar Xiv:1309.6869, 2013.

Yang, L. F. and Wang, M. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. ar Xiv preprint ar Xiv:1905.10389, 2019.

Yarotsky, D. Error bounds for approximations with deep Re LU networks. Neural Networks, 94:103 114, 2017.

Yarotsky, D. Optimal approximation of continuous functions by very deep Re LU networks. ar Xiv preprint ar Xiv:1802.03620, 2018.

Zahavy, T. and Mannor, S. Deep neural linear bandits: Overcoming catastrophic forgetting through likelihood matching. ar Xiv preprint ar Xiv:1901.08612, 2019.

Zhang, L., Yang, T., Jin, R., Xiao, Y., and Zhou, Z.-H. Online stochastic linear optimization under one-bit feedback. In International Conference on Machine Learning, pp. 392 401, 2016.

Zou, D. and Gu, Q. An improved analysis of training overparameterized deep neural networks. In Advances in Neural Information Processing Systems, 2019.

Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic gradient descent optimizes over-parameterized deep Re LU networks. Machine Learning, 2019.