# combinatorial_neural_bandits__c76a1fec.pdf

Combinatorial Neural Bandits

Taehyun Hwang * 1 Kyuwook Chai * 1 Min-hwan Oh 1

We consider a contextual combinatorial bandit problem where in each round a learning agent selects a subset of arms and receives feedback on the selected arms according to their scores. The score of an arm is an unknown function of the arm s feature. Approximating this unknown score function with deep neural networks, we propose algorithms: Combinatorial Neural UCB (CN-UCB) and Combinatorial Neural Thompson Sampling (CN-TS). We prove that CN-UCB achieves e O(ed

T) or e O( p

ed TK) regret, where ed is the effective dimension of a neural tangent kernel matrix, K is the size of a subset of arms, and T is the time horizon. For CN-TS, we adapt an optimistic sampling technique to ensure the optimism of the sampled combinatorial action, achieving a worst-case (frequentist) regret of e O(ed

TK). To the best of our knowledge, these are the first combinatorial neural bandit algorithms with regret performance guarantees. In particular, CN-TS is the first Thompson sampling algorithm with the worst-case regret guarantees for the general contextual combinatorial bandit problem. The numerical experiments demonstrate the superior performances of our proposed algorithms.

1. Introduction

We consider a general class of contextual semi-bandits with combinatorial actions, where in each round the learning agent is given a set of arms, chooses a subset of arms, and receives feedback on each of the chosen arms along with the reward based on the combinatorial actions. The goal of the agent is to maximize cumulative rewards through these repeated interactions. The feedback is given as a function of

*Equal contribution 1Graduate School of Data Science, Seoul National University, Seoul, Republic of Korea. Correspondence to: Min-hwan Oh <minoh@snu.ac.kr>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

the feature vectors (contexts) of the chosen arms. However, the functional form of the feedback model is unknown to the agent. Therefore, the agent needs to carefully balance exploration and exploitation in order to simultaneously learn the feedback model and optimize cumulative rewards.

Many real-world applications are naturally combinatorial action selection problems. For example, in most online recommender systems, such as streaming services and online retail, recommended items are typically presented as a set or a list. Real-time vehicle routing can be formulated as the shortest-path problem under uncertainty which is a classic combinatorial problem. Network routing is also another example of a combinatorial optimization problem. Often, in these applications, the response model is not fully known a priori (e.g., user preferences in recommender systems, arrival time in vehicle routing) but can only be queried by sequential interactions. Therefore, these applications can be formulated as a combinatorial bandit problem.

Despite the generality and wide applicability of the combinatorial bandit problem in practice, the combinatorial action space poses a greater challenge in balancing exploration and exploitation. To overcome such a challenge, parametric models such as the (generalized) linear model are often assumed for the feedback model (Qin et al., 2014; Wen et al., 2015; Kveton et al., 2015; Zong et al., 2016; Li et al., 2016; 2019; Oh & Iyengar, 2019). These works typically extend the techniques in the (generalized) linear contextual bandits (Abe & Long, 1999; Auer, 2002; Filippi et al., 2010; Rusmevichientong & Tsitsiklis, 2010; Abbasi-Yadkori et al., 2011; Chu et al., 2011; Li et al., 2017) to utilize contextual information and the structure of the feedback/reward model to avoid the naive exploration in combinatorial action space. However, the representation power of the (generalized) linear model can be limited in many real-world applications. When the model assumptions are violated, often the performances of the algorithms that exploit the structure of a model can severely deteriorate.

Beyond the parametric assumption for the feedback model, discretization-based techniques (Chen et al., 2018; Nika et al., 2020) have been proposed to capture the non-linearity of the base arm under the Lipschitz condition on the feedback model. These techniques split the context space and compute an upper confidence bound of rewards for each context partition. The performances of the algorithms strongly

Combinatorial Neural Bandits

Table 1. Comparison with the related work. For the neural bandit algorithms with single arm selection (Zhou et al., 2020; Zhang et al.,

2021), the reward function is not defined for a super arm (or the reward function can be viewed the same as the feedback for a single arm). All of the feedback models assume the boundedness of feedback. e O is a big-O notation up to logarithmic factors. Combinatorial Feedback Reward Regret

C2UCB (Qin et al., 2014) Yes Linear Lipschitz e O(d

T) Comb Lin UCB (Wen et al., 2015) Yes Linear Sum of feedback e O(K p

d T min{log N, d}) Comb Lin TS (Wen et al., 2015) Yes Linear Sum of feedback e O(d K

CC-MAB (Chen et al., 2018) Yes Lipschitz Sub-modular e O(2d T d+4 d+6 )

ACC-UCB (Nika et al., 2020) Yes Lipschitz Sub-modular e O(T

Neural-UCB (Zhou et al., 2020) No General - e O(ed

T) Neural-TS (Zhang et al., 2021) No General - e O(ed

CN-UCB (this work) Yes General Lipschitz e O(ed

T) or e O( p

ed TK) CN-TS (this work) Yes General Lipschitz e O(ed

Bayesian regret, which is a weaker notion of regret than the worst-case regret. d represents the approximate optimality dimension related to context space.

depend on the policy of how to partition the context space. However, splitting the context space is computationally expensive. As the reward function becomes more complex, so does the splitting procedure. Thus, it is challenging to apply these methods to high-dimensional contextual bandits. In addition, the Lipschitz assumption on the feedback model (not on the reward function) does not hold when contexts close in the context space yield significantly different outcomes, i.e., when context space cannot be partitioned with respect to the outcome.

Deep neural networks have shown remarkable empirical performances in various learning tasks (Le Cun et al., 2015; Goodfellow et al., 2016; Silver et al., 2016). Incorporating the superior representation power and recent advances in generalization theory of deep neural networks (Jacot et al., 2018; Cao & Gu, 2019) into contextual bandits, an upper confidence bound (UCB) algorithm as an extension of the linear contextual bandit has been proposed (Zhou et al., 2020). Extending the UCB approach, Zhang et al. (2021) proposed a neural network-based Thompson Sampling (TS) algorithm (Thompson, 1933). However, these algorithms are proposed only for single-action selection. How these algorithms generalize to the combinatorial action selection has remained open.

In this paper, we study provably efficient contextual combinatorial bandit algorithms without any modeling assumptions on the feedback model (with mild assumptions on the reward function which takes the feedback as an input). The extension to the combinatorial actions and providing provable performance guarantees requires more involved analysis and novel algorithmic modifications, particularly for the TS algorithm. To briefly illustrate this challenge, even under the simple linear feedback model, a worst-case

regret bound has not been known for a TS algorithm with various classes of combinatorial actions. This is due to the difficulty of ensuring the optimism of randomly sampled combinatorial actions (see Section 4.1). Addressing such challenges, we adapt an optimistic sampling technique to our proposed TS algorithm, which allows us to achieve a sublinear regret.

Our main contributions are as follows:

We propose algorithms for a general class of contextual combinatorial bandits: Combinatorial Neural UCB (CN-UCB) and Combinatorial Neural Thompson Sampling (CN-TS). To the best of our knowledge, these are the first neural-network based combinatorial bandit algorithms with regret guarantees.

We establish that CN-UCB is statistically efficient achieving e O(ed

T) or e O( p

ed TK) regret, where ed is the effective dimension of a neural tangent kernel matrix, K is the size of a subset of arms, and T is the time horizon. This result matches the corresponding regret bounds of linear contextual bandits.

The highlight of our contributions is that CN-TS is the first TS algorithm with the worst-case regret guarantees of e O(ed

TK) for a general class of contextual combinatorial bandits. To our best knowledge, even under a simpler, linear feedback model, the existing TS algorithms with various combinatorial actions (including semi-bandit) do not have the worst-case regret guarantees. This is due to the difficulty of ensuring the optimism of sampled combinatorial actions. We overcome this challenge by adapting optimistic sampling of the estimated reward while directly sampling in the reward space.

Combinatorial Neural Bandits

The numerical evaluations demonstrate the superior performances of our proposed algorithms. We observe that the performances of the benchmark methods deteriorate significantly when the modeling assumptions are violated. In contrast, our proposed methods exhibit consistent competitive performances.

2. Problem setting

2.1. Notations

For a vector x Rd, we denote its ℓ2-norm by x 2 and its transpose by x . The weighted ℓ2-norm associated with a positive definite matrix A is defined by x A :=

x Ax. The trace of a matrix A is tr(A). We define [N] for a positive integer N to be a set containing positive integers up to N, i.e., {1, 2, . . . , N}.

2.2. Contextual Combinatorial Bandit

In this work, we consider a contextual combinatorial bandit, where T is the total number of rounds, and N is the number of arms. At round t [T], a learning agent observes the set of context vectors for all arms {xt,i Rd | i [N]} and chooses a set of arms St [N] with size constraint |St| = K. St is called a super arm. We introduce the notion of candidate super arm set S 2[N] defined as the set of all possible subsets of arms with size K, i.e., S := {S [N] | |S| = K}.

2.2.1. SCORE FUNCTION FOR FEEDBACK

Once a super arm St S is chosen, the agent then observes the scores of the chosen arms {vt,i}i St and receives a reward R(St, vt) as a function of the scores vt := [vt,i]N i=1 (which we discuss in the next section). This type of feedback is also known as semi-bandit feedback (Audibert et al., 2014). Note that in combinatorial bandits, feedback and reward are not necessarily the same as is the case in noncombinatorial bandits. For each t [T] and i [N], score vt,i is assumed to be generated as follows:

vt,i = h(xt,i) + ξt,i (1)

where h is an unknown function satisfying 0 h(x) 1 for any x, and ξt,i is a ρ-sub-Gaussian noise satisfying E[ξt,i|Ft] = 0 where Ft is the history up to round t.

To learn the score function h in Eq.(1), we use a fully connected neural network (Zhou et al., 2020; Zhang et al., 2021) with depth L 2, defined recursively:

fℓ= Wℓϕ(fℓ 1), 2 ℓ L,

f(x; θ) = mf L

where θ := [vec(W1) , ..., vec(WL) ] Rp is the parameter of the neural network with p = dm+m2(L 2)+m,

ϕ(x) := max{x, 0} is the Re LU activation function, and m is the width of each hidden layer. We denote the gradient of the neural network by g(x; θ) := θf(x; θ) Rp.

2.2.2. REWARD FUNCTION & REGRET

R(S, v) is a deterministic reward function that measures the quality of the super arm S based on the scores v. For example, the reward of a super arm St can be the sum of the scores of arms in St, i.e., R(St, vt) = P

i St vt,i. For our analysis, the reward function can be any function (linear or non-linear) which satisfies the following mild assumptions standard in the combinatorial bandit literature (Qin et al., 2014; Li et al., 2016).

Assumption 1 (Monotonicity). R(S, v) is monotone nondecreasing with respect to the score vector v = [vi]N i=1, which means, for any S, if vi v i for all i [N], we have R(S, v) R(S, v ) .

Assumption 2 (Lipschitz continuity). R(S, v) is Lipschitz continuous with respect to the score vector v restricted on the arms in S, which means, there exists a constant C0 > 0 such that for any v and v , we have |R(S, v) R(S, v )| C0 p P

i S(vi v i)2 .

Remark 1. Reward function satisfying Assumptions 1 and 2 encompasses a wide range of combinatorial feedback models including semi-bandit, document-based or position based ranking models, and cascading models with little change to the learning algorithm. See Appendix G for more detailed discussions.

Note that we do not require the agent to have direct knowledge on the explicit form of the reward function R(S, v). For the sake of clear exposition, we assume that the agent has access to an exact optimization oracle OS(v) which takes a score vector v as an input and returns the solution of the maximization problem argmax S S R(S, v).

Remark 2. One can trivially extend the exact optimization oracle to an α-approximation oracle without altering the learning algorithm or regret analysis. For problems such as semi-bandit algorithms choosing top-K arms, exact optimization can be done by simply sorting base scores. Even for more challenging assortment optimization, there are many polynomial-time (approximate) optimization methods available (Rusmevichientong et al., 2010; Davis et al., 2014). For this reason, we present the regret analysis without α-approximation assumption. Extension of our regret analysis to an α-approximation oracle is given in Appendix E.

The goal of the agent is to minimize the following (worstcase) cumulative expected regret:

t=1 E [R(S t , v t ) R(St, v t )] (3)

Combinatorial Neural Bandits

where v t := [h(xt,i)]N i=1 is the expected score which is unknown, and S t := argmax S S R(S, v t ) is the offline optimal super arm at round t under the expected score.

3. Combinatorial Neural UCB (CN-UCB)

3.1. CN-UCB Algorithm

In this section, we present our first algorithm, Combinatorial Neural UCB (CN-UCB). CN-UCB is a neural network-based UCB algorithm that operates using the optimism in the face of uncertainty (OFU) principle (Lai & Robbins, 1985) for combinatorial actions.

In our proposed method, the neural network used for feedback model approximation is initialized by randomly generating each entry of θ0 = [vec(W1) , ..., vec(WL) ] , where for each ℓ [L 1], Wℓ= (W, 0; 0, W) with each entry of W generated independently from N(0, 4/m) and WL = (w , w ) with each entry of w generated independently from N(0, 2/m). At each round t [T], the algorithm observes the contexts for all arms, {xt,i}i [N] and computes an upper confidence bound ut,i of the expected score for each arm i, based on xt,i, θt 1, and the exploration parameter γt 1. Then, the sum of upper confidence bound score vector ut := [ut,i]N i=1 and the offset term vector et := [et, , et], (specified in Lemma 1), is passed to the optimization oracle OS as input. Then, the agent plays St = OS(ut + et) and receives the corresponding scores {vt,i}i St as feedback along with the reward associated with super arm St. Then the algorithm updates θt by minimizing the following loss function in Eq.(4) using gradient descent with step size η for J times.

f(xk; θ) vk 2 + mλ

2 θ θ0 2 2 (4)

Here, the loss is minimized using ℓ2-regularization. Hyperparameter λ controls the level of regularization, where the regularization centers at the randomly initialized neural network parameter θ0. The CN-UCB algorithm is summarized in Algorithm 1.

3.2. Regret of CN-UCB

For brevity, we denote {xk}T N k=1 be the collection of all contexts {x1,1, . . . , x T,N}.

Definition 1. (Jacot et al., 2018; Cao & Gu, 2019) Define

e H(1) i,j = Σ(1) i,j = xi, xj , A(ℓ) i,j =

Σ(ℓ) i,i Σ(ℓ) i,j Σ(ℓ) j,i Σ(ℓ) j,j

Σ(ℓ+1) i,j = 2E(y,z) N(0,A(ℓ) i,j) [ϕ(y)ϕ(z)] ,

e H(ℓ+1) i,j = 2 e H(ℓ) i,j E(y,z) N(0,A(ℓ) i,j) [ϕ (y)ϕ (z)] + Σ(ℓ+1) i,j .

Then, H = ( e H(L) + Σ(L))/2 is called the neural tangent kernel (NTK) matrix on the context set {xk}T N k=1.

The NTK matrix H on the contexts {xk}T N k=1 is defined recursively from the input layer to the output layer of the network (Zhou et al., 2020; Zhang et al., 2021). Then, we define the effective dimension of the NTK matrix H.

Definition 2. The effective dimension ed of the NTK matrix H with regularization parameter λ is defined as

ed = log det(I + H/λ)

log(1 + TN/λ) . (5)

The effective dimension can be thought of as the actual dimension of contexts in the Reproducing Kernel Hilbert Space spanned by the NTK. For further detailed information, we refer the reader to Jacot et al. (2018). We proceed under the following assumption regarding contexts:

Assumption 3. For any k [TN], xk 2 = 1 and [xk]j = [xk]j+ d

2 for 1 j d

2. Furthermore, for some λ0 > 0, H λ0I .

This is a mild assumption commonly used in the neural contextual bandits (Zhou et al., 2020; Zhang et al., 2021). x 2 = 1 is only imposed for simplicity of exposition. For the condition on the entries of x, we can always reconstruct a new context x = [x , x ] /

2. A positive definite NTK matrix is a standard assumption in the NTK literature (Du et al., 2019; Arora et al., 2019), also used in the aforementioned neural contextual bandit literature. The following theorem provides the regret bound of Algorithm 1.

Theorem 1. Suppose Assumptions 1-3 hold. Let h = h(xk) T N k=1 RT N. If we run CN-UCB with

m poly(T, L, N, λ 1, λ 1 0 , log T) ,

η = C1(TKm L + mλ) 1, λ C2LK,

J = 2 log p

λ/TK/(λ + C3TKL) TKL/( C1λ)

for some positive constants C1, C2, C3 with C2 p

maxt,i g(xt,i; θt 1)/ m 2 2/L and B

2h H 1h, then the cumulative expected regret of CN-UCB over horizon T is upper-bounded by

R(T) = e O q

ed T max{ed, K} .

Discussion of Theorem 1. Theorem 1 establishes that the cumulative regret of CN-UCB is e O(ed

T) or e O( p

ed TK), whichever is higher. This result matches the state-of-theart regret bounds for the contextual combinatorial bandits with the linear feedback model (Li et al., 2016; Zong et al., 2016; Li & Zhang, 2018). Note that the existence of C2 in Theorem 1 follows from Lemma B.6 in Zhou et al. (2020)

Combinatorial Neural Bandits

Algorithm 1 Combinatorial Neural UCB (CN-UCB)

Input: Number of rounds T, regularization parameter λ, norm parameter B, step size η, network width m, number of gradient descent steps J, network depth L. Initialization: Randomly initialize θ0 as described in Section 3.1 and Z0 = λI for t = 1, ..., T do

Observe {xt,i}i [N] Compute ˆvt,i = f(xt,i; θt 1) and ut,i = ˆvt,i + γt 1 g(xt,i; θt 1)/ m Z 1 t 1 for i [N] Let St = OS(ut + et) Play super arm St and observe {vt,i}i St Update Zt = Zt 1 + P

i St g(xt,i; θt 1)g(xt,i; θt 1) /m Update θt to minimize the loss in Eq.(4) using gradient descent with η for J times Compute γt and et+1 described in lemma 1 end for

and Lemma B.3 in Cao & Gu (2019). While the regret analysis for Theorem 1 has its own merit, the technical lemmas for Theorem 1 also provide the building block for the more challenging analysis of the TS algorithm which is presented in Section 4.

3.3. Proof Sketch of Theorem 1

In this section, we provide a proof sketch of the regret upper bound in Theorem 1 and the key lemmas whose proofs are deferred to Appendix A.

Recall that we do not make any parametric assumption on the score function, but a neural network is used to approximate the unknown score function. Hence, we need to carefully control the approximation error. To achieve this, we use an over-parametrized neural network, for which the following condition on the neural network width is required.

Condition 1. The network width m satisfies

m C max{L 3

2 λ 1 2 log(TNL2/δ) 3

T 6N 6L6 log(T 2N 2L/δ) max{λ 4 0 , 1}} ,

m (log m) 3 CT 4K4L21λ 4(1 + p

+ CTKL12λ 1 + CT 4K4L18λ 10(λ + TL)6 ,

where C is a positive absolute constant.

Unlike the analysis of the (generalized) linear UCB algorithms (Abbasi-Yadkori et al., 2011; Li et al., 2017), we do not have guarantees on the upper confidence bound ut,i being higher than the expected score v t,i = h(xt,i) due to the approximation error. Therefore, we consider adding the offset term to the the upper confidence bound to ensure optimism. The following lemma shows that the upper confidence bounds ut,i do not deviate far from the expected score h(xt,i) and specifies the value of the offset term.

Lemma 1. For any δ (0, 1), suppose the width of the neural network m satisfies Condition 1. Let γt be a positive

scaling factor defined as

det λI + Γ2,t 2 log δ +

+ (λ + C1t KL) (1 ηmλ) J 2 p

t K/λ + Γ3,t ,

1 + CΓ,1t 7 6 K 7 6 L4λ 7

Γ2,t = CΓ,2t 5 3 K 5 3 L4λ 1

Γ3,t = CΓ,3t 7 6 K 7 6 L 7 2 λ 7

log m(1 + p

for some constants C1, CΓ,1, CΓ,2, CΓ,3 > 0. If η C2(TKm L+mλ) 1 for some C2 > 0, then for any t [T] and i [N], with probability at least 1 δ we have

|ut,i h(xt,i)| 2γt 1 g(xt,i; θt 1)/ m Z 1 t 1 + et ,

where et is defined for some absolute constants C3, C4 > 0 as follows.

et := C3γt 1t 1 6 K 1 6 L 7 2 λ 2

+ C4t 2 3 K 2 3 λ 2

The next corollary shows that the surrogate upper confidence bound ut,i + et is higher than true mean score h(xt,i) with high probability.

Corollary 1. With probability at least 1 δ

ut,i + et h(xt,i) .

The point of Corollary 1 is that in Zhou et al. (2020), to bound the instantaneous regret, it is enough for the agent to choose only one optimistic action (see Lemma 5.3 in Zhou et al. (2020)), while in our case, the agent has to choose the optimistic super arm in order to bound the instantaneous regret (See Eq. (8) in Proof of Theorem 1). However, in

Combinatorial Neural Bandits

order to ensure the optimism of the chosen super arm, it is necessary to guarantee the optimism of all individual arms in the chosen super arm, which is represented in Corollary 1.

The following technical lemma bounds the sum of weighted norms which is similar to Lemma 4.2 in Qin et al. (2014) and Lemma 5.4 in Zhou et al. (2020).

Lemma 2. For any δ (0, 1) suppose the width of the neural network m satisfies Condition 1. If η C1(TKm L + mλ) 1, and λ C2LK, for some positive constant C1, C2 with C2 p

maxt,i g(xt,i; θt 1)/ m 2 2/L, then with probability at least 1 δ, for some C3 > 0,

g(xt,i; θt 1)/ m 2 Z 1 t 1

2ed log(1+TN/λ) + 2 + C3T 5 3 K 3 2 L4λ 1

Combining these results, we can derive the regret bound in Theorem 1. First, using the Lipschitz continuity of the reward function, we bound the instantaneous regret with the sum of scores for each individual arm within the super arm. By Lemma 1, the upper confidence bound of the overparametrized neural network concentrates well around the true score function. By adding an arm-independent offset term, we can ensure the optimism of the surrogate upper confidence bound. Then, we apply Lemma 2 to derive the desired cumulative regret bound.

4. Combinatorial Neural TS (CN-TS)

4.1. Challenges in Worst-Case Regret Analysis for Combinatorial Actions

The challenges in the worst-case (non-Bayesian) regret analysis for TS algorithms with combinatorial actions lie in the difficulty of ensuring optimism of a sampled combinatorial action. The key analytical element to drive a sublinear regret for any TS algorithm, either combinatorial or noncombinatorial, is to show that a sampled action is optimistic with sufficient frequency (Agrawal & Goyal, 2013; Abeille & Lazaric, 2017). With combinatorial actions, however, ensuring optimism becomes more challenging than singleaction selection. In particular, if the structure of the reward and feedback model is not known, one can only resort to hoping that all of the sampled base arms in the chosen super arm St are optimistic, i.e., the scores of all sampled base arms are higher than their expected scores. The probability of such an event can be exponentially small in the size of the super arm K.

For example, let the probability that the sampled score of the i-th arm is higher than the corresponding expected score be at least ep, i.e., P(evi > h(xi)) ep. If the sampled score of every arm is optimistic, by the monotonicity property of the

reward function, the reward induced by the sampled scores would be larger than the reward induced by the expected score, i.e., R(S, ev) R(S, v ). However, the probability of the event that all the K sampled scores are higher than their corresponding expected scores would be in the order of ep K. Hence, the probability of such an event can be exponentially small in the size of the super arm K.

Note that in the UCB exploration, one can ensure highprobability optimism even with combinatorial actions in a straightforward manner since action selection is deterministic. However, in TS with combinatorial actions, suitable random exploration with provable efficiency is much more challenging to guarantee. This challenge is further exacerbated by the complex analysis based on neural networks that we consider in this work.

4.2. CN-TS Algorithm

To address the challenge of TS exploration with combinatorial actions described above, we present CN-TS, a neural network-based TS algorithm. We make two modifications from conventional TS for parametric bandits. First, instead of maintaining an actual Bayesian posterior as in the canonical TS algorithms, CN-TS is a generic randomized algorithm that samples rewards from a Gaussian distribution. The algorithm directly samples estimated rewards from a Gaussian distribution, rather than sampling network parameters this modification is adapted from Zhang et al. (2021).

Second, in order to ensure sufficient optimistic sampling in combinatorial action space, we draw multiple M independent score samples for each arm instead of drawing a single sample. Leveraging these multiple samples, we compute the most optimistic (the highest estimated) score for each arm. We demonstrate that implementing this modification effectively ensures the required optimism of samples, formalized in Lemma 3. The algorithm is summarized in Algorithm 2.

4.3. Regret of CN-TS

Under the same assumptions introduced in the analysis of CN-UCB, we present the worst-case regret bound for CN-TS in Theorem 2.

Theorem 2. Suppose Assumptions 1-3 hold and m satisfies Condition 1. If we run CN-TS with

η = C1(TKm L + mλ) , λ = max{1 + 1/T, C2LK},

J = 2 log p

λ/TKL/(4 C3T) TKL/( C1λ)

ν = B + ρ q

ed log(1 + TN/λ) + 2 + 2 log T ,

B = max{1/(22e π),

M = 1 log K/ log(1 ep)

for some positive constants C1 > 0, C3 > 0, and C2

Combinatorial Neural Bandits

Algorithm 2 Combinatorial Neural Thompson Sampling (CN-TS)

Input: Number of rounds T, regularization parameter λ, exploration variance ν, step size η, network width m, number of gradient descent steps J, network depth L, sample size M. Initialization: Randomly initialize θ0 as described in Section 3.1 and e Z0 = λI for t = 1, ..., T do

Observe {xt,i}i [N] Compute σ2 t,i = λg(xt,i; θt 1) e Z 1 t 1g(xt,i; θt 1)/m for each i [N]

Sample {ev(j) t,i }M j=1 independently from N(f(xt,i; θt 1), ν2σ2 t,i) for each i [N]

Compute evt,i = maxj ev(j) t,i for each i [N] Let St = OS(evt + ϵ) Play super arm St and observe {vt,i}i St Update e Zt = e Zt 1 + P

i St g(xt,i; θt 1)g(xt,i; θt 1) /m Update θt to minimize the loss (4) using gradient descent with η for J times end for

maxt,i g(xt,i; θt 1)/ m 2 2/L, then the cumulative expected regret of CN-TS over horizon T is upper-bounded by R(T) = e O(ed

Discussion of Theorem 2. Theorem 2 establishes that the cumulative regret of CN-TS is e O(ed

TK). To the best of our knowledge, this is the first TS algorithm with the worst-case regret guarantees for general combinatorial action settings. This is crucial since various combinatorial bandit problems were prohibitive for TS methods due to the difficulty of ensuring the optimism of randomly selected super-action as discussed in Section 4.1. Our result also encompasses the linear feedback model setting, for which, to our best knowledge, a worst-case regret bound has not been proven for TS with combinatorial actions in general.

Remark 3. Both CN-UCB and CN-TS depend on the condition of network size m. However, our experiments show superior performances of the proposed algorithms even when they are implemented with much smaller m (see Section 5). The large value of m is sufficient for regret analysis, due to the current state of the NTK theory. The same phenomenon is also present in the single action selection version of the neural bandits (Zhang et al., 2021; Zhou et al., 2020).

Remark 4. For a clear exposition of main ideas, the knowledge of T is assumed for both CN-UCB and CN-TS. This knowledge was also assumed in the previous neural bandit literature (Zhang et al., 2021; Zhou et al., 2020). We can replace this requirement of knowledge on T by using a doubling technique. We provide modified algorithms that do not depend on such knowledge of T in Appendix F.

Remark 5. The proposed optimistic sampling technique can be applied to the regret analysis for TS algorithms with combinatorial actions other than neural bandit settings. Regarding the cost of the optimistic sampling, this salient feature of the algorithm is controlled by the number of mul-

tiple samples M. A notable feature is that while this technique provides provably sufficient optimism, the proposed optimistic sampling technique comes at a minimal cost of log M. That is, even if we over-sample by the factor of 2, the additional cost in the regret bound only increases by the additive logarithmic factor, i.e., log 2M = log M + log 2. Also, given that a theoretically suggested value of M as shown in Theorem 2 is only Ω(log K), the regret caused by the optimistic sampling is of O(log log K).

4.4. Proof Sketch of Theorem 2

For any t [T], we define events Eσ t and Eµ t similar to the prior literature on TS (Agrawal & Goyal, 2013; Zhang et al., 2021) defined as follows.

Eσ t := {ω Ft+1 | i, |evt,i f(xt,i; θt 1)| βtνσt,i}

Eµ t := {ω Ft+1 | i, |f(xt,i; θt 1) h(xt,i)| νσt,i + ϵ}

where for some constants {Cϵ,k}4 k=1, ϵ is defined as

ϵ := Cϵ,1T 2 3 K 2 3 L3λ 2

+ Cϵ,2(1 ηmλ)J/2p

+ Cϵ,3T 7 6 K 7 6 L4λ 7

log m(1 + p

+ Cϵ,4T 7 6 K 7 6 λ 2

3 L 9 2 m 1

ed log(1 + TN/λ) + 2 2 log δ .

Under event Eσ t , the difference between the optimistic sampled score and the estimated score can be controlled by the score s approximate posterior variance. Under the event Eµ t , the estimated score based on the neural network does not deviate far from the expected score up to the approximate error term. Note that both events Eµ t , Eσ t happen with high probability. The remaining part is a guarantee on the probability of optimism for randomly sampled actions. Lemma 3 shows

Combinatorial Neural Bandits

that the proposed optimistic sampling ensures a constant probability of optimism. Lemma 3. Suppose we take optimistic samples of size M = 1 log K log(1 ep) where ep := 1/(4e π). Then we have

P R(St, evt + ϵ) > R(S t , v t )|Ft, Eµ t

where ϵ = [ϵ, . . . , ϵ] RN.

Lemma 3 implies that even in the worst case, our randomized action selection still provides optimistic rewards at least with constant frequency. Hence, the regret pertaining to random sampling can be upper-bounded based on this frequent-enough optimism. The complete proof is deferred to Appendix B.

5. Numerical Experiments

In this section, we perform numerical evaluations on CN-UCB and CN-TS. For each round in CN-TS, we draw M = 10 samples for each arm. We also present the performances of CN-TS(M=1), which is a special case of CN-TS drawing only one sample per arm. We perform synthetic experiments and measure the cumulative regret of each algorithm. In Experiment 1, we compare our algorithms with contextual combinatorial bandits based on a linear assumption: Comb Lin UCB and Comb Lin TS (Wen et al., 2015).

In Experiment 2, we demonstrate the empirical performances of our algorithms as the context dimension d increases. The contexts given to the agent in each round are randomly generated from a unit ball. The dimension of each context is d = 80 for Experiment 1, and d = 40, 80, 120 for Experiment 2. For each round, the agent of each algorithm chooses K = 4 arms among N = 20.

Similar to the experiments in Zhou et al. (2020), we assume three unknown score functions

h1(xt,i) = x t,ia ,

h2(xt,i) = (x t,ia)2 ,

h3(xt,i) = cos(πx t,ia) ,

where a has the same dimension of the context and is randomly generated from a unit ball and remains fixed during the horizon. We suppose a top-K problem and use the sum of scores R(St, vt) = P

i St vt,i as the reward function. However, as mentioned in Remark 1, the reward function can be any function that satisfies Assumptions 1 and 2. For example, R(St, vt) can be the quality of positions of a position-based click model (Lattimore & Szepesvari, 2020) or the expected revenue given by a multinomial logit (MNL) choice model (Oh & Iyengar, 2019) although the regret bound under the MNL choice model is not provided under the current theoretical result.

We use regularization parameter λ = 1 for all methods, confidence bound coefficient α = 1 for Comb Lin UCB and γ = 1 for CN-UCB, and exploration variance ν = 1 for CN-TS, CN-TS(M=1) and Comb Lin TS. To estimate the score of each arm, we design a neural network with depth L = 2 and hidden layer width m = 100. The number of parameters is p = md + m = 8100 for Experiment 1, and p = 4100, 8100, 12100 for Experiment 2. The activation function is the rectified linear unit (Re LU). We use the loss function in Eq.(4) and use stochastic gradient descent with a batch of 100 super arms. We train the neural network every 10 rounds. The training epoch is 100, and the learning rate is 0.01.

Experiment 1. We evaluate the cumulative regret of the algorithms for each score function h. For score functions h1(x) and h2(x), we set the number of rounds, T, to 2000, while T is set to 4000 for h3(x). We then present the average results, derived from 20 independent runs for each score function instance. The results are depicted in Figure 1. Our proposed algorithms show significant improvements over those based on linear models. In contrast to linear baselines, the cumulative regrets for CN-UCB and CN-TS demonstrate a sub-linear trend, even when the score function is quadratic or non-linear. These findings suggest that our algorithms can be readily applied to a diverse range of complex reward functions.

Experiment 2. We present the results of our proposed algorithms for context dimensions d = 40, 80, 120. To highlight the advantage of optimistic sampling, we show a comparison between CN-TS and CN-TS(M=1). For these experiments, we utilize the quadratic score function h2(x). The number of rounds, T, is set to 2000 for d = 40, 80 and 4000 for d = 120. Similar to Experiment 1, the results represent averages derived from 20 independent runs. Figure 2 demonstrates the proficient performance of our algorithms, even as the feature dimension increases. The empirical results suggest a scalability of our algorithms in d that is no greater than linear. Furthermore, when d is large, CN-TS exhibits a marginally lower cumulative regret compared to CN-TS(M=1). This observation substantiates our assertion that CN-TS ensures a constant probability of optimism by drawing multiple M samples.

6. Conclusion

In this paper, we study a general class of a contextual combinatorial bandit problem, where the model of the score function is unknown. Approximating the score function with deep neural networks, we propose two algorithms: CN-UCB and CN-TS. We prove that CN-UCB achieves e O(ed

T) or e O( p

ed TK) regret. For CN-TS, we adapt an optimistic sampling technique to ensure the optimism of the sampled combinatorial action, establish a worst-case (frequentist) regret

Combinatorial Neural Bandits

0 250 500 750 1000 1250 1500 1750 2000

Cumulative Regret

h1(x) = x a, d=80

CN-UCB CN-TS CN-TS(M=1) Comb Lin UCB Comb Lin TS

0 250 500 750 1000 1250 1500 1750 2000

Cumulative Regret

h2(x) = (x a)2, d=80

CN-UCB CN-TS CN-TS(M=1) Comb Lin UCB Comb Lin TS

0 500 1000 1500 2000 2500 3000 3500 4000

Cumulative Regret

h3(x) = cos( x a), d=80

CN-UCB CN-TS CN-TS(M=1) Comb Lin UCB Comb Lin TS

Figure 1. Cumulative regret of CN-UCB and CN-TS compared with algorithms based on linear models.

0 250 500 750 1000 1250 1500 1750 2000

Cumulative Regret

h2(x) = (x a)2, d=40

CN-UCB CN-TS CN-TS(M=1)

0 250 500 750 1000 1250 1500 1750 2000

Cumulative Regret

h2(x) = (x a)2, d=80

CN-UCB CN-TS CN-TS(M=1)

0 500 1000 1500 2000 2500 3000 3500 4000

Cumulative Regret

h2(x) = (x a)2, d=120

CN-UCB CN-TS CN-TS(M=1)

Figure 2. Experiment results of CN-UCB, CN-TS, and CN-TS(M=1) as context dimension d increases.

TK). To our knowledge, these are the first combinatorial neural bandit algorithms with sub-linear regret guarantees. In particular, CN-TS is the first general contextual combinatorial Thompson sampling algorithm with the worst-case regret guarantees. Compared to the benchmark methods, our proposed methods exhibit consistent competitive performances, hence achieving both provable efficiency and practicality.

Acknowledgements

This work was supported by the New Faculty Startup Fund from Seoul National University and the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. 2022R1C1C1006859, No. 2022R1A4A103057912, No. 2021M3E5D2A01024795).

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011.

Abe, N. and Long, P. M. Associative reinforcement learn-

ing using linear probabilistic concepts. In International Conference on Machine Learning, pp. 3 11, 1999.

Abeille, M. and Lazaric, A. Linear thompson sampling revisited. In Artificial Intelligence and Statistics, pp. 176 184. PMLR, 2017.

Abramowitz, M. and Stegun, I. A. (eds.). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York, 1964.

Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pp. 127 135. PMLR, 2013.

Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019a.

Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization, 2019b.

Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International

Combinatorial Neural Bandits

Conference on Machine Learning, pp. 322 332. PMLR, 2019.

Audibert, J.-Y., Bubeck, S., and Lugosi, G. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31 45, 2014.

Auer, P. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

Besson, L. and Kaufmann, E. What doubling tricks can and can t do for multi-armed bandits. ar Xiv preprint ar Xiv:1803.06971, 2018.

Cao, Y. and Gu, Q. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, volume 32, pp. 10836 10846, 2019.

Chen, L., Xu, J., and Lu, Z. Contextual combinatorial multiarmed bandits with volatile arms and submodular reward. Advances in Neural Information Processing Systems, 31, 2018.

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208 214, 2011.

Davis, J. M., Gallego, G., and Topaloglu, H. Assortment optimization under variants of the nested logit model. Mathematics of Operations Research, 62(2):250 273, 2014.

Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/ forum?id=S1e K3i09YQ.

Filippi, S., Cappe, O., Garivier, A., and Szepesv ari, C. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pp. 586 594, 2010.

Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. Deep learning, volume 1. MIT press Cambridge, 2016.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, volume 31, pp. 8571 8580, 2018.

Kveton, B., Szepesvari, C., Wen, Z., and Ashkan, A. Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, pp. 767 776, 2015.

Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1): 4 22, 1985.

Lattimore, T. and Szepesvari, C. Bandit Algorithms. Cambridge: Cambridge University Press, 2020.

Le Cun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436 444, 2015.

Li, L., Lu, Y., and Zhou, D. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pp. 2071 2080, 2017.

Li, S. and Zhang, S. Online clustering of contextual cascading bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

Li, S., Wang, B., Zhang, S., and Chen, W. Contextual combinatorial cascading bandits. In International conference on machine learning, pp. 1245 1253. PMLR, 2016.

Li, S., Lattimore, T., and Szepesv ari, C. Online learning to rank with features. In International Conference on Machine Learning, pp. 3856 3865. PMLR, 2019.

Nika, A., Elahi, S., and Tekin, C. Contextual combinatorial volatile multi-armed bandit with adaptive discretization. In International Conference on Artificial Intelligence and Statistics, pp. 1486 1496. PMLR, 2020.

Oh, M.-h. and Iyengar, G. Thompson sampling for multinomial logit contextual bandits. In Advances in Neural Information Processing Systems, pp. 3151 3161, 2019.

Qin, L., Chen, S., and Zhu, X. Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 461 469. SIAM, 2014.

Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parameterized bandits. Mathematics of Operations Research, 35 (2):395 411, 2010.

Rusmevichientong, P., Shen, Z.-J. M., and Shmoys, D. B. Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations research, 58(6):1666 1680, 2010.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016.

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.

Combinatorial Neural Bandits

Wen, Z., Kveton, B., and Ashkan, A. Efficient learning in large-scale combinatorial semi-bandits. In International Conference on Machine Learning, pp. 1113 1122, 2015.

Zhang, W., Zhou, D., Li, L., and Gu, Q. Neural thompson sampling. In International Conference on Learning Representation (ICLR), 2021.

Zhou, D., Li, L., and Gu, Q. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pp. 11492 11502. PMLR, 2020.

Zong, S., Ni, H., Sung, K., Ke, N. R., Wen, Z., and Kveton, B. Cascading bandits for large-scale recommendation problems. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI 16, pp. 835 844, 2016.

Combinatorial Neural Bandits

Nomenclature

v t,i = h(xt,i) Expected score of arm i at time t

K The size of super arm

ρ Sub-Gaussian parameter

N The number of arms

T The number of rounds

ξt,i Sub-Gaussian noise

R(S, v) Reward for the super arm S based on scores v

S t The offline optimal super arm at time t

St The super arm played at time t

ed The effective dimension

J The number of gradient descent steps

g(xt,i; θt 1) Gradient of the neural network for arm i at time t

ˆvt,i = f(xt,i; θt 1) Estimated score of arm i at time t

L The number of hidden layer of the neural network

m The hidden layer width of the neural network

θt The parameter of the neural network at time t

2h H 1h, 1/(22e π)}

H The Neural Tangent Kernel matrix

λ Regularziation parameter

λ0 NTK matrix parameter

η Step size for gradient descent

p The number of parameters of the neural network

Eµ t Event i [N] :| f(xt,i; θt 1) h(xt,i) | νσt,i + ϵ

Eσ t Event i [N] :| evt,i f(xt,i; θt 1) | βtνσt,i

ν = B + ρ q

ed log(1 + TN/λ) + 2 + 2 log T

ep = 1/(4e π)

σ2 t,i = λg(xt,i; θt 1) e Z 1 t 1g(xt,i; θt 1)/m

βt = 4 log t + 2 log K + 2 log M

evt,i = maxj ev(j) t,i

ev(j) t,i j-th sampled score generated by from distribution N f(xt,i; θt 1), νσ2 t,i

M = 1 log K/ log(1 ep)

ϵ Approximation error of the neural network

e Zt = λI + Pt k=1 P i Sk g(xk,i; θk 1)g(xk,i; θk 1) /m

Combinatorial Neural Bandits

γt Exploration scaling parameter for upper confidence bound

1 + CΓ,1t 7 6 KL4λ 7

Γ2,t = CΓ,2t 5 3 K 3 2 L4λ 1

Γ3,t = CΓ,3t 7 6 K 7 6 L 7 2 λ 7

6 log m(1 + p

et Offset term added to the upper confidence bound at time t

Zt = λI + Pt k=1 P

i Sk g(xk,i; θk 1)g(xk,i; θk 1) /m

ut,i Upper confidence bound of the expected score for arm i at time t

Combinatorial Neural Bandits

A. Regret Bound for CN-UCB

In this section, we present all the necessary technical lemmas and their proof, followed by the proof of Theorem 1.

A.1. Proof of Lemma 1

We introduce the following technical lemmas which are necessary for proof of Lemma 1. Lemma 4 (Lemma 5.1 in Zhou et al. (2020)). For any δ (0, 1), suppose that there exists a positive constant C such that

m CT 4N 4L6λ 1 0 log(T 2N 2L/δ) .

Then, with probability at least 1 δ, there exists a θ Rp such that for all i [TN],

h(xk) = g(xk; θ0) (θ θ0) , m θ θ0 2

where H is the NTK matrix defined in Definition 1 and h = [h(xk)]T N k=1. Lemma 5 (Lemma 4.1 in Cao & Gu (2019)). Suppose that there exist C1, C2 > 0 such that for any δ (0, 1), τ satisfies

2 log(TNL2/δ) 3

2 τ C2L 6 (log m) 3

Then, with probability at least 1 δ, for all eθ, ˆθ satisfying eθ θ0 2 τ, ˆθ θ0 2 τ and k [TN], we have f(xk; eθ) f(xk; ˆθ) g(xk; ˆθ) (eθ ˆθ) C3τ 4 3 L3p

where C3 0 is an absolute constant. Lemma 6 (Lemma 5 in Allen-Zhu et al. (2019b)). For any δ (0, 1), suppose that there exist C1, C2 > 0 such that if τ satisfies C1m 3

2 max n (log m) 3

2 , (log(TN/δ))

3 2 o τ C2L 9

2 (log m) 3 .

Then, with probability at least 1 δ, for all θ satisfying θ θ0 2 τ and k [TN] we have

g(xk; θ) g(xk; θ0) 2 C3 p

log mτ 1 3 L3 g(xk; θ0) 2 ,

where C3 > 0 is an absolute constant. Lemma 7 (Lemma B.3 in Cao & Gu (2019)). Suppose that there exist C1, C2 > 0 such that for any δ (0, 1), τ satisfies

2 log(TNL2/δ) 3

2 τ C2L 6 (log m) 3

Then with probability at least 1 δ, for any θ satisfying θ θ0 2 τ and k [TN] we have

g(xk; θ) F C3

where C3 > 0 is an absolute constant.

Proof of Lemma 1. First of all, note that because m satisfies Condition 1, the required conditions in Lemma 4-7 are satisfied. For any t [T], i [N], by definition of ut,i and v t,i, we have ut,i v t,i = f(xt,i; θt 1) + γt 1 g(xt,i; θt 1)/ m Z 1 t h(xt,i)

= f(xt,i; θt 1) + γt 1 g(xt,i; θt 1)/ m Z 1 t 1 g(xt,i; θ0) (θ θ0)

f(xt,i; θt 1) g(xt,i; θ0) (θ θ0) | {z } I0

+ γt 1 g(xt,i; θt 1)/ m Z 1 t 1 | {z } I1

Combinatorial Neural Bandits

where the second equality holds due to Lemma 4, and the inequality follows from the triangle inequality.

For I0, we have

I0 = f(xt,i; θt 1) g(xt,i; θ0) (θ θ0 + θt 1 θt 1)

= f(xt,i; θt 1) f(xt,i; θ0) g(xt,i; θ0) (θt 1 θ0) g(xt,i; θ0) (θ θt 1)

f(xt,i; θt 1) f(xt,i; θ0) g(xt,i; θ0) (θt 1 θ0) | {z } I2

+ g(xt,i; θ0) (θ θt 1) | {z } I3

where the second equality holds due to the initial condition of f, i.e., f(x; θ0) = 0 for all x, and the inequality comes from the triangle inequality.

To bound I2, we have

I2 = f(xt,i; θt 1) f(xt,i; θ0) g(xt,i; θ0) (θt 1 θ0)

C 3τ 4 3 L3p

= C3t 2 3 K 2 3 λ 2

where the first inequality follows from Lemma 5 for some constant C 3 > 0, and the second equality is due to setting τ of Lemma 5 as 2 p

t K/(mλ) of Lemma 11, i.e., τ = 2 p

To bound I3, we have

I3 = g(xt,i; θ0) (θ θt 1) g(xt,i; θ0) Z 1 t 1 θ θt 1 Zt 1 γt 1 m g(xt,i; θ0) Z 1 t 1

where the first inequality holds due to the Cauchy-Schwarz inequality, and the second inequality follows from Lemma 11.

Combining the results, we have ut,i v t,i I2 + I3 + I1

C3t 2 3 K 2 3 λ 2

log m + γt 1 m g(xt,i; θ0) Z 1 t 1 + γt 1 m g(xt,i; θt 1) Z 1 t 1

= C3t 2 3 K 2 3 λ 2

log m + γt 1 m

g(xt,i; θ0) Z 1 t 1 + g(xt,i; θt 1) Z 1 t 1

Now I4 can be bounded as

I4 = g(xt,i; θ0) + g(xt,i; θt 1) g(xt,i; θt 1) Z 1 t 1 + g(xt,i; θt 1) Z 1 t 1 g(xt,i; θ0) g(xt,i; θt 1) Z 1 t 1 + 2 g(xt,i; θt 1) Z 1 t 1

λ g(xt,i; θ0) g(xt,i; θt 1) 2 + 2 g(xt,i; θt 1) Z 1 t 1

3 L3 g(xt,i; θ0) 2 + 2 g(xt,i; θt 1) Z 1 t 1

C2t 1 6 K 1 6 λ 2

3 L 7 2 m 1 3 p

log m + 2 g(xt,i; θt 1) Z 1 t 1

where the first inequality follows from the triangle inequality, the second inequality holds due to the property x Z 1 t 1

λ x 2, the third inequality follows from Lemma 6 with τ = 2 p

t K/(mλ) in Lemma 11, and the last inequality holds due to Lemma 7. Finally, by taking a union bound about δ, with probability at least 1 5δ, we have ut,i v t,i C3t 2 3 K 2 3 λ 2

log m + γt 1 m I4

2γt 1 g(xt,i; θt 1)/ m Z 1 t 1 + C2γt 1t 1 6 K 1 6 λ 2

3 L 7 2 m 1

+ C3t 2 3 K 2 3 λ 2

Combinatorial Neural Bandits

In particular, if we define

et = C2γt 1t 1 6 K 1 6 λ 2

3 L 7 2 m 1

log m + C3t 2 3 K 2 3 λ 2

and we replace δ with δ/5, then we have the desired result.

A.2. Proof of Corollary 1

Proof of Corollary 1. Suppose that Lemma 1 holds. Let us denote

ut,i = ut,i + C2γt 1t 1 6 K 1 6 λ 2

3 L 7 2 m 1

log m + C3t 2 3 K 2 3 λ 2

log m | {z } et

Then, we have

ut,i v t,i = f(xt,i; θt 1) + γt 1 g(xt,i; θt 1)/ m Z 1 t 1 + et g(xt,i; θ0) (θ θ0)

f(xt,i; θt 1) g(xt,i; θ0) (θ θ0) | {z } I0

+γt 1 g(xt,i; θt 1)/ m Z 1 t 1 + et

C3t 2 3 K 2 3 λ 2

log m | {z } I2

γt 1 m g(xt,i; θ0) Z 1 t 1 | {z } I3

+γt 1 m g(xt,i; θt 1) Z 1 t 1

+ et + γt 1 m g(xt,i; θt 1) Z 1 t 1 γt 1 m g(xt,i; θt 1) Z 1 t 1

= C3t 2 3 K 2 3 λ 2

log m + 2γt 1 m g(xt,i; θt 1) Z 1 t 1 + et

g(xt,i; θ0) Z 1 t 1 + g(xt,i; θt 1) Z 1 t 1

| {z } I4 0 ,

where the first equation comes from Lemma 4 and the second inequality follows from Eq.(6).

A.3. Proof of Lemma 2

The following lemma is necessary for our proof.

Lemma 8 (Lemma B.7 in Zhang et al. (2021)). For any t [T], suppose that there exists C > 0 such that the network width m satisfies

m CT 6N 6L6 log(TLN/δ).

Then with probability at least 1 δ,

log det(I + λ 1Kt) log det(I + λ 1H) + 1 ,

where Kt = J t Jt/m, Jt = [g(x1,a11; θ0), , g(xt,at K; θ0)] Rp t K, and atk means k-th action in the super arm St at time t, i.e., St := {at1, . . . , at K}.

Combinatorial Neural Bandits

Proof. Note that we have

det(ZT ) = det

i ST g(x T,i; θT 1)g(x T,i; θT 1) /m

2 T 1(g(x T,i; θT 1)/ m)(g(x T,i; θT 1)/ m) Z 1

= det(ZT 1) det

2 T 1g(x T,i; θT 1)/ m Z 1

2 T 1g(x T,i; θT 1)/ m !

= det(ZT 1)

g(x T,i; θT 1)/ m 2 Z 1 T 1

g(xt,i; θt 1)/ m 2 Z 1 t 1

Then, we have

log det(ZT )

g(xt,i; θt 1)/ m 2 Z 1 t 1

On the other hand, for any t [T], we have X

g(xt,i; θt 1)/ m 2 Z 1 t 1 X

1 λ g(xt,i; θt 1) 2 2 /m

where the first inequality comes from the property x 2 A 1 x 2 2 /λmin(A) for any positive definite matrix A, the constant C2 of the second inequality can be derived by Lemma 7, and the last inequality holds due to the assumption of λ. Then using the inequality, x 2 log(1 + x) for any x [0, 1], we have

g(xt,i; θt 1)/ m 2 Z 1 t 1 2

g(xt,i; θt 1)/ m 2 Z 1 t 1

2 log det ZT

2 log det ZT

+ C3T 5 3 K 5 3 L4λ 1

where the last inequality holds due to Lemma 13 for some C3 > 0. Furthermore, since we have

det λI = log det ZT (λI) 1

i St g(xt,i; θ0)g(xt,i; θ0) /(mλ)

= log det I + λ 1 JT J T /m

= log det I + λ 1 J T JT /m

= log det I + λ 1KT

log det I + λ 1H + 1

= ed log(1 + TN/λ) + 1 , (7)

Combinatorial Neural Bandits

where the first, second equation and the first inequality holds naively, the third equality uses the definition of Jt, the fourth equality holds since for any matrix A Mn(R) the nonzero eigenvalues of I + AA and I + A A are same, which means det(I + AA ) = det(I + A A), the first inequality follows from Lemma 8, and the last equality uses the definition of effective dimension in Definition 2. Finally, by taking a union bound about δ, with probability at least 1 2δ, we have

g(xt,i; θt 1)/ m 2 Z 1 t 1 2ed log(1 + TN/λ) + 2 + C3T 5 3 K 5 3 L4λ 1

By replacing δ with δ/2, we have the desired result.

A.4. Proof of Theorem 1

Proof of Theorem 1. We define the following event:

E1 := n |ut,i h(xt,i)| 2γt 1 g(xt,i; θt 1)/ m Z 1 t 1 + et, i [N], 1 t T o ,

g(xt,i; θt 1)/ m 2 Z 1 t 1 2ed log(1 + TN/λ) + 2 + C3T 5 3 K 5 3 L4λ 1

E := E1 E2 .

Then, we decompose the cumulative expected regret into two components: when E occurs and when E does not happen.

t=1 (R(S t , v t ) R(St, v t )) 1I(E)

t=1 (R(S t , v t ) R(St, v t )) 1I(Ec)

t=1 (R(S t , v t ) R(St, v t )) 1I(E) | {z } It

where the inequality holds since we have E holds with probability at least 1 T 1 by Lemma 1 and Lemma 2. To bound It, we have

It R(S t , v t ) R(St, v t )

R(S t , ut + et) R(St, v t ) (8)

R(St, ut + et) R(St, v t )

ut,i + et v t,i 2

2γt 1 g(xt,i; θt 1)/ m Z 1 t 1 + 2et 2

max n γt 1 g(xt,i; θt 1)/ m Z 1 t 1 , et o 2 , (9)

where Cℓis a Lipschitz constant, the first inequality holds due to the monotonicity of the reward function, the second inequality comes from the choice of the oracle, i.e., St = OS(ut + et), the third inequality follows from the Lipschitz continuity of the reward function, the fourth inequality comes from Lemma 1 and the last inequality holds due to the property, a + b 2 max{a, b}.

Combinatorial Neural Bandits

On the other hand, if we denote Ai := γt 1 g(xt,i; θt 1)/ m Z 1 t 1, then we have

max n γt 1 g(xt,i; θt 1)/ m Z 1 t 1 , et o 2 = s X

Ai et A2 i + X

i St A2 i + X

i St A2 i + s X

i St A2 i +

By substituting Eq.(10) into Eq.(9), we have

R(S t , v t ) R(St, v t ) 4Cℓ

i St γ2 t 1 g(xt,i; θt 1)/ m 2 Z 1 t 1 +

Therefore, by summing Eq.(11) over all t [T], we have

i St γ2 t 1 g(xt,i; θt 1)/ m 2 Z 1 t 1 +

g(xt,i; θt 1)/ m 2 Z 1 t 1 + 4Cℓ

g(xt,i; θt 1)/ m 2 Z 1 t 1 + 4Cℓ

T 2ed log(1 + TN/λ) + 2 + C1T 5 3 K 5 3 L4λ 1

log m + 4Cℓ

KTe T , (12)

where the second inequality holds since γt γT and et e T , the third inequality follows from the Cauchy-Schwarz inequality and the last inequality comes from Lemma 2 with an absolute constant C1 > 0.

Meanwhile, we bound γT as follows:

det λI + CΓ,2T 5 3 K 5 3 L4λ 1

log m 2 log δ +

+ (λ + C2TKL) (1 ηmλ) J 2 p

TK/λ + Γ3,T

det λI + 2CΓ,2T 5 3 K 5 3 L4λ 1

log m 2 log δ +

+ (λ + C2TKL) (1 ηmλ) J 2 p

TK/λ + Γ3,T

ed log(1 + TN/λ) + 1 + 2CΓ,2T 5 3 K 5 3 L4λ 1

log m 2 log δ +

+ (λ + C2TKL) (1 ηmλ) J 2 p

TK/λ + Γ3,T (13)

where the fist inequality holds due to Lemma 13, the second inequality holds due to Eq.(7).

Combinatorial Neural Bandits

Note that by setting η = C1(TKm L + mλ) 1 and J = 2 log

λ/T K λ+ C2T KL

C1λ , we have

(λ + C2TKL)(1 ηmλ) J 2 p

By choosing sufficiently large m such that

1 + CΓ,1T 7 6 K 7 6 L4λ 7

Γ2,T = CΓ,2T 5 3 K 5 3 L4λ 1

C1T 5 3 K 3 2 L4λ 1

(λ + C2TKL)Γ3,T = (λ + C2TKL)CΓ,3T 7 6 K 7 6 L 7 2 λ 7

log m(1 + p

Te T γT + 1 2ρ q

ed log(1 + TN/λ) + 3 2 log δ + 2

and combining all the results, R(T) can be bounded by

T 2ed log(1 + TN/λ) + 3 2ρ q

ed log(1 + TN/λ) + 3 2 log δ + 2

ed log(1 + TN/λ) + 3 2 log δ + 2

B. Regret Bound for CN-TS

B.1. Proof Lemma 3

proof of Lemma 3. For given Ft, since ev(j) t,i N(f(xt,i; θt 1), ν2σ2 t,i), we have

P max j ev(j) t,i + ϵ > h(xt,i) | Ft, Eµ t

= 1 P ev(j) t,i + ϵ h(xt,i), j [M] | Ft, Eµ t

ev(j) t,i f(xt,i; θt 1) + ϵ

νσt,i h(xt,i) f(xt,i; θt 1)

νσt,i , j [M] | Ft, Eµ t

ev(j) t,i f(xt,i; θt 1) + ϵ

νσt,i |h(xt,i) f(xt,i; θt 1)|

νσt,i , j [M] | Ft, Eµ t

ev(j) t,i f(xt,i; θt 1)

νσt,i |h(xt,i) f(xt,i; θt 1)| ϵ

νσt,i , j [M] | Ft, Eµ t

= 1 P Zj |h(xt,i) f(xt,i; θt 1)| ϵ

νσt,i , j [M] | Ft, Eµ t

where the first inequality is due to a |a|, for the last equality we denote Zj as a standard normal random variable. Note that under the event Eµ t , we have |f(xt,i; θt 1) h(xt,i)| νσt,i + ϵ for all i [N]. Hence, under the event Eµ t ,

|h(xt,i) f(xt,i; θt 1)| ϵ

νσt,i νσt,i + ϵ ϵ

νσt,i = 1 .

Then, it follows that

P max j ev(j) t,i + ϵ > h(xt,i) | Ft, Eµ t

1 [P (Z 1)]M .

Combinatorial Neural Bandits

Using the anti-concentration inequality in Lemma 9, we have P(Z 1) 1 ep where ep := 1/(4e π). Then finally we have

P R(St, evt + ϵ) R(S t , v t ) | Ft, Eµ t

P R(S t , evt + ϵ) R(S t , v t ) | Ft, Eµ t

P evt,i + ϵ h(xt,i), i S t | Ft, Eµ t

i St P evt,i + ϵ h(xt,i) | Ft, Eµ t

1 [P(Z 1)]M K

1 (1 ep)M K

1 K (1 ep)M

where the first inequality holds due to the choice of the oracle, the second inequality comes from the monotonicity of the reward function, the third inequality uses the Bernoulli s inequality, and the last inequality comes from the choice of M = 1 log K log(1 ep) , which means (1 ep)M 1

B.2. Proof of Theorem 2

Proof of Theorem 2. First of all, we decompose the expected cumulative regret as follows:

t=1 E [R(S t , v t ) R(St, evt + ϵ)]

| {z } R1(T )

t=1 E [R(St, evt + ϵ) R(St, v t )]

| {z } R2(T )

From now on, we derive the bounds for R1(T) and R2(T) respectively.

Bounding R2(T)

First we decompose R2(T):

R(St, evt + ϵ) R(St, ˆvt + ϵ) | {z } I2

R(St, ˆvt + ϵ) R(St, v t ) | {z } I1

Combinatorial Neural Bandits

For I1, we have

|R(St, ˆvt + ϵ) R(St, v t )| C(1) 0

i St (f(xt,i; θt 1) + ϵ h(xt,i))2

i St (νσt,i + 2ϵ)2

i St (2 max{νσt,i, 2ϵ})2

i St (νσt,i)2 + X

i St (νσt,i)2 + s X

i St σ2 t,i + 2ϵ

where the first inequality holds due to the Lipschitz continuity for a constant C(1) 0 > 0, the second inequality holds due to the event Eµ t holds with high probability, the third inequality follows from the property that a + b 2 max{a, b}, and the last inequality uses the fact that

b for any a, b 0.

On the other hand, for I2 we have

|R(St, evt + ϵ) R(St, ˆvt + ϵ)| C(2) 0

i St (evt,i f(xt,i; θt 1))2

i St β2 t ν2σ2 t,i

= C(2) 0 βtν s X

i St σ2 t,i ,

where the first inequality holds for some Lipschitz continuity constant C(2) 0 > 0, the second inequality holds due to the event Eσ t holds with high probability.

Combinatorial Neural Bandits

By combining the bounds of I1 and I2, we derive the bound for R2(T) as follows:

i St σ2 t,i + 2ϵ

i St σ2 t,i

= C0ν(βT + 2)E

i St σ2 t,i

C0ν(βT + 2)E

i St σ2 t,i

= C0ν(βT + 2)E

g(xt,i; θt 1)/ m 2 e Z 1 t

C0ν(βT + 2)

Tλ 2ed log(1 + TN/λ) + 2 + C1T 5 3 K 3 2 L4λ 1

where C1 > 0 is a constant, the first inequality uses βt βT and C0 = max{C(1) 0 , C(2) 0 }, the second inequality follows from the Cauchy-Schwarz inequality, and the last inequality holds due to Lemma 2.

Bounding R1(T)

Note that a sufficient condition for ensuring the success of CN-TS is to show that the probability of sampling being optimistic is high enough. Lemma 3 gives a lower bound of the probability that the reward induced by sampled scores is larger than the reward induced by the expected scores up to the approximation error. For our analysis, first we define e Vt the set of concentrated samples for which the reward induced by sampled scores concentrate appropriately to the reward induced by the estimated scores. Also, we define the set of optimistic samples e Vopt t which coinciding with e Vt.

{ev(j) t,i | i [N]}M j=1 =: v1:M t | R(St, evt + ϵ) R(St, ˆvt + ϵ) C0

i St (βtνσt,i)2

e Vopt t := n {ev(j) t,i | i [N]}M j=1 =: v1:M t | R(St, evt + ϵ) > R(S t , v t ) o e Vt .

Also, note that the event Et := Eσ t Eµ t , which means

Et = {|f(xt,i; θt 1) h(xt,i)| νσt,i + ϵ, i [N]} {|evt,i f(xt,i; θt 1)| βtνσt,i, i [N]} .

For our notations, we denote St as the super arm induced by the sampled score v1:M t e Vt and ϵ. Also we represent R(S, v1:M t + ϵ) the reward under the sampled score v1:M t and ϵ. Also, we define St as the super arm induced by vt e Vopt t and ϵ. Similarly we can define R(S, vt + ϵ).

Recall that St = argmax R(S, evt + ϵ). Then, for any v1:M t e Vt, we have

R(S t , v t ) R(St, evt + ϵ) 1I (Et)

R(S t , v t ) inf v1:M t e Vt max S R(S, v1:M t + ϵ)

Note that we can decompose

R(S t , v t ) R(St, evt + ϵ) = R(S t , v t ) R(St, evt + ϵ) 1I(Et) + R(S t , v t ) R(St, evt + ϵ) 1I(Ec t ) .

Combinatorial Neural Bandits

Since the event Et holds with high probability, we can bound the summation of the second term in the right hand side as follows:

t=1 E [R(S t , v t ) R(St, evt + ϵ)] =

t=1 E R(S t , v t ) R(St, evt + ϵ) 1I(Et)

Therefore, we need to bound the summation of I3. Note that we have

E R(S t , v t ) R(St, evt + ϵ) 1I(Et) | Ft

R(S t , v t ) inf v1:M t e Vt max S R(S, v1:M t + ϵ)

1I (Et) | Ft

R(S t , v t ) inf v1:M t e Vt max S R(S, v1:M t + ϵ)

1I (Et) | Ft, v1:M t e Vopt t

R( St, evt + ϵ) inf v1:M t e Vt max S R(S, v1:M t + ϵ)

1I (Et) | Ft, v1:M t e Vopt t

R( St, evt + ϵ) inf v1:M t e Vt R( St, v1:M t + ϵ)

1I (Et) | Ft, v1:M t e Vopt t

sup v1:M t e Vt

R( St, evt + ϵ) R( St, v1:M t + ϵ) 1I (Et) | Ft, v1:M t e Vopt t

= E h R( St, evt + ϵ) R( St, v1:M t + ϵ) 1I (Et) | Ft, v1:M t e Vopt t i

= E h R( St, evt + ϵ) R( St, ˆvt + ϵ) + R( St, ˆvt + ϵ) R( St, v1:M t + ϵ) 1I (Et) | Ft, v1:M t e Vopt t i

i St (βtνσt,i)2 1I (Et) | Ft, v1:M t e Vopt t

i St σ2 t,i 1I (Et) | Ft, v1:M t e Vopt t

i St σ2 t,i | Ft, v1:M t e Vopt t

where C0 > 0 is a Lipschitz constant. On the other hand, from Lemma 3, we have

P R(St, evt + ϵ) > R(S t , v t ) | Ft, Et

1/(4e π) := ep ,

which means that

P( v1:M t e Vopt t | Ft, Et) = P R( St, evt + ϵ) > R(S t , v t ) and v1:M t e Vt | Ft, Et

P R( St, evt + ϵ) > R(S t , v t ) | Ft, Et P v1:M t / e Vt | Ft, Et

Combinatorial Neural Bandits

Then, we can write

i S t σ2 t,i | Ft, Et

i St σ2 t,i | Ft, Et, v1:M t e Vopt t

P v1:M t e Vopt t | Ft, Et

i St σ2 t,i | Ft, Et, v1:M t e Vopt t

where S t is a super arm induced by any sampled scores. By combining the results, we have

E R(S t , v t ) R(St, evt + ϵ) 1I(Et) | Ft

i St σ2 t,i | Ft, Et, v1:M t e Vopt t

i S t σ2 t,i | Ft, Et

i S t σ2 t,i | Ft

Summing over all t [T] and the failure event into consideration, we have

t=1 E R(S t , v t ) R(St, evt + ϵ) 1I(Et) | Ft

i S t σ2 t,i | Ft

Note that the summation on the RHS contains an expectation, so we cannot directly apply Lemma 2. Instead, since we can write

i S t σ2 t,i | Ft

i S t σ2 t,i +

i S t σ2 t,i | Ft

i S t σ2 t,i

where S t is any super arm induced by arbitrary sampled scores. By using Lemma 2 we have

i S t σ2 t,i

i S t σ2 t,i

Tλ 2ed log(1 + TN/λ) + 2 + C1T 5 3 K 3 2 L4λ 1

log m , (16)

where C1 > 0 is a constant.

On the other hand, let Yt = Pt k=1 E hq P

i S k σ2 k,i | Fk i q P

i S k σ2 k,i . Since we have

Yt Yt 1 = E

i S t σ2 t,i | Ft

i S t σ2 t,i ,

which implies,

E [Yt Yt 1 | Ft] = E

i S t σ2 t,i | Ft

i S t σ2 t,i | Ft

Combinatorial Neural Bandits

then Yt is a martingale for all 1 t T.

Note that we can bound |Yt Yt 1| as follows:

|Yt Yt 1| =

i S t σ2 t,i | Ft

i S t σ2 t,i

where the inequality holds due to Lemma 7 for some positive constant C2. Then, applying the Azuma-Hoeffding inequality (Lemma 10), which means,

i S t σ2 t,i | Ft

i S t σ2 t,i

8TLK log T , (17)

with probability 1 T 1. Combining Eq.(16) and Eq.(17), we have

i S t σ2 t,i | Ft

Tλ 2ed log(1 + TN/λ) + 2 + C1T 5 3 K 3 2 L4λ 1

8TLK log T (18)

By substituting Eq.(18) for Eq.(15), we have the bound for R1(T) as follows:

R1(T) 4C0βtν

Tλ 2ed log(1 + TN/λ) + 2 + C1T 5 3 K 3 2 L4λ 1

8TLK log T + O(1) (19)

Finally, combining Eq.(19) and Eq.(14) we have

R(T) 4C0βT ν

Tλ 2ed log(1 + TN/λ) + 2 + C1T 5 3 K 3 2 L4λ 1

8TLK log T + 2C0T

+ C0ν(βT + 2)

Tλ 2ed log(1 + TN/λ) + 2 + C1T 5 3 K 3 2 L4λ 1

Then choosing m such that

C1T 5 3 K 3 2 L4λ 1

Cϵ,1T 5 3 K 2 3 L3λ 2

6 K 7 6 L4λ 7

log m(1 + p

6 K 7 6 λ 2

3 L 9 2 m 1

log m B + ρ p

log det(I + H/λ) + 2 2 log δ 1

Also by setting J = 2 log 1 4T Cϵ,2

λ T KL T KL

C1 , we have

TCϵ,2(1 ηmλ)J/2p

Combinatorial Neural Bandits

which follows, Tϵ 1. Hence, R(T) can be bounded by

R(T) 4C0βT ν

Tλ 2ed log(1 + TN/λ) + 3 + C2 p

8TLK log T + 2C0

+ C0ν(βT + 2)

Tλ 2ed log(1 + TN/λ) + 3 .

C. Auxiliary Lemmas

Lemma 9. (Abramowitz & Stegun, 1964) For a Gaussian distributed random variable Z with mean µ and variance σ2, for any z 1, 1 2 πz exp( z2/2) P(|Z µ| > zσ) 1 πz exp( z2/2) .

Lemma 10 (Azuma-Hoeffding inequality). If a super-martingale (Yt, t 0) corresponding to filtration Ft, satisfies |Yt Yt 1| < βt for some constant βt, for all t = 1, . . . , T, then for any a 0,

P(Yt Yt 1 a) 2 exp

2 PT t=1 β2 t

D. Extensions from Neural Bandits for Single Action

In this section, we describe how the auxiliary lemmas used in the neural bandit works (Zhou et al., 2020; Zhang et al., 2021) for single action can be extended to the combinatorial action settings. The main distinction is that in single action settings, the amount of data to be trained at time t is t, whereas in combinatorial action settings, it is t K. Therefore, by properly accounting for this difference, we can obtain the following results.

Definition 3. For simplicity, we restate some definitions used in this section.

i Sk g(xk,i; θ0)g(xk,i; θ0) /m ,

Zt(or e Zt) = λI +

i Sk g(xk,i; θk 1)g(xk,i; θk 1) /m ,

σ2 t,i = λg(xt,i; θ0) Zt 1g(xt,i; θ0)/m ,

σ2 t,i = λg(xt,i; θt 1) e Zt 1g(xt,i; θt 1)/m , Jt = [g(x1,a11; θ0), . . . , g(x1,a1K; θ0), . . . g(xt,at K; θ0)] Rp t K ,

Jt = [g(x1,a11; θt 1), . . . , g(x1,a1K; θt 1), . . . g(xt,at K; θt 1)] Rp t K ,

yt = [v1,a11, . . . , v1,a1K, . . . , vt,at K] Rt K ,

where atk is the k-th action in the super arm St at time t, i.e., St := {at1, , at K}.

Lemma 11 (Lemma 5.2 in Zhou et al. (2020)). Suppose that there exist some positive constants C1, C2 > 0 such that for any δ (0, 1), η C1(TKm L + mλ) 1 and

2 λ 1 2 log(TNL2/δ) 3

m(log m) 3 C2 TKL12λ 1 + T 4K4L18λ 10(λ + TKL)6 + T 7K7L21λ 7(1 + p

Then, with probability at least 1 δ, we have

θt θ0 2 2 p

θ θt Zt γt/ m .

Combinatorial Neural Bandits

Lemma 12 (Lemma B.2 in Zhou et al. (2020)). There exist some constants { Ci}4 i=1 > 0 such that for any δ (0, 1), if for any t [T], η, m satisfy that

t K/(mλ) C1m 3

2 log(TNL2/δ) 3

t K/(mλ) C2 min n L 6 (log m) 3

2 , mλ2η2L 6(log m) 1 3

η C3(mλ + t Km L) 1 ,

m 1 6 C4t 7 6 K 7 6 L 7 2 λ 7

log m(1 + p

then, with probability at least 1 δ, we have θt θ0 2 p

t K/(mλ) and

θt θ0 Z 1 t Jtyt/m 2 (1 ηmλ) J 2 p

t K/(mλ) + C5t 7 6 K 7 6 L 7 2 λ 7

log m(1 + p

where C5 > 0 is an absolute constant. Lemma 13 (Lemma B.3 in Zhou et al. (2020)). There exist some constants { Ci}5 i=1 > 0 such that for any δ (0, 1), if for any t [T], m satisfies that

2 log(TNL2/δ) 3

t K/(mλ) C2L 6(log m) 3

then, with probability at least 1 δ, for any t [T] we have

Zt 2 λ + C3t KL ,

Zt Zt F C4t 7 6 K 7 6 L4λ 1

log m , log det Zt

det λI log det Zt

C5t 5 3 K 5 3 L4λ 1

where C3, C4, C5 > 0 are some absolute constants, and Zt = λI + Pt 1 k=1 P i Sk g(xk,i; θ0)g(xk,i; θ0) .

Lemma 14 (Lemma C.2 in Zhou et al. (2020)). For any δ (0, 1), C1, C2 > 0, suppose that τ satisfies

2 log(TNL2/δ) 3

2 τ C2L 6(log m) 3

Then, with probability at least 1 δ, if for any j [J], θ(j) θ(0) 2 τ, we have the following results for any j, s [J],

J(j) J(0) F C4τ 1 3 L 7 2 p

t Km log m ,

f (s) f (j) (J(j)) (θ(s) θ(j)) 2 C5τ 4 3 L3p

t Km log m ,

where C3, C4, C5 > 0 are some absolute constants. Lemma 15 (Lemma C.3 in Zhou et al. (2020)). For any δ (0, 1) and { Ci}4 i=1 > 0, suppose that τ, η satisfy

2 log(TNL2/δ) 3

2 C2L 6(log m) 3

η C3(mλ + t Km L) 1 ,

τ 8 3 C4mλ2η2L 6(log m) 1 .

Then, with probability at least 1 δ, if for any j [J], θ(j) θ(0) 2 τ, we have that for any j [J], f (j) y 2 2

t K. Lemma 16 (Lemma C.4 in Zhou et al. (2020)). For any δ (0, 1) and { Ci}3 i=1 > 0, suppose that τ, η satisfy

2 log(TNL2/δ) 3

2 C2L 6(log m) 3

η C3(mλ + t Km L) 1 .

Then, with probability at least 1 δ, we have for any j [J],

eθ (j) θ(0) 2 p

eθ (j) θ(0) Z 1 Jy/m 2 (1 ηmλ) j 2 p

Combinatorial Neural Bandits

E. Extension of Regret Analysis to α-approximation Oracle

In this section, we extend our regret analysis to the case when the agent only has access to an α-approximation oracle, Oα S. First, we replace St with Sα t = Oα S(ut + et) for CN-UCB (Algorithm 1) and Sα t = Oα S(evt + ϵ) for CN-TS (Algorithm 2). The total regret R(T) is replaced with an α-regret defined as:

t=1 E [αR(S t , v t ) R(Sα t , v t )]

For CN-UCB, note that αR(St, ut + et) R(Sα t , ut + et). Also, αR(S t , v t ) αR(S t , ut + et) αR(St, ut + et)

R(Sα t , ut + et) . We can derive that the α-regret bound of CN-UCB is e O(ed

T) or e O( p

ed TK), whichever is higher, by substituting the following notations in Appendix A.4:

R(S t , v t ) αR(S t , v t )

R(S t , ut + et) αR(S t , ut + et)

For CN-TS, note that αR(St, evt + ϵ) R(Sα t , evt + ϵ). We split α-regret as follows:

Rα(T) = Rα 1 (T) + Rα 2 (T)

t=1 E [αR(S t , v t ) αR(Sα t , evt + ϵ)]

t=1 E [αR(Sα t , evt + ϵ) R(Sα t , v t )] .

By replacing St with Sα t in Appendix B.2, we can get the α-regret bound of Rα 2 (T). For Rα 1 (T), since αR(S t , v t ) αR(Sα t , evt + ϵ) αR(S t , v t ) αR(St, evt + ϵ) R(S t , v t ) R(St, evt + ϵ), we know that Rα 1 (T) R1(T). By combining the results, we can conclude that the α-regret bound of CN-TS is e O(ed

F. When Time Horizon T Is Unknown

For Theorems 1 and 2, we assumed that T is known for the sake of clear exposition for our proposed algorithms and their regret analysis. However, the knowledge of T is not essential both for the algorithms and their analysis. With slight modifications, our proposed algorithms can be applied to the settings where T is unknown. In this section, we propose the variants of CN-UCB and CN-TS: CN-UCB with doubling and CN-TS with doubling, and show that their regret upper bounds are of the same order of regret as those of CN-UCB and CN-TS up to logarithmic factors.

F.1. Algorithms

CN-UCB with doubling and CN-TS with doubling utilize a doubling technique (Besson & Kaufmann, 2018) in which the network size stays fixed during each epoch but is updated after the end of each epoch whose length τ doubles the length of a previous epoch. This way, even when T is unknown, the networks size can be set adaptively over epochs.

The algorithms first initialize the variables related to τ, especially the hidden layer width mτ and the number of parameters of the neural network p(τ). For each round, after playing super arm St and observing the scores {vt,i}i St, CN-UCB with doubling and CN-TS with doubling call the Update algorithm. Until τ, Update algorithm updates θt and Zt or e Zt as if τ is the time horizon. If t reaches τ, Update algorithm doubles the value of τ. After reinitializing the variables related to the doubled τ, which includes reconstructing the neural network to have a larger hidden layer width mτ, the algorithm updates all of the θt and Zt or e Zt for t = 0, , t. Update algorithm returns θt and Zt or e Zt to CN-UCB with doubling or CN-TS with doubling. This process continues until t reaches T.

Note that the computation complexity of each round of CN-UCB and CN-UCB with doubling heavily depends on how quickly they can compute the inverse of the gram matrix Z. Since Z Mp(R), and p depends on m, the computation speed

Combinatorial Neural Bandits

of each round in CN-UCB is relatively slow as m is a large constant. On the other hand, CN-UCB with doubling can show faster computation speed, especially at the beginning rounds, where m is kept relatively small. The same argument can be applied to CN-TS and CN-TS with doubling.

CN-UCB with doubling is summarized in Algorithm 3. CN-TS with doubling is summarized in Algorithm 4. The Update algorithm is summarized in Algorithm 5.

Algorithm 3 CN-UCB with doubling

Input: Epoch period τ, network depth L. Initialization: Initialize {network width mτ, regularization parameter λτ, norm parameter Bτ, step size ητ, number of gradient descent steps Jτ} with respect to τ, set number of parameters of neural network p(τ) = mτd+m2 τ(L 2)+mτ, Z0 = λτIp(τ), randomly initialize θ0 as described in Section 3.1 while t = T do

Observe {xt,i}i [N] Compute ˆvt,i = f(xt,i; θt 1) and ut,i = ˆvt,i + γt 1 g(xt,i; θt 1)/ mτ Z 1 t 1 for i [N]

Let St = OS(ut + et) Play super arm St and observe {vt,i}i St Update(t, τ) Compute γt and et+1 described in lemma 1 (replace {λ, m, I, B, η, J} with {λτ, mτ, Ip(τ), Bτ, ητ, Jτ}) end while

Algorithm 4 CN-TS with doubling

Input: Epoch period τ, network depth L, sample size M Initialization: Initialize {network width mτ, regularization parameter λτ, exploration variance ντ, step size ητ, number of gradient descent steps Jτ} with respect to τ, set number of parameters of neural network p(τ) = mτd + m2 τ(L 2) + mτ, e Z0 = λτIp(τ), randomly initialize θ0 as described in Section 3.1 while t = T do

Observe {xt,i}i [N]. Compute σ2 t,i = λτg(xt,i; θt 1) e Z 1 t 1g(xt,i; θt 1)/mτ for each i [N]

Sample {ev(j) t,i }M j=1 independently from N(f(xt,i; θt 1), ν2 τσ2 t,i) for each i [N]

Compute evt,i = maxj ev(j) t,i for each i [N] Let St = OS(evt + ϵ) Play super arm St and observe {vt,i}i St Update(t, τ) end while

F.2. Regret Analysis

The regret upper bounds of CN-UCB with doubling and CN-UCB (or CN-TS with doubling and CN-TS) have the same rate up to logarithmic factors. We provide the sketch of proof.

By modifying Definitions 1 and 2 with respect to τ, the effective dimension edτ can be written as edτ = log det(I + Hτ/λτ)/ log(1 + τN/λτ). Denote the epoch periods as τn = 2nτ0, where n Z 0 and τ0 is the initial epoch period.

If T < τ0, CN-UCB with doubling and CN-TS with doubling are equivalent to CN-UCB and CN-TS respectively. In this case, there is no change in the regret upper bounds. Meanwhile, if T τ0, there exists ˆn Z+ such that τˆn 1 T < τˆn. Denote the instantaneous regret as Regt. Define Pb t=a Regt := 0 if a > b. Then the regret can be written as

t=τ0+1 Regt + +

τˆn 2+1 Regt +

t=τˆn 1+1 Regt .

Let ed := max{edτ0, . . . , edτˆn}. For CN-UCB with doubling, each sum has an upper bound e O(max{edτn, q

edτn K} τn).

Combinatorial Neural Bandits

Algorithm 5 Update(t, τ)

Input: Epoch period τ, round t if t < τ then

Update Zt = Zt 1 + P

i St g(xt,i; θt 1)g(xt,i; θt 1) /mτ Update θt to minimize the loss Eq.(4) using gradient descent with ητ for Jτ times else

τ 2τ Reinitialize {mτ, λτ, ητ, Jτ} with respect to τ, set p(τ) = mτd + m2 τ(L 2) + mτ, randomly reinitialize θ0 as described in Section 3.1. For CN-UCB with doubling, reinitialize Bτ with respect to τ, Z0 = λτIp(τ) For CN-TS with doubling, reinitialize ντ with respect to τ, e Z0 = λτIp(τ) for t = 1, , t do

For CN-UCB with doubling, Zt = Zt 1 + P

i St g(xt ,i; θt 1)g(xt ,i; θt 1) /mτ For CN-TS with doubling, e Zt = e Zt 1 + P

i St g(xt ,i; θt 1)g(xt ,i; θt 1) /mτ Update θt to minimize the loss (4) using gradient descent with ητ for Jτ times end for end if Return: θt, Zt or e Zt

Thus, the regret is bounded by e O(max{ed, p

2T) . Similarly, for CN-TS with doubling, each sum has upper bound e O(edτn τn K) and the regret has upper bound e O(ed

G. Specific Examples of Combinaotrial Feedback Models

As mentioned in Remark 1, algorithms having a reward function satisfying Assumptions 1 and 2 encompasses various combinatorial feedback models, suggesting that these assumptions are not restrictive. In this section, we provide specific examples.

G.1. Semi-bandit Model

In the semi-bandit setting, after choosing a superarm, the agent observes all of the scores (or feedback) associated with the superarm and receives a reward as a function of the scores. The main text of this paper describes how our algorithms cover semi-bandit feedback models. Recall that in semi-bandit setting, if the feature vectors are independent then the score of each arm is independent. Meanwhile, in ranking models (or click models), chosen arms may have a position within the superarm, and the scores of arms may depend on its own attractiveness as well as its position.

G.2. Document-based Model

The document-based model is a click model that assumes the scores of an arm are identical to its attractiveness. The attractiveness of an arm is determined by the context of arm. Formally, for each arm i [N], let α(xt,i) [0, 1] be the attractiveness of arm i at time t. Then the document-based model assumes that the score function of xt,i in the k-th position is defined as

h(xt,i, k) = α(xt,i) 1I(k K) . (21)

Note that h in Eq.(21) is bounded in [0, 1]. Since a neural network is a universal approximator, we can utilize neural networks to estimate the score of arm i in position k as follows:

ˆh(xt,i, k) = f(xt,i, k; θt 1) .

Note that for any k [K], the score of an arm only depends on the attractiveness of the arm. Hence, our algorithms can be directly applicable to the document-based model without any modification.

Combinatorial Neural Bandits

G.3. Position-based Model

In the document-based model, the score of an arm is invariant to the position within the super arm. However, in the position-based model, the score of a chosen arm varies depending on its position. Let χ : [K] [0, 1] be a function that measures the quality of a position within the super arm. The position-based model assumes that the score function of a chosen arm associated to xt,i and located in the k-th position is defined as

h(xt,i, k) = α(xt,i)χ(k) . (22)

Note that the score of an arm can change as its position moves within the superarm. We can slightly modify our suggested algorithms to reflect this. First, we introduce a modified neural network f(xt,i, k; θt 1) that estimates the score of each arm at every available position. By this, the action space of each round increases from N to NK. The regret bound only changes as much as the action space increases. Denote the gradient of f(xt,i, k; θt 1) as g(xt,i, k; θt 1).

Furthermore, we replace the oracle to OS {ut,i(k) + et}i [N],k [K] that considers the position of the arms. The oracle OS chooses only one arm for one position. Also, an arm that has been chosen for a certain position cannot be chosen for another position. As an optimization problem having the above constraints can be solved with linear programming, OS {ut,i(k) + et}i [N],k [K] can compute exact optimization within polynomial time. Modified algorithm for a positionbased model is described in Algorithm 6.

Algorithm 6 Combinatorial neural bandits for for position-based model

Initialize as Algorithm 1 for t = 1, ..., T do

Observe {xt,i}i [N] if Exploration == UCB then

Compute ut,i(k) = f(xt,i, k; θt 1) + γt 1 g(xt,i, k; θt 1)/ m) Z 1 t 1 for i [N], k [K]

Let St = OS {ut,i(k) + et}i [N],k [K]

else if Exploration == TS then

Compute σ2 t,i(k) = λ g(xt,i, k; θt 1) e Z 1 t 1 g(xt,i, k; θt 1)/m for i [N], k [K]

Sample {ev(j) t,i (k)}M j=1 independently from N( f(xt,i, k; θt 1), ν2σ2 t,i(k)) for i [N], k [K]

Compute evt,i(k) = maxj ev(j) t,i (k) for i [N], k [K] Let St = OS {evt,i(k) + ϵ}i [N],k [K]

end if Play super arm St and observe {vt,i(ki)}i St (UCB) Update Zt = Zt 1 + P

i St g(xt,i, ki; θt 1) g(xt,i, ki; θt 1) /m (TS) Update e Zt = e Zt 1 + P

i St g(xt,i, ki; θt 1) g(xt,i, ki; θt 1) /m Update θt to minimize the loss in Eq.(4) using gradient descent with η for J times end for

G.4. Cascade Model

In the cascade model, the agent suggests arms to a user one-by-one in order of the positions of the arms within the superarm. The user scans the arms one-by-one until she selects an arm that she likes, which ends the suggestion procedure. Note that the suggestion procedure potentially may end before the agent shows all the arms in the superarm to the user. Also, the user may not select any arm after she scans all the arms in the superarm. Hence, unlike the previously mentioned models, where the agent receives all of the scores of the chosen arms, in the cascade model, the agent only receives the scores of the arms observed by the user.

Let us assume that the score the agent receives when the user selects an arm in the 1-st position is 1. In case the same arm is in the k-th position, the score the agent receives when the user selects the same arm must be less than 1. To reflect this feature, we consider a position discount factor ψk [0, 1], k K that is multiplied to the attractiveness of the arm. The observed score of an arm is determined by its attractiveness and the position discount factor that is multiplied to it. The mechanism estimating the attractiveness using a neural network is same as the one for the semi-bandits. The only difference is that the agent only receives the discounted scores of the arms observed by the user.

Combinatorial Neural Bandits

Algorithm 7 Combinatorial neural bandits for for cascade feedback model

Initialize as Algorithm 1, {ψk [0, 1]}k [K]: position discount factors for t = 1, ..., T do

Observe {xt,i}i [N] if Exploration == UCB then

Compute ut,i = f(xt,i; θt 1) + γt 1 g(xt,i, k; θt 1)/ m) Z 1 t 1 for i [N]

Let St = OS {ut,i + et}i [N]

else if Exploration == TS then

Compute σ2 t,i = λg(xt,i; θt 1) e Z 1 t 1g(xt,i; θt 1)/m for i [N]

Sample {ev(j) t,i }M j=1 independently from N(f(xt,i; θt 1), ν2σ2 t,i) for i [N]

Compute evt,i = maxj ev(j) t,i for i [N] Let St = OS {evt,i + ϵ}i [N]

end if Play super arm St and observe Ft, {ψkvt,k}k [Ft] (UCB) Update Zt = Zt 1 + P

k [Ft] g(xt,k; θt 1)g(xt,i; θt 1) /m

(TS) Update e Zt = e Zt 1 + P

k [Ft] g(xt,k; θt 1)g(xt,k; θt 1) /m Update θt to minimize the loss in Eq.(4) using gradient descent with η for J times end for

Suppose that the user selects Ft-th arm. Then the agent observes the discounted scores for the first Ft arms in St. Update is based on the discounted scores, ψkvt,k, k Ft. An adjusted Algorithm for the cascade model is described in Algorithm 7. In addition, in case we have no information of the position discount factor, we can deal with the cascade model same as the position-based model.

H. Additional Related Work

As mentioned in Section 1, the proposed methods are the first neural network-based combinatorial bandit algorithms with regret guarantees. As for the previous combinatorial TS algorithms, Wen et al. (2015) proposed a TS algorithm for a contextual combinatorial bandits with semi-bandit feedback and a linear score function. However, the regret bound for the algorithm is only analyzed in the Bayesian setting (hence establishing the Bayesian regret) which is a weaker notion of regret and much easier to control in combinatorial action settings. To our knowledge, Oh & Iyengar (2019) was the first work to establish the worst-case regret bound for a variant of contextual combinatorial bandits, multinomial logit (MNL) contextual bandits, utilizing the optimistic sampling procedure similar to CN-TS. Yet, our proposed algorithm differs from Oh & Iyengar (2019) in that we sample directly from the score space rather than the parameter space which avoids the computational complexity of sampling a high-dimensional network parameters. More importantly, Oh & Iyengar (2019) exploit the structure of the MNL choice feedback model to derive the regret bound whereas we address a more general semi-bandit feedback without any assumptions on the structure of the feedback model.

I. Additional Experiments

In Experiment 1, the linear combinatorial bandit algorithms perform worse than our proposed algorithms, even for the linear score function. One of the possible reasons for this is that the neural network based algorithms use much larger number of parameters than the linear model based algorithms, overparametrized for the problem setting. Overparametrized neural networks have been shown to have superior generalization performances. See Allen-Zhu et al. (2019b;a). Note that the regret performance is about the generalization to the unseen data rather than it is about the fit to the existing data. In this aspect, overparameterized neural network can show superior performance over the linear model. This is supported by Figure 3. In Figure 3, we demonstrate the empirical performances of CN-TS ans Comb Lin TS as the network width m decreases. We can see that by decreasing m, the results of the neural network models and linear models become more similar, i.e., the gap between the regrets reduce.

Combinatorial Neural Bandits

0 250 500 750 1000 1250 1500 1750 2000 Round (t)

Cumulative Regret

h1(x) = x a, d=80

CN-TS(m=100) CN-TS(m=10) CN-TS(m=5) CN-TS(m=2) Comb Lin TS

Figure 3. Cumulative regret of CN-TS and Comb Lin TS with respect to the network width (m).