# generalized_linear_bandits_with_limited_adaptivity__40f11814.pdf

Generalized Linear Bandits with Limited Adaptivity

Ayush Sawarni Stanford University ayushsaw@stanford.edu

Nirjhar Das Indian Institute of Science Bangalore nirjhardas@iisc.ac.in

Siddharth Barman Indian Institute of Science Bangalore barman@iisc.ac.in

Gaurav Sinha Microsoft Research India gauravsinha@microsoft.com

We study the generalized linear contextual bandit problem within the constraints of limited adaptivity. In this paper, we present two algorithms, B-GLin CB and RS-GLin CB, that address, respectively, two prevalent limited adaptivity settings. Given a budget M on the number of policy updates, in the first setting, the algorithm needs to decide upfront M rounds at which it will update its policy, while in the second setting it can adaptively perform M policy updates during its course. For the first setting, we design an algorithm B-GLin CB, that incurs O(

T) regret when M = Ω(log log T) and the arm feature vectors are generated stochastically. For the second setting, we design an algorithm RS-GLin CB that updates its policy O(log2 T) times and achieves a regret of O(

T) even when the arm feature vectors are adversarially generated. Notably, in these bounds, we manage to eliminate the dependence on a key instance dependent parameter κ, that captures non-linearity of the underlying reward model. Our novel approach for removing this dependence for generalized linear contextual bandits might be of independent interest.

1 Introduction

Contextual Bandits (CB) is an archetypal framework that models sequential decision making in time-varying environments. In this framework, the algorithm (decision maker) is presented, in each round, with a set of arms (represented as d-dimensional feature vectors), and it needs to decide which arm to play. Once an arm is played, a reward corresponding to the played arm is accrued. The regret of the round is defined as the difference between the maximum reward possible in that round and the reward of the played arm. The goal is to design a policy for selecting arms that minimizes cumulative regret (referred to as the regret of the algorithm) over a specified number of rounds, T. In the last few decades, much progress has been made in designing algorithms for special classes of reward models, e.g. linear model [3, 4, 1, 16], logistic model [6, 2, 7, 28] and generalized linear models [8, 19].

However, despite this progress, there is a key challenge that prevents deployment of CB algorithms in the real world. Practical situations often allow for very limited adaptivity, i.e., do not allow CB algorithms to update their policy at all rounds. For example, in clinical trials [10], each trial involves administering medical treatments to a cohort of patients, with medical outcomes observed and collected for the entire cohort at the conclusion of the trial. This data is then used to design the treatment for the next phase of the trial. Similarly, in online advertising [25] and recommendations [18], updating the policy after every iteration during deployment is often infeasible due to infrastructural constraints. A recent line of work [23, 22, 11, 12, 21, 9, 26] tries to address this limitation

Work done while author was at Microsoft Research India Work done while author was at Microsoft Research India

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

by developing algorithms that try to minimize cumulative regret while ensuring that only a limited number of policy updates occur. Across these works, two settings (called M1 and M2 from here onwards) of limited adaptivity have been popular. Both M1, M2 provide a budget M to the algorithm, determining the number of times it can update its policy. In M1 [23, 13], the algorithm is required to decide upfront a sub-sequence of M rounds where policy updates will occur. While in M2 [1, 23]), the algorithm is allowed to adaptively decide (during its course) when to update its policy.

Limited adaptivity algorithms were recently proposed for the CB problem with linear reward models under the M1 setting [23, 11], and optimal regret guarantees were obtained when the arm feature vectors were stochastically generated. Similarly, in their seminal work on linear bandits, [1] developed algorithms for the M2 setting and proved optimal regret guarantees with no restrictions on the arm vectors. While these results provide tight regret guarantees for linear reward models, extending them to generalized linear models is quite a challenge. Straightforward extensions lead to sub-optimal regret with a significantly worse dependence on an instance dependent parameter κ (See Section 2 for definition) that captures non-linearity of the problem instance. In fact, to the best of our knowledge, developing optimal algorithms for the CB problem with generalized linear reward models under the limited adaptivity settings M1, M2, is an open research question. This is the main focus of our work. We make the following contributions.

1.1 Our Contributions

We propose B-GLin CB, an algorithm that solves the CB problem for bounded (almost surely) generalized linear reward models (Definition 2.1) under the M1 setting of limited adaptivity. We prove that, when the arm feature vectors are generated stochastically, the regret of B-GLin CB at the end of T rounds is O(

T), when M = Ω(log log T). When M = O(log log T), we prove an O(T 2M 1/(2M 2)) regret guarantee. While the algorithm bears a slight resemblance to the one in [23], direct utilization of their key techniques (distributional optimal design) results in a regret guarantee that scales linearly with the instance dependent non-linearity κ. On the other hand, the leading terms in our regret guarantee for B-GLin CB have no dependence on κ. To achieve this, we make novel modifications to the key technique of distributional optimal design in [23]. Along with this, the rounds for policy updates are also chosen more carefully (in a κ dependent fashion), leading to a stronger regret guarantee. We propose RS-GLin CB, an algorithm that solves the CB problem for bounded (almost surely) generalized linear reward models (Definition 2.1) under the M2 setting of limited adaptivity. RS-GLin CB builds on a similar algorithm in [1] by adding a novel context-dependent criterion for determining if a policy update is needed. This new criterion allows us to prove optimal regret guarantee ( O(

T)) with only O(log2 T) updates to the policy. It is quite crucial for the generalized linear reward settings since, without it, the resultant regret guarantees have a linear dependence on κ. Our work also resolves a conjecture in [17] by proving an optimal ( O(

T)) regret guarantee (for the CB problem with logistic reward model) that does not depend polynomially on S (the known upper bound on the size of the model parameters, i.e. θ S, See Section 2) 3. RS-GLin CB is, to our knowledge, the first CB algorithm for generalized linear reward models that is both computationally efficient (amortized O(log T) computation per round) and incurs optimal regret. We also perform experiments in Section 5 that validate its superiority both in terms of regret and computational efficiency in comparison to other baseline algorithms proposed in [14] and [6].

1.2 Important Remarks on Contributions and Comparison with Prior Work

Remark 1.1 (κ-independence). For both B-GLin CB and RS-GLin CB, our regret guarantees are free of κ (in their leading term), an instance-dependent parameter that can be exponential in the size of the unknown parameter vector, i.e., θ (See Section 2 for definition). Our contribution in this regard is two-fold. Not only do we prove κ-independent regret guarantees under the limited adaptivity constraint, we also characterize a broad class of generalized linear reward models for which a κ-independent regret guarantee can be achieved. Specifically, our results imply that the CB problem with generalized linear reward models originally proposed in [8] and subsequently studied in literature [19, 14, 24] admits a κ-independent regret.

3This requires a non-convex projection. We discuss its convex relaxation in Appendix E.

Remark 1.2 (Computational efficiency). Efforts to reduce the total time complexity to be linear in T have been active in the CB literature with generealized linear rewards models. For e.g., [14] recently devised computationally efficient algorithms but they suffer from regret dependence on κ. Optimal (κ-independent) guarantees were recently achieved for logistic reward models [6, 2], and the algorithms were subsequently made computationally efficient in [7, 28]. However, the techniques involved rely heavily on the structure of the logistic model and do not easily extend to more general models. To the best of our knowledge, ours is the first work that achieves optimal κ-independent regret guarantees for bounded generalized linear reward models while remaining computationally efficient4. Remark 1.3 (Self Concordance of bounded GLMs). In order to prove κ-independent regret guarantees, we prove a key result about self concordance of bounded (almost surely) generalized linear models (Definition 2.1) in Lemma 2.2. This result was postulated in [8] for GLMs (with the same definition as ours), but no proof was provided. While [6, 7] partially tackled this issue for logistic reward models5, in our work, we prove self concordance for much more general generalized linear models.

2 Notations and Preliminaries

Notations: A policy π is a function that maps any given arm set X to a probability distribution over the same set, i.e., π(X) (X), where (X) is the probability simplex supported on X. We will denote matrices in bold upper case (e.g. M). x denotes the ℓ2 norm of vector x. We write x M to denote

x Mx for a positive semi-definite matrix M and vector x. For any two real numbers a and b, we denote by a b the minimum of a and b. Throughout, e O( ) denotes big-O notation but suppresses log factors in all relevant parameters. For m, n N with m < n, we denote the set {1, . . . , n} by [n] and {m, . . . , n} by [m, n].

Definition 2.1 (GLM). A Generalized Linear Model or GLM with parameter vector θ Rd is a real valued random variable r that belongs to the exponential family with density function

P(r | x) = exp r x, θ b ( x, θ ) + c (r)

Function b (called the log-partition function) is assumed to be twice differentiable and b is assumed to be monotone. Further, we assume that r [0, R] almost surely for some known R R.

Important properties of GLMs such as E[r | x] = b( x, θ ) and variance V[r | x] = b( x, θ ) are detailed in Appendix C. We define the link function µ as µ ( x, θ ) := E[r | x]. Thus, µ is also monotone. We now present a key Lemma on GLMs (see Appendix C for details) that enables us to achieve optimal regret guarantees for our algorithms designed in Sections 3 and 4.

Lemma 2.2 (Self-Concordance of GLMs). For any GLM supported on [0, R] almost surely, the link function µ( ) satisfies | µ(z)| R µ(z), for all z R.

Next we describe the two CB problems with GLM rewards that we address in this paper. Let T N be the total number of rounds. At round t [T], we receive an arm set Xt Rd, with number of arms K = |Xt| and must select an arm xt Xt. Following this, we receive a reward rt sampled from the GLM distribution P(r|xt) with unknown θ .

Problem 1: In this problem we assume that at each round t, the set of arms Xt Rd is drawn from an unknown distribution D. Further, we assume the constraints of limited adaptivity setting M1, i.e., the algorithm is given a budget M N and needs to decide upfront the M rounds at which it will update its policy. Let supp (D) denote the support of distribution D. We want to design an algorithm that minimizes the expected cumulative regret given as

t=1 max x Xt µ ( x, θ )

t=1 µ ( xt, θ )

4While RS-GLin CB and B-GLin CB have total running time of e O(T), their per-round complexity can reach O(T). This stands in contrast to [7], which maintains efficiency in both total and per-round time complexity. 5In [5], a claim about κ independent regret for all generalized linear models with bounded θ was made; however, we can construct counterexamples to this claim (see Appendix C, Remark C.5).

Here, the expectation is taken over the randomness of the algorithm, the distribution of rewards rt, and the distribution of the arm set D.

Problem 2: In this problem we do not make any assumptions on the arm feature vectors, i.e., the arm vectors can be adversarially chosen. However, we assume the constraints of limited adaptivity setting M2, i.e., the algorithm is given a budget M N and needs to adaptively decide the M rounds at which it will update its policy (during its course). We want to design an algorithm that minimizes the cumulative regret given as

t=1 max x Xt µ ( x, θ )

t=1 µ ( xt, θ )

Finally, for both the problems, the Maximum Likelihood Estimator (MLE) of θ can be calculated by minimizing the sum of the log-losses. The log-loss is defined for any given arm x, its (stochastic) reward r and vector θ Rd (as the estimator of the true unknown θ ) as follows: ℓ(θ, x, r) := r x, θ + R x,θ 0 µ(z)dz. After t rounds, the MLE bθ is computed as bθ = arg minθ Pt s=1 ℓ(θ, xs, rs).

2.1 Instance Dependent Non-Linearity Parameters

As in prior works [6, 7], we define instance dependent parameters that capture non-linearity of the underlying instance and critically impact our algorithm design. The performance of Algorithm 1 (B-GLin CB) that solves Problem 1, can be quantified using three such parameters that are defined using the derivative of the link function µ( ). Specifically, for any arm set X, write optimal arm x = arg maxx X µ ( x, θ ) and define,

κ := max X supp(D) max x X 1 µ ( x, θ ), 1 κ := max X supp(D) µ ( x , θ ), 1 bκ := E X D [ µ ( x , θ )] (1)

Remark 2.3. These quantities feature prominently in our regret analysis of Algorithm 1. In particular, the dominant term in our regret bound scales as O( p

T/κ ). We also note that bκ κ ; in fact, for specific distributions D, the gap between them can be significant. Hence, we also provide a regret upper bound of O( p

T/bκ). In this latter case, however, we incur a worse dependence on d. Section 3 provides a quantified form of this trade-off.

Algorithm 2 (RS-GLin CB) that solves Problem 2, requires another such non-linearity parameter κ6, defined as,

κ := max x T t=1Xt

1 µ ( x, θ ) (2)

We note that, here, κ is defined considering the parameter vector θ in contrast to prior work on logistic bandits [7], where its definition involved a maximization over all vectors θ with θ S (known upper bound of θ ). Hence, κ as defined here is potentially much smaller and can lead to lower regret, compared to prior works. Standard to the CB literature with GLM rewards, we will assume that tight upper bounds on these parameters is known to the algorithms. Assumption 2.4. We make the following additional assumptions which are standard for the CB problem with linear or GLM reward models.

For every round t [T], and each arm x Xt, x 1.

Let θ be the unknown parameter of the GLM reward, then θ S for a known constant S.

2.2 Optimal Design Policies

G-optimal Design Given an arm set X, the G-OPTIMAL DESIGN policy πG is the solution of the following optimization problem: arg minλ (X) maxx X ||x||2 U(λ) 1, where U(λ) = Ex λ[xx T]. Now consider the following optimization problem, also known as the D-optimal design problem: maxλ (X) log Det(U(λ)). This is a concave maximization problem as opposed to the G-optimal design which is non-convex. We have the following equivalence theorem due to Kiefer and Wolfowitz [15]:

6We overload the notation to match that in the literature. κ in the context of Problem 1 is defined via (1), while κ in the context of Problem 2 by (2).

Algorithm 1 B-GLin CB: Batched Generalized Linear Bandits Algorithm Input: Number of batches M and horizon of play T.

1: Initialize batches T1, . . . , TM, as defined in equation (3), and set λ := 20Rd log T. 2: for rounds t T1 do 3: Observe arm set Xt, sample arm xt πG(Xt), and observe reward rt. 4: Compute bθw = arg minθ P

s T1 ℓ(θ, xs, rs) and matrix V = λI + P

t T1 xtx T t . 5: Initialize policy π1 as G-OPTIMAL DESIGN. 6: for batches k = 2 to M do 7: for each round t Tk do 8: Observe arm set Xt. 9: for j = 1 to k 1 do 10: Update arm set Xt Xt \ {x Xt : UCBj(x) < maxy Xt LCBj(y)}.

11: Scale Xt, as in (4), to obtain f Xt. , then sample xt πk 1 f Xt .

12: Equally divide Tk into two sets A and B.

13: Define Hk = λI + P

t A µ( x,bθw )

β(xt) xtx T t , and bθk = arg minθ P

s A ℓ(θ, xs, rs).

14: Compute DISTRIBUTIONAL OPTIMAL DESIGN policy πk using the arm sets {Xt}t B.

Lemma 2.5 (Keifer-Wolfowitz). Let X Rd be any set of arms and WG be the expected design matrix, defined as WG := Ex πG(X) xx T , with πG(X) as the solution to the D-optimal design problem. Then, πG(X) also solves the G-optimal design problem, and for all x X, x 2 W 1 G d.

Distributional optimal design Notably, the upper bound on x W 1 G specified in Lemma 2.5 holds only for the arms x in X. When the arm set Xt varies from round to round, securing a guarantee analogous to Lemma 2.5 is generally challenging. Nonetheless, when the arm sets Xt are drawn from a distribution, it is possible to extend the guarantee, albeit with a worse dependence on d; see Section A.5 in Appendix A. Improving this dependence motivates the need of studying DISTRIBUTIONAL OPTIMAL DESIGN and towards this we utilize the results of [23].

The distributional optimal design policy is defined using a collection of tuples M = {(pi, Mi) : p1, . . . , pn 0 and P i pi = 1}, wherein each Mi is a d d positive semi-definite matrix and n 4d log d. The collection M is detailed next. Let softmaxα ({s1, . . . , sk}) denote the probability distribution where the ith element is sampled with probability sα i Pk j=1 sα j . For a specific M =

{(pi, Mi)}n i=1, and each i [n] write πMi (X) = softmaxα({ x 2 Mi : x X}). Finally, with πG as the G-OPTIMAL DESIGN policy (Section 2.2), we define the DISTRIBUTIONAL OPTIMAL DESIGN policy π as

π (X) = πG (X) with probability 1/2 πMi (X) with probability pi/2

Given a collection of arm sets {X1, . . . , Xs} (called core set) sampled from the distribution D, we utilize Algorithm 2 of [23] to find the collection M; see Algorithm 4 of [23]. Overall, the computed M induces a policy π that upholds the following guarantee. Lemma 2.6 (Theorem 5, [23]). Let π be the DISTRIBUTIONAL OPTIMAL DESIGN policy that has been learnt from s independent samples X1, . . . Xs D. Also, let W denote the expected design matrix, W = EX D Ex π(X) xx T | X . Then,

max x X x W 1

d log d 1 exp O d4 log2 d sd 12 2 16 .

3 B-GLin CB

In this section, we present B-GLin CB (Algorithm 1) that solves Problem 1 described in Section 2, which enforces constraints of limited adaptivity setting M1. Given limited adaptivity budget M N, our algorithm first computes the batch length for each of the M batches (i.e., determine rounds where

the policy remains constant). We build upon the batch length construction in [9]; however, the first batch is chosen to be κ dependent which crucially helps in removing κ from the leading term in the regret. 7

Batch Lengths: For each batch k [M], let Tk denote all the rounds within the kth batch. We will refer to the first batch T1 as the warm-up batch. The batch lengths τk := |Tk|, k [M] are calculated as follows:

τ1 := κ e3Sd2γ2

S α 2/3 , τ2 := α, τk := α τk 1, for k [3, M] (3)

where γ := 30RS d log T 8 and α = T

1 2(1 2 M+1) if M log log T and α = 2

T otherwise.

During the warm-up batch (Lines 2, 3), the algorithm follows the G-OPTIMAL DESIGN policy, πG. At the end of the warm-up batch (Line 4), the algorithm computes the Maximum Likelihood Estimate (MLE), bθw, of θ 9, and design matrix V := P

t T1 xtx T t + λI, with parameter λ = 20Rd log T.

Now, for each batch k 2 and every round t Tk, the algorithm updates Xt by eliminating arms from it using the confidence bounds (see Equation (7)) computed in the previous batches (Line 10). The algorithm next computes e Xt, a scaled version of Xt, as follows, with β(x) define in equation (5),

µ( x, bθw )/β(x) x : x Xt

Finally, we use the distributional optimal design policy πk, on the scaled arm set e Xt, to sample the next arm (Line 11). At the end of every batch, we equally divide the batch Tk into two sets A and B. We use samples from A to compute the estimator bθk and the scaled design matrix Hk. The rounds in B are used to compute πk+1, the distributional optimal design policy for the next batch. It is important to note while the policy πk is utilized in each round (Line 11) to draw arms, it is updated (to πk+1) only at the end of the batch. Hence, conforming to setting M1, the algorithm updates the selection policy at M rounds that were decided upfront.

Confidence Bounds: The scaled design matrix Hk, an estimator of the Hessian, is computed at the end of each batch k 2, . . . , M (Line 13):

µ( xt, bθw )/β(xt) xtx T t + λI, where β(x) = exp R min 2S, γ κ x V 1

(5) where A is the first half of Tk. Using this, we define the upper and lower confidence bounds (UCBk and LCBk) computed at the end of batch Tk:

( x, bθw + γ κ x V 1 k = 1 x, bθk + γ x H 1 k k > 1 , (6)

( x, bθw γ κ x V 1 k = 1 x, bθk γ x H 1 k k > 1 (7)

Remark 3.1. The confidence bounds employed by the algorithm exhibit a significant distinction between the first batch and subsequent batches. While the first batch s bounds are influenced by the parameter κ, subsequent batches utilize κ-independent bounds. This difference arises from the use of the standard design matrix V in the first batch and a scaled design matrix Hk (equation 5) in later batches, leveraging the self-concordance property of GLM rewards to achieve κ-independence. Notably, the first batch s confidence bounds influence the scaling factor β(x) in later batches, creating a trade-off (addressed in the regret analysis in Appendix A) where an inaccurate estimate of bθw can exponentially increase the scaling factor and confidence bounds.

In Theorem 3.2 and Corollary 3.3, we present our regret guarantee for B-GLin CB. Detailed proofs for both are provided in Appendix A. The computational efficiency of B-GLin CB is discussed in Appendix D.

7We note that in case κ is unknown, any known upper bound on κ suffices for the algorithm. 8Recall that R provides an upper bound on the stochastic rewards and S is an upper bound on the norm of θ . 9In case the MLE lies outside the set {θ : θ S}, we apply the projection step detailed in Appendix E.

Algorithm 2 RS-GLin CB: Rarely-Switching GLM Bandit Algorithm

1: Initialize: V = H1 = λI, To = , τ = 1, λ := d log(T/δ)/R2 and γ := 25RS q

2: for rounds t = 1, . . . , T do 3: Observe arm set Xt. 4: if maxx Xt x 2 V 1 1/(γ2κR2) then // Switching Criterion I 5: Select xt = arg maxx Xt x V 1 and observe reward rt. 6: Update To To {t}, V V + xtx T t and Ht+1 Ht. 7: Compute bθo = arg minθ P s To ℓ(θ, xs, rs) + λ

2 θ 2 2. 8: else 9: if det(Ht) > 2 det(Hτ) then // Switching Criterion II 10: Set τ = t and eθ arg minθ P

s [t 1]\To ℓ(θ, xs, rs) + λ

2 θ 2 2 and

11: bθτ Project(eθ). 12: Update Xt Xt \ {x Xt : UCBo(x) < maxz Xt LCBo(z)}. 13: Select xt = arg maxx Xt UCB(x, Hτ, bθτ) and observe reward rt.

14: Update Ht+1 Ht + µ( xt,bθo )

Theorem 3.2. Algorithm 1 (B-GLin CB) incurs regret RT (R1 + R2) log log T 10, where

1 2(1 21 M) log T

κ1/3d2e2S(RS log T)2/3T

1 3(1 21 M) !

Corollary 3.3. When the number of batches M log log T, Algorithm 1 achieves a regret bound of

T + d2e2S(S2R2κT)1/3 !

Remark 3.4. Scaling the arm set (as in (4)) for optimal design is a crucial aspect of our algorithm, allowing us to obtain tight estimates of µ ( x, θ ) (see Lemma A.10). This result relies on multiple novel ideas and techniques, including self-concordance for GLMs, matrix concentration, Bernsteintype concentration for the canonical exponential family (Lemma A.1), and application of distributional optimal design on scaled arm set. Remark 3.5. The κ-dependent batch construction is a crucial feature of our algorithm, enabling effective estimation of µ( x, θ ) at the end of the first batch. Since the first batch incurs regret linear in its length, achieving a κ-independent guarantee requires the first batch to be o(

T). We demonstrate that choosing τ1 = O(T 1 3 ) is sufficient for this purpose (see Appendix A).

4 RS-GLin CB

In this section we present RS-GLin CB (Algorithm 2) that solves Problem 2 described in Section 2, which enforces constraints of limited adaptivity setting M2. This algorithm incorporates a novel switching criterion (Line 4), extending the determinant-doubling approach of [1]. Additionally, we introduce an arm-elimination step (Line 12) to obtain tighter regret guarantees. Throughout this section, we set λ = d log(T/δ)/R2 and γ = 25RS p

d log (T/δ).

At round t, on receiving an arm set Xt, RS-GLin CB first checks the Switching Criterion I (Line 4). This criterion checks whether for any arm x Xt the quantity x V 1 is greater than a carefully chosen κ-dependent threshold. Here V is the design matrix corresponding to all arms that have been played in the rounds in To (:= the set of rounds preceding round t, where Switching Criterion I was triggered). Under this criterion the arm that maximizes x V 1 is played (call this arm xt) and the corresponding reward is obtained. Subsequently in Line 6, the set To is updated to include t; the

10Note that RT is expected regret. See the 2.

design matrix V is updated as V V + xtx t ; and the scaled design matrix Ht+1 is set to Ht. The MLE is computed (Line 7) based on the data in the rounds in To to obtain bθo.

When Switching Criterion I is not triggered, the algorithm first checks (Line 9) the Switching Criterion II, that is whether the determinant of the scaled design matrix Ht has become more than double of that of Hτ (where τ is the last round before t when Switching Criterion II was triggered). If Switching Criterion II is triggered at round t, then in Line 10, the algorithm sets τ t and recomputes the MLE over all the past rounds except those in To to obtain eθ. Then eθ is projected into an ellipsoid around bθo to obtain the estimate bθτ via the following optimization problem11,

µ ( xs, θ ) µ( xs, eθ ) xs

H(θ) s.t. θ bθo V γ κ. (8)

Here H(θ) := P

s To µ ( xs, θ ) xsx T s . After checking Switching Criterion II, the algorithm performs an arm elimination step (Line 12) based on the parameter estimate bθo as follows: for every arm x Xt, we compute UCBo(x) = x, bθo + γ κ x V 1 and LCBo(x) = x, bθo γ κ x V 112. Then, Xt is updated by eliminating from it the arms with UCBo( ) less than the highest LCBo( ). For arms in the reduced arm set Xt, RS-GLin CB computes the index UCB(x, Hτ, bθτ) := x, bθτ + 150 x H 1 τ p

d log (T/δ), and plays the arm xt with the highest index (Line 13). After observing the subsequent reward rt, the algorithm updates the scaled design matrix Ht (Line 14) as follows: Ht+1 Ht + ( µ( xt, bθo )/e)xtx t . With this, the round t ends and the algorithm moves to the next round. Next, in Lemma 4.1 and Theorem 4.2 we present the guarantees on number of policy updates and regret, respectively, for RS-GLin CB. Detailed proofs for both are provided in Appendix B. Lemma 4.1. RS-GLin CB (Algorithm 2), during its entire execution, updates its policy at most O(R4S2 κd2 log2(T/δ)) times. Theorem 4.2. Given δ (0, 1), with probability 1 δ, the regret of RS-GLin CB (Algorithm 2) satisfies RT = O d q P

t [T ] µ ( x t , θ ) log (RT/δ) + κd2R5S2 log2 (T/δ) .

Remark 4.3. Switching Criterion I is essential in delivering tight regret guarantees in the non-linear setting. Unlike existing literature [7], which relies on warm-up rounds based on observed rewards (hence heavily dependent on reward models), RS-GLin CB presents a context-dependent criterion that implicitly checks whether the estimate µ( x, bθo ) is within a constant factor of µ ( x, θ ) (see Lemmas B.3 and B.4). We show that the number of times Switching Criterion I is triggered is only O(κd2 log2(T)) (see Lemma B.11), hence incurring a small regret in these rounds. Remark 4.4. Unlike [1], our determinant-doubling Switching Criterion II uses the scaled design matrix Ht instead of the unscaled version (similar to V). The matrix Ht, estimating the Hessian of the log-loss, is crucial for achieving optimal regret. This modification is crucial in extending algorithms satisfying limited adaptivity setting M2 for the CB problem with a linear reward model to more general GLM reward models. Remark 4.5. The feasible set for the optimization stated in 8 is an ellipsoid around bθo, which contains θ with high probability. Deviating from existing literature on GLM Bandits which projects the estimate into the ball set of radius S ({θ : θ S}), our projection step leads to tighter regret guarantees; notably, the leading

T term is free of parameters S (and R). This resolves the conjecture made in [17] regarding the possibility of obtaining S-free regret in the

T term in logistic bandits. Remark 4.6. The regret guarantees of the logistic bandit algorithms in [2, 17] have a second-order term that is minimum of an arm-geometry dependent quantity (see Theorem 3 of [17]) and a κdependent term similar to our regret guarantee. Although our analysis is not able to accommodate this arm-geometry dependent quantity, we underscore that our algorithm is computationally efficient while the above works are not. In fact, to the best of our knowledge, the other known efficient algorithms for logistic bandits [7, 28] also do not achieve the arm-geometry dependent regret term. It can be interesting to design an efficient algorithm that is able to achieve the same guarantees in the second-order regret term as in [2, 17].

11This optimization problem is non-convex. However, a convex relation of this optimization problem is detailed in Appendix E, which leads to slightly worse regret guarantees in poly(R, S). 12We note that in case κ is unknown, any known upper bound on κ suffices for the algorithm.

Figure 1: Top: Cumulative Regret vs. number of rounds for Logistic (left) and Probit (right) reward models. Bottom: (left) Execution times of ECOLog and RS-GLin CB for different values of κ (low κ = 9.3 and high κ = 141.6) for Logistic rewards. (right) Execution times of GLOC and RS-GLin CB for different values of κ (low κ = 17.6 and high κ = 202.3) for Probit rewards.

5 Experiments

We tested the practicality of our algorithm RS-GLin CB against various baselines for logistic and generalized linear bandits. For these experiments, we adjusted the Switching Criterion I threshold constant in RS-GLin CB to 0.01 and used data from both Switching Criteria (I and II) rounds to estimate eθ. These modifications do not affect the overall efficiency as eθ is calculated only O(log(T)) times. The experiment code is available at https://github.com/nirjhar-das/GLBandit_Limited_Adaptivity.

Logistic. We compared RS-GLin CB against ECOLog [7] and GLOC [14], the only algorithms with overall time complexity O(T) for this setting. The dimension was set to d = 5, number of arms per round to K = 20, and θ was sampled from a d-dimensional sphere of radius S = 5. Arms were sampled uniformly from the d-dimensional unit ball. We ran simulations for T = 20, 000 rounds, repeating them 10 times. RS-GLin CB showed the smallest regret with a flattened regret curve, as seen in Fig. 1 (top-left). Probit. For the probit reward model, we compared RS-GLin CB against GLOC and GLM-UCB [8]. The dimension was set to d = 5 and number of arms per round to K = 20. θ was sampled from a d-dimensional sphere of radius S = 3. Arm features were generated similarly as in the logistic bandit simulation. We ran simulations for T = 5, 000 rounds, repeating them 10 times. RS-GLin CB outperformed both baselines, as shown in Fig. 1 (top-right). Comparing Execution Times. We compared the execution times of RS-GLin CB and ECOLog. We created two logistic bandit instances with d = 5 and K = 20, and different κ values. We ran both algorithms for T = 20, 000 rounds, repeating each run 20 times. For low κ, RS-GLin CB took about one-fifth of the time of ECOLog, and for high κ, slightly more than one-third, as seen in Fig. 1 (left-bottom). This demonstrates that RS-GLin CB has a significantly lower computational overhead compared to ECOLog. We also compared the execution times of RS-GLin CB and GLOC under the probit reward model, creating two bandit instances with d = 5 and K = 20, but with differing κ. We ran both algorithms for T = 20, 000 rounds, repeating each run 20 times. The result is shown in Fig. 1 (bottom-right). We observe that for low κ, RS-GLin CB takes less than half time of GLOC while for high κ, it takes about two-third time of GLOC. A more detailed discussion of these experiments is provided in Appendix D.

6 Conclusion and Future Work

The Contextual Bandit problem with GLM rewards is a ubiquitous framework for studying online decision-making with non-linear rewards. We study this problem with a focus on limited adaptivity. In particular, we design algorithms B-GLin CB and RS-GLin CB that obtain optimal regret guarantees for two prevalant limited adaptivity settings M1 and M2 respectively. A key feature of our guarantees are that their leading terms are independent of an instance dependent parameter κ that captures non-linearity. To the best of our knowledge, our paper provides the first algorithms for the CB problem with GLM rewards under limited adaptivity (and otherwise) that achieve κ-independent regret. The regret guarantee of RS-GLin CB, not only aligns with the best-known guarantees for Logistic Bandits but enhances them by removing the dependence on S (upper bound on θ ) in the leading term of the regret and therefore resolves a conjecture in [17]. The batch learning algorithm B-GLin CB, for M = Ω(log (log T)), achieves a regret of e O d RS p

T . We believe that the dependence on d along with the bκ term is not tight and improving the dependence is a relevant direction for future work.

Acknowledgments and Disclosure of Funding

Siddharth Barman gratefully acknowledges the support of the Walmart Center for Tech Excellence (CSR WMGT-23-0001) and a SERB Core research grant (CRG/2021/006165).

[1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

[2] Marc Abeille, Louis Faury, and Clément Calauzènes. Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics, pages 3691 3699. PMLR, 2021.

[3] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

[4] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 208 214, Fort Lauderdale, FL, USA, 11 13 Apr 2011. PMLR.

[5] Louis Faury. Variance-sensitive confidence intervals for parametric and offline bandits. Theses, Institut Polytechnique de Paris, Oct 2021.

[6] Louis Faury, Marc Abeille, Clément Calauzènes, and Olivier Fercoq. Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning, pages 3052 3060. PMLR, 2020.

[7] Louis Faury, Marc Abeille, Kwang-Sung Jun, and Clément Calauzènes. Jointly efficient and optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics, pages 546 580. PMLR, 2022.

[8] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. Advances in neural information processing systems, 23, 2010.

[9] Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32, 2019.

[10] International Stroke Trial Collaborative Group et al. The international stroke trial (ist): a randomised trial of aspirin, subcutaneous heparin, both, or neither among 19 435 patients with acute ischaemic stroke. The Lancet, 349(9065):1569 1581, 1997.

[11] Yanjun Han, Zhengqing Zhou, Zhengyuan Zhou, Jose Blanchet, Peter W Glynn, and Yinyu Ye. Sequential batch learning in finite-action linear contextual bandits. ar Xiv preprint ar Xiv:2004.06321, 2020.

[12] Osama Hanna, Lin Yang, and Christina Fragouli. Learning from distributed users in contextual linear bandits without sharing the context. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.

[13] Osama Hanna, Lin Yang, and Christina Fragouli. Efficient batched algorithm for contextual linear bandits with large action space via soft elimination. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

[14] Kwang-Sung Jun, Aniruddha Bhargava, Robert Nowak, and Rebecca Willett. Scalable generalized linear bandits: Online computation and hashing. Advances in Neural Information Processing Systems, 30, 2017.

[15] Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363 366, 1960.

[16] Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.

[17] Junghyun Lee, Se-Young Yun, and Kwang-Sung Jun. Improved regret bounds of (multinomial) logistic bandits via regret-to-confidence-set conversion. In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, pages 4474 4482. PMLR, 02 04 May 2024.

[18] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670, 2010.

[19] Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pages 2071 2080. PMLR, 2017.

[20] Blake Mason, Kwang-Sung Jun, and Lalit Jain. An experimental design approach for regret minimization in logistic bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7736 7743, 2022.

[21] Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Erik Snowberg. Batched bandit problems. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 1456 1456, Paris, France, 03 06 Jul 2015. PMLR.

[22] Zhimei Ren and Zhengyuan Zhou. Dynamic batch learning in high-dimensional sparse linear contextual bandits. Management Science, 70(2):1315 1342, 2024.

[23] Yufei Ruan, Jiaqi Yang, and Yuan Zhou. Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74 87, 2021.

[24] Yoan Russac, Olivier Capp e, and Aurélien Garivier. Algorithms for non-stationary generalized linear bandits. Ar Xiv, abs/2003.10113, 2020.

[25] Eric M Schwartz, Eric T Bradlow, and Peter S Fader. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500 522, 2017.

[26] David Simchi-Levi and Yunzong Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 47(3):1904 1931, 2022.

[27] Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1 230, 2015.

[28] Yu-Jie Zhang and Masashi Sugiyama. Online (multinomial) logistic bandit: Improved regret and constant computation cost. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Generalized Linear Bandits with Limited Adaptivity Appendix

Table of Contents

1 Introduction 1 1.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Important Remarks on Contributions and Comparison with Prior Work . . . . . . . 2

2 Notations and Preliminaries 3 2.1 Instance Dependent Non-Linearity Parameters . . . . . . . . . . . . . . . . . . . . 4 2.2 Optimal Design Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 B-GLin CB 5

4 RS-GLin CB 7

5 Experiments 9

6 Conclusion and Future Work 10

A Regret Analysis of B-GLin CB 13 A.1 Additional Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.2 Concentration Inequalities and Confidence Intervals . . . . . . . . . . . . . . . . . 13 A.3 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.4 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.5 Optimal Design Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B Regret Analysis of RS-GLin CB 21 B.1 Confidence Sets for Switching Criterion I rounds . . . . . . . . . . . . . . . . . . 22 B.2 Confidence Sets for non-Switching Criterion I rounds . . . . . . . . . . . . . . . . 23 B.3 Bounding the instantaneous regret . . . . . . . . . . . . . . . . . . . . . . . . . . 25 B.4 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 B.5 Bounding number of policy updates: Proof of Lemma 4.1 . . . . . . . . . . . . . . 27 B.6 Some Useful Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

C Useful Properties of GLMs 28

D Computational Cost 31

E Projection 32 E.1 Convex Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

A Regret Analysis of B-GLin CB

A.1 Additional Notation

We write c to denote absolute constant(s) that appears throughout our analysis. Our analysis also utilizes the following function

γ(λ) = 24RS p

log (T) + d + R (log (T) + d)

Note that γ(λ) is a parameterized version of γ (which was defined in section 3). In our proof, we present the arguments using this parameterized version. A direct minimization of the above expression in terms of λ would not suffice since we need λ to be sufficiently large for certain matrix concentration lemmas to hold (see Section A.5). However, later we show that setting λ equal to c Rd log T leads to the desired bounds.

We use ex to denote the scaled versions of the arms (see Line 11 of the algorithm); in particular,

v u u t µ x, bθw

β(x) x (10)

Furthermore, to capture the non-linearity of the problem, we introduce the term ϕ(λ):

ϕ(λ) := κ e3S γ(λ)2

Recall that the scaled data matrix Hk (for each batch k) was computed using bθw as follows

β(xt) xtx T t + λI.

Following the definition of Hk and using the true vector θ we define

t Tk µ ( xt, θ ) xtx T t + λI.

We will show that Hk dominates H k with high probability. Furthermore, we assume that the MLE

estimator θ obtained by minimizing the log-loss objective always satisfies bθ S. In case, that s not true, one can use the non-convex projection described in Appendix E. The projected vector satisfies the same guarantees as described in the subsequent lemmas up to a multiplicative factor of 2. Hence, the assumption bθ S is non-limiting.

A.2 Concentration Inequalities and Confidence Intervals

Lemma A.1 (Bernstein s Inequality). Let X1, . . . , Xn be a sequence of independent random variables with |Xt E [Xt] | b. Also, let sum S := Pn t=1 (Xt E [Xt]) and v := Pm t=1 Var[Xt]. Then, for any δ [0, 1], we have

Lemma A.2. Let X = {x1, x2, . . . , xs} Rd be a set of vectors with xt 1, for all t [s], and let scalar λ 0. Also, let r1, r2, . . . , rs [0, R] be independent random variables distributed by the canonical exponential family; in particular, E [rs] = µ ( xs, θ ) for θ Rd. Further, let bθ = arg minθ Pt s=1 ℓ(θ, xs, rs) be the maximum likelihood estimator of θ and let matrix

j=1 µ ( xj, θ ) xjx T j + λI.

Then, with probability at least than 1 1 T 2 , the following inequality holds

θ bθ H 24RS p

log (T) + d + R (log (T) + d)

Proof. We first define the following quantities

α(x, θ , bθ) := Z 1

v=1 µ x, θ + v x, bθ θ dv

j=1 α(x, θ , bθ)xjx T j + λ 1 + 2RS I

Using Lemma C.2 we have

G 1 1 + 2RS H (12)

Hence we write θ bθ H p

(1 + 2RS) θ bθ G

1 + 2RS G θ bθ G 1

θ , xj bθ, xj α(x, θ , bθ)xj + λ 1 + 2RS

µ ( θ , xj ) µ bθ, xj xj

µ ( θ , xj ) µ bθ, xj xj

(since G λ 1+2RS I and θ bθ 2 2S)

µ ( θ , xj ) µ bθ, xj xj

(Using (12) and assuming RS 1)

Now by the optimality condition on bθ we have Ps j=1 µ xj, bθ xj = Ps j=1 rjxj (see equation (3) [8]). Hence, we write

µ ( θ , xj ) µ bθ, xj xj

j=1 (µ ( θ , xj ) rj) xj

Let B denote the unit ball in Rd. We can write

j=1 (µ ( θ , xj ) rj) xj

H 1 = max y B y, H 1/2 s X

j=1 (µ ( θ , xj ) rj) xj

We construct an ε-net for the unit ball, denoted as Cε. For any y B, we define yε := arg minb Cε b y 2. We can now write

j=1 (µ ( θ , xj ) rj) xj

= max y B y yε, H 1/2 s X

j=1 (µ ( θ , xj ) rj) xj + yε, H 1/2 s X

j=1 (µ ( θ , xj ) rj) xj

max y B y yε 2

j=1 (µ ( θ , xj ) rj) xj

H 1 + yε, H 1/2 s X

j=1 (µ ( θ , xj ) rj) xj

j=1 (µ ( θ , xj ) rj) xj

H 1 + yε, H 1/2 s X

j=1 (µ ( θ , xj ) rj) xj

Rearranging, we obtain

j=1 (µ ( θ , xj ) rj) xj

H 1 1 1 ε yε, H 1/2 s X

j=1 (µ ( θ , xj ) rj) xj

Next, we use Lemma A.3 (stated below) with δ = T 2|Cε| and union bound over all vectors in Cε. We also observe that |Cε| 2

ε d. Substituting ϵ = 1/2 and using Lemma A.3, we obtain that the following holds with probability greater than 1 1 T 2 ,

j=1 (µ ( θ , xj ) rj) xj

log (T 2|Cε|) + 4R

λ log T 2|Cε|

log (T) + d + R (log (T) + d)

Substituting in equations (13), we get the desired inequality in the lemma statement.

Lemma A.3. Let y be a fixed vector with y 1. Then, with the notation stated in Lemma A.2, the following inequality holds with probability at least 1 δ

j=1 (µ ( θ , xj ) rj) y TH 1/2xj

Proof. Let us denote the jth term of the sum as Zj. Note that each random variable Zj has variance

Var(Zj) = µ ( xj, θ ) y TH 1/2xj 2 . Hence, we have

j=1 Var(Zj) =

j=1 µ ( θ , xj ) y TH 1/2xj 2

Moreover, each Zj is at most R

λ (since xj 1, H λI and r [0, R]). Now applying Lemma A.1, we have

Corollary A.4. Let x1, x2, . . . , xτ be the sequence of arms pulled during the warm-up batch and let bθw be the estimator of θ computed at the end of the batch. Then, for any vector x and λ 0 the following bound holds with probability greater than 1 1 T 2

| x, θ bθw | κ x V 1 γ(λ).

Proof. This result is derived directly from Lemma A.2 and the definition of γ(λ) (see 9). By applying the lemma, we obtain

| x, θ bθw | x H 1 θ bθw H x H 1 γ(λ)

Considering the definition of κ, we have µ ( x, θ ) 1

κ. This implies that H 1

κV 13 which in turn leads to the inequality x H 1 κ x V 1.

A.3 Preliminary Lemmas

Lemma A.5. For each batch k 2 and the scaled data matrix Hk computed at the end of batch, the following bound holds with probability at least 1 1 T 2 :

Proof. If the event stated in Lemma A.4 holds,

From Lemma C.2, we apply the multiplicative bound on µ to obtain

µ x, bθw µ ( x, θ ) exp R| x, bθw θ |

Via Corollary A.4 we have | x, bθw x, θ | κ x V 1 γ(λ). Additionally, given that bθ , θ S and x 1 we also have | x, bθw x, θ | 2S. Hence, we write

µ x, bθw µ ( x, θ ) exp R min{ κ x V 1 γ(λ), 2S}

µ ( x, θ ) β(x)

Substituting these results into the definitions of Hk and H k proves the lemma statement.

Claim A.6. The Algorithm 1 runs for at most log log T batches.

Proof. When M log log T then the claim trivially holds. When M log log T + 1, we define the length of the second batch, τ2, as 2

T. The length of the M th batch is

T) PM 1 k=1 1 2k 1

1 2log log T (M log log T + 1) T.

Corollary A.7. Let bθk be the estimator of θ calculated at the end of the kth batch. Then for any vector x the following holds with probability greater than 1 log log T

T 2 for every batch k 2.

| x, θ bθk | x H 1 k γ(λ)

13For logistic rewards, we note that instead of using the worst case bound of κ, one can make use of an upper bound on S := θ as done in [20] That is, we lower bound µ ( x, θ ) µ S x and use H P

t µ S xt xtx t .

Proof. This result is a direct consequence of Lemma A.2 and the definition of γ(λ) (see 9). According to the lemma, we have

| x, θ bθk | x H k 1 θ bθk H k x H k 1 γ(λ)

Using Lemma A.5, we can further bound x H k 1 x Hk 1. Finally, a union bound over all batches and considering the fact that there are at most log log T batches (Claim A.6) we establish the corollary s claim.

Claim A.8. For any x [0, M] the following holds

Proof. The claim follows from the convexity of ex.

Lemma A.9. Let x X be the selected in any round of batch k 2 in the algorithm, and let x be the optimal arm in the arm set X, i.e., x = arg maxx X µ ( x, θ ). With probability greater than 1 log log T

T 2 , the following inequality holds-

µ ( x , θ ) µ ( x, θ ) 6ϕ(λ) X

y {x,x } ey {ex,ex }

y V 1 ey Hk 1 + 2γ(λ) p

µ ( x , θ ) ex H 1 k 1 + ex H 1 k 1

Proof. We begin by applying Taylor s theorem, which yields the following for some z between x, θ and x , θ

|µ ( x , θ ) µ ( x, θ )| (14) = µ(z) | x, θ x, θ |

= µ(z) x , θ x , bθk 1 + x , bθk 1 x, bθk 1 + x, bθk 1 x, θ

µ(z) x , θ x , bθk 1 + x , bθk 1 x, bθk 1 + x, bθk 1 x, θ

2 µ(z) x H 1 k 1 γ(λ) + x H 1 k 1 γ(λ) (via Corollary A.7)

v u u t β(x )

µ x , bθw ex H 1 k 1 +

v u u t β(x)

µ x, bθw ex H 1 k 1

v u u t µ(z)β(x )

µ x , bθw ex H 1 k 1 + 2 p

v u u t µ(z)β(x)

µ x, bθw ex H 1 k 1 (15)

We now invoke Lemmas C.2 and A.4 to obtain v u u t µ(z)

µ x, bθw β(x)

exp min n 2S, z x, bθw o β(x)

(by stated assumptions and Lemma C.2)

S, | x, θ x, bθw | + | x, θ z|

S, | x, θ x, bθw | + | x, θ x , θ |

(since z [ x, θ , x t , θ ])

exp min S, 3 κ x V 1 γ (λ) + 2 κ x V 1 γ (λ)

(using Lemma A.4 and the elimination criteria)

exp min 2S, 2 κ x V 1 γ (λ) + κ x V 1 γ (λ)

(substituting the definition of β(x))

Similarly, we also have p

µ ( x , θ ) exp min S, κ x V 1 γ (λ) + κ x V 1 γ (λ) .

Further, we can simplify each term in equation (15) as

v u u t µ(z)β(x )

µ x , bθw ex H 1 k 1

µ ( x , θ ) ex H 1 k 1 exp min 3S, 3 κγ (λ) x V 1 + x V 1

µ ( x , θ ) κ e3S γ(λ)

2 x V 1 + x V 1 ex H 1 k 1 + 2γ(λ) p

µ ( x , θ ) ex H 1 k 1 (via Claim A.8)

µ ( x , θ ) κ e3S γ(λ)

2 x V 1 + x V 1 ex H 1 k 1

µ ( x , θ ) ex H 1 k 1

µ ( x , θ )ϕ(λ) x V 1 + x V 1 ex H 1 k 1 + 2γ(λ) p

µ ( x , θ ) ex H 1 k 1

Finally, we substitute the above bound in (15) to obtain

|µ ( x , θ ) µ ( x, θ ) | 6 p

µ ( x , θ )ϕ(λ) x V 1 + x V 1 ex H 1 k 1 + ex H 1 k 1

µ ( x , θ ) ex H 1 k 1 + ex H 1 k 1

For Phase k, the distribution of the remaining arms after the elimination step (X in line 10 of the Algorithm 1) is represented as Dk.

Lemma A.10. During any round in batch k of Algorithm 1, and for an absolute constant c, we have

E [|µ ( x , θ ) µ ( x, θ ) |] c

ϕ(λ)d2 τ1 τk 1 + γ(λ) τk 1

Proof. The proof here invokes Lemma A.9. We begin by noting that

y {x,x } ey {ex,ex }

y V 1 ey H 1 k 1 4 E X Dk

max x X x V 1 max x X ex H 1 k 1

max x X x 2 V 1

max x X ex 2 H 1 k 1

(via Jensen s inequality)

max x X x 2 V 1

max x X ex 2 H 1 k 1

(via Claim A.11)

(using Lemma A.17)

We also have

µ ( x , θ ) ex H 1 k 1 + ex H 1 k 1

µ ( x , θ ) max x X x H 1 k 1

max x X x H 1 k 1

E X Dk [ µ ( x , θ )] E X Dk

x 2 H 1 k 1

(using the definition of κ for the first bound and Jensen for the second)

d log d κ τk 1

(using Lemma A.17)

Substituting the above bounds in Lemma A.9 we obtained the stated inequality. This completes the proof.

A.4 Proof of Theorem 3.2

We trivially upper bound the regret incurred during the warm-up batch as τ1R; recall that R denotes the upper bound on the rewards and τ1 denotes the length of the first (warm-up) batch; see equation (5)).

For each batch k and an absolute constant c, Lemma A.10 gives us

t Tk µ ( x t , θ ) µ ( xt, θ )

ϕ(λ)d2 τ1 τk 1 + γ(λ) τk 1

Since there are at most log log T batches, we can upper bound the regret as

τ1R + ϕ(λ)d2

Setting τ1 = ϕ(λ)d2α

R 2/3 we get

Now with the choice of λ = 20d R log T, we have

γ(λ) γ where γ = 30RS p

Substituting α = T

1 2(1 2 M) and ϕ(λ) = κ e3S γ2

κd3 e3S RST

1 2(1 2 M) log T 2/3 +

1 2(1 2 M) p

A.5 Optimal Design Guarantees

In this section, we study the optimal design policies utilized in different batches of the algorithm. Specifically, πG denotes the G-OPTIMAL DESIGN policy applied during the warm-up batch, while πk refers to the DISTRIBUTIONAL OPTIMAL DESIGN policy calculated at the end of batch k ( and used in the (k + 1)th batch). Recall that the distribution of the remaining arms after the elimination step ( X in line 10 of the Algorithm) is represented as Dk. We define expected design matrices for each policy:

WG := E X D

Wk := E X Dk

Recall, for all batches starting from the second batch (k 2), we employ the scaled arm set, denoted as e X, for learning and action selection under the DISTRIBUTIONAL OPTIMAL DESIGN policy. However, during the initial warm-up batch, we utilize the original, unscaled arm set. Claim A.11. The following holds for any positive semidefinite matrix A and any batch k-

E X Dk max x X x A E X Dj max x X x A j [k 1].

This is due to the fact that the set of surviving arms in batch k is always a smaller set than the previous batches. Lemma A.12 (Lemma 4 [23]). The expected data matrix WG satisfies. We have

max x X x 2 W 1 G

Lemma A.13 (Theorem 5 [23]). Let the DISTRIBUTIONAL OPTIMAL DESIGN π which has been learnt from s independent samples X1, . . . Xs D and let W denote the expected data matrix, W = EX D Ex π(X) xx T|X . We have

max x X x W 1

d log d 1 exp O d4 log2 d sd 12 2 16 . (16)

Lemma A.14. Under the notation of Lemma A.13, we have

max x X x 2 W 1

Proof. Recall that the DISTRIBUTIONAL OPTIMAL DESIGN policy samples according to the πG policy with half probability. Hence, we have

xx T|X E X Dk

2 E x πG(X)

max x X x 2 W 1

max x X x 2 W 1 G

Lemma A.15 (Matrix Chernoff [27, 23]). Let x1, x3, . . . , xn D be vectors, with xt 1, then we have

i=1 xix T i n

8 E x D xx T )

1 2d exp εn

Corollary A.16. In Algorithm 1 the warm-up matrix V, with λ 16 log (Td), satisfies the following with probability greater than 1 1 T 2 .

8 E X D E x πG(X) xx T

Similarly Hk, with λ 6 log (Td) satisfies the following for each batch k 2 with probability greater than 1 1 T 2

8 E X Dk E ex πk(X) exex T.

Proof. The results for both V and Hk are obtained directly by applying Lemma A.15 with ε = log(T )

τ1 and ε = log(T )

τk , respectively.

We note that the analysis of [23] gives an optimal guarantee (in expectation) on x V 1, but not on x 2 V 1. We obtain such a bound here and use it in the analysis.

Lemma A.17. The following holds with probability greater than 1 1 T 2

E X D max x X x 2 V 1 O d2

E X Dk max x X ex 2 H 1 k O d2

We also have that for sufficiently large T O d32(log 2T)2 , the following holds with probability greater than 1 log log T

E X Dk max x X ex H 1 k O

Proof. First, we note from Corollary A.16 that the following holds with high probability

τ1 ex W 1 k

We obtain the first two inequalities, (18) and (19), by a direct use of Corollary A.16. For (20) we note that for every phase we have at least O(

T) samples for learning the DISTRIBUTIONAL OPTIMAL

DESIGN policy (for any M). Since, T d32 log 2T 2 the event stated in Lemma A.13 holds with probability greater than 1 1 T 2 .

B Regret Analysis of RS-GLin CB

Recall To denotes the set of rounds when switching criterion I is satisfied. We write τo to denote the size of the set τo = |To|. We define the following (scaled) data matrix

s To µ ( xs, θ ) xsx s + λI .

We will specify the regularizer λ later. We also define

Below, we state the main concentration bound used in the proof.

Lemma B.1 (Theorem 1 of [6]). Let {Ft} t=1 be a filtration. Let {xt} t=1 be a stochastic process in B1(d) such that xt is Ft measurable. Let {ηt} t=1 be a martingale difference sequence such that ηt is Ft measurable. Furthermore, assume we have |ηt| 1 almost surely, and denote σ2 t = V[ηt|Ft]. Let λ > 0 and for any t 1 define:

s=1 ηsxs Ht =

s=1 σ2 sxsx s + λI

Then, for any δ (0, 1],

t 1 : St H 1 t

det(Ht) 1 2

B.1 Confidence Sets for Switching Criterion I rounds

Lemma B.2. At any round t, let bθo be the maximum likelihood estimate calculated using set of rewards observed in the rounds To. With probability at least 1 δ we have

bθo θ H w γ.

Proof. Let us define the matrix Gw = P

s To α(x, θ , bθo)xsx s + λI. First, we note that by selfconcordance property of µ (Lemma C.2), Gw 1 1+2RS H w. Hence,

1 + 2RS bθo θ Gw

1 + 2RS Gw bθo θ G 1 w

bθo, xs θ , xs α(xs, bθo, θ )xs + λbθo λtθ G 1 w

µ bθo, xs µ ( θ , xs ) xs + λbθo λtθ G 1 w (Taylor s theorem)

µ bθo, xs µ ( θ , xs ) xs + λbθo λθ H 1 w (Gw 1 1+2RS H w)

Since bθo is the maximum likelihood estimate, by optimality condition, we have the following relation: P

s To µ xs, bθo xs + λbθo = P

s To rsxs. Substituting this above, we get

bθo θ H w (1 + 2RS)

s To (rs µ ( θ , xs )) xs λθ H 1 w

s To (rs µ ( θ , xs )) xs

+ λ θ H 1 w

λ. (H w λt I, θ 2 S)

where ηs := (rs µ ( θ , xs )).

We will now apply Lemma B.1 ηs scaled by R. First note that P

s To ηsxs (H w) 1 = P s To ηs

R xs (R2H w) 1 which, in turn ensures that the noise variable is upper bounded by 1.

Applying B.1 we get

bθo θ H w S

+ (1 + 2RS)

R2λ log 1 + τo R2λd

R2λ d log(2)

Simplifying constants and setting

d log(T) + log(1/δ), we have bθo θ H w

d log(T/δ) + log(1/δ) c RS p

d log (T/δ).

B.2 Confidence Sets for non-Switching Criterion I rounds

We define Ew be the event defined in Lemma B.2, that is, Ew = { bθo θ H w γ}.

Lemma B.3. If in round t the switching criteria I is not satisfied and the event Ew holds, we have

| x, bθo θ | 1

R for all x Xt.

| x, bθo θ | x H 1 w bθo θ H w (Cauchy-Schwarz)

x H 1 w γ (via Lemma B.2)

γ κ x V 1 w (Vw κH w)

R2κγ2 (warm-up criteria is not satisfied)

Recall that Ht is defined in line 14 of Algorithm 2. Further, we define

s [t 1]\To µ ( xs, θ ) xsx s + λI

Corollary B.4. Under event Ew, Ht H t e2Ht

Proof. For a given s [t 1] \ To, let bθs o denote the value of bθo in that round. Then, for all x Xs, by Lemma C.2,

µ x, bθs o exp( R| x, bθs o θ |) µ ( x, θ ) µ x, bθs o exp(R| x, bθs o θ |)

Applying lemma B.3, gives e 1 µ x, bθs o µ ( x, θ ) e1 µ x, bθs o . Thus,

s [t 1]\To e 1 µ xs, bθs o xsx s + λI X

s [t 1]\To µ ( xs, θ ) xsx s + λI = H t

. Further, H t P

s [t 1]\To e2 µ( xs,bθs o ) e xsx s + e2λI = e2Ht.

Recall that τ is the round when the Switching Criterion II (Line 9) is satisfied. Now we define the following quantities:

s [τ 1]\To µ ( xs, θ ) xs + λτθ

s [τ 1]\To µ ( xs, θ ) xsx s + λI

Θ = n θ : θ bθo V γ κ o

θ = arg min θ Rd

s [t 1]\To ℓ(θ, xs, rs)

Moreover, recall the following definition bθτ:

bθτ = arg min θ Θ

gτ(θ) gτ( θ) Hτ (θ) 1 (21)

Lemma B.5. Under event Ew, bθτ θ V 2γ κ

Proof. First, we observe from Lemma B.2 that bθo θ H w γ. Using V κH w, we can write bθo θ V γ κ. This implies that θ Θ. Now, bθτ Θ by virtue of being a feasible solution to the optimization in (21). Thus, bθτ θ V = bθτ bθo + bθo θ V

bθτ bθo V + bθo θ V (triangle inequality)

Lemma B.6. Let δ (0, 1). Then, under event Ew, with probability 1 δ, bθτ θ H τ β.

Proof. We have for all rounds s [τ 1] \ To,

| xs, θ bθτ | xs V 1 θ bθτ V (Cauchy-Schwarz)

xs V 1 2γ κ (by Lemma B.5)

2γ κ 1 Rγ κ (warm up criterion not satisfied)

Also note that θ Θ. Hence,

bθτ θ H τ 2(1 + 2)

s [τ 1] rsxs

H 1 (by Lemma E.1)

d log(T/δ) (by Lemma B.1)

Let the event in lemma B.6 be denoted by Eτ, or in other words, Eτ = { bθτ θ H τ β}.

B.3 Bounding the instantaneous regret

In this subsection, we will only consider rounds t [T] which does not satisfy Switching Criterion I. Let xt Xt be the played arm defined via line 13 Algorithm 2. Further, let x t Xt be the best available arm in that round. Corollary B.7. Under the event Eτ, for all x Xt, we have, | x, bθτ θ | β x H 1 τ .

| x, bθτ θ | x H 1 τ bθτ θ H τ (Cauchy-Schwartz)

β x H 1 τ (by Lemma B.6)

β x H 1 τ (Hτ H τ)

Lemma B.8. Under event Eτ, x t xt, θ 2

2β xt H 1 t .

x t , θ xt, θ x t , bθτ + β x t H 1 t

xt, bθτ β xt H 1 t

(by Corollary B.7)

xt, bθτ + β xt H 1 τ

xt, bθτ β xt H 1 τ

(optimistic xt, see line 13 algo. 2) = 2β xt H 1 τ

2β xt H 1 t (Lemma B.13, det(H 1 τ ) det(H 1 t ) = det(Ht)

det(Hτ ) 2)

Lemma B.9. The arm set X t obtained after eliminating arms from Xt (line 12 Algorithm 2), under event Ew, satisfies: (a) x t Xt, (b) x t xt, θ 4

Proof. Suppose x = arg maxx Xt LCBo(x). Now, we have, for all x Xt,

| x, bθo θ | x H 1 w

bθo θ H w (Cauchy-Schwarz)

γ x H 1 w (by Lemma B.2)

γ κ x V 1 (V κH w)

Thus, UCBo(x t ) = x t , bθo + γ κ x t V 1 x t , θ x , θ x , bθo γ κ x V 1 = LCBo(x ) , where the second inequality is due to optimality of x t . Hence, x t is not eliminated, implying x t X t. This completes the proof of (a).

Since xt is also in X t (by definition),

UCBo(xt) = xt, bθo + γ κ xt V 1 x , bθo γ κ x V 1

x t , bθo γ κ x t V 1 (x has max LCBo( ))

Again, using the fact that x t , bθo x t , θ γ κ x t V 1 and xt, bθo xt, θ + γ κ xt V 1, we obtain,

xt, θ + 2γ κ xt V 1 x t , θ 2γ κ x t V 1 which gives us, x t xt, θ 2γ κ xt V 1 + 2γ κ x t V 1

Finally, since Switching Criterion I is not satisfied in this round, x V 1 < 1 Rγ κ for all x Xt. Plugging this above,

x t xt, θ 4

B.4 Proof of Theorem 4.2

In this subsection, we complete the proof of the regret bound of RS-GLin CB(Algorithm 2). We first restate Theoreom 4.2 and then prove it. For every round t [T], we use xt Xt to denote the arm played by the algorithm and x t to denote the best available arm in that round.

Theorem B.10 (Theorem 4.2). Given δ (0, 1), with probability 1 δ, the regret of RS-GLin CB (Algorithm 2) satisfies RT = O d q P t [T ] µ ( x t , θ ) log (RT/δ) + κd2R5S2 log2 (T/δ) .

Proof. Firstly, we will assume throughout the proof that Ew Eτ holds, which happens with probability at least 1 δ. Thus, regret of Algorithm 2 is upper bounded as:

t [T ] µ ( x t , θ ) µ ( xt, θ )

t [T ]\To µ ( x t , θ ) µ ( xt, θ ) (Upper bound of R for rounds in To)

c R3κγ2 log(T/δ) + X

t [T ]\To µ(z) x t xt, θ

(some z [ xt, θ , x t , θ ]; lemma B.11)

Now, let R1(T) = P

t [T ]\To µ(z) x t xt, θ . Hereon, we will slightly abuse notation Hτ to denote the Hτ matrix last updated before time t for each time step t [T]. This will be clear from the context as we will only use Hτ term-wise. With this, we upper bound R1(T) as follows:

t [T ]\To µ(z)2β xt H 1 τ (by Lemma B.8)

t [T ]\To µ(z)2 xt H 1 t (Lemma B.13, det(H 1 τ ) det(H 1 t ) = det(Ht)

det(Hτ ) 2)

t [T ]\To µ(z)e1 xt H 1 t (by Lemma B.4)

µ ( x t , θ ) µ ( xt, θ ) exp(R x t xt, θ ) xt H 1 t

(by Lemma C.2)

µ ( x t , θ ) p

µ ( xt, θ )e4 xt H 1 t (by Lemma B.9)

µ ( x t , θ ) p

µ ( xt, θ ) xt H 1 t

µ ( x t , θ ) xt H 1 t ( xt = p

µ ( xt, θ )xt)

t [T ]\To µ ( x t , θ )

t [T ]\To xt 2 H 1 t (Cauchy-Schwarz)

t [T ]\To µ ( x t , θ )

2d log 1 + RT

(Lemma B.12; xt 2 R)

cd log(RT/δ) s X

t [T ]\To µ ( x t , θ ).

Putting things back,

RT cd log(RT/δ) s X

t [T ]\To µ ( x t , θ ) + c R5S2κ log(T/δ)2.

B.5 Bounding number of policy updates: Proof of Lemma 4.1

We first obtain a bound on the number of rounds when Switching Criterion I is satisfied. Then we restate Lemma 4.1 and present its proof. Here, we use To to denote the collection of all rounds till T for which Switching Criterion I is satisfied. Lemma B.11. Algorithm 2, during its entire execution, satisfies the Switching Criterion I at most 2d R2κγ2 log (T/δ) times.

Proof. Recall that Switching Criterion I (Line 4) is satisfied, when x 2 V 1 > 1/(R2κγ2) for some x Xt. Let Vm be the sequence of V matrices (line 6 of Algorithm 2) for m To. That is, V1 = λI, Vm = P

s [m 1] To xsx s + λI. In these rounds, by Line 5 of Algorithm 2, we have that the arm played xt is such that xt = arg maxx Xt x V 1 t . Therefore, X

t To xt 2 V 1 t τo R2κγ2 (22)

Furthermore, by the Elliptic Potential Lemma (Lemma B.12) we have X

t To xt 2 V 1 t 2d log 1 + τo

Combining (23) and (22) we have

τo 2d R2κγ2 log 1 + τo

2d R2κγ2 log (T) 2d R2κγ2 log (T/δ)

Lemma (4.1). Algorithm 2, during its entire execution, updates its policy at most O(R4S2 κd2 log2(T/δ)) times.

Proof. Note that in Algorithm 2, policy changes happen only in the rounds when either of the Switching Criteria are triggered. The number of times Switching Criterion I is triggered is bounded by Lemma B.11. On the other hand, the number of times Switching Criterion II is triggered is equal to the number of times determinant of Ht doubles, which is bounded by Lemma B.15. Thus in total, the number of policy changes in Algorithm 2 is upper bounded by 2d R2κγ2 log(T/δ) + cd log(T) .

B.6 Some Useful Lemmas

Lemma B.12 (Elliptic Potential Lemma (Lemma 10 [1])). Let x1, x2, . . . xt be a sequence of vectors in Rd and let xs 2 L for all s [t]. Further, let Vs = Ps 1 m=1 xmx m + λI. Suppose λ L2. Then,

s=1 xs 2 V 1 s 2d log 1 + L2t

Lemma B.13 (Lemma 12 of [1]). Let A B 0. Then

x Ax x Bx det(A)

Lemma B.14 (Lemma 10 of [1]). Let {xs}t s=1 be a set of vectors. Define the sequence {Vs}t s=1 as V1 = λI, Vs+1 = Vs + xsx s for s [t 1]. Further, let xs 2 L s [t]. Then,

det(Vt) λ + t L2/d d .

Lemma B.15. Let {xs}t s=1 be a set of vectors. Define the sequence {Vs}t s=1 as V1 = λI, Vs+1 = Vs + xsx s for s [t 1]. Further, let xs 2 L s [t]. Define the set {1 = τ1, τ2 . . . τm = t} such that: det(Vτi+1) 2 det(Vτi) but det(Vτi+1 1) < 2 det(Vτi) for i {2, . . . m 1}. Then, the number of time doubling happens,i.e., m, is at most O(d log(t)).

Proof. By Lemma B.14, det(Vt) λ + t L2/d d. But we have that from definition of τi s

det(Vt) det(Vτm 1) 2 det(Vτm 2) ...

2m 2 det(Vτ1)

= 2m 2 det(V1)

= 2m 2λd (V1 = λI)

Thus, 2m 2λd λ + t L2/d d which implies that

2m 2 1 + t L2

Hence, m O(d log(t)) .

C Useful Properties of GLMs

Recall that a Generalized Linear Model is characterized by a canonical exponential family, i.e., the random variable r has density function pz (r) = exp (rz b (z) + c (r)), with parameter z, log-partition function b( ), and a function c. Further, b(z) = µ(z) is also called the link function.

Hereon, we will assume that the random variable has a bounded non-negative support, i.e., r [0, R] almost surely. Now, we state the following key Lemmas on GLMs Lemma C.1 (Self-Concordance for GLMs). For distributions in the exponential family the function µ( ) satisfies that for all z R, | µ(z)| R µ(z).

Proof. Indeed,

| ... b (z)| = |E[(r E[r])3]| (Lemma C.3)

E |(r E[r])3| (Jensen s inequality)

= E |r E[r]| (r E[r])2

E[R(r E[r])2] (r, E[r] [0, R])

= R E[(r E[r])2]

= R b(z) (Lemma C.3)

As a consequence, we have the following simple modification of the self-concordance results of [6]. Lemma C.2. For an exponential distribution with log-partition function b( ), for all z1, z2 R, letting µ(z) := b(z), following holds:

α(z1, z2) := Z 1

v=0 µ (z1 + v (z2 z1)) µ (z) 1 + R|z1 z2| for z {z1, z2} (25)

µ(z2) e R|z2 z1| µ (z1) e R|z2 z1| µ (z2) (26)

α(z1, z2) := Z 1

v=0 (1 v) µ (z1 + v(z2 z1)) dv µ(z1) 2 + R|z1 z2| (27)

Proof. Without loss of generality, assume that z2 z1. Note that by property of integration R b a f(x)dx = R a b f(b + a x)dx, α(z1, z2) = α(z2, z1). Now, by proposition C.1, and the fact that µ(z) = ... b (z), we have for any v R and z z1,

R µ(v) µ(v) R µ(v) (Lemma C.1)

µ(v)dv R Z z1

R(z z1) log µ(z)

µ(z1) exp( R(z z1)) µ(z) µ(z1) exp(R(z z1))

Putting z = z2 establishes 26. To show 25, we further set z = z1 + u(z2 z1) for u [0, 1], (note that z z1) and integrate on u,

0 exp( Ru(z2 z1))du Z 1

0 µ(z1 + u(z2 z1))du Z 1

0 exp(Ru(z2 z1))du

which gives µ(z1)1 exp( R(z2 z1))

R(z2 z1) α(z1, z2) µ(z1)exp(R(z2 z1)) 1

Next, we use the fact that for x > 0, e x (1 + x) 1 which on rearranging gives (1 e x)/x 1/(1 + x). Applying this inequality to the LHS above finishes the proof. Note that similar exercise can be repeated with z2 z1 to get the same result for z2.

For 27, we have, by application of 26, µ(z1 + v(z2 z1)) µ(z1) exp(R|v(z2 z1)|). Therefore,

α(z1, z2) = Z 1

v=0 (1 v) µ (z1 + v(z2 z1)) dv

v=0 (1 v) µ(z1) exp( R|v(z1 z2)|)dv

= µ(z1) Z 1

v=0 (1 v) exp( Rv|(z1 z2)|)dv (v [0, 1])

= µ(z1) 1 R|z1 z2| + exp( R|z1 z2|) 1

µ(z1) 1 2 + R|z1 z2| (Lemma 10 of [2])

Next we state some nice properties of the GLM family that is the key in deriving Lemma C.1.

Lemma C.3 (Properties of GLMs). For any random variable r that is distributed by a canonical exponential family, we have

1. E [r] = µ (z) = b (z)

2. V[r] = E h (r E [r])2i = µ (z) = b (z)

3. E h (r E[r])3i = ... b (z)

Proof. 1. Indeed, since pz(r) is a probability distribution, R

r pz(r)dr = 1 which in turn implies that b(z) = log R

r exp(rz + c(r))dr . Thus, taking derivative,

r exp(rz + c(r))dr

z exp(rz + c(r))dr

= exp( b(z)) Z

r r exp(rz + c(r))dr

r r exp(rz b(z) + c(r))dr = E[r]

2. Let f(z) := R

r r exp(rz + c(r))dr. Thus, b(z) = exp( b(z))f(z). Taking derivative on both sides,

b(z) = b(z) exp( b(z))f(z) + exp( b(z)) f(z)

= E[r]2 + exp( b(z)) Z

r r2 exp(rz + c(r))dr

= E[r]2 + Z

r r2 exp(rz b(z) + c(r))dr

= E[r]2 + E[r2] = V[r]

3. Again let f(z) := R

r r2 exp(rz+c(r))dr. Thus, b(z) = b(z)2+exp( b(z))f(z). Taking derivative on both sides, ... b (z) = 2 b(z) b(z) b(z) exp( b(z))f(z) + exp( b(z)) f(z)

= 2 b(z) b(z) b(z) E[r2] + Z

r r3 exp(rz b(z) + c(r))dr

= 2 E[r]V[r] E[r] E[r2] + E[r3]

Now, let us expand E[(r E[r])3].

E[(r E[r])3] = E[r3 3r2 E[r] + 3r E[r]2 E[r]3]

= E[r3] 3 E[r] E[r2] + 3 E[r]3 E[r]3

= E[r3] E[r] E[r2] 2 E[r] E[r2] + E[r]2

= E[r3] E[r] E[r2] 2 E[r]V[r]

Corollary C.4. For all exponential family, b( ) is a convex function.

Proof. Indeed, note that b(z) = V[r] which is always non-negative. Thus, b(z) 0 implying that b( ) is convex.

Remark C.5. In [5] Section 1.4.1, the author claims that if the GLM parameter z lies in a bounded set, then the GLM is self-concordant, i.e., | µ(z)| a µ(z), for some appropriate constant a over this bounded set. Thereafter the author notes that the techniques developed in [5] guarantees κ-free regret rates (in

T term) for such GLMs (i.e., all GLMs with bounded parameter). However, the claim regarding self-concordance of GLMs is not true in general. There are classes of GLMs whose parameters may be restricted in a bounded set, but for them no constant a exists. One such example is the exponential distribution. The link function µ for exponential distribution is given as µ(z) = 1

z. If we allow z to lie in the set ( c, 0) for some positive c, then we have µ(z) strictly increasing (satisfying our assumption on monotonicity of µ, thus a valid example). However, for this GLM,

z2 µ(z) = 2

Note that µ(z) is positive for the assumed support of z. Suppose this GLM is self-concordant, then we must have some positive constant a such that

| µ(z)| = 2

z3 a µ(z) = a 1

Simplifying, we obtain the following relation:

However, since z ( c, 0), we have limz 0 2

z . Hence, no constant a is possible. By this counterexample it can be seen that bounded parameter set is not enough to guarantee self-concordance of GLMs. In this work, we give a characterization of self-concordance of GLMs with bounded support of the random variable. It will be interesting to understand a complete characterization of self-concordance of GLMs.

D Computational Cost

Consider a log-loss minimization oracle that returns the unconstrained MLE for a given GLM class with a computational complexity of Copt n, when the log-loss is computed over n data points. Let the maximum number of arms available every round be K. Furher, let the computational cost of an oracle that solves the non-convex optimization 8 be NCopt.

Computational Cost of B-GLin CB: In the B-GLin CB algorithm, we employ the log-loss oracle at the end of each batch. The estimator bθ calculated at the end of a batch of length τ incurs a computational cost of Coptτ. Furthermore, this oracle is invoked for a maximum of M log log T batches. Additionally, the computation of the distributional optimal design at the end of each batch is efficient in d (poly(d)). Moreover, in every round, the algorithm solves the D/G Optimal Design problem (requiring O(d log d) computation) and runs elimination based on prior (at most log log T) phases. Hence, the amortized cost per round of B-GLin CB is O(K log log T + d log d + Copt).

Computational Cost of RS-GLin CB: In the RS-GLin CB algorithm, the estimator bθo is computed each time Switching Criterion I is triggered. Additionally, during rounds when Switching Criterion I is not triggered, the estimator bθτ is computed a maximum of O(log(T)) times. These computations involve utilizing both the log-loss oracle and the non-convex projection oracle. Furthermore, in each non-Switching Criterion I round, the algorithm executes an elimination step. This yields an amortized time complexity of O(Copt log (T) + NCopt log2(T) + K) per round.

Performance in Practice: As evident from Fig. 1, RS-GLin CB has much better computational performance in practice. We ran all the experiments on an Azure Data Science VM equipped with AMD EPYC 7V13 64-Core Processor (clock speed of 2.45 GHz) and Linux Ubuntu 20.04 LTS operating system. It was ensured that no other application processes were running while we tested the performance. We implemented and tested our code in Python, and measured the execution times using time.time() command. We allowed no operations for 10 seconds after every run to let the CPU temperature come back to normal, in case the execution heats up the CPU, thereby causing subsequent runs to slow down.

Comparison with ECOLog [7] shows that execution time for RS-GLin CB is significantly smaller. We posit that this is because RS-GLin CB solves a large convex optimization problem but less frequently, resulting into smaller overhead at the implementation level, while ECOLog solves a smaller convex optimization problem, but does so every round. On an implementation level, this translates into more function calls and computation. Further, we observe that with increasing κ, the execution time of RS-GLin CB increases, which is in accordance with Lemma 4.1 that quantifies the number of policy switches as an increasing function of κ.

While comparing with GLOC [14], we observe that RS-GLin CB performs better than GLOC in both high and low κ regimes. Since GLOC runs an online convex optimization (online Newton step) algorithm to generate its confidence sets, the time taken by GLOC is nearly constant with changing κ. On the other hand, in accordance with Lemma 4.1, the computational cost of RS-GLin CB increases with κ. However, after a few initial rounds, when neither of the switching criteria are triggered, RS-GLin CB does not need to solve any computationally intensive optimization problem, hence these rounds execute very fast. In practice, with typical data distribution, RS-GLin CB reaches this stage much before what the worst-case guarantees show, hence we see it perform better than GLOC.

E Projection

We describe the projection step used in Algorithms 1 and 2. We present arguments similar to the ones made in Appendix B.3 of [6]. We write

s=1 µ ( θ, xs ) xsx T s + λI

Recall, H = H(θ ). Let bθ be the MLE estimator of θ calculated after the sequence arm pulls x1, x2, . . . , xt. Let r1, r2, . . . , rt be the corresponding observed rewards. We project bθ to a set Θ by solving the following optimization problem

eθ := arg min θ Θ

s=1 (µ ( xs, θ ) µ xs, bθ )xs

H(θ) 1 (28)

Lemma E.1. Using the notations described above, if θ Θ and maxi [t] | xi, eθ θ | c/R, then we have

eθ θ H(θ ) 2(1 + c)

s=1 (µ ( xs, θ ) rs)xs

Proof. First, we note that by self-concordance property of µ (lemma C.2), for any s [t],

α(xs, eθ, θ ) µ ( xs, θ )

1 + R| xs, eθ θ |

µ ( xs, θ )

1 + R(c/R) (maxi [s] | xs, eθ θ | c/R)

= µ ( xs, θ )

Similarly, we have α(xs, eθ, θ ) µ( xs,eθ )

Let us define the matrix G = P

s [t] α(x, eθ, θ )xsx s. Using the above fact, we obtain the relation:

G 1 1+c H and G 1 1+c H(eθ). Also define the vector g(θ) = P s [t] µ ( θ, xs ) xs. Now,

1 + c eθ θ G (H ( 1 + c)G)

1 + c G eθ θ G 1

α(xs, eθ, θ ) eθ θ , xs xs

µ xs, eθ µ ( xs, θ ) xs

G 1 (Taylor s theorem)

s [t] µ eθ, xs xs

s [t] µ ( θ , xs ) xs

Let g(θ) = Pt s=1 µ ( xs, θ ) xs for any θ. Therefore, we have, eθ θ H

1 + c g(eθ) g(θ ) G 1

1 + c g(eθ) g(bθ) + g(bθ) g(θ ) G 1

1 + c g(eθ) g(bθ) G 1 + g(bθ) g(θ ) G 1

( inequality)

(1 + c) g(eθ) g(bθ) H(eθ) 1 + g(bθ) g(θ ) H 1

(H 1 ( 1 + c)G 1)

2(1 + c) g(bθ) g(θ ) H 1 (by (28))

H 1 (bθ is the unconstrained MLE, g(bθ) = P

s [t] rsxs.)

E.1 Convex Relaxation

The optimization problem in (28) is a non-convex optimization problem and therefore it is not clear what is the computational complexity of the problem. However, it is possible to substitute this optimization problem with a convex one, whose computational complexity can be better tractable. The process is similar to the one detailed in [2, section 6]. Here we briefly outline the steps.

Let Lt(θ) = Pt s=1 ℓ(θ, xs, rs) and θ be defined as follows:

θ := arg min θ Θ Lt(θ) (29)

Note that when the set Θ is a convex set, then the above optimization problem is convex by property of the log-likelihood function of GLMs. Hence it can be solved efficiently. With this projected θ, we have the following guarantee:

Lemma E.2. Suppose g(bθ) g(θ ) H 1 γ and λ = γ/R. If θ Θ and maxi [t] | xi, θ

θ | c/R, then we have

θ θ H(θ ) c p

(2 + c)R3Sγ

Proof. First we note that by self-concordance property of µ, for any s [t],

α(xs, θ , θ) µ ( xs, θ )

2 + R| xs, θ θ | (Lemma C.2)

µ ( xs, θ )

2 + R(c/R) (maxi [s] | xs, eθ θ | c/R)

= µ ( xs, θ )

Let us define G(θ , θ) := Pt s=1 α(xs, θ , θ)xsx s. Using the above fact, we obtain G(θ , θ) 1 2+c H .

We now follow closely the proof outlined in Appendix B.3 of [2] with minor changes. By second-order Taylor s expansion, for any θ Rd, we can write

Lt(θ) Lt(θ ) Lt(θ ), θ θ = θ θ 2 G(θ,θ )

1 2 + c θ θ 2 H

Taking absolute value on both sides, and substituting θ = θ,

H (2 + c) |Lt( θ) Lt(θ )| + | Lt(θ ), θ θ | ( -inequality)

(2 + c) |Lt( θ) Lt(θ )| + Lt(θ ) H 1 θ θ H

(Cauchy-Schwarz)

|Lt( θ) Lt(θ )| +

Recall that bθ is the unconstrained MLE, therefore Lt(bθ) = 0. By a similar Taylor expansion as above and some algebraic manipulations (see Appendix B.3 of [2]), we have, for θ .

Lt(θ ) Lt(bθ) g(θ ) g(bθ) 2

g(θ ) g(bθ) 2

H 1 + g(θ ) g(bθ) H 1

λ γ2 + γ (Lemma B.1)

2R3Sγ (recall

R2λ = γ/RS)

We also have, by definition of θ, whenever θ Θ, Lt( θ) Lt(θ ), therefore we have Lt( θ) Lt(bθ) Lt(θ ) Lt(bθ) 2R3Sγ Thus, we have, θ θ 2

H (2 + c) 4R3Sγ + γ θ θ H

Using the inequality that for some x2 bx + c = x b + c, we have, θ θ H (2 + c)γ + p

(2 + c)4R3Sγ

(2 + c)R3Sγ

Neur IPS Paper Checklist

IMPORTANT, please:

Delete this instruction block, but keep the section heading Neur IPS paper checklist", Keep the checklist subsection headings, questions/answers and guidelines below. Do not modify the questions and only use the provided macros for your answers.

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: See Section 1 Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: See Conclusion (Section 6) Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: All details are stated in Section 2 and the proofs are given in details in the Appendices A, B, C.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: Detailed discussion is done in Section 5 and Appendix D. Code can be accessed at https://github.com/nirjhar-das/GLBandit_Limited_Adaptivity.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility.

In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: Full code with documentation is provided at https://github.com/ nirjhar-das/GLBandit_Limited_Adaptivity.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: See Section 5.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: Execution Time plots have errorbars, but Regret plots do not as regret plots were observed to get cluttered although the variation was not significant.

Guidelines:

The answer NA means that the paper does not include experiments.

The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: See Appendix D. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: There are no potential harms caused by the research process as it is mainly theoretical and the experiments are all in simulation. There might be Societal Impact of the algorithms developed here when applied in practical decision-making settings. The study of this beyond the scope of current work. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [No]

Justification: The problem studied is of a theoretical interest and we do not foresee any negative societal impacts. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The data and models are only toy data/fully simulated and therefore of no real impact. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: All credits regarding code are given in the README.md file at https: //github.com/nirjhar-das/GLBandit_Limited_Adaptivity.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset.

The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: All details are provided in the README.md of our codebase https:// github.com/nirjhar-das/GLBandit_Limited_Adaptivity. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Our work is mainly theoretical with only simple simulated experiments. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Our work does not involve human participants. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.