# nash_regret_guarantees_for_linear_bandits__209962cd.pdf Nash Regret Guarantees for Linear Bandits Ayush Sawarni Indian Institute of Science Bangalore sawarniayush@gmail.com Soumyabrata Pal Google Research Bangalore soumyabrata@google.com Siddharth Barman Indian Institute of Science Bangalore barman@iisc.ac.in We obtain essentially tight upper bounds for a strengthened notion of regret in the stochastic linear bandits framework. The strengthening referred to as Nash regret is defined as the difference between the (a priori unknown) optimum and the geometric mean of expected rewards accumulated by the linear bandit algorithm. Since the geometric mean corresponds to the well-studied Nash social welfare (NSW) function, this formulation quantifies the performance of a bandit algorithm as the collective welfare it generates across rounds. NSW is known to satisfy fairness axioms and, hence, an upper bound on Nash regret provides a principled fairness guarantee. We consider the stochastic linear bandits problem over a horizon of T rounds and with set of arms X in ambient dimension d. Furthermore, we focus on settings in which the stochastic reward associated with each arm in X is a non-negative, ν-sub-Poisson random variable. For this setting, we develop an algorithm that achieves a Nash regret of O q T log(T|X|) . In addition, addressing linear bandit instances in which the set of arms X is not necessarily finite, we obtain a Nash regret upper bound of O d 5 4 ν 1 2 T log(T) . Since bounded random variables are sub-Poisson, these results hold for bounded, positive rewards. Our linear bandit algorithm is built upon the successive elimination method with novel technical insights, including tailored concentration bounds and the use of sampling via John ellipsoid in conjunction with the Kiefer-Wolfowitz optimal design. 1 Introduction Bandit optimization is a prominent framework for sequential decision making and has several applications across multiple domains, such as healthcare [26, 23, 27] and advertising [22]. In this framework, we have a set of arms (possible actions) with unknown means and a time horizon. The goal is to sequentially pull the arms such that the regret which is a notion of loss defined over the bandit instance is minimized. We consider settings wherein the stochastic rewards generated by a sequential algorithm induces welfare across a population of agents. Specifically, there are T agents, arriving one per round; in particular, the reward accrued at each round t [T] corresponds to the value accrued by the tth agent. Indeed, such a welfarist connection exists in various applications of the bandit framework. Consider, for instance, the classic context of drug trials [24]: Suppose there are T patients and several available drugs. In each round t [T], one of the available drugs is administered to the tth patient. Subsequently, the reward accrued at the tth round corresponds to the efficacy of the administered drug to the tth patient. In such a setting, fairness is a fundamental consideration. That is, in addition to cumulative efficacy, individual effectiveness of the drugs is quite important. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). A central notion in the bandit literature is that of average regret, defined as the difference between the (a priori unknown) optimum and the arithmetic mean of the expected rewards (accumulated by the algorithm) [4]. However, average regret fails to capture the fairness criterion that the rewards should be balanced (across the agents) and not just cumulatively high. From a welfarist viewpoint, the standard notion of (average) regret equates the algorithm s performance to the social welfare it induces. Social welfare is defined as the sum of agent s rewards [19] and can be high among a set of agents even if a fraction of them receive indiscriminately low rewards. For instance, in the drug-trials example provided above, high average efficacy (i.e., high social welfare) does not rule out a severely ineffective outcome for a subset of agents. Given that average regret is defined using the sum of expected rewards, this notion inherits this utilitarian limitation of social welfare. In summary, in welfare-inducing contexts, a bandit algorithm with low average regret is not guaranteed to induce fair outcomes across rounds. Addressing this issue and with the overarching aim of achieving fairness across rounds (i.e., across agents that receive rewards from a bandit algorithm), the current work considers a strengthened notion of regret. The strengthening referred to as Nash regret is defined as the difference between the (a priori unknown) optimum and the geometric mean of expected rewards induced by the bandit algorithm. It is relevant to note that the geometric mean (of rewards) corresponds to the Nash social welfare (NSW) function [19]. This welfare function has been extensively studied in mathematical economics (see, e.g., [19]) and is known to satisfy fundamental fairness axioms, including the Pigou Dalton transfer principle, scale invariance, and independence of unconcerned agents. Hence, by definition, Nash regret quantifies the performance of a bandit algorithm as the NSW it generates. Quantitatively speaking, in order for the geometric mean (i.e., the NSW) to be large, the expected reward at every round should be large enough. The AM-GM inequality also highlights that Nash regret is a more demanding objective that average regret. We obtain novel results for Nash regret in the stochastic linear bandits framework. In this well-studied bandit setup each arm corresponds to a d-dimensional vector x (an arm-specific context) and the unknown arm means are modelled to be a linear function of x. With a focus on average regret, stochastic linear bandits have been extensively studied in the past decade [1, 9, 21]. The current paper extends the line of work on linear bandits with fairness and welfare considerations. Note that an ostensible approach for minimizing Nash regret is to take the logarithm of the observed rewards and, then, solve the average regret problem. However, this approach has the following shortcomings: (i) Taking log implies the modified rewards can have a very large range possibly making the regret vacuous, and (ii) This approach leads to a multiplicative guarantee and not an additive one. In a recent work of [5], the authors study Nash regret in the context of stochastic multi-armed bandits (with bounded rewards) and provide optimal guarantees. The current work notably generalizes this prior work to linear bandits. 1.1 Our Contributions and Techniques We consider the stochastic linear bandits setting with a set of arms X over a finite horizon of T rounds. Since we consider the welfarist viewpoint, we assume that the rewards across all the rounds are positive and, in particular, model the distribution of the arm rewards to be ν-sub-Poisson, for parameter ν R+. Our goal is to minimize the Nash regret NRT. We develop a novel algorithm LINNASH that obtains essentially optimal Nash regret guarantees for this setting. Specifically, for a finite set of arms X Rd, our algorithm LINNASH achieves Nash regret NRT = O q T log(T|X|) . For infinite sets of arms, a modified version of LINNASH achieves Nash regret NRT = O d 5 4 ν 1 2 Recall that Nash regret is a strengthening of the average regret; the AM-GM inequality implies that, for any bandit algorithm, the Nash regret is at least as much as its average regret. Hence, in the linear bandits context, the known Ω d/ T lower bound on average regret (see [17], Chapter 24) holds for Nash regret as well.1 This observation implies that, up to a logarithmic factor, our upper bound on Nash regret is tight with respect to the number of rounds T. We also note that for instances in which the number of arms |X| = ω(2d), the Nash-regret dependence on d has a slight gap. Tightening this gap is an interesting direction of future work. We note that bounded, positive random variables are sub-Poisson (Lemma 1). Hence, our results hold for linear bandit instances wherein the stochastic rewards are bounded and positive. This observation also highlights the fact that the current work is a generalization of the result obtained in [5]. In addition, notice that, by definition, Poisson distributions are 1-sub-Poisson. Hence, our guarantees further hold of rewards that are not necessarily sub-Gaussian. Given the recent interest in obtaining regret guarantees beyond sub-Gaussian rewards [18, 3], our study of sub-Poisson rewards is interesting in its own right.2 Our linear bandit algorithm, LINNASH, has two parts. In the first part, we develop a novel approach of sampling arms such that in expectation the reward obtained is a linear function of the center of John Ellipsoid [13]. Such a strategy ensures that the expected reward in any round of the first part is sufficiently large. The second part of LINNASH runs in phases of exponentially increasing length. In each phase, we sample arms according to a distribution that is obtained as a solution of a concave optimization problem, known as D-optimal design. We construct confidence intervals at each phase and eliminate sub-optimal arms. A key novelty in our algorithm and analysis is the use of confidence widths that are estimate dependent. We define these widths considering multiplicative forms of concentration bounds and crucially utilize the sub-Poisson property of the rewards. The tail bounds we develop might be of independent interest. 1.2 Other Related Work There has been a recent surge in interest to achieve fairness guarantees in the context of multi-armed bandits; see, e.g., [14, 7, 20, 6, 12]. However, these works mostly consider fairness across arms and, in particular, impose fairness constraints that require each arm to be pulled a pre-specified fraction of times. By contrast, our work considers fairness across rounds. Alternative Regret Formulations. In the current work, for the welfare computation, each agent t s value is considered as the expected reward in round t. One can formulate stronger notions of regret by, say, considering the expectation of the geometric mean of the rewards, rather than the geometric mean of the expectations. However, as discussed in [5], it is not possible to obtain non-trivial guarantees for such reformulations in general: every arm must be pulled at least once. Hence, if one considers the realized rewards (and not their expectations), even a single pull of a zero-reward arm will render the geometric mean zero. 2 Problem Formulation and Preliminaries We will write [m] to denote the set {1, 2, . . . , m}. For a matrix X, let Det(X) to denote the determinant of X. For any discrete probability distribution λ with sample space Ω, write Supp(λ) {x Ω: Pr X λ {X = x} > 0} to denote the points for which the probability mass assigned by λ is positive. For a vector a Rd and a positive definite matrix V Rd d, we will denote ||a||V := a T Va. Finally, let B := {x Rd | ||x||2 = 1} be the d-dimensional unit ball. We address the problem of stochastic linear bandits with a time horizon of T Z+ rounds. Here, an online algorithm (decision maker) is given a set of arms X Rd. Each arm corresponds to a d-dimensional vector. Furthermore, associated with each arm x X, we have a stochastic reward rx R+. In the linear bandits framework, the expected value of the reward rx is modeled to be a linear function of x Rd. In particular, there exists an unknown parameter vector θ Rd such that, for each x X, the associated reward s expected value E[rx] = x, θ . Given the focus on welfare contexts, we will, throughout, assume that the rewards are positive, rx > 0, for all x X. The online algorithm (possibly randomized) must sequentially select an arm Xt in each round t [T] and, then, it observes the corresponding (stochastic) reward r Xt > 0.3 For notational convenience, 1This lower bound on average regret is obtained for instances in which the set of arms X are the corners of a hypercube [17]. 2An intersection between sub-Gaussian and sub-Poisson distribution classes is identified in Lemma 2. 3Note that, for a randomized online algorithm, the selected arm Xt is a random variable. we will write rt to denote r Xt. In particular, if in round t the selected arm Xt = x, then the expected reward is x, θ , i.e., E[rt | Xt = x] = x, θ . We will, throughout, use x to denote the optimal arm, x = arg maxx X x, θ and bθ to denote estimator of θ . In the stochastic linear bandits framework, our overarching objective is to minimize the Nash regret, defined as follows: NRT := max x X x, θ t=1 E[ Xt, θ ] Note that the definition of Nash regret is obtained by applying the Nash social welfare (geometric mean) onto ex ante rewards, E [ Xt, θ ],4 accrued across the T rounds. 2.1 Sub-Poisson Rewards In order to model the environment with positive rewards (rx > 0), we assume that the rewards rx associated with the arms x X are ν-sub Poisson, for some parameter ν > 0. Formally, their moment-generating function satisfies the following bound E eλ rx exp ν 1E[rx] eνλ 1 = exp ν 1 x, θ eνλ 1 for all λ R. (2) Note that a Poisson random variable is 1-sub Poisson. To highlight the generality of ν-sub-Poisson distributions, we note that bounded, non-negative random variables are sub-Poisson (Lemma 1). Further, in Lemma 2, we establish a connection between non-negative sub-Gaussian and sub-Poisson random variables. Lemma 1. Any non-negative random variable X [0, B] is B-sub-Poisson, i.e., if mean E[X] = µ, then for all λ R, we have E[eλX] exp B 1µ e Bλ 1 . Lemma 2. Let X be a non-negative sub-Gaussian random variable X with mean µ = E[X] and sub-Gaussian norm σ. Then, X is also σ2 µ -sub-Poisson. The proofs of Lemmas 1 and 2 appear in Appendix A. Lemma 2 has useful instantiations. In particular, the lemma implies that the half-normal random variable, with variance of σ, is also a (Cσ)-sub-Poisson, where C is a constant (independent of distribution parameters). Similarly, for other well-studied, positive sub-Gaussian random variables (including truncated and folded normal distributions), the sub-Poisson parameter is small. Next, we discuss the necessary preliminaries for our algorithm and analysis. 2.2 Optimal Design. Write (X) to denote the probability simplex associated with the set of arms X. Let λ (X) be such a probability distribution over the arms, with λx denoting the probability of selecting arm x. The following optimization problem, defined over the set of arms X, is well-known and is referred to as the G-optimal design problem. Minimize g(λ) max x X ||x||2 U(λ) 1, where λ (X) and U(λ) = X x X λxxx T (3) The solution to (3) provides the optimal sequence of arm pulls (for a given budget of rounds) to minimize the confidence width of the estimated rewards for all arms x X. The G-optimal design problem connects to the following optimization problem (known as D-optimal design problem): Maximize f(λ) log Det(U(λ)), where λ (X) and U(λ) = X x X λxxx T (4) The lemma below provides an important result of Kiefer and Wolfowitz [15]. Lemma 3 (Kiefer-Wolfowitz). If the set X is compact and X spans Rd, then there exists λ (X) supported over at most d(d + 1)/2 arms such that λ minimizes the objective in equation (3) with g(λ ) = d. Furthermore, λ is also a maximizer of the D-optimal design objective, i.e., λ maximizes the function f(λ) = log Det(U(λ)) subject to λ (X). 4Here, the expectation is with respect to the random variable Xt. Algorithm 1 Generate Arm Sequence (Subroutine to generate Arm Sequence) Input: Arm set X and sequence length e T Z+. 1: Find the probability distribution λ (X) by maximizing the following objective log Det(U(λ0)) subject to λ0 (X) and Supp(λ0) d(d + 1)/2 (5) 2: Initialize multiset S = and set A = Supp(λ). Also, initialize count cz = 0, for each arm z A. 3: Compute distribution U as described in Section 3.1. 4: for i = 1 to e T do 5: With probability 1/2 set flag = SAMPLE-U, otherwise, set flag = D/G-OPT. 6: if flag = SAMPLE-U or A = then 7: Sample an arm bx from the distribution U, and update multiset S S {bx}. 8: else if flag = D/G-OPT then 9: Pick the next arm z in A (round robin). 10: Update multiset S S {z} and increment count cz cz + 1. 11: If cz λz e T/3 , then update A A \ {z}. 12: end if 13: end for 14: return multiset S At several places in our algorithm, our goal is to find a probability distribution that minimizes the non-convex optimization problem (3). However, instead we will maximize the concave function f(λ) = log Det(U(λ)) over λ (X). The Frank-Wolfe algorithm, for instance, can be used to solve the D-optimal design problem (4) and compute λ efficiently ([17], Chapter 21). Lemma 3 ensures that this approach works, since the G-optimal and the D-optimal design problems have the same optimal solution λ (X), which satisfies Supp(λ ) d(d + 1)/2.5 2.3 John Ellipsoid. For any convex body K Rd, a John ellipsoid is an ellipsoid with maximal volume that can be inscribed within K. It is known that K itself is contained within the John Ellipsoid dilated by a factor of d. Formally,6 Lemma 4 ([11]). Let K Rd be a convex body (i.e., a compact, convex set with a nonempty interior). Then, there exists an ellipsoid E (called the John ellipsoid) that satisfies E K c+d(E c). Here, c Rd denotes the center of E and c + d(E c) refers to the (dialated) set {c + d(x c) : x E}. 3 Our Algorithm LINNASH and Main Results In this section, we detail our algorithm LINNASH (Algorithm 2), and establish an upper bound on the Nash regret achieved by this algorithm. Subsection 3.1 details Part I of LINNASH and related analysis. Then, Subsection 3.2 presents and analyzes Part II of the algorithm. Using the lemmas from these two subsections, the regret bound for the algorithm is established in Subsection 3.3. 3.1 Part I: Sampling via John Ellipsoid and Kiefer-Wolfowitz Optimal Design As mentioned previously, Nash regret is a more challenging objective than average regret: if in any round t [T], the expected7 reward E[rt] is zero (or very close to zero), then geometric mean (QT t=1 E[r Xt])1/T goes to zero, even if the expected rewards in the remaining rounds are large. Hence, we need to ensure that in every round t [T], specifically the rounds in the beginning of the algorithm, the expected rewards are bounded from below. In [5], this problem was tackled for stochastic multi-armed bandits (MAB) by directly sampling each arm uniformly at random in the initial rounds. Such a sampling ensured that, in each of those initial rounds, the expected reward is bounded from below by the average of the expected rewards. While such a uniform sampling 5Even though the two optimization problems (3) and (4) share the optimal solution, the optimal objective function values can be different. 6The ellipsoid E considered in Lemma 4 is also the ellipsoid of maximal volume contained in K [11]. 7Here, the expectation is over randomness in algorithm and the reward noise. Algorithm 2 LINNASH (Nash Regret Algorithm for Finite Set of Arms) Input: Arm set X and horizon of play T. 1: Initialize matrix V [0]d,d and number of rounds e T = 3 p Tdν log(T|X|). Part I 2: Generate arm sequence S for the first e T rounds using Algorithm 1. 3: for t = 1 to e T do 4: Pull the next arm Xt from the sequence S, observe corresponding reward rt, and update V V+Xt XT t 5: end for 6: Set estimate bθ := V 1 Pe T t=1 rt Xt 7: Compute confidence bounds LNCB(x, bθ, e T/3) and UNCB(x, bθ, e T/3), for all x X (see equation (7)) 8: Set e X = n x X : UNCB(x, bθ, e T/3) maxz X LNCB(z, bθ, e T/3) o and initialize T = 2 3 e T Part II 9: while end of time horizon T is reached do 10: Initialize V = [0]d,d to be an all zeros d d matrix and s = [0]d to be an all-zeros vector. // Beginning of new phase. 11: Find the probability distribution λ ( e X) by maximizing the following objective log Det(U(λ0)) subject to λ0 ( e X) and Supp(λ0) d(d + 1)/2. (6) 12: for each arm a in Supp(λ) do 13: Pull arm a for the next λa T rounds. Update V V + λa T aa T . 14: Observe λa T corresponding rewards z1, z2, . . . and update s s + (P j zj)a. 15: end for 16: Set estimate bθ = V 1s and compute LNCB(x, bθ, T ) and UNCB(x, bθ, T ), for all x X (see equation (7)) 17: Set e X = n x e X : UNCB(x, bθ, T ) maxz X LNCB(z, bθ, T ) o . // End of phase. 18: Update T 2 T . 19: end while strategy is reasonable for the MAB setting, it can be quite unsatisfactory in the current context of linear bandits. To see this, consider a linear bandit instance in which, all except for one arms in X are orthogonal to θ . Here, a uniform sampling strategy will lead to an expected reward of x , θ /|X|, which can be arbitrarily small for large cardinality X. To resolve this issue we propose a novel approach in the initial e T := 3 p Tdν log(T|X|) rounds. In particular, we consider the convex hull of the set of arms X denoted as cvh(X) and find the center c Rd of the John ellipsoid E for the convex hull cvh(X). Since E cvh(X), the center c of the John ellipsoid is contained within cvh(X) as well. Furthermore, via Carathéodory s theorem [10], we can conclude that the center c can be expressed as a convex combination of at most (d + 1) points in X. Specifically, there exists a size-(d + 1) subset Y := {y1, . . . , yd+1} X and convex coefficients α1, . . . , αd+1 [0, 1] such that c = Pd+1 i=1 αiyi with Pd+1 i=1 αi = 1. Therefore, the convex coefficients induce a distribution U (X) of support size d + 1 and with Ex U [x] = c. Lemma 5 below asserts that sampling according to the distribution U leads to an expected reward that is sufficiently large. Hence, U is used in the subroutine Generate Arm Sequence (Algorithm 1). In particular, the purpose of the subroutine is to carefully construct a sequence (multiset) of arms S, with size |S| = e T and to be pulled in the initial e T rounds. The sequence S is constructed such that (i) upon pulling arms from S, we have a sufficiently large expected reward in each pull, and (ii) we obtain an initial estimate of the inner product of the unknown parameter vector θ with all arms in X. Here, objective (i) is achieved by considering the above-mentioned distribution U. Now, towards the objective (ii), we compute distribution λ (X) by solving the optimization problem (also known as the D-optimal design problem) stated in equation (5). We initialize sequence S = and run the subroutine Generate Arm Sequence for e T iterations. In each iteration (of the for-loop in Line 4), with probability 1/2, we sample an arm according to the distribution U (Line 7) and include it in S. Also, in each iteration, with remaining probability 1/2, we consider the computed distribution λ and, in particular, pick arms z from the support of λ in a round-robin manner. We include such arms z in S while ensuring that, at the end of the subroutine, each such arm z Supp(λ) is included at least λze T/3 times. We return the curated sequence of arms S at the end of the subroutine. Our main algorithm LINNASH (Algorithm 2) first calls subroutine Generate Arm Sequence to generated the sequence S. Then, the algorithm LINNASH sequentially pulls the arms Xt from S, for 1 t e T rounds. For these initial e T = |S| rounds, let rt denote the noisy, observed rewards. Using these e T observed rewards, the algorithm computes the ordinary least squares (OLS) estimate bθ (see Line 6 in Algorithm 2); in particular, bθ := (Pe T t=1 Xt XT t ) 1(Pe T t=1 rt Xt). The algorithm uses the OLS estimate bθ to eliminate several low rewarding arms (in Lines 7 and 8 in Algorithm 2). This concludes Part I of the algorithm LINNASH. Before detailing Part II (in Subsection 3.2), we provide a lemma to be used in the analysis of Part I of LINNASH. Lemma 5. Let c Rd denote the center of a John ellipsoid for the convex hull cvh(X) and let U (X) be a distribution that satisfies Ex U x = c. Then, it holds that Ex U[ x, θ ] x , θ Proof. Lemma 4 ensures that there exists a positive definite matrix H with the property that x Rd : q (x c)T H(x c) 1 cvh(X) x Rd : q (x c)T H(x c) d . Now, write y := c x c d and note that (y c)T H(y c) = (x c)T H(x c) d2 1 (since x cvh(X)) Therefore, y cvh(X). Recall that, for all arms x X, the associated reward (rx) is nonnegative and, hence, the rewards expected value satisfies x, θ 0. This inequality and the containment y cvh(X) give us y, θ 0. Substituting y = c x c d in the last inequality leads to c, θ x , θ /(d + 1). Given that Ex U [x] = c, we obtain the desired inequality Ex U x, θ = c, θ x ,θ Note that at each iteration of the subroutine Generate Arm Sequence, with probability 1/2, we insert an arm into S that is sampled according to U. Using this observation and Lemma 5, we obtain that, for any round t [e T] and for the random arm Xt pulled from the sequence S according to our procedure, the observed reward r Xt must satisfy E[r Xt] x ,θ Further, recall that in the subroutine Generate Arm Sequence, we insert arms x Supp(λ) at least λxe T/3 times, where λ corresponds to the solution of D-optimal design problem defined in equation (5). Therefore, we can characterize the confidence widths of the estimated rewards for each arm in X computed using the least squares estimate bθ computed in Line 7 in Algorithm 2. Broadly speaking, we can show that all arms with low expected reward (less than a threshold) also have an estimated reward at most twice the true reward. On the other hand, high rewarding arms must have an estimated reward to be within a factor of 2 of the true reward. Thus, based on certain high probability confidence bounds (equation (7)), we can eliminate arms in X with true expected reward less than some threshold, with high probability. 3.2 Part II: Phased Elimination via Estimate Dependent Confidence Widths Note that while analyzing average regret via confidence bound algorithms, it is quite common to use, for each arm x, a confidence width (interval) that does not depend on x s estimated reward. This is a reasonable design choice for bounding average regret, since the regret incurred at each round is the sum of confidence intervals that grow smaller with the round index and, hence, this choice 8Here, the expectation is over both the randomness in including arm Xt in S and the noise in the reward. leads to a small average regret. However, for the analysis of the Nash regret, a confidence width that is independent of the estimated reward can be highly unsatisfactory: the confidence width might be larger than the optimal x , θ . This can in turn allow an arm with extremely low reward to be pulled leading to the geometric mean going to zero. In order to alleviate this issue, it is vital that our confidence intervals are reward dependent. This in turn, requires one to instantiate concentration bounds similar to the multiplicative version of the standard Chernoff bound. In general, multiplicative forms of concentration bounds are much stronger than the additive analogues [16]. In prior work [5] on Nash regret for the stochastic multi-armed bandits setting, such concentration bounds were readily available through the multiplicative version of the Chernoff bound. However, in our context of linear bandits, the derivation of analogous concentration bounds (and the associated confidence widths) is quite novel and requires a careful use of the sub-Poisson property. In particular, we use the following confidence bounds (with estimate dependent confidence widths) in our algorithm. We define the lower and upper confidence bounds considering any arm x, any least squares estimator ϕ (of θ ), and t the number of observations used to compute the estimator ϕ. That is, for any triple (x, ϕ, t) X Rd [T], we define Lower Nash Confidence Bound (LNCB) and Upper Nash Confidence Bound (UNCB) as follows: LNCB(x, ϕ, t):= x, ϕ 6 x, ϕ νd log(T|X|) UNCB(x, ϕ, t):= x, ϕ +6 x, ϕ νd log(T|X|) As mentioned previously, the confidence widths in equation (7) are estimate dependent. Next, we provide a high level overview of Part II in Algorithm 2. This part is inspired from the phased elimination algorithm for the average regret ([17], Chapter 21); a key distinction here is to use the Nash confidence bounds defined in (7). Part II in Algorithm 2 begins with the set of arms e X X obtained after an initial elimination of low rewarding arms in Part I . Subsequently, Part II runs in phases of exponentially increasing length and eliminates sub-optimal arms in every phase. Suppose at the beginning of the ℓth phase, e X is the updated set of arms. We solve the D-optimal design problem (see (6)) corresponding to e X to obtain a distribution λ ( e X). For the next O(d2 + 2ℓe T) rounds, we pull arms a in the support of λ (Line 13): each arm a Supp(λ) is pulled λa T times where T = O(2ℓe T). Using the data covariance matrix and the observed noisy rewards, we recompute: (1) an improved estimate bθ (of θ ) and (2) improved confidence bounds for every surviving arm. Then, we eliminate arms based on the confidence bounds and update the set of surviving arms (Lines 16 and 17). The following lemma provides the key concentration bound for the least squares estimate. Lemma 6. Let x1, x2, . . . , xs Rd be a fixed set of vectors and let r1, r2, . . . , rs be independent ν-sub-Poisson random variables satisfying Ers = xs, θ for some unknown θ . Further, let matrix V = Ps j=1 xjx T j and bθ = V 1 P j rjxj be the least squares estimator of θ . Consider any z Rd with the property that z T V 1xj γ for all j [s]. Then, for any δ [0, 1] we have P n z, bθ (1 + δ) z, θ o exp δ2 z, θ P n z, bθ (1 δ) z, θ o exp δ2 z, θ Lemma 6 is established in Appendix B. Using this lemma, we can show that the optimal arm x is never eliminated with high probability. Lemma 7. Consider any bandit instance in which for the optimal arm x X we have x , θ T log(T|X|). Then, with probability at least 1 4 log T T , the optimal arm x always exists in the surviving set e X in Part I and in every phase in Part II of Algorithm 2. Finally, using Lemmas 6 and 7, we show that, with high probability, in every phase of Part II all the surviving arms x e X have sufficiently high reward means. Lemma 8. Consider any phase ℓin Part II of Algorithm 2 and let e X be the surviving set of arms at the beginning of that phase. Then, with e T = p dνT log(T |X|), we have x, θ x , θ 25 3dν x , θ log (T|X|) 2ℓ e T for all x e X Here, ν is the sub-Poisson parameter of the stochastic rewards. The proofs of the Lemmas 7 and 8 are deferred to Appendix C. 3.3 Main Result This section states and provides a proof sketch of the main theorem.9 Theorem 1. For any given stochastic linear bandits problem with (finite) set of arms X Rd, time horizon T Z+, and ν-sub-Poisson rewards, Algorithm 2 achieves Nash regret T log(T|X|) Here, β = max {1, x , θ log d}, with x X denoting the optimal arm and θ the (unknown) parameter vector. Note that, in Theorem 1, lower the value of the optimal expected reward, x , θ , stronger is the Nash regret guarantee. In particular, with a standard normalization assumption that x , θ 1 and for 1-sub Poisson rewards, we obtain a Nash regret of O q d T log(T|X|) . Also, observe that the regret guarantee provided in Theorem 1 depends logarithmically on the size of X. Hence, the Nash regret is small even when |X| is polynomially large in d. Proof Sketch of Theorem 1: We condition on the event E defined as the intersection of the (high probability) events defined in Lemmas 7 and 8, respectively. Union bound implies that event Ec holds with probability at most O(T 1). For Part I, in the first e T = 3 p Tdν log(T|X|) rounds, we bound from below the product of the expected rewards, using Lemma 5, as follows t=1 E[ Xt, θ | E] 1 T x , θ 1 e T log(2(d + 1)) For remaining rounds, we invoke Lemma 8 (specifically, equation (10)) to obtain t=e T+1 E[ Xt, θ | E] 1 T x , θ T e T 1 50|e T (2ℓ/3) + 2d2| 3dν log (T|X|) x , θ 2ℓ e T The above equations reduce to the following bound on the geometric mean of the expected rewards (across the T rounds): t=1 E[ Xt, θ ] 1 T E[ Xt, θ | E] 1 T Pr(E) x , θ dν T x , θ log (T|X|) Therefore, the Nash regret of LINNASH satisfies NRT = x , θ t=1 E[ Xt, θ ] T log(T|X|) This completes the proof sketch. 9Due to space restrictions, the complete analysis is deferred to Appendix C Computational Efficiency of LINNASH. We note that Algorithm 2 (LINNASH) executes in polynomial time. In particular, the algorithm calls the subroutine GENERATEARMSEQUENCE in Part I for computing the John Ellipsoid. Given a set of arm vectors as input, this ellipsoid computation can be performed efficiently (see Chapter 3 in [25]). In fact, for our purposes an approximate version of the John Ellipsoid suffices, and such an approximation can be found much faster [8]; specifically, in time O(|X|2d). Furthermore, the algorithm solves the D-optimal design problem, once in Part I and at most O(log T) times in Part II. The D-optimal design is a concave maximization problem, which can be efficiently solved using, say, the Frank-Wolfe algorithm with rank-1 updates. Each iteration takes O(|X|2) time, and the total number of iterations is at most O(d) (see, e.g., Chapter 21 of [17] and Chapter 3 in [25]). Overall, we get that LINNASH is a polynomial-time algorithm. 4 Extension of Algorithm LINNASH for Infinite Arms The regret guarantee in Theorem 1 depends logarithmically on |X|. Such a dependence makes the guarantee vacuous when the set of arms X is infinitely large (or even |X| = Ω(2 Td 1)). To resolve this limitation, we extend LINNASH with a modified confidence width that depends only on the largest estimated reward γ := maxx X x, bθ . Specifically, we consider the confidence width γ d 5 2 ν log (T) T , for all the arms, and select the set of surviving arms in each phase (of Part II of the algorithm for infinite arms) as follows: x X : x, bθ γ 16 γ d 5 2 ν log (T) See Algorithm 3 in Appendix D for details. The theorem below is the main result of this section. Theorem 2. For any given stochastic linear bandits problem with set of arms X Rd, time horizon T Z+, and ν-sub-Poisson rewards, Algorithm 2 achieves Nash regret Here, β = max {1, x , θ log d}, with x X denoting the optimal arm and θ the (unknown) parameter vector. Proof of Theorem 2 and a detailed regret analysis of Algorithm 3 can be found in Appendix D. 5 Conclusion and Future Work Fairness and welfare considerations have emerged as a central design objectives in online decisionmaking contexts. Motivated, broadly, by such considerations, the current work addresses the notion of Nash regret in the linear bandits framework. We develop essentially tight Nash regret bounds for linear bandit instances with a finite number of arms. In addition, we extend this guarantee to settings wherein the number of arms is infinite. Here, our regret bound scales as d5/4, where d is the ambient dimension. Note that, for linear bandits with infinite arms, [1] obtains a bound of d/ T for average regret. We conjecture that a similar dependence should be possible for Nash regret as well and pose this strengthening as a relevant direction of future work. Another important direction would be to study Nash regret for more other bandit frameworks (such as contextual bandits and combinatorial bandits) and Markov Decision Processes (MDPs). Acknowledgments and Disclosure of Funding Siddharth Barman s reserach is supported by a SERB Core research grant (CRG/2021/006165). [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011. [2] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127 135. PMLR, 2013. [3] Shubhada Agrawal, Sandeep K Juneja, and Wouter M Koolen. Regret minimization in heavytailed bandits. In Conference on Learning Theory, pages 26 62. PMLR, 2021. [4] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002. [5] Siddharth Barman, Arindam Khan, Arnab Maiti, and Ayush Sawarni. Fairness and welfare quantification for regret in multi-armed bandits. ar Xiv preprint ar Xiv:2205.13930, 2022. [6] Ilai Bistritz, Tavor Baharav, Amir Leshem, and Nicholas Bambos. My fair bandit: Distributed learning of max-min fairness with multi-player bandits. In International Conference on Machine Learning, pages 930 940. PMLR, 2020. [7] Ioannis Caragiannis, David Kurokawa, Hervé Moulin, Ariel D Procaccia, Nisarg Shah, and Junxing Wang. The unreasonable fairness of maximum nash welfare. ACM Transactions on Economics and Computation (TEAC), 7(3):1 32, 2019. [8] Michael B Cohen, Ben Cousins, Yin Tat Lee, and Xin Yang. A near-optimal algorithm for approximating the john ellipsoid. In Conference on Learning Theory, pages 849 873. PMLR, 2019. [9] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008. [10] Jürgen Eckhoff. Helly, radon, and carathéodory type theorems. In Handbook of convex geometry, pages 389 448. Elsevier, 1993. [11] Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012. [12] Safwan Hossain, Evi Micha, and Nisarg Shah. Fair algorithms for multi-agent multi-armed bandits. Advances in Neural Information Processing Systems, 34:24005 24017, 2021. [13] Ralph Howard. The john ellipsoid theorem. University of South Carolina, 1997. [14] Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. Fairness in learning: Classic and contextual bandits. Advances in neural information processing systems, 29, 2016. [15] Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363 366, 1960. [16] William Kuszmaul and Qi Qi. The multiplicative version of azuma s inequality, with an application to contention analysis. ar Xiv preprint ar Xiv:2102.05077, 2021. [17] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. [18] Andres Munoz Medina and Scott Yang. No-regret algorithms for heavy-tailed linear bandits. In International Conference on Machine Learning, pages 1642 1650. PMLR, 2016. [19] Hervé Moulin. Fair division and collective welfare. MIT press, 2004. [20] Vishakha Patil, Ganesh Ghalme, Vineet Nair, and Yadati Narahari. Achieving fairness in the stochastic multi-armed bandit problem. The Journal of Machine Learning Research, 22(1):7885 7915, 2021. [21] Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395 411, 2010. [22] Eric M Schwartz, Eric T Bradlow, and Peter S Fader. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500 522, 2017. [23] Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. Mobile Health: Sensors, Analytic Methods, and Applications, pages 495 517, 2017. [24] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933. [25] Michael J Todd. Minimum-volume ellipsoids: Theory and algorithms. SIAM, 2016. [26] Yogatheesan Varatharajah and Brent Berry. A contextual-bandit-based approach for informed decision-making in clinical trials. Life, 12(8):1277, 2022. [27] Sofia Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statistical Science, 30:199 215, 05 2015. Limitations: The main contributions of our works are theoretical. From a theoretical point of view, the limitations of our paper are discussed in Section 5. In particular, we believe that tightening the gap between the upper and lower bounds in Nash regret for an infinite set of arms will require novel and non-trivial algorithmic ideas - we leave this as an important direction of future work. Broader Impact: Due to the theoretical nature of this work, we do not foresee any adverse societal impact of this work. A Proof of Lemmas 1 and 2 Lemma 1. Any non-negative random variable X [0, B] is B-sub-Poisson, i.e., if mean E[X] = µ, then for all λ R, we have E[eλX] exp B 1µ e Bλ 1 . Proof. For random variable X we have E [exp (λX)] = E exp λB X B e(λB) + 1 X e0 (due to convexity of ex) = 1 + E [X] Lemma 2. Let X be a non-negative sub-Gaussian random variable X with mean µ = E[X] and sub-Gaussian norm σ. Then, X is also σ2 µ -sub-Poisson. Proof. Since X is a σ-sub-Gaussian random variable, for any non-negative scalar s 0, we have E[es X] exp sµ + (sσ)2 The fact that X is a positive random variable implies that the mean µ > 0. Also, the considered scalar s 0 and, hence, the term sσ2 µ > 0. Also, recall that ex 1 + x + x2 2 , for any non-negative x. Using these observations and equation (12), we obtain E[es X] exp µ2 For random variable X, inequality (13) ensures that the required mgf bound (equation (2)) holds for all non-negative s and with sub-Poisson parameter equal to σ2 We next complete the proof by showing that the mgf bound holds for negative s as well. Towards this, write B := σ2 µ and define random variable Y := 1{X B} X + 1{X>B} B. Note that Y is a positive, bounded random variable. Furthermore, for any negative s, we have exp (s Y ) exp (s X). Therefore, for a negative s, it holds that E [exp(s X)] E [exp (s Y )]. Since positive random variable Y [0, B], the mgf bound obtained in Lemma 1 gives us E[es X] E es Y exp µ Since B := σ2 µ , the mgf bound (equation (2)) on X holds for negative s as well. This, overall, shows that X is a σ2 µ -sub-Poission random variable. The lemma stands proved. B Proof of Concentration Bounds Lemma 6. Let x1, x2, . . . , xs Rd be a fixed set of vectors and let r1, r2, . . . , rs be independent ν-sub-Poisson random variables satisfying Ers = xs, θ for some unknown θ . Further, let matrix V = Ps j=1 xjx T j and bθ = V 1 P j rjxj be the least squares estimator of θ . Consider any z Rd with the property that z T V 1xj γ for all j [s]. Then, for any δ [0, 1] we have P n z, bθ (1 + δ) z, θ o exp δ2 z, θ P n z, bθ (1 δ) z, θ o exp δ2 z, θ Proof. We use the Chernoff method to get an upper bound on the desired probabilities, as shown below P n z, bθ (1 + δ) z, θ o = P exp(c z, bθ ) exp(c(1 + δ) z, θ ) (for some constant c) E[exp c z T V 1 (P t rtxt) ] exp(c (1 + δ) z, θ ) = Qs t=1 E[exp c rt V 1xt ] exp(c (1 + δ) z, θ ) (rt s are independent) Qs t=1 exp E[rt] ν ecνz T V 1xt 1 exp(c (1 + δ) z, θ ) (rt is sub Poisson) c z, θ (1 + δ) + ec νz T V 1xt 1 ! Substituting c = log(1+δ) νγ , we get P n z, bθ (1 + δ) z, θ o exp νγ (1 + δ) log (1 + δ) + (1 + δ) 1 γ z T V 1xt 1 ! γ z T V 1xt 1 we have (1 + δ) 1 γ z T V 1xt 1 + δ 1 γ z T V 1xt. Substituting in (14) we get P n z, bθ (1 + δ) z, θ o νγ z, θ (1 + δ) log (1 + δ) + t=1 xt, θ δ νγ z T V 1xt νγ z, θ (1 + δ) log (1 + δ) + δ t=1 θ T xtx T t V 1z (rearranging terms) νγ z, θ (1 + δ) log (1 + δ) + δ νγ z, θ . (Ps t=1 xtx T t = V) Using the logarithmic inequality log(1 + δ) 2δ 2+δ, we further simplify as P n z, bθ (1 + δ) z, θ o exp z, θ νγ ((1 + δ) log (1 + δ) δ) exp δ2 z, θ exp δ2 z, θ . (since δ [0, 1]) Following similar steps and substituting c = log(1 δ) νγ , we obtain a bound on the lower tail (inequality 9): P n z, bθ (1 δ) z, θ o exp 1 νγ z, θ (1 δ) log (1 δ) δ Now, using the logarithmic inequality (1 δ) log(1 δ) δ + δ2 P n z, bθ (1 δ) z, θ o exp δ2 z, θ Combining (9) and (8) we get the following Corollary. Corollary 9. Using the notations as in Lemma 6, we have P n | z, bθ z, θ | δ z, θ o 2 exp δ2 z, θ The next two lemmas are variants of Lemma 6 where we bound the error in terms of an upper bound on z, θ . Lemma 10. Let x1, x2, . . . , xs Rd be a fixed set of vectors and let r1, r2, . . . , rs be independent ν sub Poisson random variables satisfying Ers = xs, θ for some unknown θ . In that case, let matrix V = Ps j=1 xjx T j and bθ = V 1 P j rjxj be the least squares estimator of θ . Consider any z Rd that satisfies z T V 1xj γ for all j [s] and z, θ α. Then for any δ [0, 1] we have P n z, bθ (1 + δ)α o e δ2α Proof. Following the same approach as in the proof of Lemma 6, we have P n z, bθ (1 + δ)α o E[exp(c z T V 1 (P t rtxt))] exp(c (1 + δ)α) cα(1 + δ) + ecνz T V 1xt 1 ! (rt are sub-poisson and independent) Now, substituting c = 1 νγ log (1 + δ)) and using (1 + δ) 1 γ z T V 1xt 1 + δ 1 γ z T V 1xt we have P n z, bθ (1 + δ)α o exp γν α(1 + δ) log (1 + δ) + ν θ (1 + δ) 1 γ z T V 1xt 1 ! νγ α(1 + δ) log (1 + δ) + δ t=1 θ T xtx T t V 1Z νγ α(1 + δ) log (1 + δ) + δ νγ α(1 + δ) log (1 + δ) + δ νγ α (α z, θ ) exp δ2α (2 + δ) νγ (using log(1 + δ) 2δ 2+δ) Since δ [0, 1], we have the desired result. Lemma 11. Using the same notations as in Lemma 10, for any δ [0, 1], the following holds P n z, bθ z, θ δα o exp δ2α Proof. Using steps similar to the previous lemmas, we obtain P n z, bθ z, θ δα o E[exp(c z T V 1 (P t rtxt))] exp(c ( z, θ δα)) cαδ + c z, θ + ecνz T V 1xt 1 ! (rt are sub-poisson and independent) Substituting c = log(1 δ) νγ and simplifing we get P n z, bθ z, θ δα o exp z, θ νγ (log (1 δ) + δ) + α νγ δ log (1 δ) Note that since log(1 δ) + δ is negative, we can upper bound the above expression by replacing z, θ with α. P n z, bθ z, θ δα o exp α νγ (log (1 δ) + δ δ log (1 δ)) . (since (1 δ) log(1 δ) δ + δ2 Hence, the lemma stands proved. C Regret Analysis of Algorithm 2 We will first define events E1 and E2 for each phase of the algorithm and show that they hold with high probability. We will use the events in the regret analysis. Event E1: At the end of Part I, let bθ be the unbiased estimator of θ and e T be as defined in Algorithm 2. All arms x X with x, θ < 10 q dν log (T|X|) dν log (T|X|) In addition, all arms x X with x, θ 10 q dν log (T|X|) | x, θ x, bθ | 3 dν x, θ log (T|X|) e T and (19) 1 2 x, θ x, bθ 4 3 x, θ . (20) Event E2: Let e X denote the surviving set of arms at the start of a phase in Part II, and T be as defined in Algorithm 2. For all phases and for all x e X such that x, θ dν log (T|X|) T , the estimator bθ (calculated at the end of a phase) satisfies | x, θ x, bθ | 3 dν x, θ log (T|X|) 1 2 x, θ x, bθ 4 3 x, θ . (22) C.1 Supporting Lemmas Lemma 12 (Chernoff Bound). Let Z1, . . . , Zn be independent Bernoulli random variables. Consider the sum S = Pn r=1 Zr and let µ = E[S] be its expected value. Then, for any ε [0, 1], we have P {S (1 ε)µ} exp µε2 Lemma 13. During Part I, arms from D-optimal design are added to S at least e T/3 times with probability greater than 1 1 Proof. We use Lemma 12 with Zi as indicator random variables that take value one when an arm from A (the support of λ in the optimal design) is chosen. By setting ε = 1 3 and µ = e T 2 , we obtain the required probability bound. Lemma 14. Using the notations in Algorithm 1, if the event in Lemma 13 holds, then for each x X and each round t in Part I of the algorithm, we have x T V 1Xt 3d where Xt is the arm pulled in round t. Proof. Let U(λ) and λ denote the optimal design matrix (as defined in (4)) and the solution to the D-optimal design problem in Algorithm 1, respectively. That is, λ is the solution of the optimization problem stated in equation (5) and U(λ) = P x X λxxx T . Lemma 3 implies that ||x||U(λ) 1 d for all x X. Next, note that the construction of the sequence S in Part I (Subroutine Generate Arm Sequence) and the event specified in Lemma 13 give us V e T 3 U(λ). Hence, x T V 1Xt x V 1 V 1Xt V (via Hölder s inequality) = x V 1 Xt V 1 3 U(λ) 1 Xt e T 3 U(λ) 1 (since V e T 3 U(λ)) e T x U(λ) 1 e T Xt U(λ) 1 e T (by Lemma 3) The next lemma lower bounds the probability of event E1 (see equations (18), (19), and (20)). Lemma 15. Event E1 holds with probability at least 1 6 Proof. First, consider all arms x X for which x, θ < 10 q dν log (T|X|) T . Here, we invoke Lemma 10, with γ = 3d e T (as derived in Lemma 14), α = 10 q dν log (T|X|) T , and δ = 1, to obtain dν log (T|X|) dν log (T|X|) Tdν log(T|X|) 1 T|X| (23) Next, we consider arms x X such that x, θ 10 q dν log (T|X|) T and for such arms establish equations (19) and (20). Towards this, we invoke Lemma 6, with parameters γ = 3d e T and δ = dν log (T|X|) x,θ e T . It is relevant to note that here δ [0, 1] this containment follows from the condition x, θ 10 q dν log (T|X|) T and e T = 3 p Tdν log(T|X|). Therefore, | x, θ x, bθ | 3 dν x, θ log (T|X|) = P n | x, θ x, bθ | δ x, θ o (since δ = 3 q dν log (T|X|) 9dν log (T|X|) x,θ e T x, θ = 2 T|X| (24) For establishing equation (20), we invoke Lemma 6 again, now with γ = 3d e T and δ = 1 Tνd log(T|X|) x, θ Tνd log(T|X|) 10 q dν log (T|X|) 1 T|X| (25) Similarly, with δ = 1 2, Lemma 6 gives us 2 x, θ 1 T|X| (26) Finally, we combine (23), (24), (25) and (26), and apply a union bound over all arms in X. Then, conditioning on the event in Lemma 13 leads to the stated probability bound. The lemma stands proved. The next lemma shows that event E2 (see equations (21) and (22)) holds with high probability Lemma 16. Event E2 holds with probability at least 1 3 log T Proof. Consider any phase in Part II and let U(λ) be the optimal design matrix obtained after solving the D-optimal design problem at the start of the phase. By Lemma 3, for all x, z e X we have z T V 1x z V 1 V 1x V (via Hölder s inequality) z V 1 x V 1 First, we address equation (21). In particular, we instantiate Lemma 6 with δ = 3 q dν log (T|X|) γ = d T . Note that given the lower bound on x, θ and the inequality T 2 p Tdν log(T|X|) ensure that δ lies in [0, 1]. Hence, substituting these values of δ and γ in Lemma 6, we obtain | x, θ x, bθ | 3 dν x, θ log (T|X|) 9dν log (T|X|) Next, following a similar approach as in the proof of Lemma 15, we use Lemma 6 with δ = 1 3 and δ = 1 2 to establish the upper and lower bounds of equation (22), respectively. Applying a union bound across arms in e X and over all at most log T phases, we obtain the desired probability bound of 1 3 log T Corollary 17. P {E1 E2} 1 4 log T Proof. From Lemma 15 we have P {E1} 1 6 T. Furthermore, from Lemma 16 we have P {E2} 1 3 log T T . Applying a union bound on the complements of these two events establishes the corollary. Lemma 18. Consider any bandit instance with x , θ 192 q dν log (T|X|) T . If event E1 holds, then any arm with mean x, θ 10 q dν log(T|X|) T is eliminated after Part I of Algorithm 2. Proof. We will show that in the given bandit instance and under the event E1, for each arm x X with mean x, θ 10 q dν log(T|X|) T the upper Nash confidence bound (see equation (7)) is less than the lower confidence bound of the optimal arm x . Hence, all such arms x are eliminated from consideration in Line 8 of Algorithm 2. This will establish the lemma. The upper Nash confidence bound of arm x at the end of Part I is defined as UNCB x, bθ, e T/3 = x, bθ + 6 3 x, bθ d ν log (T|X|) d ν log(T|X|) 3 x, bθ d ν log (T|X|) e T (via event E1) dν log(T|X|) v u u t3 20 q dν log(T|X|) T dν log (T|X|) Tνd log(T|X|) (substituting e T) dν log(T|X|) In the given bandit instance and under event E1, for the optimal arm x , we have x , bθ x , θ + 3 dν x , θ log (T|X|) dν log (T|X|) Td ν log(T|X|) (substituting e T) v u u t dν log (T|X|) dν log(T|X|) Tνd log(T|X|) (using x , θ 192 q dν log (T|X|) 16 x , θ . (28) Therefore, the lower Nash confidence bound of x satisfies LNCB x , bθ, e T/3 = x , bθ 6 3 x , bθ d ν log (T|X|) d ν x , θ log (T|X|) 3 x , bθ d ν log (T|X|) e T (via (19) in event E1) d ν x , θ log (T|X|) e T (since x , bθ 17 16 x , θ via (28)) dν log (T|X|) v u u t dν log (T|X|) dν log(T|X|) Tdν log(T|X|) dν log(T|X|) Equations (29) and (27) imply UNCB x, bθ, e T/3 < LNCB x , bθ, e T/3 (30) As mentioned previously, Line 8 in Algorithm 2 eliminates all arms x that satisfy inequality (30). Hence, the lemma stands proved C.2 Proofs of Lemmas 7 and 8 Lemma 7. Consider any bandit instance in which for the optimal arm x X we have x , θ T log(T|X|). Then, with probability at least 1 4 log T T , the optimal arm x always exists in the surviving set e X in Part I and in every phase in Part II of Algorithm 2. Proof. We will show that, under events E1 and E2, throughout the execution of Algorithm 2 the UNCB of the optimal arm x is never less than the LNCB of any arm x. Hence, then the optimal arm x never satisfies the elimination criterion in Algorithm 2 and, hence, x always exists in the surviving set of arms. First, we consider arms x with the property that x, θ < 10 q dν log (T|X|) T . For any such arm x, at the end of Part I of the algorithm we have LNCB x, bθ, e T/3 UNCB x, bθ, e T/3 < via (30) LNCB x , bθ, e T/3 UNCB x , bθ, e T/3 . Hence, at the end of Part I, arm x is not eliminated via the LNCB of any x which satisfies x, θ < 10 q dν log (T|X|) T . Further, note that, under event E1, such arms are eliminated at the end of Part I (Lemma 18). Hence, the LNCB of such arms are not even considered in the phases of Part II. To complete the proof, we next show that the UNCB of the optimal arm x is at least the LNCB of all arms x which bear x, θ 10 q dν log (T|X|) T . Below, we will consider the Nash confidence bounds for a general T . Replacing T by e T gives us the desired confidence-bounds comparison for the end of Part I this repetition is omitted. Under events E1 and E2, for any arm x with x, θ 10 q dν log (T|X|) T , it holds that LNCB(x, bθ, T ) = x, bθ 6 x, bθ d ν log (T|X|) d ν x, θ log (T|X|) x, bθ dν log (T|X|) T (via (21)) dν x, θ log (T|X|) T ( x, bθ 1 2 x, θ via (22)) x, θ . (31) Complementarily, for optimal arm x we have UNCB(x , bθ, T ) = x , bθ + 6 x , bθ d ν log (T|X|) dν x , θ log (T|X|) x , bθ d ν log (T|X|) dν x , θ log (T|X|) T (since x , bθ x ,θ x , θ (32) Since x , θ x, θ for all arms x, inequalities (31) and (32) lead to the confidence-bounds comparison: UNCB(x , bθ, T ) LNCB(x, bθ, T ). Hence, if events E1 and E2 hold, then the optimal arm x is never eliminated from Algorithm 2. Further, Corollary 17 ensures that the events E1 and E2 hold with probability at least 1 4 log T T . Hence, the lemma stands proved. Lemma 8. Consider any phase ℓin Part II of Algorithm 2 and let e X be the surviving set of arms at the beginning of that phase. Then, with e T = p dνT log(T |X|), we have x, θ x , θ 25 3dν x , θ log (T|X|) 2ℓ e T for all x e X Here, ν is the sub-Poisson parameter of the stochastic rewards. Proof. For the analysis, assume that events E1 and E2 hold. Lemma 7 ensures that the optimal arm is contained in the surviving set of arms e X. Furthermore, if an arm x e X at the beginning of the ℓth phase, then it must be the case that arm x was not eliminated in the previous phase (which executed for T /2 rounds); in particular, we have UNCB(x, bθ, T /2) LNCB(x , bθ, T /2). This inequality reduces to x, bθ d ν log (T|X|) x , bθ d ν log (T|X|) Rearranging the terms, we obtain x, bθ x , bθ 6 x , bθ d ν log (T|X|) x, bθ d ν log (T|X|) 4 x , θ d ν log (T|X|) 4 x, θ d ν log (T|X|) 3 x, θ via (22)) x , θ d ν log (T|X|) Further, invoking equation (21) for x leads to x, θ x , θ 20 x , θ d ν log (T|X|) x , θ d ν log (T|X|) x , θ d ν log (T|X|) Substituting T = 2ℓe T/3, the above inequality reduces to the desired bound in (10). From Corollary 17, we have that the events E1 and E2 hold with probability at least 1 4 log T T . Hence, the lemma stands proved. C.3 Proof of Theorem 1 Theorem 1. For any given stochastic linear bandits problem with (finite) set of arms X Rd, time horizon T Z+, and ν-sub-Poisson rewards, Algorithm 2 achieves Nash regret T log(T|X|) Here, β = max {1, x , θ log d}, with x X denoting the optimal arm and θ the (unknown) parameter vector. Proof. We will assume, without loss of generality, that x , θ 192 q T log(T|X|), otherwise the stated Nash Regret bound directly holds (see equation (1)). Write E to denote the good event identified in Lemma 8; the lemma ensures that P{E} 1 4 log T During Part I of Algorithm 2, the product of expected rewards, conditioned on E, satisfies t=1 E[ Xt, θ | E] 1 T x , θ T (via Lemma 5) log(2(d+1))e T 1 log(2(d + 1))e T For analyzing Part II, we will utilize Lemma 8. Write Bℓto denote all the rounds t that belong to ℓth phase (in Part II). Also, let T ℓdenote the associated phase-length parameter, i.e., T ℓ= 2ℓe T/3. Note that in each phase ℓ(i.e., in the for-loop at Line 12 of Algorithm 2), every arm a in Supp(λ) (the support of D-optimal design) is pulled λa T ℓ times. Given that |Supp(λ)| d(d + 1)/2, we have |Bℓ| T ℓ+ d(d+1) 2 . By construction T ℓ d(d+1) 2 and, hence, |Bℓ| 2T ℓ. Since the phase length parameter, T ℓ, doubles after each phase, the algorithm would have at most log T phases. Hence, the product of expected rewards in Part II satisfies t=e T+1 E[ Xt, θ | E] 1 T = Y t Bℓ E[ Xt, θ | E] 1 T d ν x , θ log (T|X|) T (Lemma 8) x , θ T e T d ν log (T|X|) x , θ T e T d ν log (T|X|) The last inequality follows from the fact that (1 x)r (1 2rx), for any r [0, 1] and x [0, 1/2]. Note that the term q dν log (T|X|) x ,θ T ℓ 1/2, since x , θ 192 q T log(T|X|) along with Tdν log T|X| and T e4. We further simplify the expression as follows d ν log (T|X|) d ν log (T|X|) (since |Bℓ| 2T ℓ) d ν log (T|X|) (since (1 a)(1 b) 1 a b for a, b 0) d ν log (T|X|) (via Cauchy-Schwarz inequality) dν T x , θ log (T|X|). Combining the lower bound for the expected rewards in the two parts we get t=1 E[ Xt, θ ] 1 T E[ Xt, θ | E] P{E} 1 1 log(2(d + 1))e T dν T x , θ log (T|X|) 1 log(2(d + 1))e T dν T x , θ log (T|X|) 1 log(2(d + 1))e T dν T x , θ log (T|X|) ! 1 4 log T 1 log(2(d + 1))3 p Tdν log(T|X|) T 100 dν T x , θ log (T|X|) 4 log T T log (T|X|) 6 x , θ d ν log(T|X|) T log(2(d + 1)). Therefore, the Nash Regret can be bounded as NRT = x , θ t=1 E[ Xt, θ ] T log (T|X|) + 6 d ν log(T|X|) T log(2(d + 1)) x , θ (33) x , θ + 6 log(2(d + 1)) x , θ r T log (T|X|) (34) Hence, with β = max n 1, p x , θ , x , θ log d o = max {1, x , θ log d}, from equation (34) we obtain the desired bound on Nash regret NRT = O β q T log(T|X|) . The theorem stands proved. D Algorithm 3 and Regret Analysis Algorithm 3 LINNASH (Nash Confidence Bound Algorithm for Infinite Set of Arms) Input: Arm set X and horizon of play T. 1: Initialize matrix V [0]d,d and number of rounds e T = 3 p Td2.5ν log(T). Part I 2: Generate arm sequence S for the first e T rounds using Algorithm 1. 3: for t = 1 to e T do 4: Pull the next arm Xt from the sequence S. 5: Observe reward rt and update V V + Xt XT t 6: end for 7: Set estimate bθ := V 1 Pe T t=1 rt Xt 8: Find γ = maxz X z, bθ 9: Update e X {x X : x, bθ γ 16 3 γ d 5 2 ν log (T) 3 e T Part II 11: while end of time horizon T is reached do 12: Initialize V = [0]d,d to be an all zeros d d matrix and s = [0]d to be an all-zeros vector. // Beginning of new phase. 13: Find the probability distribution λ ( e X) by maximizing the following objective log Det(V(λ)) subject to λ ( e X) and Supp(λ) d(d + 1)/2. (35) 14: for each arm a in Supp(λ) do 15: Pull arm a for the next λa T rounds. 16: Observe rewards and Update V V + λa T aa T 17: Observe λa T corresponding rewards z1, z2, . . . and update s s + (P j zj)a. 18: end for 19: Estimate bθ = V 1 P 20: Find γ = maxz X z, bθ 21: e X {x X : x, bθ γ 16 γ d 5 2 log (T) T } 22: T 2 T // End of phase. 23: end while Instead of ensuring probability bounds on individual arms, we construct a confidence ellipsoid around θ . In the context of Algorithm 3, we define the following events for the regret analysis: G1 In Part I, arms from the D-optimal design are chosen at least e T/3 times. If x , θ T log T, then bθ calculated at the end of Part I satisfies x , θ d 3 2 ν log T. G2 In Part II, for every phase, if x , θ 196 q T log T, the estimators bθ satisfy: x , θ d 3 2 ν log T. Without loss of generality, we assume throughout that x , θ 196 d1.25 ν T log T. Otherwise, the regret bound in Theorem 2 trivially holds. Let B denote the unit ball in Rd. We have bθ θ V = V 1 2 (bθ θ ) 2 = max y B y, V 1 2 (bθ θ ) . We construct an ε-net for the unit ball, denoted as Cε. For any y B, we define yε := arg minb Cε b y 2. We can now write bθ θ V = max y B y yε, V 1 2 (bθ θ ) + yε, V 1 2 (bθ θ ) max y B y yε 2 V 1 2 (bθ θ ) 2 + | yε, V 1 2 (bθ θ ) | ε (bθ θ ) V + | yε, V 1 2 (bθ θ ) |. Rearranging, we obtain bθ θ V 1 1 ε| yεV 1 2 , bθ θ |. (36) In the following lemmas, we show that | yεV 1 2 , bθ θ | is small for all values of yε. Lemma 19. Let x1, x2, . . . , xn be a sequence of fixed arm pulls (from a set X) such that each arm x in the support λ from D-optimal design (for X) is pulled at least λxτ times. Consider the matrix V = Pn j=1 xjx T j and let z be a vector such that z 2 1 and z V 1 2 , θ 6ν q d τ log (T|Cε|). Then, with probability greater than 1 2 T|Cε|, we have, | z V 1 2 , θ bθ | τ log (T|Cε|) x , θ Proof. We begin by utilizing Lemma 6. First, we determine the γ parameter in the lemma as follows, for any t [n] we have z V 1 2 T V 1xt z V 1 2 V 1 V 1xt V z 2 xt V 1 xt V 1 . (since z 2 1) Let Aλ be the optimal design matrix. Since V τAλ, we have xt V 1 xt 1 d τ . (by Lemma 3) Now, we use Corollary 9 with γ = q d τ and δ = 3 q d τ ν log (T|Cε|) 2 . Note that δ [0, 1] since z V 1 2 , θ 6 q d τ ν log (T|Cε|). We obtain the following probability bound | z V 1 2 , θ bθ | d τ log (T|Cε|) z V 1 2 , θ d τ ν log (T|Cε|) z V 1 2 ,θ z V 1 2 , θ 2 T|Cε|. (37) Finally, we establish an upper bound on the term z V 1 2 , θ as follows z V 1 2 , θ z 2 V 1 2 θ 2 θ T Vθ (since z 2 1) i [n] θ T xix T i θ = n x , θ . ( xi, θ x , θ ) Substituting in (37) we get the lemma statement. This completes the proof of the lemma. Lemma 20. Consider the same notation as in Lemma 19. If z V 1 2 , θ 0, 6ν q d τ log (T|Cε|) , then with probability greater than 1 2 T|X| we have | z V 1 2 , θ bθ | 12ν d τ log (T|Cε|). Proof. Utilizing Lemma 10, with δ = 1, α = 6ν q d τ log (T|Cε|), and γ = q d τ , we have z V 1 2 , bθ d τ log (T|Cε|). Since z V 1 2 , θ 0, it follows, with probability greater than 1 1 T|X|, that z V 1 2 , bθ θ 12ν d τ log (T|Cε|). Next, applying Lemma 11 with δ = 1 and α = 6ν q d τ log (T|Cε|), we have, with probability greater than 1 1 T|X|, z V 1 2 , θ bθ 6ν d τ log (T|Cε|) 12ν d τ log (T|Cε|). Hence, the lemma stands proved. Lemma 21. If x , θ 196 q T log T, then Proof. First, we note (from Lemma 13) that arms from the solution of the D-optimal design problem are selected (with probability greater than 1 1 T) at least e T/3 times. Hence, we can use Lemmas 19 and 20 with τ = e T/3. Let us consider the case where yεV 1 2 , θ 6 q e T log (T|Cε|). We have that the following holds with probability greater than 1 1 T|Cε|: bθ θ V 1 1 ε yεV 1 2 , bθ θ (from (36)) v u u t e Td e T 3 log (T|Cε|) x , θ (using Lemma 19) 3d ν log (T|Cε|) x , θ 1 Next, we note that |Cε| 3 ε d [17], and by choosing ε = 1/2 we get bθ θ V 7 νd 3 2 log (T) x , θ 1 Taking a union bound over all elements in Cε gives a probability bound of 1 1 Now, for the case where yεV 1 2 , θ h 0, 6 q e T log (T|Cε|) i , substituting τ = e T/3 in Lemma 20 we have, with probability greater than 1 1 T|Cε|, bθ θ V 1 1 ε yεV 1 2 , bθ θ d τ log (T|Cε|) (using Lemma 20) e T log (T) (substituting ε = 0.5) 7 d 3 2 ν log (T) x , θ 1 The last inequality is due to the fact that x , θ 196 q T log T and e T = 3 p Tνd2.5 log T. We again take a union bound over all elements in Cε to get a probability bound of 1 1 Finally, a union bound over the two cases and the event in Lemma 13 proves the lemma. Lemma 22. If x , θ 196 q T log T, then P {G2} 1 log T Proof. To prove Lemma 22, we follow the same steps as in the proof of Lemma 21. Utilizing Lemma 19 and Lemma 20 with τ = T , we establish that for any fixed phase, the following inequality holds with probability greater than 1 1 T: bθ θ V 7 d 3 2 ν log T x , θ 1 Taking a union bound over all at most log T phases in Part II of Algorithm 3 gives us the desired lower bound on P {G2}. Corollary 23. If G1 holds, then for all x X, bθ calculated at the end of Part I satisfies | x, bθ x, θ | 7 3 x , θ d2.5ν log T Consider any phase ℓin Part II. If G2 holds, then for every arm in the surviving arm set e X , bθ calculated at the end of the phase satisfies | x, bθ x, θ | 7 3 x , θ d2.5ν log T Proof. First we use Hölder s inequality | x, θ bθ | x V 1 θ bθ V . (40) Since G1 holds, arms from the optimal design matrix are selected at least e T/3 times; we have by Lemma 3 Similarly, for every phase in Part II with T = 2ℓe T/3 we have Finally, using bounds on θ bθ V from events G1 and G2, and substituting in (40), we get the desired bound. Corollary 24. If x , θ 196 q 7 10 x , θ max x X x, bθ 13 Proof. Since T 2e T/3, via Corollary 23 any bθ calculated in Part I or during any phase of Part II satisfies | x, bθ x, θ | 7 3 x , θ d2.5ν log T max x X x, bθ x , bθ x , θ d2.5ν log T d2.5ν log T 10 x , θ (since x , θ 196 q T log T and e T = 3 p Td2.5ν log(T)) Now, for any x X, x, bθ x, θ + 7 x , θ d2.5ν log T d2.5ν log T Hence, the lemma stands proved. Lemma 25. If events G1 and G2 hold then the optimal arm x always exists in the surviving set e X in every phase in Part II of Algorithm 3 Proof. Let τ = e T/3 for Part I and τ = T for every phase of Part II. From Corollary 23 we have x , bθ x , θ 7 x , θ d2.5ν log T x , θ d2.5ν log T τ (since x , θ x, θ ) x , θ d2.5ν log T τ (using Corollary 23) maxx e X x, θ d2.5ν log T τ . (using Corollary 24) Hence, the best arm will never satisfy the elimination criteria in Algorithm 3. Lemma 26. Given that events G1 and G2 hold, consider any phase index ℓin Part II of Alg. 3. For the surviving set of arms e X at the beginning of that phase, and for e T = p d2.5νT log(T), the following inequality holds for all x e X x, θ x , θ 26 3d2.5ν x , θ 2ℓ e T . (41) Proof. Lemma 25 ensures that the optimal arm is contained in the surviving set of arms e X. Furthermore, if an arm x e X is pulled in the ℓth phase, then it must be the case that arm x was not eliminated in the previous phase (with a phase length parameter T 2 ); in particular the arms x does not satisfy the inequality on Line 21 of Algorithm 3. This inequality reduces to x, bθ x , bθ 16 v u u t maxx e X x, bθ d2.5 ν log (T) x , θ d2.5 ν log (T) (via Corollary 24) Substituting T = 2le T/3 in the above inequality proves the Lemma. Theorem 2. For any given stochastic linear bandits problem with set of arms X Rd, time horizon T Z+, and ν-sub-Poisson rewards, Algorithm 2 achieves Nash regret Here, β = max {1, x , θ log d}, with x X denoting the optimal arm and θ the (unknown) parameter vector. Proof. Without loss of generality, we assume that x , θ 196 q T log T. Otherwise, the Nash Regret bound is trivially true. For Part I, the product of expected rewards satisfies t=1 E[ Xt, θ | G1 G2] 1 T x , θ T (From Lemma 5) log(2(d+1))e T 1 log(2(d + 1))e T For Part II, we use Lemma 8. Let Ei denote the time interval of the ith phase, and let T i be the phase length parameter in that phase. Recall that |Ei| T i + d(d+1) 2 . Also, the algorithm runs for at most log T phases. Hence, we have t=e T+1 E[ Xt, θ | G1 G2] 1 T = Y t Ej E[ Xt, θ | G1 G2] 1 T d2.5 ν x , θ log (T) x , θ T e T d2.5 ν log (T) x , θ T e T d2.5 ν log (T) The last inequality is due to the fact that (1 x)r (1 2rx) where r [0, 1] and x [0, 1/2]. Note that the term q d2.5ν log (T) x ,θ T j 1/2 for x , θ 196 q T log T, T 2 p Td2.5ν log T, and T e6. We can further simplify the expression as follows d2.5 ν log (T) 1 52T j + d(d+1) d2.5 ν log (T) d2.5 ν log (T) (assuming T j d(d + 1)) d2.5 ν log (T) d2.5 ν log (T) (using Cauchy Schwarz) d2.5ν T x , θ log (T). Combining the lower bound for rewards in Part I and Part II of the algorithm, we obtain t=1 E[ Xt, θ ] 1 T E[ Xt, θ | G1 G2] P{G1 G2} 1 1 log(2(d + 1))e T d2.5ν T x , θ log (T) 1 log(2(d + 1))e T d2.5ν T x , θ log (T) 1 log(2(d + 1))e T d2.5ν T x , θ log (T) ! 1 2 log T 1 log(2(d + 1))3 p Tdν log(T) T 78 d2.5ν T x , θ log (T) 2 log T x , θ d2.5ν T log (T) 2 x , θ log(2(d + 1))3 p Hence, the Nash Regret can be bounded as NRT = x , θ t=1 E[ Xt, θ ] x , θ d2.5ν T log (T) + 2 x , θ log(2(d + 1))3 p The theorem stands proved. E Experiments We conduct experiments to compare the performance of our algorithm LINNASH with Thompson Sampling on synthetic data. For a comparison, we select Thompson Sampling (Algorithm 1 in 10000 15000 20000 25000 30000 35000 40000 45000 50000 Rounds Nash Regret Lin Nash Thomson Sampling Figure 1: Nash Regret comparison of LINNASH and Thompson Sampling 0 10000 20000 30000 40000 50000 Rounds Round-wise Reward Figure 2: Round-wise reward for LINNASH 0 10000 20000 30000 40000 50000 Rounds Round-wise Reward Thompson Sampling Figure 3: Round-wise reward for Thompson Sampling [2]), instead of UCB/OFUL, since randomization is essential to achieve meaningful Nash Regret guarantees. We fine-tune the parameters of both algorithms and evaluate their performance in the following experimental setup: We fix the ambient dimension d = 80, the number of arms |X| = 10000, and the number of rounds T = 50000. Both the unknown parameter vector, θ , and the arm embeddings are sampled from a multivariate Gaussian distribution. Subsequently, the arm embeddings are shifted and normalized to ensure that all mean rewards are non-negative, with the maximum reward mean being set to 0.5. Upon pulling an arm, we observe a Bernoulli random variable with a probability corresponding to its mean reward. In this experimental setting, we observe a significant performance advantage of LINNASH over Thompson Sampling. We plot our results in Figure 1, which shows that the Nash regret of LINNASH decreases notably faster than that of Thompson Sampling. Another notable advantage of LINNASH evident from the experiments is due to successive elimination. The variance in the quality of arms pulled decreases as the number of rounds progresses see Figures 2 and 3. This is due to the bulk elimination of suboptimal arms at regular intervals. In contrast, Thompson Sampling incurs a large variance in quality of arms being pulled even after several rounds, since no arms are being eliminated at any point.