# adaptation_to_the_range_in_karmed_bandits__954972aa.pdf Journal of Machine Learning Research 24 (2023) 1-33 Submitted 2/21; Revised 6/22; Published 1/23 Adaptation to the Range in K Armed Bandits H edi Hadiji hedi.hadiji@math.u-psud.fr Gilles Stoltz gilles.stoltz@math.u-psud.fr Universit e Paris-Saclay, CNRS, Laboratoire de math ematiques d Orsay, 91405, Orsay, France Editor: Ambuj Tewari We consider stochastic bandit problems with K arms, each associated with a distribution supported on a given finite range [m, M]. We do not assume that the range [m, M] is known and show that there is a cost for learning this range. Indeed, a new trade-offbetween distribution-dependent and distribution-free regret bounds arises, which prevents from simultaneously achieving the typical ln T and T bounds. For instance, a T distributionfree regret bound may only be achieved if the distribution-dependent regret bounds are at least of order T. We exhibit a strategy achieving the rates for regret imposed by the new trade-off. Keywords: multiarmed bandits, adversarial learning, cumulative regret, informationtheoretic proof techniques 1. Introduction Stochastic multi-armed bandits form a standard setting to deal with sequential decisionmaking problems like the design of clinical trials one of the first applications mentioned or online advertisement and online revenue management. Except for notable exceptions discussed below, virtually all articles on stochastic K armed bandits either assume that distributions of the arms belong to some parametric family often, one-dimensional exponential families or are sub-Gaussian with a known parameter σ2. Among the latter category, the case of the non-parametric family of distributions supported on a known range [m, M] is of particular interest to us. We show that the knowledge of the range [m, M] is a crucial information and that facing bounded bandit problems but ignoring the bounds m and M is much harder. We do so by studying what may be achieved and what cannot be achieved anymore when the range [m, M] is unknown and the strategies need to learn it. We call this problem adaptation to the range, or scale-free regret minimization. Why this problem is important and why we considered it is explained in Section 1.2. More precisely, we prove that adaptation to the range is actually possible but that it has a cost: our most striking result (in Section 2.3) is a severe trade-offbetween the scale-free distribution-dependent and distribution-free regret bounds that may be achieved. For instance, no strategy adaptive to the range can simultaneously achieve distributiondependent regret bounds of order ln T and distribution-free regret bounds of order T up to polynomial factors; this is in contrast with the case of a known range where simple strategies like UCB strategies (by Auer et al., 2002a) do so. Our general trade-offshows, 2023 H edi Hadiji and Gilles Stoltz. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-0148.html. Hadiji and Stoltz for instance, that if one wants to keep the same T order of magnitude for the scale-free distribution-free regret bounds, then the best scale-free distribution-dependent rate that may be achieved is T. We also provide (in Section 4) a strategy, based on exponential weights, that adapts to the range and obtains optimal distribution-dependent and distribution-free regret bounds in the eyes of the exhibited trade-off: these are of respective orders T 1 α and T α, where α [1/2, 1) is a parameter of the strategy. 1.1 Literature Review Optimal scale-free regret minimization under full monitoring for adversarial sequences is offered by the Ada Hedge strategy by De Rooij et al. (2014), which we will use as a building block in Section 4. For stochastic bandits, the main difficulty in adaptation to the range is the adaptation to the upper end M (see Remark 4); this is why Honda and Takemura (2015) could provide optimal ln T distribution-dependent regret bounds for payoffs lying in ranges of the form ( , M], with a known M. Lattimore (2017) considers models of distributions with a known bound on their kurtosis, which is a scale-free measure of the skewness of the distributions; he provides a scale-free algorithm based on the median-of-means estimators, with ln T distribution-dependent regret bounds. However, bounded bandits can have an arbitrarily high kurtosis, so our settings are not directly comparable. Cowan and Katehakis (2015) study adaptation to the range but in the restricted case of uniform distributions over unknown intervals. They provide optimal ln T distribution-dependent regret bounds for that specific model: in their model, the cost for adaptation is mild and lies only in the multiplicative constant before the ln T. In the setting of bounded bandits, we show that distribution-dependent regret bounds must be larger than ln T, but the argument of Lattimore (2017, Remark 8) entails that any regret rate larger than ln T, e.g., (ln T) ln ln T, may be achieved. Similar results by Cowan et al. (2018) for Gaussian distributions with unknown means and variances were also obtained. Finally, on the front of adversarial bandits, no prior work discussed adaptation to the range, to the best of our knowledge. Additional important references performing adaptation in some other sense for stochastic and adversarial K armed bandits are discussed now, including some follow-up work to this article. Adaptation to the effective range or to unbounded ranges in adversarial bandits. Gerchinovitz and Lattimore (2016) show that it is impossible to adapt to the so-called effective range in adversarial bandits. A sequence of rewards has effective range smaller than b if for all rounds t, rewards yt,a at this round all lie in an interval of the form [mt, Mt] with Mt mt b. The lower bound they exhibit relies on a sequence of changing intervals of fixed size. This problem is thus different from our setting. See also positive results regret upper bounds under additional assumptions by Cesa-Bianchi and Shamir (2018) and Thune and Seldin (2018) for adaptation to the effective range. Allenberg et al. (2006) deal with unbounded ranges [mt, Mt] in adversarial bandits and other partial monitoring settings, where, e.g., Mt = mt = tβ for some β > 0. They provide Adaptation to the Range in K Armed Bandits regret upper bounds scaling with tβ/2 when β is known, but do not detail the price to pay for not knowing β though they suggest to resort to a doubling trick in that case. Adaptation to the variance. Audibert et al. (2009) consider a variant of UCB called UCBV, which adapts to the unknown variance. Its analysis assumes that rewards lie in a known range [0, M]. The results crucially use Bernstein s inequality, which we state as Reminder 3 in Appendix C. As Bernstein s inequality holds for random variables with supports in ( , M], the analysis of UCB-V might perhaps be extended to this case as well. Deviation bounds in Bernstein s inequality contain two terms, a main term scaling with the standard deviation, and a remainder term, scaling with M. This remainder term, which seems harmless, is actually a true issue when M is not known, as shown by the results of the present article. Adaptation to other criteria. Wei and Luo (2018), Zimmert and Seldin (2019), Bubeck et al. (2018), and many more, provide strategies for adversarial bandits with rewards in a known range, say [0, 1], and adapting to additional regularity in the data, like small variations or stochasticity of the data but never to the range itself. Follow-up works. Following an earlier version of this work, further research on rangeadaptive bandit algorithms has been conducted. Baudry et al. (2021) bypass our lower bound by imposing minimal extra conditions on the reward distributions, avoiding the heavy-tail construction from Theorem 3; in this context, they provide a fully range-adaptive algorithm. For adversarial multi-armed bandits, Putta and Agrawal (2022) recover some small-loss bounds while being agnostic to the range, at the cost of degraded worst-case guarantees, and Huang et al. (2021) obtain similar results under delayed feedback. 1.2 Why Studying Adaptation to the Range for Finite-Armed Bandits We encountered the problem of learning the range [m, M] of bandits problems when designing bandit algorithms for continuum-armed problems, see Hadiji (2019). Therein, arms are indexed by some bounded interval I, and the mean-payofffunction f : I R is assumed to be smooth enough, e.g., H olderian-smooth, with unknown regularity parameters L and β. The mean-payofffunction f has a bounded range, as it is continuous over a bounded interval. To optimally learn these smoothness parameters, histogram-reductions of continuum-armed bandit problems to finite-armed bandits problems, of proper bandwidth, are performed ( a la Kleinberg, 2004), by zooming out. Any reasonable K armed bandit algorithm may be used in this algorithmic scheme. However, given this reduction, it had to be assumed that the range [m, M] of f is known, as all K armed bandit algorithms we were aware of assumed that the range of the distributions over the arms was known. To get more complete adaptivity results in the continuum-armed case and be able to ignore the range of the mean-payofffunction f, it was necessary and sufficient to deal with the similar issue of range adaptivity in the case of finitely many arms which this article provides. We also believe that exhibiting impossibility results, like the existence of a severe tradeoffbetween distribution-dependent and distribution-free regret bounds in the case of the model of bounded distributions with an unknown range, has consequences beyond that model. This impossibility result holds in particular for all larger models, like non-parametric models containing all distributions over the entire real line R satisfying certain assumptions Hadiji and Stoltz on their tails to make sure that they are not too large. We therefore provide some intrinsic limitation to learning in K armed stochastic bandits. The techniques introduced extend to more complex settings, like linear bandits; see an extended version of this article [ar Xiv:2006.03378], Appendix D. 2. Setting and Main Results We consider finitely-armed stochastic bandits with bounded and possibly signed rewards. More precisely, K 2 arms are available; we denote by [K] the set {1, . . . , K} of these arms. With each arm a is associated a Borel probability distribution νa lying in some known model D; a model is a set of Borel probability distributions over R with a first moment. The models of interest in this article are discussed below; in the sequel, we only consider Borel distributions even though we will omit this specification. A bandit problem in D is a K vector of probability distributions in D, denoted by ν = (νa)a [K]. The player knows D but not ν. As is standard in this setting, we denote by µa = E(νa) the mean payoffprovided by an arm a. An optimal arm and the optimal mean payoffare respectively given by a argmaxa [K] µa and µ = maxa [K] µa. Finally, a = µ µa denotes the gap of an arm a. The online learning game goes as follows: at round t 1, the player picks an arm At [K], possibly at random according to a probability distribution pt = (pt,a)a [K] based on an auxiliary randomization Ut 1, e.g., uniformly distributed over [0, 1], and then receives and observes a reward Zt drawn independently at random according to the distribution νAt, given At. More formally, a strategy of the player is a sequence of measurable mappings from the observations to the action set, (U0, Z1, U1, . . . , Zt 1, Ut 1) 7 At. At each given time T 1, we measure the performance of a strategy through its expected regret: RT (ν) = Tµ E a=1 a E Na(T) , (1) where we used the tower rule for the first equality and defined Na(T) as the number of times arm a was pulled between time rounds 1 and T. Doob s optional skipping (see Doob, 1953, Chapter III, Theorem 5.2, page 145 for the original reference, see also Chow and Teicher, 1988, Section 5.3 for a more recent reference) shows that we may assume that i.i.d. sequences of rewards (Yt,a)t 1 are drawn beforehand, independently at random, for each arm a and that the obtained payoffat round t 1 given the choice At equals Zt = Yt,At. We will use this second formulation in the rest of the paper as it is the closest to the one of oblivious individual sequences described later in Section 4.1. We may then assume that the auxiliary randomizations U0, U1, . . . are i.i.d. random variables independent from the (Yt,a)t 1 and distributed according to a uniform distribution over [0, 1]. Model: bounded signed rewards with unknown range. For a given range [m, M], where m < M are two real numbers, not necessarily nonnegative, we denote by Dm,M the set of probability distributions supported on [m, M]. Then, the model corresponding to distribu- Adaptation to the Range in K Armed Bandits tions with a bounded but unknown range is the union of all such Dm,M: m,M R:m µ . Thus, a is the only optimal arm in ν . Finally, for ε < 2 a/(M µa), the point µa + 2 a/ε is larger than M and thus lies outside of the bounded support of νa. In that case, the density of νa with respect to ν a is given by 1/(1 ε) on the support of νa and 0 elsewhere, so that KL(νa, ν a) = ln 1/(1 ε) . Step 2: Application of a fundamental inequality. We denote by kl(p, q) the Kullback-Leibler divergence between Bernoulli distributions with parameters p and q. We also index expectations in the rest of this proof only by the bandit problem they are relative to: for instance, Eν denotes the expectation of a random variable when the ambient randomness is given by the bandit problem ν. The fundamental inequality for lower bounds on the regret of stochastic bandits (Garivier et al., 2019b, Section 2, Equation 6), which is based on the chain rule for Kullback-Leibler divergence and on a data-processing inequality for expectations of [0, 1] valued random variables, reads: T , Eν Na(T) Na(T) KL(νa, ν a) = Eν Na(T) ln 1/(1 ε) . Adaptation to the Range in K Armed Bandits Now, since u ( , 1) 7 u 1 ln(1 u) is increasing, we have ln 1/(1 ε) (2 ln 2)ε for ε 1/2. For all (p, q) [0, 1]2 and with the usual measure-theoretic conventions, kl(p, q) = p ln p + (1 p) ln(1 p) | {z } ln 2 +(1 p) ln 1 1 q (1 p) ln 1 1 q ln 2 , so that, putting all inequalities together, we have proved 1 1 Eν Na(T) /T ln 2 (2 ln 2) ε Eν Na(T) . (4) In this step, we only imposed the constraint ε [0, 1/2]. We recall that in the previous step, we imposed ε < 2 a/(M µa). Both conditions are implied by ε a/ 2(M µa) , which we will assume in the sequel. Step 3: Inequalities stemming from the definition of scale-free distribution-free regret bounds. We denote by [m, M] a range containing the supports of all distributions of ν. By definition of Φfree, given that a is a suboptimal arm (i.e., a > 0): a Eν[Na(T)] RT (ν) (M m) Φfree(T) . We now prove a similar inequality for ν , for which we recall that a is the unique optimal arm. We denote by k = µ a µk the gap of arm k in ν . By the definition of ν a, the distributions of ν have supports within the range [m, Mε], where we defined the upper end Mε = max{M, µa + 2 a/ε} = µa + 2 a/ε, given the condition imposed on ε. Therefore, by definition of Φfree, and given that all gaps k are larger than the gap a = µ a µ = a between the unique optimal arm a of ν and the second best arm(s) of ν (which were the optimal arms of ν), we have a T Eν [Na(T)] = a T Eν [Na(T)] X j =a j Eν [Nj(T)] = RT (ν ) (Mε m) Φfree(T) . By rearranging the two inequalities above, we get T 1 (M m) Φfree(T) T a and 1 Eν Na(T) T (Mε m) Φfree(T) thus, after substitution into (4), 1 (M m) Φfree(T) ln T a (Mε m) Φfree(T) ln 2 (2 ln 2) ε Eν Na(T) . (5) Step 4: Final calculations. We take ε = εT = α 1 Φfree(T)/T for some constant α > 0; we will pick α = 1/8. By the assumption Φfree(T) = o(T), we have εT a/ 2(M µa) , as needed, for T large enough, as well as MεT = µa + 2 a/εT = µa + 2α a T/Φfree(T). Hadiji and Stoltz Substituting these values into (5), a finite-time lower bound on the quantity of interest is finally given by T/Φfree(T) α 2 ln 2 ln 2 + 1 (M m) Φfree(T) T a | {z } 0 ln T a 2α a T + (µa m)Φfree(T) | {z } 1/(2α) It entails the asymptotic lower bound lim inf T + Eν T/Φfree(T) α 2 ln 2 ln(1/α) 2 ln 2 = 1 for the choice α = 1/8. The claimed result follows by adding these lower bounds for each suboptimal arm a, with a factor a, following the rewriting (1) of the regret. Remark 4 The proof above only exploits the fact that the upper end M of the range is unknown: the alternative problems lie in Dm,M for some M that can be arbitrarily large. Yet, by definition of adaptation to the range, the strategy needs to guarantee (M m) Φfree(T) distribution-free regret bounds in that case. We may note that therefore, Theorem 3 also holds for the model of bounded distributions with a known lower end m R for the range: Definitions 1 and 2 handle the case of D ,+ but can be adapted in an obvious way to Dm,+ by fixing m, by having the strategy know m, and requiring the bounds to hold for all M [m, + ) and all bandit problems in Dm,M, thus leading to the concept of adaptation to the upper end of the range. This observation is in line with the folklore knowledge that there is a difference in nature between dealing with nonnegative payoffs, i.e., gains, or dealing with nonpositive payoffs, i.e., losses, for regret minimization under bandit monitoring; see Cesa-Bianchi and Lugosi (2006, Remark 6.5, page 164) for an early reference and Kwon and Perchet (2016) for a more complete literature review. Actually, 0 plays no special role, the issue is rather whether one end of the payoffrange is known. 4. Adaptation to Range Based on Ada Hedge: The AHB Strategy When the range of payoffs is known, Auer et al. (2002b) achieve a distribution-free regret bound of order KT ln K with exponential weights the Hedge strategy on estimated payoffs and with extra-exploration, i.e., by mixing exponential weights with the uniform distribution over arms. Actually, it is folklore knowledge that the extra-exploration used in this case is unnecessary (see, among others, Stoltz, 2005). To deal with the case of an unknown payoffrange, we consider a self-tuned version of Hedge called Ada Hedge (De Rooij et al., 2014, see also an earlier work by Cesa-Bianchi et al., 2007) and do add extra-exploration. Just as Auer et al. (2002b), we will actually obtain regret guarantees for oblivious adversarial bandits, not only distribution-free regret bounds for stochastic bandits. We therefore introduce now the setting of oblivious adversarial bandits and define adaptation to the range in that case. Adaptation to the Range in K Armed Bandits 4.1 Oblivious Adversarial Bandits In the setting of fully oblivious adversarial bandits (see Cesa-Bianchi and Lugosi, 2006; Audibert and Bubeck, 2009), a range [m, M] is set by the environment, where m, M are real numbers, not necessarily nonnegative. The player is unaware of [m, M] and will remain so. The environment also picks beforehand a sequence y1, y2, . . . of reward vectors in [m, M]K. We denote by yt = (yt,a)a [K] the components of these vectors. The player will observe a component of each of these reward vectors in a sequential fashion, as follows. Auxiliary randomizations U0, U1, . . . i.i.d. according to a uniform distribution over [0, 1] are available. At each round t 1, the player picks an arm At [K], possibly at random (thanks to Ut 1) according to a probability distribution pt = (pt,a)a [K], and then receives and observes yt,At. More formally, a strategy of the player is a sequence of mappings from the observations to the action set, (U0, y1,A1, U1, . . . , yt 1,At 1, Ut 1) 7 At. The strategy does not rely on m nor M. At each given time T 1, denoting by y1:T = (y1, . . . , y T ) the reward vectors, we measure the performance of a strategy through its expected regret: RT (y1:T ) = max a [K] where, as rewards are fixed beforehand, all randomness lies in the choice of the arms At only, i.e., where the expectation is only over the choice of the arms At. The counterpart of Definition 1 in this setting is stated next. Definition 5 (Scale-free adversarial regret bounds) A strategy for oblivious adversarial bandits is adaptive to the unknown range of payoffs with a scale-free adversarial regret bound Φadv : N [0, + ) if for all real numbers m < M, the strategy ensures, without the knowledge of m and M: y1, y2, . . . in [m, M]K, T 1, RT (y1:T ) (M m) Φadv(T) . Conversion of upper/lower bounds from one setting to the other. We recall that when applying Doob s optional skipping in Section 2, for each arm a, we denoted by (Yt,a)t 1 an i.i.d. sequence of rewards drawn beforehand, independently at random, according to the distribution νa associated with that arm. By the tower rule for the right-most equality below, we note that for all m < M and for all ν in Dm,M, RT (ν) = max a [K] E = E RT (Y1:T ) sup y1:T in [m,M]K RT (y1:T ) . In particular, lower bounds on the regret for stochastic bandits are also lower bounds on the regret for oblivious adversarial bandits, and strategies designed for oblivious adversarial bandits obtain the same distribution-free regret bounds for stochastic bandits when the individual payoffs yt,At in their definition are replaced with the stochastic payoffs Yt,At. Hadiji and Stoltz Algorithm 1 AHB: Ada Hedge for K armed Bandits, with extra-exploration 1: Inputs: a payoffestimation scheme, e.g., (8); a sequence (γt)t 1 in [0, 1] of extraexploration rates 2: for rounds t = 1, . . . , K do 3: Draw arm At = t 4: Get and observe the payoffyt,t 5: end for 6: Ada Hedge initialization: ηK+1 = + and q K+1 = (1/K, . . . , 1/K) def = 1/K 7: for rounds t = K + 1, . . . do 8: Define pt by mixing qt with the uniform distribution as pt = (1 γt)qt + γt1/K 9: Draw an arm At pt, i.e., independently at random according to the distribution pt 10: Get and observe the payoffyt,At 11: Compute estimates byt,a of all payoffs with the payoffestimation scheme, e.g., (8) 12: Compute the mixability gap δt 0 based on the distribution qt and on these estimates: a=1 qt,a byt,a + 1 a=1 qt,aeηtbyt,a ! | {z } when ηt + or ηt = + , i.e., δt = a=1 qt,a byt,a + max a [K] byt,a | {z } when ηt = + 13: Compute the learning rate ηt+1 = 14: Define qt+1 component-wise as qt+1,a = exp s=K+1 bya,s s=K+1 byk,s 15: end for 4.2 The AHB Strategy We state our main strategy, AHB which stands for Ada Hedge for K armed Bandits, with extra-exploration , in the setting of oblivious adversarial bandits, see Algorithm 1. In a setting of stochastic bandits, it suffices to replace therein yt,At with Yt,At. The AHB strategy relies on a payoffestimation scheme, which we discuss now. In Algorithm 1, some initial exploration lasting K rounds is used to get a rough idea of the location of the payoffs and to center the estimates used at an appropriate location. Following Auer et al. (2002b), we consider, for all rounds t K + 1 and arms a [K], byt,a = yt,At C pt,a 1{At=a} + C where C def = 1 s=1 ys,s . (8) Note that all pt,a > 0 for Algorithm 1 due to the use of exponential weights. As proved by Auer et al. (2002b), the estimates byt,a are conditionally unbiased. Indeed, the distributions qt and pt, as well as the constant C, are measurable functions of the information Adaptation to the Range in K Armed Bandits Ht 1 = (U0, y1,A1, U1, . . . , Ut 2, yt 1,At 1) available at the beginning of round t K + 1, and the arm At is drawn independently at random according to pt based on an auxiliary randomization denoted by Ut 1. Therefore, given that the payoffs are oblivious, the conditional expectation of byt,a with respect to Ht 1 amounts to integrating over the randomness given by the random draw At pt: for t K + 1, E byt,a Ht 1 = yt,a C pt,a P At = a Ht 1 + C = yt,a C pt,a pt,a + C = yt,a . (9) These estimators are bounded: assuming that all yt,a, thus also C, belong to the range [m, M], and given that the distributions pt were obtained by a mixing with the uniform distribution, with weight γt, we have pt,a γt/K, and therefore, t K + 1, a [K], byt,a C |yt,a C| γt/K . (10) Remark 6 Algorithm 1 is invariant by affine changes, i.e., translations by real numbers and/or multiplications by positive factors, of the payoffs, given that Ada Hedge (see De Rooij et al., 2014, Theorem 16) and the payoffestimation scheme (8) are so. This is key for adaptation to the range. This invariance is achieved, when ignoring the range [m, M], thanks to any value C [m, M]. Here, we chose to have K rounds of exploration in Algorithm 1 and let C equal the average of the payoffs achieved. However, it would of course have been sufficient to pick one arm at random, observe a single reward y1,A1 and let C = y1,A1. 4.3 Regret Analysis, Part 1: Scale-Free Adversarial Regret Bound Theorem 7 Ada Hedge for K armed bandits (Algorithm 1) with a non-increasing extraexploration sequence (γt)t 1 smaller than 1/2 and the estimation scheme given by (8) ensures that for all bounded ranges [m, M], for all oblivious individual sequences y1, y2, . . . in [m, M]K, for all T 1, RT (y1:T ) 3(M m) KT ln K + 5(M m)K ln K In particular, given a parameter α (0, 1), the extra-exploration γt = min n 1/2, p 5(1 α)K ln K tαo leads to the scale-free adversarial regret bound Φadv(T) = 3 + 5 1 α K ln K T max{α,1 α} + 10(M m)K ln K . (11) For α = 1/2, the bound reads Φadv(T) = 7(M m) TK ln K + 10(M m)K ln K. This value α = 1/2 is the best one to consider if one is only interested in a distributionfree bound i.e., if one is not interested in the distribution-dependent rates for the regret. The proof of Theorem 7 is detailed in Appendix B but we sketch its proof below. Hadiji and Stoltz Remark 8 We strongly suspect that the ln K factor in the bound of Theorem 7 is superfluous. In the case of a known range, the MOSS algorithm is known to be minimax optimal with a regret bound of order KT. One idea could thus be to use a MOSS-type index, together with a Bernstein-type upper confidence bound to account for the unknown variance and range. A final ingredient would be to add initial extra-exploration, pulling every arm p T/K times before running the standard phase of the algorithm; on a technical level, this automatically makes the sub-Poissonian term in Bernstein s inequality tractable. We have not managed yet to fill in the technical details in order to prove this, although we believe a variant of these ideas would get rid of the logarithmic factor. In contrast, the algorithm discussed here, based on Ada Hedge, enjoys a simple distribution-free analysis as sketched below , as well as a distribution-dependent analysis (see Section 4.4), unlike an algorithm based on MOSS-type indices. Another promising approach would be to use the Tsallis-INF algorithm introduced by Audibert and Bubeck (2009) and further studied by Zimmert and Seldin (2019), which achieves a (M m) KT adversarial regret bound when M and m are known. Unfortunately, current analyses of the algorithm rely crucially on the non-positivity of the reward estimates, or, equivalently on the knowledge of an upper bound on the rewards. Zimmert and Lattimore (2019) relax this requirement, but not enough for the relaxed version to be applied to our case. However, when M is known and m is unknown, i.e., only adaptation to m is needed, the reward estimates can be made non-positive by taking C = M in the estimation scheme (8), and our techniques may be extended to show that Tsallis-INF indeed enjoys an adversarial regret bound of order (M m) KT in this case. Details may be found in an extended version of this article [ar Xiv:2006.03378], Theorem 23 in Appendix F. Proof sketch A direct application of the Ada Hedge regret bound (Lemma 3 and Theorem 6 of De Rooij et al., 2014), bounding the variance terms of the form E (X E[X])2 by E (X C)2 , ensures that t=K+1 byt,k t K+1 a [K] qt,a byt,a 2 v u u t t K+1 a [K] qt,a byt,a C 2 ln K + M m We take expectations, use the definition of the pt in terms of the qt in the left-hand side, and apply Jensen s inequality in the right-hand side to get t=K+1 byt,k =yt,At z }| { K X a=1 pt,a byt,a + E[...] [m M,M m] z }| { K X a=1 (1/K qt,a) byt,a a=1 E h qt,a byt,a C 2i ln K + M m Adaptation to the Range in K Armed Bandits Since pt,a (1 γt)qt,a with γt 1/2 by assumption on the extra-exploration rate, we have the bound qt,a 2pt,a. Together with standard calculations similar to (9), we have E h qt,a byt,a C 2i 2 E h pt,a(byt,a C)2 Ht 1 i = 2 E (yt,At C)2 pt,a 1{At=a} = 2 (yt,a C)2 | {z } (M m)2 . The proof of the first regret bound of the theorem is concluded by collecting all bounds and by taking care of the first K rounds. The second regret bound then follows from straightforward calculations. 4.4 Regret Analysis, Part 2: Distribution-Dependent Rates for Adaptation Given the conversion explained in Section 4.1, Algorithm 1 tuned as in Corollary 7 for α [1/2, 1) also enjoys the scale-free distribution-free regret bound ΦAHB free (T) = ΦAHB adv (T) of order T α. The theorem below entails that AHB is adaptive to the unknown range with a distribution-dependent regret rate T/ΦAHB free(T) of order T 1 α that is optimal given the lower bound stated by Theorem 3. Theorem 9 Consider AHB (Algorithm 1) tuned with some α [1/2, 1) as in the second part of Theorem 7. For all distributions ν1, . . . , νK in D ,+, RT (ν) T/ΦAHB free(T) 12 ln K a=1 a . (12) The proof is provided in Appendix C. It follows quite closely that of Theorem 3 in Seldin and Lugosi (2017), where the authors study a variant of the Exp3 algorithm of Auer et al. (2002b) for stochastic rewards. It consists, in our setting, in showing that the number of times the algorithm chooses suboptimal arms is almost only determined by the extraexploration. Our proof is simpler as we aim for cruder bounds. The main technical difference and issue to solve lies in controlling the learning rates ηt, which heavily depend on data in our case. 5. Numerical Illustrations We provide some numerical experiments on synthetic data to illustrate the qualitative behavior of some popular algorithms like UCB strategies when they are incorrectly tuned, as opposed to strategies that are less sensitive to ignoring the range or to the AHB strategy which adapts to it. These experiments are only of an illustrative nature. Bandit problems considered and UCB strategies. We consider stochastic bandit problems ν(α) = (ν(α) a )a [K] indexed by a scale parameter α {0.01, 1, 100}. We take K = 10 arms, each arm a being associated with a rectified Gaussian distribution. Precisely, the distribution ν(α) a is the distribution of the variable ( α max 0, min{Y, 1.2} with Y N(0.6 , V ) if a = 1, α max 0, min{Y, 1} with Y N(0.5 , V ) if a = 1, Hadiji and Stoltz 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Figure 1: Probability density functions of the reward distributions with respect to the sum of the Lebesgue measure and Dirac masses at 0, 1, and 1.2. Left pictures: highvariance case; right pictures: low-variance case. Top pictures: first arm (optimal arm); bottom pictures: other arms. Arrows represent atoms and their lengths are only illustrative. so that all distributions are commonly supported on [m, M] = [0, 1.2 α], with arm 1 being the unique optimal arm. We will consider two values for V , namely V = 0.01 (low-variance case) and V = 0.25 (high-variance case). See Figure 1 for a plot of the corresponding probability density functions. We denote by µ(α) 1 = 0.6 α and µ(α) a = 0.5 α if a = 1 the means associated with the distributions ν(α) 1 and ν(α) a , respectively. The gaps therefore equal (α) a = 0.1 α for a 2. The main algorithm of interest is, of course, the AHB strategy with extra-exploration (Algorithm 1), which we tune as stated in Theorem 7 with parameter 1/2. We now present the competitors. UCB strategies at different scales. We consider instances of UCB (Auer et al., 2002a) using indices of the form where Na(t) is the number of times arm a was pulled up to round t, and where bµa(t) denotes the empirical average of payoffs obtained for arm a. We hesitated between setting σ2 based on the range M m = 1.2α, namely, σ2 = (M m)2/4 = (1.2α)2, or based on a sub Gaussian parameter, which would be smaller. As distributions ν(α) a are rectified Gaussians, it is not immediately clear whether they are sub-Gaussian, but we considered despite all the choice σ2 = V . It turns out that this second choice outperformed the first one, which is why, in the rest of the study, we consider the following three instances of UCB: Na(t) , where s {0.01, 1, 100} . When the scale parameter α is known, we would take s = α. Adaptation to the Range in K Armed Bandits Range-estimating UCB. We also study a version of UCB estimating the range, namely, using indices bµa(t) + ˆrt 2 ln T Na(t) , where ˆrt = max s t YAs,s min s t YAs,s estimates the range M m. We were unable to provide theoretical guarantees that match our lower bounds, and this algorithm does not perform particularly well in practice as we will discuss below. ε greedy. Finally, we also consider the ε greedy strategy, which, at round t K + 1, picks with probability 1 εt the arm with the best empirical mean, and otherwise, selects an arm uniformly at random. Following Auer et al. (2002a), we used the tuning εt = min 1, 5K with d = 1/12 . Indeed, Auer et al. (2002a) exhibit theoretical guarantees for distributions over [0, 1] in the case where d is smaller than or equal to the smallest gap. When rescaled on [0, 1], the smallest gap equals 0.1α(M m)/(1.2α) = 1/12 in our setting; this explains our choice d = 1/12, but note that the ε greedy strategy defined above relies on some extra knowledge encompassed in the choice d = 1/12, compared to the completely agnostic AHB strategy. Interestingly, for any fixed-in-advance sequence of εt, the ε greedy strategy is scale-free. Of course, its strong downside is that a proper tuning of the εt requires knowledge of a scaled lower bound on the gaps. Experimental setting. Each algorithm is run N = 100 times, on a time horizon T = 100,000. We plot estimates of the rescaled regret RT (ν(α))/α to have a meaningful comparison between the bandit problems. These estimates are constructed as follows. We index the arms picked in the n th run by an additional subscript n, so that AT,n refers to the arm picked by some strategy at time t in the n th run. The expected regret of a given strategy can be rewritten as RT (να) = T max a [K] µ(α) a E t=1 µ(α) At = T (0.6 α) E t=1 µ(α) At and is estimated by b RT (α) = 1 n=1 b RT (α, n) where b RT (α, n) = T (0.6 α) t=1 µ(α) At,n . On Figures 2 and 3 we plot the estimates b RT (α)/α of the rescaled regret as solid lines. The shaded areas correspond to 2 standard errors of the sequences b RT (α, n)/α Discussion of the results. An initial observation is that, as expected, the performance of AHB, the range-estimating UCB, and ε-greedy is unaffected by the scale of the problems (see the second lines of Figures 2 and 3). It turns out that out of these three algorithm, AHB performs best. A second observation is that the performance of UCB depends dramatically on the value of the parameter s. UCB performs like follow-the-leader when s is too small, and Hadiji and Stoltz UCB, s = 0.01 UCB, s = 1 UCB, s = 100 Range estimating UCB AHB with extra exploration 104 105 104 105 Figure 2: Comparison of the (estimated) regrets of various strategies over bandit problems ν(α) in the high variance case, where α ranges in {0.01, 1, 100} and V = 0.25. Each algorithm was run N = 100 times on every problem for T = 100,000 time steps. Solid lines report the values of the estimated regrets, while shaded areas correspond to 2 standard errors of the estimates. UCB, s = 0.01 UCB, s = 1 UCB, s = 100 Range estimating UCB AHB with extra exploration 104 105 104 105 Figure 3: Same legend as for Figure 2, but in the low variance case. Adaptation to the Range in K Armed Bandits like random play when s is too large; both of these strategies suffer linear regret and UCB incorrectly scaled also does so (see the first lines of Figures 2 and 3). It remains to compare AHB to UCB tuned with the correct scale: the ranking between the two depends on the value of V , with AHB outperforming UCB tuned with the correct scale in the high-variance case and vice versa in the low variance case. Our last observation is that in the low-variance case, the range-estimating version of UCB is far offfrom UCB tuned with the correct scale. This is because of the large difference between the sub-Gaussian parameter and its upper bound given by the squared half-range, which the range-estimating version of UCB is targeting. Acknowledgments H edi Hadiji and Gilles Stoltz have no direct funding to acknowledge other than the salaries paid by their employers (Universit e Paris-Saclay, CNRS, and HEC Paris). They have no competing interests to declare. Appendix A. More on Scale-Free Distribution-Dependent Regret Bounds Considered in Isolation This section details the claims of Section 2.2: no strategy may be adaptive to the range and achieve Φdep = ln T (Section A.2) but we may construct a strategy adaptive to the range and achieving Φdep ln T (Section A.3). Before we do so, we provide a reminder on a general, and optimal, distribution-dependent regret lower bound for K armed stochastic bandits (Section A.1). A.1 Reminder of a General Regret Lower Bound for K Armed Bandits This section considers some general model D. It also rules out poor strategies by restricting its attention to so-called consistent strategies according to the terminology introduced by Lai and Robbins (1985), while Burnetas and Katehakis, 1996 rather speak of uniformly fast convergent strategies. Definition 10 A strategy is consistent on a model D if for all bandit problems ν in D, it achieves a subpolynomial regret bound, that is, RT (ν)/T α 0 for all α (0, 1]. A lower bound on the distribution-dependent rates that such a strategy may achieve is provided by a general, and optimal, result of Lai and Robbins (1985) and Burnetas and Katehakis (1996); see also its rederivation by Garivier et al. (2019b). It involves a quantity defined as an infimum of Kullback-Leibler divergences: we recall that for two probability distributions ν, ν defined on the same probability space (Ω, F), KL(ν, ν ) = dν if ν ν , + otherwise, Hadiji and Stoltz where ν ν means that ν is absolutely continuous with respect to ν and dν/dν then denotes the Radon-Nikodym derivative. Now, for any probability distribution ν, any real number x, and any model D, we define Kinf(ν, x, D) = inf KL(ν, ν ) : ν D and E(ν ) > x , where by convention, the infimum of an empty set equals + and where we denoted by E(ν ) the expectation of ν . The quantity Kinf(ν, x, D) can be null. With the usual measuretheoretic conventions, in particular, 0/0 = 0, we then have the following lower bound. Reminder 1 For all models D, for all consistent strategies on D, for all bandit problems ν in D, lim inf T + RT (ν) a Kinf(νa, µ , D) . The case of a known payoffrange [m, M]. When the payoffrange [m, M] is known, i.e., when the model is Dm,M, there exist strategies achieving the lower bound of Reminder 1, like the DMED strategy of Honda and Takemura (2011, 2015) or the KL UCB strategy of Capp e et al. (2013) and Garivier et al. (2019a). The case of a known payoffupper bound M. The DMED strategy of Honda and Takemura (2015) actually achieves the lower bound of Reminder 1 even for the model and for the model D ,M of all distributions upper bounded by M but not necessarily lower bounded. This suggests that adaptation to M is much more difficult than adaptation to m as far as distribution-dependent regret bounds are considered, and is in line with Remark 4. That M is more important than m for distribution-dependent bounds is also reflected in the lower bound of Reminder 1: this lower bound does not depend on whether D equals some Dm,M, or D ,M, or even D ,M. We may indeed easily show (see an extended version of this article [ar Xiv:2006.03378], Appendix E) that given M R, for all m M, for all ν Dm,M and all µ > E(ν), Kinf ν, µ, Dm,M = Kinf ν, µ, D ,M . A.2 Adaptation to the Range Impossible at Logarithmic Distribution-Dependent Rate A strategy that would be adaptive to the range with a distribution-dependent rate Φdep = ln would, by definition and in particular, be consistent on D ,+. The following theorem therefore shows, by contradiction, that no strategy may be adaptive to the range with a distribution-dependent rate Φdep = ln. A similar phenomenon was discussed by Lattimore (2017) in the case of stochastic bandits with Gaussian distributions. Theorem 11 For all distributions νa D ,+ with expectation µa, and all µ > µa, Kinf(νa, µ , D ,+) = 0 . Adaptation to the Range in K Armed Bandits As a consequence, all consistent strategies on D ,+ are such that, for all bandit problems ν in D ,+ with at least one suboptimal arm a, lim inf T + RT (ν) Interestingly, Cowan and Katehakis (2015) observe that for the model of uniform distributions over bounded intervals, the Kinf is positive, and thus the lower bound of Reminder 1 does not prevent logarithmic regret bounds. In fact, they also provide an algorithm enjoying optimal distribution-dependent bounds thus being, in a sense, adaptive to the range in that very restricted model. Proof We denote by [m, M] an interval containing the support of νa. We remind the reader of the model Dm,+ defined in (6), composed of all bounded distributions with unknown upper end on the range but known lower end m on the range. As Dm,+ D ,+ and by definition of Kinf, Kinf(νa, µ , D ,+) Kinf(νa, µ , Dm,+) , so that it suffices to show that Kinf(νa, µ , Dm,+) = 0. We have in particular µa m. We use the same construction as in the proof of Theorem 3. Let ν ε = (1 ε)νa+εδµa+2 a/ε for ε (0, 1): it is a bounded probability distribution, with lower end of support larger than m, that is, ν ε Dm,+. For ε small enough, µa+2 a/ε lies outside of the bounded support of νa. In that case, the density of νa with respect to ν ε is given by 1/(1 ε) on the support of νa and 0 elsewhere, so that KL νa, ν ε = ln 1 1 ε Moreover, E ν ε = (1 ε)µa + ε µa + 2 a/ε = µa + 2 a = µ + a > µ . Therefore, by definition of Kinf as an infimum, Kinf(νa, µ , Dm,+) KL νa, ν ε = ln 1 1 ε This upper bound holds for all ε > 0 small enough and thus shows Kinf(νa, µ , Dm,+) = 0. The second part of the theorem follows from Reminder 1, from the existence of an arm a with a = µ µa > 0, and from the fact that Kinf(νa, µ , D ,+) = 0, as we established above. Remark 12 Recall that Remark 4 defined a notion of adaptation to the upper end M of the payoffrange. The proof above reveals that Theorem 11 holds with all occurrences of D ,+ replaced by Dm,+, for some m R. We may therefore similarly exclude a ln T distributiondependent rate for adaptation to the upper end M of the payoffrange. This observation is yet another example that the knowledge of the lower end m of the payoffrange does not critically change the picture, and the difficulty in ignoring a payoff range lies in ignoring the upper end thereof. Hadiji and Stoltz A.3 UCB with an Increased Exploration Rate Adapts to the Range The impossibility result implied by Theorem 11 does not prevent distribution-dependent rates for adaptation that are larger than a logarithm. Let ϕ be a non-decreasing function such that ϕ(t) ln t, like ϕ(t) = (ln t)2 or even ϕ(t) = (ln t) ln ln t. Lattimore (2017, Remark 8) introduced and studied, in the case of Gaussian bandits with unknown variances, the following variant of UCB, which we refer to in this section as UCB with an increased exploration rate ϕ: ϕ(t) Na(t) where ϕ(t) ln t + and ϕ(t) and where bµa(t) denotes the empirical average of payoffs obtained till round t when playing arm a. The (asymptotic only) analysis of Lattimore (2017, Remark 8) relies on the fact that ϕ(t) 2(M m) ln t for t larger than some unknown threshold T0, and that after T0, the indexes are thus larger than the ones of the original version of UCB based on the knowledge of m and M. This argument readily extends to the case of sub-Gaussian distributions, where we recall that a distribution ν with expectation µ is v sub-Gaussian, with v > 0, if t R, Z et(x µ)dν(x) evt2/2 . Hoeffding s lemma proves that distributions over a bounded range [m, M] are (M m)2/4 sub-Gaussian. Based on a slightly different proof than the one of Lattimore (2017, Remark 8), one can prove the following finite-time result where we did not aim for tight numerical constants. Theorem 13 UCB with an increased exploration rate given by a non-decreasing function ϕ ensures that for all v > 0, for all distributions ν1, . . . , νK that are v sub-Gaussian, for all T K + 1, | {z } main term a [K] 2 a max 32v t=K e ϕ(t)/(2v) ! | {z } smaller-order term: typically, a O(1) Whenever ϕ ln, this strategy is therefore adaptive to the unknown range of payoffs with a distribution-dependent rate Φdep = ϕ. The second part of the statement follows from the claimed bound given that ϕ ln entails ϕ(t) 4v ln t for t large enough, and therefore, e ϕ(t)/(2v) 1/t2. As a consequence, the sum tagged as smaller-order term in the bound is finite. Possible such choices are ϕ : t 7 (ln t)2, or even ϕ : t 7 (ln t)(ln ln t). However, as already mentioned in Lattimore (2017), as the distribution-dependent rate approaches ln t, the smaller-order term blows up. For example, if ϕ(t) = (ln t)2, the summands e (ln t)2/(2v) in the smaller-order term are larger than e 1 for all t e 2v: the smaller-order term is at least of the order of e 2v, and the regret thus carries an exponential dependence on v. In the case of a bounded range, this means an exponential Adaptation to the Range in K Armed Bandits dependence on the range M m. This is probably not an artifact of the proof: in the case of a bounded range, as long as ϕ(t) (M m) ln t, the lack of exploration bonus entails that the strategy behaves similarly to a follow-the-leader strategy, which is known to suffer catastrophic, i.e., linear, regret. Proof As indicated above, we did not aim for tight numerical constants here and we somehow simplified the standard analysis of UCB by not considering thresholds of the form µ ε but rather µ a/4. Hence the non-standard (much increased) numerical factor in front of P a ln T/ a when we specify ϕ : t 7 2(M m)2 ln t into the bound. In this proof, we repeatedly use that i.i.d. random variables X1, . . . , Xn with a v sub Gaussian distribution with expectation µ satisfy, by the Cram er-Chernoffinequality, that for all ε > 0, i=1 Xi µ + ε inf λ>0 e nλε E inf λ>0 e nλε evλ2/2 n = e nε2/(2v) ; and we obtain a similar inequality for deviations of the form µ ε . Let a and a be an optimal and a suboptimal arm, respectively. Each arm is pulled once in the first K round. We bound E Na(T)] by using that for t K, an arm At+1 is pulled only if its index has the highest value, and then introduce the threshold µ a/4 to separate the Ua(t) and the Ua (t): t=K P Ua(t) Ua (t) and At+1 = a t=K P Ua(t) µ a 4 and At+1 = a + t=K P Ua (t) µ a t=K P ˆµa(t) + ϕ(t) Na(t) µ a 4 and At+1 = a t=K P ˆµa (t) µ a ϕ(t) Na (t) t=K P ˆµa(t) + ϕ(T) Na(t) µ a 4 and At+1 = a t=K P ˆµa (t) µ a ϕ(t) Na (t) n=1 P ˆµa,n µ a | {z } Sum(a) n=1 P ˆµa ,n µ a | {z } Sum(a ) Hadiji and Stoltz Note that we used the fact that ϕ is non-decreasing to get to the last but one inequality, and we used optional skipping for the last one; we denote by ˆµa,n and ˆµa ,n the average of n i.i.d. rewards distributed according to νa and νa , respectively. We first deal with Sum(a). Let N0 = 4 ϕ(T)/ 2 a . For n N0, P ˆµa,n µ a P ˆµa,n µa + a e n 2 a/(32v) . Sum(a) N0 1+ X n N0 e n 2 a/(32v) 4ϕ(T) 2a + 1 1 e 2a/(32v) 4ϕ(T) 2a +2 max 32v where we used1 in the last step 1/(1 e x) 2/x for x (0, 1] and 1/(1 e x) 2 for x 1. For Sum(a ), we apply the Cram er-Chernoffinequality, then use (x + y)2 x2 + y2 for x, y 0, and finally apply the same inequalities on 1/(1 e x) as for the other sum: n=1 P ˆµa ,n µ a n=1 e n a/4+ ϕ(t)/n 2 /(2v) n=1 e n 2 a/(32v) e ϕ(t)/(2v) t=K e ϕ(t)/(2v) 1 1 e 2a/(32v) 2a , 1 T 1 X t=K e ϕ(t)/(2v) . The proof is concluded by substituting the bounds in RT (ν) = X a [K] a E Na(T)]. Appendix B. Proof of Theorem 7 How the second regret bound follows from the first one. We substitute the stated values of the γt. We have, first, 5(1 α)K ln K t=K+1 t α p 5(1 α)K ln K Z T 1 α T 1 α , (13) second, using the definition of γT as a minimum, 1/2 + T αK ln K p 5(1 α)K ln K = 2K ln K + K ln K 5(1 α) T α , T T max{α,1 α}, so that the first regret bound of Theorem 7 is further bounded by T max{α,1 α} + 10(M m)K ln K . 1. For a bounded distribution, the case x > 1 does not occur as x = 2 a/(32v) = 2 a/ 8(M m)2 1/8; but it may occur for other sub-Gaussian distributions. Adaptation to the Range in K Armed Bandits The claimed expression for Φadv(T) is obtained by bounding 2 First regret bound. In Algorithm 1, for time steps t K + 1, the weights qt are obtained by using the Ada Hedge algorithm of De Rooij et al. (2014) on the payoffestimates byt,a. Ada Hedge is designed for the case of a full monitoring not of a bandit monitoring , but the use of these estimates emulates a full monitoring. Section 2.2 of De Rooij et al. (2014) see also an earlier analysis by Cesa-Bianchi et al. (2007) ensures the bound stated next in Reminder 2. We call pre-regret the quantity at hand in Reminder 2: it corresponds to some regret defined in terms of the payoffestimates. Reminder 2 (Application of Lemma 3 and Theorem 6 of De Rooij et al., 2014) For all sequences of payoffestimates byt,a lying in some bounded real-valued interval, denoted by [b, B], for all T K + 1, the pre-regret of Ada Hedge satisfies t=K+1 byt,k a=1 qt,a byt,a 2 k [K] qt,k byt,k a=1 qt,a(byt,a c)2 ln K for any c R +(B b) 1 + 2 and Ada Hedge does not require the knowledge of [b, B] to achieve this bound. The bound of Reminder 2 will prove itself particularly handy for three reasons: first, it is valid for real-valued payoffs; second, it is adaptive to the range of payoffs; third, the right-hand side looks at first sight not intrinsic enough a bound, as it also depends on the weights qt, but we will see later that this dependency is particularly useful in our specific case. To the best of our knowledge, this is the first direct application of the Ada Hedge bound depending on the weights qt (previous applications were rather solving inequations on the regret, e.g., to get improvements for small losses; see Cesa-Bianchi et al., 2007 and De Rooij et al., 2014). We recall that we start the summation in Reminder 2 at t = K+1 because the Ada Hedge algorithm is only started at this time, after the initial exploration. The bound holding for any c R is obtained by a classical bound on the variance. Proof of the first bound of Theorem 7 We deal with the contribution of the initial exploration by using the inequality max(u + v) max u + max v, together with the fact that yt,a yt,AT M m for any a [K]: RT (y1:T ) max a [K] | {z } K(M m) + max a [K] t=K+1 yt,a E t=K+1 yt,At Hadiji and Stoltz We now transform the pre-regret bound of Reminder 2, which is stated with the distributions qt, into a pre-regret bound with the distributions pt; we do so while substituting the bounds B = C + KM/γT and b = C + Km/γT implied by (10) and the fact that (γt) is non-increasing, and by using the definition qt,a = pt,a γt(1/K qt,a) for all a [K]: t=K+1 byt,k a=1 pt,a byt,a + a=1 (1/K qt,a) byt,a 2 a=1 qt,a(byt,a C)2 ln K + (M m)K As noted by Auer et al. (2002b), by the very definition (8) of the estimates, a=1 pt,a byt,a = yt,At . By (9), the tower rule and the fact that qt is Ht 1 measurable, on the one hand, and the fact that the expectation of a maximum is larger than the maximum of expectations, on the other hand, the left-hand side of the first inequality in (15) thus satisfies t=K+1 byt,k a=1 pt,a byt,a + a=1 (1/K qt,a) byt,a t=K+1 yt,k E t=K+1 yt,At | {z } [m,M] a=1 E qt,a yt,a | {z } [m,M] t=K+1 yt,k E t=K+1 yt,At As for the right-hand side of the second inequality in (15), we first note that by definition (see line 4 in Algorithm 1), pt,a (1 γt)qt,a with γt 1/2 by assumption on the extraexploration rate, so that qt,a 2pt,a; therefore, by substituting first this inequality and then by using Jensen s inequality, a=1 qt,a(byt,a C)2 ln K a=1 pt,a(byt,a C)2 ln K a=1 E h pt,a(byt,a C)2 i ln K . Standard calculations (see Auer et al., 2002b again) show, similarly to (9), that for all a [K], E h pt,a(byt,a C)2 Ht 1 i = E (yt,At C)2 pt,a 1{At=a} = (yt,a C)2 (M m)2 , Adaptation to the Range in K Armed Bandits where the last inequality comes from (10). By the tower rule, the same upper bound holds for the (unconditional) expectation. Therefore, taking the expectation of both sides of (15) and collecting all bounds together, we proved so far RT (y1:T ) 2 2 |{z} 3 (M m) KT ln K + (M m)K ln K where we used γT 1/2 and ln K ln 2 as K 2. Appendix C. Proof of Theorem 9 Given the decomposition (1) of the regret, it is necessary and sufficient to upper bound the expected number of times E[Na(t)] any suboptimal arm a is drawn, where by definition of Algorithm 1, E[Na(t)] = 1 + E (1 γt)qt,a + γt t=K+1 E[qt,a] + 1 We show below (and this is the main part of the proof) that t=K+1 E[qt,a] = O(ln T) . (17) The straightforward calculations (13) already showed that 5 ln K (1 α)K T 1 α . Substituting the value (11) of ΦAHB free (T) = Φadv(T) and using the decomposition (1) of RT (ν) into P a E[Na(t)] then yield RT (ν) T/ΦAHB free(T) X 5 ln K (1 α)K K ln K 1 + o(1) + O ln T from which the stated bound follows, via the crude inequality 3 5 1 α + 5 12. Structure of the proof of (17). Let a denote an optimal arm. By definition of qt,a and by lower bounding a sum of exponential terms by any of the summands, we get s=K+1 byt,a s=K+1 byt,k t=K+1 (byt,a byt,a ) Hadiji and Stoltz Then, by separating cases, depending on whether Pt 1 t=K+1(byt,a byt,a ) is smaller or larger than the threshold (t 1 K) a/2, and by remembering that the probability qt,a is always smaller than 1, we get t=K+1 E[qt,a] t=K+1 E exp ηt (t 1 K) a s=K+1 (bys,a bys,a ) (t 1 K) a We show that the sums in the right-hand side of (18) are respectively O(1) and O(ln T). First sum in the right-hand side of (18). Given the definition of the learning rates (see the statement of Algorithm 1), namely, s=K+1 δs , (19) we are interested in upper bounds on the sum of the δs. Such upper bounds were already derived in the proof of Theorem 7; the second inequality in (15) together with the bound qt,a 2pt,a stated in the middle of the proof immediately yield a=1 qs,a bys,a C 2 ln K + (M m)K a=1 ps,a bys,a C 2 ln K + (M m)K Unlike what we did to complete the proof of Theorem 7, we do not take expectations and rather proceed with deterministic bounds. By the definition (8) of the estimated payoffs for the equality below, by (10) for the first inequality below, and by the fact that the exploration rates are non-increasing for the second inequality below, we have, for all s K + 1, a=1 ps,a bys,a C 2 = ps,As (M m)2 γs/K (M m)2 γt/K . (20) γt + (M m)K def = Dt = Θ p t/γt + 1/γt . For the sake of concision, we denoted by Dt the obtained bound. Via the definition (19) of ηt, the sum of interest is in turn bounded by t=K+1 exp ηt t 1 K a t=K+1 exp a ln K Adaptation to the Range in K Armed Bandits where the equality to O(1), i.e., the fact that the considered series is bounded, follows from the fact that (t 1 K)/Dt = Θ tγt + tγt = Θ t(1 α)/2 + t1 α . Second sum in the right-hand side of (18). We will use Bernstein s inequality for martingales, and more specifically, the formulation of the inequality by Freedman (1975, Theorem 1.6) see also Massart (2007, Section 2.2) , as stated next. Reminder 3 Let (Xn)n 1 be a martingale difference sequence with respect to a filtration (Fn)n 0, and let N 1 be a summation horizon. Assume that there exist real numbers b and v N such that, almost surely, n N, Xn b and n=1 E X2 n Fn 1 v N . Then for all δ (0, 1), For s K + 1, we consider the increments Xs = a bys,a + bys,a, which are adapted to the filtration Fs = σ(A1, Z1, . . . , As, Zs), where we recall that Z1, . . . , Zs denote the payoffs obtained in rounds 1, . . . , s. Also, as ps is measurable with respect to past information Fs 1 and since payoffs are drawn independently from everything else (see Section 2), we have, by the definition (8) of the estimated payoffs (where we rather denote by Ys,a the payoffs drawn at random according to νa, to be in line with the notation of Section 2 for stochastic bandits): for all a [K], E bys,a Fs 1 = E[Ys,a | Fs 1] C ps,a 1{As=a} + C = µa C ps,a 1{As=a} + C = µa . As a consequence, E[Xs | Fs 1] = E a bys,a + bys,a | Fs 1 = 0. Put differently, (Xs)s K+1 is indeed a martingale difference sequence with respect to the filtration (Fs)s K. We now check that the additional assumptions of Reminder 3 are satisfied. Manipulations and arguments similar to the ones used in (10) and (20) show that for all s K + 1, a bys,a + bys,a a Ys,a C ps,a 1{As=a } + Ys,a C ps,a 1{As=a} (M m)(1 + K/γs) b def = (M m)(1 + K/γt) . For the variance bound, we first note that for all s t 1, we have (bys,a C)(bys,a C) = 0 because of the indicator functions, and therefore, E h a bys,a + bys,a 2 Fs 1 i E h bys,a + bys,a 2 Fs 1 i E h bys,a C 2 Fs 1 i + E h bys,a C 2 Fs 1 i ; Hadiji and Stoltz in addition, for all a [K], including a , E h bys,a C 2 Fs 1 i = E (Ys,As C)2 p2s,a 1{As=a} ps,a (M m)2K s=K+1 E h a bys,a + bys,a 2 Fs 1 i 2K(M m)2(t 1 K) γt vt def = 2(M m)2t K Bernstein s inequality (Reminder 3) may thus be applied; the choice δ = 1/t therein leads to a (bys,a bys,a) 2(M m) γt ln t + M m t/γt = O(t(1+α)/2) and 1/γt = O(tα) as t , where α < 1, and as a > 0 given that we are considering a suboptimal arm a, there exists t0 N such that for all t t0, D t (t 1 K) a s=K+1 (bys,a bys,a ) (t 1 K) a a (bys,a bys,a) (t 1 K) a a (bys,a bys,a) D t Therefore, as T t=K+1 (byt,a byt,a ) (t 1 K) a = O(ln T) , as claimed. This concludes the proof. C. Allenberg, P. Auer, L. Gy orfi, and G. Ottucs ak. Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In Proccedings of the 17th International Conference on Algorithmic Learning Theory (ALT 06), pages 229 243. Springer, 2006. Adaptation to the Range in K Armed Bandits J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT 09), pages 217 226. Omnipress, 2009. J.-Y. Audibert, R. Munos, and C. Szepesv ari. Exploration exploitation tradeoffusing variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876 1902, 2009. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235 256, 2002a. P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002b. D. Baudry, P. Saux, and O.-A. Maillard. From optimality to robustness: Adaptive resampling strategies in stochastic bandits. In Advances in Neural Information Processing Systems, volume 34, pages 14029 14041, 2021. S. Bubeck, M.B. Cohen, and Y. Li. Sparsity, variance and curvature in multi-armed bandits. In Proceedings of the 29th International Conference on Algorithmic Learning Theory (ALT 18), volume 83 of PMLR, pages 111 127, 2018. A.N. Burnetas and M.N. Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122 142, 1996. O. Capp e, A. Garivier, O.-A. Maillard, R. Munos, and G. Stoltz. Kullback Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3): 1516 1541, 2013. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. N. Cesa-Bianchi and O. Shamir. Bandit regret scaling with the effective loss range. In Proceedings of the 29th International Conference on Algorithmic Learning Theory (ALT 18), volume 83 of PMLR, pages 128 151, 2018. N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 66(2-3):321 352, 2007. Y. Chow and H. Teicher. Probability Theory. Springer, 1988. W. Cowan and M.N. Katehakis. An asymptotically optimal policy for uniform bandits of unknown support, 2015. Preprint, ar Xiv:1505.01918. W. Cowan, J. Honda, and M.N. Katehakis. Normal bandits of unknown means and variances. Journal of Machine Learning Research, 18(154):1 28, 2018. S. De Rooij, T. van Erven, P.D. Gr unwald, and W.M. Koolen. Follow the leader if you can, hedge if you must. Journal of Machine Learning Research, 15(37):1281 1316, 2014. J.L. Doob. Stochastic Processes. Wiley Publications in Statistics. John Wiley & Sons, 1953. Hadiji and Stoltz D.A Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1): 100 118, 1975. A. Garivier, H. Hadiji, P. M enard, and G. Stoltz. KL-UCB-Switch: Optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints, 2019a. Preprint, ar Xiv:1805.05071. A. Garivier, P. M enard, and G. Stoltz. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 44(2):377 399, 2019b. S. Gerchinovitz and T. Lattimore. Refined lower bounds for adversarial bandits. In Advances in Neural Information Processing Systems, volume 29, pages 1198 1206, 2016. H. Hadiji. Polynomial cost of adaptation for X-armed bandits. In Advances in Neural Information Processing Systems, volume 32, pages 1029 1038, 2019. J. Honda and A. Takemura. An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Machine Learning, 85:361 391, 2011. J. Honda and A. Takemura. Non-asymptotic analysis of a new bandit algorithm for semibounded rewards. Journal of Machine Learning Research, 16(113):3721 3756, 2015. J. Huang, Y. Dai, and L. Huang. Scale-free adversarial multi-armed bandit with arbitrary feedback delays, 2021. Preprint, ar Xiv:2110.13400. R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, volume 17, pages 697 704, 2004. J. Kwon and V. Perchet. Gains and losses are fundamentally different in regret minimization: The sparse case. Journal of Machine Learning Research, 17(227):1 32, 2016. T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985. T. Lattimore. A scale free algorithm for stochastic bandits with bounded kurtosis, 2017. Preprint ar Xiv:1703.08937, later published, with the omission of some remarks, in Advances in Neural Information Processing Systems, volume 30, pages 1584 1593, 2017. P. Massart. Concentration Inequalities and Model Selection, volume XXXIII of Ecole d Et e de Probabilit es de Saint-Flour. Springer, 2007. Lectures given in 2003, published in 2007. S.R. Putta and S. Agrawal. Scale-free adversarial multi armed bandits. In Proceedings of the 33rd International Conference on Algorithmic Learning Theory (ALT 22), volume 167 of PMLR, pages 910 930, 2022. Y. Seldin and G. Lugosi. An improved parametrization and analysis of the EXP3++ algorithm for stochastic and adversarial bandits. In Proceedings of the 30th Annual Conference on Learning Theory (COLT 17), volume 65 of PMLR, pages 1743 1759, 2017. Adaptation to the Range in K Armed Bandits G. Stoltz. Incomplete Information and Internal Regret in Prediction of Individual Sequences. Ph D thesis, Universit e Paris-Sud, 2005. URL https://tel.archives-ouvertes.fr/ tel-00009759/document. T.S. Thune and Y. Seldin. Adaptation to easy data in prediction with limited advice. In Advances in Neural Information Processing Systems, volume 31, pages 2909 2918, 2018. C.-Y. Wei and H. Luo. More adaptive algorithms for adversarial bandits. In Proceedings of the 31st Conference On Learning Theory (COLT 18), volume 75 of PMLR, pages 1263 1291, 2018. J. Zimmert and T. Lattimore. Connections between mirror descent, Thompson sampling and the information ratio. In Advances in Neural Information Processing Systems, volume 32, pages 11973 11982, 2019. J. Zimmert and Y. Seldin. An optimal algorithm for stochastic and adversarial bandits. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AIStats 20), volume 89 of PMLR, pages 467 475, 2019.