# adaptation_to_the_range_in_karmed_bandits__954972aa.pdf

Journal of Machine Learning Research 24 (2023) 1-33 Submitted 2/21; Revised 6/22; Published 1/23

Adaptation to the Range in K Armed Bandits

H edi Hadiji hedi.hadiji@math.u-psud.fr Gilles Stoltz gilles.stoltz@math.u-psud.fr Universit e Paris-Saclay, CNRS, Laboratoire de math ematiques d Orsay, 91405, Orsay, France

Editor: Ambuj Tewari

We consider stochastic bandit problems with K arms, each associated with a distribution supported on a given ﬁnite range [m, M]. We do not assume that the range [m, M] is known and show that there is a cost for learning this range. Indeed, a new trade-oﬀbetween distribution-dependent and distribution-free regret bounds arises, which prevents from simultaneously achieving the typical ln T and

T bounds. For instance, a

T distributionfree regret bound may only be achieved if the distribution-dependent regret bounds are at least of order

T. We exhibit a strategy achieving the rates for regret imposed by the new trade-oﬀ. Keywords: multiarmed bandits, adversarial learning, cumulative regret, informationtheoretic proof techniques

1. Introduction

Stochastic multi-armed bandits form a standard setting to deal with sequential decisionmaking problems like the design of clinical trials one of the ﬁrst applications mentioned or online advertisement and online revenue management. Except for notable exceptions discussed below, virtually all articles on stochastic K armed bandits either assume that distributions of the arms belong to some parametric family often, one-dimensional exponential families or are sub-Gaussian with a known parameter σ2. Among the latter category, the case of the non-parametric family of distributions supported on a known range [m, M] is of particular interest to us. We show that the knowledge of the range [m, M] is a crucial information and that facing bounded bandit problems but ignoring the bounds m and M is much harder. We do so by studying what may be achieved and what cannot be achieved anymore when the range [m, M] is unknown and the strategies need to learn it. We call this problem adaptation to the range, or scale-free regret minimization. Why this problem is important and why we considered it is explained in Section 1.2.

More precisely, we prove that adaptation to the range is actually possible but that it has a cost: our most striking result (in Section 2.3) is a severe trade-oﬀbetween the scale-free distribution-dependent and distribution-free regret bounds that may be achieved. For instance, no strategy adaptive to the range can simultaneously achieve distributiondependent regret bounds of order ln T and distribution-free regret bounds of order

T up to polynomial factors; this is in contrast with the case of a known range where simple strategies like UCB strategies (by Auer et al., 2002a) do so. Our general trade-oﬀshows,

2023 H edi Hadiji and Gilles Stoltz.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-0148.html.

Hadiji and Stoltz

for instance, that if one wants to keep the same

T order of magnitude for the scale-free distribution-free regret bounds, then the best scale-free distribution-dependent rate that may be achieved is

T. We also provide (in Section 4) a strategy, based on exponential weights, that adapts to the range and obtains optimal distribution-dependent and distribution-free regret bounds in the eyes of the exhibited trade-oﬀ: these are of respective orders T 1 α and T α, where α [1/2, 1) is a parameter of the strategy.

1.1 Literature Review

Optimal scale-free regret minimization under full monitoring for adversarial sequences is oﬀered by the Ada Hedge strategy by De Rooij et al. (2014), which we will use as a building block in Section 4. For stochastic bandits, the main diﬃculty in adaptation to the range is the adaptation to the upper end M (see Remark 4); this is why Honda and Takemura (2015) could provide optimal ln T distribution-dependent regret bounds for payoﬀs lying in ranges of the form ( , M], with a known M. Lattimore (2017) considers models of distributions with a known bound on their kurtosis, which is a scale-free measure of the skewness of the distributions; he provides a scale-free algorithm based on the median-of-means estimators, with ln T distribution-dependent regret bounds. However, bounded bandits can have an arbitrarily high kurtosis, so our settings are not directly comparable. Cowan and Katehakis (2015) study adaptation to the range but in the restricted case of uniform distributions over unknown intervals. They provide optimal ln T distribution-dependent regret bounds for that speciﬁc model: in their model, the cost for adaptation is mild and lies only in the multiplicative constant before the ln T. In the setting of bounded bandits, we show that distribution-dependent regret bounds must be larger than ln T, but the argument of Lattimore (2017, Remark 8) entails that any regret rate larger than ln T, e.g., (ln T) ln ln T, may be achieved. Similar results by Cowan et al. (2018) for Gaussian distributions with unknown means and variances were also obtained. Finally, on the front of adversarial bandits, no prior work discussed adaptation to the range, to the best of our knowledge. Additional important references performing adaptation in some other sense for stochastic and adversarial K armed bandits are discussed now, including some follow-up work to this article.

Adaptation to the eﬀective range or to unbounded ranges in adversarial bandits. Gerchinovitz and Lattimore (2016) show that it is impossible to adapt to the so-called eﬀective range in adversarial bandits. A sequence of rewards has eﬀective range smaller than b if for all rounds t, rewards yt,a at this round all lie in an interval of the form [mt, Mt] with Mt mt b. The lower bound they exhibit relies on a sequence of changing intervals of ﬁxed size. This problem is thus diﬀerent from our setting. See also positive results regret upper bounds under additional assumptions by Cesa-Bianchi and Shamir (2018) and Thune and Seldin (2018) for adaptation to the eﬀective range. Allenberg et al. (2006) deal with unbounded ranges [mt, Mt] in adversarial bandits and other partial monitoring settings, where, e.g., Mt = mt = tβ for some β > 0. They provide

Adaptation to the Range in K Armed Bandits

regret upper bounds scaling with tβ/2 when β is known, but do not detail the price to pay for not knowing β though they suggest to resort to a doubling trick in that case.

Adaptation to the variance. Audibert et al. (2009) consider a variant of UCB called UCBV, which adapts to the unknown variance. Its analysis assumes that rewards lie in a known range [0, M]. The results crucially use Bernstein s inequality, which we state as Reminder 3 in Appendix C. As Bernstein s inequality holds for random variables with supports in ( , M], the analysis of UCB-V might perhaps be extended to this case as well. Deviation bounds in Bernstein s inequality contain two terms, a main term scaling with the standard deviation, and a remainder term, scaling with M. This remainder term, which seems harmless, is actually a true issue when M is not known, as shown by the results of the present article.

Adaptation to other criteria. Wei and Luo (2018), Zimmert and Seldin (2019), Bubeck et al. (2018), and many more, provide strategies for adversarial bandits with rewards in a known range, say [0, 1], and adapting to additional regularity in the data, like small variations or stochasticity of the data but never to the range itself.

Follow-up works. Following an earlier version of this work, further research on rangeadaptive bandit algorithms has been conducted. Baudry et al. (2021) bypass our lower bound by imposing minimal extra conditions on the reward distributions, avoiding the heavy-tail construction from Theorem 3; in this context, they provide a fully range-adaptive algorithm. For adversarial multi-armed bandits, Putta and Agrawal (2022) recover some small-loss bounds while being agnostic to the range, at the cost of degraded worst-case guarantees, and Huang et al. (2021) obtain similar results under delayed feedback.

1.2 Why Studying Adaptation to the Range for Finite-Armed Bandits

We encountered the problem of learning the range [m, M] of bandits problems when designing bandit algorithms for continuum-armed problems, see Hadiji (2019). Therein, arms are indexed by some bounded interval I, and the mean-payoﬀfunction f : I R is assumed to be smooth enough, e.g., H olderian-smooth, with unknown regularity parameters L and β. The mean-payoﬀfunction f has a bounded range, as it is continuous over a bounded interval. To optimally learn these smoothness parameters, histogram-reductions of continuum-armed bandit problems to ﬁnite-armed bandits problems, of proper bandwidth, are performed ( a la Kleinberg, 2004), by zooming out. Any reasonable K armed bandit algorithm may be used in this algorithmic scheme. However, given this reduction, it had to be assumed that the range [m, M] of f is known, as all K armed bandit algorithms we were aware of assumed that the range of the distributions over the arms was known. To get more complete adaptivity results in the continuum-armed case and be able to ignore the range of the mean-payoﬀfunction f, it was necessary and suﬃcient to deal with the similar issue of range adaptivity in the case of ﬁnitely many arms which this article provides. We also believe that exhibiting impossibility results, like the existence of a severe tradeoﬀbetween distribution-dependent and distribution-free regret bounds in the case of the model of bounded distributions with an unknown range, has consequences beyond that model. This impossibility result holds in particular for all larger models, like non-parametric models containing all distributions over the entire real line R satisfying certain assumptions

Hadiji and Stoltz

on their tails to make sure that they are not too large. We therefore provide some intrinsic limitation to learning in K armed stochastic bandits. The techniques introduced extend to more complex settings, like linear bandits; see an extended version of this article [ar Xiv:2006.03378], Appendix D.

2. Setting and Main Results

We consider ﬁnitely-armed stochastic bandits with bounded and possibly signed rewards. More precisely, K 2 arms are available; we denote by [K] the set {1, . . . , K} of these arms. With each arm a is associated a Borel probability distribution νa lying in some known model D; a model is a set of Borel probability distributions over R with a ﬁrst moment. The models of interest in this article are discussed below; in the sequel, we only consider Borel distributions even though we will omit this speciﬁcation. A bandit problem in D is a K vector of probability distributions in D, denoted by ν = (νa)a [K]. The player knows D but not ν. As is standard in this setting, we denote by µa = E(νa) the mean payoﬀprovided by an arm a. An optimal arm and the optimal mean payoﬀare respectively given by a argmaxa [K] µa and µ = maxa [K] µa. Finally, a = µ µa denotes the gap of an arm a. The online learning game goes as follows: at round t 1, the player picks an arm At [K], possibly at random according to a probability distribution pt = (pt,a)a [K] based on an auxiliary randomization Ut 1, e.g., uniformly distributed over [0, 1], and then receives and observes a reward Zt drawn independently at random according to the distribution νAt, given At. More formally, a strategy of the player is a sequence of measurable mappings from the observations to the action set, (U0, Z1, U1, . . . , Zt 1, Ut 1) 7 At. At each given time T 1, we measure the performance of a strategy through its expected regret:

RT (ν) = Tµ E

a=1 a E Na(T) , (1)

where we used the tower rule for the ﬁrst equality and deﬁned Na(T) as the number of times arm a was pulled between time rounds 1 and T. Doob s optional skipping (see Doob, 1953, Chapter III, Theorem 5.2, page 145 for the original reference, see also Chow and Teicher, 1988, Section 5.3 for a more recent reference) shows that we may assume that i.i.d. sequences of rewards (Yt,a)t 1 are drawn beforehand, independently at random, for each arm a and that the obtained payoﬀat round t 1 given the choice At equals Zt = Yt,At. We will use this second formulation in the rest of the paper as it is the closest to the one of oblivious individual sequences described later in Section 4.1. We may then assume that the auxiliary randomizations U0, U1, . . . are i.i.d. random variables independent from the (Yt,a)t 1 and distributed according to a uniform distribution over [0, 1].

Model: bounded signed rewards with unknown range. For a given range [m, M], where m < M are two real numbers, not necessarily nonnegative, we denote by Dm,M the set of probability distributions supported on [m, M]. Then, the model corresponding to distribu-

Adaptation to the Range in K Armed Bandits

tions with a bounded but unknown range is the union of all such Dm,M:

m,M R:m<M Dm,M .

2.1 Adaptation to the Range: Concept of Scale-Free Regret Bounds

Regret scales with the range length M m, thus regret bounds involve a multiplicative factor M m. We therefore consider such bounds divided by the scale factor M m and call them scale-free regret bounds. We denote by N the set of natural integers; rates on regret bounds will be given by functions Φ : N [0, + ). We deﬁne adaptation to the unknown range in Deﬁnitions 1 and 2 below.

Deﬁnition 1 (Scale-free distribution-free regret bounds) A strategy for stochastic bandits is adaptive to the unknown range of payoﬀs with a scale-free distribution-free regret bound Φfree : N [0, + ) if for all real numbers m < M, the strategy ensures, without the knowledge of m and M:

ν in Dm,M, T 1, RT (ν) (M m) Φfree(T) .

We show in Section 4 that adaptation to the unknown range may indeed be performed in the sense of Deﬁnition 1, with a scale-free distribution-free regret bound of order

KT ln K. The latter is optimal up to maybe a factor of

ln K as Auer et al. (2002b) provided a lower bound (1/20) min

KT, T on the regret of any strategy against individual sequences in [0, 1]K, thus for bandit problems in D0,1, thus for scale-free distribution-free regret bounds.

Deﬁnition 2 (Distribution-dependent rates for adaptation) A strategy for stochastic bandits is adaptive to the unknown range of payoﬀs with a distribution-dependent rate Φdep : N [0, + ) if for all real numbers m < M, the strategy ensures, without the knowledge of m and M:

ν in Dm,M, lim sup T +

RT (ν) Φdep(T) < + .

Put diﬀerently, the strategy ensures that lim sup RT (ν)/Φdep(T) < + for all ν D ,+.

Deﬁnition 2 does not add much to the classical notion of distribution-dependent rates on regret bounds, as the scale factor M m does not appear in the deﬁnition; it merely ensures that the strategy is not informed of the range. Also, we are only interested in rates of convergence here, not in the value of the ﬁnite limit of RT (ν)/Φdep(T). This limit however heavily depends on ν, which justiﬁes the terminology of distribution-dependent rates for adaptation Φdep. In contrast, the bounds targeted in the distribution-free case have a ﬁnite-time, closedform expression, which is why we did not speak of rates in that case and rather referred to scale-free distribution-free regret bounds Φfree.

Hadiji and Stoltz

2.2 Scale-Free Distribution-Dependent Regret Bounds Considered in Isolation

We ﬁrst explain the impact of ignoring the range on distribution-dependent regret bounds. What follows is discussed in greater detail in Appendix A as these results were already known or, at least, much expected. When m and M are known, there exist several strategies ensuring

ν in Dm,M, lim sup T +

even with an optimal value of the limit; see the end of Section A.1. Given Deﬁnition 2, one may therefore wonder whether Φdep = ln is achievable as a distribution-dependent rate for adaptation to the range. Theorem 11 in Section A.2 and the comment before its statement provide a negative answer to this question. However, a UCB-strategy with an increased exploration rate given by a non-decreasing function ϕ ln was suggested by Lattimore (2017, Remark 8) in the context of Gaussian bandits. It also works well in the setting of bounded bandits: Theorem 13 in Section A.3 states that it is adaptive to the unknown range of payoﬀs with a distribution-dependent rate Φdep = ϕ. That is, any rate that is larger than a logarithm may be achieved, including, for instance, ϕ(t) = (ln t) ln ln t.

2.3 Simultaneous Scale-Free Regret Bounds

When the range [m, M] of the payoﬀs is known, it is possible to simultaneously achieve optimal distribution-free bounds, of order

KT, and optimal distribution-dependent bounds, of order ln T with the optimal constant recalled in Reminder 1 of Appendix A.1; see the KL-UCB-switch strategy by Garivier et al. (2019a). Put diﬀerently, when the range of payoﬀs is known, one can achieve optimal asymptotic distribution-dependent regret bounds while not sacriﬁcing ﬁnite-time guarantees. Simpler strategies like UCB strategies (see Auer et al., 2002a) also simultaneously achieve regret bounds of similar

T ln T and ln T orders of magnitude but with suboptimal constants. Zimmert and Seldin (2019) also provide a strategy, Tsallis-INF with α = 1/2, that provides simultaneously distribution-dependent regret guarantees of order ln T, with suboptimal constants though, and adversarial guarantees of order

KT, which are stronger than just distribution-free guarantees. First main result: existence of a trade-oﬀ. Our ﬁrst main result states that getting simultaneously these ln T and

T rates is not possible anymore when the range of payoﬀs is unknown.

Theorem 3 A strategy with a scale-free distribution-free regret bound Φfree(T) = o(T) may only achieve distribution-dependent rates Φdep for adaptation satisfying

Φdep(T) T Φfree(T) .

More precisely, the regret of such a strategy is lower bounded as follows: for all ν in D ,+,

lim inf T RT (ν) T/Φfree(T) 1

a=1 a . (2)

Adaptation to the Range in K Armed Bandits

The orders of magnitude of the scale-free distribution-free regret bounds Φfree(T) range between the optimal

T and the trivial T rates. The distribution-dependent rates Φdep for adaptation to the range are therefore at best

T for strategies enjoying scale-free distribution-free regret bounds; ln T rates are excluded. More generally, Theorem 3 shows that there is a trade-oﬀ: to force faster distribution-dependent rates for adaptation, one must suﬀer worsened scale-free distribution-free regret bounds. The proof of Theorem 3 is provided in Section 3. It actually provides a ﬁnite-time, but messy, lower bound on RT (ν) T/Φfree(T) .

Second main result: achieving the trade-oﬀ. Our second main result consists of showing that the trade-oﬀimposed by Theorem 3 may indeed be achieved. Section 4 will introduce a strategy, relying on a parameter α [1/2, 1) and called AHB which stands for Ada Hedge for K armed Bandits with extra-exploration; see Algorithm 1. Theorems 7 and 9 show in particular that AHB adapts to the unknown range, satisﬁes a scale-free distribution-free regret bound

ΦAHB free (T) = 3 + 5 1 α

K ln K T α + 10K ln K = O(T α) ,

and achieves ΦAHB dep (T) = T/ΦAHB free (T) = O(T 1 α) as a distribution-dependent rate for adaptation. Like Zimmert and Seldin (2019), we are actually able to prove an adversarial regret bound, not only the mentioned distribution-free regret bound. Even better, Theorem 9 states that for all ν in D ,+,

RT (ν) T/ΦAHB free(T) 12 ln K

a=1 a . (3)

Discussion. The distribution-dependent constants in the right-hand sides of (2) and (3) are proportional to the sums of the gaps,

and diﬀer from this sum only by distribution-free factors of 1/16 and (12 ln K)/(1 α). The quantity G(ν) appears as a new measure of the underlying geometry of information. We have no deep interpretation thereof, but may despite all underline a fundamental diﬀerence in our setting compared to the setting of a known range. When the payoﬀrange is unknown, the optimal distribution-dependent number of pulls of a suboptimal arm may be bounded independently of ν. The proof of Theorem 3 in Section 3 indeed shows that for all suboptimal arms a [K],

lim inf T + Eν

T/Φfree(T) 1

This is in contrast with the case of a known range, for which the bound of Reminder 1 of Appendix A.1 is optimal and strongly depends on νa and µ .

Hadiji and Stoltz

The reason for this is that when ignoring the range, the player needs to be a lot more conservative in the exploitation and explore more often. Indeed, to maintain the distributionfree regret bound, the player must avoid the catastrophic case in which an apparently suboptimal arm turns out to be good because of large rewards occurring with small probability, i.e., because of heavy-tail-like issues. For this reason, the player must pull suboptimal arms more frequently than in the case of a known range. This intuition is supported by the construction in the lower bound presented in Section 3: the alternative bounded problem ν against a problem ν has an arm satisfying PY ν a[Y µa + 2 a/ε] = ε. This behavior is indeed reminiscent of issues arising with heavy-tailed distributions.

3. Proof of Theorem 3: Existence of a Trade-Oﬀ

We follow a proof technique introduced by Lai and Robbins (1985) and Burnetas and Katehakis (1996) and recently revisited by Garivier et al. (2019b). We ﬁx some bandit problem ν in D ,+ and construct an alternative bandit problem ν in D ,+ by modifying the distribution of a single suboptimal arm a to make it optimal. This is always possible, as there is no bound on the upper end on the ranges of the payoﬀs in the model. We apply a fundamental inequality that links the expectations of the numbers of times Na(T) that a is pulled under ν and ν . We then substitute inequalities stemming from the deﬁnition of distribution-free scale-free regret bounds Φfree, and the result follows by rearranging all inequalities.

Step 1: Alternative bandit problem. The lower bound is trivial it equals 0 when all arms of ν are optimal. We therefore assume that at least one arm is suboptimal and ﬁx such an arm a. For some ε [0, 1] to be deﬁned later by the analysis, we introduce the alternative problem ν = (ν k)k [K] with ν k = νk for j = a and ν a = (1 ε)νa + εδµa+2 a/ε. This distribution ν a has a bounded range, so that ν lies indeed in D ,+. The expectation of ν a equals µ a = µa + 2 a = µ + a > µ . Thus, a is the only optimal arm in ν . Finally, for ε < 2 a/(M µa), the point µa + 2 a/ε is larger than M and thus lies outside of the bounded support of νa. In that case, the density of νa with respect to ν a is given by 1/(1 ε) on the support of νa and 0 elsewhere, so that KL(νa, ν a) = ln 1/(1 ε) .

Step 2: Application of a fundamental inequality. We denote by kl(p, q) the Kullback-Leibler divergence between Bernoulli distributions with parameters p and q. We also index expectations in the rest of this proof only by the bandit problem they are relative to: for instance, Eν denotes the expectation of a random variable when the ambient randomness is given by the bandit problem ν. The fundamental inequality for lower bounds on the regret of stochastic bandits (Garivier et al., 2019b, Section 2, Equation 6), which is based on the chain rule for Kullback-Leibler divergence and on a data-processing inequality for expectations of [0, 1] valued random variables, reads:

T , Eν Na(T)

Na(T) KL(νa, ν a) = Eν

Na(T) ln 1/(1 ε) .

Adaptation to the Range in K Armed Bandits

Now, since u ( , 1) 7 u 1 ln(1 u) is increasing, we have ln 1/(1 ε) (2 ln 2)ε for ε 1/2. For all (p, q) [0, 1]2 and with the usual measure-theoretic conventions,

kl(p, q) = p ln p + (1 p) ln(1 p) | {z } ln 2

+(1 p) ln 1 1 q (1 p) ln 1 1 q ln 2 ,

so that, putting all inequalities together, we have proved

1 1 Eν Na(T) /T

ln 2 (2 ln 2) ε Eν

Na(T) . (4)

In this step, we only imposed the constraint ε [0, 1/2]. We recall that in the previous step, we imposed ε < 2 a/(M µa). Both conditions are implied by ε a/ 2(M µa) , which we will assume in the sequel.

Step 3: Inequalities stemming from the deﬁnition of scale-free distribution-free regret bounds. We denote by [m, M] a range containing the supports of all distributions of ν. By deﬁnition of Φfree, given that a is a suboptimal arm (i.e., a > 0):

a Eν[Na(T)] RT (ν) (M m) Φfree(T) .

We now prove a similar inequality for ν , for which we recall that a is the unique optimal arm. We denote by k = µ a µk the gap of arm k in ν . By the deﬁnition of ν a, the distributions of ν have supports within the range [m, Mε], where we deﬁned the upper end Mε = max{M, µa + 2 a/ε} = µa + 2 a/ε, given the condition imposed on ε. Therefore, by deﬁnition of Φfree, and given that all gaps k are larger than the gap a = µ a µ = a between the unique optimal arm a of ν and the second best arm(s) of ν (which were the optimal arms of ν), we have

a T Eν [Na(T)] = a T Eν [Na(T)] X

j =a j Eν [Nj(T)] = RT (ν )

(Mε m) Φfree(T) .

By rearranging the two inequalities above, we get

T 1 (M m) Φfree(T)

T a and 1 Eν Na(T)

T (Mε m) Φfree(T)

thus, after substitution into (4), 1 (M m) Φfree(T)

ln T a (Mε m) Φfree(T)

ln 2 (2 ln 2) ε Eν

Na(T) . (5)

Step 4: Final calculations. We take ε = εT = α 1 Φfree(T)/T for some constant α > 0; we will pick α = 1/8. By the assumption Φfree(T) = o(T), we have εT a/ 2(M µa) , as needed, for T large enough, as well as MεT = µa + 2 a/εT = µa + 2α a T/Φfree(T).

Hadiji and Stoltz

Substituting these values into (5), a ﬁnite-time lower bound on the quantity of interest is ﬁnally given by

T/Φfree(T) α 2 ln 2

ln 2 + 1 (M m) Φfree(T)

T a | {z } 0

ln T a 2α a T + (µa m)Φfree(T) | {z } 1/(2α)

It entails the asymptotic lower bound

lim inf T + Eν

T/Φfree(T) α 2 ln 2 ln(1/α) 2 ln 2 = 1

for the choice α = 1/8. The claimed result follows by adding these lower bounds for each suboptimal arm a, with a factor a, following the rewriting (1) of the regret.

Remark 4 The proof above only exploits the fact that the upper end M of the range is unknown: the alternative problems lie in Dm,M for some M that can be arbitrarily large. Yet, by deﬁnition of adaptation to the range, the strategy needs to guarantee (M m) Φfree(T) distribution-free regret bounds in that case. We may note that therefore, Theorem 3 also holds for the model of bounded distributions with a known lower end m R for the range:

Deﬁnitions 1 and 2 handle the case of D ,+ but can be adapted in an obvious way to Dm,+ by ﬁxing m, by having the strategy know m, and requiring the bounds to hold for all M [m, + ) and all bandit problems in Dm,M, thus leading to the concept of adaptation to the upper end of the range. This observation is in line with the folklore knowledge that there is a diﬀerence in nature between dealing with nonnegative payoﬀs, i.e., gains, or dealing with nonpositive payoﬀs, i.e., losses, for regret minimization under bandit monitoring; see Cesa-Bianchi and Lugosi (2006, Remark 6.5, page 164) for an early reference and Kwon and Perchet (2016) for a more complete literature review. Actually, 0 plays no special role, the issue is rather whether one end of the payoﬀrange is known.

4. Adaptation to Range Based on Ada Hedge: The AHB Strategy

When the range of payoﬀs is known, Auer et al. (2002b) achieve a distribution-free regret bound of order

KT ln K with exponential weights the Hedge strategy on estimated payoﬀs and with extra-exploration, i.e., by mixing exponential weights with the uniform distribution over arms. Actually, it is folklore knowledge that the extra-exploration used in this case is unnecessary (see, among others, Stoltz, 2005). To deal with the case of an unknown payoﬀrange, we consider a self-tuned version of Hedge called Ada Hedge (De Rooij et al., 2014, see also an earlier work by Cesa-Bianchi et al., 2007) and do add extra-exploration. Just as Auer et al. (2002b), we will actually obtain regret guarantees for oblivious adversarial bandits, not only distribution-free regret bounds for stochastic bandits. We therefore introduce now the setting of oblivious adversarial bandits and deﬁne adaptation to the range in that case.

Adaptation to the Range in K Armed Bandits

4.1 Oblivious Adversarial Bandits

In the setting of fully oblivious adversarial bandits (see Cesa-Bianchi and Lugosi, 2006; Audibert and Bubeck, 2009), a range [m, M] is set by the environment, where m, M are real numbers, not necessarily nonnegative. The player is unaware of [m, M] and will remain so. The environment also picks beforehand a sequence y1, y2, . . . of reward vectors in [m, M]K. We denote by yt = (yt,a)a [K] the components of these vectors. The player will observe a component of each of these reward vectors in a sequential fashion, as follows. Auxiliary randomizations U0, U1, . . . i.i.d. according to a uniform distribution over [0, 1] are available. At each round t 1, the player picks an arm At [K], possibly at random (thanks to Ut 1) according to a probability distribution pt = (pt,a)a [K], and then receives and observes yt,At. More formally, a strategy of the player is a sequence of mappings from the observations to the action set, (U0, y1,A1, U1, . . . , yt 1,At 1, Ut 1) 7 At. The strategy does not rely on m nor M. At each given time T 1, denoting by y1:T = (y1, . . . , y T ) the reward vectors, we measure the performance of a strategy through its expected regret:

RT (y1:T ) = max a [K]

where, as rewards are ﬁxed beforehand, all randomness lies in the choice of the arms At only, i.e., where the expectation is only over the choice of the arms At. The counterpart of Deﬁnition 1 in this setting is stated next.

Deﬁnition 5 (Scale-free adversarial regret bounds) A strategy for oblivious adversarial bandits is adaptive to the unknown range of payoﬀs with a scale-free adversarial regret bound Φadv : N [0, + ) if for all real numbers m < M, the strategy ensures, without the knowledge of m and M:

y1, y2, . . . in [m, M]K, T 1, RT (y1:T ) (M m) Φadv(T) .

Conversion of upper/lower bounds from one setting to the other. We recall that when applying Doob s optional skipping in Section 2, for each arm a, we denoted by (Yt,a)t 1 an i.i.d. sequence of rewards drawn beforehand, independently at random, according to the distribution νa associated with that arm. By the tower rule for the right-most equality below, we note that for all m < M and for all ν in Dm,M,

RT (ν) = max a [K] E

= E RT (Y1:T )

sup y1:T in [m,M]K RT (y1:T ) .

In particular, lower bounds on the regret for stochastic bandits are also lower bounds on the regret for oblivious adversarial bandits, and strategies designed for oblivious adversarial bandits obtain the same distribution-free regret bounds for stochastic bandits when the individual payoﬀs yt,At in their deﬁnition are replaced with the stochastic payoﬀs Yt,At.

Hadiji and Stoltz

Algorithm 1 AHB: Ada Hedge for K armed Bandits, with extra-exploration

1: Inputs: a payoﬀestimation scheme, e.g., (8); a sequence (γt)t 1 in [0, 1] of extraexploration rates

2: for rounds t = 1, . . . , K do

3: Draw arm At = t

4: Get and observe the payoﬀyt,t 5: end for

6: Ada Hedge initialization: ηK+1 = + and q K+1 = (1/K, . . . , 1/K) def = 1/K

7: for rounds t = K + 1, . . . do

8: Deﬁne pt by mixing qt with the uniform distribution as pt = (1 γt)qt + γt1/K

9: Draw an arm At pt, i.e., independently at random according to the distribution pt 10: Get and observe the payoﬀyt,At 11: Compute estimates byt,a of all payoﬀs with the payoﬀestimation scheme, e.g., (8)

12: Compute the mixability gap δt 0 based on the distribution qt and on these estimates:

a=1 qt,a byt,a + 1

a=1 qt,aeηtbyt,a !

| {z } when ηt + or ηt = +

, i.e., δt =

a=1 qt,a byt,a + max a [K] byt,a

| {z } when ηt = +

13: Compute the learning rate ηt+1 =

14: Deﬁne qt+1 component-wise as

qt+1,a = exp

s=K+1 bya,s

s=K+1 byk,s

15: end for

4.2 The AHB Strategy

We state our main strategy, AHB which stands for Ada Hedge for K armed Bandits, with extra-exploration , in the setting of oblivious adversarial bandits, see Algorithm 1. In a setting of stochastic bandits, it suﬃces to replace therein yt,At with Yt,At. The AHB strategy relies on a payoﬀestimation scheme, which we discuss now. In Algorithm 1, some initial exploration lasting K rounds is used to get a rough idea of the location of the payoﬀs and to center the estimates used at an appropriate location. Following Auer et al. (2002b), we consider, for all rounds t K + 1 and arms a [K],

byt,a = yt,At C

pt,a 1{At=a} + C where C def = 1

s=1 ys,s . (8)

Note that all pt,a > 0 for Algorithm 1 due to the use of exponential weights. As proved by Auer et al. (2002b), the estimates byt,a are conditionally unbiased. Indeed, the distributions qt and pt, as well as the constant C, are measurable functions of the information

Adaptation to the Range in K Armed Bandits

Ht 1 = (U0, y1,A1, U1, . . . , Ut 2, yt 1,At 1) available at the beginning of round t K + 1, and the arm At is drawn independently at random according to pt based on an auxiliary randomization denoted by Ut 1. Therefore, given that the payoﬀs are oblivious, the conditional expectation of byt,a with respect to Ht 1 amounts to integrating over the randomness given by the random draw At pt: for t K + 1,

E byt,a Ht 1 = yt,a C

pt,a P At = a Ht 1 + C = yt,a C

pt,a pt,a + C = yt,a . (9)

These estimators are bounded: assuming that all yt,a, thus also C, belong to the range [m, M], and given that the distributions pt were obtained by a mixing with the uniform distribution, with weight γt, we have pt,a γt/K, and therefore,

t K + 1, a [K], byt,a C |yt,a C|

γt/K . (10)

Remark 6 Algorithm 1 is invariant by aﬃne changes, i.e., translations by real numbers and/or multiplications by positive factors, of the payoﬀs, given that Ada Hedge (see De Rooij et al., 2014, Theorem 16) and the payoﬀestimation scheme (8) are so. This is key for adaptation to the range. This invariance is achieved, when ignoring the range [m, M], thanks to any value C [m, M]. Here, we chose to have K rounds of exploration in Algorithm 1 and let C equal the average of the payoﬀs achieved. However, it would of course have been suﬃcient to pick one arm at random, observe a single reward y1,A1 and let C = y1,A1.

4.3 Regret Analysis, Part 1: Scale-Free Adversarial Regret Bound

Theorem 7 Ada Hedge for K armed bandits (Algorithm 1) with a non-increasing extraexploration sequence (γt)t 1 smaller than 1/2 and the estimation scheme given by (8) ensures that for all bounded ranges [m, M], for all oblivious individual sequences y1, y2, . . . in [m, M]K, for all T 1,

RT (y1:T ) 3(M m)

KT ln K + 5(M m)K ln K

In particular, given a parameter α (0, 1), the extra-exploration

γt = min n 1/2, p

5(1 α)K ln K tαo

leads to the scale-free adversarial regret bound

Φadv(T) = 3 + 5 1 α

K ln K T max{α,1 α} + 10(M m)K ln K . (11)

For α = 1/2, the bound reads Φadv(T) = 7(M m)

TK ln K + 10(M m)K ln K.

This value α = 1/2 is the best one to consider if one is only interested in a distributionfree bound i.e., if one is not interested in the distribution-dependent rates for the regret. The proof of Theorem 7 is detailed in Appendix B but we sketch its proof below.

Hadiji and Stoltz

Remark 8 We strongly suspect that the

ln K factor in the bound of Theorem 7 is superﬂuous. In the case of a known range, the MOSS algorithm is known to be minimax optimal with a regret bound of order

KT. One idea could thus be to use a MOSS-type index, together with a Bernstein-type upper conﬁdence bound to account for the unknown variance and range. A ﬁnal ingredient would be to add initial extra-exploration, pulling every arm p

T/K times before running the standard phase of the algorithm; on a technical level, this automatically makes the sub-Poissonian term in Bernstein s inequality tractable. We have not managed yet to ﬁll in the technical details in order to prove this, although we believe a variant of these ideas would get rid of the logarithmic factor. In contrast, the algorithm discussed here, based on Ada Hedge, enjoys a simple distribution-free analysis as sketched below , as well as a distribution-dependent analysis (see Section 4.4), unlike an algorithm based on MOSS-type indices.

Another promising approach would be to use the Tsallis-INF algorithm introduced by Audibert and Bubeck (2009) and further studied by Zimmert and Seldin (2019), which achieves a (M m)

KT adversarial regret bound when M and m are known. Unfortunately, current analyses of the algorithm rely crucially on the non-positivity of the reward estimates, or, equivalently on the knowledge of an upper bound on the rewards. Zimmert and Lattimore (2019) relax this requirement, but not enough for the relaxed version to be applied to our case. However, when M is known and m is unknown, i.e., only adaptation to m is needed, the reward estimates can be made non-positive by taking C = M in the estimation scheme (8), and our techniques may be extended to show that Tsallis-INF indeed enjoys an adversarial regret bound of order (M m)

KT in this case. Details may be found in an extended version of this article [ar Xiv:2006.03378], Theorem 23 in Appendix F.

Proof sketch A direct application of the Ada Hedge regret bound (Lemma 3 and Theorem 6 of De Rooij et al., 2014), bounding the variance terms of the form E (X E[X])2 by E (X C)2 , ensures that

t=K+1 byt,k

t K+1 a [K]

qt,a byt,a 2 v u u t

t K+1 a [K]

qt,a byt,a C 2 ln K + M m

We take expectations, use the deﬁnition of the pt in terms of the qt in the left-hand side, and apply Jensen s inequality in the right-hand side to get

t=K+1 byt,k

=yt,At z }| { K X

a=1 pt,a byt,a +

E[...] [m M,M m] z }| { K X

a=1 (1/K qt,a) byt,a

a=1 E h qt,a byt,a C 2i ln K + M m

Adaptation to the Range in K Armed Bandits

Since pt,a (1 γt)qt,a with γt 1/2 by assumption on the extra-exploration rate, we have the bound qt,a 2pt,a. Together with standard calculations similar to (9), we have

E h qt,a byt,a C 2i 2 E h pt,a(byt,a C)2 Ht 1 i = 2 E (yt,At C)2

pt,a 1{At=a}

= 2 (yt,a C)2 | {z } (M m)2 .

The proof of the ﬁrst regret bound of the theorem is concluded by collecting all bounds and by taking care of the ﬁrst K rounds. The second regret bound then follows from straightforward calculations.

4.4 Regret Analysis, Part 2: Distribution-Dependent Rates for Adaptation

Given the conversion explained in Section 4.1, Algorithm 1 tuned as in Corollary 7 for α [1/2, 1) also enjoys the scale-free distribution-free regret bound ΦAHB free (T) = ΦAHB adv (T) of order T α. The theorem below entails that AHB is adaptive to the unknown range with a distribution-dependent regret rate T/ΦAHB free(T) of order T 1 α that is optimal given the lower bound stated by Theorem 3.

Theorem 9 Consider AHB (Algorithm 1) tuned with some α [1/2, 1) as in the second part of Theorem 7. For all distributions ν1, . . . , νK in D ,+,

RT (ν) T/ΦAHB free(T) 12 ln K

a=1 a . (12)

The proof is provided in Appendix C. It follows quite closely that of Theorem 3 in Seldin and Lugosi (2017), where the authors study a variant of the Exp3 algorithm of Auer et al. (2002b) for stochastic rewards. It consists, in our setting, in showing that the number of times the algorithm chooses suboptimal arms is almost only determined by the extraexploration. Our proof is simpler as we aim for cruder bounds. The main technical diﬀerence and issue to solve lies in controlling the learning rates ηt, which heavily depend on data in our case.

5. Numerical Illustrations

We provide some numerical experiments on synthetic data to illustrate the qualitative behavior of some popular algorithms like UCB strategies when they are incorrectly tuned, as opposed to strategies that are less sensitive to ignoring the range or to the AHB strategy which adapts to it. These experiments are only of an illustrative nature. Bandit problems considered and UCB strategies. We consider stochastic bandit problems ν(α) = (ν(α) a )a [K] indexed by a scale parameter α {0.01, 1, 100}. We take K = 10 arms, each arm a being associated with a rectiﬁed Gaussian distribution. Precisely, the distribution ν(α) a is the distribution of the variable

( α max 0, min{Y, 1.2} with Y N(0.6 , V ) if a = 1,

α max 0, min{Y, 1} with Y N(0.5 , V ) if a = 1,

Hadiji and Stoltz

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Figure 1: Probability density functions of the reward distributions with respect to the sum of the Lebesgue measure and Dirac masses at 0, 1, and 1.2. Left pictures: highvariance case; right pictures: low-variance case. Top pictures: ﬁrst arm (optimal arm); bottom pictures: other arms. Arrows represent atoms and their lengths are only illustrative.

so that all distributions are commonly supported on [m, M] = [0, 1.2 α], with arm 1 being the unique optimal arm. We will consider two values for V , namely V = 0.01 (low-variance case) and V = 0.25 (high-variance case). See Figure 1 for a plot of the corresponding probability density functions. We denote by µ(α) 1 = 0.6 α and µ(α) a = 0.5 α if a = 1 the means associated with the distributions ν(α) 1 and ν(α) a , respectively. The gaps therefore equal (α) a = 0.1 α for a 2. The main algorithm of interest is, of course, the AHB strategy with extra-exploration (Algorithm 1), which we tune as stated in Theorem 7 with parameter 1/2. We now present the competitors.

UCB strategies at diﬀerent scales. We consider instances of UCB (Auer et al., 2002a) using indices of the form

where Na(t) is the number of times arm a was pulled up to round t, and where bµa(t) denotes the empirical average of payoﬀs obtained for arm a. We hesitated between setting σ2 based on the range M m = 1.2α, namely, σ2 = (M m)2/4 = (1.2α)2, or based on a sub Gaussian parameter, which would be smaller. As distributions ν(α) a are rectiﬁed Gaussians, it is not immediately clear whether they are sub-Gaussian, but we considered despite all the choice σ2 = V . It turns out that this second choice outperformed the ﬁrst one, which is why, in the rest of the study, we consider the following three instances of UCB:

Na(t) , where s {0.01, 1, 100} .

When the scale parameter α is known, we would take s = α.

Adaptation to the Range in K Armed Bandits

Range-estimating UCB. We also study a version of UCB estimating the range, namely, using indices

bµa(t) + ˆrt

2 ln T Na(t) , where ˆrt = max s t YAs,s min s t YAs,s

estimates the range M m. We were unable to provide theoretical guarantees that match our lower bounds, and this algorithm does not perform particularly well in practice as we will discuss below. ε greedy. Finally, we also consider the ε greedy strategy, which, at round t K + 1, picks with probability 1 εt the arm with the best empirical mean, and otherwise, selects an arm uniformly at random. Following Auer et al. (2002a), we used the tuning

εt = min 1, 5K

with d = 1/12 .

Indeed, Auer et al. (2002a) exhibit theoretical guarantees for distributions over [0, 1] in the case where d is smaller than or equal to the smallest gap. When rescaled on [0, 1], the smallest gap equals 0.1α(M m)/(1.2α) = 1/12 in our setting; this explains our choice d = 1/12, but note that the ε greedy strategy deﬁned above relies on some extra knowledge encompassed in the choice d = 1/12, compared to the completely agnostic AHB strategy. Interestingly, for any ﬁxed-in-advance sequence of εt, the ε greedy strategy is scale-free. Of course, its strong downside is that a proper tuning of the εt requires knowledge of a scaled lower bound on the gaps. Experimental setting. Each algorithm is run N = 100 times, on a time horizon T = 100,000. We plot estimates of the rescaled regret RT (ν(α))/α to have a meaningful comparison between the bandit problems. These estimates are constructed as follows. We index the arms picked in the n th run by an additional subscript n, so that AT,n refers to the arm picked by some strategy at time t in the n th run. The expected regret of a given strategy can be rewritten as

RT (να) = T max a [K] µ(α) a E

t=1 µ(α) At

= T (0.6 α) E

t=1 µ(α) At

and is estimated by

b RT (α) = 1

n=1 b RT (α, n) where b RT (α, n) = T (0.6 α)

t=1 µ(α) At,n .

On Figures 2 and 3 we plot the estimates b RT (α)/α of the rescaled regret as solid lines. The shaded areas correspond to 2 standard errors of the sequences b RT (α, n)/α

Discussion of the results. An initial observation is that, as expected, the performance of AHB, the range-estimating UCB, and ε-greedy is unaﬀected by the scale of the problems (see the second lines of Figures 2 and 3). It turns out that out of these three algorithm, AHB performs best. A second observation is that the performance of UCB depends dramatically on the value of the parameter s. UCB performs like follow-the-leader when s is too small, and

Hadiji and Stoltz

UCB, s = 0.01 UCB, s = 1 UCB, s = 100

Range estimating UCB AHB with extra exploration

104 105 104 105

Figure 2: Comparison of the (estimated) regrets of various strategies over bandit problems ν(α) in the high variance case, where α ranges in {0.01, 1, 100} and V = 0.25. Each algorithm was run N = 100 times on every problem for T = 100,000 time steps. Solid lines report the values of the estimated regrets, while shaded areas correspond to 2 standard errors of the estimates.

UCB, s = 0.01 UCB, s = 1 UCB, s = 100

Range estimating UCB AHB with extra exploration

104 105 104 105

Figure 3: Same legend as for Figure 2, but in the low variance case.

Adaptation to the Range in K Armed Bandits

like random play when s is too large; both of these strategies suﬀer linear regret and UCB incorrectly scaled also does so (see the ﬁrst lines of Figures 2 and 3). It remains to compare AHB to UCB tuned with the correct scale: the ranking between the two depends on the value of V , with AHB outperforming UCB tuned with the correct scale in the high-variance case and vice versa in the low variance case. Our last observation is that in the low-variance case, the range-estimating version of UCB is far oﬀfrom UCB tuned with the correct scale. This is because of the large diﬀerence between the sub-Gaussian parameter and its upper bound given by the squared half-range, which the range-estimating version of UCB is targeting.

Acknowledgments

H edi Hadiji and Gilles Stoltz have no direct funding to acknowledge other than the salaries paid by their employers (Universit e Paris-Saclay, CNRS, and HEC Paris). They have no competing interests to declare.

Appendix A. More on Scale-Free Distribution-Dependent Regret Bounds Considered in Isolation

This section details the claims of Section 2.2: no strategy may be adaptive to the range and achieve Φdep = ln T (Section A.2) but we may construct a strategy adaptive to the range and achieving Φdep ln T (Section A.3). Before we do so, we provide a reminder on a general, and optimal, distribution-dependent regret lower bound for K armed stochastic bandits (Section A.1).

A.1 Reminder of a General Regret Lower Bound for K Armed Bandits

This section considers some general model D. It also rules out poor strategies by restricting its attention to so-called consistent strategies according to the terminology introduced by Lai and Robbins (1985), while Burnetas and Katehakis, 1996 rather speak of uniformly fast convergent strategies.

Deﬁnition 10 A strategy is consistent on a model D if for all bandit problems ν in D, it achieves a subpolynomial regret bound, that is, RT (ν)/T α 0 for all α (0, 1].

A lower bound on the distribution-dependent rates that such a strategy may achieve is provided by a general, and optimal, result of Lai and Robbins (1985) and Burnetas and Katehakis (1996); see also its rederivation by Garivier et al. (2019b). It involves a quantity deﬁned as an inﬁmum of Kullback-Leibler divergences: we recall that for two probability distributions ν, ν deﬁned on the same probability space (Ω, F),

KL(ν, ν ) =

dν if ν ν ,

+ otherwise,

Hadiji and Stoltz

where ν ν means that ν is absolutely continuous with respect to ν and dν/dν then denotes the Radon-Nikodym derivative. Now, for any probability distribution ν, any real number x, and any model D, we deﬁne

Kinf(ν, x, D) = inf KL(ν, ν ) : ν D and E(ν ) > x ,

where by convention, the inﬁmum of an empty set equals + and where we denoted by E(ν ) the expectation of ν . The quantity Kinf(ν, x, D) can be null. With the usual measuretheoretic conventions, in particular, 0/0 = 0, we then have the following lower bound.

Reminder 1 For all models D, for all consistent strategies on D, for all bandit problems ν in D,

lim inf T + RT (ν)

a Kinf(νa, µ , D) .

The case of a known payoﬀrange [m, M]. When the payoﬀrange [m, M] is known, i.e., when the model is Dm,M, there exist strategies achieving the lower bound of Reminder 1, like the DMED strategy of Honda and Takemura (2011, 2015) or the KL UCB strategy of Capp e et al. (2013) and Garivier et al. (2019a). The case of a known payoﬀupper bound M. The DMED strategy of Honda and Takemura (2015) actually achieves the lower bound of Reminder 1 even for the model

and for the model D ,M of all distributions upper bounded by M but not necessarily lower bounded. This suggests that adaptation to M is much more diﬃcult than adaptation to m as far as distribution-dependent regret bounds are considered, and is in line with Remark 4. That M is more important than m for distribution-dependent bounds is also reﬂected in the lower bound of Reminder 1: this lower bound does not depend on whether D equals some Dm,M, or D ,M, or even D ,M. We may indeed easily show (see an extended version of this article [ar Xiv:2006.03378], Appendix E) that given M R, for all m M, for all ν Dm,M and all µ > E(ν),

Kinf ν, µ, Dm,M = Kinf ν, µ, D ,M .

A.2 Adaptation to the Range Impossible at Logarithmic Distribution-Dependent Rate

A strategy that would be adaptive to the range with a distribution-dependent rate Φdep = ln would, by deﬁnition and in particular, be consistent on D ,+. The following theorem therefore shows, by contradiction, that no strategy may be adaptive to the range with a distribution-dependent rate Φdep = ln. A similar phenomenon was discussed by Lattimore (2017) in the case of stochastic bandits with Gaussian distributions.

Theorem 11 For all distributions νa D ,+ with expectation µa, and all µ > µa,

Kinf(νa, µ , D ,+) = 0 .

Adaptation to the Range in K Armed Bandits

As a consequence, all consistent strategies on D ,+ are such that, for all bandit problems ν in D ,+ with at least one suboptimal arm a,

lim inf T + RT (ν)

Interestingly, Cowan and Katehakis (2015) observe that for the model of uniform distributions over bounded intervals, the Kinf is positive, and thus the lower bound of Reminder 1 does not prevent logarithmic regret bounds. In fact, they also provide an algorithm enjoying optimal distribution-dependent bounds thus being, in a sense, adaptive to the range in that very restricted model.

Proof We denote by [m, M] an interval containing the support of νa. We remind the reader of the model Dm,+ deﬁned in (6), composed of all bounded distributions with unknown upper end on the range but known lower end m on the range. As Dm,+ D ,+ and by deﬁnition of Kinf, Kinf(νa, µ , D ,+) Kinf(νa, µ , Dm,+) ,

so that it suﬃces to show that Kinf(νa, µ , Dm,+) = 0. We have in particular µa m. We use the same construction as in the proof of Theorem 3. Let ν ε = (1 ε)νa+εδµa+2 a/ε for ε (0, 1): it is a bounded probability distribution, with lower end of support larger than m, that is, ν ε Dm,+. For ε small enough, µa+2 a/ε lies outside of the bounded support of νa. In that case, the density of νa with respect to ν ε is given by 1/(1 ε) on the support of νa and 0 elsewhere, so that

KL νa, ν ε = ln 1 1 ε

Moreover, E ν ε = (1 ε)µa + ε µa + 2 a/ε = µa + 2 a = µ + a > µ . Therefore, by deﬁnition of Kinf as an inﬁmum,

Kinf(νa, µ , Dm,+) KL νa, ν ε = ln 1 1 ε

This upper bound holds for all ε > 0 small enough and thus shows Kinf(νa, µ , Dm,+) = 0. The second part of the theorem follows from Reminder 1, from the existence of an arm a with a = µ µa > 0, and from the fact that Kinf(νa, µ , D ,+) = 0, as we established above.

Remark 12 Recall that Remark 4 deﬁned a notion of adaptation to the upper end M of the payoﬀrange. The proof above reveals that Theorem 11 holds with all occurrences of D ,+ replaced by Dm,+, for some m R. We may therefore similarly exclude a ln T distributiondependent rate for adaptation to the upper end M of the payoﬀrange. This observation is yet another example that the knowledge of the lower end m of the payoﬀrange does not critically change the picture, and the diﬃculty in ignoring a payoﬀ range lies in ignoring the upper end thereof.

Hadiji and Stoltz

A.3 UCB with an Increased Exploration Rate Adapts to the Range

The impossibility result implied by Theorem 11 does not prevent distribution-dependent rates for adaptation that are larger than a logarithm. Let ϕ be a non-decreasing function such that ϕ(t) ln t, like ϕ(t) = (ln t)2 or even ϕ(t) = (ln t) ln ln t. Lattimore (2017, Remark 8) introduced and studied, in the case of Gaussian bandits with unknown variances, the following variant of UCB, which we refer to in this section as UCB with an increased exploration rate ϕ:

ϕ(t) Na(t) where ϕ(t)

ln t + and ϕ(t)

and where bµa(t) denotes the empirical average of payoﬀs obtained till round t when playing arm a. The (asymptotic only) analysis of Lattimore (2017, Remark 8) relies on the fact that ϕ(t) 2(M m) ln t for t larger than some unknown threshold T0, and that after T0, the indexes are thus larger than the ones of the original version of UCB based on the knowledge of m and M. This argument readily extends to the case of sub-Gaussian distributions, where we recall that a distribution ν with expectation µ is v sub-Gaussian, with v > 0, if

t R, Z et(x µ)dν(x) evt2/2 .

Hoeﬀding s lemma proves that distributions over a bounded range [m, M] are (M m)2/4 sub-Gaussian. Based on a slightly diﬀerent proof than the one of Lattimore (2017, Remark 8), one can prove the following ﬁnite-time result where we did not aim for tight numerical constants.

Theorem 13 UCB with an increased exploration rate given by a non-decreasing function ϕ ensures that for all v > 0, for all distributions ν1, . . . , νK that are v sub-Gaussian, for all T K + 1,

| {z } main term

a [K] 2 a max 32v

t=K e ϕ(t)/(2v) !

| {z } smaller-order term: typically, a O(1)

Whenever ϕ ln, this strategy is therefore adaptive to the unknown range of payoﬀs with a distribution-dependent rate Φdep = ϕ.

The second part of the statement follows from the claimed bound given that ϕ ln entails ϕ(t) 4v ln t for t large enough, and therefore, e ϕ(t)/(2v) 1/t2. As a consequence, the sum tagged as smaller-order term in the bound is ﬁnite. Possible such choices are ϕ : t 7 (ln t)2, or even ϕ : t 7 (ln t)(ln ln t). However, as already mentioned in Lattimore (2017), as the distribution-dependent rate approaches ln t, the smaller-order term blows up. For example, if ϕ(t) = (ln t)2, the summands e (ln t)2/(2v) in the smaller-order term are larger than e 1 for all t e

2v: the smaller-order term is at least of the order of e

2v, and the regret thus carries an exponential dependence on v. In the case of a bounded range, this means an exponential

Adaptation to the Range in K Armed Bandits

dependence on the range M m. This is probably not an artifact of the proof: in the case of a bounded range, as long as ϕ(t) (M m) ln t, the lack of exploration bonus entails that the strategy behaves similarly to a follow-the-leader strategy, which is known to suﬀer catastrophic, i.e., linear, regret.

Proof As indicated above, we did not aim for tight numerical constants here and we somehow simpliﬁed the standard analysis of UCB by not considering thresholds of the form µ ε but rather µ a/4. Hence the non-standard (much increased) numerical factor in front of P a ln T/ a when we specify ϕ : t 7 2(M m)2 ln t into the bound. In this proof, we repeatedly use that i.i.d. random variables X1, . . . , Xn with a v sub Gaussian distribution with expectation µ satisfy, by the Cram er-Chernoﬀinequality, that for all ε > 0,

i=1 Xi µ + ε

inf λ>0 e nλε E

inf λ>0 e nλε evλ2/2 n = e nε2/(2v) ;

and we obtain a similar inequality for deviations of the form µ ε . Let a and a be an optimal and a suboptimal arm, respectively. Each arm is pulled once in the ﬁrst K round. We bound E Na(T)] by using that for t K, an arm At+1 is pulled only if its index has the highest value, and then introduce the threshold µ a/4 to separate the Ua(t) and the Ua (t):

t=K P Ua(t) Ua (t) and At+1 = a

t=K P Ua(t) µ a

4 and At+1 = a +

t=K P Ua (t) µ a

t=K P ˆµa(t) +

ϕ(t) Na(t) µ a

4 and At+1 = a

t=K P ˆµa (t) µ a

ϕ(t) Na (t)

t=K P ˆµa(t) +

ϕ(T) Na(t) µ a

4 and At+1 = a

t=K P ˆµa (t) µ a

ϕ(t) Na (t)

n=1 P ˆµa,n µ a

| {z } Sum(a)

n=1 P ˆµa ,n µ a

| {z } Sum(a )

Hadiji and Stoltz

Note that we used the fact that ϕ is non-decreasing to get to the last but one inequality, and we used optional skipping for the last one; we denote by ˆµa,n and ˆµa ,n the average of n i.i.d. rewards distributed according to νa and νa , respectively. We ﬁrst deal with Sum(a). Let N0 = 4 ϕ(T)/ 2 a . For n N0,

P ˆµa,n µ a

P ˆµa,n µa + a

e n 2 a/(32v) .

Sum(a) N0 1+ X

n N0 e n 2 a/(32v) 4ϕ(T)

2a + 1 1 e 2a/(32v) 4ϕ(T)

2a +2 max 32v

where we used1 in the last step 1/(1 e x) 2/x for x (0, 1] and 1/(1 e x) 2 for x 1. For Sum(a ), we apply the Cram er-Chernoﬀinequality, then use (x + y)2 x2 + y2 for x, y 0, and ﬁnally apply the same inequalities on 1/(1 e x) as for the other sum:

n=1 P ˆµa ,n µ a

n=1 e n a/4+

ϕ(t)/n 2 /(2v)

n=1 e n 2 a/(32v) e ϕ(t)/(2v)

t=K e ϕ(t)/(2v) 1 1 e 2a/(32v)

2a , 1 T 1 X

t=K e ϕ(t)/(2v) .

The proof is concluded by substituting the bounds in RT (ν) = X

a [K] a E Na(T)].

Appendix B. Proof of Theorem 7

How the second regret bound follows from the ﬁrst one. We substitute the stated values of the γt. We have, ﬁrst,

5(1 α)K ln K

t=K+1 t α p

5(1 α)K ln K Z T

1 α T 1 α ,

(13) second, using the deﬁnition of γT as a minimum,

1/2 + T αK ln K p

5(1 α)K ln K = 2K ln K +

K ln K 5(1 α) T α ,

T T max{α,1 α}, so that the ﬁrst regret bound of Theorem 7 is further bounded by

T max{α,1 α} + 10(M m)K ln K .

1. For a bounded distribution, the case x > 1 does not occur as x = 2 a/(32v) = 2 a/ 8(M m)2 1/8; but it may occur for other sub-Gaussian distributions.

Adaptation to the Range in K Armed Bandits

The claimed expression for Φadv(T) is obtained by bounding 2

First regret bound. In Algorithm 1, for time steps t K + 1, the weights qt are obtained by using the Ada Hedge algorithm of De Rooij et al. (2014) on the payoﬀestimates byt,a. Ada Hedge is designed for the case of a full monitoring not of a bandit monitoring , but the use of these estimates emulates a full monitoring. Section 2.2 of De Rooij et al. (2014) see also an earlier analysis by Cesa-Bianchi et al. (2007) ensures the bound stated next in Reminder 2. We call pre-regret the quantity at hand in Reminder 2: it corresponds to some regret deﬁned in terms of the payoﬀestimates.

Reminder 2 (Application of Lemma 3 and Theorem 6 of De Rooij et al., 2014) For all sequences of payoﬀestimates byt,a lying in some bounded real-valued interval, denoted by [b, B], for all T K + 1, the pre-regret of Ada Hedge satisﬁes

t=K+1 byt,k

a=1 qt,a byt,a 2

k [K] qt,k byt,k

a=1 qt,a(byt,a c)2 ln K for any c R

+(B b) 1 + 2

and Ada Hedge does not require the knowledge of [b, B] to achieve this bound.

The bound of Reminder 2 will prove itself particularly handy for three reasons: ﬁrst, it is valid for real-valued payoﬀs; second, it is adaptive to the range of payoﬀs; third, the right-hand side looks at ﬁrst sight not intrinsic enough a bound, as it also depends on the weights qt, but we will see later that this dependency is particularly useful in our speciﬁc case. To the best of our knowledge, this is the ﬁrst direct application of the Ada Hedge bound depending on the weights qt (previous applications were rather solving inequations on the regret, e.g., to get improvements for small losses; see Cesa-Bianchi et al., 2007 and De Rooij et al., 2014). We recall that we start the summation in Reminder 2 at t = K+1 because the Ada Hedge algorithm is only started at this time, after the initial exploration. The bound holding for any c R is obtained by a classical bound on the variance.

Proof of the ﬁrst bound of Theorem 7 We deal with the contribution of the initial exploration by using the inequality max(u + v) max u + max v, together with the fact that yt,a yt,AT M m for any a [K]:

RT (y1:T ) max a [K]

| {z } K(M m)

+ max a [K]

t=K+1 yt,a E

t=K+1 yt,At

Hadiji and Stoltz

We now transform the pre-regret bound of Reminder 2, which is stated with the distributions qt, into a pre-regret bound with the distributions pt; we do so while substituting the bounds B = C + KM/γT and b = C + Km/γT implied by (10) and the fact that (γt) is non-increasing, and by using the deﬁnition qt,a = pt,a γt(1/K qt,a) for all a [K]:

t=K+1 byt,k

a=1 pt,a byt,a +

a=1 (1/K qt,a) byt,a 2

a=1 qt,a(byt,a C)2 ln K + (M m)K

As noted by Auer et al. (2002b), by the very deﬁnition (8) of the estimates,

a=1 pt,a byt,a = yt,At .

By (9), the tower rule and the fact that qt is Ht 1 measurable, on the one hand, and the fact that the expectation of a maximum is larger than the maximum of expectations, on the other hand, the left-hand side of the ﬁrst inequality in (15) thus satisﬁes

t=K+1 byt,k

a=1 pt,a byt,a +

a=1 (1/K qt,a) byt,a

t=K+1 yt,k E

t=K+1 yt,At

| {z } [m,M]

a=1 E qt,a yt,a

| {z } [m,M]

t=K+1 yt,k E

t=K+1 yt,At

As for the right-hand side of the second inequality in (15), we ﬁrst note that by deﬁnition (see line 4 in Algorithm 1), pt,a (1 γt)qt,a with γt 1/2 by assumption on the extraexploration rate, so that qt,a 2pt,a; therefore, by substituting ﬁrst this inequality and then by using Jensen s inequality,

a=1 qt,a(byt,a C)2 ln K

a=1 pt,a(byt,a C)2 ln K

a=1 E h pt,a(byt,a C)2 i ln K .

Standard calculations (see Auer et al., 2002b again) show, similarly to (9), that for all a [K],

E h pt,a(byt,a C)2 Ht 1 i = E (yt,At C)2

pt,a 1{At=a}

= (yt,a C)2 (M m)2 ,

Adaptation to the Range in K Armed Bandits

where the last inequality comes from (10). By the tower rule, the same upper bound holds for the (unconditional) expectation. Therefore, taking the expectation of both sides of (15) and collecting all bounds together, we proved so far

RT (y1:T ) 2

2 |{z} 3 (M m)

KT ln K + (M m)K ln K

where we used γT 1/2 and ln K ln 2 as K 2.

Appendix C. Proof of Theorem 9

Given the decomposition (1) of the regret, it is necessary and suﬃcient to upper bound the expected number of times E[Na(t)] any suboptimal arm a is drawn, where by deﬁnition of Algorithm 1,

E[Na(t)] = 1 + E

(1 γt)qt,a + γt

t=K+1 E[qt,a] + 1

We show below (and this is the main part of the proof) that

t=K+1 E[qt,a] = O(ln T) . (17)

The straightforward calculations (13) already showed that

5 ln K (1 α)K T 1 α .

Substituting the value (11) of ΦAHB free (T) = Φadv(T) and using the decomposition (1) of RT (ν) into P a E[Na(t)] then yield

RT (ν) T/ΦAHB free(T) X

5 ln K (1 α)K

K ln K 1 + o(1) + O ln T

from which the stated bound follows, via the crude inequality 3

5 1 α + 5 12. Structure of the proof of (17). Let a denote an optimal arm. By deﬁnition of qt,a and by lower bounding a sum of exponential terms by any of the summands, we get

s=K+1 byt,a

s=K+1 byt,k

t=K+1 (byt,a byt,a )

Hadiji and Stoltz

Then, by separating cases, depending on whether Pt 1 t=K+1(byt,a byt,a ) is smaller or larger than the threshold (t 1 K) a/2, and by remembering that the probability qt,a is always smaller than 1, we get

t=K+1 E[qt,a]

t=K+1 E exp ηt (t 1 K) a

s=K+1 (bys,a bys,a ) (t 1 K) a

We show that the sums in the right-hand side of (18) are respectively O(1) and O(ln T). First sum in the right-hand side of (18). Given the deﬁnition of the learning rates (see the statement of Algorithm 1), namely,

s=K+1 δs , (19)

we are interested in upper bounds on the sum of the δs. Such upper bounds were already derived in the proof of Theorem 7; the second inequality in (15) together with the bound qt,a 2pt,a stated in the middle of the proof immediately yield

a=1 qs,a bys,a C 2 ln K + (M m)K

a=1 ps,a bys,a C 2 ln K + (M m)K

Unlike what we did to complete the proof of Theorem 7, we do not take expectations and rather proceed with deterministic bounds. By the deﬁnition (8) of the estimated payoﬀs for the equality below, by (10) for the ﬁrst inequality below, and by the fact that the exploration rates are non-increasing for the second inequality below, we have, for all s K + 1,

a=1 ps,a bys,a C 2 =

ps,As (M m)2

γs/K (M m)2

γt/K . (20)

γt + (M m)K

def = Dt = Θ p

t/γt + 1/γt .

For the sake of concision, we denoted by Dt the obtained bound. Via the deﬁnition (19) of ηt, the sum of interest is in turn bounded by

t=K+1 exp ηt t 1 K a

t=K+1 exp a ln K

Adaptation to the Range in K Armed Bandits

where the equality to O(1), i.e., the fact that the considered series is bounded, follows from the fact that

(t 1 K)/Dt = Θ tγt + tγt = Θ t(1 α)/2 + t1 α .

Second sum in the right-hand side of (18). We will use Bernstein s inequality for martingales, and more speciﬁcally, the formulation of the inequality by Freedman (1975, Theorem 1.6) see also Massart (2007, Section 2.2) , as stated next.

Reminder 3 Let (Xn)n 1 be a martingale diﬀerence sequence with respect to a ﬁltration (Fn)n 0, and let N 1 be a summation horizon. Assume that there exist real numbers b and v N such that, almost surely,

n N, Xn b and

n=1 E X2 n Fn 1 v N .

Then for all δ (0, 1),

For s K + 1, we consider the increments Xs = a bys,a + bys,a, which are adapted to the ﬁltration Fs = σ(A1, Z1, . . . , As, Zs), where we recall that Z1, . . . , Zs denote the payoﬀs obtained in rounds 1, . . . , s. Also, as ps is measurable with respect to past information Fs 1 and since payoﬀs are drawn independently from everything else (see Section 2), we have, by the deﬁnition (8) of the estimated payoﬀs (where we rather denote by Ys,a the payoﬀs drawn at random according to νa, to be in line with the notation of Section 2 for stochastic bandits): for all a [K],

E bys,a Fs 1 = E[Ys,a | Fs 1] C

ps,a 1{As=a} + C = µa C

ps,a 1{As=a} + C = µa .

As a consequence, E[Xs | Fs 1] = E a bys,a + bys,a | Fs 1 = 0. Put diﬀerently, (Xs)s K+1 is indeed a martingale diﬀerence sequence with respect to the ﬁltration (Fs)s K. We now check that the additional assumptions of Reminder 3 are satisﬁed. Manipulations and arguments similar to the ones used in (10) and (20) show that for all s K + 1,

a bys,a + bys,a a Ys,a C

ps,a 1{As=a } + Ys,a C

ps,a 1{As=a}

(M m)(1 + K/γs) b def = (M m)(1 + K/γt) .

For the variance bound, we ﬁrst note that for all s t 1, we have (bys,a C)(bys,a C) = 0 because of the indicator functions, and therefore,

E h a bys,a + bys,a 2 Fs 1 i E h bys,a + bys,a 2 Fs 1 i

E h bys,a C 2 Fs 1 i + E h bys,a C 2 Fs 1 i ;

Hadiji and Stoltz

in addition, for all a [K], including a ,

E h bys,a C 2 Fs 1 i = E (Ys,As C)2

p2s,a 1{As=a}

ps,a (M m)2K

s=K+1 E h a bys,a + bys,a 2 Fs 1 i 2K(M m)2(t 1 K)

γt vt def = 2(M m)2t K

Bernstein s inequality (Reminder 3) may thus be applied; the choice δ = 1/t therein leads to

a (bys,a bys,a) 2(M m)

γt ln t + M m

t/γt = O(t(1+α)/2) and 1/γt = O(tα) as t , where α < 1, and as a > 0 given that we are considering a suboptimal arm a, there exists t0 N such that for all t t0,

D t (t 1 K) a

s=K+1 (bys,a bys,a ) (t 1 K) a

a (bys,a bys,a) (t 1 K) a

a (bys,a bys,a) D t

Therefore, as T

t=K+1 (byt,a byt,a ) (t 1 K) a

= O(ln T) ,

as claimed. This concludes the proof.

C. Allenberg, P. Auer, L. Gy orﬁ, and G. Ottucs ak. Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In Proccedings of the 17th International Conference on Algorithmic Learning Theory (ALT 06), pages 229 243. Springer, 2006.

Adaptation to the Range in K Armed Bandits

J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT 09), pages 217 226. Omnipress, 2009.

J.-Y. Audibert, R. Munos, and C. Szepesv ari. Exploration exploitation tradeoﬀusing variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876 1902, 2009.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235 256, 2002a.

P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002b.

D. Baudry, P. Saux, and O.-A. Maillard. From optimality to robustness: Adaptive resampling strategies in stochastic bandits. In Advances in Neural Information Processing Systems, volume 34, pages 14029 14041, 2021.

S. Bubeck, M.B. Cohen, and Y. Li. Sparsity, variance and curvature in multi-armed bandits. In Proceedings of the 29th International Conference on Algorithmic Learning Theory (ALT 18), volume 83 of PMLR, pages 111 127, 2018.

A.N. Burnetas and M.N. Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122 142, 1996.

O. Capp e, A. Garivier, O.-A. Maillard, R. Munos, and G. Stoltz. Kullback Leibler upper conﬁdence bounds for optimal sequential allocation. The Annals of Statistics, 41(3): 1516 1541, 2013.

N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.

N. Cesa-Bianchi and O. Shamir. Bandit regret scaling with the eﬀective loss range. In Proceedings of the 29th International Conference on Algorithmic Learning Theory (ALT 18), volume 83 of PMLR, pages 128 151, 2018.

N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 66(2-3):321 352, 2007.

Y. Chow and H. Teicher. Probability Theory. Springer, 1988.

W. Cowan and M.N. Katehakis. An asymptotically optimal policy for uniform bandits of unknown support, 2015. Preprint, ar Xiv:1505.01918.

W. Cowan, J. Honda, and M.N. Katehakis. Normal bandits of unknown means and variances. Journal of Machine Learning Research, 18(154):1 28, 2018.

S. De Rooij, T. van Erven, P.D. Gr unwald, and W.M. Koolen. Follow the leader if you can, hedge if you must. Journal of Machine Learning Research, 15(37):1281 1316, 2014.

J.L. Doob. Stochastic Processes. Wiley Publications in Statistics. John Wiley & Sons, 1953.

Hadiji and Stoltz

D.A Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1): 100 118, 1975.

A. Garivier, H. Hadiji, P. M enard, and G. Stoltz. KL-UCB-Switch: Optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints, 2019a. Preprint, ar Xiv:1805.05071.

A. Garivier, P. M enard, and G. Stoltz. Explore ﬁrst, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 44(2):377 399, 2019b.

S. Gerchinovitz and T. Lattimore. Reﬁned lower bounds for adversarial bandits. In Advances in Neural Information Processing Systems, volume 29, pages 1198 1206, 2016.

H. Hadiji. Polynomial cost of adaptation for X-armed bandits. In Advances in Neural Information Processing Systems, volume 32, pages 1029 1038, 2019.

J. Honda and A. Takemura. An asymptotically optimal policy for ﬁnite support models in the multiarmed bandit problem. Machine Learning, 85:361 391, 2011.

J. Honda and A. Takemura. Non-asymptotic analysis of a new bandit algorithm for semibounded rewards. Journal of Machine Learning Research, 16(113):3721 3756, 2015.

J. Huang, Y. Dai, and L. Huang. Scale-free adversarial multi-armed bandit with arbitrary feedback delays, 2021. Preprint, ar Xiv:2110.13400.

R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, volume 17, pages 697 704, 2004.

J. Kwon and V. Perchet. Gains and losses are fundamentally diﬀerent in regret minimization: The sparse case. Journal of Machine Learning Research, 17(227):1 32, 2016.

T.L. Lai and H. Robbins. Asymptotically eﬃcient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985.

T. Lattimore. A scale free algorithm for stochastic bandits with bounded kurtosis, 2017. Preprint ar Xiv:1703.08937, later published, with the omission of some remarks, in Advances in Neural Information Processing Systems, volume 30, pages 1584 1593, 2017.

P. Massart. Concentration Inequalities and Model Selection, volume XXXIII of Ecole d Et e de Probabilit es de Saint-Flour. Springer, 2007. Lectures given in 2003, published in 2007.

S.R. Putta and S. Agrawal. Scale-free adversarial multi armed bandits. In Proceedings of the 33rd International Conference on Algorithmic Learning Theory (ALT 22), volume 167 of PMLR, pages 910 930, 2022.

Y. Seldin and G. Lugosi. An improved parametrization and analysis of the EXP3++ algorithm for stochastic and adversarial bandits. In Proceedings of the 30th Annual Conference on Learning Theory (COLT 17), volume 65 of PMLR, pages 1743 1759, 2017.

Adaptation to the Range in K Armed Bandits

G. Stoltz. Incomplete Information and Internal Regret in Prediction of Individual Sequences. Ph D thesis, Universit e Paris-Sud, 2005. URL https://tel.archives-ouvertes.fr/ tel-00009759/document.

T.S. Thune and Y. Seldin. Adaptation to easy data in prediction with limited advice. In Advances in Neural Information Processing Systems, volume 31, pages 2909 2918, 2018.

C.-Y. Wei and H. Luo. More adaptive algorithms for adversarial bandits. In Proceedings of the 31st Conference On Learning Theory (COLT 18), volume 75 of PMLR, pages 1263 1291, 2018.

J. Zimmert and T. Lattimore. Connections between mirror descent, Thompson sampling and the information ratio. In Advances in Neural Information Processing Systems, volume 32, pages 11973 11982, 2019.

J. Zimmert and Y. Seldin. An optimal algorithm for stochastic and adversarial bandits. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics (AIStats 20), volume 89 of PMLR, pages 467 475, 2019.