# refined_lower_bounds_for_adversarial_bandits__136a2786.pdf

Reﬁned Lower Bounds for Adversarial Bandits

Sébastien Gerchinovitz Institut de Mathématiques de Toulouse Université Toulouse 3 Paul Sabatier Toulouse, 31062, France sebastien.gerchinovitz@math.univ-toulouse.fr

Tor Lattimore Department of Computing Science University of Alberta Edmonton, Canada tor.lattimore@gmail.com

We provide new lower bounds on the regret that must be suffered by adversarial bandit algorithms. The new results show that recent upper bounds that either (a) hold with high-probability or (b) depend on the total loss of the best arm or (c) depend on the quadratic variation of the losses, are close to tight. Besides this we prove two impossibility results. First, the existence of a single arm that is optimal in every round cannot improve the regret in the worst case. Second, the regret cannot scale with the effective range of the losses. In contrast, both results are possible in the full-information setting.

1 Introduction

We consider the standard K-armed adversarial bandit problem, which is a game played over T rounds between a learner and an adversary. In every round t {1, . . . , T} the learner chooses a probability distribution pt = (pi,t)1 i K over {1, . . . , K}. The adversary then chooses a loss vector ℓt = (ℓi,t)1 i K [0, 1]K, which may depend on pt. Finally the learner samples an action from pt denoted by It {1, . . . , K} and observes her own loss ℓIt,t. The learner would like to minimise her regret, which is the difference between cumulative loss suffered and the loss suffered by the optimal action in hindsight:

RT (ℓ1:T ) =

t=1 ℓIt,t min 1 i K

where ℓ1:T [0, 1]T K is the sequence of losses chosen by the adversary. A famous strategy is called Exp3, which satisﬁes E[RT (ℓ1:T )] = O( p

KT log(K))) where the expectation is taken over the randomness in the algorithm and the choices of the adversary [Auer et al., 2002]. There is also a lower bound showing that for every learner there is an adversary for which the expected regret is E[RT (ℓ1:T )] = Ω(

KT) [Auer et al., 1995]. If the losses are chosen ahead of time, then the adversary is called oblivious, and in this case there exists a learner for which E[RT (ℓ1:T )] = O(

KT) [Audibert and Bubeck, 2009]. One might think that this is the end of the story, but it is not so. While the worst-case expected regret is one quantity of interest, there are many situations where a reﬁned regret guarantee is more informative. Recent research on adversarial bandits has primarily focussed on these issues, especially the questions of obtaining regret guarantees that hold with high probability as well as stronger guarantees when the losses are nice in some sense. While there are now a wide range of strategies with upper bounds that depend on various quantities, the literature is missing lower bounds for many cases, some of which we now provide.

We focus on three classes of lower bound, which are described in detail below. The ﬁrst addresses the optimal regret achievable with high probability, where we show there is little room for improvement over existing strategies. Our other results concern lower bounds that depend on some kind of regularity in the losses ( nice data). Speciﬁcally we prove lower bounds that replace T in the regret bound with the loss of the best action (called ﬁrst-order bounds) and also with the quadratic variation of the losses (called second-order bounds).

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

High-probability bounds Existing strategies Exp3.P [Auer et al., 2002] and Exp3-IX [Neu, 2015a] are tuned with a conﬁdence parameter δ (0, 1) and satisfy, for all ℓ1:T [0, 1]KT ,

P RT (ℓ1:T ) c p

KT log(K/δ) δ (1)

for some universal constant c > 0. An alternative tuning of Exp-IX or Exp3.P [Bubeck and Cesa Bianchi, 2012] leads to a single algorithm for which, for all ℓ1:T [0, 1]KT ,

RT (ℓ1:T ) c

log(K) + log(1/δ) p

The difference is that in (1) the algorithm depends on δ while in (2) it does not. The cost of not knowing δ is that the log(1/δ) moves outside the square root. In Section 2 we prove two lower bounds showing that there is little room for improvement in either (1) or (2).

First-order bounds An improvement over the worst-case regret bound of O(

TK) is the so-called improvement for small losses. Speciﬁcally, there exist strategies (eg., FPL-TRIX by Neu [2015b] with earlier results by Stoltz [2005], Allenberg et al. [2006], Rakhlin and Sridharan [2013]) such that for all ℓ1:T [0, 1]KT

E[RT (ℓ1:T )] O q

L T K log(K) + K log(KT) , with L T = min 1 i K

t=1 ℓi,t , (3)

where the expectation is with respect to the internal randomisation of the algorithm (the losses are ﬁxed). This result improves on the O(

KT) bounds since L T T is always guaranteed and sometimes L T is much smaller than T. In order to evaluate the optimality of this bound, we ﬁrst rewrite it in terms of the small-loss balls Bα,T deﬁned for all α [0, 1] and T 1 by

Bα,T ℓ1:T [0, 1]KT : L T T α . (4)

Corollary 1. The ﬁrst-order regret bound (3) of Neu [2015b] is equivalent to:

α [0, 1], sup ℓ1:T Bα,T E[RT (ℓ1:T )] O p

αTK log(K) + K log(KT) .

The proof is straightforward. Our main contribution in Section 3 is a lower bound of the order of

αTK for all α Ω(log(T)/T). This minimax lower bound shows that we cannot hope for a better bound than (3) (up to log factors) if we only know the value of L T .

Second-order bounds Another type of improved regret bound was derived by Hazan and Kale [2011b] and involves a second-order quantity called the quadratic variation:

t=1 ℓt µT 2 2 TK

where µT = 1 T PT t=1 ℓt is the mean of all loss vectors. (In other words, QT /T is the sum of the empirical variances of all the K arms). Hazan and Kale [2011b] addressed the general online linear optimisation setting. In the particular case of adversarial K-armed bandits with an oblivious adversary (as is the case here), they showed that there exists an efﬁcient algorithm such that for some absolute constant c > 0 and for all T 2

ℓ1:T [0, 1]KT , E[RT (ℓ1:T )] c K2p

QT log T + K1.5 log2 T + K2.5 log T . (6)

As before we can rewrite the regret bound (6) in terms of the small-variation balls Vα,T deﬁned for all α [0, 1/4] and T 1 by

Vα,T ℓ1:T [0, 1]KT : QT

Corollary 2. The second-order regret bound (6) of Hazan and Kale [2011b] is equivalent to:

α [0, 1/4], sup ℓ1:T Vα,T E[RT (ℓ1:T )] c K2p

αTK log T + K3/2 log2 T + K5/2 log T .

The proof is straightforward because the losses are deterministic and ﬁxed in advance by an oblivious adversary. In Section 4 we provide a lower bound of order

αTK that holds whenever α = Ω(log(T)/T). This minimax lower bound shows that we cannot hope for a bound better than (7) by more than a factor of K2 log T if we only know the value of QT . Closing the gap is left as an open question.

Two impossibility results in the bandit setting We also show in Section 4 that, in contrast to the full-information setting, regret bounds involving the cumulative variance of the algorithm as in [Cesa-Bianchi et al., 2007] cannot be obtained in the bandit setting. More precisely, we prove that two consequences that hold true in the full-information case, namely: (i) a regret bound proportional to the effective range of the losses and (ii) a bounded regret if one arm performs best at all rounds, must fail in the worst case for every bandit algorithm.

Additional notation and key tools Before the theorems we develop some additional notation and describe the generic ideas in the proofs. For 1 i K let Ni(t) be the number of times action i has been chosen after round t. All our lower bounds are derived by analysing the regret incurred by strategies when facing randomised adversaries that choose the losses for all actions from the same joint distribution in every round (sometimes independently for each action and sometimes not). Ber(α) denotes the Bernoulli distribution with parameter α [0, 1]. If P and Q are measures on the same probability space, then KL(P, Q) is the KL-divergence between them. For a < b we deﬁne clip[a,b](x) = min {b, max {a, x}} and for x, y R we let x y = max{x, y}. Our main tools throughout the analysis are the following information-theoretic lemmas. The ﬁrst bounds the KL divergence between the laws of the observed losses/actions for two distributions on the losses.

Lemma 1. Fix a randomised bandit algorithm and two probability distributions Q1 and Q2 on [0, 1]K. Assume the loss vectors ℓ1, . . . , ℓT [0, 1]K are drawn i.i.d. from either Q1 or Q2, and denote by Qj the joint probability distribution on all sources of randomness when Qj is used (formally, Qj = Pint (Q T j ), where Pint is the probability distribution used by the algorithm for its internal randomisation). Let t 1. Denote by ht = (Is, ℓIs,s)1 s t 1 the history available at the beginning of round t, by Q(ht,It) j the law of (ht, It) under Qj, and by Qj,i the ith marginal distribution of Qj. Then,

KL Q(ht,It) 1 , Q(ht,It) 2 =

i=1 EQ1 Ni(t 1) KL Q1,i, Q2,i .

Results of roughly this form are well known and the proof follows immediately from the chain rule for the relative entropy and the independence of the loss vectors across time (see [Auer et al., 2002] or the supplementary material). One difference is that the losses need not be independent across the arms, which we heavily exploit in our proofs by using correlated losses. The second key lemma is an alternative to Pinsker s inequality that proves useful when the Kullback-Leibler divergence is larger than 2. It has previously been used for bandit lower bounds (in the stochastic setting) by Bubeck et al. [2013].

Lemma 2 (Lemma 2.6 in Tsybakov 2008). Let P and Q be two probability distributions on the same measurable space. Then, for every measurable subset A (whose complement we denote by Ac),

P(A) + Q(Ac) 1

2 exp KL(P, Q) .

2 Zero-Order High Probability Lower Bounds

We prove two new high-probability lower bounds on the regret of any bandit algorithm. The ﬁrst shows that no strategy can enjoy smaller regret than Ω( p

KT log(1/δ)) with probability at least 1 δ. Upper bounds of this form have been shown for various algorithms including Exp3.P [Auer et al., 2002] and Exp3-IX [Neu, 2015a]. Although this result is not very surprising, we are not aware of any existing work on this problem and the proof is less straightforward than one might expect. An added beneﬁt of our result is that the loss sequences producing large regret have two special properties. First, the optimal arm is the same in every round and second the range of the losses in each round is O( p

K log(1/δ)/T). These properties will be useful in subsequent analysis.

In the second lower bound we show that any algorithm for which E[RT (ℓ1:T )] = O(

KT) must necessarily suffer a high probability regret of at least Ω(

KT log(1/δ)) for some sequence ℓ1:T .

The important difference relative to the previous result is that strategies with log(1/δ) appearing inside the square root depend on a speciﬁc value of δ, which must be known in advance. Theorem 1. Suppose K 2 and δ (0, 1/4) and T 32(K 1) log(2/δ), then there exists a sequence of losses ℓ1:T [0, 1]KT such that

P RT (ℓ1:T ) 1

(K 1)T log(1/(4δ)) δ/2 ,

where the probability is taken with respect to the randomness in the algorithm. Furthermore ℓ1:T can be chosen in such a way that there exists an i such that for all t it holds that ℓi,t = minj ℓj,t and maxj,k{ℓj,t ℓk,t} p

(K 1) log(1/(4δ))/T/(4 log 2). Theorem 2. Suppose K 2, T 1, and there exists a strategy and constant C > 0 such that for any ℓ1:T [0, 1]KT it holds that E[RT (ℓ1:T )] C p

(K 1)T. Let δ (0, 1/4) satisfy p

(K 1)/T log(1/(4δ)) C and T 32 log(2/δ). Then there exists ℓ1:T [0, 1]KT for which

(K 1)T log(1/(4δ))

where the probability is taken with respect to the randomness in the algorithm. Corollary 3. If p (0, 1) and C > 0, then there does not exist a strategy such that for all T, K, ℓ1:T [0, 1]KT and δ (0, 1) the regret is bounded by

P RT (ℓ1:T ) C p

(K 1)T logp(1/δ) δ.

The corollary follows easily by integrating the assumed high-probability bound and applying Theorem 2 for sufﬁciently large T and small δ. The proof may be found in the supplementary material.

Proof of Theorems 1 and 2 Both proofs rely on a carefully selected choice of correlated stochastic losses described below. Let Z1, Z2, . . . , ZT be a sequence of i.i.d. Gaussian random variables with mean 1/2 and variance σ2 = 1/(32 log(2)). Let [0, 1/30] be a constant that will be chosen differently in each proof and deﬁne K random loss sequences ℓ1 1:T , . . . , ℓK 1:T where

clip[0,1](Zt ) if i = 1 clip[0,1](Zt 2 ) if i = j = 1 clip[0,1](Zt) otherwise .

For 1 j K let Qj be the measure on ℓ1:T [0, 1]KT and I1, . . . , IT when ℓi,t = ℓj i,t for all 1 i K and 1 t T. Informally, Qj is the measure on the sequence of loss vectors and actions when the learner interacts with the losses sampled from the jth environment deﬁned above. Lemma 3. Let δ (0, 1) and suppose 1/30 and T 32 log(2/δ). Then Qi RT (ℓi 1:T ) T/4 Qi (Ni(T) T/2) δ/2 and EQi[RT (ℓi 1:T )] 7 EQi[T Ni(T)]/8.

The proof follows by substituting the deﬁnition of the losses and applying Azuma s inequality to show that clipping does not occur too often. See the supplementary material for details.

Proof of Theorem 1. First we choose the value of that determines the gaps in the losses by = p

σ2(K 1) log(1/(4δ))/(2T) 1/30. By the pigeonhole principle there exists an i > 1 for which EQ1[Ni(T)] T/(K 1). Therefore by Lemmas 2 and 1, and the fact that the KL divergence between clipped Gaussian distributions is always smaller than without clipping,

Q1 (N1(T) T/2) + Qi (N1(T) > T/2) 1

2 exp KL Q(h T ,IT ) 1 , Q(h T ,IT ) i

2 exp EQ1[Ni(T)](2 )2

But by Lemma 3

max k {1,i} Qk RT (ℓk 1:T ) T /4 max {Q1 (N1(T) T/2) , Qi (Ni(T) T/2)} δ/2

2 (Q1 (N1(T) T/2) + Qi (N1(T) > T/2)) δ/2 δ/2 .

Therefore there exists an i {1, . . . , K} such that

RT (ℓi 1:T )

= Qi RT (ℓi 1:T ) T /4 δ/2 .

The result is completed by substituting the value of σ2 = 1/(32 log(2)) and by noting that maxj,k{ℓj,t ℓk,t} 2 p

(K 1) log(1/(4δ))/T/(4 log 2) Qi-almost surely.

Proof of Theorem 2. By the assumption on δ we have = 7σ2

4δ 1/30. Suppose for all i > 1 that EQ1[Ni(T)] > σ2

Then by the assumption in the theorem statement and the second part of Lemma 3 we have

(K 1)T EQ1[RT (ℓ1 1:T )] 7

which is a contradiction. Therefore there exists an i > 1 for which Eq. (8) does not hold. Then by the same argument as the previous proof it follows that

max k {1,i} Qk

RT (ℓk 1:T ) 7σ2

(K 1)T log 1

= max k {1,i} Qk RT (ℓk 1:T ) T /4 δ/2 .

The result is completed by substituting the value of σ2 = 1/(32 log(2)).

Remark 1. It is possible to derive similar high-probability regret bounds with non-correlated losses. However the correlation makes the results cleaner (we do not need an additional concentration argument to locate the optimal arm) and it is key to derive Corollaries 4 and 5 in Section 4.

3 First-Order Lower Bound

First-order upper bounds provide improvement over minimax bounds when the loss of the optimal action is small. Recall from Corollary 1 that ﬁrst-order bounds can be rewritten in terms of the small-loss balls Bα,T deﬁned in (4). Theorem 3 below provides a new lower bound of order p L T K, which matches the best existing upper bounds up to logarithmic factors. As is standard for minimax results this does not imply a lower bound on every loss sequence ℓ1:T . Instead it shows that we cannot hope for a better bound if we only know the value of L T . Theorem 3. Let K 2, T K 118, and α [(c log(32T) (K/2))/T, 1/2], where c = 64/9. Then for any randomised bandit algorithm supℓ1:T Bα,T E[RT (ℓ1:T )]

αTK/27, where the expectation is taken with respect to the internal randomisation of the algorithm.

Our proof is inspired by that of Auer et al. [2002, Theorem 5.1]. The key difference is that we take Bernoulli distributions with parameter close to α instead of 1/2. This way the best cumulative loss L T is ensured to be concentrated around αT, and the regret lower bound

α(1 α)TK can be seen to involve the variance α(1 α)T of the binomial distribution with parameters α and T.

First we state the stochastic construction of the losses and prove a general lemma that allows us to prove Theorem 3 and will also be useful in Section 4 to a derive a lower bound in terms of the quadratic variation. Let ε [0, 1 α] be ﬁxed and deﬁne K probability distributions (Qj)K j=1 on [0, 1]KT such that under Qj the following hold:

All random losses ℓi,t for 1 i K and 1 t T are independent. ℓi,t is sampled from a Bernoulli distribution with parameter α+ε if i = j, or with parameter α if i = j. Lemma 4. Let α (0, 1), K 2, and T K/(4(1 α)). Consider the probability distributions Qj on [0, 1]KT deﬁned above with ε = (1/2) p

α(1 α)K/T, and set Q = 1 K PK j=1 Qj. Then for any randomised bandit algorithm E[RT (ℓ1:T )] p

α(1 α)TK/8, where the expectation is with respect to both the internal randomisation of the algorithm and the random loss sequence ℓ1:T which is drawn from Q.

The assumption T K/(4(1 α)) above ensures that ε 1 α, so that the Qj are well deﬁned.

Proof of Lemma 4. We lower bound the regret by the pseudo-regret for each distribution Qj:

t=1 ℓIt,t min 1 i K

min 1 i K EQj

t=1 EQj α + ε ε1{It=j} Tα = Tε

t=1 Qj(It = j)

where the ﬁrst equality follows because EQj[ℓIt,t] = EQj[EQj[ℓIt,t|ℓ1:t 1, It]] = EQj[α + ε ε1{It=j}] since under Qj, the conditional distribution of ℓt given (ℓ1:t 1, It) is simply K i=1B(α + ε ε1{i=j}). To bound (9) from below, note that by Pinsker s inequality we have for all t {1, . . . , T} and j {1, . . . , K}, Qj(It = j) Q0(It = j) + (KL(QIt 0 , QIt j )/2)1/2, where Q0 = Ber(α + ε) KT is the joint probability distribution that makes all the ℓi,t i.i.d. Ber(α + ε), and QIt 0 and QIt j denote the laws of It under Q0 and Qj respectively. Plugging the last inequality above into (9), averaging over j = 1, . . . , K and using the concavity of the square root yields

t=1 ℓIt,t min 1 i K

j=1 KL QIt 0 , QIt j

where we recall that Q = 1 K PK j=1 Qj. The rest of the proof is devoted to upper-bounding KL(QIt 0 , QIt j ). Denote by ht = (Is, ℓIs,s)1 s t 1 the history available at the beginning of round t. From Lemma 1

KL QIt 0 , QIt j KL Q(ht,It) 0 , Q(ht,It) j = EQ0 Nj(t 1) KL B(α + ε), B(α)

EQ0 Nj(t 1) ε2

α(1 α) , (11)

where the last inequality follows by upper bounding the KL divergence by the χ2 divergence (see the supplementary material). Averaging (11) over j {1, . . . , K} and t {1, . . . , T} and noting that PT t=1(t 1) T 2/2 we get

j=1 KL QIt 0 , QIt j 1

Kα(1 α) Tε2

Plugging the above inequality into (10) and using the deﬁnition of ε = (1/2) p

α(1 α)K/T yields

t=1 ℓIt,t min 1 i K

Proof of Theorem 3. We show that there exists a loss sequence ℓ1:T [0, 1]KT such that L T αT and E[RT (ℓ1:T )] (1/27)

αTK. Lemma 4 above provides such kind of lower bound, but without the guarantee on L T . For this purpose we will use Lemma 4 with a smaller value of α (namely, α/2) and combine it with Bernstein s inequality to prove that L T Tα with high probability.

Part 1: Applying Lemma 4 with α/2 (note that T K K/(4(1 α/2)) by assumption on T) and noting that maxj EQj[RT (ℓ1:T )] E Q[RT (ℓ1:T )] we get that for some j {1, . . . , K} the probability distribution Qj deﬁned with ε = (1/2) p

(α/2)(1 α/2)K/T satisﬁes

EQj[RT (ℓ1:T )] 1

since α 1/2 by assumption.

Part 2: Next we prove that Qj(L T > Tα) 1 32T . (13)

To this end, ﬁrst note that L T PT t=1 ℓj,t. Second, note that under Qj, the ℓj,t, t 1, are i.i.d. Ber(α/2). We can thus use Bernstein s inequality: applying Theorem 2.10 (and a remark on p.38) of Boucheron et al. [2013] with Xt = ℓj,t α/2 1 = b, with v = T(α/2)(1 α/2), and with c = b/3 = 1/3), we get that, for all δ (0, 1), with Qj-probability at least 1 δ,

t=1 ℓj,t Tα

2 = Tα , (14)

where the second last inequality is true whenever Tα log(1/δ) and that last is true whenever Tα (8/3)2 log(1/δ) = c log(1/δ). By assumption on α, these two conditions are satisﬁed for δ = 1/(32T), which concludes the proof of (13).

Conclusion: We show by contradiction that there exists a loss sequence ℓ1:T [0, 1]KT such that L T αT and

E[RT (ℓ1:T )] 1

6αTK , (15)

where the expectation is with respect to the internal randomisation of the algorithm. Imagine for a second that (15) were false for every loss sequence ℓ1:T [0, 1]KT satisfying L T αT. Then we would have 1{L T αT }EQj[RT (ℓ1:T )|ℓ1:T ] (1/64)

6αTK almost surely (since the internal source of randomness of the bandit algorithm is independent of ℓ1:T ). Therefore by the tower rule for the ﬁrst expectation on the r.h.s. below, we would get

EQj[RT (ℓ1:T )] = EQj h RT (ℓ1:T )1{L T αT } i + EQj h RT (ℓ1:T )1{L T >αT } i

6αTK + T Qj(L T > Tα) 1

where (16) follows from (13) and by noting that 1/32 < (1/64)

6αTK since α K/(2T) > 4/(6T) 4/(6TK). Comparing (16) and (12) we get a contradiction, which proves that there exists a loss sequence ℓ1:T [0, 1]KT satisfying both L T αT and (15). We conclude the proof by noting that

6/64 1/27. Finally, the condition T K 118 is sufﬁcient to make the interval (c log(32T) (K/2))/T, 1

2 non empty.

4 Second-Order Lower Bounds

We start by giving a lower bound on the regret in terms of the quadratic variation that is close to existing upper bounds except in the dependence on the number of arms. Afterwards we prove that bandit strategies cannot adapt to losses that lie in a small range or the existence of an action that is always optimal.

Lower bound in terms of quadratic variation We prove a lower bound of Ω(

αTK) over any small-variation ball Vα,T (as deﬁned by (7)) for all α = Ω(log(T)/T). This minimax lower bound matches the upper bound of Corollary 2 up to a multiplicative factor of K2p

log(T). Closing this gap is left as an open question, but we conjecture that the upper bound is loose (see also the COLT open problem by Hazan and Kale [2011a]). Theorem 4. Let K 2, T (32K) 601, and α [(2c1 log(T) 8K)/T, 1/4], where c1 = (4/9)2(3

5 + 1)2 12. Then for any randomised bandit algorithm, supℓ1:T Vα,T E[RT (ℓ1:T )]

αTK/25, where the expectation is taken with respect to the internal randomisation of the algorithm.

The proof is very similar to that of Theorem 3; it also follows from Lemma 4 and Bernstein s inequality. It is postponed to the supplementary material.

Impossibility results In the full-information setting (where the entire loss vector is observed after each round) Cesa-Bianchi et al. [2007, Theorem 6] designed a carefully tuned exponential weighting algorithm for which the regret depends on the variation of the algorithm and the range of the losses:

ℓ1:T RKT , E[RT (ℓ1:T )] 4 p

VT log K + 4ET log K + 6ET , (17)

where the expectation is taken with respect to the internal randomisation of the algorithm and ET = max1 t T max1 i,j K |ℓi,t ℓj,t| denotes the effective range of the losses and VT = PT t=1 Var It pt(ℓIt,t) denotes the cumulative variance of the algorithm (in each round t the expert s action It is drawn at random from the weight vector pt). The bound in (17) is not closed-form because VT depends on the algorithm, but has several interesting consequences:

1. If for all t the losses ℓi,t lie in an unknown interval [at, at + ρ] with a small width ρ > 0, then Var It pt(ℓIt,t) ρ2/4, so that VT Tρ2/4. Hence

E[RT (ℓ1:T )] 2ρ p

T log K + 4ρ log K + 6ρ . Therefore, though the algorithm by Cesa-Bianchi et al. [2007, Section 4.2] does not use the prior knowledge of at or ρ, it is able to incur a regret that scales linearly in the effective range ρ. 2. If all the losses ℓi,t are nonnegative, then by Corollary 3 of [Cesa-Bianchi et al., 2007] the second-order bound (17) implies the ﬁrst-order bound

E[RT (ℓ1:T )] 4

log K + 39MT max{1, log K} , (18)

where MT = max1 t T max1 i K ℓi,t . 3. If there exists an arm i that is optimal at every round t (i.e., ℓi ,t = mini ℓi,t for all t 1), then any translation-invariant algorithm with regret guarantees as in (18) above suffers a bounded regret. This is the case for the fully automatic algorithm of Cesa-Bianchi et al. [2007, Theorem 6] mentioned above. Then by the translation invariance of the algorithm all losses ℓi,t appearing in the regret bound can be replaced with the translated losses ℓi,t ℓi ,t 0, so that a bound of the same form as (18) implies a regret bound of O(log K). 4. Assume that the loss vectors ℓt are i.i.d. with a unique optimal arm in expectation (i.e., there exists i such that E[ℓi ,1] < E[ℓi,1] for all i = i ). Then using the Hoeffding-Azuma inequality we can show that the algorithm of Cesa-Bianchi et al. [2007, Section 4.2] has with high probability a bounded cumulative variance VT , and therefore (by (17)) incurs a bounded regret, in the same spirit as in de Rooij et al. [2014], Gaillard et al. [2014].

We already know that point 2 has a counterpart in the bandit setting. If one is prepared to ignore logarithmic terms, then point 4 also has an analogue in the bandit setting due to the existence of logarithmic regret guarantees for stochastic bandits [Lai and Robbins, 1985]. The following corollaries show that in the bandit setting it is not possible to design algorithms to exploit the range of the losses or the existence of an arm that is always optimal. We use Theorem 1 as a general tool but the bounds can be improved to

TK/30 by analysing the expected regret directly (similar to Lemma 4). Corollary 4. Let K 2, T 32(K 1) log(14) and ρ 0.22 p

(K 1)/T. Then for any randomised bandit algorithm, supℓ1,...,ℓT Cρ E[RT (ℓ1:T )] p

T(K 1)/504, where the expectation is with respect to the randomness in the algorithm, and Cρ x [0, 1]K : maxi,j |xi xj| ρ . Corollary 5. Let K 2 and T 32(K 1) log(14). Then, for any randomised bandit algorithm, there is a loss sequence ℓ1:T [0, 1]KT such that there exists an arm i that is optimal at every round t (i.e., ℓi ,t = mini ℓi,t for all t 1), but E[RT (ℓ1:T )] p

T(K 1)/504, where the expectation is with respect to the randomness in the algorithm.

Proof of Corollaries 4 and 5. Both results follow from Theorem 1 by choosing δ = 0.15. Therefore there exists an ℓ1:T such that P{RT (ℓ1:T ) p

(K 1)T log(1/(4 0.15)/27} 0.15/2, which implies (since RT (ℓ1:T ) 0 here) that E[RT (ℓ1:T )] p

(K 1)T/504. Finally note that ℓ1:T Cρ since ρ p

(K 1) log(1/(4δ))/T/(4 log 2) and there exists an i such that ℓi,t ℓj,t for all j and t.

Acknowledgments

The authors would like to thank Aurélien Garivier and Émilie Kaufmann for insightful discussions. This work was partially supported by the CIMI (Centre International de Mathématiques et d Informatique) Excellence program. The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR), under grants ANR-13-BS01-0005 (project SPADRO) and ANR13-CORD-0020 (project ALICIA).

C. Allenberg, P. Auer, L. Györﬁ, and G. Ottucsák. Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In Proceedings of ALT 2006, pages 229 243. Springer, 2006.

J. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In Proceedings of Conference on Learning Theory (COLT), pages 217 226, 2009.

P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on, pages 322 331. IEEE, 1995.

P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multi-armed bandit problem. SIAM J. Comput., 32(1):48 77, 2002.

S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press, 2013.

S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012.

S. Bubeck, V. Perchet, and P. Rigollet. Bounded regret in stochastic multi-armed bandits. In Proceedings of The 26th Conference on Learning Theory, pages 122 134, 2013.

N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Mach. Learn., 66(2/3):321 352, 2007.

S. de Rooij, T. van Erven, P. D. Grünwald, and W. M. Koolen. Follow the leader if you can, hedge if you must. J. Mach. Learn. Res., 15(Apr):1281 1316, 2014.

P. Gaillard, G. Stoltz, and T. van Erven. A second-order bound with excess losses. In Proceedings of the 27th Conference on Learning Theory (COLT 14), 2014.

E. Hazan and S. Kale. A simple multi-armed bandit algorithm with optimal variation-bounded regret. In Proceedings of the 24th Conference on Learning Theory, pages 817 820, 2011a.

E. Hazan and S. Kale. Better algorithms for benign bandits. J. Mach. Learn. Res., 12(Apr):1287 1311, 2011b.

T. L. Lai and H. Robbins. Asymptotically efﬁcient adaptive allocation rules. Adv. in Appl. Math., 6: 4 22, 1985.

G. Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems 28 (NIPS 2015), 2015a.

G. Neu. First-order regret bounds for combinatorial semi-bandits. In Proceedings of The 28th Conference on Learning Theory, pages 1360 1375, 2015b.

A. Rakhlin and K. Sridharan. Online learning with predictable sequences. In Proceedings of the 26th Conference on Learning Theory, pages 993 1019, 2013.

G. Stoltz. Incomplete information and internal regret in prediction of individual sequences. Ph D thesis, Paris-Sud XI University, 2005.

A. Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008.