# the_pareto_regret_frontier_for_bandits__02578a8c.pdf

The Pareto Regret Frontier for Bandits

Tor Lattimore Department of Computing Science University of Alberta, Canada tor.lattimore@gmail.com

Given a multi-armed bandit problem it may be desirable to achieve a smallerthan-usual worst-case regret for some special actions. I show that the price for such unbalanced worst-case regret guarantees is rather high. Speciﬁcally, if an algorithm enjoys a worst-case regret of B with respect to some action, then there must exist another action for which the worst-case regret is at least Ω(n K/B), where n is the horizon and K the number of actions. I also give upper bounds in both the stochastic and adversarial settings showing that this result cannot be improved. For the stochastic case the pareto regret frontier is characterised exactly up to constant factors.

1 Introduction

The multi-armed bandit is the simplest class of problems that exhibit the exploration/exploitation dilemma. In each time step the learner chooses one of K actions and receives a noisy reward signal for the chosen action. A learner s performance is measured in terms of the regret, which is the (expected) difference between the rewards it actually received and those it would have received (in expectation) by choosing the optimal action.

Prior work on the regret criterion for ﬁnite-armed bandits has treated all actions uniformly and has aimed for bounds on the regret that do not depend on which action turned out to be optimal. I take a different approach and ask what can be achieved if some actions are given special treatment. Focussing on worst-case bounds, I ask whether or not it is possible to achieve improved worst-case regret for some actions, and what is the cost in terms of the regret for the remaining actions. Such results may be useful in a variety of cases. For example, a company that is exploring some new strategies might expect an especially small regret if its existing strategy turns out to be (nearly) optimal.

This problem has previously been considered in the experts setting where the learner is allowed to observe the reward for all actions in every round, not only for the action actually chosen. The earliest work seems to be by Hutter and Poland [2005] where it is shown that the learner can assign a prior weight to each action and pays a worst-case regret of O( n log ρi) for expert i where ρi is the prior belief in expert i and n is the horizon. The uniform regret is obtained by choosing ρi = 1/K, which leads to the well-known O( n log K) bound achieved by the exponential weighting algorithm [Cesa-Bianchi, 2006]. The consequence of this is that an algorithm can enjoy a constant regret with respect to a single action while suffering minimally on the remainder. The problem was studied in more detail by Koolen [2013] where (remarkably) the author was able to exactly describe the pareto regret frontier when K = 2.

Other related work (also in the experts setting) is where the objective is to obtain an improved regret against a mixture of available experts/actions [Even-Dar et al., 2008, Kapralov and Panigrahy, 2011]. In a similar vain, Sani et al. [2014] showed that algorithms for prediction with expert advice can be combined with minimal cost to obtain the best of both worlds. In the bandit setting I am only aware

of the work by Liu and Li [2015] who study the effect of the prior on the regret of Thompson sampling in a special case. In contrast the lower bound given here applies to all algorithms in a relatively standard setting.

The main contribution of this work is a characterisation of the pareto regret frontier (the set of achievable worst-case regret bounds) for stochastic bandits.

Let µi R be the unknown mean of the ith arm and assume that supi,j µi µj 1. In each time step the learner chooses an action It {1, . . . , K} and receives reward g It,t = µi + ηt where ηt is the noise term that I assume to be sampled independently from a 1-subgaussian distribution that may depend on It. This model subsumes both Gaussian and Bernoulli (or bounded) rewards. Let π be a bandit strategy, which is a function from histories of observations to an action It. Then the n-step expected pseudo regret with respect to the ith arm is

Rπ µ,i = nµi E

where the expectation is taken with respect to the randomness in the noise and the actions of the policy. Throughout this work n will be ﬁxed, so is omitted from the notation. The worst-case expected pseudo-regret with respect to arm i is

Rπ i = sup µ Rπ µ,i . (1)

This means that Rπ RK is a vector of worst-case pseudo regrets with respect to each of the arms. Let B RK be a set deﬁned by

B [0, n]K : Bi min

The boundary of B is denoted by δB. The following theorem shows that δB describes the pareto regret frontier up to constant factors.

There exist universal constants c1 = 8 and c2 = 252 such that:

Lower bound: for ηt N(0, 1) and all strategies π we have c1(Rπ + K) B

Upper bound: for all B B there exists a strategy π such that Rπ i c2Bi for all i

Observe that the lower bound relies on the assumption that the noise term be Gaussian while the upper bound holds for subgaussian noise. The lower bound may be generalised to other noise models such as Bernoulli, but does not hold for all subgaussian noise models. For example, it does not hold if there is no noise (ηt = 0 almost surely).

The lower bound also applies to the adversarial framework where the rewards may be chosen arbitrarily. Although I was not able to derive a matching upper bound in this case, a simple modiﬁcation of the Exp-γ algorithm [Bubeck and Cesa-Bianchi, 2012] leads to an algorithm with

Rπ 1 B1 and Rπ k n K

for all k 2 ,

where the regret is the adversarial version of the expected regret. Details are in the supplementary material.

The new results seem elegant, but disappointing. In the experts setting we have seen that the learner can distribute a prior amongst the actions and obtain a bound on the regret depending in a natural way on the prior weight of the optimal action. In contrast, in the bandit setting the learner pays an enormously higher price to obtain a small regret with respect to even a single arm. In fact, the learner must essentially choose a single arm to favour, after which the regret for the remaining arms has very limited ﬂexibility. Unlike in the experts setting, if even a single arm enjoys constant worst-case regret, then the worst-case regret with respect to all other arms is necessarily linear.

2 Preliminaries

I use the same notation as Bubeck and Cesa-Bianchi [2012]. Deﬁne Ti(t) to be the number of times action i has been chosen after time step t and ˆµi,s to be the empirical estimate of µi from the ﬁrst s times action i was sampled. This means that ˆµi,Ti(t 1) is the empirical estimate of µi at the start of the tth round. I use the convention that ˆµi,0 = 0. Since the noise model is 1-subgaussian we have

ε > 0 P { s t : ˆµi,s µi ε/s} exp ε2

This result is presumably well known, but a proof is included in the supplementary material for convenience. The optimal arm is i = arg maxi µi with ties broken in some arbitrary way. The optimal reward is µ = maxi µi. The gap between the mean rewards of the jth arm and the optimal arm is j = µ µj and ji = µi µj. The vector of worst-case regrets is Rπ RK and has been deﬁned already in Eq. (1). I write Rπ B RK if Rπ i Bi for all i {1, . . . , K}. For vector Rπ and x R we have (Rπ + x)i = Rπ i + x.

3 Understanding the Frontier

Before proving the main theorem I brieﬂy describe the features of the regret frontier. First notice that if Bi = p

n(K 1) for all i, then

n/(K 1) = X

Thus B B as expected. This particular B is witnessed up to constant factors by MOSS [Audibert and Bubeck, 2009] and OC-UCB [Lattimore, 2015], but not UCB [Auer et al., 2002], which suffers Rucb i Ω( n K log n).

Of course the uniform choice of B is not the only option. Suppose the ﬁrst arm is special, so B1 should be chosen especially small. Assume without loss of generality that B1 B2 . . . BK n. Then by the main theorem we have

n Bi (k 1)n

This also proves the claim in the abstract, since it implies that BK (K 1)n/B1. If B1 is ﬁxed, then choosing Bk = (k 1)n/B1 does not lie on the frontier because

B1 k 1 Ω(B1 log K)

However, if H = PK k=2 1/(k 1) Θ(log K), then choosing Bk = (k 1)n H/B1 does lie on the frontier and is a factor of log K away from the lower bound given in Eq. (4). Therefore up the a log K factor, points on the regret frontier are characterised entirely by a permutation determining the order of worst-case regrets and the smallest worst-case regret.

Perhaps the most natural choice of B (assuming again that B1 . . . BK) is

B1 = np and Bk = (k 1)n1 p H for k > 1 .

For p = 1/2 this leads to a bound that is at most

K log K worse than that obtained by MOSS and OC-UCB while being a factor of

K better for a select few.

Assumptions

The assumption that i [0, 1] is used to avoid annoying boundary problems caused by the fact that time is discrete. This means that if i is extremely large, then even a single sample from this arm can cause a big regret bound. This assumption is already quite common, for example a worst-case regret of Ω(

Kn) clearly does not hold if the gaps are permitted to be unbounded. Unfortunately there is no perfect resolution to this annoyance. Most elegant would be to allow time to be continuous with actions taken up to stopping times. Otherwise you have to deal with the discretisation/boundary problem with special cases, or make assumptions as I have done here.

4 Lower Bounds

Theorem 1. Assume ηt N(0, 1) is sampled from a standard Gaussian. Let π be an arbitrary strategy, then 8(Rπ + K) B.

Proof. Assume without loss of generality that Rπ 1 = mini Rπ i (if this is not the case, then simply re-order the actions). If Rπ 1 > n/8, then the result is trivial. From now on assume Rπ 1 n/8. Let c = 4 and deﬁne

2, c Rπ k n

Deﬁne K vectors µ1, . . . , µK RK by

0 if j = 1 εk if j = k = 1 εj otherwise .

Therefore the optimal action for the bandit with means µk is k. Let A = {k : Rπ k n/8} and A = {k : k / A} and assume k A. Then

(a) Rπ µk,k

(b) εk Eπ µk

(c) = εk n Eπ µk Tk(n) (d) = c Rπ k(n Eπ µk Tk(n)) n ,

where (a) follows since Rπ k is the worst-case regret with respect to arm k, (b) since the gap between the means of the kth arm and any other arm is at least εk (Note that this is also true for k = 1 since ε1 = mink εk. (c) follows from the fact that P

i Ti(n) = n and (d) from the deﬁnition of εk. Therefore

Eπ µk Tk(n) . (5)

Therefore for k = 1 with k A we have

Eπ µk Tk(n) (a) Eπ µ1Tk(n) + nεk q

(b) n Eπ µ1T1(n) + nεk q

Eπµ1Tk(n) (c) n

Eπµ1Tk(n) ,

where (a) follows from standard entropy inequalities and a similar argument as used by Auer et al. [1995] (details in supplementary material), (b) since k = 1 and Eπ µ1T1(n) + Eπ µ1Tk(n) n, and (c) by Eq. (5). Therefore

Eπ µ1Tk(n) 1 2

which implies that

Rπ 1 Rπ µ1,1 =

k=2 εk Eπ µ1Tk(n) X

Therefore for all i A we have

n Rπ k Rπ i Rπ 1 X

8Rπ i + 8K X

n Rπ k + 8K X

which implies that 8(Rπ + K) B as required.

5 Upper Bounds

I now show that the lower bound derived in the previous section is tight up to constant factors. The algorithm is a generalisation MOSS [Audibert and Bubeck, 2009] with two modiﬁcations. First, the width of the conﬁdence bounds are biased in a non-uniform way, and second, the upper conﬁdence bounds are shifted. The new algorithm is functionally identical to MOSS in the special case that Bi is uniform. Deﬁne log+(x) = max {0, log(x)}.

1: Input: n and B1, . . . , BK 2: ni = n2/B2 i for all i 3: for t 1, . . . , n do

4: It = arg max i ˆµi,Ti(t 1) +

4 Ti(t 1) log+

1 ni 5: end for

Algorithm 1: Unbalanced MOSS

Theorem 2. Let B B, then the strategy π given in Algorithm 1 satisﬁes Rπ 252B.

Corollary 3. For all µ the following hold:

1. Rπ µ,i 252Bi .

2. Rπ µ,i mini(n i + 252Bi)

The second part of the corollary is useful when Bi is large, but there exists an arm for which n i and Bi are both small. The proof of Theorem 2 requires a few lemmas. The ﬁrst is a somewhat standard concentration inequality that follows from a combination of the peeling argument and Doob s maximal inequality. The proof may be found in the supplementary material.

Lemma 4. Let Zi = max 1 s n µi ˆµi,s

4 s log+ ni

. Then P {Zi } 20 ni 2 for all > 0.

In the analysis of traditional bandit algorithms the gap ji measures how quickly the algorithm can detect the difference between arms i and j. By design, however, Algorithm 1 is negatively biasing its estimate of the empirical mean of arm i by p

1/ni. This has the effect of shifting the gaps, which I denote by ji and deﬁne to be

ji = ji + p

1/ni = µi µj + p

Lemma 5. Deﬁne stopping time τji by

s : ˆµj,s +

4 s log+ nj

If Zi < ji/2, then Tj(n) τji.

Proof. Let t be the ﬁrst time step such that Tj(t 1) = τji. Then

ˆµj,Tj(t 1)+

4 Tj(t 1) log+

1/nj µj + ji/2 p

= µj + ji ji/2 p

< ˆµi,Ti(t 1) +

4 Ti(t 1) log+

which implies that arm j will not be chosen at time step t and so also not for any subsequent time steps by the same argument and induction. Therefore Tj(n) τji.

Lemma 6. If ji > 0, then Eτji 40

2 ji Product Log

Proof. Let s0 be deﬁned by

& 64 2 ji Product Log

s=1 P {τji s} 1 +

ˆµi,s µi,s ji

4 s log+ nj

s=s0+1 P ˆµi,s µi,s ji

1 + s0 + 32

2 ji Product Log

where the last inequality follows since ji 2.

Proof of Theorem 2. Let = 2/ ni and A = {j : ji > }. Then for j A we have ji 2 ji and ji p

1/nj. Letting = p

1/ni we have

j=1 ji Tj(n)

j A ji Tj(n)

(a) 2Bi + E

j A jiτji + n max j A ji : Zi ji/2

(b) 2Bi + X

80 ji + 128

ji Product Log

+ 4n E[Zi1{Zi }]

(c) 2Bi + X

j A 90 nj + 4n E[Zi1{Zi }] ,

where (a) follows by using Lemma 5 to bound Tj(n) τji when Zi < ji. On the other hand, the total number of pulls for arms j for which Zi ji/2 is at most n. (b) follows by bounding

τji in expectation using Lemma 6. (c) follows from basic calculus and because for j A we have ji p

1/ni. All that remains is to bound the expectation.

4n E[Zi1{Zi }] 4n P {Zi } + 4n Z

P {Zi z} dz 160n

ni = 160n ni = 160Bi ,

where I have used Lemma 4 and simple identities. Putting it together we obtain

Rπ µ,i 2Bi + X

j A 90 nj + 160B1 252Bi ,

where I applied the assumption B B and so P

j =1 nj = P

j =1 n/Bj Bi.

The above proof may be simpliﬁed in the special case that B is uniform where we recover the minimax regret of MOSS, but with perhaps a simpler proof than was given originally by Audibert and Bubeck [2009].

On Logarithmic Regret

In a recent technical report I demonstrated empirically that MOSS suffers sub-optimal problemdependent regret in terms of the minimum gap [Lattimore, 2015]. Speciﬁcally, it can happen that

Rmoss µ,i Ω K

min log n , (6)

where min = mini: i>0 i. On the other hand, the order-optimal asymptotic regret can be significantly smaller. Speciﬁcally, UCB by Auer et al. [2002] satisﬁes

which for unequal gaps can be much smaller than Eq. (6) and is asymptotically order-optimal [Lai and Robbins, 1985]. The problem is that MOSS explores only enough to obtain minimax regret, but sometimes obtains minimax regret even when a more conservative algorithm would do better. It is worth remarking that this effect is harder to observe than one might think. The example given in the afforementioned technical report is carefully tuned to exploit this failing, but still requires n = 109 and K = 103 before signiﬁcant problems arise. In all other experiments MOSS was performing admirably in comparison to UCB.

All these problems can be avoided by modifying UCB rather than MOSS. The cost is a factor of O( log n). The algorithm is similar to Algorithm 1, but chooses the action that maximises the following index.

It = arg max i ˆµi,Ti(t 1) +

(2 + ε) log t

where ε > 0 is a ﬁxed arbitrary constant. Theorem 7. If π is the strategy of unbalanced UCB with ni = n2/B2 i and B B, then the regret of the unbalanced UCB satisﬁes:

1. (problem-independent regret). Rπ µ,i O Bi log n .

2. (problem-dependent regret). Let A = n i : i 2 p

1/ni log n o . Then

log n1{A = } + X

The proof is deferred to the supplementary material. The indicator function in the problemdependent bound vanishes for sufﬁciently large n provided ni ω(log(n)), which is equivalent to

Bi o(n/ log n). Thus for reasonable choices of B1, . . . , BK the algorithm is going to enjoy the same asymptotic performance as UCB. Theorem 7 may be proven for any index-based algorithm for which it can be shown that

2 i log n ,

which includes (for example) KL-UCB [Capp e et al., 2013] and Thompson sampling (see analysis by Agrawal and Goyal [2012a,b] and original paper by Thompson [1933]), but not OC-UCB [Lattimore, 2015] or MOSS [Audibert and Bubeck, 2009].

Experimental Results

I compare MOSS and unbalanced MOSS in two simple simulated examples, both with horizon n = 5000. Each data point is an empirical average of 104 i.i.d. samples, so error bars are too small to see. Code/data is available in the supplementary material. The ﬁrst experiment has K = 2 arms and B1 = n 1 3 and B2 = n 2 3 . I plotted the results for µ = (0, ) for varying . As predicted, the new algorithm performs signiﬁcantly better than MOSS for positive , and signiﬁcantly worse otherwise (Fig. 1). The second experiment has K = 10 arms. This time B1 = n and Bk = (k 1)H n with H = P9 k=1 1/k. Results are shown for µk = 1{k = i } for [0, 1/2] and i {1, . . . , 10}. Again, the results agree with the theory. The unbalanced algorithm is superior to MOSS for i {1, 2} and inferior otherwise (Fig. 2).

0.4 0.2 0 0.2 0.4 0

MOSS U. MOSS

0 1 2 3 4 5 0

Figure 2: θ = + (i 1)/2

Sadly the experiments serve only to highlight the plight of the biased learner, which suffers signiﬁcantly worse results than its unbaised counterpart for most actions.

6 Discussion

I have shown that the cost of favouritism for multi-armed bandit algorithms is rather serious. If an algorithm exhibits a small worst-case regret for a speciﬁc action, then the worst-case regret of the remaining actions is necessarily signiﬁcantly larger than the well-known uniform worst-case bound of Ω(

Kn). This unfortunate result is in stark contrast to the experts setting for which there exist algorithms that suffer constant regret with respect to a single expert at almost no cost for the remainder. Surprisingly, the best achievable (non-uniform) worst-case bounds are determined up to a permutation almost entirely by the value of the smallest worst-case regret.

There are some interesting open questions. Most notably, in the adversarial setting I am not sure if the upper or lower bound is tight (or neither). It would also be nice to know if the constant factors can be determined exactly asymptotically, but so far this has not been done even in the uniform case. For the stochastic setting it is natural to ask if the OC-UCB algorithm can also be modiﬁed. Intuitively one would expect this to be possible, but it would require re-working the very long proof.

Acknowledgements

I am indebted to the very careful reviewers who made many suggestions for improving this paper. Thank you!

Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Proceedings of International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2012a.

Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Proceedings of Conference on Learning Theory (COLT), 2012b.

Jean-Yves Audibert and S ebastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217 226, 2009.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on, pages 322 331. IEEE, 1995. Peter Auer, Nicol o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 2002.

S ebastien Bubeck and Nicol o Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multiarmed Bandit Problems. Foundations and Trends in Machine Learning. Now Publishers Incorporated, 2012. ISBN 9781601986269.

Olivier Capp e, Aur elien Garivier, Odalric-Ambrym Maillard, R emi Munos, and Gilles Stoltz. Kullback Leibler upper conﬁdence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516 1541, 2013.

Nicolo Cesa-Bianchi. Prediction, learning, and games. Cambridge University Press, 2006.

Eyal Even-Dar, Michael Kearns, Yishay Mansour, and Jennifer Wortman. Regret to the best vs. regret to the average. Machine Learning, 72(1-2):21 37, 2008.

Marcus Hutter and Jan Poland. Adaptive online prediction by following the perturbed leader. The Journal of Machine Learning Research, 6:639 660, 2005.

Michael Kapralov and Rina Panigrahy. Prediction strategies without loss. In Advances in Neural Information Processing Systems, pages 828 836, 2011.

Wouter M Koolen. The pareto regret frontier. In Advances in Neural Information Processing Systems, pages 863 871, 2013.

Tze Leung Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985.

Tor Lattimore. Optimally conﬁdent UCB : Improved regret for ﬁnite-armed bandits. Technical report, 2015. URL http://arxiv.org/abs/1507.07880.

Che-Yu Liu and Lihong Li. On the prior sensitivity of thompson sampling. ar Xiv preprint ar Xiv:1506.03378, 2015.

Amir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. In Advances in Neural Information Processing Systems, pages 810 818, 2014.

William Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.