# bandit_problems_with_fidelity_rewards__9303b940.pdf

Journal of Machine Learning Research 24 (2023) 1-44 Submitted 11/21; Revised 6/23; Published 11/23

Bandit problems with ﬁdelity rewards

G abor Lugosi gabor.lugosi@upf.edu ICREA Barcelona, Spain, and Department of Economics and Business Pompeu Fabra University Barcelona, Spain, and Barcelona Graduate School of Economics Barcelona, Spain

Ciara Pike-Burke c.pikeburke@gmail.com Department of Mathematics Imperial College London London, UK

Pierre-Andr e Savalle psavalle@cisco.com Cisco Systems, Inc. Paris, France

Editor: Aurelien Garivier

The ﬁdelity bandits problem is a variant of the K-armed bandit problem in which the reward of each arm is augmented by a ﬁdelity reward that provides the player with an additional payoﬀdepending on how loyal the player has been to that arm in the past. We propose two models for ﬁdelity. In the loyalty-points model the amount of extra reward depends on the number of times the arm has previously been played. In the subscription model the additional reward depends on the current number of consecutive draws of the arm. We consider both stochastic and adversarial problems. Since single-arm strategies are not always optimal in stochastic problems, the notion of regret in the adversarial setting needs careful adjustment. We introduce three possible notions of regret and investigate which can be bounded sublinearly. We study in detail the special cases of increasing, decreasing and coupon (where the player gets an additional reward after every m plays of an arm) ﬁdelity rewards. For the models which do not necessarily enjoy sublinear regret, we provide a worst case lower bound. For those models which exhibit sublinear regret, we provide algorithms and bound their regret. Keywords: multi-armed bandit problem, ﬁdelity reward, regret minimization

1. Introduction

Consider the problem of a worker searching for a good restaurant for their daily lunches. They are willing to explore the neighborhood in order to ﬁnd the best restaurant, but also want to have good lunches. This is well modeled as a bandit problem, where the worker has to balance exploration (trying out new or unfamiliar restaurants), and exploitation (having a nice lunch at a favored place). However, the overall experience and satisfaction of the worker may not depend solely on the chosen restaurant, but also on whether the worker is a

2023 G abor Lugosi, Ciara Pike-Burke, Pierre-Andr e Savalle.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1408.html.

Lugosi, Pike-Burke, Savalle

regular at this restaurant. Indeed, being loyal often leads to better experiences (e.g., waiters are more friendly, or the customer gets a free meal once in a while). We call the bonus from this loyalty the ﬁdelity reward. In other situations, ﬁdelity can have the opposite eﬀect on the reward. For example, restaurants may oﬀer free drinks to new customers, or customers may get bored of visiting the same restaurant. In this case the ﬁdelity reward decreases. We consider multi-armed bandit problems where the reward of each arm is augmented by a ﬁdelity reward that provides an additional payoﬀdepending on how loyal the player has been to a given arm. We consider two models for the ﬁdelity rewards, namely the loyalty-points model and the subscription model. Under the loyalty-points model, the ﬁdelity reward is a (possibly arm-speciﬁc) function of the total number of past plays of the arm. An important feature of these loyalty points is that once collected, they stay in the bank. This corresponds to the common practice of loyalty programs in marketing and retail, where the customer is rewarded for loyal behavior (e.g., the loyalty schemes oﬀered by airline companies or grocery stores). However, in other cases, loyalty may cause the reward to decrease. For example in recommendation systems, users grow bored of repeatedly seeing the same content. In the stochastic bandit setting, the latter problem is known as rotting bandits and has been studied by Levine et al. (2017), Seznec et al. (2019, 2020) and others. In this paper, we extend this to adversarial rewards and also consider ﬁdelity functions that are not necessarily decreasing functions of the number of past plays. In the subscription model, the ﬁdelity reward is a (possibly arm-speciﬁc) function of the current number of consecutive draws of the arm. In particular, this means that if a player stops playing an arm, the next time they play it they will have to start the subscription again from scratch. This model is more realistic in situations where a user pays for a subscription, such as for a mobile phone plan or a gym membership. Here, the beneﬁt of using a particular service increases with the length of continuous use. Again, there may be scenarios where continuously selecting the same option has a negative impact, for example when customers grow bored of going to the same restaurant every day, or where the beneﬁt is only received after selecting an option a given number of consecutive times, such as when using a weekly metro pass. The aim of this paper is to understand how the presence of ﬁdelity rewards eﬀects the diﬃculty of the multi-armed bandit problem. Our main interest is in the adversarial setting, where we assume that the base rewards are generated by some oblivious adversary and the player receives additional ﬁdelity rewards depending on their past actions. We wish to develop policies with low regret, deﬁned as the cumulative diﬀerence of rewards between the policy and some baseline algorithms. Clearly, the best baseline selects the best possible sequence of T actions. However, even in the standard adversarial bandits problem without any additional ﬁdelity rewards, it is impossible to compete with such a strong baseline. Instead, it is common to take inspiration from the stochastic setting, where it is known that an optimal policy is a single-arm policy that constantly plays the best arm, and compare the performance of an adversarial bandits algorithm to the best single-arm strategy. In the ﬁdelity bandits problem, we also deﬁne the regret relative to a class of policies that is optimal in the stochastic problem. However, here the addition of ﬁdelity rewards means that the optimal policy class in the stochastic setting may be more complex. Moreover, there may be multiple optimal policies for the stochastic problem, thus complicating the

Bandit problems with fidelity rewards

deﬁnition of regret in the adversarial setting. The ﬁrst contribution of this paper is to deﬁne several natural deﬁnitions of regret for the adversarial ﬁdelity bandits problem and to understand the relationship between them.

With these deﬁnitions of regret in hand, we aim to understand in which K-armed adversarial ﬁdelity bandits problems is it possible to achieve regret that is sublinear in T. In the loyalty points problem, we show that when the ﬁdelity rewards are increasing functions of the number of previous plays, sublinear regret is not possible for any deﬁnition of the regret that we consider. However, when the ﬁdelity rewards are decreasing functions of the number of past plays, sublinear regret is possible for some deﬁnitions of regret in the loyalty points model. To show this, we introduce a bandit algorithm that exploits the concavity of the cumulative ﬁdelity reward in this setting and achieves regret scaling with K

T. Lastly, in the so called coupons model where the ﬁdelity reward is only received every ρj plays of an arm, we show that it is possible to achieve sub-linear regret and provide an algorithm with regret scaling with

KT in the loyalty points setting. For the subscription model, where the ﬁdelity rewards are functions of the number of consecutive plays of the arm, our results for the increasing case are the same as for the loyalty points model. Indeed, we show that there exist some settings with increasing ﬁdelity functions where the regret must be linear. In the decreasing subscription model, the results are more positive. Indeed, here we are able to provide an algorithm whose regret is sublinear, and in particular is of order (KT)2/3. We also show that sublinear regret is possible in the coupons model with subscription ﬁdelity reward, although here the regret scales multiplicatively with the periodicity of the coupon function. Our results are summarized in Tables 1 and 2. For completeness, we also provide a near-optimal algorithm for the stochastic ﬁdelity bandits problem in the cases where no such algorithm exists in the literature.

1.1 Related work

In the loyalty points bandits problem, the reward changes depending on the past plays of each arm. Thus, the loyalty points bandits problem can be seen as an instance of the rested bandits problem introduced by Whittle (1988) which has been studied in the stochastic reward setting. We are not aware of any work on the rested bandits problem with adversarial rewards. In the stochastic rested bandits problem, it is classically assumed that the reward of each arm is a stochastic process which only transitions to the next value when the arm is played. Tekin and Liu (2010, 2012) considered Markov processes. A more general process was considered by Cortes et al. (2017) but they considered weaker deﬁnitions of regret, and compared the reward of their algorithm to the best single arm strategy, or the best arm given the history of actions taken by their algorithm.

The rotting bandits problem studied by Levine et al. (2017); Seznec et al. (2019, 2020); Bouneﬀouf and F eraud (2016) is another instance of the stochastic rested bandits problem. This problem is equivalent to the loyalty points bandit problem with stochastic base rewards and decreasing ﬁdelity reward. Indeed, here the expected reward decreases as a function of the number of plays of an arm. The best result that we are aware of for this problem is from Seznec et al. (2019, 2020) who show that the regret can be bounded by e O(

KT). For this, Seznec et al. (2020) present a natural UCB style algorithm (c.f. Auer et al. (2002a)) which does not use any knowledge of the ﬁdelity functions. The stochastic loyalty points

Lugosi, Pike-Burke, Savalle

bandits problem with increasing ﬁdelity reward was also independently studied by Metelli et al. (2022) in concurrent work. In the subscription bandits problem, the reward at each time step depends on the previous plays of multiple arms. In particular, when we stop playing an arm after having played it consecutively for some number of steps, its reward changes even though it is not played, in addition to the reward of the played arm changing. This problem therefore does not exactly ﬁt into either the classical rested or restless bandit framework, especially since while the restless bandit allows for the reward of an arm to change independently of whether it was played, this change is typically assumed to depend on the environment and not on the actions of the player, see Whittle (1988); Slivkins and Upfal (2008); Garivier and Moulines (2011); Besbes et al. (2014). We are not aware of any prior work studying this particular problem neither in the stochastic nor in the adversarial setting. The stochastic bandit problem where the reward depends on the time between consecutive plays of an arm has been studied by Immorlica and Kleinberg (2018); Pike-Burke and Grunewalder (2019); Cella and Cesa-Bianchi (2020); Basu et al. (2019). This is potentially another form of measuring loyalty to an arm but is distinct to the models considered in this paper. Lastly, we observe that standard adversarial bandit algorithms are not suﬃcient to deal with the reward structure in the ﬁdelity bandits problem. These algorithms are typically designed to be competitive with the best single arm strategy which may be far from optimal in the ﬁdelity bandit problem. Moreover, the addition of the ﬁdelity rewards means that it is possible for the adversary to pick rewards to make either getting high reward or high ﬁdelity reward impossible. Similarly, algorithms for learning in Markov Decision Processes (MDPs) with adversarial rewards (e.g., Even-Dar et al. (2009); Zimin and Neu (2013)) are not applicable. These require the MDP to be episodic or irreducible.

2. Problem setup

In the classical K-armed bandit problem, a player sequentially interacts with the environment. Let [K] = {1, . . . , K} denote the set of arms. Then in each round, t, the player selects an arm Jt [K], possibly using the previous observations and incorporating an element of randomness into their decision. The environment then returns a reward Xt,Jt depending on which arm was selected. The player s aim is to select arms that maximize their total reward over T rounds of the bandit game. In the stochastic multi-armed bandit problem, the rewards Xt,j for arm j are generated i.i.d. from an underlying reward distribution. In particular, E[Xt,j] = µj for all 1 t T. We assume throughout the paper that the rewards are bounded in [0, 1]. In the adversarial multi-armed bandit problem, we work under the model of an oblivious adversary. In this model, all rewards Xt,j [0, 1] for j [K] and t [T] are chosen by an adversary prior to starting the game. At round t [T] of the game, the player selects an arm Jt [K] and the reward Xt,Jt is revealed to the player, while the rewards Xt,j with j = Jt remain hidden forever. In this paper, we consider a modiﬁcation of the standard multi-armed bandit problem to account for extra ﬁdelity rewards. We focus on two models; the loyalty-points model and the subscription model. In both these models, after playing arm Jt = j at time t, the player

Bandit problems with fidelity rewards

receives reward of the form, Yt,j = Xt,j + ψj(Ht 1) ,

where ψj(Ht 1) is some known, arm-speciﬁc function of the player s history Ht 1 = {J1, Y1,J1, X1,J1 . . . , Jt 1, Yt 1,Jt 1, Xt 1,Jt 1}. For ease of expression, we often refer to the Xt,j s as the base rewards and the ψj(Ht 1) s as the ﬁdelity rewards. Therefore the reward Yt,j received at time t is composed of the base reward from playing arm j at time t and the additional ﬁdelity reward. In the loyalty points model, the reward depends on the number of past plays of each arm. Let Nt,j = Pt s=1 1{Js=j} denote the number of plays of arm j up to time t 1 and let N0,j = 0. (Here, and in the rest of the paper, 1{ } is used to denote the indicator function.) Then, for all arms j, ψj(Ht) = fj(Nt,j) where each fj is a function deﬁned on the set of natural numbers, taking values in [0, 1]. Hence, the reward the player receives from playing arm j at time t is Yt,j = Xt,j + fj(Nt 1,j) .

In the subscription model, the ﬁdelity reward depends on the current number of consecutive plays of the arm. Formally, deﬁne

Qt,j = 1{Jt=j}(t max{s t : Js = j, Js 1 = j}) .

Note that since only one arm can be played per time step, Qt,j = 0 for all arms j = Jt 1. In this case, ψj(Ht 1) = fj(Qt,j) for some known, arm-speciﬁc function of the number of consecutive plays up to time t 1 and fj : N [0, 1] for all arms j [K]. The reward received from playing arm j at time t is then,

Yt,j = Xt,j + fj(Qt,j) .

In this paper we assume that the ﬁdelity functions fj are known to the player. In addition, we assume that the time horizon T is also known in advance. We also remark that in both models, when ψj 0 (so fj 0) for all j [K], the problem reduces to the classical multi-armed bandit problem.

We now turn to the problem of measuring the performance of an algorithm in the ﬁdelity bandits problem. Typically, the performance of bandit algorithms is measured in terms of their regret, where the regret is the total diﬀerence in reward accumulated by playing according to some optimal policy and that from playing the algorithm of interest. In order to properly deﬁne meaningful notions of regret, we discuss the stochastic and adversarial versions of the problems separately.

2.1.1 Regret in the stochastic bandit problem

Consider ﬁrst the stochastic variant of the problem. Recall that here, the base reward (without the ﬁdelity component), associated to arm j at time t is a random variable Xt,j, where the Xt,j are independent random variables taking values in [0, 1], and E[Xt,j] = µj for all t [T] and j [K].

Lugosi, Pike-Burke, Savalle

A policy π of the player is a sequence (πt)T t=1 where πt is a mapping from the history of arms and rewards, Ht 1 = {π1, Y1,π1, X1,π1 . . . , πt 1, Yt 1,πt 1, Xt 1,πt 1}, to the set [K] of arms. We write πt to denote the choice of arm made by the policy π at time t. The history Ht depends on the chosen arms π1, . . . , πt. The ﬁdelity reward depends on the history of actions chosen according to policy π so we write Ht(π) for the generated history under policy π up to time t. The regret of a policy is deﬁned as the diﬀerence between the cumulative reward of the policy and that of the best sequence of arm choices. To deﬁne the regret formally, let [K]T

be the set of sequences of arms of the form j = (j1, . . . , j T ) with j1, . . . , j T [K] for horizon T N. We write ST (j) = PT t=1 Yt,jt for the cumulative reward of the sequence j. We then deﬁne the regret of a candidate policy π as

max j [K]T ST (j) ST (π) . (1)

To avoid complications caused by random ﬂuctuations, instead of the regret deﬁned above, it is customary to consider the so-called pseudo-regret (see Bubeck and Cesa-Bianchi (2012); Lattimore and Szepesv ari (2020)). In this notion, the base reward of each arm Xt,j is replaced by its expectation µj. Hence, the pseudo cumulative reward of policy π is deﬁned as e ST (π) = PT t=1(µπt + ψπt(Ht 1(π))). Similarly, for a competing reference sequence j, we write e ST (j) = PT t=1(µjt + ψjt(Ht 1(j))). We then deﬁne the pseudo-regret as

RT (π) = max j [K]T e ST (j) e ST (π) . (2)

We often write RT = RT (π) when the policy π is clear from the context. Note that e ST (π) may still be random since the arm choices of the policy π inﬂuence the ﬁdelity rewards. On the other hand, for each sequence j [K]T , the pseudo cumulative reward e ST (j) is deterministic. Note that in the classical multi-armed bandit problem, the regret is usually deﬁned with respect to a much smaller reference class, containing only single-arm strategies. These strategies are constant sequences of the form j = (j, j, . . . , j) for j [K]. The two deﬁnitions are, in fact, equivalent in this problem. To see this, simply notice that in the absence of ﬁdelity rewards, there is always a single-arm strategy that maximizes the expected cumulative reward (i.e., j argminj [K] µj). In the presence of ﬁdelity rewards, single-arm strategies are not necessarily optimal. Depending on the nature of the ﬁdelity reward functions fj, one may still be able to determine a subclass J [K]T that guarantees that, regardless of the distribution of the base rewards (that is, regardless of the values of µ1, . . . , µK), there always exists j J such that e ST (j ) = maxj [K]T e ST (j). In other words,

RT (π) = sup j J e ST (j) e ST (π) .

We call such a subset suﬃcient. In solving the regret minimization problem, it is important to understand the suﬃcient subsets J that allow one such a reduction. In particular, in each case it is helpful to identify minimal suﬃcient sets of sequences. For a particular ﬁdelity

Bandit problems with fidelity rewards

reward f, a set J is minimal if for all proper subsets J J there exists a distribution of the arms such that maxj J e ST (j) < maxj J e ST (j). We deﬁne

Cf = {J [K]T : J is minimal and suﬃcient}

to be the class of minimal suﬃcient sets for given ﬁdelity functions f. For example, for the classical multi-armed bandit problem when fj 0, the class Cf of minimal and suﬃcient sets contains just one set, namely the set J0 = {(j, j, . . . , j) : j [K]} of single-arm strategies.

2.1.2 Regret in the adversarial bandit problem

Next we discuss the possible notions of regret in the adversarial setting. We work with an oblivious adversary. This means that the entire sequence of base rewards (Xt,j)t [T],j [K] is ﬁxed before play starts. In each round t, only the reward corresponding to the arm chosen by the player is revealed. In the adversarial setting the player is allowed to randomize the arm selection. (Without randomization, no nontrivial guarantees exist even for the classical multi-armed bandit problem.) Formally, a player s policy π is a sequence (πt)T t=1 where πt is a measurable mapping from the history of arms and rewards Ht 1 = {J1, Y1,J1, X1,J1 . . . , Jt 1, Yt 1,Jt 1, Xt 1,Jt 1} to the set K, the standard simplex in RK. At time t, an arm Jt is chosen at random, according to the distribution πt(Ht 1). While in the stochastic multi-armed bandit problem the deﬁnition of regret is quite natural, the adversarial case is signiﬁcantly more complex. In the stochastic setting, we deﬁne the regret as the diﬀerence between the (pseudo) cumulative reward of the player and that of the best sequence of actions. Adopting this notion to the adversarial framework often leads to trivialities. To see this, just consider the case of the classical adversarial multi-armed bandit problem where fj 0 for all arms j. In this case, the optimal policy is to play the arm with highest reward at each time step (jt argmaxj Xt,j). Hence, the player would need to compete with all possible KT sequences of arm choices. Obviously, achieving a sublinear regret with respect to this class of strategies is hopeless. One usually gets around this problem by reducing the class of comparison policies (Auer et al. (1995)). The usual deﬁnition of regret in the absence of ﬁdelity rewards is with respect to the class of single-arm strategies. This deﬁnition is natural since single-arm strategies are always optimal in the classical stochastic bandit problem. With this deﬁnition of regret, the adversarial multi-armed problem may be considered as a robust version of its stochastic counterpart. As explained in the previous sections, in the stochastic ﬁdelity bandits problem, competing against the best single-arm strategy is often too restrictive. It is clear that in many cases, it is necessary to switch arms in order to get good reward. This means that the best single arm strategy is no longer a reasonable baseline. Instead, we adopt the same philosophy as described above, and use knowledge of what makes a good sequence in the stochastic setting to deﬁne the regret in the adversarial case. In particular, to deﬁne the regret for the adversarial case for given ﬁdelity functions fj , we ﬁrst determine a minimal set of sequences of arm pulls that can be optimal for the stochastic problem for some distribution. These are the so-called suﬃcient and minimal sets deﬁned in Section 2.1.1. Then we deﬁne the regret as the diﬀerence of the cumulative reward of the player and that of the largest reward one could achieve by following one of the policies in a suﬃcient

Lugosi, Pike-Burke, Savalle

and minimal comparison class. When there is only one suﬃcient and minimal set (i.e., the class Cf contains a single set J ), then the (pseudo) regret is deﬁned simply as

RT (π) = max j J ST (j) e ST (π) , (3)

where ST (j) = PT t=1(Xt,jt + ψjt(Ht 1(j))) is the cumulative reward of a sequence j = (j1, . . . , j T ) and e ST (π) = EST (J) is the expected cumulative reward of policy π. Here J = (J1, . . . , JT ) denotes the sequence of randomized arm choices of policy π and the expectation is with respect to the randomizations of the policy. In some cases of ﬁdelity rewards, there are multiple sets of minimal suﬃcient sets of sequences. In such cases there are various natural ways of deﬁning regret. We single out two possibilities that we call the weak and strong regrets. The weak regret is deﬁned as

R T (π) = min J Cf max j J ST (j) e ST (π) ,

while the strong regret is

R T (π) = max J Cf max j J ST (j) e ST (π) = max j S J Cf J ST (j) e ST (π) .

Clearly, minimizing the strong regret is a more ambitious goal, since the player competes with a larger set of sequences of arms. On the other hand, having a small weak regret means that the policy is able to compete against at least one minimal suﬃcient set of sequences (whose identity may depend on the base rewards (Xt,j)). Of course, if Cf contains only one set, then the weak and strong notions coincide and reduce to the deﬁnition (3). In particular, note that when fj 0, both the weak and strong regret correspond to the notion of regret classically considered in standard adversarial bandits problem.

2.1.3 Mean regret for loyalty points bandits

For loyalty points bandits, there is a third notion of regret that is perhaps more natural than the weak and strong regret. This is deﬁned by observing that in the stochastic loyalty points model, the total reward only depends on the number of plays of each arm and the average base reward, and not on the order of plays. We may replicate this in the adversarial setting using empirical averages. We call this the mean regret. Unfortunately, such a deﬁnition of regret is not suitable for the subscription model where the total reward inherently depends on the ordering, since the subscription model only oﬀers ﬁdelity reward for consecutive plays of an arm. If a sequence j = (j1, . . . , j T ) [K]T plays arm j NT,j(j) = PT t=1 1{jt=j} times up to time T then in the stochastic analogue of loyalty points bandits problem, its (pseudo) reward is

µj NT,j(j) +

j=1 µj (NT,j(j) + Fj(NT,j(j)))

where Fj(n) = Pn t=1 fj(t) are the cumulative ﬁdelity rewards and µj is the expected value of the base reward of arm j. Hence, for given ﬁdelity functions f1, . . . , f K, in the stochastic

Bandit problems with fidelity rewards

problem, e ST (j) only depends on the number of times each arm is played and the expected base rewards. To shorten notation, introduce the vector N = (N1, . . . , NK) of nonnegative integers satisfying PK j=1 Nj = T. N is called the type of the sequence j if PT t=1 1{jt=j} = Nj for all j [K]. Let µ = (µ1, . . . , µK) [0, 1]K denote the vector of means of the base rewards of the K arms and deﬁne,

σ(µ, N) def. =

j=1 (Njµj + Fj(Nj)) .

Then, the cumulative reward of the optimal sequence of actions is maxj [K]T e ST (j) = max N σ(µ, N) and the regret for the stochastic problem is RT (π) = max N σ(µ, N) e ST (π). We introduce some deﬁnitions to help relate these types to the minimal suﬃcient sets. Fix some ﬁdelity reward functions f1, . . . , f K. For a given µ [0, 1]K, let N(µ) = {N : σ(µ, N) = max N σ(µ, N )} be the set of optimal types. We call a set B of types a covering set if for all µ [0, 1]K, N(µ) B = . A covering set B is minimal if no proper subset of B is a covering set. Each minimal suﬃcient set of sequences J Cf is such that J contains exactly one sequence of each type of a minimal covering set B. We say that a type N is admissible if N S µ [0,1]K N(µ). In the adversarial bandit problem, it is natural to replace the values µj by the empirical averages

since these relate to the long run rewards. We then set max N σ(bµT , N) as a goal for the learner, where bµT = (bµT,1, . . . , bµT,K). With this in mind, we deﬁne the mean regret as

R T (π) = max N σ(bµT , N) e ST (π) = max N NK:P j [K] Nj=T

j=1 (NjbµT,j + Fj(Nj)) e ST (π) .

Hence, the type N maximizing σ(bµT , N) may be interpreted as the best response to the (unknown) empirical means of the outcomes (in hindsight). This point of view is often adopted in the theory of repeated games, see Hart and Mas-Colell (2000, 2001); Foster and Vohra (1999). This gives the mean regret a natural interpretation that is a generalization of most usual notions of regret. The next proposition establishes the relationship between the weak, strong, and mean regrets.

Proposition 1 In the loyalty points model, regardless of the ﬁdelity rewards and the base rewards, for each policy π, R T (π) R T (π) R T (π) .

In particular, if there is only one minimal and suﬃcient set J , then all three regrets are equal to (3).

Proof The stated inequalities may be equivalently written as

min J Cf max j J ST (j) max N σ(bµT , N) max J Cf max j J ST (j) .

Lugosi, Pike-Burke, Savalle

Increasing Decreasing Coupons

Stochastic E[RT ] = e O(K1/3T 2/3) E[RT ] = e O(

KT) E[RT ] = e O(

KT) [ Section 3.2 ] [ Seznec et al. (2020) ] [ Section 5.2 ]

Adversarial

E[R T ] = Ω(T)

E[R T ] = Ω(T)

E[R T ] = Ω(T)

E[R T ] = Ω(T)

E[R T ] = e O(K

E[R T ] = e O(K

E[R T ] = e O(

E[R T ] = e O(

E[R T ] = e O(

KT) [ Section 3.3 ] [ Section 4.3 ] [ Section 5.3 ]

Table 1: Summary of our main results in the loyalty points bandits problems. e O indicates order up to log factors, and ρ is the least common multiple of the periodicity ρ1, . . . , ρK in the coupons reward model.

Fix the ﬁdelity reward functions f1, . . . , f K. Denote by N the type of the sequence that solves the min-max optimization in the deﬁnition of the weak regret, that is,

min J Cf max j J ST (j) = min j [T]K:j has type N ST (j) .

Observe that there is an admissible type with this property. On the other hand, for any admissible type N = (N1, . . . , NK), the average of the cumulative rewards over all sequences of type N equals

1 T N1,...,NK X

j [T]K:j has type N ST (j) = 1 T N1,...,NK X

j [T]K:j has type N

t=1 Xt,jt +

= σ(bµT , N) .

Hence, min J Cf max j J ST (j) σ(bµT , N) max N σ(bµT , N) ,

proving the ﬁrst announced inequality. The second inequality is proved similarly.

We note that while the mean regret is not applicable in the subscription model, R T (π) R T (π) obviously holds in the subscription model as well. In fact, in all instances of the subscription model that we consider in this paper, it will turn out that there is only one minimal suﬃcient set, meaning that R T (π) = R T (π), and so there is no need to consider an intermediate deﬁnition of regret.

2.2 Fidelity reward functions

Throughout this work, we assume that the ﬁdelity functions are known and fj takes values in [0, 1] for all arms j [K]. We consider both increasing and decreasing ﬁdelity functions,

Bandit problems with fidelity rewards

Increasing Decreasing Coupons

Stochastic E[RT ] = e O(T 2/3K1/3) E[RT ] = e O(T 2/3K1/3) E[RT ] = e O(

KT) [ Section 6.2 ] [ Section 7.2 ] [ Section 8.2 ]

Adversarial E[R T ] = Ω(T)

E[R T ] = Ω(T)

E[R T ] = e O((KT)2/3)

E[R T ] = e O((KT)2/3)

E[R T ] = O( p

E[R T ] = O( p

ρKT) [ Section 6.31] [ Section 7.3 ] [ Section 8.3 ]

Table 2: Summary of our main results in the subscription bandits problems. e O indicates order up to log factors, and ρ is the least common multiple of the periodicity ρ1, . . . , ρK in the coupons reward model.

in addition to a certain class of periodic ﬁdelity rewards that we call the coupons model. In the increasing rewards model, we assume that for all j [K], fj(n) is a nondecreasing function of n. For the decreasing rewards model, we assume that for all j [K], fj(n) is nonincreasing in n. In the coupons model, a ﬁdelity reward is obtained only once every few (consecutive) plays of each arm. In particular, let the period length of arm j be denoted by ρj N, and the bonus ﬁdelity reward rj [0, 1]. Then, the player receives a ﬁdelity bonus rj after every ρj (consecutive) plays of arm j. More formally, for rj > 0,

( rj if t = 0 (mod ρj) 0 otherwise.

Throughout the paper, it is often helpful to consider the cumulative ﬁdelity reward of each arm j which is deﬁned for any n [T] as,

t=1 fj(n) .

We now consider the diﬀerent cases in turn. For each, we ﬁrst analyze the stochastic setting and deﬁne the class of minimal suﬃcient sequences which also deﬁnes the regret in the adversarial setting. We then analyze the adversarial setting. Our results are summarized in Tables 1 and 2.

3. Loyalty points model: increasing ﬁdelity rewards

The ﬁrst setting we consider is the loyalty points model where the ﬁdelity functions are increasing functions of the number of past plays of each arm.

3.1 Minimal suﬃcient sets

We ﬁrst need to deﬁne the minimal suﬃcient sets J of sequences of actions deﬁned in Section 2.1. When the ﬁdelity rewards are nondecreasing, the unique minimal suﬃcient

1. This is the general result. For the special case where the ﬁdelity functions are step functions with a step from 0 to 1 at m we also show that the weak and strong regret can be bounded by e O(T 2/3(Km)1/3).

Lugosi, Pike-Burke, Savalle

set of sequences is the set J0 of single-arm strategies. This is shown in the below lemma, whose proof is in Appendix A.1. When the ﬁdelity functions of the diﬀerent arms are equal, optimality of such a single-arm strategy is obvious. In the case of arm-speciﬁc ﬁdelity functions, recall that Fj(T) = PT t=1 fj(t), then the optimal arm is given by

j = argmax 1 j K

(In case of multiple maximizers, we may choose any of them). The optimal policy then plays arm j for all rounds t = 1, . . . , T.

Lemma 2 In the stochastic loyalty points model with increasing ﬁdelity rewards, regardless of the distribution of the base rewards and the ﬁdelity rewards, there exists a single-arm strategy that minimizes the pseudo cumulative reward e ST (j) over all j [K]T .

3.2 Stochastic rewards

As Lemma 2 shows, to minimize regret, it suﬃces to construct a strategy that is able to compete with single-arm strategies. In the stochastic setting, a simple variant of the UCB algorithm of Auer et al. (2002a) achieves this. Instead of using the observations Yt,j to construct the sample averages and upper conﬁdence bounds, one may work with the modiﬁed rewards e Yt,j = Xt,j + 1

T Fj(T) . Note that since the ﬁdelity functions are known, computing these modiﬁed rewards is possible. These modiﬁed rewards are natural since the best single-arm policy equates to playing the arm that maximizes µj + Fj(T)/T in all rounds. We then deﬁne the upper conﬁdence bounds around the sample mean of these observations by

UCBt(j) = Xt,j + 1

where Xt,j = 1 Nt,j Pt s=1 Xs,j1{Js=j}. The algorithm proceeds as a standard UCB algorithm: ﬁrst it plays each arm once, then for t = K + 1, . . . , T, it plays arm

Jt = argmax 1 j K UCBt 1(j) .

The regret of this algorithm is bounded in the next theorem, whose proof follows from a modiﬁcation of the standard UCB analysis (see (Auer et al., 2002a)) and is given in Appendix A.2.

Theorem 3 The expected (pseudo) regret of the UCB algorithm deﬁned above in the stochastic loyalty points bandits problem with increasing ﬁdelity reward is bounded by

(µj µj + fj (T) fj(0)) + 1

where e j = µj µj + (Fj (T) Fj(T))/T. Consequently, the worst case regret of this algorithm is bounded by O(T 2/3(K log(T))1/3).

Bandit problems with fidelity rewards

When the ﬁdelity functions are all constant, fj (T) = Fj (T)/T, fj(0) = Fj(T)/T, the problem reduces to the standard stochastic multi-armed bandit problem and we recover the optimal O(PK j=1 log(T)/e j) order of the problem dependent regret (see Auer et al. (2002a)). When the ﬁdelity functions are not constant, the leading term in the regret is multiplied by µj µj+fj (T) fj(0)

e j 1. This term can be as large as 2/e j, in which case, the

regret is of the order of log(KT)

e 2 j . Thus, while the logarithmic dependence on the horizon T

is the same as in the standard multi-armed bandit problem, the dependence on the gap e j may be worse, depending on the ﬁdelity functions. Concurrent work (Metelli et al., 2022) also obtained similar upper bounds on the regret for this problem. We argue that this worse dependence on e j is inevitable. Consider the case where the ﬁdelity functions are the same for all arms, that is, fj = f for all j [K]. Here, e j = µj µj and j = argmaxj µj. For simplicity, assume that the distribution of arm j is Bernoulli with parameter µj [0, 1]. If Fj(T) = Fj (T), then to distinguish µj + Fj(T)/T from µ + Fj (T)/T, if suﬃces to distinguish µj from µj . A classical result of Lai and Robbins (1985) shows that, in the standard multi-armed bandit problem, any algorithm that is guaranteed to achieve regret o(T α) for every α > 0 must play any arm j at least Nj = Ω( log(T)

2 j ) times to distinguish µj and µj . Thus, any algorithm that obtains

sublinear regret must explore sub-optimal arms which necessarily results in some loss of ﬁdelity reward. Deﬁne τ = P j:j =j Nj. Then the contribution to the regret from the lost ﬁdelity reward is

t=1 f(Nt 1,Jt) =

t=T τ f(t) X

t=1 (f(T τ) f(Nj))

This gives a lower bound on the regret of Ω(P j:j =j log(T)

e 2 j (µj µj + f(T τ) f(Nj))).

While this does not exactly match Theorem 3, it shows that a term involving the ﬁdelity functions is unavoidable and that the problem is, in general, harder than the plain stochastic bandit problem.

3.3 Adversarial rewards

As seen in the previous section, the only set of minimal suﬃcient sequences is the set J0 of single-arm strategies. Hence, as in (3), the notions of weak, strong, and mean regrets introduced in Section 2.1 coincide and equal

RT (π) = max j J0 ST (j) e ST (π) .

This is the same notion of regret as in the standard adversarial bandits problem. In spite of the simplicity of the target and unlike in the stochastic version of the problem, here the addition of the ﬁdelity reward makes it considerably more diﬃcult to achieve sublinear regret. In fact, Theorem 4 below shows that, unless the ﬁdelity rewards are essentially constant, any policy must incur linear regret for some reward sequence.

Lugosi, Pike-Burke, Savalle

To prove the lower bound, it suﬃces to consider the case where the ﬁdelity function is the same for all arms. We obtain the following lower bound for the regret, the proof of which is in Appendix A.3.

Theorem 4 Consider the adversarial loyalty points bandits problem with increasing ﬁdelity rewards, where the ﬁdelity function is the same for every arm. Deﬁne δ = f(7T/8) f(T/8). Then for every policy π there exists a sequence of rewards such that the regret satisﬁes

Thus, the regret is Ω(T) unless f is essentially constant over most of its domain. The particular constants appearing in the bound (and in the deﬁnition of δ) have no special signiﬁcance, they have been chosen for convenience. In particular, the deﬁnition of δ may be replaced by f(T(1 ϵ)) f(Tϵ) for any ϵ (0, 1/2) and the constant 1/40 modiﬁed accordingly. We omit the straightforward details. Theorem 4 implies that in the adversarial case one cannot hope for a nontrivial regret bound unless the (nondecreasing) ﬁdelity rewards essentially do not change with time. In that case one may use a simple and natural modiﬁcation of any low-regret algorithm for adversarial bandits such as EXP3 (Auer et al. (2002b)). We omit the straightforward details.

4. Loyalty points model: decreasing ﬁdelity rewards

4.1 Stochastic rewards

The stochastic loyalty points bandit problem with decreasing ﬁdelity rewards is equivalent to the rotting bandits problem studied by Levine et al. (2017); Seznec et al. (2019, 2020), and Bouneﬀouf and F eraud (2016). Seznec et al. (2019, 2020) provide an algorithm with regret RT = e O( p

KT log(T)). This matches the lower bound for the standard bandit problem (which can be viewed as a special case of ﬁdelity bandits where the ﬁdelity function is constantly 0) up to logarithmic factors. In the adversarial setting, the loyalty points model with decreasing ﬁdelity rewards is signiﬁcantly more complex. In order to appropriately deﬁne the notion of regret, ﬁrst we need to determine the class Cf of minimal suﬃcient sets of sequences of arm pulls. We discuss this in the next section. As it turns out, the class Cf is nontrivial in this case and therefore the notions of strong, weak, and mean regret do not coincide. In fact, we show that while there is no algorithm that achieves sublinear strong regret for a large class of ﬁdelity functions, it is possible to construct a policy with guaranteed sublinear mean regret and thus also sublinear weak regret.

4.2 Minimal suﬃcient sets

In order to study the adversarial version of the problem, ﬁrst we need to understand the minimal suﬃcient sets of sequences of actions. Recall from Section 2.1 that each minimal suﬃcient set J Cf is such that J contains exactly one sequence of each type of a minimal covering set B. Hence, in order to understand the class Cf of minimal suﬃcient sets, one

Bandit problems with fidelity rewards

needs to understand the set of admissible types, that is, types that are maximizers of the function σ(µ, N) for some µ [0, 1]K. An important special case is when the ﬁdelity rewards fj(t) are strictly decreasing functions of t for each j [K]. Then the average cumulative reward associated to a a type N = (N1, . . . , NK),

j=1 Fj(Nj),

is strictly concave. In fact, one may extend h to the standard simplex K such that the function h : K R is strictly concave. Then the cumulative total expected reward corresponding to a mean vector µ [0, 1]K may be written as

σ(µ, N) = T N

T , µ + h N

It is easy to see that for such ﬁdelity reward functions, for each type N, there exists µ [0, 1]K such that N(µ) = {N}, that is, N is the unique type maximizing σ(µ, ). In such cases, the minimal suﬃcient sets are all those sets J that, for each type N, J contains exactly one sequence j of type N. But then

max J Cf max j J ST (j) = max j [K]T ST (j)

and therefore it is hopeless to minimize the strong regret. Based on this, it is not diﬃcult to prove the following proposition.

Proposition 5 Consider the adversarial loyalty points bandit problem with strictly decreasing ﬁdelity reward functions f1, . . . , f K. Then there exists a positive constant c depending on the ﬁdelity reward functions only such that for any (randomized) policy π of the forecaster, there exist base rewards (Xt,j)t [T],j [K] such that the strong regret satisﬁes

R T (π) c T .

Remark 6 In some cases, when the ﬁdelity rewards are not strictly decreasing (so the conditions of Proposition 5 do not hold), it is possible to achieve sublinear strong regret. As an example, consider the case when K = 2 and the ﬁdelity function of both arms is fj(t) = α1t T0 for j = 1, 2, where α (0, 1) and T0 [T] are some ﬁxed parameters. Then it is easy to see that the set of optimal types N(µ) corresponds to a single-arm strategy whenever |µ1 µ2| = α. When µ1 µ2 = α, N(µ) consists of all types with N1 T T0 and when µ1 µ2 = α, N(µ) consists of all types with N1 T0. Hence, in all cases, N(µ) contains one of the two single-arm strategies and therefore the only minimal suﬃcient set is J0. This implies that the three notions of regret coincide in this case. As is shown below, sublinear mean regret is always achievable and hence, in this particular example, the strong regret can also be made sublinear.

Lugosi, Pike-Burke, Savalle

Input: T (horizon), tuning parameters η (0, 1], ϵ (0, 1). Initialize: Deﬁne q(1), . . . , q(M) K to be an ϵ cover of K so M ϵ (K 1). Initialize: w1,i = 1 i = 1, . . . , M.

1: for all t = 1, . . . , T do

2: Set pt,i = wt,i PM i=1 wt,i and deﬁne qt = P i [M] pt,iq(i)

3: Select arm Jt to play by sampling the arms with probabilities (qt,1, . . . , qt,K) and receive reward YJt,t = Xt,Jt + f Jt(NJt,t 1) [0, 2].

4: Calculate the pseudo-losses b Xt,j = 1 1 Xt,j

qt,j 1{Jt=j}

5: Calculate the pseudo-losses of each expert, bℓt(q(i)) = P j [K] q(i) j (1 b Xt,j) h(q(i))

6: Update wt+1,i = wt,i exp{ ηbℓt(q(i))}.

Figure 1: EXP4 with pseudo rewards for the adversarial loyalty points bandits model with decreasing ﬁdelity reward.

4.3 Minimizing the mean regret

Interestingly, as opposed to the strong regret, minimizing the mean (and therefore weak) regret is an achievable goal. In this section we deﬁne a policy that guarantees that if the ﬁdelity functions are nonincreasing, then the mean regret is o(T) regardless of the base rewards. The algorithm we propose is an exponentially weighted average forecaster where the set of experts q(1), . . . , q(M) K is deﬁned by an ϵ-cover of the simplex K. Similarly to in the standard exponentially weighted average forecaster, we need to estimate the loss of each expert. In this loyalty points bandits problem this is somewhat more involved since the loss estimates need to be deﬁned to take into account the ﬁdelity rewards. We use the loss estimates bℓt(q(i)) = P j [K] q(i) j (1 b Xt,j) h(q(i)) for each expert i where b Xt,j = 1 1 Xt,j

qt,j 1{Jt=j} for all arms j. The algorithm is deﬁned in Figure 1. In order to work with these adjusted losses, we need to show that they enjoy the desirable properties of an importance sampling estimate and that they lead to the correct dependence on the ﬁdelity reward of our algorithm. This is done in the following theorem which provides a bound on the regret of the algorithm.

Theorem 7 Consider the adversarial loyalty points bandit problem when the ﬁdelity rewards are nonincreasing for all j [K]. Then, regardless of the base rewards, the randomized policy π deﬁned in Figure 1 with ϵ = (K + 1)/

2T and η = p

log(1/ϵ)/(2T) has mean regret

ER T (π) (K + 1)

log(K + 1) + 1 +

Bandit problems with fidelity rewards

Proof We consider a randomized forecaster that, at each time instance, selects a point qt = (qt,1, . . . , qt,K) in the standard simplex K. Then the forecaster plays a random arm, chosen according to the distribution qt, that is,

Jt = j with probability qt,j, j [K] ,

Writing e Nj = PT t=1 1Jt=j and f N = ( e N1, . . . , e NK) for the corresponding type, the cumulative reward of such a forecaster equals

t=1 Xt,Jt +

t=1 Fj( e Nj) = T

t=1 Xt,Jt + h

Consider the piecewise linear extension of h to K. It follows from the fact that the ﬁdelity rewards are nonincreasing that h is concave and 1-Lipschitz with respect to the ℓ1 norm. Then we may apply the Hoeﬀding-Azuma inequality to both PT t=1 Xt,Jt and e Nj to obtain that, with probability at least 1 δ,

t=1 Xt,Jt +

t=1 Fj( e Nj)

j [K] qt,j Xt,j + Th

2 log K + 1

j [K] qt,j Xt,j +

t=1 h (qt) (K + 1)

2 log K + 1

where we used concavity of h. In particular, by standard tail integration,

t=1 Xt,Jt +

t=1 Fj( e Nj)

j [K] qt,j Xt,j +

2 log(K + 1) . (5)

Hence, it suﬃces to construct qt such that

j [K] qt,j Xt,j +

with γT = (K + 1)

. In fact, we prove the slightly stronger bound

j [K] qt,j Xt,j +

max q K T ( q, bµT + h (q)) γT .

In order to construct such a forecaster qt, we discretize K. Let ϵ > 0 and let q(1), . . . , q(M)

be a minimal ϵ-cover of K with respect to the ℓ1 distance. Using the fact that a minimal

Lugosi, Pike-Burke, Savalle

ϵ-cover is an ϵ/2 packing and a standard volumetric estimate, we get that M ϵ (K 1). Also, by the Lipschitz property of h,

max i [M] T D q(i), bµT E + h q(i) max q K T ( q, bµT + h (q)) 2Tϵ . (6)

We treat q(1), . . . , q(M) as experts and use the exponentially weighted average forecaster. In order to do this, we use the estimates

b Xt,j = 1 1 Xt,j

qt,j 1{Jt=j} , j [K] ,

where qt,j is the probability our forecaster plays arm j in round t. Observe that b Xt,j may be computed for all j since Xt,Jt is observed after selecting arm Jt at time t. Also note that EJt qt b Xt,j = Xt,j and 1 b Xt,j 0. To each expert q(i) we assign the loss estimate

bℓt(q(i)) = X

j [K] q(i) j (1 b Xt,j) h(q(i)) = 1 X

j [K] q(i) j b Xt,j h(q(i)) .

Now we are prepared to deﬁne the proposed forecaster. At each time instance t = 1, . . . , T, qt is chosen as a weighted average of the experts q(1), . . . , q(M), weighted by the distribution pt,1, . . . , pt,M so that qt = P i [M] pt,iq(i), where for t = 1, pt,i = 1/M for all i [M] and for t > 1, pt,i = wt 1,i

wt 1,i = exp

s=1 bℓs(q(i))

and Wt 1 = X

i [M] wt 1,i .

Here η > 0 is a tuning paramenter. We proceed by following analysis of the exponentially weighted average forecaster (see, e.g., Cesa-Bianchi and Lugosi (2006)). On the one hand,

t=1 bℓt(q(i))

t=1 bℓt(q(i)) log M

for all i [M]. On the other hand,

i [M] pt,ie ηbℓt(q(i))

Bandit problems with fidelity rewards

i [M] pt,ibℓt(q(i)) + η2 X

i [M] pt,ibℓt(q(i))2

(using e x 1 x + x2 for x 1 and bℓt(q(i)) 1 by construction)

i [M] pt,ibℓt(q(i)) + η2 T X

i [M] pt,ibℓt(q(i))2 .

Comparing the upper and lower bounds for log(WT /W0), we obtain that, for every i [M]

i [M] pt,ibℓt(q(i))

t=1 bℓt(q(i)) + log M

i [M] pt,ibℓt(q(i))2 . (7)

Then, observe that EJt qtbℓt(q(i)) = 1 P j [K] q(i) j Xt,j h(q(i)), EJt qt P i [M] pt,ibℓt(q(i)) = 1 P j [K] qt,j Xt,j P i [M] pt,ih(q(i)) and for each t [T],

i [M] pt,ibℓt(q(i))2 = X

i [M] pt,i EJt qt

j [K] q(i) j (1 b Xt,j) h(q(i))

i [M] pt,i EJt qt

j [K] q(i) j (1 b Xt,j)

i [M] pt,i EJt qt

j [K] q(i) j 1 Xt,j

qt,j 1{Jt=j}

i [M] pt,i X

q(i) j 1 Xt,j

i [M] pt,i X

q(i) j qt,j

= 2(K + 1) (using qt = X

i [M] pt,iq(i)) .

So we can rearrange the expression in (7) to get

j [K] qt,j Xt,j +

i [M] pt,ih(q(i))

j [K] q(i) j Xt,j + Th(q(i))

η 2(K + 1)ηT

max q K T ( q, bµT + h (q)) 2Tϵ (K 1) log 1

ϵ η 2(K + 1)ηT

Lugosi, Pike-Burke, Savalle

(by (6)) and since M ϵ (K 1))

= max q K T ( q, bµT + h (q)) 2Tϵ (K + 1)

(choosing η = p

log(1/ϵ)/(2T))

= max q K T ( q, bµT + h (q)) (K + 1)

with the choice ϵ = (K + 1)/

2T. The proof is ﬁnished by noting that, by concavity of h,

i [M] pt,ih(q(i)) h

i [M] pt,iq(i)

and therefore

j [K] qt,j Xt,j +

t=1 h(qt) max q K T ( q, bµT + h (q)) (K +1)

Combining this with (5) implies the announced regret bound.

Remark 8 Since the set of experts needs to cover the simplex, the proposed algorithm may not be computationally eﬃcient. A possible way to construct computationally eﬃcient low-regret forecasters is to exploit the concavity of h and use ideas from bandit convex optimization (e.g., Bubeck et al. (2017)). However, it is unclear whether such an algorithm achieves the same dependence on K. This is left for future research. A related open question is to determine the optimal dependence on K. While the lower bound of Ω(

KT) for the standard adversarial bandits problem (where the ﬁdelity functions are all 0) suggests that the dependence on T in Theorem 7 cannot be improved, it is unclear what the optimal dependence on K is in the adversarial loyalty points bandit problem with decreasing ﬁdelity rewards.

5. Loyalty points model: coupon rewards

In this section we study the loyalty points model with coupons ﬁdelity rewards. Recall that in this model, associated to each arm j [K], are a positive integer ρj (the period length) and a real number rj [0, 1] and the player receives the extra ﬁdelity reward rj after each (not necessarily consecutive) ρj plays of arm j.

5.1 Minimal suﬃcient sets

Once again, in order to understand regret, the ﬁrst step is to determine the class of minimal suﬃcient sets. To this end, we need to understand the set of types N that are maximizers

Bandit problems with fidelity rewards

of the function σ(µ, N) for some µ [0, 1]K. This is not a simple task because of the nonlinear nature of the coupon rewards. Note that

The presence of the integer part in the terms Nj/ρj means that the set of possible optimizers may be quite rich and complex. However, in some cases, the class of minimal suﬃcient sets is simple. For example, consider the simplest case when the period lengths ρj associated to all arms j [K] are the same, say equal to ρ. If T happens to be divisible by ρ, then it is easy to see that playing an arm j that maximizes µj + rj/ρ forever maximizes the reward. (In other words, Nj = T for a maximizing arm j and Ni = 0 for all other arms is an optimal type.) Hence, in this case, the unique minimal suﬃcient set of sequences is the set J0 of single-arm strategies. However, even if T is not an integer multiple of the period length ρ, if ρj = ρ for all j [K], then for some µ [0, 1]K the optimal type is such that an arm maximizing µj + rj/ρ is played ρ T/ρ times and a (possibly diﬀerent) arm maximizing µj is played the remaining T ρ T/ρ times. Thus, in this case, the class of minimal suﬃcient sets consists of all sequences that are either single-arm strategies or such that only two arms are played, one of them ρ T/ρ times and the other T ρ T/ρ times. When the period lengths are diﬀerent, the optimizing sequences may have a more complicated structure.2

Nevertheless, the above argument easily generalizes to possibly diﬀerent period lengths as stated in the next lemma.

Lemma 9 Let ρ denote the least common multiple of the period lengths ρ1, . . . , ρK. If T is an integer multiple of ρ, then the unique minimal suﬃcient set of sequences is the set J0 of single-arm strategies. More generally, every minimal suﬃcient set J Cf is such that every sequence in J has type N = (N1, . . . , NK) such that for some j [K], Nj ρ T/ρ .

In other words, even though single-arm strategies are not necessarily optimal, every j S J Cf J is almost a single-arm strategy in the sense that it diﬀers from one in at most T ρ T/ρ ρ positions. This fact will help us design strategies with small strong regret in the adversarial version of the problem.

5.2 Stochastic rewards

The observation made in Lemma 9 suggests a simple way of achieving small regret in the stochastic loyalty-points bandit problem with coupon rewards. One may simply aim at accumulating a reward that is not much smaller than maxj [K] µj + rj

. This may be

achieved by applying a regret-minimization strategy using the augmented rewards b Yt,j = Xt,j + rj

ρj . Suppose that one plays according to a standard K-armed bandit strategy based

on the rewards b Yt,j, and that the expected cumulative regret of this strategy compared to

2. Note that the optimization problem is closely related to (though not completely equivalent to) a knapsack problem where each item j has size ρj and reward ρjµj + rj and the total capacity of the knapsack is T.

Lugosi, Pike-Burke, Savalle

T maxj [K] µj + rj

is bounded by a quantity ϵT . This means that if b N1, . . . , b NK denote the (random) number of times each of the K arms is pulled up to time T, then

T max j [K]

This implies that the expected ﬁdelity-augmented reward of the strategy satisﬁes

µj b Nj + rj

µj b Nj + rj

T max j [K]

max N σ(µ, N) ϵT K .

In other words, the regret of the strategy is at most ϵT + K. For example, by using a standard UCB algorithm (Lattimore and Szepesv ari, 2020), one may take

j = max i [K]

are the gap parameters, see Lattimore and Szepesv ari (2020). Summarizing, we have the following:

Theorem 10 Consider the stochastic loyalty-points bandit problem with coupon rewards. There exists a strategy of play whose expected regret satisﬁes

Consequently, the worst case regret of this strategy is O( p

KT log(T)).

Note that these regret bounds are near optimal for the standard stochastic K-armed bandit problem, so we expect that they cannot be improved for the stochastic loyalty points bandits problem with coupons ﬁdelity rewards.

Bandit problems with fidelity rewards

5.3 Adversarial rewards

Equipped with our understanding of the class of minimal suﬃcient sets and having derived a small-regret strategy for the stochastic version of the problem, it is now easy to design strategies that work well in the adversarial case. The main take-home message from the previous sections is that (a) single-arm strategies are almost optimal in the presence of coupon rewards; (b) playing a standard bandit strategy based on the augmented rewards Xt,j+rj/ρj is nearly optimal. We show here that the same principles work in the adversarial case as well, and lead to similar near optimal regret bounds.

Theorem 11 Consider the adversarial loyalty-points bandit problem with coupon rewards. There exists a randomized policy π whose expected weak, mean and strong regret satisfy

ER T (π) ER T (π) 4 p

TK log K + K ,

and ER T (π) 4 p

TK log K + ρ + K .

Proof The argument for bounding the mean regret is similar to that of Theorem 10. Suppose that one plays a standard (randomized) adversarial bandit strategy based on the augmented rewards b Yt,j = Xt,j + rj

ρj . Denote by Jt the arm selected by the strategy at time

t and by b Nj = PT t=1 1{Jt=j} the number of times the strategy plays arm j up to time T. Recall also that bµT,j = 1

T PT t=1 Xj,t denotes the empirical average of the base rewards of arm j. Assume that the strategy comes with the regret guarantee

t=1 b Yt,Jt max j K

t=1 b Yt,Jt ϵT .

For example, by using the exp3 algorithm of Auer et al. (1995), one may take ϵT = 4 TK log K (see, e.g., (Lattimore and Szepesv ari, 2020, Theorem 11.1)). Then the expected cumulative reward of this strategy equals

t=1 Xt,Jt +

t=1 Xt,Jt +

j=1 rj b Nj

t=1 b Yt,Jt K

= T max j K

= max N σ(bµT , N) K ϵT ,

Lugosi, Pike-Burke, Savalle

proving the bound for the mean regret. To prove the upper bound for the strong regret, notice that by Lemma 9,

max J Cf max j J St(j) max j J0 St(j)+ρ = max j [K]

+ρ max j [K]

But we have already proved above that

t=1 Xt,Jt +

concluding the proof.

6. Subscription model: increasing ﬁdelity rewards

In the rest of the paper we discuss what we call the subscription model. Recall from the introduction that in this model the ﬁdelity rewards depend on the current number of consecutive plays of the selected arm as opposed to the loyalty points models where the ﬁdelity rewards are a function of the total number of times the selected arm has been played in the past. Just like in the case of the loyalty points model, here we also consider three types of ﬁdelity rewards: increasing, decreasing, and coupon rewards. We begin the discussion by considering increasing ﬁdelity rewards. More precisely, we assume that the ﬁdelity reward is a nondecreasing function of the number of consecutive plays of an arm prior to the current time step. Recall the deﬁnition Qt,j = 1{Jt=j}(t max{s t : Js = j, Js 1 = j}) as the number of consecutive plays of arm j up to time t and assume that the reward is of the form Yt,j = Xt,j + fj(Qt,j), where Xt,j is the base reward and fj is the (known) ﬁdelity function associated with arm j that is assumed to be nondecreasing. We may assume, without loss of generality, that fj(0) = 0 for all j [K].

6.1 Minimal suﬃcient sets

It is clear that for the stochastic subscription bandit problem with increasing ﬁdelity rewards, there is always an optimal policy that is a single-arm policy. Indeed, a single-arm policy that plays an arm j argmax1 j K{µj T + Fj(T)} for all T time steps maximizes total reward. (Recall that Fj(t) = Pt s=1 fj(t) denotes the cumulative ﬁdelity reward of playing arm j during t consecutive steps.) Hence, the class of minimal and suﬃcient sets contains the single set J0 of single-arm strategies.

6.2 Stochastic rewards

We know that in the stochastic bandit problem with increasing ﬁdelity rewards in the subscription model, the reward of the optimal strategy is max1 j K{µj T +Fj(T)}, achieved by playing a single arm j for all T rounds. Since the optimal policy is the same as in the standard stochastic bandits problem, one may imagine that similar learning algorithms can be applied here. However, note that

Bandit problems with fidelity rewards

the usual regret-minimizing strategies (such as ucb) keep exploring all through their run. This involves switching to sub-optimal arms in order to make sure that unlikely events of good arms collecting small rewards get detected. Such strategies are not suitable for the subscription model under increasing ﬁdelity rewards, since a single switch of arms resets the count Qt,j to zero, resulting in a signiﬁcant loss of ﬁdelity rewards. In particular, any strategy that switches an arm after time ϵT (for a constant ϵ (0, 1)) must suﬀer a regret of at least Fj (ϵT) which is of order Ω(T) except in trivial cases. However, an expected reward that is not much smaller than that of the optimal strategy may be achieved by any strategy that is able to identify an arm achieving the optimum early on. Indeed, strategies designed for best-arm identiﬁcation may be used directly to achieve sublinear expected regret in the stochastic subscription bandits problem with decreasing ﬁdelity reward. In best-arm identiﬁcation problems, the goal of the learner is to learn the identity of the arm with the highest expected reward. Instead of reviewing the growing literature of best-arm identiﬁcation, we show how such a simple strategy may be adopted, in a straightforward manner, to obtain strategies with sublinear regret. We refer the reader to the book by Lattimore and Szepesv ari (2020) for an exhaustive survey. The notion of simple regret proves to be useful in our approach. To deﬁne the simple regret, we ﬁrst deﬁne the sub-optimality gap of arm j as e j = µj + Fj (T)/T (µj + Fj(T)/T) where j = argmax1 j K(µj + Fj(T)/T). (Note that in the case where the ﬁdelity functions are the same across all arms, e j = µj µj = j.) The simple regret of the best-arm identiﬁcation strategy outputting arm J is deﬁned as

Suppose that one plays a best-arm identiﬁcation strategy during the ﬁrst t0 < T rounds, besed on the ﬁdelity-augmented rewards Xt,j + Fj(T)/T, at the end of which the algorithm identiﬁes an arm J [K]. For the rest of the rounds t {t0 + 1, . . . , T}, the strategy plays arm J, independently of the rewards. During these T t0 steps, the strategy accumulates a total expected reward of E[(T t0)µJ + FJ(T t0)]. The expected regret of the above-deﬁned strategy may be bounded in terms of the simple regret, since

E[RT ] Tµj + Fj (T) E [(T t0)µJ + FJ(T t0)]

Tµj + Fj (T) E [TµJ + FJ(T)] + 2t0 = 2TEe J + 2t0 .

The simplest best-arm identiﬁcation strategy samples all K arms t0/K times and outputs an arm that maximizes the empirical reward accumulated over this time. This simple strategy achieves an expected simple regret bounded by C p

(K log K)/t0 for a universal constant C (see Lattimore and Szepesv ari (2020)). Substituting this into the upper bound above and choosing t0 T 2/3(K log K)1/3, we obtain the following.

Theorem 12 Consider the stochastic bandit problem in the subscription model with increasing ﬁdelity rewards. There exists a strategy of play whose expected regret satisﬁes

E[RT ] CT 2/3(K log K)1/3 ,

where C is a numerical constant.

Lugosi, Pike-Burke, Savalle

Of course, the naive algorithm used here is not optimal and more sophisticated best-arm strategies may be used to improve the regret bound. In particular, one may get a bound that is logarithmic in T but depends on the sub-optimality gaps e j. However, we conjecture that even an improved strategy will not be able to achieve regret of order

KT since the ﬁdelity functions can be chosen such that any algorithm which explores suﬃciently must pay in terms of the ﬁdelity rewards, much like in the increasing loyalty points model in Section 3.2. Moreover, the use of such improved strategies involves the same key idea and we prefer to present the idea here in its simplest form.

6.3 Adversarial rewards

Next we consider the adversarial version of the problem. As noted in Section 6.1, the only class of minimal and suﬃcient sets contains the single set J0 of single-arm strategies. This implies that the notions of weak and strong regrets coincide as the learner simply competes with the best single-arm strategy. It is easy to see that one cannot expect to achieve sublinear regret in general. In fact, for any sequence j = (j1, . . . , j T ) of actions, the corresponding cumulative reward ST (j) is at most the analogous cumulative reward in the loyalty-points model. To see this, note that every arm switch in j reduces the cumulative reward in the subscription model more than in the loyalty points model. Hence, minimizing regret with respect to the class of single-arm strategies is at least as hard in the subscription model as in the loyalty-points model. In particular, Theorem 4 immediately implies the following:

Theorem 13 Consider the adversarial bandits problem in the subscription model with increasing ﬁdelity rewards, where the ﬁdelity function is the same for every arm. Deﬁne δ = f(7T/8) f(T/8). Then for every policy π there exists a sequence of rewards such that the regret satisﬁes

The theorem shows that if the ﬁdelity function grows substantially between (say) T/8 and 7T/8, one cannot expect sublinear regret. If this is not the case, sublinear regret may be possible, depending on the ﬁdelity function. In the rest of the section we take a closer look at the special but important case when the ﬁdelity reward follows a step function. More precisely, consider the case where the ﬁdelity functions take the form

fj(t) = 1 if t m 0 if t < m

for all j [K] and for some positive integer m T. We refer to fj(t) as an m-step function. Theorem 13 shows that when m T/8 (or more generally, when m = Ω(T)), sublinear regret cannot be guaranteed. However, for small values of m, small regret is achievable. To see why, consider ﬁrst the extremal case when m = 1. In this case, the only time when the player does not receive a ﬁdelity reward is when the player switches actions. Hence, the problem is equivalent to a bandit problem with switching costs that has been studied in the adversarial setting by Dekel, Ding, Koren, and Peres (Dekel et al., 2014) and Cesa-Bianchi, Dekel, and Shamir (Cesa-Bianchi et al., 2013). In particular, Theorem 1 of

Bandit problems with fidelity rewards

Dekel et al. (2014), shows that for every randomized strategy in the bandits with switching costs problem, there exists a sequence of rewards such that the regret is Ω(T 2/3K1/3) for K T. This lower bound is, in fact, achievable. In fact, an easy batch modiﬁcation of the EXP3 algorithm shows that regret of the order of O(T 2/3K1/3m1/3) is achievable when the ﬁdelity reward is an m-step function. This is stated in the next theorem.

Theorem 14 Consider the adversarial ﬁdelity bandits problem in the subscription model with the m-step ﬁdelity function. There exists a randomized strategy whose expected (weak and strong) regret satisﬁes

E[RT ] = O((K log K)1/3T 2/3m1/3) .

In particular, sublinear regret is achievable whenever m = o(T), hence characterizing the range of values of m for which sublinear regret is achievable.

7. Subscription model: decreasing ﬁdelity rewards

We continue the study of the subscription model in the case when the ﬁdelity rewards fj are nonincreasing functions of Qt,j = 1{Jt=j}(t max{s t : Js = j, Js 1 = j}), the number of consecutive plays of arm j prior to the current time step. This case is substantially diﬀerent from the case of increasing ﬁdelity rewards studied in the previous section. As always, we start by determining the class of minimal and suﬃcient sets that allow us to design regret minimization strategies in the stochastic model and to deﬁne regret in the adversarial case.

7.1 Minimal suﬃcient sets

In order to deﬁne the minimal suﬃcient sets for given nonincreasing ﬁdelity functions f1, . . . , f K, we need to characterize the sequences j [K]T that can have a maximal expected reward in the stochastic bandit problem for some values of the expected rewards µ1, . . . , µK of the K arms. To this end, note that for ﬁxed µ1, . . . , µK, it is optimal to play the arm j with the largest expected initial reward, that is, j = argmax1 j K{µj +fj(0)}, repeatedly until time m [1, T] when µj + fj (m + 1) < maxj =j {µj + fj(0)}, that is until its expected reward drops under the initial expected reward of the second best arm. An arm j (2) argmaxj =j µj is then played once to allow for the Q value of the optimal arm to reset and then arm j is played again for m steps. This is then repeated in every period of m + 1 steps. Each such sequence j is indexed by a triplet (i, k, m) [K] [K] [T 1], corresponding to periodically playing arm i m N times before arm k is played once and then repeating it. Denote this sequence by

j(i, k, m) = (i, . . . , i | {z } m times

, k, i, . . . , i | {z } m times

Moreover, if the ﬁdelity functions fj are strictly decreasing, then every such sequence is the only optimal sequence for some values of µ1, . . . , µK. Hence, in such cases, there is only one

Lugosi, Pike-Burke, Savalle

minimal suﬃcient set, containing all sequences of the form j(i, k, m), that is,

J = {j(i, k, m) : (i, k, m) [K] [K] [T 1]} .

Moreover, in all cases, the minimal suﬃcient set is unique and it is a subset of J deﬁned above.

7.2 Stochastic rewards

In this section we show that one may achieve sublinear regret in the stochastic version of the problem. As it is shown in the previous section, the goal of the learner is to achieve an expected reward that is not much smaller than the expected reward corresponding to playing the best sequence of the type j(i, k, m), that is,

max (i,k,m) [K] [K] [T 1] WT (µi, µk, m) (9)

WT (µi, µk, m) def. = T T m + 1

µi + T m + 1

(Fi(m) + Fk(1)) + Fi

T (m + 1) T m + 1

is the expected reward of playing arm i m times, then arm j once and repeating this until horizon T in a subscription bandits problem where the expected base rewards are given by µ1, . . . , µK. We denote by (i , k , m ) a triple achieving the maximum in (9). The proposed strategy is similar to the one of Section 6.2 in that the ﬁrst t0 rounds of the game are used purely for exploration. After time t0, the strategy commits to a triplet (i, k, m) and plays accordingly. More precisely, up to time t0, the strategy samples each arm t0/K times and computes the empirical average of the observed base reward of each arm. Denote these estimates by bµ1, . . . , bµK. By a simple use of Hoeﬀding s inequality, we have

E max j [K] |bµj µj|

Let ˆi and ˆk denote the arms corresponding to the two highest values of bµi + fi(0) such that bµˆi +fˆi(0) bµˆk +fˆk(0). (In case of ties, we may choose arbitrarily.) Moreover, let bm denote the corresponding period length, that is,

bm = min m : bµˆi + fˆi(m + 1) < bµˆk + fˆk(0) .

Equivalently, (ˆi, ˆk, ˆm) = argmax i,j,m WT (bµi, bµk, m) . (12)

Starting from time t0 + 1, until time T, the strategy plays according to the sequence j(ˆi, ˆk, bm).

Bandit problems with fidelity rewards

Theorem 15 Consider the stochastic bandit problem in the subscription model with decreasing ﬁdelity rewards. The expected regret of the explore-then-commit strategy deﬁned above with t0 = (2T)2/3(K log K)1/3 is bounded by

E[RT ] 3T 2/3(K log K)1/3 .

Proof Conditionally on the rewards observed during the ﬁrst t0 periods of play, the expected reward of the proposed strategy for the remaining T t0 rounds equals

WT t0(µˆi, µˆk, bm)

WT (µˆi, µˆk, bm) 2t0 (since all rewards are bounded by 2)

WT (bµˆi, bµˆk, bm) 2t0 T max j [K] |bµj µj|

WT (bµi , bµk , m ) 2t0 T max j [K] |bµj µj| (by (12))

WT (µi , µk , m ) 2t0 2T max j [K] |bµj µj| .

Taking expected values and using (11), we get that

E[RT ] WT (µi , µk , m ) EWT t0(µˆi, µˆk, bm) 2t0 + 2T

Choosing t0 = T 2/3(K log K)1/3/2, we obtain the stated bound.

We remark that, just like in the case of increasing ﬁdelity rewards, one may also design strategies that yield regret bounds that have a logarithmic dependence on the time horizon T. For example, one may use successive elimination strategies in the exploration phase to identify the two arms i and k played by the optimal strategy. The price to pay is that the regret bound depends on the means µj of the arms. In particular, they depend on the gap between the expected reward of the best and second best arm, as well as on the gap between the expected reward of the second and the third best arms. In order to keep the discussion simple, we omit these regret bounds from this paper. We also note that there is currently no known lower bound for the stochastic subscription model with decreasing ﬁdelity rewards, so determining the optimal dependence on T and K in this problem remains an open question.

7.3 Adversarial rewards

In this section we show that in the adversarial bandit problem in the subscription model with decreasing ﬁdelity rewards, it is possible to achieve sublinear regret. Recall from Section 7.1 that in this model, there is only one minimal suﬃcient set, and in all cases it is a subset of the class J containing all sequences of the type j(i, k, m) for (i, k, m) [K] [K] [T 1]. Hence, weak and strong regret of a policy π coincide and, recalling (3), are bounded as

RT (π) max (i,k,m) [K] [K] [T 1] ST (j(i, k, m)) e ST (π) .

Lugosi, Pike-Burke, Savalle

In other words, our goal is to design a strategy whose expected cumulative reward is close to max(i,k,m) ST (j(i, k, m)). Our approach is to treat each triple (i, k, m) as an expert that plays according to the sequence j(i, k, m) and design a strategy that competes with the best expert. To this end, we use a variant of the exponentially weighted average forecaster that assigns a weight to each expert, based on an exponential function of its estimated cumulative reward, and selects a random expert with probability proportional to its weight. A diﬃculty in the subscription bandits problem is that the reward of the strategy depends not only on the reward of the chosen expert but also on the number of consecutive plays of the selected arm. Therefore the reward an expert would gain from playing an arm in a given round may diﬀer from the actual reward the learner receives. In order to keep such deviations under control, we draw inspiration from the lazy label eﬃcient exponentially weighted average forecaster proposed by Cesa-Bianchi, Lugosi, and Stoltz (Cesa-Bianchi et al., 2004). The forecaster, at each time instance, ﬂips a biased coin that comes up heads with probability ϵ, where ϵ is a small, appropriately chosen value. When the coin ﬂip results in heads, the forecaster selects a random arm, plays it, and updates the estimated cumulative reward of each expert. Then the algorithm selects a random expert according to the exponentially weighted average distribution over the experts, and plays according to the sequence corresponding to the selected expert until the next time when the coin ﬂip comes up heads. This lazy play guarantees that the strategy switches between experts at most about ϵT times, making the possibly adverse eﬀect of switching between experts negligible. More precisely, at every time t {2, . . . , T}, an independent Bernoulli random variable Zt with P{Zt = 1} = 1 P{Zt = 0} = ϵ is drawn. If Zt = 1, then the strategy simply explores by selecting an arm Jt chosen uniformly at random. (We deﬁne Z1 = 1 to ensure that the strategy explores in the ﬁrst step.) These time instances are used to estimate the cumulative reward of each expert. In particular, for an expert (i, k, m) [K] [K] [T 1], let jt(i, k, m) [K] denote the arm played by the expert at time t, that is, the t-th element of the sequence j(i, k, m) as deﬁned in (8). The cumulative reward of the expert, at time t, is

St(j(i, k, m)) =

Xs,js(i,k,m) + Φ(i, k, m) ,

where we deﬁne

Φ(i, k, m) = 1

(Fi(m) + Fk(1)) + Fi

T (m + 1) T m + 1

as the normalized total ﬁdelity reward received by the expert. St(j(i, k, m)) is estimated by

b St(i, k, m) =

s=1 b Ys,js(i,k,m)

where b Ys,js(i,k,m) def. = 2 K

ϵ 1Zs=11Js=js(i,k,m)(2 Xs,js(i,k,m) Φ(i, k, m))

Bandit problems with fidelity rewards

is the estimated reward of expert (i, k, m) at time s, where Js indicates the arm played by the learner in round s. Note that the factor K/ϵ guarantees that Eb Ys,(i,k,m) = Xs,js(i,k,m) + Φ(i, k, m) and therefore b St(i, k, m) is an unbiased estimator of the cumulative total reward of the expert. It is convenient to introduce the estimated losses

bℓs,(i,k,m) def. = 2 b Ys,(i,k,m) = K

ϵ 1Zs=11Js=js(i,k,m)(2 Xs,js(i,k,m) Φ(i, k, m)) .

Note that bℓs,(i,k,m) is always nonnegative, a property that is crucial for the proof below. In the time instance immediately following an exploration step unless the Benoulli random variable again equals 1 , the learner selects an expert (It, Kt, Mt) [K] [K] [T 1] at random, based on the exponentially weighted average distribution. More precisely,

Pt 1 {(It, Kt, Mt) = (i, k, m)} = pt,(i,k,m) def. = wt 1,(i,k,m)

where Pt 1 denotes conditional probability given the past and

wt,(i,k,m) = exp(η b St(j(i, k, m)))

(i,k,m) [K] [K] [T 1] exp(η b St(j(i, k, m))) .

Then the learner follows the expert (It, Kt, Mt) until the next time the Bernoulli variable takes value 1. More precisely, if for some t, Zt = 1 and t = min{s > t : Zs = 1}, then for all time instances s {t, . . . , t 1}, the learner plays the arms according to positions t, . . . , t 1 of the sequence j(It, Kt, Mt), as deﬁned in (8). The algorithm is described in detail in Figure 2. The next theorem establishes sublinear regret for the performance of the proposed strategy. In particular, it shows that the expected regret grows at a rate at most e O(T 2/3). As in the stochastic setting, it remains an open problem to determine if this rate is optimal.

Theorem 16 Consider the adversarial ﬁdelity bandits problem in the subscription model with nonincreasing ﬁdelity rewards. Then, choosing the parameters of the algorithm as η = log(K2(T 1))/(2TK) 2/3 and ϵ = K η, the expected (weak and strong) regret of the strategy deﬁne above satisﬁes

ERT 3(2TK)2/3 log1/3 K2(T 1) .

Proof Let J = (J1, . . . , JT ) [K]T denote the (random) sequence of arms played by the strategy. The strategy is deﬁned such that, except for the times when Zt = 1, the strategy follows the expert (It, Kt, Mt), that is, Jt = jt(It, Kt, Mt). Deﬁne the pseudo reward of the strategy by

t=1 (Xt,Jt + Φ(It, Kt, Mt)) .

Lugosi, Pike-Burke, Savalle

Input: T (horizon), ϵ (0, 1) (sampling probability), η > 0 (learning rate). Initialize: w0,(i,k,m) = 1.

1: Sample T independent Bernoulli(ϵ) random variables Z1, . . . , ZT .

2: Set Z1 = 1

3: for all t = 1, . . . , T do

4: if Zt = 1, then

5: Select a uniformly random arm Jt and observe Xt,Jt;

6: Compute wt,(i,k,m) = exp(η b St(j(i, k, m))) (i, k, m) [K]2 [T 1];

8: if Zt = 0, then

9: if Zt 1 = 1 then

10: Draw a random expert (It, Kt, Mt) according to the distribution pt,(i,k,m) = wt 1,(i,k,m)/Wt 1 and play according to the t-th position of the sequence j(It, Kt, Mt);

12: Play according to the same expert as at time t 1.

15: end for

Figure 2: Lazy strategy for adversarial bandits with decreasing ﬁdelity rewards, subscription model

The key observation is that, since the ﬁdelity rewards are decreasing, the realized total reward of the strategy is at least

t=1 (Xt,Jt + f Jt(Qt 1,Jt)) ST 2

t=1 1{Zt=1} ,

since PT t=1 1{Zt=1} is the total number of times the strategy switches experts and the rewards are bounded in [0, 2]. Hence, it suﬃces to show that the total expected pseudo reward EST is not much smaller than

max (i,k,m) ST (j(i, k, m)) = max (i,k,m)

Xt,jt(i,k,m) + Φ(i, k, m) .

As it is customary in the analysis of exponentially weighted average forecasters (see Cesa Bianchi and Lugosi (2006)), we compare upper and lower bounds for the weight Wt. On the one hand,

(i,k,m) exp(η b ST (j(i, k, m)))

log(K2(T 1))

η max (i,k,m) b ST (j(i, k, m)) log(K2(T 1)) .

Bandit problems with fidelity rewards

On the other hand, log(WT /W0) = PT t=1 log(Wt/Wt 1) and for each t [T], we have

(i,k,m) pt,(i,k,m) exp(η b Yt(i, k, m))

(i,k,m) pt,(i,k,m) exp( ηbℓt,(i,k,m))

(i,k,m) pt,(i,k,m)

1 ηbℓt,(i,k,m) + η2

2 bℓ2 t,(i,k,m)

(using e x 1 x + x2/2 for x 0)

(i,k,m) pt,(i,k,m)bℓt,(i,k,m) + 2η2K

(i,k,m) pt,(i,k,m)bℓt,(i,k,m)

(using log(1 + x) x for all x 1 and bℓt,(i,k,m) 2K/ϵ)

(i,k,m) pt,(i,k,m) b Yt,(i,k,m) + 2η2K

(i,k,m) pt,(i,k,m)bℓt,(i,k,m)

(by deﬁnition of bℓs,(i,k,m) = 2 b Ys,(i,k,m))

Comparing the upper and lower bounds obtained for log(WT /W0), we have that for all (i , k , m ) [K] [K] [T 1],

(i,k,m) pt,(i,k,m) b Yt,(i,k,m) b ST (j(i , k , m ))

log(K2(T 1))

(i,k,m) pt,(i,k,m)bℓt,(i,k,m) .

Taking expected values on both sides and noting that

(i,k,m) pt,(i,k,m) b Yt,(i,k,m) = EST ,

EST max (i,k,m) ST (j(i, k, m)) log(K2(T 1))

and therefore

ERT log(K2(T 1))

Lugosi, Pike-Burke, Savalle

Optimizing the upper bound suggests the choices

ϵ = K η and η = log(K2(T 1))

yielding the announced upper bound.

8. Subscription model: coupon rewards

We ﬁnally consider the subscription model with so-called coupons ﬁdelity rewards. Here the player earns an additional reward rj [0, 1] after each consecutive ρj N plays of an arm j. As we will see, this setting is similar to the loyalty points model with coupon rewards which was studied in Section 5. However, a key diﬀerence is that here, in order to get any additional ﬁdelity reward, an arm needs to be played ρj times consecutively.

8.1 Minimal suﬃcient sets

We begin by considering the class of minimal suﬃcient sets for the subscription model with coupon rewards. For this, we note that due to the need to be playing an arm continuously in order to receive any ﬁdelity reward, any optimal sequence j [K]T will mainly consist of consecutive instances of any arm, where the length of these segments is determined by ρj, the period of arm j. Moreover, it can easily be seen that if ρj = ρ for all arms j and T is divisible by ρ, then the optimal strategy is to play the arm maximizing µj + rj/ρ for all rounds t = 1, . . . , T, and thus the minimal suﬃcient sets are the single arm strategies J0. In the more general case, the minimal suﬃcient sets may not simply be single arm strategies. However, by an argument similar to the one in the loyalty points setting, we can show that the total reward accumulated by the best single arm strategy is not too far from that of the best minimal suﬃcient set. In particular,

Lemma 17 Let ρ denote the least common multiple of the period lengths ρ1, . . . , ρK. Then,

max j J Cf J ST (j) max j J0 St(j ) 2ρ .

Therefore, as in the coupons loyalty points model, we compare our algorithms to single arm strategies knowing that the maximal reward of such strategies does not diﬀer too much from the true maximal reward.

8.2 Stochastic rewards

From the above discussion, it is clear that to gain the additional ﬁdelity reward, each time an arm j is played, it should be played for ρj rounds consecutively. Since the ﬁdelity functions are known, so are the ρj. Therefore, we consider algorithms that run in batches, that is, they select an arm j using all the available data up to that point, then play that arm ρj times before updating the estimates and selecting another arm. While in principle, any standard algorithm for the multi-armed bandit problem could be used to select the arms at the beginning of each batch, for simplicity, here we consider a batch version of

Bandit problems with fidelity rewards

the UCB algorithm (Auer et al., 2002a). We aim to accumulate rewards comparable to T maxj [K](µj + rj/ρj) and so we use conﬁdence bounds of the form.

UCBt(j) = Xt,j + rj

Nt,j , (13)

where Xt,j is the average base reward from Nt,j plays of arm j up to time t. Note that each arm j is played ρj times when it is selected. This leads to the following regret bound, the complete proof of which is given in Appendix B.1.

Theorem 18 The expected regret of the Batch-UCB algorithm up to horizon T in the stochastic coupons subscription model can be bounded by

j =j ρj e j + (1 + 2

j =j e j + 2ρ

where e j = maxi [K](µi + ri/ρi) (µj + rj/ρj). So the worst case regret bound is E[RT ] = O( p

KT log(T) + PK j=1 ρj + ρ).

The above regret bounds exhibit the expected optimal leading order terms. However, it is still an open question to determine whether the dependence on the period lengths ρj can be improved.

8.3 Adversarial rewards

Since we know that the best single arm strategy is almost optimal in the subscription coupons model, we aim to develop an algorithm that is competitive with that. Unlike in the loyalty points model with coupon rewards where we could simply use a standard adversarial bandits algorithm with augmented rewards, the subscription structure of the ﬁdelity rewards here means that we need to be a bit more careful to guarantee that the augmented rewards are actually similar to the accumulated rewards. Speciﬁcally, this means that we need to play in batches and play each arm j some multiple of ρj times whenever it is selected. For example, we may select arms using the EXP3 (Auer et al., 2002b) algorithm and then play them ρ times, where ρ is again the least common multiple of the period lengths ρ1, . . . , ρK. This gives the following regret bound,

Theorem 19 Selecting arms according to the EXP3 algorithm with parameter λ = min{1, q

(e 1)T } and playing each arm j ρ times when selected in the adversarial coupons subscription ﬁdelity bandits model leads to regret of

E[R T ] = e O( p

ρKT log(K)) ,

where ρ is again the least common multiple of the period lengths ρ1, . . . , ρK.

Proof The proof follows by considering an adversarial bandits problem over S = T

ρ rounds. In each round s = 1, . . . , S, the algorithm selects an arm Js using the EXP3 algorithm and receives reward

P t Ts Xt,j+rj

2ρ [0, 1] where Ts is such that |Ts| = ρ and represents

Lugosi, Pike-Burke, Savalle

the set of time points t = ρs + 1, . . . , ρ(s + 1) in the ﬁdelity bandits problem where we are playing arm j ρ times consecutively. Let RS denote the regret in this modiﬁed problem

compared to playing arm j = argmax1 j K

PT t=1 Xt,j+Srj

ρ , which we note is the same arm as the best single arm strategy in the original problem. By Lemma 17, RT 2ρRS + ρ, so it suﬃces to provide a bound on RS. This can be done by applying the analysis of EXP3 in

Corollary 3.2 of (Auer et al., 2002b). In particular, for parameter λ = min{1, q

E[RS] = O( q

ρ ), thus giving the result.

Remark 20 Note that when ρ1 = = ρK = ρ, then the regret is simply e O( ρKT). However, in other cases, depending on the periodicities, ρ can be signiﬁcantly larger than any single ρj. By a simple modiﬁcation of the argument, the regret bound can be improved to e O( maxj ρj

KT) by considering a variant of EXP3 that plays each arm j exactly ρj

times when it is selected, using the pseudo-rewards

P t Ts Xt,j+rj

2 maxj ρj [0, 1] and S T minj ρj rounds. However, to avoid some technicalities arising from the fact that the number of rounds becomes random, we chose to present the simplest version. It is also an open question to determine the optimal dependence on the periods ρ1, . . . , ρK.

Remark 21 The above result assumes that ρj is constant for all j [K]. Clearly if ρj are such that ρ = Ω(T), the bound in Theorem 19 becomes vacuous. In such a case, one may simply ignore the ﬁdelity reward of any arm with ρj = Ω(T) as it is not received often enough to make a signiﬁcant diﬀerence to the cumulative reward. In fact, for ρj = Ω(T 1/4) it is better to ignore the ﬁdelity rewards. We omit the straightforward details.

Remark 22 From Theorems 18 and 19, we observe that the cost of playing in batches is additive in the stochastic case, and multiplicative in the adversarial setting. This is to be expected since playing in batches can almost be viewed as a form of delayed feedback. It is known that the penalty for receiving delayed feedback is multiplicative in the adversarial setting (see, e.g., Joulani et al. (2013)), so our results are consistent with this.

9. Conclusion

In this paper we have studied several instances of the ﬁdelity bandits problem where the reward of each arm is augmented by a ﬁdelity reward which measures how loyal the player has been to that arm in the past. Our focus has been on the settings with adversarially generated base rewards, although we also analyze the stochastic version of the problem for completeness. For this problem, the deﬁnition of the regret when the rewards are adversarial is nontrivial. By considering the stochastic analogue of the problem we suggest several natural regret deﬁnitions. In particular, we deﬁne the regret with respect to a class of policies which may be optimal for some conﬁguration of the stochastic problem, a technique which we believe may be applicable in many settings beyond the ﬁdelity bandits problem studied here. Our main interest in this paper was to determine when it is possible for these regrets to be sub-linear. We considered two forms of ﬁdelity reward, namely the

Bandit problems with fidelity rewards

loyalty points model where the ﬁdelity reward depends on the number of times of an arm was previously played, and the subscription model where the ﬁdelity reward is a function of the number of consecutive plays of an arm up to that point. Our ﬁndings show that learning with adversarial rewards and increasing ﬁdelity functions is hard, with Ω(T) bounds presented for even the weakest regret deﬁnition in both the loyalty points and subscription model. However, these results are worst case, so it remains a possibility that for speciﬁc ﬁdelity functions sublinear regret could be achieved (see, e.g., the results for m-step functions in Section 6.3). For decreasing ﬁdelity rewards, the picture is more positive. Although we provide Ω(T) lower bounds for the strongest notions of regret in the loyalty points model, we show that for weaker notions of regret, it is possible to achieve sub-linear regret in this case. This means that we are able to perform comparably to some near-optimal sequence of actions. For the subscription model, we show that sublinear regret is possible for even the strong regret when the ﬁdelity rewards are decreasing. The reason for this distinction is that in the subscription model, due to the fact that the ﬁdelity reward depends on the number of consecutive plays, there is only one minimal suﬃcient set, simplifying the class of comparator policies for the strong regret. Lastly, we note that in the coupons rewards setting, where a bonus reward is obtained every ρj (consecutive) plays of arm j, it is possible to get sub-linear strong regret in both the loyalty points and subscription models with adversarial base rewards. Interestingly, in this case, the eﬀects of the ﬁdelity rewards are diﬀerent in the subscription and loyalty points models, with the regret scaling mutliplicatively with the periodicity of the coupon function in the subscription case, while it only increases additively in the loyalty points model. Although we have considered several forms of ﬁdelity reward in this paper, there are still other realistic assumptions one could make on the ﬁdelity reward. For example, it may be interesting to consider more general periodic ﬁdelity functions, or other measures of ﬁdelity such as the number of times the arm has been played in the last m plays. This is left for future work. We also point out that throughout the paper we have assumed that the entire ﬁdelity functions are known to the forecaster, although typically the algorithms only require knowledge of the cumulative ﬁdelity reward Fj(T). It remains to be seen if this can be reduced to requiring no knowledge of the ﬁdelity reward. A related problem is whether one can remove the need to know the horizon T and develop anytime algorithms for ﬁdelity bandits problems. A key challenge here is that the optimal sequence of arms typically depends on the horizon so standard techniques such as the doubling trick (see e.g. Besson and Kaufmann (2018)) may not be applicable. For example, in the subscription model with increasing ﬁdelity rewards, since we lose any ﬁdelity reward accrued when we switch arms, it is possible to construct ﬁdelity functions such that even if we played the best arm for an entire phase we may not get enough ﬁdelity reward, so the doubling trick could lead to linear regret. By assuming that the average ﬁdelity reward is stationary over time or independent of T, one may be able to adapt our results to the anytime setting, but we leave this to future work. Lastly, we note that our aim in this paper was to understand when it is possible to achieve sub-linear expected regret in the ﬁdelity bandits problem. As such, some constants or rates (particularly in the stochastic reward case) may not be tight. Therefore, tightening them and/or providing matching lower bounds remains an interesting open problem. We have also focused on providing bounds on the expected regret. We imagine that our analysis

Lugosi, Pike-Burke, Savalle

for upper bounds in the stochastic setting can be easily extended to obtain high probability bounds. On the other hand, obtaining high-probability upper bounds on the regret in the adversarial setting is less straightforward and we leave this topic for future research.

Acknowledgments

G abor Lugosi acknowledges the support of Ayudas Fundaci on BBVA a Proyectos de Investigaci on Cient ıﬁca 2021 and of the Spanish grant PID2022-138268NB-I00, ﬁnanced by MCIN/AEI/10.13039/501100011033, FSE+MTM2015-67304-P, and FEDER, EU. This work was started while Ciara Pike-Burke was at Universitat Pompeu Fabra and Barcelona Graduate School of Economics.

Peter Auer, Nicol o Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of 36th Annual Symposium on Foundations of Computer Science, pages 322 331. IEEE, 1995.

Peter Auer, Nicol o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002a.

Peter Auer, Nicol o Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002b.

Soumya Basu, Rajat Sen, Sujay Sanghavi, and Sanjay Shakkottai. Blocking bandits. In Advances in Neural Information Processing Systems, pages 4785 4794, 2019.

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in neural information processing systems, pages 199 207, 2014.

Lilian Besson and Emilie Kaufmann. What doubling tricks can and can t do for multi-armed bandits. ar Xiv preprint ar Xiv:1803.06971, 2018.

Djallel Bouneﬀouf and Raphael F eraud. Multi-armed bandit problem with known trend. Neurocomputing, 205:16 21, 2016.

S ebastien Bubeck and Nicol o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012.

S ebastien Bubeck, Yin Tat Lee, and Ronen Eldan. Kernel-based methods for bandit convex optimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 72 85, 2017.

Leonardo Cella and Nicol o Cesa-Bianchi. Stochastic bandits with delay-dependent payoﬀs. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1168 1177. PMLR, 2020.

Bandit problems with fidelity rewards

Nicol o Cesa-Bianchi and G abor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, 2006.

Nicol o Cesa-Bianchi, G abor Lugosi, and Gilles Stoltz. Minimizing regret with label eﬃcient prediction. IEEE Transactions on Information Theory, 51:2152 2162, 2004.

Nicol o Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other adaptive adversaries. In Advances in Neural Information Processing Systems, pages 1160 1168, 2013.

Corinna Cortes, Giulia De Salvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. Discrepancy-based algorithms for non-stationary rested bandits. ar Xiv preprint ar Xiv:1710.10657, 2017.

Ofer Dekel, Jian Ding, Tomer Koren, and Yuval Peres. Bandits with switching costs: T 2/3 regret. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 459 467. ACM, 2014.

Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Online markov decision processes. Mathematics of Operations Research, 34(3):726 736, 2009.

Dean P. Foster and Rakesh Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 29:7 36, 1999.

Aur elien Garivier and Eric Moulines. On upper-conﬁdence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, pages 174 188. Springer, 2011.

Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68:1127 1150, 2000.

Sergiu Hart and Andreu Mas-Colell. A general class of adaptive strategies. Journal of Economic Theory, 98:26 54, 2001.

Nicole Immorlica and Robert Kleinberg. Recharging bandits. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 309 319, 2018.

Pooria Joulani, Andras Gy orgy, and Csaba Szepesv ari. Online learning under delayed feedback. In International Conference on Machine Learning, pages 1453 1461, 2013.

Tze Leung Lai and Herbert Robbins. Asymptotically eﬃcient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985.

Tor Lattimore and Csaba Szepesv ari. Bandit algorithms. Cambridge University Press, 2020.

Nir Levine, Koby Crammer, and Shie Mannor. Rotting bandits. In Advances in Neural Information Processing Systems, pages 3077 3086, 2017.

Alberto Maria Metelli, Francesco Trovo, Matteo Pirola, and Marcello Restelli. Stochastic rising bandits. In International Conference on Machine Learning, pages 15421 15457. PMLR, 2022.

Lugosi, Pike-Burke, Savalle

Ciara Pike-Burke and Steﬀen Grunewalder. Recovering bandits. In Advances in Neural Information Processing Systems, pages 14122 14131, 2019.

Julien Seznec, Andrea Locatelli, Alexandra Carpentier, Alessandro Lazaric, and Michal Valko. Rotting bandits are no harder than stochastic ones. In International Conference on Artiﬁcial Intelligence and Statistics, pages 2564 2572, 2019.

Julien Seznec, Pierre Menard, Alessandro Lazaric, and Michal Valko. A single algorithm for both restless and rested rotting bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3784 3794. PMLR, 2020.

Aleksandrs Slivkins and Eli Upfal. Adapting to a changing environment: the brownian restless bandits. In Conference on Learning Theory, pages 343 354, 2008.

Cem Tekin and Mingyan Liu. Online algorithms for the multi-armed bandit problem with markovian rewards. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1675 1682. IEEE, 2010.

Cem Tekin and Mingyan Liu. Online learning of rested and restless bandits. IEEE Transactions on Information Theory, 58(8):5588 5611, 2012.

Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, pages 287 298, 1988.

Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems 26, pages 1583 1591, 2013.

Appendix A. Proofs for the loyalty points model with increasing rewards

A.1 Proof of Lemma 2 (Optimality of single arm strategies for increasing loyalty points bandits)

Proof Suppose that a sequence j plays arm j Nj times up to time T (and therefore PK j=1 Nj = T). Then the total (pseudo) reward of such a sequence is

(since the fj are increasing functions)

T max 1 j K

T Fj(T) = e ST (j , j , . . . , j ) .

Hence there is a single-arm strategy whose total reward is at least as large as that of j.

Bandit problems with fidelity rewards

A.2 Proof of Theorem 3 (Regret of UCB in stochastic loyalty points bandits with increasing ﬁdelity rewards)

Proof This proof is similar to the standard proofs of the regret bounds of UCB (see Auer et al. (2002a); Lattimore and Szepesv ari (2020)) with some adjustments to deal with the ﬁdelity functions. Let B = K j=1 T t=1 {µj + Fj(T)/T UCBt(j)} be the event that the upper conﬁdence bounds on each arm hold at all time steps. We show that on event B, the number of plays of sub-optimal arms is small. First, we write

t=1 (µj µJt + fj (t) f Jt(Nt 1,Jt)))

j=1 NT,j(µj µj) +

j=1 NT,j(µj µj) +

n=NT,j fj (n) X

j=1 NT,j(µj µj) + (T (T X

j:j =j NT,j))fj (T) X

j:j =j NT,jfj(0)

(since the ﬁdelity functions are increasing)

j:j =j NT,j(µj µj + fj (T) fj(0)) ,

since NT,j = T P j =j NT,j. Hence, it suﬃces to bound the number of plays of each sub-optimal arm. Let mj = 8 log(KT)

and observe that on the event B, if Nt,j > mj arm j will not be played again.

Indeed, in this case, for any number of plays m of the optimal arm,

UCBt(j) = Xt,j + 1

T Fj(T) + 2

T Fj(T) + e j = µj + 1

T Fj (T) UCBt(j )

and therefore arm j is not played again. Then, using the fact that the conﬁdence bounds hold with probability 1/(KT)2 (by Hoeﬀding s inequality), the expected regret may be bounded by

ERT = E[RT 1B] + E[RT 1BC]

j:j =j E[NT,j1B](µj µj + fj (T) fj(0)) + TP(BC)

& 8 log(KT)

(µj µj + fj (T) fj(0)) + T

j=1 P(µj + Fj(T)/T > UCBt(j))

Lugosi, Pike-Burke, Savalle

(µj µj + fj (T) fj(0)) + 1

proving the result. To obtain the worst case regret bound, we use standard techniques (see,

e.g. Lattimore and Szepesv ari (2020)) and set e j q

A.3 Proof of Theorem 4 (Lower bound on the regret for the adversarial loyalty points bandit problem with increasing ﬁdelity reward)

Proof It suﬃces to consider the two-armed setting (i.e., K = 2). The argument may easily be extended to multiple arms, for example, by replicating the rewards of the two arms. We may also assume that T is even and the ﬁdelity functions of each arm are the same. We write F(t) = Pt s=1 f(s) for the cumulative ﬁdelity reward. In order to prove the theorem, we consider just two possible sequences of rewards. In both cases arm 1 has reward X1,t = δ/5 for all t [T]. Also, in both cases arm 2 has reward X2,t = 0 for all t [T/2]. The reward X2,t of arm 2 for all t > T/2 equals 0 in case (I), while it equals δ in case (II). We know that in this problem single arm strategies are optimal. In case (I), the optimal arm is arm 1 with total reward Tδ/5 + F(T), while in case (II), the optimal arm is arm 2 with total reward Tδ/2 + F(T). Consider any policy and denote by τ1 the number of times the policy chooses arm 1 up to time T/2. Similarly, τ2 denotes the number of times the policy chooses arm 1 between times T/2 + 1 and T. Note that τ1 is the same in cases (I) and (II) as up to time T/2 the two cases are identical. If τ1 T/4, then in case (I) the policy suﬀers regret at least Tδ/20, so in the rest of the proof we may assume that τ1 > T/4 and we consider case (II). In that case the total reward of the policy equals

2 τ2δ + F(τ1 + τ2) + F(T τ1 τ2) .

Thus, since arm 2 is optimal, the regret becomes

5 + F(T) (F(τ1 + τ2) + F(T τ1 τ2)) .

If τ2 > τ1 T/8, then the regret is at least

5 (since the ﬁdelity reward is nondecreasing)

20 (since τ1 > T/4).

Bandit problems with fidelity rewards

Finally, if τ2 τ1 T/8, then

and therefore, by the nondecreasing property of the ﬁdelity rewards, for any t [T], F(t) + F(T t) is nonincreasing in t T/2 and nondecreasing in t T/2. Hence,

F(T) (F(τ1 + τ2) + F(T τ1 τ2)) F(T) F(7T/8) F(T/8)

t=1 (f(T t) f(t))

8 (f(7T/8) f(T/8)) = Tδ

Hence, the regret satisﬁes

5 + F(T) (F(τ1 + τ2) + F(T τ1 τ2)) Tδ

Appendix B. Results for Stochastic Coupons Subscription Model

B.1 Proof of Theorem 18 (Regret bound for the Batch-UCB algorithm in the stochastic coupons subscription model)

We ﬁrst bound the number of plays of any sub-optimal arm.

Lemma 23 For any sub-optimal arm j,

0 E[Nj(T)] 16 log(TK)

e 2 j + ρj + 1 + 2

Proof Let B be the event that the conﬁdence bounds hold for all arms at all time steps, that is, B = K j=1 T t=1 {µj UCBt(j)}. Then deﬁne n j = 16 log(TK)

e 2 j = m jρj + lj for

some integer m j N and remainder 0 lj ρj. We now show that on event B, if we have played arm j en j = (m j + 1)ρj n j times, then it is not played again. Indeed, if t is such that Nj,t en j n j, then,

UCBt(j) = Y j,t + 2

NT,j eµj + 4

n j < eµj + e j = eµj UCBt(j ).

Here we have used the deﬁnition of the conﬁdence bounds and the notation eµj = µj + rj

ρj and j = argmax1 j K µj + rj

ρj . Hence, UCBt(j) < UCBt(j ) and so we would play arm j and not play arm j again. Consequently arm j is not played more than n j times. Using this, we bound the expected number of times we play arm j as,

E[NT,j] = E[NT,j1{B}] + E[NT,j1{BC}] n j + TP(BC)

Lugosi, Pike-Burke, Savalle

(m j + 1)ρj + T

t=1 P(µj > UCBt(j)) 16 log(TK)

where we have used the fact that the conﬁdence intervals hold with probability at least 1 2 T 2K2 . Using the trivial bound x x + 1 and noting that Nj(T) 0 gives the result of the lemma.

Using this, we now prove the regret bound in Theorem 18. Proof We play each arm j in batches of ρj plays. Denote the number of such batches of arm j as BT,j and note that NT,j = ρj BT,j. Using Lemma 17, we write the regret as,

E[RT ] = T(µj + rj

j=1 E[BT,j](ρjµj + rj)

= T(µj + rj

j=1 E[NT,j](µj + rj

ρj ) + 2ρ = X

j =j E[NT,j]e j + 2ρ

where e j = max1 j K{µj + rj

ρj } (µj + rj

ρj ). Substituting the result from Lemma 23 into the above result gives,

j =j ρj e j + (1 + 2

j =j e j + 2ρ.