# bandit_problems_with_fidelity_rewards__9303b940.pdf Journal of Machine Learning Research 24 (2023) 1-44 Submitted 11/21; Revised 6/23; Published 11/23 Bandit problems with fidelity rewards G abor Lugosi gabor.lugosi@upf.edu ICREA Barcelona, Spain, and Department of Economics and Business Pompeu Fabra University Barcelona, Spain, and Barcelona Graduate School of Economics Barcelona, Spain Ciara Pike-Burke c.pikeburke@gmail.com Department of Mathematics Imperial College London London, UK Pierre-Andr e Savalle psavalle@cisco.com Cisco Systems, Inc. Paris, France Editor: Aurelien Garivier The fidelity bandits problem is a variant of the K-armed bandit problem in which the reward of each arm is augmented by a fidelity reward that provides the player with an additional payoffdepending on how loyal the player has been to that arm in the past. We propose two models for fidelity. In the loyalty-points model the amount of extra reward depends on the number of times the arm has previously been played. In the subscription model the additional reward depends on the current number of consecutive draws of the arm. We consider both stochastic and adversarial problems. Since single-arm strategies are not always optimal in stochastic problems, the notion of regret in the adversarial setting needs careful adjustment. We introduce three possible notions of regret and investigate which can be bounded sublinearly. We study in detail the special cases of increasing, decreasing and coupon (where the player gets an additional reward after every m plays of an arm) fidelity rewards. For the models which do not necessarily enjoy sublinear regret, we provide a worst case lower bound. For those models which exhibit sublinear regret, we provide algorithms and bound their regret. Keywords: multi-armed bandit problem, fidelity reward, regret minimization 1. Introduction Consider the problem of a worker searching for a good restaurant for their daily lunches. They are willing to explore the neighborhood in order to find the best restaurant, but also want to have good lunches. This is well modeled as a bandit problem, where the worker has to balance exploration (trying out new or unfamiliar restaurants), and exploitation (having a nice lunch at a favored place). However, the overall experience and satisfaction of the worker may not depend solely on the chosen restaurant, but also on whether the worker is a 2023 G abor Lugosi, Ciara Pike-Burke, Pierre-Andr e Savalle. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1408.html. Lugosi, Pike-Burke, Savalle regular at this restaurant. Indeed, being loyal often leads to better experiences (e.g., waiters are more friendly, or the customer gets a free meal once in a while). We call the bonus from this loyalty the fidelity reward. In other situations, fidelity can have the opposite effect on the reward. For example, restaurants may offer free drinks to new customers, or customers may get bored of visiting the same restaurant. In this case the fidelity reward decreases. We consider multi-armed bandit problems where the reward of each arm is augmented by a fidelity reward that provides an additional payoffdepending on how loyal the player has been to a given arm. We consider two models for the fidelity rewards, namely the loyalty-points model and the subscription model. Under the loyalty-points model, the fidelity reward is a (possibly arm-specific) function of the total number of past plays of the arm. An important feature of these loyalty points is that once collected, they stay in the bank. This corresponds to the common practice of loyalty programs in marketing and retail, where the customer is rewarded for loyal behavior (e.g., the loyalty schemes offered by airline companies or grocery stores). However, in other cases, loyalty may cause the reward to decrease. For example in recommendation systems, users grow bored of repeatedly seeing the same content. In the stochastic bandit setting, the latter problem is known as rotting bandits and has been studied by Levine et al. (2017), Seznec et al. (2019, 2020) and others. In this paper, we extend this to adversarial rewards and also consider fidelity functions that are not necessarily decreasing functions of the number of past plays. In the subscription model, the fidelity reward is a (possibly arm-specific) function of the current number of consecutive draws of the arm. In particular, this means that if a player stops playing an arm, the next time they play it they will have to start the subscription again from scratch. This model is more realistic in situations where a user pays for a subscription, such as for a mobile phone plan or a gym membership. Here, the benefit of using a particular service increases with the length of continuous use. Again, there may be scenarios where continuously selecting the same option has a negative impact, for example when customers grow bored of going to the same restaurant every day, or where the benefit is only received after selecting an option a given number of consecutive times, such as when using a weekly metro pass. The aim of this paper is to understand how the presence of fidelity rewards effects the difficulty of the multi-armed bandit problem. Our main interest is in the adversarial setting, where we assume that the base rewards are generated by some oblivious adversary and the player receives additional fidelity rewards depending on their past actions. We wish to develop policies with low regret, defined as the cumulative difference of rewards between the policy and some baseline algorithms. Clearly, the best baseline selects the best possible sequence of T actions. However, even in the standard adversarial bandits problem without any additional fidelity rewards, it is impossible to compete with such a strong baseline. Instead, it is common to take inspiration from the stochastic setting, where it is known that an optimal policy is a single-arm policy that constantly plays the best arm, and compare the performance of an adversarial bandits algorithm to the best single-arm strategy. In the fidelity bandits problem, we also define the regret relative to a class of policies that is optimal in the stochastic problem. However, here the addition of fidelity rewards means that the optimal policy class in the stochastic setting may be more complex. Moreover, there may be multiple optimal policies for the stochastic problem, thus complicating the Bandit problems with fidelity rewards definition of regret in the adversarial setting. The first contribution of this paper is to define several natural definitions of regret for the adversarial fidelity bandits problem and to understand the relationship between them. With these definitions of regret in hand, we aim to understand in which K-armed adversarial fidelity bandits problems is it possible to achieve regret that is sublinear in T. In the loyalty points problem, we show that when the fidelity rewards are increasing functions of the number of previous plays, sublinear regret is not possible for any definition of the regret that we consider. However, when the fidelity rewards are decreasing functions of the number of past plays, sublinear regret is possible for some definitions of regret in the loyalty points model. To show this, we introduce a bandit algorithm that exploits the concavity of the cumulative fidelity reward in this setting and achieves regret scaling with K T. Lastly, in the so called coupons model where the fidelity reward is only received every ρj plays of an arm, we show that it is possible to achieve sub-linear regret and provide an algorithm with regret scaling with KT in the loyalty points setting. For the subscription model, where the fidelity rewards are functions of the number of consecutive plays of the arm, our results for the increasing case are the same as for the loyalty points model. Indeed, we show that there exist some settings with increasing fidelity functions where the regret must be linear. In the decreasing subscription model, the results are more positive. Indeed, here we are able to provide an algorithm whose regret is sublinear, and in particular is of order (KT)2/3. We also show that sublinear regret is possible in the coupons model with subscription fidelity reward, although here the regret scales multiplicatively with the periodicity of the coupon function. Our results are summarized in Tables 1 and 2. For completeness, we also provide a near-optimal algorithm for the stochastic fidelity bandits problem in the cases where no such algorithm exists in the literature. 1.1 Related work In the loyalty points bandits problem, the reward changes depending on the past plays of each arm. Thus, the loyalty points bandits problem can be seen as an instance of the rested bandits problem introduced by Whittle (1988) which has been studied in the stochastic reward setting. We are not aware of any work on the rested bandits problem with adversarial rewards. In the stochastic rested bandits problem, it is classically assumed that the reward of each arm is a stochastic process which only transitions to the next value when the arm is played. Tekin and Liu (2010, 2012) considered Markov processes. A more general process was considered by Cortes et al. (2017) but they considered weaker definitions of regret, and compared the reward of their algorithm to the best single arm strategy, or the best arm given the history of actions taken by their algorithm. The rotting bandits problem studied by Levine et al. (2017); Seznec et al. (2019, 2020); Bouneffouf and F eraud (2016) is another instance of the stochastic rested bandits problem. This problem is equivalent to the loyalty points bandit problem with stochastic base rewards and decreasing fidelity reward. Indeed, here the expected reward decreases as a function of the number of plays of an arm. The best result that we are aware of for this problem is from Seznec et al. (2019, 2020) who show that the regret can be bounded by e O( KT). For this, Seznec et al. (2020) present a natural UCB style algorithm (c.f. Auer et al. (2002a)) which does not use any knowledge of the fidelity functions. The stochastic loyalty points Lugosi, Pike-Burke, Savalle bandits problem with increasing fidelity reward was also independently studied by Metelli et al. (2022) in concurrent work. In the subscription bandits problem, the reward at each time step depends on the previous plays of multiple arms. In particular, when we stop playing an arm after having played it consecutively for some number of steps, its reward changes even though it is not played, in addition to the reward of the played arm changing. This problem therefore does not exactly fit into either the classical rested or restless bandit framework, especially since while the restless bandit allows for the reward of an arm to change independently of whether it was played, this change is typically assumed to depend on the environment and not on the actions of the player, see Whittle (1988); Slivkins and Upfal (2008); Garivier and Moulines (2011); Besbes et al. (2014). We are not aware of any prior work studying this particular problem neither in the stochastic nor in the adversarial setting. The stochastic bandit problem where the reward depends on the time between consecutive plays of an arm has been studied by Immorlica and Kleinberg (2018); Pike-Burke and Grunewalder (2019); Cella and Cesa-Bianchi (2020); Basu et al. (2019). This is potentially another form of measuring loyalty to an arm but is distinct to the models considered in this paper. Lastly, we observe that standard adversarial bandit algorithms are not sufficient to deal with the reward structure in the fidelity bandits problem. These algorithms are typically designed to be competitive with the best single arm strategy which may be far from optimal in the fidelity bandit problem. Moreover, the addition of the fidelity rewards means that it is possible for the adversary to pick rewards to make either getting high reward or high fidelity reward impossible. Similarly, algorithms for learning in Markov Decision Processes (MDPs) with adversarial rewards (e.g., Even-Dar et al. (2009); Zimin and Neu (2013)) are not applicable. These require the MDP to be episodic or irreducible. 2. Problem setup In the classical K-armed bandit problem, a player sequentially interacts with the environment. Let [K] = {1, . . . , K} denote the set of arms. Then in each round, t, the player selects an arm Jt [K], possibly using the previous observations and incorporating an element of randomness into their decision. The environment then returns a reward Xt,Jt depending on which arm was selected. The player s aim is to select arms that maximize their total reward over T rounds of the bandit game. In the stochastic multi-armed bandit problem, the rewards Xt,j for arm j are generated i.i.d. from an underlying reward distribution. In particular, E[Xt,j] = µj for all 1 t T. We assume throughout the paper that the rewards are bounded in [0, 1]. In the adversarial multi-armed bandit problem, we work under the model of an oblivious adversary. In this model, all rewards Xt,j [0, 1] for j [K] and t [T] are chosen by an adversary prior to starting the game. At round t [T] of the game, the player selects an arm Jt [K] and the reward Xt,Jt is revealed to the player, while the rewards Xt,j with j = Jt remain hidden forever. In this paper, we consider a modification of the standard multi-armed bandit problem to account for extra fidelity rewards. We focus on two models; the loyalty-points model and the subscription model. In both these models, after playing arm Jt = j at time t, the player Bandit problems with fidelity rewards receives reward of the form, Yt,j = Xt,j + ψj(Ht 1) , where ψj(Ht 1) is some known, arm-specific function of the player s history Ht 1 = {J1, Y1,J1, X1,J1 . . . , Jt 1, Yt 1,Jt 1, Xt 1,Jt 1}. For ease of expression, we often refer to the Xt,j s as the base rewards and the ψj(Ht 1) s as the fidelity rewards. Therefore the reward Yt,j received at time t is composed of the base reward from playing arm j at time t and the additional fidelity reward. In the loyalty points model, the reward depends on the number of past plays of each arm. Let Nt,j = Pt s=1 1{Js=j} denote the number of plays of arm j up to time t 1 and let N0,j = 0. (Here, and in the rest of the paper, 1{ } is used to denote the indicator function.) Then, for all arms j, ψj(Ht) = fj(Nt,j) where each fj is a function defined on the set of natural numbers, taking values in [0, 1]. Hence, the reward the player receives from playing arm j at time t is Yt,j = Xt,j + fj(Nt 1,j) . In the subscription model, the fidelity reward depends on the current number of consecutive plays of the arm. Formally, define Qt,j = 1{Jt=j}(t max{s t : Js = j, Js 1 = j}) . Note that since only one arm can be played per time step, Qt,j = 0 for all arms j = Jt 1. In this case, ψj(Ht 1) = fj(Qt,j) for some known, arm-specific function of the number of consecutive plays up to time t 1 and fj : N [0, 1] for all arms j [K]. The reward received from playing arm j at time t is then, Yt,j = Xt,j + fj(Qt,j) . In this paper we assume that the fidelity functions fj are known to the player. In addition, we assume that the time horizon T is also known in advance. We also remark that in both models, when ψj 0 (so fj 0) for all j [K], the problem reduces to the classical multi-armed bandit problem. We now turn to the problem of measuring the performance of an algorithm in the fidelity bandits problem. Typically, the performance of bandit algorithms is measured in terms of their regret, where the regret is the total difference in reward accumulated by playing according to some optimal policy and that from playing the algorithm of interest. In order to properly define meaningful notions of regret, we discuss the stochastic and adversarial versions of the problems separately. 2.1.1 Regret in the stochastic bandit problem Consider first the stochastic variant of the problem. Recall that here, the base reward (without the fidelity component), associated to arm j at time t is a random variable Xt,j, where the Xt,j are independent random variables taking values in [0, 1], and E[Xt,j] = µj for all t [T] and j [K]. Lugosi, Pike-Burke, Savalle A policy π of the player is a sequence (πt)T t=1 where πt is a mapping from the history of arms and rewards, Ht 1 = {π1, Y1,π1, X1,π1 . . . , πt 1, Yt 1,πt 1, Xt 1,πt 1}, to the set [K] of arms. We write πt to denote the choice of arm made by the policy π at time t. The history Ht depends on the chosen arms π1, . . . , πt. The fidelity reward depends on the history of actions chosen according to policy π so we write Ht(π) for the generated history under policy π up to time t. The regret of a policy is defined as the difference between the cumulative reward of the policy and that of the best sequence of arm choices. To define the regret formally, let [K]T be the set of sequences of arms of the form j = (j1, . . . , j T ) with j1, . . . , j T [K] for horizon T N. We write ST (j) = PT t=1 Yt,jt for the cumulative reward of the sequence j. We then define the regret of a candidate policy π as max j [K]T ST (j) ST (π) . (1) To avoid complications caused by random fluctuations, instead of the regret defined above, it is customary to consider the so-called pseudo-regret (see Bubeck and Cesa-Bianchi (2012); Lattimore and Szepesv ari (2020)). In this notion, the base reward of each arm Xt,j is replaced by its expectation µj. Hence, the pseudo cumulative reward of policy π is defined as e ST (π) = PT t=1(µπt + ψπt(Ht 1(π))). Similarly, for a competing reference sequence j, we write e ST (j) = PT t=1(µjt + ψjt(Ht 1(j))). We then define the pseudo-regret as RT (π) = max j [K]T e ST (j) e ST (π) . (2) We often write RT = RT (π) when the policy π is clear from the context. Note that e ST (π) may still be random since the arm choices of the policy π influence the fidelity rewards. On the other hand, for each sequence j [K]T , the pseudo cumulative reward e ST (j) is deterministic. Note that in the classical multi-armed bandit problem, the regret is usually defined with respect to a much smaller reference class, containing only single-arm strategies. These strategies are constant sequences of the form j = (j, j, . . . , j) for j [K]. The two definitions are, in fact, equivalent in this problem. To see this, simply notice that in the absence of fidelity rewards, there is always a single-arm strategy that maximizes the expected cumulative reward (i.e., j argminj [K] µj). In the presence of fidelity rewards, single-arm strategies are not necessarily optimal. Depending on the nature of the fidelity reward functions fj, one may still be able to determine a subclass J [K]T that guarantees that, regardless of the distribution of the base rewards (that is, regardless of the values of µ1, . . . , µK), there always exists j J such that e ST (j ) = maxj [K]T e ST (j). In other words, RT (π) = sup j J e ST (j) e ST (π) . We call such a subset sufficient. In solving the regret minimization problem, it is important to understand the sufficient subsets J that allow one such a reduction. In particular, in each case it is helpful to identify minimal sufficient sets of sequences. For a particular fidelity Bandit problems with fidelity rewards reward f, a set J is minimal if for all proper subsets J J there exists a distribution of the arms such that maxj J e ST (j) < maxj J e ST (j). We define Cf = {J [K]T : J is minimal and sufficient} to be the class of minimal sufficient sets for given fidelity functions f. For example, for the classical multi-armed bandit problem when fj 0, the class Cf of minimal and sufficient sets contains just one set, namely the set J0 = {(j, j, . . . , j) : j [K]} of single-arm strategies. 2.1.2 Regret in the adversarial bandit problem Next we discuss the possible notions of regret in the adversarial setting. We work with an oblivious adversary. This means that the entire sequence of base rewards (Xt,j)t [T],j [K] is fixed before play starts. In each round t, only the reward corresponding to the arm chosen by the player is revealed. In the adversarial setting the player is allowed to randomize the arm selection. (Without randomization, no nontrivial guarantees exist even for the classical multi-armed bandit problem.) Formally, a player s policy π is a sequence (πt)T t=1 where πt is a measurable mapping from the history of arms and rewards Ht 1 = {J1, Y1,J1, X1,J1 . . . , Jt 1, Yt 1,Jt 1, Xt 1,Jt 1} to the set K, the standard simplex in RK. At time t, an arm Jt is chosen at random, according to the distribution πt(Ht 1). While in the stochastic multi-armed bandit problem the definition of regret is quite natural, the adversarial case is significantly more complex. In the stochastic setting, we define the regret as the difference between the (pseudo) cumulative reward of the player and that of the best sequence of actions. Adopting this notion to the adversarial framework often leads to trivialities. To see this, just consider the case of the classical adversarial multi-armed bandit problem where fj 0 for all arms j. In this case, the optimal policy is to play the arm with highest reward at each time step (jt argmaxj Xt,j). Hence, the player would need to compete with all possible KT sequences of arm choices. Obviously, achieving a sublinear regret with respect to this class of strategies is hopeless. One usually gets around this problem by reducing the class of comparison policies (Auer et al. (1995)). The usual definition of regret in the absence of fidelity rewards is with respect to the class of single-arm strategies. This definition is natural since single-arm strategies are always optimal in the classical stochastic bandit problem. With this definition of regret, the adversarial multi-armed problem may be considered as a robust version of its stochastic counterpart. As explained in the previous sections, in the stochastic fidelity bandits problem, competing against the best single-arm strategy is often too restrictive. It is clear that in many cases, it is necessary to switch arms in order to get good reward. This means that the best single arm strategy is no longer a reasonable baseline. Instead, we adopt the same philosophy as described above, and use knowledge of what makes a good sequence in the stochastic setting to define the regret in the adversarial case. In particular, to define the regret for the adversarial case for given fidelity functions fj , we first determine a minimal set of sequences of arm pulls that can be optimal for the stochastic problem for some distribution. These are the so-called sufficient and minimal sets defined in Section 2.1.1. Then we define the regret as the difference of the cumulative reward of the player and that of the largest reward one could achieve by following one of the policies in a sufficient Lugosi, Pike-Burke, Savalle and minimal comparison class. When there is only one sufficient and minimal set (i.e., the class Cf contains a single set J ), then the (pseudo) regret is defined simply as RT (π) = max j J ST (j) e ST (π) , (3) where ST (j) = PT t=1(Xt,jt + ψjt(Ht 1(j))) is the cumulative reward of a sequence j = (j1, . . . , j T ) and e ST (π) = EST (J) is the expected cumulative reward of policy π. Here J = (J1, . . . , JT ) denotes the sequence of randomized arm choices of policy π and the expectation is with respect to the randomizations of the policy. In some cases of fidelity rewards, there are multiple sets of minimal sufficient sets of sequences. In such cases there are various natural ways of defining regret. We single out two possibilities that we call the weak and strong regrets. The weak regret is defined as R T (π) = min J Cf max j J ST (j) e ST (π) , while the strong regret is R T (π) = max J Cf max j J ST (j) e ST (π) = max j S J Cf J ST (j) e ST (π) . Clearly, minimizing the strong regret is a more ambitious goal, since the player competes with a larger set of sequences of arms. On the other hand, having a small weak regret means that the policy is able to compete against at least one minimal sufficient set of sequences (whose identity may depend on the base rewards (Xt,j)). Of course, if Cf contains only one set, then the weak and strong notions coincide and reduce to the definition (3). In particular, note that when fj 0, both the weak and strong regret correspond to the notion of regret classically considered in standard adversarial bandits problem. 2.1.3 Mean regret for loyalty points bandits For loyalty points bandits, there is a third notion of regret that is perhaps more natural than the weak and strong regret. This is defined by observing that in the stochastic loyalty points model, the total reward only depends on the number of plays of each arm and the average base reward, and not on the order of plays. We may replicate this in the adversarial setting using empirical averages. We call this the mean regret. Unfortunately, such a definition of regret is not suitable for the subscription model where the total reward inherently depends on the ordering, since the subscription model only offers fidelity reward for consecutive plays of an arm. If a sequence j = (j1, . . . , j T ) [K]T plays arm j NT,j(j) = PT t=1 1{jt=j} times up to time T then in the stochastic analogue of loyalty points bandits problem, its (pseudo) reward is µj NT,j(j) + j=1 µj (NT,j(j) + Fj(NT,j(j))) where Fj(n) = Pn t=1 fj(t) are the cumulative fidelity rewards and µj is the expected value of the base reward of arm j. Hence, for given fidelity functions f1, . . . , f K, in the stochastic Bandit problems with fidelity rewards problem, e ST (j) only depends on the number of times each arm is played and the expected base rewards. To shorten notation, introduce the vector N = (N1, . . . , NK) of nonnegative integers satisfying PK j=1 Nj = T. N is called the type of the sequence j if PT t=1 1{jt=j} = Nj for all j [K]. Let µ = (µ1, . . . , µK) [0, 1]K denote the vector of means of the base rewards of the K arms and define, σ(µ, N) def. = j=1 (Njµj + Fj(Nj)) . Then, the cumulative reward of the optimal sequence of actions is maxj [K]T e ST (j) = max N σ(µ, N) and the regret for the stochastic problem is RT (π) = max N σ(µ, N) e ST (π). We introduce some definitions to help relate these types to the minimal sufficient sets. Fix some fidelity reward functions f1, . . . , f K. For a given µ [0, 1]K, let N(µ) = {N : σ(µ, N) = max N σ(µ, N )} be the set of optimal types. We call a set B of types a covering set if for all µ [0, 1]K, N(µ) B = . A covering set B is minimal if no proper subset of B is a covering set. Each minimal sufficient set of sequences J Cf is such that J contains exactly one sequence of each type of a minimal covering set B. We say that a type N is admissible if N S µ [0,1]K N(µ). In the adversarial bandit problem, it is natural to replace the values µj by the empirical averages since these relate to the long run rewards. We then set max N σ(bµT , N) as a goal for the learner, where bµT = (bµT,1, . . . , bµT,K). With this in mind, we define the mean regret as R T (π) = max N σ(bµT , N) e ST (π) = max N NK:P j [K] Nj=T j=1 (NjbµT,j + Fj(Nj)) e ST (π) . Hence, the type N maximizing σ(bµT , N) may be interpreted as the best response to the (unknown) empirical means of the outcomes (in hindsight). This point of view is often adopted in the theory of repeated games, see Hart and Mas-Colell (2000, 2001); Foster and Vohra (1999). This gives the mean regret a natural interpretation that is a generalization of most usual notions of regret. The next proposition establishes the relationship between the weak, strong, and mean regrets. Proposition 1 In the loyalty points model, regardless of the fidelity rewards and the base rewards, for each policy π, R T (π) R T (π) R T (π) . In particular, if there is only one minimal and sufficient set J , then all three regrets are equal to (3). Proof The stated inequalities may be equivalently written as min J Cf max j J ST (j) max N σ(bµT , N) max J Cf max j J ST (j) . Lugosi, Pike-Burke, Savalle Increasing Decreasing Coupons Stochastic E[RT ] = e O(K1/3T 2/3) E[RT ] = e O( KT) E[RT ] = e O( KT) [ Section 3.2 ] [ Seznec et al. (2020) ] [ Section 5.2 ] Adversarial E[R T ] = Ω(T) E[R T ] = Ω(T) E[R T ] = Ω(T) E[R T ] = Ω(T) E[R T ] = e O(K E[R T ] = e O(K E[R T ] = e O( E[R T ] = e O( E[R T ] = e O( KT) [ Section 3.3 ] [ Section 4.3 ] [ Section 5.3 ] Table 1: Summary of our main results in the loyalty points bandits problems. e O indicates order up to log factors, and ρ is the least common multiple of the periodicity ρ1, . . . , ρK in the coupons reward model. Fix the fidelity reward functions f1, . . . , f K. Denote by N the type of the sequence that solves the min-max optimization in the definition of the weak regret, that is, min J Cf max j J ST (j) = min j [T]K:j has type N ST (j) . Observe that there is an admissible type with this property. On the other hand, for any admissible type N = (N1, . . . , NK), the average of the cumulative rewards over all sequences of type N equals 1 T N1,...,NK X j [T]K:j has type N ST (j) = 1 T N1,...,NK X j [T]K:j has type N t=1 Xt,jt + = σ(bµT , N) . Hence, min J Cf max j J ST (j) σ(bµT , N) max N σ(bµT , N) , proving the first announced inequality. The second inequality is proved similarly. We note that while the mean regret is not applicable in the subscription model, R T (π) R T (π) obviously holds in the subscription model as well. In fact, in all instances of the subscription model that we consider in this paper, it will turn out that there is only one minimal sufficient set, meaning that R T (π) = R T (π), and so there is no need to consider an intermediate definition of regret. 2.2 Fidelity reward functions Throughout this work, we assume that the fidelity functions are known and fj takes values in [0, 1] for all arms j [K]. We consider both increasing and decreasing fidelity functions, Bandit problems with fidelity rewards Increasing Decreasing Coupons Stochastic E[RT ] = e O(T 2/3K1/3) E[RT ] = e O(T 2/3K1/3) E[RT ] = e O( KT) [ Section 6.2 ] [ Section 7.2 ] [ Section 8.2 ] Adversarial E[R T ] = Ω(T) E[R T ] = Ω(T) E[R T ] = e O((KT)2/3) E[R T ] = e O((KT)2/3) E[R T ] = O( p E[R T ] = O( p ρKT) [ Section 6.31] [ Section 7.3 ] [ Section 8.3 ] Table 2: Summary of our main results in the subscription bandits problems. e O indicates order up to log factors, and ρ is the least common multiple of the periodicity ρ1, . . . , ρK in the coupons reward model. in addition to a certain class of periodic fidelity rewards that we call the coupons model. In the increasing rewards model, we assume that for all j [K], fj(n) is a nondecreasing function of n. For the decreasing rewards model, we assume that for all j [K], fj(n) is nonincreasing in n. In the coupons model, a fidelity reward is obtained only once every few (consecutive) plays of each arm. In particular, let the period length of arm j be denoted by ρj N, and the bonus fidelity reward rj [0, 1]. Then, the player receives a fidelity bonus rj after every ρj (consecutive) plays of arm j. More formally, for rj > 0, ( rj if t = 0 (mod ρj) 0 otherwise. Throughout the paper, it is often helpful to consider the cumulative fidelity reward of each arm j which is defined for any n [T] as, t=1 fj(n) . We now consider the different cases in turn. For each, we first analyze the stochastic setting and define the class of minimal sufficient sequences which also defines the regret in the adversarial setting. We then analyze the adversarial setting. Our results are summarized in Tables 1 and 2. 3. Loyalty points model: increasing fidelity rewards The first setting we consider is the loyalty points model where the fidelity functions are increasing functions of the number of past plays of each arm. 3.1 Minimal sufficient sets We first need to define the minimal sufficient sets J of sequences of actions defined in Section 2.1. When the fidelity rewards are nondecreasing, the unique minimal sufficient 1. This is the general result. For the special case where the fidelity functions are step functions with a step from 0 to 1 at m we also show that the weak and strong regret can be bounded by e O(T 2/3(Km)1/3). Lugosi, Pike-Burke, Savalle set of sequences is the set J0 of single-arm strategies. This is shown in the below lemma, whose proof is in Appendix A.1. When the fidelity functions of the different arms are equal, optimality of such a single-arm strategy is obvious. In the case of arm-specific fidelity functions, recall that Fj(T) = PT t=1 fj(t), then the optimal arm is given by j = argmax 1 j K (In case of multiple maximizers, we may choose any of them). The optimal policy then plays arm j for all rounds t = 1, . . . , T. Lemma 2 In the stochastic loyalty points model with increasing fidelity rewards, regardless of the distribution of the base rewards and the fidelity rewards, there exists a single-arm strategy that minimizes the pseudo cumulative reward e ST (j) over all j [K]T . 3.2 Stochastic rewards As Lemma 2 shows, to minimize regret, it suffices to construct a strategy that is able to compete with single-arm strategies. In the stochastic setting, a simple variant of the UCB algorithm of Auer et al. (2002a) achieves this. Instead of using the observations Yt,j to construct the sample averages and upper confidence bounds, one may work with the modified rewards e Yt,j = Xt,j + 1 T Fj(T) . Note that since the fidelity functions are known, computing these modified rewards is possible. These modified rewards are natural since the best single-arm policy equates to playing the arm that maximizes µj + Fj(T)/T in all rounds. We then define the upper confidence bounds around the sample mean of these observations by UCBt(j) = Xt,j + 1 where Xt,j = 1 Nt,j Pt s=1 Xs,j1{Js=j}. The algorithm proceeds as a standard UCB algorithm: first it plays each arm once, then for t = K + 1, . . . , T, it plays arm Jt = argmax 1 j K UCBt 1(j) . The regret of this algorithm is bounded in the next theorem, whose proof follows from a modification of the standard UCB analysis (see (Auer et al., 2002a)) and is given in Appendix A.2. Theorem 3 The expected (pseudo) regret of the UCB algorithm defined above in the stochastic loyalty points bandits problem with increasing fidelity reward is bounded by (µj µj + fj (T) fj(0)) + 1 where e j = µj µj + (Fj (T) Fj(T))/T. Consequently, the worst case regret of this algorithm is bounded by O(T 2/3(K log(T))1/3). Bandit problems with fidelity rewards When the fidelity functions are all constant, fj (T) = Fj (T)/T, fj(0) = Fj(T)/T, the problem reduces to the standard stochastic multi-armed bandit problem and we recover the optimal O(PK j=1 log(T)/e j) order of the problem dependent regret (see Auer et al. (2002a)). When the fidelity functions are not constant, the leading term in the regret is multiplied by µj µj+fj (T) fj(0) e j 1. This term can be as large as 2/e j, in which case, the regret is of the order of log(KT) e 2 j . Thus, while the logarithmic dependence on the horizon T is the same as in the standard multi-armed bandit problem, the dependence on the gap e j may be worse, depending on the fidelity functions. Concurrent work (Metelli et al., 2022) also obtained similar upper bounds on the regret for this problem. We argue that this worse dependence on e j is inevitable. Consider the case where the fidelity functions are the same for all arms, that is, fj = f for all j [K]. Here, e j = µj µj and j = argmaxj µj. For simplicity, assume that the distribution of arm j is Bernoulli with parameter µj [0, 1]. If Fj(T) = Fj (T), then to distinguish µj + Fj(T)/T from µ + Fj (T)/T, if suffices to distinguish µj from µj . A classical result of Lai and Robbins (1985) shows that, in the standard multi-armed bandit problem, any algorithm that is guaranteed to achieve regret o(T α) for every α > 0 must play any arm j at least Nj = Ω( log(T) 2 j ) times to distinguish µj and µj . Thus, any algorithm that obtains sublinear regret must explore sub-optimal arms which necessarily results in some loss of fidelity reward. Define τ = P j:j =j Nj. Then the contribution to the regret from the lost fidelity reward is t=1 f(Nt 1,Jt) = t=T τ f(t) X t=1 (f(T τ) f(Nj)) This gives a lower bound on the regret of Ω(P j:j =j log(T) e 2 j (µj µj + f(T τ) f(Nj))). While this does not exactly match Theorem 3, it shows that a term involving the fidelity functions is unavoidable and that the problem is, in general, harder than the plain stochastic bandit problem. 3.3 Adversarial rewards As seen in the previous section, the only set of minimal sufficient sequences is the set J0 of single-arm strategies. Hence, as in (3), the notions of weak, strong, and mean regrets introduced in Section 2.1 coincide and equal RT (π) = max j J0 ST (j) e ST (π) . This is the same notion of regret as in the standard adversarial bandits problem. In spite of the simplicity of the target and unlike in the stochastic version of the problem, here the addition of the fidelity reward makes it considerably more difficult to achieve sublinear regret. In fact, Theorem 4 below shows that, unless the fidelity rewards are essentially constant, any policy must incur linear regret for some reward sequence. Lugosi, Pike-Burke, Savalle To prove the lower bound, it suffices to consider the case where the fidelity function is the same for all arms. We obtain the following lower bound for the regret, the proof of which is in Appendix A.3. Theorem 4 Consider the adversarial loyalty points bandits problem with increasing fidelity rewards, where the fidelity function is the same for every arm. Define δ = f(7T/8) f(T/8). Then for every policy π there exists a sequence of rewards such that the regret satisfies Thus, the regret is Ω(T) unless f is essentially constant over most of its domain. The particular constants appearing in the bound (and in the definition of δ) have no special significance, they have been chosen for convenience. In particular, the definition of δ may be replaced by f(T(1 ϵ)) f(Tϵ) for any ϵ (0, 1/2) and the constant 1/40 modified accordingly. We omit the straightforward details. Theorem 4 implies that in the adversarial case one cannot hope for a nontrivial regret bound unless the (nondecreasing) fidelity rewards essentially do not change with time. In that case one may use a simple and natural modification of any low-regret algorithm for adversarial bandits such as EXP3 (Auer et al. (2002b)). We omit the straightforward details. 4. Loyalty points model: decreasing fidelity rewards 4.1 Stochastic rewards The stochastic loyalty points bandit problem with decreasing fidelity rewards is equivalent to the rotting bandits problem studied by Levine et al. (2017); Seznec et al. (2019, 2020), and Bouneffouf and F eraud (2016). Seznec et al. (2019, 2020) provide an algorithm with regret RT = e O( p KT log(T)). This matches the lower bound for the standard bandit problem (which can be viewed as a special case of fidelity bandits where the fidelity function is constantly 0) up to logarithmic factors. In the adversarial setting, the loyalty points model with decreasing fidelity rewards is significantly more complex. In order to appropriately define the notion of regret, first we need to determine the class Cf of minimal sufficient sets of sequences of arm pulls. We discuss this in the next section. As it turns out, the class Cf is nontrivial in this case and therefore the notions of strong, weak, and mean regret do not coincide. In fact, we show that while there is no algorithm that achieves sublinear strong regret for a large class of fidelity functions, it is possible to construct a policy with guaranteed sublinear mean regret and thus also sublinear weak regret. 4.2 Minimal sufficient sets In order to study the adversarial version of the problem, first we need to understand the minimal sufficient sets of sequences of actions. Recall from Section 2.1 that each minimal sufficient set J Cf is such that J contains exactly one sequence of each type of a minimal covering set B. Hence, in order to understand the class Cf of minimal sufficient sets, one Bandit problems with fidelity rewards needs to understand the set of admissible types, that is, types that are maximizers of the function σ(µ, N) for some µ [0, 1]K. An important special case is when the fidelity rewards fj(t) are strictly decreasing functions of t for each j [K]. Then the average cumulative reward associated to a a type N = (N1, . . . , NK), j=1 Fj(Nj), is strictly concave. In fact, one may extend h to the standard simplex K such that the function h : K R is strictly concave. Then the cumulative total expected reward corresponding to a mean vector µ [0, 1]K may be written as σ(µ, N) = T N T , µ + h N It is easy to see that for such fidelity reward functions, for each type N, there exists µ [0, 1]K such that N(µ) = {N}, that is, N is the unique type maximizing σ(µ, ). In such cases, the minimal sufficient sets are all those sets J that, for each type N, J contains exactly one sequence j of type N. But then max J Cf max j J ST (j) = max j [K]T ST (j) and therefore it is hopeless to minimize the strong regret. Based on this, it is not difficult to prove the following proposition. Proposition 5 Consider the adversarial loyalty points bandit problem with strictly decreasing fidelity reward functions f1, . . . , f K. Then there exists a positive constant c depending on the fidelity reward functions only such that for any (randomized) policy π of the forecaster, there exist base rewards (Xt,j)t [T],j [K] such that the strong regret satisfies R T (π) c T . Remark 6 In some cases, when the fidelity rewards are not strictly decreasing (so the conditions of Proposition 5 do not hold), it is possible to achieve sublinear strong regret. As an example, consider the case when K = 2 and the fidelity function of both arms is fj(t) = α1t T0 for j = 1, 2, where α (0, 1) and T0 [T] are some fixed parameters. Then it is easy to see that the set of optimal types N(µ) corresponds to a single-arm strategy whenever |µ1 µ2| = α. When µ1 µ2 = α, N(µ) consists of all types with N1 T T0 and when µ1 µ2 = α, N(µ) consists of all types with N1 T0. Hence, in all cases, N(µ) contains one of the two single-arm strategies and therefore the only minimal sufficient set is J0. This implies that the three notions of regret coincide in this case. As is shown below, sublinear mean regret is always achievable and hence, in this particular example, the strong regret can also be made sublinear. Lugosi, Pike-Burke, Savalle Input: T (horizon), tuning parameters η (0, 1], ϵ (0, 1). Initialize: Define q(1), . . . , q(M) K to be an ϵ cover of K so M ϵ (K 1). Initialize: w1,i = 1 i = 1, . . . , M. 1: for all t = 1, . . . , T do 2: Set pt,i = wt,i PM i=1 wt,i and define qt = P i [M] pt,iq(i) 3: Select arm Jt to play by sampling the arms with probabilities (qt,1, . . . , qt,K) and receive reward YJt,t = Xt,Jt + f Jt(NJt,t 1) [0, 2]. 4: Calculate the pseudo-losses b Xt,j = 1 1 Xt,j qt,j 1{Jt=j} 5: Calculate the pseudo-losses of each expert, bℓt(q(i)) = P j [K] q(i) j (1 b Xt,j) h(q(i)) 6: Update wt+1,i = wt,i exp{ ηbℓt(q(i))}. Figure 1: EXP4 with pseudo rewards for the adversarial loyalty points bandits model with decreasing fidelity reward. 4.3 Minimizing the mean regret Interestingly, as opposed to the strong regret, minimizing the mean (and therefore weak) regret is an achievable goal. In this section we define a policy that guarantees that if the fidelity functions are nonincreasing, then the mean regret is o(T) regardless of the base rewards. The algorithm we propose is an exponentially weighted average forecaster where the set of experts q(1), . . . , q(M) K is defined by an ϵ-cover of the simplex K. Similarly to in the standard exponentially weighted average forecaster, we need to estimate the loss of each expert. In this loyalty points bandits problem this is somewhat more involved since the loss estimates need to be defined to take into account the fidelity rewards. We use the loss estimates bℓt(q(i)) = P j [K] q(i) j (1 b Xt,j) h(q(i)) for each expert i where b Xt,j = 1 1 Xt,j qt,j 1{Jt=j} for all arms j. The algorithm is defined in Figure 1. In order to work with these adjusted losses, we need to show that they enjoy the desirable properties of an importance sampling estimate and that they lead to the correct dependence on the fidelity reward of our algorithm. This is done in the following theorem which provides a bound on the regret of the algorithm. Theorem 7 Consider the adversarial loyalty points bandit problem when the fidelity rewards are nonincreasing for all j [K]. Then, regardless of the base rewards, the randomized policy π defined in Figure 1 with ϵ = (K + 1)/ 2T and η = p log(1/ϵ)/(2T) has mean regret ER T (π) (K + 1) log(K + 1) + 1 + Bandit problems with fidelity rewards Proof We consider a randomized forecaster that, at each time instance, selects a point qt = (qt,1, . . . , qt,K) in the standard simplex K. Then the forecaster plays a random arm, chosen according to the distribution qt, that is, Jt = j with probability qt,j, j [K] , Writing e Nj = PT t=1 1Jt=j and f N = ( e N1, . . . , e NK) for the corresponding type, the cumulative reward of such a forecaster equals t=1 Xt,Jt + t=1 Fj( e Nj) = T t=1 Xt,Jt + h Consider the piecewise linear extension of h to K. It follows from the fact that the fidelity rewards are nonincreasing that h is concave and 1-Lipschitz with respect to the ℓ1 norm. Then we may apply the Hoeffding-Azuma inequality to both PT t=1 Xt,Jt and e Nj to obtain that, with probability at least 1 δ, t=1 Xt,Jt + t=1 Fj( e Nj) j [K] qt,j Xt,j + Th 2 log K + 1 j [K] qt,j Xt,j + t=1 h (qt) (K + 1) 2 log K + 1 where we used concavity of h. In particular, by standard tail integration, t=1 Xt,Jt + t=1 Fj( e Nj) j [K] qt,j Xt,j + 2 log(K + 1) . (5) Hence, it suffices to construct qt such that j [K] qt,j Xt,j + with γT = (K + 1) . In fact, we prove the slightly stronger bound j [K] qt,j Xt,j + max q K T ( q, bµT + h (q)) γT . In order to construct such a forecaster qt, we discretize K. Let ϵ > 0 and let q(1), . . . , q(M) be a minimal ϵ-cover of K with respect to the ℓ1 distance. Using the fact that a minimal Lugosi, Pike-Burke, Savalle ϵ-cover is an ϵ/2 packing and a standard volumetric estimate, we get that M ϵ (K 1). Also, by the Lipschitz property of h, max i [M] T D q(i), bµT E + h q(i) max q K T ( q, bµT + h (q)) 2Tϵ . (6) We treat q(1), . . . , q(M) as experts and use the exponentially weighted average forecaster. In order to do this, we use the estimates b Xt,j = 1 1 Xt,j qt,j 1{Jt=j} , j [K] , where qt,j is the probability our forecaster plays arm j in round t. Observe that b Xt,j may be computed for all j since Xt,Jt is observed after selecting arm Jt at time t. Also note that EJt qt b Xt,j = Xt,j and 1 b Xt,j 0. To each expert q(i) we assign the loss estimate bℓt(q(i)) = X j [K] q(i) j (1 b Xt,j) h(q(i)) = 1 X j [K] q(i) j b Xt,j h(q(i)) . Now we are prepared to define the proposed forecaster. At each time instance t = 1, . . . , T, qt is chosen as a weighted average of the experts q(1), . . . , q(M), weighted by the distribution pt,1, . . . , pt,M so that qt = P i [M] pt,iq(i), where for t = 1, pt,i = 1/M for all i [M] and for t > 1, pt,i = wt 1,i wt 1,i = exp s=1 bℓs(q(i)) and Wt 1 = X i [M] wt 1,i . Here η > 0 is a tuning paramenter. We proceed by following analysis of the exponentially weighted average forecaster (see, e.g., Cesa-Bianchi and Lugosi (2006)). On the one hand, t=1 bℓt(q(i)) t=1 bℓt(q(i)) log M for all i [M]. On the other hand, i [M] pt,ie ηbℓt(q(i)) Bandit problems with fidelity rewards i [M] pt,ibℓt(q(i)) + η2 X i [M] pt,ibℓt(q(i))2 (using e x 1 x + x2 for x 1 and bℓt(q(i)) 1 by construction) i [M] pt,ibℓt(q(i)) + η2 T X i [M] pt,ibℓt(q(i))2 . Comparing the upper and lower bounds for log(WT /W0), we obtain that, for every i [M] i [M] pt,ibℓt(q(i)) t=1 bℓt(q(i)) + log M i [M] pt,ibℓt(q(i))2 . (7) Then, observe that EJt qtbℓt(q(i)) = 1 P j [K] q(i) j Xt,j h(q(i)), EJt qt P i [M] pt,ibℓt(q(i)) = 1 P j [K] qt,j Xt,j P i [M] pt,ih(q(i)) and for each t [T], i [M] pt,ibℓt(q(i))2 = X i [M] pt,i EJt qt j [K] q(i) j (1 b Xt,j) h(q(i)) i [M] pt,i EJt qt j [K] q(i) j (1 b Xt,j) i [M] pt,i EJt qt j [K] q(i) j 1 Xt,j qt,j 1{Jt=j} i [M] pt,i X q(i) j 1 Xt,j i [M] pt,i X q(i) j qt,j = 2(K + 1) (using qt = X i [M] pt,iq(i)) . So we can rearrange the expression in (7) to get j [K] qt,j Xt,j + i [M] pt,ih(q(i)) j [K] q(i) j Xt,j + Th(q(i)) η 2(K + 1)ηT max q K T ( q, bµT + h (q)) 2Tϵ (K 1) log 1 ϵ η 2(K + 1)ηT Lugosi, Pike-Burke, Savalle (by (6)) and since M ϵ (K 1)) = max q K T ( q, bµT + h (q)) 2Tϵ (K + 1) (choosing η = p log(1/ϵ)/(2T)) = max q K T ( q, bµT + h (q)) (K + 1) with the choice ϵ = (K + 1)/ 2T. The proof is finished by noting that, by concavity of h, i [M] pt,ih(q(i)) h i [M] pt,iq(i) and therefore j [K] qt,j Xt,j + t=1 h(qt) max q K T ( q, bµT + h (q)) (K +1) Combining this with (5) implies the announced regret bound. Remark 8 Since the set of experts needs to cover the simplex, the proposed algorithm may not be computationally efficient. A possible way to construct computationally efficient low-regret forecasters is to exploit the concavity of h and use ideas from bandit convex optimization (e.g., Bubeck et al. (2017)). However, it is unclear whether such an algorithm achieves the same dependence on K. This is left for future research. A related open question is to determine the optimal dependence on K. While the lower bound of Ω( KT) for the standard adversarial bandits problem (where the fidelity functions are all 0) suggests that the dependence on T in Theorem 7 cannot be improved, it is unclear what the optimal dependence on K is in the adversarial loyalty points bandit problem with decreasing fidelity rewards. 5. Loyalty points model: coupon rewards In this section we study the loyalty points model with coupons fidelity rewards. Recall that in this model, associated to each arm j [K], are a positive integer ρj (the period length) and a real number rj [0, 1] and the player receives the extra fidelity reward rj after each (not necessarily consecutive) ρj plays of arm j. 5.1 Minimal sufficient sets Once again, in order to understand regret, the first step is to determine the class of minimal sufficient sets. To this end, we need to understand the set of types N that are maximizers Bandit problems with fidelity rewards of the function σ(µ, N) for some µ [0, 1]K. This is not a simple task because of the nonlinear nature of the coupon rewards. Note that The presence of the integer part in the terms Nj/ρj means that the set of possible optimizers may be quite rich and complex. However, in some cases, the class of minimal sufficient sets is simple. For example, consider the simplest case when the period lengths ρj associated to all arms j [K] are the same, say equal to ρ. If T happens to be divisible by ρ, then it is easy to see that playing an arm j that maximizes µj + rj/ρ forever maximizes the reward. (In other words, Nj = T for a maximizing arm j and Ni = 0 for all other arms is an optimal type.) Hence, in this case, the unique minimal sufficient set of sequences is the set J0 of single-arm strategies. However, even if T is not an integer multiple of the period length ρ, if ρj = ρ for all j [K], then for some µ [0, 1]K the optimal type is such that an arm maximizing µj + rj/ρ is played ρ T/ρ times and a (possibly different) arm maximizing µj is played the remaining T ρ T/ρ times. Thus, in this case, the class of minimal sufficient sets consists of all sequences that are either single-arm strategies or such that only two arms are played, one of them ρ T/ρ times and the other T ρ T/ρ times. When the period lengths are different, the optimizing sequences may have a more complicated structure.2 Nevertheless, the above argument easily generalizes to possibly different period lengths as stated in the next lemma. Lemma 9 Let ρ denote the least common multiple of the period lengths ρ1, . . . , ρK. If T is an integer multiple of ρ, then the unique minimal sufficient set of sequences is the set J0 of single-arm strategies. More generally, every minimal sufficient set J Cf is such that every sequence in J has type N = (N1, . . . , NK) such that for some j [K], Nj ρ T/ρ . In other words, even though single-arm strategies are not necessarily optimal, every j S J Cf J is almost a single-arm strategy in the sense that it differs from one in at most T ρ T/ρ ρ positions. This fact will help us design strategies with small strong regret in the adversarial version of the problem. 5.2 Stochastic rewards The observation made in Lemma 9 suggests a simple way of achieving small regret in the stochastic loyalty-points bandit problem with coupon rewards. One may simply aim at accumulating a reward that is not much smaller than maxj [K] µj + rj . This may be achieved by applying a regret-minimization strategy using the augmented rewards b Yt,j = Xt,j + rj ρj . Suppose that one plays according to a standard K-armed bandit strategy based on the rewards b Yt,j, and that the expected cumulative regret of this strategy compared to 2. Note that the optimization problem is closely related to (though not completely equivalent to) a knapsack problem where each item j has size ρj and reward ρjµj + rj and the total capacity of the knapsack is T. Lugosi, Pike-Burke, Savalle T maxj [K] µj + rj is bounded by a quantity ϵT . This means that if b N1, . . . , b NK denote the (random) number of times each of the K arms is pulled up to time T, then T max j [K] This implies that the expected fidelity-augmented reward of the strategy satisfies µj b Nj + rj µj b Nj + rj T max j [K] max N σ(µ, N) ϵT K . In other words, the regret of the strategy is at most ϵT + K. For example, by using a standard UCB algorithm (Lattimore and Szepesv ari, 2020), one may take j = max i [K] are the gap parameters, see Lattimore and Szepesv ari (2020). Summarizing, we have the following: Theorem 10 Consider the stochastic loyalty-points bandit problem with coupon rewards. There exists a strategy of play whose expected regret satisfies Consequently, the worst case regret of this strategy is O( p KT log(T)). Note that these regret bounds are near optimal for the standard stochastic K-armed bandit problem, so we expect that they cannot be improved for the stochastic loyalty points bandits problem with coupons fidelity rewards. Bandit problems with fidelity rewards 5.3 Adversarial rewards Equipped with our understanding of the class of minimal sufficient sets and having derived a small-regret strategy for the stochastic version of the problem, it is now easy to design strategies that work well in the adversarial case. The main take-home message from the previous sections is that (a) single-arm strategies are almost optimal in the presence of coupon rewards; (b) playing a standard bandit strategy based on the augmented rewards Xt,j+rj/ρj is nearly optimal. We show here that the same principles work in the adversarial case as well, and lead to similar near optimal regret bounds. Theorem 11 Consider the adversarial loyalty-points bandit problem with coupon rewards. There exists a randomized policy π whose expected weak, mean and strong regret satisfy ER T (π) ER T (π) 4 p TK log K + K , and ER T (π) 4 p TK log K + ρ + K . Proof The argument for bounding the mean regret is similar to that of Theorem 10. Suppose that one plays a standard (randomized) adversarial bandit strategy based on the augmented rewards b Yt,j = Xt,j + rj ρj . Denote by Jt the arm selected by the strategy at time t and by b Nj = PT t=1 1{Jt=j} the number of times the strategy plays arm j up to time T. Recall also that bµT,j = 1 T PT t=1 Xj,t denotes the empirical average of the base rewards of arm j. Assume that the strategy comes with the regret guarantee t=1 b Yt,Jt max j K t=1 b Yt,Jt ϵT . For example, by using the exp3 algorithm of Auer et al. (1995), one may take ϵT = 4 TK log K (see, e.g., (Lattimore and Szepesv ari, 2020, Theorem 11.1)). Then the expected cumulative reward of this strategy equals t=1 Xt,Jt + t=1 Xt,Jt + j=1 rj b Nj t=1 b Yt,Jt K = T max j K = max N σ(bµT , N) K ϵT , Lugosi, Pike-Burke, Savalle proving the bound for the mean regret. To prove the upper bound for the strong regret, notice that by Lemma 9, max J Cf max j J St(j) max j J0 St(j)+ρ = max j [K] +ρ max j [K] But we have already proved above that t=1 Xt,Jt + concluding the proof. 6. Subscription model: increasing fidelity rewards In the rest of the paper we discuss what we call the subscription model. Recall from the introduction that in this model the fidelity rewards depend on the current number of consecutive plays of the selected arm as opposed to the loyalty points models where the fidelity rewards are a function of the total number of times the selected arm has been played in the past. Just like in the case of the loyalty points model, here we also consider three types of fidelity rewards: increasing, decreasing, and coupon rewards. We begin the discussion by considering increasing fidelity rewards. More precisely, we assume that the fidelity reward is a nondecreasing function of the number of consecutive plays of an arm prior to the current time step. Recall the definition Qt,j = 1{Jt=j}(t max{s t : Js = j, Js 1 = j}) as the number of consecutive plays of arm j up to time t and assume that the reward is of the form Yt,j = Xt,j + fj(Qt,j), where Xt,j is the base reward and fj is the (known) fidelity function associated with arm j that is assumed to be nondecreasing. We may assume, without loss of generality, that fj(0) = 0 for all j [K]. 6.1 Minimal sufficient sets It is clear that for the stochastic subscription bandit problem with increasing fidelity rewards, there is always an optimal policy that is a single-arm policy. Indeed, a single-arm policy that plays an arm j argmax1 j K{µj T + Fj(T)} for all T time steps maximizes total reward. (Recall that Fj(t) = Pt s=1 fj(t) denotes the cumulative fidelity reward of playing arm j during t consecutive steps.) Hence, the class of minimal and sufficient sets contains the single set J0 of single-arm strategies. 6.2 Stochastic rewards We know that in the stochastic bandit problem with increasing fidelity rewards in the subscription model, the reward of the optimal strategy is max1 j K{µj T +Fj(T)}, achieved by playing a single arm j for all T rounds. Since the optimal policy is the same as in the standard stochastic bandits problem, one may imagine that similar learning algorithms can be applied here. However, note that Bandit problems with fidelity rewards the usual regret-minimizing strategies (such as ucb) keep exploring all through their run. This involves switching to sub-optimal arms in order to make sure that unlikely events of good arms collecting small rewards get detected. Such strategies are not suitable for the subscription model under increasing fidelity rewards, since a single switch of arms resets the count Qt,j to zero, resulting in a significant loss of fidelity rewards. In particular, any strategy that switches an arm after time ϵT (for a constant ϵ (0, 1)) must suffer a regret of at least Fj (ϵT) which is of order Ω(T) except in trivial cases. However, an expected reward that is not much smaller than that of the optimal strategy may be achieved by any strategy that is able to identify an arm achieving the optimum early on. Indeed, strategies designed for best-arm identification may be used directly to achieve sublinear expected regret in the stochastic subscription bandits problem with decreasing fidelity reward. In best-arm identification problems, the goal of the learner is to learn the identity of the arm with the highest expected reward. Instead of reviewing the growing literature of best-arm identification, we show how such a simple strategy may be adopted, in a straightforward manner, to obtain strategies with sublinear regret. We refer the reader to the book by Lattimore and Szepesv ari (2020) for an exhaustive survey. The notion of simple regret proves to be useful in our approach. To define the simple regret, we first define the sub-optimality gap of arm j as e j = µj + Fj (T)/T (µj + Fj(T)/T) where j = argmax1 j K(µj + Fj(T)/T). (Note that in the case where the fidelity functions are the same across all arms, e j = µj µj = j.) The simple regret of the best-arm identification strategy outputting arm J is defined as Suppose that one plays a best-arm identification strategy during the first t0 < T rounds, besed on the fidelity-augmented rewards Xt,j + Fj(T)/T, at the end of which the algorithm identifies an arm J [K]. For the rest of the rounds t {t0 + 1, . . . , T}, the strategy plays arm J, independently of the rewards. During these T t0 steps, the strategy accumulates a total expected reward of E[(T t0)µJ + FJ(T t0)]. The expected regret of the above-defined strategy may be bounded in terms of the simple regret, since E[RT ] Tµj + Fj (T) E [(T t0)µJ + FJ(T t0)] Tµj + Fj (T) E [TµJ + FJ(T)] + 2t0 = 2TEe J + 2t0 . The simplest best-arm identification strategy samples all K arms t0/K times and outputs an arm that maximizes the empirical reward accumulated over this time. This simple strategy achieves an expected simple regret bounded by C p (K log K)/t0 for a universal constant C (see Lattimore and Szepesv ari (2020)). Substituting this into the upper bound above and choosing t0 T 2/3(K log K)1/3, we obtain the following. Theorem 12 Consider the stochastic bandit problem in the subscription model with increasing fidelity rewards. There exists a strategy of play whose expected regret satisfies E[RT ] CT 2/3(K log K)1/3 , where C is a numerical constant. Lugosi, Pike-Burke, Savalle Of course, the naive algorithm used here is not optimal and more sophisticated best-arm strategies may be used to improve the regret bound. In particular, one may get a bound that is logarithmic in T but depends on the sub-optimality gaps e j. However, we conjecture that even an improved strategy will not be able to achieve regret of order KT since the fidelity functions can be chosen such that any algorithm which explores sufficiently must pay in terms of the fidelity rewards, much like in the increasing loyalty points model in Section 3.2. Moreover, the use of such improved strategies involves the same key idea and we prefer to present the idea here in its simplest form. 6.3 Adversarial rewards Next we consider the adversarial version of the problem. As noted in Section 6.1, the only class of minimal and sufficient sets contains the single set J0 of single-arm strategies. This implies that the notions of weak and strong regrets coincide as the learner simply competes with the best single-arm strategy. It is easy to see that one cannot expect to achieve sublinear regret in general. In fact, for any sequence j = (j1, . . . , j T ) of actions, the corresponding cumulative reward ST (j) is at most the analogous cumulative reward in the loyalty-points model. To see this, note that every arm switch in j reduces the cumulative reward in the subscription model more than in the loyalty points model. Hence, minimizing regret with respect to the class of single-arm strategies is at least as hard in the subscription model as in the loyalty-points model. In particular, Theorem 4 immediately implies the following: Theorem 13 Consider the adversarial bandits problem in the subscription model with increasing fidelity rewards, where the fidelity function is the same for every arm. Define δ = f(7T/8) f(T/8). Then for every policy π there exists a sequence of rewards such that the regret satisfies The theorem shows that if the fidelity function grows substantially between (say) T/8 and 7T/8, one cannot expect sublinear regret. If this is not the case, sublinear regret may be possible, depending on the fidelity function. In the rest of the section we take a closer look at the special but important case when the fidelity reward follows a step function. More precisely, consider the case where the fidelity functions take the form fj(t) = 1 if t m 0 if t < m for all j [K] and for some positive integer m T. We refer to fj(t) as an m-step function. Theorem 13 shows that when m T/8 (or more generally, when m = Ω(T)), sublinear regret cannot be guaranteed. However, for small values of m, small regret is achievable. To see why, consider first the extremal case when m = 1. In this case, the only time when the player does not receive a fidelity reward is when the player switches actions. Hence, the problem is equivalent to a bandit problem with switching costs that has been studied in the adversarial setting by Dekel, Ding, Koren, and Peres (Dekel et al., 2014) and Cesa-Bianchi, Dekel, and Shamir (Cesa-Bianchi et al., 2013). In particular, Theorem 1 of Bandit problems with fidelity rewards Dekel et al. (2014), shows that for every randomized strategy in the bandits with switching costs problem, there exists a sequence of rewards such that the regret is Ω(T 2/3K1/3) for K T. This lower bound is, in fact, achievable. In fact, an easy batch modification of the EXP3 algorithm shows that regret of the order of O(T 2/3K1/3m1/3) is achievable when the fidelity reward is an m-step function. This is stated in the next theorem. Theorem 14 Consider the adversarial fidelity bandits problem in the subscription model with the m-step fidelity function. There exists a randomized strategy whose expected (weak and strong) regret satisfies E[RT ] = O((K log K)1/3T 2/3m1/3) . In particular, sublinear regret is achievable whenever m = o(T), hence characterizing the range of values of m for which sublinear regret is achievable. 7. Subscription model: decreasing fidelity rewards We continue the study of the subscription model in the case when the fidelity rewards fj are nonincreasing functions of Qt,j = 1{Jt=j}(t max{s t : Js = j, Js 1 = j}), the number of consecutive plays of arm j prior to the current time step. This case is substantially different from the case of increasing fidelity rewards studied in the previous section. As always, we start by determining the class of minimal and sufficient sets that allow us to design regret minimization strategies in the stochastic model and to define regret in the adversarial case. 7.1 Minimal sufficient sets In order to define the minimal sufficient sets for given nonincreasing fidelity functions f1, . . . , f K, we need to characterize the sequences j [K]T that can have a maximal expected reward in the stochastic bandit problem for some values of the expected rewards µ1, . . . , µK of the K arms. To this end, note that for fixed µ1, . . . , µK, it is optimal to play the arm j with the largest expected initial reward, that is, j = argmax1 j K{µj +fj(0)}, repeatedly until time m [1, T] when µj + fj (m + 1) < maxj =j {µj + fj(0)}, that is until its expected reward drops under the initial expected reward of the second best arm. An arm j (2) argmaxj =j µj is then played once to allow for the Q value of the optimal arm to reset and then arm j is played again for m steps. This is then repeated in every period of m + 1 steps. Each such sequence j is indexed by a triplet (i, k, m) [K] [K] [T 1], corresponding to periodically playing arm i m N times before arm k is played once and then repeating it. Denote this sequence by j(i, k, m) = (i, . . . , i | {z } m times , k, i, . . . , i | {z } m times Moreover, if the fidelity functions fj are strictly decreasing, then every such sequence is the only optimal sequence for some values of µ1, . . . , µK. Hence, in such cases, there is only one Lugosi, Pike-Burke, Savalle minimal sufficient set, containing all sequences of the form j(i, k, m), that is, J = {j(i, k, m) : (i, k, m) [K] [K] [T 1]} . Moreover, in all cases, the minimal sufficient set is unique and it is a subset of J defined above. 7.2 Stochastic rewards In this section we show that one may achieve sublinear regret in the stochastic version of the problem. As it is shown in the previous section, the goal of the learner is to achieve an expected reward that is not much smaller than the expected reward corresponding to playing the best sequence of the type j(i, k, m), that is, max (i,k,m) [K] [K] [T 1] WT (µi, µk, m) (9) WT (µi, µk, m) def. = T T m + 1 µi + T m + 1 (Fi(m) + Fk(1)) + Fi T (m + 1) T m + 1 is the expected reward of playing arm i m times, then arm j once and repeating this until horizon T in a subscription bandits problem where the expected base rewards are given by µ1, . . . , µK. We denote by (i , k , m ) a triple achieving the maximum in (9). The proposed strategy is similar to the one of Section 6.2 in that the first t0 rounds of the game are used purely for exploration. After time t0, the strategy commits to a triplet (i, k, m) and plays accordingly. More precisely, up to time t0, the strategy samples each arm t0/K times and computes the empirical average of the observed base reward of each arm. Denote these estimates by bµ1, . . . , bµK. By a simple use of Hoeffding s inequality, we have E max j [K] |bµj µj| Let ˆi and ˆk denote the arms corresponding to the two highest values of bµi + fi(0) such that bµˆi +fˆi(0) bµˆk +fˆk(0). (In case of ties, we may choose arbitrarily.) Moreover, let bm denote the corresponding period length, that is, bm = min m : bµˆi + fˆi(m + 1) < bµˆk + fˆk(0) . Equivalently, (ˆi, ˆk, ˆm) = argmax i,j,m WT (bµi, bµk, m) . (12) Starting from time t0 + 1, until time T, the strategy plays according to the sequence j(ˆi, ˆk, bm). Bandit problems with fidelity rewards Theorem 15 Consider the stochastic bandit problem in the subscription model with decreasing fidelity rewards. The expected regret of the explore-then-commit strategy defined above with t0 = (2T)2/3(K log K)1/3 is bounded by E[RT ] 3T 2/3(K log K)1/3 . Proof Conditionally on the rewards observed during the first t0 periods of play, the expected reward of the proposed strategy for the remaining T t0 rounds equals WT t0(µˆi, µˆk, bm) WT (µˆi, µˆk, bm) 2t0 (since all rewards are bounded by 2) WT (bµˆi, bµˆk, bm) 2t0 T max j [K] |bµj µj| WT (bµi , bµk , m ) 2t0 T max j [K] |bµj µj| (by (12)) WT (µi , µk , m ) 2t0 2T max j [K] |bµj µj| . Taking expected values and using (11), we get that E[RT ] WT (µi , µk , m ) EWT t0(µˆi, µˆk, bm) 2t0 + 2T Choosing t0 = T 2/3(K log K)1/3/2, we obtain the stated bound. We remark that, just like in the case of increasing fidelity rewards, one may also design strategies that yield regret bounds that have a logarithmic dependence on the time horizon T. For example, one may use successive elimination strategies in the exploration phase to identify the two arms i and k played by the optimal strategy. The price to pay is that the regret bound depends on the means µj of the arms. In particular, they depend on the gap between the expected reward of the best and second best arm, as well as on the gap between the expected reward of the second and the third best arms. In order to keep the discussion simple, we omit these regret bounds from this paper. We also note that there is currently no known lower bound for the stochastic subscription model with decreasing fidelity rewards, so determining the optimal dependence on T and K in this problem remains an open question. 7.3 Adversarial rewards In this section we show that in the adversarial bandit problem in the subscription model with decreasing fidelity rewards, it is possible to achieve sublinear regret. Recall from Section 7.1 that in this model, there is only one minimal sufficient set, and in all cases it is a subset of the class J containing all sequences of the type j(i, k, m) for (i, k, m) [K] [K] [T 1]. Hence, weak and strong regret of a policy π coincide and, recalling (3), are bounded as RT (π) max (i,k,m) [K] [K] [T 1] ST (j(i, k, m)) e ST (π) . Lugosi, Pike-Burke, Savalle In other words, our goal is to design a strategy whose expected cumulative reward is close to max(i,k,m) ST (j(i, k, m)). Our approach is to treat each triple (i, k, m) as an expert that plays according to the sequence j(i, k, m) and design a strategy that competes with the best expert. To this end, we use a variant of the exponentially weighted average forecaster that assigns a weight to each expert, based on an exponential function of its estimated cumulative reward, and selects a random expert with probability proportional to its weight. A difficulty in the subscription bandits problem is that the reward of the strategy depends not only on the reward of the chosen expert but also on the number of consecutive plays of the selected arm. Therefore the reward an expert would gain from playing an arm in a given round may differ from the actual reward the learner receives. In order to keep such deviations under control, we draw inspiration from the lazy label efficient exponentially weighted average forecaster proposed by Cesa-Bianchi, Lugosi, and Stoltz (Cesa-Bianchi et al., 2004). The forecaster, at each time instance, flips a biased coin that comes up heads with probability ϵ, where ϵ is a small, appropriately chosen value. When the coin flip results in heads, the forecaster selects a random arm, plays it, and updates the estimated cumulative reward of each expert. Then the algorithm selects a random expert according to the exponentially weighted average distribution over the experts, and plays according to the sequence corresponding to the selected expert until the next time when the coin flip comes up heads. This lazy play guarantees that the strategy switches between experts at most about ϵT times, making the possibly adverse effect of switching between experts negligible. More precisely, at every time t {2, . . . , T}, an independent Bernoulli random variable Zt with P{Zt = 1} = 1 P{Zt = 0} = ϵ is drawn. If Zt = 1, then the strategy simply explores by selecting an arm Jt chosen uniformly at random. (We define Z1 = 1 to ensure that the strategy explores in the first step.) These time instances are used to estimate the cumulative reward of each expert. In particular, for an expert (i, k, m) [K] [K] [T 1], let jt(i, k, m) [K] denote the arm played by the expert at time t, that is, the t-th element of the sequence j(i, k, m) as defined in (8). The cumulative reward of the expert, at time t, is St(j(i, k, m)) = Xs,js(i,k,m) + Φ(i, k, m) , where we define Φ(i, k, m) = 1 (Fi(m) + Fk(1)) + Fi T (m + 1) T m + 1 as the normalized total fidelity reward received by the expert. St(j(i, k, m)) is estimated by b St(i, k, m) = s=1 b Ys,js(i,k,m) where b Ys,js(i,k,m) def. = 2 K ϵ 1Zs=11Js=js(i,k,m)(2 Xs,js(i,k,m) Φ(i, k, m)) Bandit problems with fidelity rewards is the estimated reward of expert (i, k, m) at time s, where Js indicates the arm played by the learner in round s. Note that the factor K/ϵ guarantees that Eb Ys,(i,k,m) = Xs,js(i,k,m) + Φ(i, k, m) and therefore b St(i, k, m) is an unbiased estimator of the cumulative total reward of the expert. It is convenient to introduce the estimated losses bℓs,(i,k,m) def. = 2 b Ys,(i,k,m) = K ϵ 1Zs=11Js=js(i,k,m)(2 Xs,js(i,k,m) Φ(i, k, m)) . Note that bℓs,(i,k,m) is always nonnegative, a property that is crucial for the proof below. In the time instance immediately following an exploration step unless the Benoulli random variable again equals 1 , the learner selects an expert (It, Kt, Mt) [K] [K] [T 1] at random, based on the exponentially weighted average distribution. More precisely, Pt 1 {(It, Kt, Mt) = (i, k, m)} = pt,(i,k,m) def. = wt 1,(i,k,m) where Pt 1 denotes conditional probability given the past and wt,(i,k,m) = exp(η b St(j(i, k, m))) (i,k,m) [K] [K] [T 1] exp(η b St(j(i, k, m))) . Then the learner follows the expert (It, Kt, Mt) until the next time the Bernoulli variable takes value 1. More precisely, if for some t, Zt = 1 and t = min{s > t : Zs = 1}, then for all time instances s {t, . . . , t 1}, the learner plays the arms according to positions t, . . . , t 1 of the sequence j(It, Kt, Mt), as defined in (8). The algorithm is described in detail in Figure 2. The next theorem establishes sublinear regret for the performance of the proposed strategy. In particular, it shows that the expected regret grows at a rate at most e O(T 2/3). As in the stochastic setting, it remains an open problem to determine if this rate is optimal. Theorem 16 Consider the adversarial fidelity bandits problem in the subscription model with nonincreasing fidelity rewards. Then, choosing the parameters of the algorithm as η = log(K2(T 1))/(2TK) 2/3 and ϵ = K η, the expected (weak and strong) regret of the strategy define above satisfies ERT 3(2TK)2/3 log1/3 K2(T 1) . Proof Let J = (J1, . . . , JT ) [K]T denote the (random) sequence of arms played by the strategy. The strategy is defined such that, except for the times when Zt = 1, the strategy follows the expert (It, Kt, Mt), that is, Jt = jt(It, Kt, Mt). Define the pseudo reward of the strategy by t=1 (Xt,Jt + Φ(It, Kt, Mt)) . Lugosi, Pike-Burke, Savalle Input: T (horizon), ϵ (0, 1) (sampling probability), η > 0 (learning rate). Initialize: w0,(i,k,m) = 1. 1: Sample T independent Bernoulli(ϵ) random variables Z1, . . . , ZT . 2: Set Z1 = 1 3: for all t = 1, . . . , T do 4: if Zt = 1, then 5: Select a uniformly random arm Jt and observe Xt,Jt; 6: Compute wt,(i,k,m) = exp(η b St(j(i, k, m))) (i, k, m) [K]2 [T 1]; 8: if Zt = 0, then 9: if Zt 1 = 1 then 10: Draw a random expert (It, Kt, Mt) according to the distribution pt,(i,k,m) = wt 1,(i,k,m)/Wt 1 and play according to the t-th position of the sequence j(It, Kt, Mt); 12: Play according to the same expert as at time t 1. 15: end for Figure 2: Lazy strategy for adversarial bandits with decreasing fidelity rewards, subscription model The key observation is that, since the fidelity rewards are decreasing, the realized total reward of the strategy is at least t=1 (Xt,Jt + f Jt(Qt 1,Jt)) ST 2 t=1 1{Zt=1} , since PT t=1 1{Zt=1} is the total number of times the strategy switches experts and the rewards are bounded in [0, 2]. Hence, it suffices to show that the total expected pseudo reward EST is not much smaller than max (i,k,m) ST (j(i, k, m)) = max (i,k,m) Xt,jt(i,k,m) + Φ(i, k, m) . As it is customary in the analysis of exponentially weighted average forecasters (see Cesa Bianchi and Lugosi (2006)), we compare upper and lower bounds for the weight Wt. On the one hand, (i,k,m) exp(η b ST (j(i, k, m))) log(K2(T 1)) η max (i,k,m) b ST (j(i, k, m)) log(K2(T 1)) . Bandit problems with fidelity rewards On the other hand, log(WT /W0) = PT t=1 log(Wt/Wt 1) and for each t [T], we have (i,k,m) pt,(i,k,m) exp(η b Yt(i, k, m)) (i,k,m) pt,(i,k,m) exp( ηbℓt,(i,k,m)) (i,k,m) pt,(i,k,m) 1 ηbℓt,(i,k,m) + η2 2 bℓ2 t,(i,k,m) (using e x 1 x + x2/2 for x 0) (i,k,m) pt,(i,k,m)bℓt,(i,k,m) + 2η2K (i,k,m) pt,(i,k,m)bℓt,(i,k,m) (using log(1 + x) x for all x 1 and bℓt,(i,k,m) 2K/ϵ) (i,k,m) pt,(i,k,m) b Yt,(i,k,m) + 2η2K (i,k,m) pt,(i,k,m)bℓt,(i,k,m) (by definition of bℓs,(i,k,m) = 2 b Ys,(i,k,m)) Comparing the upper and lower bounds obtained for log(WT /W0), we have that for all (i , k , m ) [K] [K] [T 1], (i,k,m) pt,(i,k,m) b Yt,(i,k,m) b ST (j(i , k , m )) log(K2(T 1)) (i,k,m) pt,(i,k,m)bℓt,(i,k,m) . Taking expected values on both sides and noting that (i,k,m) pt,(i,k,m) b Yt,(i,k,m) = EST , EST max (i,k,m) ST (j(i, k, m)) log(K2(T 1)) and therefore ERT log(K2(T 1)) Lugosi, Pike-Burke, Savalle Optimizing the upper bound suggests the choices ϵ = K η and η = log(K2(T 1)) yielding the announced upper bound. 8. Subscription model: coupon rewards We finally consider the subscription model with so-called coupons fidelity rewards. Here the player earns an additional reward rj [0, 1] after each consecutive ρj N plays of an arm j. As we will see, this setting is similar to the loyalty points model with coupon rewards which was studied in Section 5. However, a key difference is that here, in order to get any additional fidelity reward, an arm needs to be played ρj times consecutively. 8.1 Minimal sufficient sets We begin by considering the class of minimal sufficient sets for the subscription model with coupon rewards. For this, we note that due to the need to be playing an arm continuously in order to receive any fidelity reward, any optimal sequence j [K]T will mainly consist of consecutive instances of any arm, where the length of these segments is determined by ρj, the period of arm j. Moreover, it can easily be seen that if ρj = ρ for all arms j and T is divisible by ρ, then the optimal strategy is to play the arm maximizing µj + rj/ρ for all rounds t = 1, . . . , T, and thus the minimal sufficient sets are the single arm strategies J0. In the more general case, the minimal sufficient sets may not simply be single arm strategies. However, by an argument similar to the one in the loyalty points setting, we can show that the total reward accumulated by the best single arm strategy is not too far from that of the best minimal sufficient set. In particular, Lemma 17 Let ρ denote the least common multiple of the period lengths ρ1, . . . , ρK. Then, max j J Cf J ST (j) max j J0 St(j ) 2ρ . Therefore, as in the coupons loyalty points model, we compare our algorithms to single arm strategies knowing that the maximal reward of such strategies does not differ too much from the true maximal reward. 8.2 Stochastic rewards From the above discussion, it is clear that to gain the additional fidelity reward, each time an arm j is played, it should be played for ρj rounds consecutively. Since the fidelity functions are known, so are the ρj. Therefore, we consider algorithms that run in batches, that is, they select an arm j using all the available data up to that point, then play that arm ρj times before updating the estimates and selecting another arm. While in principle, any standard algorithm for the multi-armed bandit problem could be used to select the arms at the beginning of each batch, for simplicity, here we consider a batch version of Bandit problems with fidelity rewards the UCB algorithm (Auer et al., 2002a). We aim to accumulate rewards comparable to T maxj [K](µj + rj/ρj) and so we use confidence bounds of the form. UCBt(j) = Xt,j + rj Nt,j , (13) where Xt,j is the average base reward from Nt,j plays of arm j up to time t. Note that each arm j is played ρj times when it is selected. This leads to the following regret bound, the complete proof of which is given in Appendix B.1. Theorem 18 The expected regret of the Batch-UCB algorithm up to horizon T in the stochastic coupons subscription model can be bounded by j =j ρj e j + (1 + 2 j =j e j + 2ρ where e j = maxi [K](µi + ri/ρi) (µj + rj/ρj). So the worst case regret bound is E[RT ] = O( p KT log(T) + PK j=1 ρj + ρ). The above regret bounds exhibit the expected optimal leading order terms. However, it is still an open question to determine whether the dependence on the period lengths ρj can be improved. 8.3 Adversarial rewards Since we know that the best single arm strategy is almost optimal in the subscription coupons model, we aim to develop an algorithm that is competitive with that. Unlike in the loyalty points model with coupon rewards where we could simply use a standard adversarial bandits algorithm with augmented rewards, the subscription structure of the fidelity rewards here means that we need to be a bit more careful to guarantee that the augmented rewards are actually similar to the accumulated rewards. Specifically, this means that we need to play in batches and play each arm j some multiple of ρj times whenever it is selected. For example, we may select arms using the EXP3 (Auer et al., 2002b) algorithm and then play them ρ times, where ρ is again the least common multiple of the period lengths ρ1, . . . , ρK. This gives the following regret bound, Theorem 19 Selecting arms according to the EXP3 algorithm with parameter λ = min{1, q (e 1)T } and playing each arm j ρ times when selected in the adversarial coupons subscription fidelity bandits model leads to regret of E[R T ] = e O( p ρKT log(K)) , where ρ is again the least common multiple of the period lengths ρ1, . . . , ρK. Proof The proof follows by considering an adversarial bandits problem over S = T ρ rounds. In each round s = 1, . . . , S, the algorithm selects an arm Js using the EXP3 algorithm and receives reward P t Ts Xt,j+rj 2ρ [0, 1] where Ts is such that |Ts| = ρ and represents Lugosi, Pike-Burke, Savalle the set of time points t = ρs + 1, . . . , ρ(s + 1) in the fidelity bandits problem where we are playing arm j ρ times consecutively. Let RS denote the regret in this modified problem compared to playing arm j = argmax1 j K PT t=1 Xt,j+Srj ρ , which we note is the same arm as the best single arm strategy in the original problem. By Lemma 17, RT 2ρRS + ρ, so it suffices to provide a bound on RS. This can be done by applying the analysis of EXP3 in Corollary 3.2 of (Auer et al., 2002b). In particular, for parameter λ = min{1, q E[RS] = O( q ρ ), thus giving the result. Remark 20 Note that when ρ1 = = ρK = ρ, then the regret is simply e O( ρKT). However, in other cases, depending on the periodicities, ρ can be significantly larger than any single ρj. By a simple modification of the argument, the regret bound can be improved to e O( maxj ρj KT) by considering a variant of EXP3 that plays each arm j exactly ρj times when it is selected, using the pseudo-rewards P t Ts Xt,j+rj 2 maxj ρj [0, 1] and S T minj ρj rounds. However, to avoid some technicalities arising from the fact that the number of rounds becomes random, we chose to present the simplest version. It is also an open question to determine the optimal dependence on the periods ρ1, . . . , ρK. Remark 21 The above result assumes that ρj is constant for all j [K]. Clearly if ρj are such that ρ = Ω(T), the bound in Theorem 19 becomes vacuous. In such a case, one may simply ignore the fidelity reward of any arm with ρj = Ω(T) as it is not received often enough to make a significant difference to the cumulative reward. In fact, for ρj = Ω(T 1/4) it is better to ignore the fidelity rewards. We omit the straightforward details. Remark 22 From Theorems 18 and 19, we observe that the cost of playing in batches is additive in the stochastic case, and multiplicative in the adversarial setting. This is to be expected since playing in batches can almost be viewed as a form of delayed feedback. It is known that the penalty for receiving delayed feedback is multiplicative in the adversarial setting (see, e.g., Joulani et al. (2013)), so our results are consistent with this. 9. Conclusion In this paper we have studied several instances of the fidelity bandits problem where the reward of each arm is augmented by a fidelity reward which measures how loyal the player has been to that arm in the past. Our focus has been on the settings with adversarially generated base rewards, although we also analyze the stochastic version of the problem for completeness. For this problem, the definition of the regret when the rewards are adversarial is nontrivial. By considering the stochastic analogue of the problem we suggest several natural regret definitions. In particular, we define the regret with respect to a class of policies which may be optimal for some configuration of the stochastic problem, a technique which we believe may be applicable in many settings beyond the fidelity bandits problem studied here. Our main interest in this paper was to determine when it is possible for these regrets to be sub-linear. We considered two forms of fidelity reward, namely the Bandit problems with fidelity rewards loyalty points model where the fidelity reward depends on the number of times of an arm was previously played, and the subscription model where the fidelity reward is a function of the number of consecutive plays of an arm up to that point. Our findings show that learning with adversarial rewards and increasing fidelity functions is hard, with Ω(T) bounds presented for even the weakest regret definition in both the loyalty points and subscription model. However, these results are worst case, so it remains a possibility that for specific fidelity functions sublinear regret could be achieved (see, e.g., the results for m-step functions in Section 6.3). For decreasing fidelity rewards, the picture is more positive. Although we provide Ω(T) lower bounds for the strongest notions of regret in the loyalty points model, we show that for weaker notions of regret, it is possible to achieve sub-linear regret in this case. This means that we are able to perform comparably to some near-optimal sequence of actions. For the subscription model, we show that sublinear regret is possible for even the strong regret when the fidelity rewards are decreasing. The reason for this distinction is that in the subscription model, due to the fact that the fidelity reward depends on the number of consecutive plays, there is only one minimal sufficient set, simplifying the class of comparator policies for the strong regret. Lastly, we note that in the coupons rewards setting, where a bonus reward is obtained every ρj (consecutive) plays of arm j, it is possible to get sub-linear strong regret in both the loyalty points and subscription models with adversarial base rewards. Interestingly, in this case, the effects of the fidelity rewards are different in the subscription and loyalty points models, with the regret scaling mutliplicatively with the periodicity of the coupon function in the subscription case, while it only increases additively in the loyalty points model. Although we have considered several forms of fidelity reward in this paper, there are still other realistic assumptions one could make on the fidelity reward. For example, it may be interesting to consider more general periodic fidelity functions, or other measures of fidelity such as the number of times the arm has been played in the last m plays. This is left for future work. We also point out that throughout the paper we have assumed that the entire fidelity functions are known to the forecaster, although typically the algorithms only require knowledge of the cumulative fidelity reward Fj(T). It remains to be seen if this can be reduced to requiring no knowledge of the fidelity reward. A related problem is whether one can remove the need to know the horizon T and develop anytime algorithms for fidelity bandits problems. A key challenge here is that the optimal sequence of arms typically depends on the horizon so standard techniques such as the doubling trick (see e.g. Besson and Kaufmann (2018)) may not be applicable. For example, in the subscription model with increasing fidelity rewards, since we lose any fidelity reward accrued when we switch arms, it is possible to construct fidelity functions such that even if we played the best arm for an entire phase we may not get enough fidelity reward, so the doubling trick could lead to linear regret. By assuming that the average fidelity reward is stationary over time or independent of T, one may be able to adapt our results to the anytime setting, but we leave this to future work. Lastly, we note that our aim in this paper was to understand when it is possible to achieve sub-linear expected regret in the fidelity bandits problem. As such, some constants or rates (particularly in the stochastic reward case) may not be tight. Therefore, tightening them and/or providing matching lower bounds remains an interesting open problem. We have also focused on providing bounds on the expected regret. We imagine that our analysis Lugosi, Pike-Burke, Savalle for upper bounds in the stochastic setting can be easily extended to obtain high probability bounds. On the other hand, obtaining high-probability upper bounds on the regret in the adversarial setting is less straightforward and we leave this topic for future research. Acknowledgments G abor Lugosi acknowledges the support of Ayudas Fundaci on BBVA a Proyectos de Investigaci on Cient ıfica 2021 and of the Spanish grant PID2022-138268NB-I00, financed by MCIN/AEI/10.13039/501100011033, FSE+MTM2015-67304-P, and FEDER, EU. This work was started while Ciara Pike-Burke was at Universitat Pompeu Fabra and Barcelona Graduate School of Economics. Peter Auer, Nicol o Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of 36th Annual Symposium on Foundations of Computer Science, pages 322 331. IEEE, 1995. Peter Auer, Nicol o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002a. Peter Auer, Nicol o Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002b. Soumya Basu, Rajat Sen, Sujay Sanghavi, and Sanjay Shakkottai. Blocking bandits. In Advances in Neural Information Processing Systems, pages 4785 4794, 2019. Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in neural information processing systems, pages 199 207, 2014. Lilian Besson and Emilie Kaufmann. What doubling tricks can and can t do for multi-armed bandits. ar Xiv preprint ar Xiv:1803.06971, 2018. Djallel Bouneffouf and Raphael F eraud. Multi-armed bandit problem with known trend. Neurocomputing, 205:16 21, 2016. S ebastien Bubeck and Nicol o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012. S ebastien Bubeck, Yin Tat Lee, and Ronen Eldan. Kernel-based methods for bandit convex optimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 72 85, 2017. Leonardo Cella and Nicol o Cesa-Bianchi. Stochastic bandits with delay-dependent payoffs. In International Conference on Artificial Intelligence and Statistics, pages 1168 1177. PMLR, 2020. Bandit problems with fidelity rewards Nicol o Cesa-Bianchi and G abor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, 2006. Nicol o Cesa-Bianchi, G abor Lugosi, and Gilles Stoltz. Minimizing regret with label efficient prediction. IEEE Transactions on Information Theory, 51:2152 2162, 2004. Nicol o Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other adaptive adversaries. In Advances in Neural Information Processing Systems, pages 1160 1168, 2013. Corinna Cortes, Giulia De Salvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. Discrepancy-based algorithms for non-stationary rested bandits. ar Xiv preprint ar Xiv:1710.10657, 2017. Ofer Dekel, Jian Ding, Tomer Koren, and Yuval Peres. Bandits with switching costs: T 2/3 regret. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 459 467. ACM, 2014. Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Online markov decision processes. Mathematics of Operations Research, 34(3):726 736, 2009. Dean P. Foster and Rakesh Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 29:7 36, 1999. Aur elien Garivier and Eric Moulines. On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, pages 174 188. Springer, 2011. Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68:1127 1150, 2000. Sergiu Hart and Andreu Mas-Colell. A general class of adaptive strategies. Journal of Economic Theory, 98:26 54, 2001. Nicole Immorlica and Robert Kleinberg. Recharging bandits. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 309 319, 2018. Pooria Joulani, Andras Gy orgy, and Csaba Szepesv ari. Online learning under delayed feedback. In International Conference on Machine Learning, pages 1453 1461, 2013. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985. Tor Lattimore and Csaba Szepesv ari. Bandit algorithms. Cambridge University Press, 2020. Nir Levine, Koby Crammer, and Shie Mannor. Rotting bandits. In Advances in Neural Information Processing Systems, pages 3077 3086, 2017. Alberto Maria Metelli, Francesco Trovo, Matteo Pirola, and Marcello Restelli. Stochastic rising bandits. In International Conference on Machine Learning, pages 15421 15457. PMLR, 2022. Lugosi, Pike-Burke, Savalle Ciara Pike-Burke and Steffen Grunewalder. Recovering bandits. In Advances in Neural Information Processing Systems, pages 14122 14131, 2019. Julien Seznec, Andrea Locatelli, Alexandra Carpentier, Alessandro Lazaric, and Michal Valko. Rotting bandits are no harder than stochastic ones. In International Conference on Artificial Intelligence and Statistics, pages 2564 2572, 2019. Julien Seznec, Pierre Menard, Alessandro Lazaric, and Michal Valko. A single algorithm for both restless and rested rotting bandits. In International Conference on Artificial Intelligence and Statistics, pages 3784 3794. PMLR, 2020. Aleksandrs Slivkins and Eli Upfal. Adapting to a changing environment: the brownian restless bandits. In Conference on Learning Theory, pages 343 354, 2008. Cem Tekin and Mingyan Liu. Online algorithms for the multi-armed bandit problem with markovian rewards. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1675 1682. IEEE, 2010. Cem Tekin and Mingyan Liu. Online learning of rested and restless bandits. IEEE Transactions on Information Theory, 58(8):5588 5611, 2012. Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, pages 287 298, 1988. Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems 26, pages 1583 1591, 2013. Appendix A. Proofs for the loyalty points model with increasing rewards A.1 Proof of Lemma 2 (Optimality of single arm strategies for increasing loyalty points bandits) Proof Suppose that a sequence j plays arm j Nj times up to time T (and therefore PK j=1 Nj = T). Then the total (pseudo) reward of such a sequence is (since the fj are increasing functions) T max 1 j K T Fj(T) = e ST (j , j , . . . , j ) . Hence there is a single-arm strategy whose total reward is at least as large as that of j. Bandit problems with fidelity rewards A.2 Proof of Theorem 3 (Regret of UCB in stochastic loyalty points bandits with increasing fidelity rewards) Proof This proof is similar to the standard proofs of the regret bounds of UCB (see Auer et al. (2002a); Lattimore and Szepesv ari (2020)) with some adjustments to deal with the fidelity functions. Let B = K j=1 T t=1 {µj + Fj(T)/T UCBt(j)} be the event that the upper confidence bounds on each arm hold at all time steps. We show that on event B, the number of plays of sub-optimal arms is small. First, we write t=1 (µj µJt + fj (t) f Jt(Nt 1,Jt))) j=1 NT,j(µj µj) + j=1 NT,j(µj µj) + n=NT,j fj (n) X j=1 NT,j(µj µj) + (T (T X j:j =j NT,j))fj (T) X j:j =j NT,jfj(0) (since the fidelity functions are increasing) j:j =j NT,j(µj µj + fj (T) fj(0)) , since NT,j = T P j =j NT,j. Hence, it suffices to bound the number of plays of each sub-optimal arm. Let mj = 8 log(KT) and observe that on the event B, if Nt,j > mj arm j will not be played again. Indeed, in this case, for any number of plays m of the optimal arm, UCBt(j) = Xt,j + 1 T Fj(T) + 2 T Fj(T) + e j = µj + 1 T Fj (T) UCBt(j ) and therefore arm j is not played again. Then, using the fact that the confidence bounds hold with probability 1/(KT)2 (by Hoeffding s inequality), the expected regret may be bounded by ERT = E[RT 1B] + E[RT 1BC] j:j =j E[NT,j1B](µj µj + fj (T) fj(0)) + TP(BC) & 8 log(KT) (µj µj + fj (T) fj(0)) + T j=1 P(µj + Fj(T)/T > UCBt(j)) Lugosi, Pike-Burke, Savalle (µj µj + fj (T) fj(0)) + 1 proving the result. To obtain the worst case regret bound, we use standard techniques (see, e.g. Lattimore and Szepesv ari (2020)) and set e j q A.3 Proof of Theorem 4 (Lower bound on the regret for the adversarial loyalty points bandit problem with increasing fidelity reward) Proof It suffices to consider the two-armed setting (i.e., K = 2). The argument may easily be extended to multiple arms, for example, by replicating the rewards of the two arms. We may also assume that T is even and the fidelity functions of each arm are the same. We write F(t) = Pt s=1 f(s) for the cumulative fidelity reward. In order to prove the theorem, we consider just two possible sequences of rewards. In both cases arm 1 has reward X1,t = δ/5 for all t [T]. Also, in both cases arm 2 has reward X2,t = 0 for all t [T/2]. The reward X2,t of arm 2 for all t > T/2 equals 0 in case (I), while it equals δ in case (II). We know that in this problem single arm strategies are optimal. In case (I), the optimal arm is arm 1 with total reward Tδ/5 + F(T), while in case (II), the optimal arm is arm 2 with total reward Tδ/2 + F(T). Consider any policy and denote by τ1 the number of times the policy chooses arm 1 up to time T/2. Similarly, τ2 denotes the number of times the policy chooses arm 1 between times T/2 + 1 and T. Note that τ1 is the same in cases (I) and (II) as up to time T/2 the two cases are identical. If τ1 T/4, then in case (I) the policy suffers regret at least Tδ/20, so in the rest of the proof we may assume that τ1 > T/4 and we consider case (II). In that case the total reward of the policy equals 2 τ2δ + F(τ1 + τ2) + F(T τ1 τ2) . Thus, since arm 2 is optimal, the regret becomes 5 + F(T) (F(τ1 + τ2) + F(T τ1 τ2)) . If τ2 > τ1 T/8, then the regret is at least 5 (since the fidelity reward is nondecreasing) 20 (since τ1 > T/4). Bandit problems with fidelity rewards Finally, if τ2 τ1 T/8, then and therefore, by the nondecreasing property of the fidelity rewards, for any t [T], F(t) + F(T t) is nonincreasing in t T/2 and nondecreasing in t T/2. Hence, F(T) (F(τ1 + τ2) + F(T τ1 τ2)) F(T) F(7T/8) F(T/8) t=1 (f(T t) f(t)) 8 (f(7T/8) f(T/8)) = Tδ Hence, the regret satisfies 5 + F(T) (F(τ1 + τ2) + F(T τ1 τ2)) Tδ Appendix B. Results for Stochastic Coupons Subscription Model B.1 Proof of Theorem 18 (Regret bound for the Batch-UCB algorithm in the stochastic coupons subscription model) We first bound the number of plays of any sub-optimal arm. Lemma 23 For any sub-optimal arm j, 0 E[Nj(T)] 16 log(TK) e 2 j + ρj + 1 + 2 Proof Let B be the event that the confidence bounds hold for all arms at all time steps, that is, B = K j=1 T t=1 {µj UCBt(j)}. Then define n j = 16 log(TK) e 2 j = m jρj + lj for some integer m j N and remainder 0 lj ρj. We now show that on event B, if we have played arm j en j = (m j + 1)ρj n j times, then it is not played again. Indeed, if t is such that Nj,t en j n j, then, UCBt(j) = Y j,t + 2 NT,j eµj + 4 n j < eµj + e j = eµj UCBt(j ). Here we have used the definition of the confidence bounds and the notation eµj = µj + rj ρj and j = argmax1 j K µj + rj ρj . Hence, UCBt(j) < UCBt(j ) and so we would play arm j and not play arm j again. Consequently arm j is not played more than n j times. Using this, we bound the expected number of times we play arm j as, E[NT,j] = E[NT,j1{B}] + E[NT,j1{BC}] n j + TP(BC) Lugosi, Pike-Burke, Savalle (m j + 1)ρj + T t=1 P(µj > UCBt(j)) 16 log(TK) where we have used the fact that the confidence intervals hold with probability at least 1 2 T 2K2 . Using the trivial bound x x + 1 and noting that Nj(T) 0 gives the result of the lemma. Using this, we now prove the regret bound in Theorem 18. Proof We play each arm j in batches of ρj plays. Denote the number of such batches of arm j as BT,j and note that NT,j = ρj BT,j. Using Lemma 17, we write the regret as, E[RT ] = T(µj + rj j=1 E[BT,j](ρjµj + rj) = T(µj + rj j=1 E[NT,j](µj + rj ρj ) + 2ρ = X j =j E[NT,j]e j + 2ρ where e j = max1 j K{µj + rj ρj } (µj + rj ρj ). Substituting the result from Lemma 23 into the above result gives, j =j ρj e j + (1 + 2 j =j e j + 2ρ.