# the_survival_bandit_problem__969ad246.pdf

Published in Transactions on Machine Learning Research (08/2024)

The Survival Bandit Problem

Charles Riou charles@ms.k.u-tokyo.ac.jp The University of Tokyo & RIKEN Center for AIP Tokyo, Japan

Junya Honda honda@i.kyoto-u.ac.jp Kyoto University & RIKEN Center for AIP Kyoto, Japan

Masashi Sugiyama sugi@k.u-tokyo.ac.jp RIKEN Center for AIP & The University of Tokyo Tokyo, Japan

Reviewed on Open Review: https: // openreview. net/ forum? id= 1q Zy JQx Oof

In this paper, we introduce and study a new variant of the multi-armed bandit problem (MAB), called the survival bandit problem (S-MAB). While in both problems, the objective is to maximize the so-called cumulative reward, in this new variant, the procedure is interrupted if the cumulative reward falls below a preset threshold. This simple yet unexplored extension of the MAB follows from many practical applications. For example, when testing two medicines against each other on voluntary patients, people s lives and health are at stake, and it is necessary to be able to interrupt experiments if serious side effects occur or if the disease syndromes are not dissipated by the treatment. From a theoretical perspective, the S-MAB is the first variant of the MAB where the procedure may or may not be interrupted. We start by formalizing the S-MAB and we define its objective as the minimization of the so-called survival regret, which naturally generalizes the regret of the MAB. Then, we show that the objective of the S-MAB is considerably more difficult than the MAB, in the sense that contrary to the MAB, no policy can achieve a reasonably small (i.e., sublinear) survival regret. Instead, we minimize the survival regret in the sense of Pareto, i.e., we seek a policy whose cumulative reward cannot be improved for some problem instance without being sacrificed for another one. For that purpose, we identify two key components in the survival regret: the regret given no ruin (which corresponds to the regret in the MAB), and the probability that the procedure is interrupted, called the probability of ruin. We derive a lower bound on the probability of ruin, as well as policies whose probability of ruin matches the lower bound. Finally, based on a doubling trick on those policies, we derive a policy which minimizes the survival regret in the sense of Pareto, providing an answer to the open problem by Perotto et al. (2019).

1 Introduction

Many real life scenarios involve decision-making with partial information feedback, and it thus comes as no surprise that it has been a major field of study over the recent years. The historical motivating example for decision-making under partial information feedback pertains to medicine testing (see, e.g., Villar et al., 2015), and is described as follows. Consider two medicines A and B designed to cure a specific disease. As pointed out by the US Food and Drug Administration1, before being made available for use by the public, those medicines undergo a very strict procedure composed of four pre-market safety monitoring stages, including

1https://www.fda.gov/patients/learn-about-drug-and-device-approvals/drug-development-process

Published in Transactions on Machine Learning Research (08/2024)

a clinical research stage, where the medicines are being tested on people. In this stage, a large number T of patients suffering from the disease are administered either of those medicines sequentially, and the objective is to cure as many patients as possible. During this process, it is crucial to balance two factors in apparent opposition: administering both medicines a sufficient number of times so as to gather information on the efficacy of both medicines, while at the same time, administering in priority whichever of the two medicines seems more effective in order to cure as many patients as possible. The former factor is called exploration, the latter is called exploitation, and the right balance between the two is known as the exploration-exploitation dilemma in the literature. To describe the above, the multi-armed bandit problem (MAB) has arisen in the literature as the most popular model, because of its simplicity and its rich theoretical interest.

In the basic setting of the MAB (see, e.g., Bubeck & Cesa-Bianchi, 2012 or Lattimore & Szepesvari, 2020 for an introduction), there are K unknown distributions F1, . . . , FK bounded in [ 1, 1], called arms, and a horizon T. In our medicine testing example, each arm k [K] {1, . . . , K} corresponds to a medicine and the distribution Fk corresponds to its (randomized) effect on a patient. While the horizon T corresponds to the total number of patients and each round t T corresponds to a patient in our example, those rounds are usually interpreted as discrete time steps, or rounds, in the MAB. At each round t [T], an agent selects an arm πt [K] and observes a reward, denoted by Xπt t and drawn from the distribution Fπt. In our previous example, the reward Xπt t corresponds to the effect of medicine πt on patient t. It can obviously be positive (when the medicine πt cures the patient), but it can also be negative when the patient is not cured by the medicine πt and/or some side effects negatively impact the risk-benefit balance. The objective of the

problem is to maximize the expected cumulative reward, defined as E h PT t=1 Xπt t i , where the expectation

is taken w.r.t. the arm distributions and the (potential) randomness in the agent s policy (πt)t 1. This is equivalent to minimizing the expected cumulative regret compared to an agent who selects the arm k [K]

with the highest expectation at every round t T, defined as maxk [K] E h PT t=1 Xk t i E h PT t=1 Xπt t i .

The MAB has been extensively studied over the past decades, and both a lower bound on the regret (see Lai & Robbins, 1985 and Burnetas & Katehakis, 1996) as well as policies matching this lower bound (see, e.g., Cappé et al., 2013, Korda et al., 2013 or more recently Riou & Honda, 2020) have been derived. The MAB is not only a theoretically rich topic, it also has a broad range of applications, from the aforementioned medicine testing (see, e.g., Villar et al., 2015 or Aziz et al., 2021) to advertising and recommender systems (see, e.g., Chapelle & Li, 2011), to name a few.

The most critical aspect of those applications is that you may want to interrupt the process when the cumulative reward PT t=1 Xπt t becomes too low. In the medicine testing example, it is necessary to be able to stop the trials early in order to reduce the number of patients exposed to the unnecessary risk of an ineffective investigational treatment and allow subjects the opportunity to explore more promising therapeutic alternatives , as pointed out in the US Food and Drugs Administration Act (2019, p. 4). In less specialized terms, a bad treatment not only exposes patients to health-threatening side effects, it also prevents them from receiving an efficient treatment to cure them. This poses an additional health threat, because early treatment improves outcomes in many cases, like rheumatoid arthritis, appendicitis, and bacterial pneumonia (see, e.g., Shmerling, 2021). Formally, we set a threshold B R + and decide to interrupt the procedure whenever the cumulative reward becomes lower than or equal to B, i.e., we interrupt the procedure at the first round τ such that Pτ t=1 Xπt t B. In practice, B is chosen prior to the experiment by, e.g., an ethics committee. We use the terminology survival bandits problem (S-MAB for short) to designate the MAB with this additional constraint.

While medicine testing was the original practical motivation for this work, portfolio selection in finance is another real-life application of S-MAB, where recent works have demonstrated the strong performance of classic MAB strategies (see, e.g., Hoffman et al., 2011, Shen et al., 2015 or Shen & Wang, 2016), and in particular of risk-aware MAB strategies (see, e.g., Huo & Fu, 2017). In the setting we consider, an investor has an initial budget B > 0 to invest sequentially on one of K securities (the arms). At every round t T, the investor selects a security πt, receives a payoff Xπt t (which can be reinvested), and updates its budget by Xπt t . In this setting, a straightforward constraint is imposed on the unsuccessful investor who has to stop investing when it is ruined and has no more money to invest, i.e., B + Pt s=1 Xπs s 0. Following this setting

Published in Transactions on Machine Learning Research (08/2024)

terminology, we will call the value B budget and the first round τ such that Pτ t=1 Xπt t B if it exists time of ruin .

Please note that both examples above embrace the regime B T, since in medicine testing, we want the number of patients who suffer to be much smaller than the total number of voluntary patients, and in portfolio selection, the agent wants to make much more money than its initial budget B. In both cases, while the SMAB procedure may stop after O(B) rounds, it is desirable to achieve a cumulative reward of the order Θ(T) to fully benefit from the procedure. Consequently, in the S-MAB, the choice of arms πt during the earlier rounds t is absolutely essential, because a bad choice of arms may break the constraint and stop the whole procedure. This is in stark contrast with the standard MAB, where the earlier rounds are used as exploration rounds in order to gather information on the arm distributions, and then to perform efficient exploitation in the later rounds. Precisely, this is the main technicality induced by the new constraint: any successful policy must exploit seemingly good arms from early rounds in order to avoid breaking the constraint and continue the procedure as long as possible. In that sense, rather than the exploration-exploitation dilemma of the MAB, the S-MAB illustrates an exploitation-exploration-exploitation dilemma which, to our knowledge, has never been explored in the literature. It is actually an open problem from the Conference on Learning Theory 2019 to define the problem, establish a (tight) bound on the best achievable performance, and provide policies which achieve that bound (see Perotto et al., 2019). To our knowledge, this paper is the first one to provide answers to this open problem.

2 Review of the Literature

At the time of submission of this paper, there are only two works which, to our knowledge, pertain to the S-MAB, and they both focus on the case of rewards in { 1, 1}. The first one is Perotto et al. (2022). That work introduced a modification of UCB (Auer et al. (2002)) which, empirically, seems to have a low risk of ruin. The second one is Manome et al. (2023). That work derived a generalization of UCB (Auer et al., 2002) and studied its experimental performance in the S-MAB setting, in the case of Bernoulli rewards in { 1, 1}. Both Perotto et al. (2022) and Manome et al. (2023) are purely experimental studies of the S-MAB. To the best of our knowledge, this work is the first one to bring a theoretical framework to the S-MAB.

Nevertheless, the S-MAB involves a budget B, as well as a conservative exploitation at the early rounds, which can also be interpreted as some risk aversion from the agent s viewpoint. It is therefore reasonable to explore the MAB literature related to those topics to see if some existing results can be applied to our setting or not. This may also give some ideas to the interested reader who may want to explore paths beyond the scope of this paper. The list of works cited in this section is by no means an exhaustive list of the existing literature of the aforementioned topics.

The first line of work we introduce here is the budgeted bandits. In this variant of the MAB, the agent is initially given a budget B and at each pull of an arm πt, the agent receives a reward Xπt t [0, 1] and incurs a positive cost cπt t > 0, such that the cumulative reward at round t is Pt s=1 Xπt t and the remaining budget is B Pt s=1 cπt t . The procedure stops when the agent s budget becomes negative or zero. Please note that in this variant of the MAB, there is no more horizon T and instead, the objective of the agent is to maximize its expected cumulative reward as a function of the initial budget B. Tran-Thanh et al. (2010) first introduced the problem where the costs are deterministic, called the budget-limited bandits, and they introduced a policy of the type ETC (Explore Then Commit, see Garivier et al., 2016). Later, Tran-Thanh et al. (2012) provided a lower bound on the regret as Ω(log B), as well as knapsack-based algorithms matching this bound. Ding et al. (2013) generalized that setting to the case of variable costs, called the MAB-BV, and introduced two algorithms which achieve the regret lower bound Θ(log B) in the case of multinomial rewards in 0, 1

m, . . . , 1 . Those results were generalized to the case of continuous costs in [0, 1] by Xia et al. (2015a). To solve that problem, an algorithm based on Thompson Sampling (see, e.g., Agrawal & Goyal, 2012) was proposed by Xia et al. (2015b), and algorithms based on UCB (see, e.g., Auer et al., 2002) and ϵ-greedy were proposed by Xia et al. (2017). Cayci et al. (2020) further generalized Xia et al. (2017) to non-negatively correlated costs and rewards, whose (2 + γ)-moment is finite for some γ > 0, but such that the expectation of the cost is positive (this ensures that the procedure stops). To wrap up the review of the budgeted bandits literature, we note that several works have considered the generalization of the MAB-BV

Published in Transactions on Machine Learning Research (08/2024)

to the multiple-play setting, where at each round, the agent pulls L 2 of the K arms, including Xia et al. (2016), Zhou & Tomlin (2018) and Rangi et al. (2019).

While both the budgeted bandits and the S-MAB have a budget constraint, in the former, the costs are positive (or have a positive expectation) and the procedure stops when the budget is totally consumed. This is in stark contrast to the S-MAB, whose procedure stops either when the budget is totally consumed, or at the horizon T. We can still tackle that issue by increasing the dimension of the budget constraint to 2. Precisely, if pulling arm πt at round t T induces reward Xπt t and decreases the budget by cost cπt t , we can define the initial bidimensional budget as B (B, B) R2 such that at each round t 1, pulling arm πt induces reward Xπt t and decreases the bidimensional budget by cost cπt t cπt t , B

T R2. Following that formulation, the procedure stops when one of the two budget components becomes negative or zero. This formulation is actually a particular instance of the bandits with knapsacks (Bw K). Formally, the agent is given a multidimensional budget B = (B, . . . , B) Rd whose components are called resources, and at each round t, it incurs a reward Xπt t [0, 1] and a cost cπt t [0, 1]d which decreases all the resources. The objective of the agent is to maximize its cumulative reward before one of the resources run out.

The Bw K is the second line of work we present here. It was first introduced in the setting of stochastic costs and rewards by Badanidiyuru et al. (2013). Based on applications in the field of advertising, Combes et al. (2015) studied a particular instance of the Bw K where B = Ω(T), d = K, and pulling arm k only affects resource k. Later, Sankararaman & Slivkins (2021) derived a problem-dependent regret bound to the knapsack problem, under the assumptions that there are only two resources including time, and that the best distribution over arms is a single arm (best-arm-optimality assumption). Several works in the literature of the Bw K have extended the basic stochastic setting to the combinatorial bandits (see Cesa-Bianchi & Lugosi, 2012), including Sankararaman & Slivkins (2018) and Liu et al. (2022a). A large part of the literature on the Bw K is related to the contextual bandit setting (see Wang et al., 2005). We mention here the works of Wu et al. (2015) which studied the case of fixed and deterministic costs, Li & Stoltz (2022) which studied a conversion model and its applications to sales discounting, or more recently Han et al. (2023), which provided a black-box algorithm of two online regression algorithms (see Cesa-Bianchi & Orabona, 2021) in the case of a large budget B = Ω(T). In the particular case where the rewards and costs depend linearly on the contexts, called linear Bw K, Agrawal & Devanur (2016) derived an algorithm based on OFUL (see Abbasi Yadkori et al., 2011) and OMD (Online Mirror Descent, see Cesa-Bianchi & Orabona, 2021). Sivakumar et al. (2022) further studied the linear Bw K both in the stochastic and the adversarial settings, in the context of possible sparsity. The Bw K in the setting of adversarial costs and rewards (see Auer et al., 1995) was first introduced in the seminal work of Immorlica et al. (2019). Some of their results were later improved by Kesselheim & Singla (2020), which studied an ℓp-relaxation of the adversarial Bw K. Finally, Liu et al. (2022b) studied the Bw K in a non-stationary environment, where the rewards and costs are generated i.i.d. from a distribution Pt evolving over time, and which can be considered as a middle ground between the stochastic and adversarial settings.

The Bw K is a very rich topic as introduced above, and thus an extremely large pan of the literature is dedicated to that topic. Yet, the costs are always assumed to be positive or to have a positive expectation. In that sense, any of the d+1 constraint is enough to ensure the termination of the procedure. This hypothesis is of course not necessary, because the procedure stops automatically at the horizon T. Another fundamental difference between the settings considered there and the S-MAB is that the budget B is assumed to scale with T, and this ensures that the cumulative reward also scales with T. This is in stark contrast to this work, where we consider any value of the budget B.

We should still note that, among the many references on the Bw K, Kumar & Kleinberg (2022) addressed the case where the costs may be non-positive. In their paper, they derive algorithm Explore Then Control Budget which achieves a O(log T) regret bound, yet there is a caveat. They assumed the existence of a zero arm (equivalent of the positive arm defined in Definition 6 in this paper) which has zero reward and simply increases the budget of each of the resources. This arm removes the risk of exhausting a resource, and in practice, cannot be applied to most of the scenarios motivating this paper and mentioned in the introduction.

Another line of work which is related to the S-MAB is the conservative bandits (Wu et al., 2016). In that setting, the objective is to find a policy which minimizes the regret under the constraint that at any round t,

Published in Transactions on Machine Learning Research (08/2024)

the expected cumulative reward should be larger than (1α)R(b) t , where R(b) t denotes the expected cumulative reward of a given baseline and α [0, 1] is a parameter. That setting has also been extended to the linear and contextual bandits (Kazerouni et al., 2017, Garcelon et al., 2020). However, this line of work is inapplicable to our problem setting for two reasons. The first one is that the constraint is on the expectation of the rewards and not on the rewards themselves, and therefore, it allows a policy to pull an arm with a seemingly high expected reward even if it has high variance. The second reason is that the constraint in the conservative bandits is relative to a given baseline, and therefore, the existence of a policy satisfying the constraint is trivial: the baseline satisfies it. In contrast, in the S-MAB, no policy can guarantee that there will be no ruin.

The S-MAB is also partly related to the thresholding bandits, introduced in Locatelli et al. (2016), which formalize the problem to identify the arm distributions whose expectations are above a certain threshold θ. While the original objective was to maximize the probability that all such arms are identified, Tao et al. (2019) considered the objective to minimize the sum on all the arms k, of the probabilities that arm k is correctly identified. The thresholding bandits is a pure exploration problem, where the objective is not to maximize the expected sum of the rewards but to identify a certain subset of the arms. In contrast, the S-MAB aims at maximizing the expected sum of the rewards under a constraint.

Finally, we conclude this literature review section by mentioning the large body of work related to riskaverse bandits, which do not seek to pull the arm with the largest expectation, but instead to pull the arm maximizing some risk-averse measure. Those include the mean-variance (Sani et al., 2012, Vakili & Zhao, 2016, Zhu & Tan, 2020), functions of the expectation and the variance (Zimin et al., 2014), the momentgenerating function at some parameter λ (Maillard, 2013), the value at risk (Galichet et al., 2013), or other general measures encompassing the value at risk (Cassel et al., 2018). In the same vein, Sinha et al. (2021) introduced a model with costs and rewards, and they sought to pull the arm with the lowest cost among those with a reasonably large expectation. We also mention Chen et al. (2022), where each arm pull yields a reward and a risk level, and they aimed to pull the arm which has the largest expectation among those whose risk level is lower than a preset level. Those works are not directly related to the S-MAB, however, risk-averse strategies can be good candidates of strategies (or ideas of strategies) to solve the S-MAB in future work.

3 Problem Setup and Definitions

Let T be the maximum round, called the horizon, and let K 1 be the number of arms, whose indices belong to the set [K] := {1, . . . , K}. We denote by F1, . . . , FK the arm distributions, and by µ1, . . . , µK their respective expectations. We assume that F1, . . . , FK are bounded in [ 1, 1]. We denote by PF the probability under F = (F1, . . . , FK), and by P in the absence of ambiguity. Let B > 0 be the initial budget. Please note that in the whole paper, we will study the problem in the asymptotics of T, and we will give a discussion that is non-asymptotic in B and K. The S-MAB procedure is defined as follows.

An agent starts with a budget B0 = B. Then, at every round t {1, . . . , T} while the agent s budget satisfies Bt 1 > 0,

1. the agent selects an arm πt [K];

2. the agent observes a reward Xπt t drawn from Fπt;

3. the agent updates its budget as Bt := Bt 1 + Xπt t .

In the above, (πt) is a policy that determines the arm to pull based on the past observations. Please note that, for any t {0, . . . , T} such that B1, . . . , Bt 1 > 0,

s=1 Xπs s .

Published in Transactions on Machine Learning Research (08/2024)

As a result, if B > T, then the boundedness of the distributions F1, . . . , FK in [ 1, 1] implies that for any t 0, Xπt t [ 1, 1] and therefore, for any t {0, . . . , T}, Bt > 0. In that case, we can remove the constraint Bt > 0 on the budget and this boils down to a standard MAB. In that sense, the S-MAB is an extension of the standard MAB.

Conversely, if B is small, the problem becomes much harder. For example, consider the case of Bernoulli arm distributions F1, . . . , FK (of support { 1, 1}) of respective parameters p1, . . . , p K (0, 1). Then no matter the arms (πt)t 1 chosen by the agent, it will incur the rewards 1, . . . , 1 (B times) for the first B rounds with probability at least min1 k K(1 pk)B > 0. Consequently, the procedure will stop after B rounds with positive probability.

In a nutshell, the initial budget B > 0 is a parameter of the difficulty of the problem. Given some arm distributions fixed, the problem difficulty increases as B decreases. In this paper, we focus on the most difficult case, i.e., when B > 0 is small and of constant order in the asymptotic regime T + .

3.1 Definition of the Ruin

The key difference between the S-MAB procedure defined above and the standard MAB is the budget constraint, which states that if the agent s budget Bt becomes negative or zero at some round t T, the agent has to stop playing immediately. W.l.o.g., in this paper, we define policies (πt) for any t 1 (also beyond T) to ensure that the time of ruin of (πt) below is well-defined.

Definition 1. For any policy π and any initial budget B > 0, the time of ruin is defined as

τ(B, π) := inf

s=1 Xπs s 0

where the infimum above is equal to + if the above set is empty. Furthermore, for any k [K], we denote by τ(B, k) the time of ruin of the constant policy πt = k for any t 1.

Using the vocabulary of the time of ruin, the budget constraint simply states that the agent plays until round min (T, τ(B, π)). We can then translate the S-MAB procedure as a standard MAB with horizon min (T, τ(B, π)).

It might thus be tempting to simply use an existing bandit strategy and try to derive a regret bound with the new horizon min (T, τ(B, π)). However, the probability of ruin of a stochastic process until a finite horizon (as opposed to + ) is in general very complicated. This difficulty is exacerbated when this horizon is random, and even depends on the procedure π itself, as is the case with min (T, τ(B, π)). In our setting where the asymptotics T + is considered, P(τ(B, π) < ) is a reasonable approximation of the probability P(τ(B, π) T) that the ruin occurs before the horizon T.

Definition 2. Given a policy π = (πt)t 1,

the probability of ruin is defined as P(τ(B, π) < );

the probability of survival is defined as P(τ(B, π) = ) = 1 P(τ(B, π) < ).

Please note that, if all the arms k [K] have the probability of survival P(τ(B, k) = ) = 0, then any policy π also has the probability of survival P(τ(B, π) = ) = 0, and its cumulative reward will be smaller than B. Therefore, in the rest of the paper, we make the following assumption.

Assumption 1. There exists an arm k [K] with a positive probability of survival: P(τ(B, k) = ) > 0.

3.2 Definition of the Objective

In this subsection, we define the objective of the agent performing the S-MAB procedure. Recall from the previous subsection that the S-MAB can be seen as an extension of the standard MAB with random

Published in Transactions on Machine Learning Research (08/2024)

horizon min(T, τ(B, π)). The objective of the standard MAB is to maximize the expected cumulative reward

E h PT t=1 Xπt t i , and we naturally extend this definition to the S-MAB as follows:

Rew T (π) := E [ST ] where ST

min(T,τ(B,π)) X

t=1 Xπt t =

t=1 Xπt t 1τ(B,π) t 1.

In the S-MAB setting, the following remark is fundamental.

Remark 1. Let an agent perform a policy π. Then, assume that the agent plays until some round t0 1 and incurs the rewards Xπ1 1 , . . . , X πt0 t0 without being ruined. Further assume that at this precise round t0, the agent realizes that Bt0 > T t0. Then, since the rewards are bounded in [ 1, 1], it is clear that, for any t {t0 + 1, . . . , T},

s=t0+1 Xπs s Bt0 (t t0) Bt0 (T t0) > 0,

in other words, the agent cannot be ruined anymore. Then, the remaining procedure from round t0 + 1 is a standard MAB (without the risk of ruin), and our agent can perform any standard MAB policy for the remaining rounds {t0 + 1, . . . , T} and enjoy the cumulative reward of such a policy. On the other hand, if our agent was not aware of the horizon T, then it would not be able to verify whether or not the condition Bt0 > T t0 is satisfied. As a result, such an agent would have to care about the risk of ruin until the horizon T, and play more conservatively until horizon T. For that reason, a policy aware of the horizon T has a significant advantage over one which is not, and hence can achieve a higher cumulative reward.

In this paper, our focus is on maximizing the expected cumulative reward among all policies which may be aware of the horizon T. As a result, instead of focusing on policies π = (πt)t 1 which are not aware of the horizon T, we need to study and formalize the general framework of policies πT = (πT t )t 1 which may depend on the horizon T.

Definition 3. π = (πT )T 1 is a sequence of policies if, for any T 1, πT = (πT t )t {1,...,T } satisfies

t {1, . . . , T}, πT t σ(U T 1 , πT 1 , XπT 1 1 , . . . , U T t 1, πT t 1, X πT t 1 t 1 , U T t ),

where σ( ) is used to denote the sigma algebra and (U T t ) is some potential randomization for the policy (πT t ) independent from the preceding history.

In this paper (as in a large part of the MAB literature), we are interested in the asymptotics in T. For this reason, we will study the asymptotics of Rew T (πT ) in T + , where the policy πT can depend on T.

A sequence of policies π = (πT )T 1 will be called anytime if there exists a policy ( πt)t 1 such that, for any t 1 and any T t, πT t = πt. We will often identify such a sequence of policies with policy π, and when we say that a policy π is anytime, it formally means that the sequence of policies is written in the above way.

Remark 2. Please note that, in the world of sequences of policies, anytime sequences of policies are the analog of anytime policies. Indeed, given a policy π = (πt)t 1, let πT t πt for any T t. The sequence of policies (πT )T 1 where πT = (πT t )1 t T is anytime and is the equivalent of the anytime policy π in the world of sequential policies.

Now, similarly to the standard MAB, given a sequence of policies π = (πT )T 1, we want to compare its expected cumulative reward to other sequences of policies π = ( πT )T 1 in order to understand how well it performs. For that purpose, we introduce the regret in the following definition.

Definition 4. Given any sequences of policies π = (πT )T 1 and π = ( πT )T 1, the relative regret rate of π with respect to π is defined as

Reg F (π π) := lim sup T +

Rew T ( πT ) Rew T (πT )

Published in Transactions on Machine Learning Research (08/2024)

First, please note that this regret is asymptotic in T, because we want to capture the main term in T in the reward. Secondly, in Definition 4, we compare the reward of the sequence of policies π with the reward of some sequence of policies π. Ideally, our objective would be to find a sequence of policies π which has a higher reward than any other sequence of policies π, in other words, which would satisfy Reg F (π π) 0 for any arm distributions F = (F1, . . . , FK) and any sequence of policy π.

In the standard MAB (without the risk of ruin), we know that such policies π exist and even second-order terms (in T) in the reward decomposition have been derived such that Rew T (π) = maxk [K] µk T + O(log T) (see, e.g., Auer et al., 2002). In contrast, in the S-MAB setting (with the risk of ruin), the policy which achieves the best bound is not known. Actually, we do not even know (yet) in what sense there exists a best bound.

As a matter of fact, contrary to the MAB, in the S-MAB, no sequence of policies π dominates all the other sequences of policies π, as stated in the next proposition, whose proof is given in Appendix B.

Proposition 1. For any sequence of policies π, sup F supπ Reg F (π π ) > 0.

The key message is that whatever the sequence of policies π that you choose, there exist some arm distributions F = (F1, . . . , FK) such that another sequence of policies π has a significantly higher reward (by a term of order Ω(T)) than π. It is now hopeless to look for a policy which is absolutely better than any other one and instead, we look for a Pareto-optimal policy, as defined below.

Definition 5. A sequence of policies π is said to be (regret-wise) Pareto-optimal if, for any sequence of policies π , sup F Reg F (π π ) > 0 = inf F Reg F (π π ) < 0.

The notion of Pareto-optimal sequences of policies is related to the concept of strictly dominated strategies in game theory. Assume that an agent performs a sequence of policies π and that there exists another sequence of policies π such that:

- for any arm distributions F , Reg F (π π ) 0, and

- for some arm distributions F, it even holds that Reg F (π π ) > 0.

Then, the agent should simply not use the sequence of policies π and instead use π , because it is:

- always at least as good as π, and

- in some cases, even strictly better than π.

In the language of game theory, π is called a strictly dominated strategy, and our goal is to find a strategy which is not strictly dominated, which we call here a (regret-wise) Pareto-optimal sequence of policies. This challenging problem is an open question from COLT 2019 (Perotto et al., 2019), and our solution is based on proof techniques from various fields including information theory, stochastic processes theory or even predictions. As such, the intermediate results of this paper can be of interest for the reader interested in any of those fields and their applications.

4 Strategy and Structure of the Paper

In this section and until Section 7, we study the case of multinomial arm distributions of support { 1, 0, 1}, referred to as multinomial distributions in this paper. The generalization to arm distributions bounded in [ 1, 1] is done in Section 9.

We first explain the choice of the reward model (Section 4.1), and why we made Assumption 1 and what the problem objective becomes when this objective does not hold (Section 7.2.4). We next give an intuition on why the study of the probability of ruin is fundamental to maximize the expected cumulative reward (Section 4.2), and we conclude with the structure of the rest of the paper (Section 4.3).

Published in Transactions on Machine Learning Research (08/2024)

4.1 Reward Models

In this paper, we will mostly focus on multinomial arm distributions of support { 1, 0, 1} (simply referred to as multinomial arm distributions in the rest of this paper). This subsection discusses why this setting is of particular interest.

Why it is reasonable to study Integer Rewards. W.l.o.g., let us assume that the budget B N, and consider the specific instance of two arms, F1 and F2, of respective support { 1, 0, 1} and { 1/2, 0, 1/2}. Assume that at some time t0, the cumulative reward of a policy π reaches

t=1 Xπt t = B + 1.

Then, at round t0 + 1, selecting arm 1 increases the risk of ruin, while selecting arm 2 does not. At that stage, it is optimal to select arm 2, even if µ1 > 0 and µ2 < 0. Such situations may occur in the more general setting of arm distributions bounded in [ 1, 1], when the cumulative reward falls in the interval ( B, B + K). This creates a gap between our algorithms upper bounds and our lower bounds. However, this gap is small, as we justify in Section 9.

Why Bernoulli rewards are not sufficient. As explained in Section 4.2, minimizing the probability of ruin is crucial to maximize the expected cumulative reward of the S-MAB. In the case of multinomial arm distributions, the best constant strategy in hindsight which minimizes the probability of ruin selects arm

k P arg max k [K]

PFk(X = +1) PFk(X = 1),

(see Lemma 4), while the one which maximizes the expected cumulative reward selects arm

k E arg max k [K] {PFk(X = +1) PFk(X = 1)} .

In general, k P = k E, and this phenomenon is also captured by the multinomial setting we consider. On the other hand, Bernoulli arm distributions only have one real parameter, and do not capture this phenomenon, essential for the understanding of the survival bandit problem.

4.2 Strategy of the Paper

By definition of the time of ruin τ(B, π), it holds that

t=1 Xπt t B. (1)

Furthermore, since the rewards are bounded in [ 1, 1], if τ(B, π) > T, then it holds that

t=1 Xπt t =

t=T +1 Xπt t B + (τ(B, π) T). (2)

Finally, since the rewards are bounded in [ 1, 1], it is clear that

t=1 Xπt t T. (3)

Published in Transactions on Machine Learning Research (08/2024)

Figure 1: The probability of ruin constraints the cumulative reward.

Eq. (1), (2) and (3) draw a set where the cumulative reward is constrained, shaded in gray in Figure 1. That figure shows that a low time of ruin prevents the cumulative reward to become large. In particular, to have an expected cumulative reward which is larger than λT (for some positive λ), Figure 1 shows that τ(B, π) needs to be bigger that λT + B with large probability, i.e., the following needs to be large:

P (τ(B, π) (1 + λ)T + B) .

While we do not know λ yet, (1+λ)T +B is large when T is large (standard assumption in bandit problems), and therefore, the above quantity is largely related to the probability of ruin P(τ(B, π) < ).

4.3 Structure of the Paper

We start by studying the case of multinomial arm distributions of support { 1, 0, 1} (referred to as multinomial distributions in the sequel) in Sections 5 to 7, and then we will extend some of our results to the general case of arm distributions bounded in [ 1, 1] in Section 9. The outline of the paper is as follows:

in Section 5, we study the probability of ruin and derive the first main result of this paper: a tight non-asymptotic lower bound (in the sense of Pareto) on the probability of ruin in Theorem 1.

in Section 6, we introduce EXPLOIT, a framework (or set) of policies whose probability of ruin matches the lower bound of Section 5. We further show that those policies cannot be regret-wise Pareto-optimal, and hence the need to adapt them to achieve our desired objective.

in Section 7, we introduce the policy EXPLOIT-UCB-DOUBLE, which performs a doubling trick over an EXPLOIT policy. We show the second main result of this paper in Theorem 3: EXPLOITUCB-DOUBLE is regret-wise Pareto-optimal. We corroborate this theoretical result with experiments showing the experimental performance of EXPLOIT-UCB-DOUBLE. This section provides an answer to an open problem from COLT 2019 (Perotto et al., 2019).

in Section 9, we generalize the results of Sections 5 7 to the general case of arm distributions bounded in [ 1, 1]. In particular, we generalize the result of Theorem 1 on the probability of ruin, which is the third main result of this paper, and further relate it to the probability of ruin of i.i.d. stochastic processes, which is a result of independent interest. Finally, we discuss the challenges of extending Theorem 3 to that setting.

Published in Transactions on Machine Learning Research (08/2024)

5 Study of the Probability of Ruin

In this section, we study the probability of ruin of anytime policies in the case of multinomial arm distributions (of support { 1, 0, 1}). We first derive one of the main results of this paper: a tight lower bound on the probability of ruin (Theorem 1), which will be generalized to distributions bounded in [ 1, 1] in Proposition 7 of Section 9. We further relate this bound to the probability of ruin of i.i.d. random walks on { 1, 0, 1} in Lemma 1, which is a result of independent interest.

5.1 Trivial Case and Assumption

Consider the following example.

Example 1. Assume that there exists an arm k [K] whose distribution Fk is such that PX Fk(X 0) = 1. Then, if the initial budget B satisfies B > K 1, the policy π which chooses the arm with the highest empirical average of rewards, i.e., defined by

t 1, πt arg max k [K]

Pt 1 s=1 Xπt s 1πs=k Pt 1 s=1 1πs=k ,

has no risk of ruin: P(τ(B, π) < ) = 0.

The above case is simple, because the distribution Fk only yields non-negative rewards. In this case, the trivial bound on the probability of ruin

P(τ(B, π) < ) 0

is tight, and we eliminate this case by assuming that there is no positive or zero arm distribution, as defined below.

Definition 6. A distribution F is called a zero distribution (resp. a positive distribution) if PX F (X = 0) = 1 (resp. if PX F (X 0) = 1 and PX F (X > 0) > 0). We say that an arm k is a zero arm (resp. a positive arm) if its distribution Fk is zero (resp. positive).

In this section, we make the following assumption.

Assumption 2. There is no positive or zero arm.

We will also consider the following weaker assumption.

Assumption 3. There is no zero arm.

Please note that the policies introduced in Sections 6 to 7 will achieve P(τ(B, π) < ) = 0 in the case where there is a positive or zero arm.

5.2 Main Results on the Probability of Ruin

Let F{ 1,0,1} be the set of multinomial distributions of support { 1, 0, 1} which are not positive or zero (see Definition 6). Similarly to the cumulative reward, for any F = (F1, . . . , FK) FK { 1,0,1} and any policies π, π , we define the relative risk of ruin of π with respect to π as

Pruin(π π ) := PF (τ(B, π) < ) PF (τ(B, π ) < ),

where the dependency on F is omitted in the notation. Please note that, since we study this problem for fixed B small with regards to T, there is no limit in the definition of Pruin(π π ). Yet, similarly to Proposition 1, we can prove that no policy π achieves supπ sup F Pruin(π π ) 0, and for that reason, we focus on a policy which is Pareto-optimal in the sense of the probability of ruin, which is formalized as follows.

Definition 7. A policy π is said to be ruin-wise Pareto-optimal if, for any policy π ,

sup F Pruin(π π ) > 0 = inf F Pruin(π π ) < 0.

Published in Transactions on Machine Learning Research (08/2024)

Before stating the main result of this section, we need to define, for any arm distributions F = (F1, . . . , FK) FK { 1,0,1},

γ(Fk) := inf Q:EX Q[X]<0 KL(Q Fk) EX Q[ X] 0. (4)

The main result of this section is a Pareto-type lower bound on the probability of ruin.

Theorem 1. Let (αk)k [K] be such that for any k, αk > 0 and PK k=1 αk = 1. For any policy π,

inf F FK { 1,0,1}

PF (τ(B, π) < ) exp

k=1 αkγ(Fk)

= sup F FK { 1,0,1}

PF (τ(B, π) < ) exp

k=1 αkγ(Fk)

The main ingredient of the proof of Theorem 1 is given in Section 5.4, and the rest of the proof is given in Appendix D.

Please note that this theorem gives a lower bound on the probability of ruin. A weaker version of Theorem 1 is that there exist some arm distributions F FK { 1,0,1} such that

PF (τ(B, π) < ) exp

k=1 αkγ(Fk)

But Theorem 1 is a little stronger than that. Actually, it states that there are two cases:

either the lower bound (6) holds for all arm distributions F FK { 1,0,1};

or it holds with strict inequality for some F FK { 1,0,1}.

We conclude this subsection by a lemma providing an insightful interpretation of the lower bound (6).

Lemma 1. For any distributions F = (F1, . . . , FK) FK { 1,0,1} and any arm k [K],

1 B log PFk(τ(B, k) < ) = γ(Fk).

Remark 3. Whereas the statement of Lemma 1 uses the KL divergence through the definition of γ(Fk) in (4), the probability of ruin of a stochastic process is usually analyzed using the moment-generating function, which is also found in the proof of this lemma given in Appendix E. The relation between them is discussed in Lemma 5 in Appendix C, and we interchangeably use both representations.

In the next subsection, we give an interpretation of the bound of Theorem 1. In particular, we explain the Pareto-type bound and its connection to game theory in Section 5.3.1, we give an interpretation of the γ(Fk) from the bound in Section 5.3.2, and in Section 5.3.3, we give an interpretation of Theorem 1 based on Lemma 1 which will give rise to the policies matching this bound in Section 6.

5.3 Interpretation of the Lower Bound

In this subsection, we give an interpretation of the lower bound of Theorem 1 and the coefficient γ(Fk) that appear in the bound.

Published in Transactions on Machine Learning Research (08/2024)

5.3.1 A Pareto-type Bound

The lower bound is of Pareto-type. It states that for any policy π, if there exist some arm distributions F = (F1, . . . , FK) such that the probability of ruin of π is lower than

k=1 αkγ(Fk)

then there exist some other arm distributions G = (G1, . . . , GK) such that the probability of ruin of π is larger than

k=1 αkγ(Gk)

Such a lower bound is said to be of Pareto-type, and corresponds to the notion of strict dominance in game theory (von Neumann & Morgenstern, 1944): consider a game between one player and the environment, which is articulated as follows:

1. simultaneously and without consulting with one another, the player chooses a policy π and the environment chooses a K-tuple of arm distributions F = (F1, . . . , FK);

2. the player receives the score

k=1 αkγ(Fk)

PF (τ(B, π) < ).

The objective of the player is to maximize its score. Then, Theorem 1 states that no strategy π can ensure a score which is always non-negative and sometimes positive. In other words, if the score is money (in USD), then there is no strategy which guarantees that the player will almost surely make money.

In terms of policy, this translates as non strict dominance. Let π and π be two policies, and assume that

for any F chosen by the environment, the score of π is as good as π ;

and there exists some F such that the score of π is (strictly) larger than the score of π .

In this case, we say that π is strictly dominated by π, in the sense that π is always at least as good as π , and sometimes strictly better than π . In such a situation, from a score maximization perspective, it is always better to choose policy π over policy π , and we can discard policy π as an unreasonable choice of a policy. As a consequence, before choosing a policy π, it is fundamental to determine if it is strictly dominated.

In Section 6, we prove that there exists a policy whose probability of ruin is equal to

k=1 αkγ(Fk)

for any arm distributions F = (F1, . . . , FK), and therefore, such a policy is not strictly dominated and reasonable from the perspective of the risk of ruin.

5.3.2 The Coefficients Gamma

In this subsection, we give an interpretation of the quantities γ(Fk) from the bound of Theorem 1. If Fk F{ 1,0,1}, a direct application of Lemmas 1 and 4 yields

B log PFk(τ(B, k) < ) = max 0, log PX Fk(X = +1)

PX Fk(X = 1)

Published in Transactions on Machine Learning Research (08/2024)

Figure 2: γ(Fk) as a function of (p k , p+ k ).

For any Fk F{ 1,0,1}, γ(Fk) [0, + ] is a measure of the chance of survival of the policy which consistently selects arm k. The following equivalence holds:

PFk(τ(B, k) < ) = 1 PX Fk(X = +1) PX Fk(X = 1) γ(Fk) = 0,

in other words, the constant policy which only selects arm k reaches the ruin systematically if and only if γ(Fk) = 0. At the other end of the spectrum,

PFk(τ(B, k) < ) = 0 PX Fk(X = 1) = 0 γ(Fk) = + ,

and the constant policy which only selects arm k never observes the ruin if and only if γ(Fk) = + .

Between those extreme cases, γ(Fk) is a function which is non-decreasing in p+ k P(X = +1) and nonincreasing in p k P(X = 1). The whole plot of γ(Fk) as a function of (p k , p+ k ) is represented in Figure 2.

5.3.3 Divide the Budget and Conquer

In this subsection, we provide an interpretation of the bound of Theorem 1 which gives rise to a class of policies matching this bound. Let (αk)k [K] be such that for any k [K], αk B N. An application of Lemma 1 to the budget αk B for any k yields

k=1 PFk(τ(αk B, k) < ) = exp

k=1 αkγ(Fk)

This gives another expression of the lower bound in (6), which is easier to interpret. This lower bound is the product on the arms k [K] of the probability of ruin of the constant policy πt = k with budget αk B. This suggests that it might be interesting to divide the budget B into shares (αk B)k [K], and to allocate to each arm a budget share αk B. This fundamental interpretation will be the basis to define the EXPLOIT policies in Section 6.

Finally, please note that each of the factors in (7) can further be computed explicitly, using Lemma 4:

k=1 αkγ(Fk)

1, PX Fk(X = 1)

PX Fk(X = 1)

Nevertheless, there is no explicit formula for γ(Fk) in the general case of non-multinomial rewards. The non-multinomial case is discussed in Section 9.

Published in Transactions on Machine Learning Research (08/2024)

5.4 Sketch and Main Ingredient of the Proof of Theorem 1

The proof of the non-asymptotic bound of Theorem 1 is given in Appendix D. This proof consists of (i) derivation of an asymptotic bound given in Lemma 2 below, and (ii) turning it into a non-asymptotic bound by using a sub-additivity argument on the probability of ruin. Please note that the proof of this lemma is conducted in the generality of distributions bounded in [ 1, 1] (not necessarily multinomial).

Lemma 2. Fix an arbitrary policy π and distributions (Q1, . . . , QK) such that EX Qk[X] < 0 for all k [K]. Then, there exists a probability vector β(Q) = (β1(Q), . . . , βK(Q)) such that for any distributions (F1, . . . , FK),

lim inf B + 1 B log P(F1,...,FK)

τ(B, π) < 3B

k=1 βk(Q) KL(Qk Fk)

where Q = mini [K] EX Qi[ X] > 0.

Proof. Let Q = (Q1, . . . , QK) be a vector of distributions such that EX Qk[X] < 0 for all k [K], and let Q := mini [K] EX Qi[ X] > 0. We denote by Nk(τ) the number of pulls of arm k until τ(B, π), and by nk its realization. Denoting by Y n k the reward of the n-th pull of arm k, let

Hτ = Y 1 1 , Y 2 1 , . . . , Y N1(τ) 1 , Y 1 2 , Y 2 2 , . . . , Y N2(τ) 2 , . . . , Y 1 K, Y 2 K, . . . , Y NK(τ) K ,

and ht be its realization. Please note that for any realization ht, B + P

m [nk] ym k 1. We further

denote by T(Q) the set of typical realizations ht satisfying

PK k=1 nk KL(Qk Fk) Pnk m=1 log d Qk

d Fk (ym k ) t

B 1 4 , PK k=1 (nk EX Qk[X] Pnk m=1 ym k ) t Q

B 1 4 , PK k=1 Pnk m=1 ym k t Q

Such realizations are typical under Q in the sense that lim B + Q(Hτ T(Q)) = 1 (shown by, e.g., Hoeffding s inequality, see Appendix D.1 for details). We can see from (8) that any typical ht satisfies

B EX Qk[ X] 1

4 B 1 4 . (9)

In particular, denoting r(ht) := (n1,...,n K)

B , (9) implies that r(ht) can take at most O(poly(B)) values, and hence there exists r such that

lim B + 1 B log Q (r(Hτ) = r|Hτ T(Q)) = 0. (10)

By performing a change of distribution and using (8), we can bound

P(F1,...,FK)

τ(B, π) < 3B

P(F1,...,FK) (Hτ T(Q), r(Hτ) = r)

ht T (Q): r(ht)= r

m=1 log d Qk

d Fk (ym k )

ht T (Q): r(ht)= r

k=1 nk KL(Qk Fk) t B 1 4

Published in Transactions on Machine Learning Research (08/2024)

ht T (Q): r(ht)= r

k=1 rk KL(Qk Fk) + 3 QB 1 4

k=1 rk EX Qk[ X] KL(Qk Fk)

EX Qk[ X] + 3 QB 1 4

Q (Hτ T(Q), r(Hτ) = r) .

For any k [K], we introduce the normalized version of rk EX Qk[ X], which we denote by

βk(Q) := rk EX Qk[ X] PK j=1 rj EX Qj[ X] .

Eq. (9) implies that PK j=1 rj EX Qj[ X] 1 4

B 1 4 , and hence

1 B log P(F1,...,FK)

τ(B, π) < 3B

1 + 4 B 1 4

k=1 βk(Q) KL(Qk Fk)

EX Qk[ X] 3 QB 1 4 + 1

B log Q (Hτ T(Q), r(Hτ) = r)

k=1 βk(Q) KL(Qk Fk)

by (10), concluding the proof of Lemma 2.

6 The EXPLOIT Framework

In this section, we introduce a framework of anytime policies, called EXPLOIT, which achieve the lower bound on the probability of ruin given in Theorem 1. We further study the regret of all such policies.

In Sections 6 and 7, our study is conducted in the case of multinomial arm distributions, and therefore, it holds that P(τ(B, π) T) = P(τ( B , π) T)

for any policy π. As a result, in these sections, we assume w.l.o.g. that B N.

6.1 Definition of the EXPLOIT Framework

Let B1, . . . , BK be arbitrary positive integers such that B1 + + BK = B. Applying (7) to αk Bk

B for any k [K], the lower bound of Theorem 1 implies that

PF (τ(B, π) < )

k=1 PFk(τ(Bk, k) < )

for some arm distributions F FK { 1,0,1}. This right hand-side of this inequality corresponds to the probability of ruin of any policy π which allocates budget B1 to arm 1, B2 to arm 2 and so on, and plays πt = k only if arm k has not exceeded its budget Bk. We say that such policies belong to the EXPLOIT framework, which is formalized below.

Definition 8. Given some positive integers B1, . . . , BK such that B1 + + BK = B, we say that a policy π = (πt)t 1 belongs to the framework EXPLOIT(B1, . . . , BK) if, at any round t 1,

k [K] : Bk +

s=1 Xπs s 1πs=k > 0

Published in Transactions on Machine Learning Research (08/2024)

Remark 4. The initial choice of the budget shares B1, . . . , BK may depend on some possible prior information we may have on the arms. Assume that there is a prior distribution Dk for γ(Fk) through, e.g., already observed samples. We further assume that the priors are independent. Then the expected probability of ruin is given by

E γ(F1) D1 ... γ(FK) DK

k=1 exp ( Bkγ(Fk))

k=1 Eγ(Fk) Dk [exp ( Bkγ(Fk))] .

Then, it seems reasonable to take Bk minimizing this quantity, which is trivially Bk = B/K when Dk is the same for all the arms k. In general, one can easily see that the optimal Bk is the one such that

E [γ(Fk) exp ( Bkγ(Fk))]

E [exp ( Bkγ(Fk))]

is the same for all the arms. Still, this corresponds to the Bayesian formulation and we do not go further in this paper.

Without any prior information on the arms, we have no reason to choose B1 B2 e.g., and we will choose the budget shares as close as possible. If B = n KK + b, with 0 b < K is the Euclidian division of B by K, a possibility is to choose B1 = = Bb = n K + 1 and Bb+1 = . . . BK = n K.

All the policies in EXPLOIT(B1, . . . , BK) have the same probability of ruin, given in the next proposition.

Proposition 2. Given some positive integers B1, . . . , BK such that B1 + + BK = B, all the policies in EXPLOIT(B1, . . . , BK) achieve the same probability of ruin, given by

p EX(B1, . . . , BK)

k=1 PFk(τ(Bk, k) < ) = exp

k=1 Bkγ(Fk)

Importantly, the probability of ruin of the policies in EXPLOIT match the lower bound of Theorem 1, and therefore, all the policies in EXPLOIT are ruin-wise Pareto-optimal. This is a major strength of the EXPLOIT framework: given B1, . . . , BK, you can choose any policy in EXPLOIT(B1, . . . , BK) to try to maximize the expected cumulative reward without sacrificing the probability of ruin.

Remark 5. When B is a multiple of K, the policy π which selects arm πt arg maxk [K] n Pt 1 s=1 Xπs s 1πs=k o

with the highest cumulative reward belongs to EXPLOIT B

K , . . . , B

K . While it is intuitive that such a policy is very conservative, it was a priori not obvious that many policies (all the policies in EXPLOIT B

K , . . . , B

K ) achieve the same probability of ruin. Actually, this result becomes false in the general case of rewards in [ 1, 1] (not necessarily integers) studied in Section 9.

Convention 1. From now on, given B = n KK +b with 0 b < K, we consider the more symmetric case B1 = = Bb = n K + 1 and Bb+1 = = BK = n K. We use the shortcut notation EXPLOIT instead of EXPLOIT(B1, . . . , BK) in that specific instance and denote p EX p EX(B1, . . . , BK).

6.2 Expected Cumulative Reward of Policies in EXPLOIT

By nature, EXPLOIT policies are very conservative. Precisely, they allow a budget of Bk for the exploration of each arm k [K]. In the previous subsection, we showed that, thanks to this limited exploration, they are ruin-wise Pareto-optimal, i.e., they achieve a small probability of ruin (in the sense of Pareto-optimality). In this subsection, we show that the downside of this limited exploration is that the expected cumulative reward of EXPLOIT policies is fairly low, upper-bounded as shown in the following proposition, whose proof is in Appendix F.

Proposition 3. Assume that the budget B is a multiple of K. W.l.o.g., assume that µ1 µK. Then, for any policy π in EXPLOIT,

t=1 Xπt t 1τ(B,π) t 1

k=1 wkµk T + o(T) ( ) 1 p EX max k [K] µk T + o(T), (11)

Published in Transactions on Machine Learning Research (08/2024)

Algorithm 1 EXPLOIT-UCB(B)

for t = 1, . . . , T do

Set At := n k [K] : Pt 1 s=1 Xπs s 1πs=k B

if At = then

Pull arm arg maxk At ˆXk t 1 + q

Pull arm arg maxk [K] ˆXk t 1.

where for any k [K], wk = P(τ( B

K ,k)= )Qk 1

1 p EX . Besides, when two arms have positive and different expectations, ( ) is a strict inequality.

As we will see in Section 7, there exists a non-anytime policy whose expected cumulative reward is equal to the right hand-side of (11). This implies that no EXPLOIT policy is regret-wise Pareto-optimal (although they are all ruin-wise Pareto-optimal).

As explained before, a policy in EXPLOIT will allow a budget share Bk for the exploration of each arm k [K]. Therefore, it may stop pulling arm k after only Bk pulls even without encountering ruin before the horizon T. This lack of exploration of arm k penalizes the cumulative reward of arm k and is reflected in the coefficient (1 p EX)wk in the middle term of the bound of Proposition 3.

6.3 Bandit Algorithms in the EXPLOIT Framework

We know that all the policies in EXPLOIT are ruin-wise Pareto-optimal and achieve the same probability of ruin (this is the result of Proposition 2). In this subsection, we exhibit an EXPLOIT policy which achieves the highest possible cumulative reward within EXPLOIT, given as the middle bound in (11). Though such a policy would not be regret-wise Pareto-optimal as suggested from Proposition 3, it will serve as a basis for the construction of such an optimal policy. For any arm k [K] and any round t, we introduce Nk(t) := Pt s=1 1πs=k as the number of pulls of arm k until round t, and ˆXk t := 1 Nk(t) Pt s=1 1πs=k Xπs s as the empirical mean of arm k at round t.

We start from the classic bandit algorithm UCB (Upper Confidence Bound, see Auer et al., 2002) and make it fit in the EXPLOIT framework. This defines EXPLOIT-UCB(B) in Algorithm 1 as the policy which, at each round t 1, performs UCB among the arms whose cumulative reward is larger than B

K . As EXPLOIT-UCB(B) is in EXPLOIT, it naturally achieves the optimal bound on the probability of ruin. In addition, it also achieves the middle bound in (11) on the reward, which is asymptotically optimal among EXPLOIT policies, as shown in the next proposition, whose proof is in Appendix G.

Proposition 4. Under the hypotheses of Proposition 3, the expected cumulative reward of EXPLOITUCB(B) satisfies

t=1 Xπt t 1τ(B,π) t 1

= 1 p EX K X

k=1 wkµk T + o(T),

where for any k [K], wk = P(τ( B

K ,k)= )Qk 1

More than the bound itself, the above result states that a standard MAB algorithm made to fit in EXPLOIT achieves the best possible reward within EXPLOIT. Here is the intuition behind it. All EXPLOIT policies have the same risk of ruin, and when there is ruin, all policies receive the total reward B. Therefore, a good EXPLOIT policy can only make a difference in the case when there is no ruin. In that case, it should achieve a high cumulative reward, i.e., behave closely to a good standard MAB policy like UCB.

Remark 6. The choice of the constant 6 in the square root is different from the original UCB and is only made for simplicity of the proof.

Published in Transactions on Machine Learning Research (08/2024)

Figure 3: Description of EXPLOIT-UCB-DOUBLE

EXPLOIT-UCB-DOUBLE first behaves like EXPLOIT-UCB(B), until the cumulative reward reaches n B2. Then, it performs like EXPLOIT-UCB(n B2), until the cumulative reward reaches 2n B2. Then, it performs like EXPLOITUCB(2n B2), until the cumulative reward reaches 3n B2, and so on.

7 A Regret-wise Pareto-optimal Policy

In this section, we make a slight modification on the policy EXPLOIT-UCB(B) (see Algorithm 1) so that it achieves a large cumulative reward, while keeping its probability of ruin small. The resulting policy is EXPLOIT-UCB-DOUBLE (given in Algorithm 2) and is proven to be regret-wise Pareto-optimal, hence giving an answer to the open problem in Perotto et al. (2019).

7.1 A (Regret-wise) Pareto-optimal Policy: EXPLOIT-UCB-DOUBLE

We start from EXPLOIT-UCB(B) (Algorithm 1). As this policy belongs to EXPLOIT, it is ruin-wise Paretooptimal. However, as shown in Proposition 3, its cumulative reward is rather low, because its exploration is limited by the budget it allocates to each arm. We tackle this issue by performing a kind of doubling trick (see, e.g., Cesa-Bianchi & Lugosi, 2006) on the budget, which relaunches the exploration when the cumulative reward is large enough.

Let n N be a hyperparameter chosen in advance and for any integer j 0, let

t {0, . . . , min(τ(B, π), T)} : B +

s=1 Xπs s > jn B2 )

with the convention that tj = min(τ(B, π), T) + 1 if the above set is empty. At each round t, EXPLOITUCB-DOUBLE (see Figure 3 and Algorithm 2) performs

EXPLOIT-UCB(B) pretending that the initial budget is B if t < t1;

EXPLOIT-UCB(jn B2) pretending that the initial budget is jn B2 if tj t < tj+1.

A more visual description of EXPLOIT-UCB-DOUBLE is given in Figure 3 and its pseudo-code is given in Algorithm 2. We denote by πn the policy associated with EXPLOIT-UCB-DOUBLE with input parameter n.

The underlying principle of EXPLOIT-UCB-DOUBLE is that, as long as the cumulative reward is low, it performs safely like an EXPLOIT policy in order to minimize the probability of ruin. Then, progressively as the cumulative reward becomes larger, EXPLOIT-UCB-DOUBLE allocates more budget for exploration and behaves more similarly to UCB, so that it starts the cumulative reward maximization.

The next proposition, whose proof is given in Appendix H.1, shows that the probability of ruin of EXPLOITUCB-DOUBLE is close to the one of an EXPLOIT policy.

Proposition 5. Given n 1, the probability of ruin of EXPLOIT-UCB-DOUBLE is bounded as follows:

P (τ(B, πn) < ) p EX + (p EX)n B

1 (p EX)n B

( ) = p EX + o T + (1),

where ( ) holds if n T + + .

Published in Transactions on Machine Learning Research (08/2024)

Algorithm 2 EXPLOIT-UCB-DOUBLE input: parameter n N; budget B > 0. j := 0; t0 := 0. for t = 1, . . . , T do

if B + Pt 1 s=1 Xπn s s > (j + 1)n B2 then Set j := j + 1 and then, set tj := t 1.

Set At := k [K] : Pt 1 s=tj+1 Xπn s s 1πn s =k B+Ptj s=1 X πn s s K + 1 ;

for k = 1, . . . , K do

Set Nk(t 1) := Pt 1 s=1 1πn s =k and ˆXk t 1 := 1 Nk(t 1) Pt 1 s=1 Xπn s s 1πn s =k.

if At = then

Pull arm arg maxk At ˆXk t 1 + q

Pull arm arg maxk [K] ˆXk t 1.

Proposition 5 shows that the progressively increasing exploration of EXPLOIT-UCB-DOUBLE is reasonable from the perspective of the probability of ruin, because it maintains the probability of ruin almost as small as the one of an EXPLOIT policy, which is ruin-wise Pareto-optimal.

The next proposition shows that this progressively increasing exploration is also reasonable from the perspective of the reward maximization. The next proposition, whose proof can be found in Appendix H.2, shows that, in the case when there is no ruin, the expected cumulative reward of EXPLOIT-UCB-DOUBLE is asymptotically equal to the best possible expected reward.

Proposition 6. Let n be such that n T + + and that n = o(T 1/4). Then, under Assumption 3, the reward given no ruin of EXPLOIT-UCB-DOUBLE with input parameter n is bounded from below by

t=1 Xπn t t 1τ(B,πn) t 1

max k [K] µk T + o(T). (12)

The proof of Proposition 6 also holds in the case B = T, that is, when there is no risk of ruin. Therefore, if you apply EXPLOIT-UCB-DOUBLE to a standard MAB (without a risk of ruin), then its expected cumulative reward will be equal to maxk [K] µk T + o(T). In the standard MAB terminology, this means that its expected cumulative regret is sublinear: maxk [K] µk T Rew T (πn) = o(T).

Now, if we gather the results of Proposition 5 and Proposition 6, the former states that the probability of ruin of EXPLOIT-UCB-DOUBLE is very small (it is almost ruin-wise Pareto-optimal), and when it does not ruin, the latter states that its cumulative reward is asymptotically maximal. Therefore, its cumulative reward is almost asymptotically optimal, and for any value of the input parameter n, EXPLOIT-UCB-DOUBLE is almost regret-wise Pareto-optimal. This is formalized in the next theorem.

Theorem 2. For any sequence of policies π , under the assumptions of Proposition 6, the anytime policy EXPLOIT-UCB-DOUBLE (given in Algorithm 2) with input parameter n satisfies

sup F FK { 1,0,1} Reg F (πn π ) > 0 = inf F Reg F (πn π ) < (p EX)n B

1 (p EX)n B max k [K] µk.

The reason why EXPLOIT-UCB-DOUBLE with arbitrary n is only almost regret-wise Pareto-optimal is that, for a very large horizon T, the additional exploration of EXPLOIT-UCB-DOUBLE induces an additional risk of ruin of order (p EX)n B

1 (p EX)n B , which is of constant order if n is not allowed to depend on T. Yet, if n is allowed to depend on T, then we can drop the almost in the above theorem.

Published in Transactions on Machine Learning Research (08/2024)

Following Propositions 5 and 6, any n = o(T 1/4) such that n T + + is a valid choice. Yet, within that range, a larger n will increase t1 and hence, EXPLOIT-UCB-DOUBLE will behave longer like an EXPLOIT policy, undermining its cumulative regret in the long term. As a result, we recommend the subpolynomial (in T) n = log T, which also showed a better practical performance than other values of n (see the experiments in Section 8), and we will formulate our final theorems for the specific choice n = log T.

Theorem 3. Under the assumptions of Theorem 2, the (non-anytime) sequence of policies EXPLOIT-UCBDOUBLE with input parameter n = log T is regret-wise Pareto-optimal.

The proof of Theorems 2 and 3 is given in Appendix I.

Remark 7. In the above theorem, n depends on T and therefore, the sequence of policies is no longer anytime, as explained in Section 3. This is the price to pay to achieve the regret-wise Pareto-optimality.

Remark 8. When there exists a zero arm (see Definition 6), there exists a possibility that EXPLOITUCB-DOUBLE continues to pull that arm without increasing the budget or being ruined. We employed Assumption 3 to exclude this case but this assumption can be removed if we use a modified version of EXPLOIT-UCB-DOUBLE such that the exploration is relaunched at round t = log T. However, this modification requires the knowledge of T, losing the anytime property and this is beyond the scope of this work.

Finally, please note that Theorems 2 and 3 provide answers to the open problem by Perotto et al., 2019. Some discussion on the extent of the results described in this subsection is provided in the next subsection.

7.2 Discussion

In this subsection, we summarize the theoretical strengths and limitations of EXPLOIT-UCB-DOUBLE.

7.2.1 Cumulative Reward given No Ruin

EXPLOIT-UCB is an EXPLOIT policy, and as such, it has the advantage to be ruin-wise Pareto-optimal, but this comes at the cost of a budget-limited exploration. Because of this limited exploration, its cumulative reward is in general smaller than (1 p EX) maxk [K] µk T + o(T) (Proposition 3), which implies

max k [K] µk T E

Now, assume that you have to deploy a policy π for a critical application of the standard MAB (with no risk of ruin). Since the application is critical, you decide to apply a conservative strategy, and for that, you set some arbitrary B > 0 and you apply EXPLOIT-UCB (for any t τ(B, π), the choice of arms is arbitrary). Then, the above result shows that the policy π will likely have a linear regret, which is not satisfactory.

However, EXPLOIT-UCB-DOUBLE solves that issue by progressively increasing the exploration as its cumulative reward grows (Proposition 6). This is very important because it means that EXPLOIT-UCB-DOUBLE can be reasonably applied to a standard MAB setting and achieve a sublinear regret (in the sense of the standard MAB). This result is valid for any choice of input parameter n for EXPLOIT-UCB-DOUBLE.

7.2.2 Probability of Ruin and Regret-wise Pareto Optimality

As a matter of fact, the input parameter n controls the risk of ruin of EXPLOIT-UCB-DOUBLE. On the one hand, if n is chosen constant and independent of T, then EXPLOIT-UCB-DOUBLE is an anytime policy, but it is not regret-wise Pareto-optimal. This comes as no surprise, because it cannot compete against all the non-anytime sequences of policies. Yet, it is almost regret-wise Pareto-optimal, up to the term (p EX)n B

1 (p EX)n B maxk [K] µk, which is exponentially small in n and B (Theorem 2). On the other hand, if n = log T depends on T, then EXPLOIT-UCB-DOUBLE is not an anytime sequence of policies anymore, but it becomes regret-wise Pareto-optimal (Theorem 3).

From a practical perspective, how to choose the parameter n? Naturally, if the horizon T is known before the experiment, then Theorem 3 recommends the choice n = log T and provides good theoretical guarantees

Published in Transactions on Machine Learning Research (08/2024)

for it. Yet, if T 1010, then 1 n 23 and there are not so many candidates for a good choice of n. In practice, n = 1 is a decent choice and brings a decent reward (which is almost as good as n = log T). Experiments are given in the next subsection.

7.2.3 Anytime vs. Regret-wise Pareto Optimality

Is there an anytime policy which is regret-wise Pareto-optimal? As explained in Section 4.1, in the case of multinomial arms, unlike in the case of Bernoulli arms, the arm k P with the largest probability of survival, and the arm k E with the largest expectation, may be different. In that case, even the best policy which knows the arm distributions is not trivial. Intuitively, such a policy should pull arm k P until a certain round t to minimize the risk of ruin, and then pull arm k E until T, but t should depend on T. It is therefore intuitive that the best such policy is not anytime, and we raise the following open question: Conjecture 1. No anytime policy is regret-wise Pareto-optimal for all arm distributions in F{ 1,0,1}.

Please note that, as explained in Section 4.1, such a matter does not arise in the case of Bernoulli arm distributions, because in that case, k E = k P . The case of Bernoulli distributions does not capture this subtlety, and this is why we focused on the more difficult case of multinomial distributions in this paper.

7.2.4 Assumption 1 and Beyond

In this subsection, we explain why Assumption 1 is made, and we discuss on a potential variant of the S-MAB to the case where it does not hold.

Originally, the standard objective of the MAB is to maximize the expected cumulative reward Rew T . This also fits into many applications, including those in medicine testing and in finance which have been mentioned in the introduction. In the standard MAB without the risk of ruin, it is known (see, e.g., Auer et al., 2002) that successful strategies π achieve

Rew T (π) = max k [K] µk T + o(T).

In order to improve upon this bound, the expected cumulative regret Reg T was defined as the difference

Reg T (π) = max k [K] µk T Rew T (π).

The study of Reg T (π) yields the remaining part of the asymptotic development (in T) of Rew T , and it is known that some successful strategies achieve Reg T (π) = O(log T) (see, e.g., Auer et al., 2002).

In contrast, in the S-MAB, Assumption 1 ensures that the ruin may not occur. If we do not assume so, then the ruin occurs almost surely. At the time of ruin, the cumulative reward is equal to B by definition:

t=1 Xπt t = B

and therefore,

= P (τ(B, π) T) ( B) + (1 P (τ(B, π) > T)) E

τ(B, π) > T

When Assumption 1 does not hold, then any strategy π satisfies

P (τ(B, π) T) = 1 + o(1),

and hence, we can define a notion of regret as

B Rew T = (1 P (τ(B, π) T))

τ(B, π) > T

Published in Transactions on Machine Learning Research (08/2024)

Experiment No Budget Arms Experiment No Budget Arms (1) B = 9 {F (11), F (7), F (12)} (4) B = 9 {F (6), F (7), F (8)} (2) B = 9 {F (1), F (2), F (3)} (5) B = 30 {F (1), F (2), F (3)} (3) B = 9 {F (4), F (5), F (3)} (6) B = 30 {F (8), F (9), F (10)}

Table 1: Settings Considered

Naturally, minimizing the regret as defined above yields a bound which is of order smaller than constant. From an application perspective, this is hard to justify for many practitioners: for example, in the financial portfolio example mentioned in the introduction, this corresponds to trying to increase the gain by an amount of money negligible compared to the initial budget B and the horizon T. Yet, from a theoretical perspective, the above problem is very interesting, and will be the object of future work.

8 Experiments

In this section, we evaluate the empirical performance of the algorithms EXPLOIT-UCB and EXPLOITUCB-DOUBLE (for various parameters n) that we introduced in this paper.

8.1 Setting

Algorithms implemented: we have compared the performance of our algorithms EXPLOIT-UCB and EXPLOIT-UCB-DOUBLE (with parameters n {1, log T , 100}, where log T = 10) to the classic bandit algorithms UCB (Auer et al., 2002) and Multinomial Thompson Sampling (MTS), which is a generalization of Thompson sampling to multinomial rewards (Riou & Honda, 2020), and to the other existing S-MAB algorithms Gambler-UCB1 (Perotto et al., 2022) and GWA-UCB1 (Manome et al., 2023).

Important note: let us mention here that both Gambler-UCB1 and GWA-UCB1 are purely empirical, and have no theoretical guarantee. Moreover, they have only been tested for rewards in { 1, 0, 1}. Besides, we mention that there is a mistake in the definition of GWA-UCB1, making it not applicable to settings with (possibly) negative rewards, so we slightly modified the formula to shift the rewards from { 1, 1} to {0, 2}.

Setting: for all the experiments performed, we consider a bandit setting with horizon T = 20000 with K = 3 multinomial arms of common support { 1, 0, 1} and distributions F (i1), F (i2) and F (i3), where i1, i2, i3 {1, . . . , 12}. The distributions F (i) for i {1, . . . , 12} are described below:

F (1) = Mult(0.4, 0.12, 0.48); F (2) = Mult(0.04, 0.88, 0.08); F (3) = Mult(0.5, 0.1, 0.4); F (4) = Mult(0.48, 0, 0.52); F (5) = Mult(0.04, 0.91, 0.05); F (6) = Mult(0.45, 0, 0.55); F (7) = Mult(0.05, 0.85, 0.1); F (8) = Mult(0.5, 0, 0.5); F (9) = Mult(0.495, 0, 0.505); F (10) = Mult(0.049, 0.9, 0.051); F (11) = Mult(0.4, 0.1, 0.5); F (12) = Mult(0.6, 0, 0.4).

We further describe all the settings in Table 1. Please note that they have been chosen to emphasize on the technicalities that are specific to the S-MAB, and which do not appear in the standard MAB. For that purpose, we have chosen a low budget B = 9 in settings (1) (4), because the case of large budget resembles to the standard MAB. In many experiments, e.g. in experiment (1), the parameters of the arms have been chosen such that the arm k with the largest expectation µk and the arm k with the lowest probability of ruin, i.e. the largest γ(Fk ), are different. Furthermore, in many settings, the arm expectations are very close such that a good policy has to balance well the exploration in order to identify the arm with the largest expectation without causing the ruin.

Published in Transactions on Machine Learning Research (08/2024)

Evaluation metric: in order to evaluate the practical performance of the algorithms introduced in this paper, we define the survival regret as

S-Reg T (π) =

max k [K] µk T Rew T (π),

which is a hypothetical regret with respect to a policy that achieves the Pareto-optimal ruin probability and always pulls the arm with the highest expected reward given no ruin.

8.2 Results

The survival regret of the algorithms considered is given in Figure 4. All the curves are averages over 200 simulations. The corresponding average survival time min(T, τ(B, π)) of the algorithms is given in Table 2. The average proportion of ruins of the algorithms is given in Table 3.

EXPLOIT-UCB-DOUBLE performs the best. Over the experiments we conducted, it is quite clear that EXPLOIT-UCB-DOUBLE largely outperforms all the other tested policies in most settings. In the few cases where they do not (precisely when the budget is large, and hence, the S-MAB resembles a standard bandit problem), its regret is low and comparable to UCB.

MTS has a large regret. In contrast to the standard bandit problems, where MTS is optimal both theoretically and in practice (Riou & Honda, 2020), for the S-MAB, the regret of MTS is rather large, even larger than the one of UCB, as it was also noted in Perotto et al. (2022) and Manome et al. (2023). This lack of performance of MTS is due to frequent ruins: in experiments (1) (4) where the budget is low (B = 9), the proportion of ruins of MTS is 2 to 3 times as large as the one from any other algorithm considered. The explanation behind this phenomenon is that MTS is a Bayesian algorithm which has a randomized exploration component. While this randomization contributes to its success in the standard bandit problem, in the S-MAB, it is the source of frequent ruins and hence, drags its performance down considerably.

Gambler-UCB1 is consistently the worst. In almost all the experiments conducted, Gambler-UCB1 has the largest regret by far. Interestingly, its proportion of ruins is generally not so high. We recall that Gambler-UCB1 replaces log t by log Bt in the UCB selection policy. This clearly decreases the exploration at every time step, instead of increasing it to achieve the standard exploration-exploitation dilemma, undermining its performance. This is in line with the results of Perotto et al. (2022), which found out empirically Gambler-UCB1 performs less well than UCB.

UCB performs quite well. In many of the experiments above, UCB performs remarkably well, with a regret which comes close to the one of EXPLOIT-UCB-DOUBLE. This is not so surprising, because EXPLOIT-UCB-DOUBLE has a hyperparameter n, which controls its level of exploration: when n is large, it exploits more and behaves almost like an EXPLOIT policy; when n is low, it explores more, and behaves almost like UCB. In particular, when the budget is large like in experiments (5) and (6), the exploration does not need to be constrained so much, and UCB performs similarly to EXPLOIT-UCB-DOUBLE.

GWA-UCB1 is inconsistent. The regret of GWA-UCB1 is often very large (experiments (1), (2), (4) and (5)), and occasionally very low (experiments (3) and (6)). We recall that GWA-UCB1 has been defined as a policy in a parametric class of policies containing UCB, and that the parameters have been tuned empirically in Manome et al. (2023). The inconsistency of GWA-UCB1 is also not a surprising fact, because the regret bound in the S-MAB is in the sense of Pareto, suggesting that an algorithm performing too well in some setting would perform poorly in some other. Such a phenomenon is especially expected when hyperparameters are tuned empirically by hand, potentially creating the observed inconsistency.

Published in Transactions on Machine Learning Research (08/2024)

0 5000 10000 15000 20000 Rounds

Survival Regret

UCB MTS EXPLOIT-UCB E.-U.-D. w/ n=log T E.-U.-D. w/ n=1 E.-U.-D. w/ n=100 GAMBLER-UCB1 GWA-UCB1

(a) Setting (1)

0 5000 10000 15000 20000 Rounds

Survival Regret

UCB MTS EXPLOIT-UCB E.-U.-D. w/ n=log T E.-U.-D. w/ n=1 E.-U.-D. w/ n=100 GAMBLER-UCB1 GWA-UCB1

(b) Setting (2)

0 5000 10000 15000 20000 Rounds

Survival Regret

UCB MTS EXPLOIT-UCB E.-U.-D. w/ n=log T E.-U.-D. w/ n=1 E.-U.-D. w/ n=100 GAMBLER-UCB1 GWA-UCB1

(c) Setting (3)

0 5000 10000 15000 20000 Rounds

Survival Regret

UCB MTS EXPLOIT-UCB E.-U.-D. w/ n=log T E.-U.-D. w/ n=1 E.-U.-D. w/ n=100 GAMBLER-UCB1 GWA-UCB1

(d) Setting (4)

0 5000 10000 15000 20000 Rounds

Survival Regret

UCB MTS EXPLOIT-UCB E.-U.-D. w/ n=log T E.-U.-D. w/ n=1 E.-U.-D. w/ n=100 GAMBLER-UCB1 GWA-UCB1

(e) Setting (5)

0 5000 10000 15000 20000 Rounds

Survival Regret

UCB MTS EXPLOIT-UCB E.-U.-D. w/ n=log T E.-U.-D. w/ n=1 E.-U.-D. w/ n=100 GAMBLER-UCB1 GWA-UCB1

(f) Setting (6)

Figure 4: Survival Regret of EXPLOIT-UCB-DOUBLE, EXPLOIT-UCB, UCB, MTS, Gambler-UCB1 and GWA-UCB1, in the settings (1) (6).

Published in Transactions on Machine Learning Research (08/2024)

Setting UCB MTS EX-UCB EX-D(log T) EX-D(1) EX-D(100) Gambling GWA-UCB1 (1) 17917 14541 18806 18309 18508 18506 19303 19503 (2) 18215 13352 18904 17911 18909 19603 18211 19010 (3) 8244 5790 12862 8858 9156 10663 15059 12989 (4) 18512 16326 18709 19107 18805 18015 18755 18711 (5) 19903 20000 20000 20000 20000 20000 19215 20000 (6) 10290 9644 12735 11751 10621 12721 15053 14980

Table 2: Average Survival Time.

Setting UCB MTS EX-UCB EX-D(log T) EX-D(1) EX-D(100) Gambling GWA-UCB1 (1) 0.10 0.27 0.06 0.08 0.07 0.07 0.04 0.03 (2) 0.09 0.33 0.05 0.10 0.05 0.02 0.09 0.05 (3) 0.60 0.72 0.36 0.57 0.56 0.48 0.26 0.36 (4) 0.07 0.18 0.06 0.04 0.06 0.10 0.07 0.07 (5) 0.01 0 0 0 0 0 0.04 0 (6) 0.64 0.64 0.46 0.55 0.60 0.49 0.35 0.36

Table 3: Proportion of Ruin of the Algorithms.

The hyperparameter in EXPLOIT-UCB-DOUBLE does not change much in practice. While fixed values of n do not offer the theoretical guarantees of Theorem 3, in practice, the value of n does not have much impact on the practical performance of EXPLOIT-UCB-DOUBLE. Actually, a good value for n being of the order of log T, choosing a small 1 n 10 is likely to give a good performance for relatively small horizon T 1010. In contrast, for very large horizons (which may happen in some of the financial applications of the S-MAB mentioned in the introduction), e.g., T 10100, we expect the parameter n to have a more determining impact on the regret.

The case of large budget. In the case of a large initial budget (experiments (5) and (6)), all the policies with strong theoretical guarantees (UCB, MTS, EXPLOIT-UCB-DOUBLE and EXPLOIT-UCB-DOUBLE) perform rather well. In experiment (5), GWA-UCB1 and Gambler-UCB1 seem to have a linear regret, while all the other policies have a much lower regret. Among them, MTS clearly has the lowest regret. This was predictable: when the budget is large, the problem resembles to a standard bandit problem, for which MTS is known to be theoretically and practically optimal. Besides, the regret, as well as the proportion of ruins and the average time of ruin of UCB, EXPLOIT-UCB ad EXPLOIT-UCB-DOUBLE are very similar. This comes as no surprise, because when the budget is large, then the constraints of the EXPLOIT framework is loose, and all of those algorithms behave like UCB most of the time. In experiment (6), the setting is designed such that the arms are extremely hard to distinguish: the arm expectations are very similar and close to 0. In that case, even the large initial budget does not preclude ruin, and all the algorithms suffer from a large regret.

9 Extension to the Case of Non-integer Rewards

This section is structured in two subsections:

1. first, we generalize our results on the probability of ruin to the case of rewards in [ 1, 1];

2. secondly, we generalize our policies to the case of rewards in [ 1, 1] and explain the challenges in deriving regret-wise and ruin-wise Pareto-optimality-type theoretical guarantees.

Published in Transactions on Machine Learning Research (08/2024)

9.1 Generalized Results on the Probability of Ruin

Let F[ 1,1] be the set of distributions bounded in [ 1, 1] which are not positive or zero (see Definition 6). Theorem 1 can be generalized as follows.

Proposition 7. Let (αk)k [K] be as in Theorem 1. For any policy π,

inf F FK [ 1,1]

P(F1,...,FK)(τ(B, π) < ) exp

k=1 αkγ(Fk)

= sup F FK [ 1,1]

P(F1,...,FK)(τ(B, π) < ) exp

k=1 αkγ(Fk)

The proof of Proposition 7 follows the proof of Theorem 1, except for the subadditivity argument, and it is given in Appendix D, next to the proof of Theorem 1.

The above bound is most likely not tight, because the subadditivity argument that we used is not tight. Actually, this is the main technicality which separates the general case from the multinomial case. In the multinomial case, it is known that at the time of ruin, the cumulative reward is exactly B . In the general case, however, the cumulative reward at the time of ruin is stochastic and depends on the arm distributions as well as the policy π.

Similarly to the case of rewards in { 1, 0, 1}, we state a lemma providing an insightful interpretation of the bound (13). While in the case of rewards in { 1, 0, 1}, Lemma 1 gave an easy expression of γ(Fk), in the general case, it is expressed as a limit.

Lemma 3. For any k [K],

Fk F[ 1,1], 1 B log PFk(τ(B, k) < ) lim inf B 1 B log PFk(τ(B , k) < ) = γ(Fk).

This lemma means that γ(Fk) can be related to the probability of ruin PFk(τ(B, k) < ) of the stochastic process with increments i.i.d. from Fk.

9.2 Generalized Results on the Regret

We first generalize the EXPLOIT framework to rewards in [ 1, 1], and then give the results on EXPLOITUCB-DOUBLE based on the new EXPLOIT framework.

9.2.1 Generalization of the EXPLOIT Framework

For the sake of clarity, we generalize the framework EXPLOIT(B1, . . . , BK) only in the case B1 = = BK = B

K (but we could similarly generalize it to any (B1, . . . , BK)), and we will refer to this generalization as EXPLOIT. Let (Y s k )s 1 be the rewards from arm k [K]. The main problem in the general case is that the sum of rewards is not necessarily an integer, and hence if you pull arm k until the first round t such that Pt s=1 Y s k B

K , then we have

s=1 Y s k = B

K κ, with κ [0, 1).

If κ is large, then this will reduce the budget of the other arms. To remedy that issue, we conservatively decide to stop the exploration when Pt s=1 Y s k < B

K +1, so that each arm has a budget share between B

K . This is formalized as follows.

Definition 9. For any k [K], denoting by (Y s k )s 1 the rewards from arm k, let τ < k := inf t 1 : Pt s=1 Y s k < B

K + 1 . We say that a policy π belongs to the EXPLOIT framework if:

Published in Transactions on Machine Learning Research (08/2024)

- for any t < PK k=1 τ < k , πt n k [K] : Pt s=1 1πs=k < τ < k o , and

- for any t PK k=1 τ < k , πt = arg max n Pt s=1 Xπs s 1πs=k o (in case of tie, πt is the smallest arm index).

We decompose B = n KK + b for an integer n K and 0 b < K, and let αk(B) n K+1k+1<b

B for any k [K], which we denote by αk in the absence of ambiguity on the initial budget B. Similarly to the multinomial case, all the distributions in EXPLOIT have the same probability of ruin p EX B

K , . . . , B

K . However, in the general case, this value is not deterministic. Yet, it can be bounded easily following the definition of EXPLOIT. Proposition 8. It holds that

k=1 αk(B)γ(Fk)

K , . . . , B

The RHS of Proposition 8 does not match the lower bound of Proposition 7. However, if we apply it to the initial budget B = B + K + 1, it yields

K , . . . , B

k=1 αk(B + 1)γ(Fk)

which coincides with the lower bound in Proposition 7. Therefore, the looseness of the ruin probability in the general case (w.r.t. the bound of Proposition 7) corresponds to a budget loss of at most K + 1.

9.2.2 Generalization of EXPLOIT-UCB-DOUBLE

We define EXPLOIT-UCB and EXPLOIT-UCB-DOUBLE based on the newly defined EXPLOIT framework for rewards in [ 1, 1]. The pseudo-code for EXPLOIT-UCB-DOUBLE remains the same as in the multinomial case and is given in Algorithm 2. Interestingly, most of the results can be extended from the multinomial case to the general case. Proposition 9. Let πn be the policy associated to the policy EXPLOIT-UCB-DOUBLE with input parameter n. Its probability of ruin is upper bounded by

P (τ(B, πn) < ) p EX + (p EX)n B

1 (p EX)n B .

The cumulative reward given no ruin of EXPLOIT-UCB-DOUBLE is bounded from below by

t=1 Xπn t t 1τ(B,πn) t 1

max k [K] µk T + o(T).

As a consequence of the above results, it holds that, for any sequence of policies π ,

sup F FK { 1,0,1} Reg F (πn π ) > 0 = inf F Reg F (πn π ) < (p EX)n B

1 (p EX)n B max k [K] µk.

As explained before, the main difficulty is that in general, for any arm k, Pτ < k s=1 Y s k is not deterministic (using the notations of Definition 8). As a result, even for EXPLOIT policies, the probability of ruin cannot be decomposed as a product of independent ruin probabilities of the arms. The same reason leads to the looseness in the subadditivity bound (see Lemma 7), leading to the (B + 1) factor in the bound of Proposition 7 instead of B. For that reason, we believe that the bound of Proposition 7 is not tight, and the question of the regret-wise Pareto-optimality of EXPLOIT-UCB-DOUBLE in the general case is open.

As a final point of consideration, please note that in the general case as well, the reward given no ruin of EXPLOIT-UCB-DOUBLE (see Proposition 6) is asymptotically optimal and equal to maxk [K] µk T, which makes it worth applying to more standard bandit settings where an algorithm with a stronger exploitation component is desired.

Published in Transactions on Machine Learning Research (08/2024)

10 Conclusion

In this paper, we introduced the S-MAB, an extension of the MAB with a risk of ruin, which naturally follows from many practical applications but is considerably more difficult to study. For example, contrary to the MAB, no policy can achieve a sublinear regret in the standard sense, because every single pull increases considerably the probability of ruin. Our contributions are threefold:

- we formally defined the problem, defined the objective to achieve with the regret-wise Paretooptimality and introduced the key notion to our problem with the time of ruin and the probability of ruin. Furthermore, we explained how an optimal policy needs to minimize the probability of ruin while at the same time maximize the cumulative reward given no ruin, which are two concepts in apparent opposition;

- we studied the probability of ruin, on which we provided both a lower bound (Theorem 1) and policies achieving this lower bound (EXPLOIT policies);

- using a doubling trick over an EXPLOIT policy, we derived a policy which is almost regret-wise Pareto-optimal, and can be made exactly Pareto-optimal if the policy knows the horizon before starting the procedure. This provides an answer to an open problem from COLT 2019 (see Perotto et al., 2019).

Along the way, we raised several open questions which we keep for future work: in the case of integer rewards, is there a policy which is regret-wise Pareto-optimal and does not depend on the horizon? In the general case of bounded rewards, most of our results extend, except that the lower bound on the probability of ruin is seemingly not tight and we did not prove that EXPLOIT-UCB-DOUBLE is regret-wise Pareto-optimal with n = log T. Can we improve upon those?

This is a fairly new and yet unexplored problem, but we believe that it is very rich and paves the way to a myriad of new questions.

Acknowledgments

JH was supported by JSPS, KAKENHI Grant Number JP21K11747, Japan. MS was supported by Institute for AI and Beyond, UTokyo. The authors would like to thank anonymous reviewers for their insightful reviews.

Published in Transactions on Machine Learning Research (08/2024)

Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Improved algorithms for linear stochastic bandits. In Neur IPS, pp. 2312 2320, 2011.

S. Agrawal and N.R. Devanur. Linear contextual bandits with knapsacks. In Neur IPS, pp. 3458 3467, 2016.

S. Agrawal and N. Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In COLT, pp. 1 39, 2012.

P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The nonstochastic multiarmed bandit problem. SIAM, 32(1):48 77, 1995.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235 256, 2002.

M. Aziz, E. Kaufmann, and M.-K. Riviere. On multi-armed bandit designs for dose-finding clinical trials. Journal of Machine Learning Research, 22:1 38, 2021.

A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. In FOCS, pp. 1 55, 2013.

S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5, 2012.

A. N. Burnetas and M. N. Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122 142, 1996.

O. Cappé, A. Garivier, O. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516 1541, 2013.

A. Cassel, S. Mannor, and A. Zeevi. A general approach to multi-armed bandits under risk criteria. In COLT, pp. 1295 1306, 2018.

S. Cayci, A. Eryilmaz, and R. Srikant. Budget-constrained bandits over general cost and reward distributions. In AISTATS, pp. 4388 4398, 2020.

N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.

N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and Science Systems, pp. 1404 1422, 2012.

N. Cesa-Bianchi and F. Orabona. Online Learning Algorithms. Annual Review of Statistics and Its Application, 2021.

O. Chapelle and L. Li. An empirical evaluation of thompson sampling. In Neur IPS, pp. 2249 2257, 2011.

T. Chen, A. Gangrade, and V. Saligrama. Strategies for safe multi-armed bandits with logarithmic regret and risk. In ICML, pp. 3123 3148, 2022.

R. Combes, C. Jiang, and S. Rayadurgam. Bandits with budgets: Regret lower bounds and optimal algorithms. In ACM SIGMETRICS Performance Evaluation Review, pp. 245 257, 2015.

A. Dembo and O. Zeitouni. Large Deviation Techniques and Applications. Springer Science, 2009.

W. Ding, T. Qin, X-D. Zhang, and T-Y. Liu. Multi-armed bandit with budget constraint and variable costs. In AAAI, pp. 232 238, 2013.

US Food and Drugs Administration. Adaptive designs for clinical trials of drugs and biologics. Guidance for Industry, 2019.

N. Galichet, M. Sebag, and O. Teytaud. Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In ACML, pp. 245 260, 2013.

Published in Transactions on Machine Learning Research (08/2024)

E. Garcelon, M. Ghavamzadeh, A. Lazaric, and M. Pirotta. Improved algorithms for conservative exploration in bandits. In AAAI, 2020.

A. Garivier, E. Kaufmann, and T. Lattimore. On explore-then-commit strategies. In Neur IPS, pp. 784 792, 2016.

Y. Han, J. Zeng, Y. Wang, Y. Xiang, and J. Zhang. Optimal contextual bandits with knapsacks under realizability via regression oracles. In AISTATS, pp. 5011 5035, 2023.

M. Hoffman, E. Brochu, and N. de Freitas. Portfolio allocation for bayesian optimization. In UAI, pp. 327 336, 2011.

X. Huo and F. Fu. Risk-aware multi-armed bandit problem with application to portfolio selection. Royal Society open science, 2017.

N. Immorlica, K. Sankararaman, R. Schapire, and A. Slivkins. Adversarial bandits with knapsacks. In FOCS, pp. 202 219, 2019.

A. Kazerouni, M. Ghavamzadeh, Y. Abbasi Yadkori, and B. Van Roy. Conservative contextual linear bandits. In Neur IPS, pp. 3910 3919, 2017.

Thomas Kesselheim and Sahil Singla. Online learning with vector costs and bandits with knapsacks. In Jacob Abernethy and Shivani Agarwal (eds.), Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp. 2286 2305. PMLR, 2020.

N. Korda, E. Kaufmann, and R. Munos. Thompson sampling for 1-dimensional exponential family bandits. In Neur IPS, pp. 1448 1456, 2013.

R. Kumar and R. Kleinberg. Non-monotonic resource utilization in the bandits with knapsacks problem. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 19248 19259. Curran Associates, Inc., 2022.

T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4 22, 1985.

T. Lattimore and C. Szepesvari. Bandit Algorithms. Cambridge University Press, 2020.

Z. Li and G. Stoltz. Contextual bandits with knapsacks for a conversion model. In Advances in Neural Information Processing Systems, volume 35, pp. 35590 35602, 2022.

Q. Liu, W. Xu, S. Wang, and Z. Fang. Combinatorial bandits with linear constraints: Beyond knapsacks and fairness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 2997 3010. Curran Associates, Inc., 2022a.

S. Liu, J. Jiang, and X. Li. Non-stationary bandits with knapsacks. In Advances in Neural Information Processing Systems, volume 35, pp. 16522 16532, 2022b.

A. Locatelli, M. Gutzeit, and A. Carpentier. An optimal algorithm for the thresholding bandit problem. In ICML, pp. 1690 1698, 2016.

O-A. Maillard. Robust risk-averse stochastic multi-armed bandits. In Algorithmic Learning Theory, pp. 218 233, 2013.

N. Manome, S. Shinohara, and U.I. Chung. Simple modification of the upper confidence bound algorithm by generalized weighted averages. In ar Xiv, 2023.

F.S. Perotto, M. Bourgais, B.C. Silva, and L. Vercouter. Open problem: Risk of ruin in multiarmed bandits. In COLT, pp. 3194 3197, 2019.

Published in Transactions on Machine Learning Research (08/2024)

F.S. Perotto, X. Pucel, and J.-L. Farges. Time is budget: A heuristic for reducing the risk of ruin in multi-armed gambler bandits. In International Conference on Innovative Techniques and Applications of Artificial Intelligence, pp. 346 352, 2022.

A. Rangi, M. Franceschetti, and L. Tran-Thanh. Unifying the stochastic and the adversarial bandits with knapsack. In IJCAI, pp. 3311 3317, 2019.

C. Riou and J. Honda. Bandit algorithms based on thompson sampling for bounded reward distributions. In Algorithmic Learning Theory, pp. 777 826, 2020.

A. Sani, A. Lazaric, and R. Munos. Risk aversion in multi armed bandits. In Neur IPS, pp. 3284 3292, 2012.

K. A. Sankararaman and A. Slivkins. Combinatorial semi-bandits with knapsacks. In AISTATS, pp. 1760 1770, 2018.

K. A. Sankararaman and A. Slivkins. Bandits with knapsacks beyond the worst case. In Neur IPS, pp. 23191 23204, 2021.

W. Shen and J. Wang. Portfolio blending via thompson sampling. In IJCAI, pp. 1983 1989, 2016.

W. Shen, J. Wang, Y.-G. Jiang, and H. Zha. Portfolio choices with orthogonal bandit learning. In IJCAI, pp. 974 980, 2015.

Robert H. Shmerling. Are early detection and treatment always best? In Harvard Health Publishing, 2021.

D. Sinha, K.A. Sankararaman, A. Kazerouni, and V. Avadhanula. Multi-armed bandits with cost subsidy. In AISTATS, pp. 3016 3024, 2021.

V. Sivakumar, S. Zuo, and A. Banerjee. Smoothed adversarial linear contextual bandits with knapsacks. In ICML, pp. 20253 20277, 2022.

C. Tao, S. Blanco, J. Peng, and Y. Zhou. Thresholding bandit with optimal aggregate regret. In Neur IPS, pp. 1602 1610, 2019.

L. Tran-Thanh, A. Chapman, E. Munoz De Cote, A. Rogers, and N. Jennings. ϵ first policies for budget limited multi-armed bandits. In AAAI, pp. 1211 1216, 2010.

L. Tran-Thanh, A. Chapman, A. Rogers, and N. Jennings. Knapsack based optimal policies for budgetlimited multi-armed bandits. In AAAI, pp. 1134 1140, 2012.

S. Vakili and Q. Zhao. Risk-averse multi-armed bandit problems under mean-variance measure. Signal Processing, 10:1093 1111, 2016.

S. Villar, J. Bowden, and J. Wason. Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statistical science, 30:199 215, 2015.

J. von Neumann and O. Morgenstern. Theory of Games and Economic Behavior. Princeton Classic Editions, 1944.

C.C. Wang, S. R. Kulkarni, and H. V. Poor. Bandit problems with side observations. IEEE, pp. 338 355, 2005.

Huasen Wu, R. Srikant, Xin Liu, and Chong Jiang. Algorithms with logarithmic or sublinear regret for constrained contextual bandits. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 28, pp. 433 -441. Curran Associates, Inc., 2015.

Y. Wu, R. Shariff, T. Lattimore, and C. Szepesvari. Conservative bandits. In ICML, pp. 1254 1262, 2016.

Y. Xia, W. Ding, X-D. Zhang, N. Yu, and T. Qin. Budgeted bandit problems with continuous random costs. In ACML, pp. 317 332, 2015a.

Published in Transactions on Machine Learning Research (08/2024)

Y. Xia, H. Li, T. Qin, N. Yu, and T-Y. Liu. Thompson sampling for budgeted multi-armed bandits. In IJCAI, pp. 3960 3966, 2015b.

Y. Xia, T. Qin, W. Ma, N. Yu, and T-Y. Liu. Budgeted multi-armed bandits with multiple plays. In IJCAI, pp. 773 818, 2016.

Y. Xia, T. Qin, W. Ding, H. Li, X-D. Zhang, N. Yu, and T-Y. Liu. Finite budget analysis of multi-armed bandit problems. Neurocomputing, 258:13 29, 2017.

D. Zhou and C. Tomlin. Budget-constrained multi-armed bandits with multiple plays. In AAAI, pp. 4572 4579, 2018.

Q. Zhu and V. Tan. Thompson sampling algorithms for mean-variance bandits. In ICML, pp. 11599 11608, 2020.

A. Zimin, R. Ibsen-Jensen, and K. Chatterjee. Generalized risk-aversion in stochastic multi-armed bandits. In ar Xiv, 2014.

Published in Transactions on Machine Learning Research (08/2024)

A General Classic Results

A.1 A Classic Lemma in the Theory of Stochastic Processes

The following result is classic in the theory of stochastic processes and we state it without a proof. It will be used throughout the paper.

Lemma 4. Consider the setting of K 1 arms of multinomial distributions F1, . . . , FK (of common support { 1, 0, 1}). Let B > 0 be a positive integer. Then,

P(τ(B, k) < ) = min

1, PX Fk(X = 1)

PX Fk(X = 1)

B Proof of the Linearity of the Classic Regret (Proposition 1)

In this appendix, we prove a slightly stronger version of Proposition 1, namely that Proposition 1 even holds in the case where the supremum on F is taken on Bernoulli arm distributions of support { 1, 1}. Let FK { 1,1} be the set of K-tuples of Bernoulli arm distributions of support { 1, 1}.

Proposition 10. Assume that the initial budget B > 2. For any policy π, it holds that

sup F FK { 1,1} sup π Reg F (π π) > 0.

Proof. Let π = (πT )T 1 be a sequence of policies, and let k0 [K] such that there are infinitely many T 1 such that P(πT 1 = k0) 1

We denote by S the set of all such T 1:

S := T 1 : P(πT 1 = k0) 1

and we note that |S| = . We then let F1, . . . , FK be the Bernoulli arm distributions, with respective parameters p1, . . . , p K such that k0 is the only optimal arm:

pk0 > max k =k0 pk.

W.l.o.g., we can assume that k0 = 1. Recall that, by definition, for any k [K],

pk = PX Fk(X = +1) = 1 PX Fk(X = 1).

Let := 1 p1

p1 . Denoting by Rew T (1) the reward of the (optimal) policy πt = 1 for any t 1, it holds that

Rew T (1) Rew T (πT ) = E

t=1 X1 t 1τ(B,1) t 1

t=1 Xπt t 1τ(B,π) t 1

Let us then compute the cumulative expected reward of the policy pulling only arm 1:

t=1 X1 t 1τ(B,1) t 1

t=1 P (τ(B, 1) t 1)

= µ1P (τ(B, 1) T) T + o(T)

= µ1P (τ(B, 1) = ) T + o(T).

Published in Transactions on Machine Learning Research (08/2024)

Then, using Lemma 4, we deduce

t=1 X1 t 1τ(B,1) t 1

= µ1 1 B T + o(T)

= (2p1 1) 1 B T + o(T). (15)

The cumulative reward of πT is upper-bounded by

t=1 XπT t t 1τ(B,πT ) t 1

t=2 XπT t t 1τ(B,πT ) t 1

t=2 X1 t 1τ(B,πT ) t 1

= E h XπT 1 1 i + E

t=2 X1 t 1 s t 1,B+Ps

r=1 X πT r r >0

= E h XπT 1 1 i + P XπT 1 1 = 1 E

t=2 X1 t 1 s t 1,B+Ps

r=1 X πT r r >0

XπT 1 1 = 1

+ 1 P XπT 1 1 = 1 E

t=2 X1 t 1 s t 1,B+Ps

r=1 X πT r r >0

XπT 1 1 = 1

= E h XπT 1 1 i + P XπT 1 1 = 1 E

t=2 X1 t 1 s t 1,B+1+Ps

+ 1 P XπT 1 1 = 1 E

t=2 X1 t 1 s t 1,B 1+Ps

But then, using (15), we know that

t=2 X1 t 1 s t 1,B+1+Ps

= (2p1 1) 1 B +1 T + o(T)

t=2 X1 t 1 s t 1,B 1+Ps

= (2p1 1) 1 B 1 T + o(T).

Therefore, we can deduce that the agent s total cumulative reward is upper-bounded by

t=1 Xπt t 1 s t 1,B+Ps

r=1 X πT r r >0

P XπT 1 1 = 1 1 B +1 + 1 P XπT 1 1 = 1 1 B 1 + o(T).

We then decompose

P XπT 1 1 = 1 =

k=1 P πT 1 = k P XπT 1 1 = 1

k=1 pk P πT 1 = k ,

Published in Transactions on Machine Learning Research (08/2024)

and by (14), if T S, then

P XπT 1 1 = 1 1

max k =1 pk < p1.

Now, please note that the function

p 7 p 1 B +1 + (1 p) 1 B 1

is strictly increasing on [0, p1] and equal to 1 B at p = p1. As a result, there exists δ > 0 such that, for any T S,

t=1 XπT t t 1 s t 1,B+Ps

r=1 X πT r r >0

(2p1 1)T 1 B δ + o(T)

t=1 X1 t 1τ(B,1) t 1

(2p1 1)δT + o(T).

By definition of the regret, this implies

sup π Reg F (π π) (2p1 1)δ,

concluding the proof of the proposition.

Remark 9. In the proof of this proposition we considered an instance such that p1 < 1 so that > 0. In the case p1 = 1 or more generally when there exists a positive arm, it is possible to achieve zero regret if the initial budget is B > K 1 with, for instance, the policy π which performs the classic bandit algorithm UCB (see, e.g., Bubeck & Cesa-Bianchi, 2012) when Bt 1 > K 1, and pulls one of the arms which has not yielded a negative reward otherwise.

C Properties of KL Divergence

In this appendix we prepare lemmas on the KL divergence used for the analysis of the ruin probability. Lemma 5. Let P F[ 1,1] be a distribution over [ 1, 1] with a positive expectation and let Λ(λ) := log E eλX be its logarithmic moment-generating function. Then, there exists λ < 0 such that Λ(λ ) = 0 and it satisfies

inf Q:EX Q[X]<0 KL(Q P) EX Q[ X] = λ .

Proof. First we have lim λ Λ(λ) =

since P[X < 0] > 0 because the distribution P is not positive or zero by definition of FK [ 1,1]. Therefore, by the continuity of Λ(λ) there exists λ < 0 such that Λ(λ ) = 0 since Λ(0) = 0 and Λ (0) = EX P [X] > 0.

Now, let Λ (x) := sup λ {λx Λ(λ)}

be the Fenchel-Legendre transform of Λ(λ) and define x = Λ (λ ). Then we have Λ (x ) = λ x Λ(λ ) = λ x . Therefore,

inf x<0 Λ (x)

Published in Transactions on Machine Learning Research (08/2024)

On the other hand,

inf x<0 Λ (x)

x = inf x<0 sup λ

inf x<0 λ x Λ(λ )

= inf x<0 λ x x = λ .

Therefore we see that

inf x<0 Λ (x)

It is well-known as the relation between Cramer s theorem and Sanov s theorem (see, e.g., Dembo & Zeitouni, 2009) that for x < EX P [X], Λ (x) = inf Q:EX Q[X] x KL(Q P),

which concludes the proof of the lemma.

Lemma 6. Let Q be an arbitrary distribution such that EX Q[X] < 0 and fix ϵ > 0. Then, there exists P such that EX P [X] > 0 and

KL(Q P) EX Q[ X] inf Q :EX Q [X]<0 KL(Q P) EX Q [ X](1 + ϵ). (16)

Proof. Let p 0, min n EX Q[ X] 1+EX Q[ X], 1 exp (EX Q[X])2

o , and let Qp = (1 p)Q + pδ{1} be the

mixture of Q and the point mass at X = 1. Let Λp(λ) = log EX Qp[eλX] be the logarithmic momentgenerating function of Qp and λ > 0 be such that Λp(λ ) = 0. Such λ exists and satisfies Λ p(λ ) > 0 since

Λp(0) = 0 and Λ p(0) = (1 p)EX Q[X] + p < 0 by p EX Q[ X] 1+EX Q[ X].

Let Pp be the distribution such that d Pp/d Qp(x) = eλ x Λp(λ ) = eλ x. Then we have EX Pp[X] = Λ p(λ ) > 0. Here note that EX Pp[e λ X] = EX Qp[e λ Xeλ X Λp(λ )] = 1. Therefore, by Lemma 5 we have

inf Q :EX Q [X]<0 KL(Q Pp) EX Q [ X] = λ . (17)

On the other hand,

KL(Q Pp) = EX Q

1X<1 log d Q

d Pp (X) + 1X=1 log d Q

1X<1 log 1 1 p d Qp

d Pp (X) + Q(X = 1) log Q(X = 1)

log 1 1 p + EX Q

1X<1 log d Qp

d Pp (X) + Q(X = 1) log Q(X = 1) Qp(X = 1)eλ 1

= log 1 1 p + EX Q [1X<1( λ X)] Q(X = 1)λ 1 + Q(X = 1) log Q(X = 1)

= log 1 1 p + λ EX Q [ X] + Q(X = 1) log Q(X = 1)

log 1 1 p + λ EX Q [ X] ,

Published in Transactions on Machine Learning Research (08/2024)

which, combined with (17), implies that

KL(Q Pp) EX Q[ X] inf Q :EX Q [X]<0 KL(Q P) EX Q [ X] + log 1 1 p EX Q[ X]. (18)

Comparing (17) and (18) with (16), we see that it is sufficient to show that

log 1 1 p EX Q[ X] ϵλ (19)

to show (16). Note that we obtain from Pinsker s inequality that

λ KL(Q Pp) log 1 1 p EX Q[ X]

2 2 log 1 1 p EX Q[ X]

2 log 1 1 p EX Q[ X] ,

recalling that Pp and Q are supported over [ 1, 1], and have positive and negative expectations, respectively. Therefore we obtain (19) since

1 λ log 1 1 p EX Q[ X] log 1 1 p (EX Q[X])2

2 log 1 1 p ϵ,

where the last inequality follows from since p 1 exp (EX Q[X])2

D Detailed Proof of the Lower Bound on the Probability of Ruin (Theorem 1 and Proposition 7)

In this section, we give a detailed proof of Theorem 1 and Proposition 7. The proof of the lower bound both in the case of multinomial arms of support { 1, 0, 1} (Theorem 1) and in the general case of rewards bounded in [ 1, 1] (Proposition 7) stems from the asymptotic lower bound, which is common to both aforementioned cases and is given in Theorem 4. The passage from the asymptotic to the non-asymptotic bound relies on sub-additivity properties, which is given in Lemma 7 and for which formulas differ depending on the case considered.

If for all the arms k [K],

inf Qk:EX Qk [X]<0 KL(Qk Fk) EX Qk[ X] = 0,

then the result becomes trivial. This is why we are going to make the following assumption in the proof:

Assumption 4. There exists an arm k [K] such that P(τ(B, k) = ) > 0.

D.1 Details of the Proof of Lemma 2

In this subsection, we provide the justification for

lim B + Q(Hτ T(Q)) = 1,

which was omitted in the main text.

Published in Transactions on Machine Learning Research (08/2024)

For any t 1 and any n = (n1, . . . , n K) such that PK k=1 nk = t, we introduce the following random events:

U(n, t) := n PK k=1 nk KL(Qk Fk) Pnk m=1 log d Qk

d Fk (ym k ) t

V (n, t) := n PK k=1 (nk EX Qk[X] Pnk m=1 ym k ) t Q

W(n, t) := n PK k=1 Pnk m=1 ym k t Q

Let ht be a realization of Hτ. Then, please note that the probability of the event {ht T(Q)} is uniformly bounded independently of the policy π by the probability of the following event:

n = (n1, . . . , n K) s.t.

k=1 nk = t : U(n, t), V (n, t), W(n, t).

For any k [K], let

dk := max y1 [ 1,1] log d Qk

d Fk (y1) min y2 [ 1,1] log d Qk

d Fk (y2) and D := max k [K] dk.

Then, a direct application of Hoeffding s inequality gives the bounds

Q(U(n, t)c) 2 exp 2t D

Q(V (n, t)c) 2 exp

Q(W(n, t)c) exp

Let C := max n D

o , this implies

max {Q(U(n, t)c), Q(V (n, t)c), 2Q(W(n, t)c)} 2 exp t C

Using this result, as well as a union bound, we can then bound the probability

Q(ht / T(Q)) Q ( n = (n1, . . . , n K) : U(n, t)c or V (n, t)c or W(n, t)c)

n=(n1,...,n K) n1+ +n K=t

Q (U(n, t)c or V (n, t)c or W(n, t)c)

n=(n1,...,n K) n1+ +n K=t

{Q (U(n, t)c) + G (V (n, t)c) + Q (W(n, t)c)}

5(t + 1)K exp t C

We can now bound the desired probability by first using the decomposition

Q(Hτ / T(Q)) = Q τ > 3B

Q , Hτ / T(Q) + Q τ 3B

Q , Hτ / T(Q) .

Published in Transactions on Machine Learning Research (08/2024)

Then, the first term can be easily bounded using Hoeffding s inequality again:

Q , Hτ / T(Q) Q τ > 3B

= Q t 1, . . . , 3B

, n = (n1, . . . , n K) : W(n, t)c

Q ( n = (n1, . . . , n K) : W(n, B)c)

(n1,...,n K) n1+ +n K=B

Q (W(n, B)c)

(B + 1)K exp t C

The second term in the decomposition is decomposed using a union bound and (20):

Q , Hτ / T(Q) Q t B, . . . , 3B

: ht / T(Q)

t=B Q(ht / T(Q))

t=B 5(t + 1)K exp t C

Then, for any t B (2KC)2, it holds that 5(t + 1)K exp t C

and therefore

Q , Hτ / T(Q) 5

t=B exp t 2C

We then deduce that, for B (2KC)2,

Q(Hτ / T(Q)) (B + 1)Ke

and therefore, that lim B + Q(Hτ T(Q)) = 1.

D.2 Asymptotic Lower Bound

The main result of this subsection is the asymptotic lower bound on the probability of ruin. This result will serve as a basis in the proof of the non-asymptotic lower bound of both Theorem 1 and Proposition 7, and for that reason, it is conducted in the general case of arm distributions in FK [ 1,1].

Theorem 4. Let (αk)k [K] such that for any k [K], αk > 0 and PK k=1 αk = 1. There exists no policy π such that, for any set of arms (F1, . . . , FK),

lim inf B + 1 B log P(F1,...,FK) (τ(B, π) < )

k=1 αk inf Qk:EX Qk [X]<0 KL(Qk Fk) EX Qk[ X],

Published in Transactions on Machine Learning Research (08/2024)

with a strict inequality for some (F1, . . . , FK).

We define, for any tuple of distributions F = (F1, . . . , FK),

AF (α1, . . . , αK) :=

k=1 αk inf Qk:EX Qk [X]<0 KL(Qk Fk) EX Qk[ X],

which we write AF in the absence of ambiguity on the (αk)k [K]. The inequality in Theorem 4 means that

lim inf B + sup F

log P(F1,...,FK) (τ(B, π) < )

BAF 1. (21)

Proof. Recall from Lemma 2 that for any Q = (Q1, . . . , QK) such that EQi[X] < 0 for any i [K],

lim inf B + 1 B log P(F1,...,FK) (τ(B, π) < )

k=1 βk(Q) KL(Qk Fk)

EX Qk[ X], (22)

where β(Q) = (β1(Q), . . . , βK(Q)) satisfies

k [K], βk(Q) 0 and

k=1 βk(Q) = 1. (23)

Let us fix (αk)k [K] such that for any k [K], αk > 0 and PK k=1 αk = 1. We are going to show that no policy π can achieve both

(F1, . . . , FK), lim inf B + 1 B log P(F1,...,FK) (τ(B, π) < )

k=1 αk inf Qk:EX Qk [X]<0 KL(Qk Fk) EX Qk[ X] (24)

(F1, . . . , FK), lim inf B + 1 B log P(F1,...,FK) (τ(B, π) < ) <

k=1 αk inf Qk:EX Qk [X]<0 KL(Qk Fk) EX Qk[ X]. (25)

Let us then fix a policy π such that there exists a distribution P = ( P1, . . . , PK) and ϵ > 0 such that

lim inf B + 1 B log P( P1,..., PK) (τ(B, π) < )

k=1 αk γk ϵ, (26)

where we denoted, for any k [K],

γk := inf Q:EX Q[X]<0 KL(Q Pk) EX Q[ X] and γmax := max k [K] γk > 0.

Please note that the positivity of γmax relies on Assumption 4. We are going to show that there exists P = ( P 1 , . . . , P K) such that, denoting

γ k := inf Q:EX Q[X]<0 KL(Q P k ) EX Q[ X], γ min := min{ γ k : k [K], γ k > 0} and ϵ := ϵαmin γ min 4(K 1) γmax ,

the following holds:

lim inf B + 1 B log P( P 1 ,..., P K) (τ(B, π) < )

k=1 αk γ k + ϵ .

Published in Transactions on Machine Learning Research (08/2024)

We define Q = ( Q1, . . . , QK) such that, for any k [K], EX Qk[X] < 0 and

KL( Qk Pk) EX Qk[ X] γk + ϵ

Denoting αmin := mink [K] αk > 0, we then introduce the set

K := k [K] : βk( Q) αk αminϵ 2(K 1) γmax

Let us prove that K is not empty. Indeed, (26) can be re-written as

lim inf B + 1 B log P( P1,..., PK) (τ(B, π) < )

k=1 αk γk ϵ

k=1 (αk γk + αkϵ)

Then, applying (22) to Q = Q and F = P and using (27), we deduce that

lim inf B + 1 B log P( P1,..., PK) (τ(B, π) < )

k=1 βk( Q) KL( Qk Pk)

k=1 βk( Q) γk + ϵ

βk( Q) + αkϵ

Then, we deduce from (28) and (29) that

βk( Q) + αkϵ

or in other words, that K X

βk( Q) αk αkϵ

γk |{z} 0 0.

This is equivalent to X

βk( Q) αk αkϵ

γk |{z} >0 0.

We then deduce that there exists k0 [K] such that βk0( Q) αk + αkϵ

2 γk0 . With (23), it implies that

j =k0 βj( Q) = 1 βk0( Q) 1 αk0 αk0ϵ

j =k0 αj αminϵ

We deduce that there exists j [K] such that βj( Q) αj αminϵ 2(K 1) γmax , proving that K is not empty. Then,

we define the distribution P = ( P 1 , . . . , P K) as follows:

Published in Transactions on Machine Learning Research (08/2024)

- for any k / K, let P k := Qk, and please note that EX P k [X] < 0;

- for any k K, let P k a distribution such that EX P k [X] > 0 and

KL( Qk P k ) EX Qk[ X] γ k

1 + αminϵ 4(K 1) γmax

= γ k + αminϵ γ k 4(K 1) γmax , (30)

where γ k = inf Q:EX Q[X]<0 KL(Q P k ) EX Q[ X] and γ min = min{ γ k : k K, γ k > 0}. Note that this

distribution P k indeed exists by Lemma 6.

Since K = and by definition of P , we have

k=1 βk( Q) KL( Qk P k ) EX Qk[ X] = X

k/ K βk( Q) KL( Qk Qk)

EX Qk[ X] | {z } 0

k K βk( Q) KL( Qk P k ) EX Qk[ X]

k K βk( Q) KL( Qk P k ) EX Qk[ X].

Then, (30) implies that

k K βk( Q) KL( Qk P k ) EX Qk[ X] X

k K βk( Q) γ k + αminϵ 4(K 1) γmax

k K βk( Q) γ k

k K βk( Q) γ k + αminϵ 4(K 1)

By definition of K, for any k K, βk( Q) αk αminϵ 2(K 1) γmax and we deduce that

k K βk( Q) KL( Qk P k ) EX Qk[ X] X

k K αk γ k αminϵ 2(K 1)

γ k γmax + αminϵ 4(K 1)

k K αk γ k αminϵ γ min 4(K 1) γmax , (31)

where the inequality (31) comes from the fact that K is not empty. Injecting (31) in (22) (with P = P and Q = Q), we have:

lim inf B + 1 B log P( P 1 ,..., P K) (τ(B, π) < )

k=1 αk γ k + αminϵ γ min 4(K 1) γmax .

Recall that, by definition,

ϵ = ϵαmin γ min 4(K 1) γmax .

We deduce the following

lim inf B + 1 B log P( P 1 ,..., P K) (τ(B, π) < )

k=1 αk γ k + ϵ ,

which concludes the proof of the theorem.

Published in Transactions on Machine Learning Research (08/2024)

D.3 Sub-additivity of the Optimal Log Probability of Ruin

The passage from the asymptotic to the non-asymptotic lower bound on the probability of ruin relies on sub-additivity properties described in the next lemma, which is the main result of this subsection.

Lemma 7. Let t R + {+ }. For any B1, B2 > 0, we have

inf π sup F

log P τ(B1 + B2 + 1, π) < t

AF inf π sup F

log P τ(B1, π) < t

AF + inf π sup F

log P τ(B2, π) < t

In the case of multinomial arm distributions of support { 1, 0, 1}, if B1 and B2 are positive integers, the previous bound can be refined as

inf π sup F

log P τ(B1 + B2, π) < t

AF inf π sup F

log P τ(B1, π) < t

AF + inf π sup F

log P τ(B2, π) < t

π 1 arg min π sup F

log P τ(B1, π) < t

π 2 arg min π sup F

log P τ(B2, π) < t

Besides, we denote by π the policy such that

( (π 1)t if t < min τ(B1, π), t , (π 2)t τ(B1, π) (ignoring the previously observed rewards) otherwise.

Let B 2 {B2, B2 + 1}. Then, it is clear that

P τ(B1 + B 2, π) < t

1 t1+2 t : B1 + B 2 +

s=1 X πs s < 0

1 t1, t1+2 t : B1 +

s=1 X πs s < 0, B1 + B 2 +

s=1 X πs s < 0

1 t1 t : B1 +

s=1 X πs s < 0

τ(B1, π) t1+2 t : B1 + B 2 +

s=1 X πs s < 0

τ(B1, π) < t

= P(τ(B1, π 1) < t) P

τ(B1, π) t1+2 t : B1 + B 2 +

s=1 X πs s < 0

τ(B1, π 1) < t

= P(τ(B1, π 1) < t)

τ(B1, π) t1+2 t : B1 +

τ(B1,π 1) X

s=1 X(π 1)s s + B 2 +

s=τ(B1,π 1)+1 X(π 2)s s < 0

τ(B1, π 1) < t

From there, we are going to study separately the general case and the case of multinomial distributions of support { 1, 0, 1}.

Published in Transactions on Machine Learning Research (08/2024)

First case: in the general case, we choose B 2 = B2 + 1, and since the rewards are bounded in [ 1, 1], then

τ(B1, π 1) < t = B1 +

τ(B1,π 1) X

s=1 X(π 1)s s + 1 0.

P τ(B1 + B2 + 1, π) < t

P(τ(B1, π 1) < t)

τ(B1, π 1) t1+2 t : B2 +

s=τ(B1,π 1)+1 X(π 2)s s < 0

τ(B1, π 1) < t

P(τ(B1, π 1) < t)

τ(B1, π 1) t1+2 t + τ(B1, π 1) : B2 +

s=τ(B1,π 1)+1 X(π 2)s s < 0

τ(B1, π 1) < t

= P(τ(B1, π 1) < t) P(τ(B2, π 2) < t).

This yields

inf π sup F

log P τ(B1 + B2 + 1, π) < t

log P τ(B1 + B2 + 1, π) < t

log P τ(B1, π 1) < t

log P τ(B2, π 2) < t

= inf π sup F

log P τ(B1, π) < t

AF + inf π sup F

log P τ(B2, π 2) < t

which concludes the general case.

Second case: in the case of multinomial arm distributions of support { 1, 0, 1},

τ(B1, π 1) < t = B1 +

τ(B1,π 1) X

s=1 X(π 1)s s = 0.

Therefore, choosing B 2 = B2, we have

P τ(B1 + B2, π) < t

P(τ(B1, π 1) < t)

τ(B1, π 1) t1+2 t : B2 +

s=τ(B1,π 1)+1 X(π 2)s s < 0

τ(B1, π 1) < t

P(τ(B1, π 1) < t)

τ(B1, π 1) t1+2 t + τ(B1, π 1) : B2 +

s=τ(B1,π 1)+1 X(π 2)s s < 0

τ(B1, π 1) < t

= P(τ(B1, π 1) < t) P(τ(B2, π 2) < t).

Published in Transactions on Machine Learning Research (08/2024)

This yields

inf π sup F

log P τ(B1 + B2, π) < t

log P τ(B1 + B2, π) < t

log P τ(B1, π 1) < t

log P τ(B2, π 2) < t

= inf π sup F

log P τ(B1, π) < t

AF + inf π sup F

log P τ(B2, π 2) < t

which concludes the multinomial case and the proof of the lemma.

D.4 Proof of Theorem 1 and Proposition 7

Let (αk)k [K] such that for any k [K], αk > 0 and PK k=1 αk = 1. Recall that, by definition,

k=1 αk inf Qk:EX Qk [X]<0 KL(Qk Fk) EX Qk[ X] > 0.

Let π be any policy, and B0 > 0 an initial budget. For any n 1, let us denote by πB the policy defined recursively on {B B0}, such that πB0 = π and for any B B0, πB t = πt for t τ(B0, π) and then πB t = πB t for t τ(B0, π) + 1, where B = B + Pτ(B,π) s=1 . Concretely, πB restarts π every time it exhausts the budget B0.

From now, we are going to study separately the general case of rewards bounded in [ 1, 1] and the case of multinomial arms of support { 1, 0, 1}.

First case: in the case of multinomial arm distributions in FK { 1,0,1}, we assume that, for any arm distributions F = (F1, . . . , FK), log PF (τ(B0, π) < )

and that there exist some arm distributions F = ( F1, . . . , FK) and C F > 0 such that

log P F (τ(B0, π) < )

A F (B0 + C F ),

and we will show that there is contradiction.

By Lemma 7, for any arm distributions F and for any n 1,

log PF τ(n B0, πn B0) <

AF n log P τ(B0, πB0) <

Consequently, for any arm distributions F,

lim sup n + 1 n B0 log P (τ(n B0, π) < ) AF . (32)

Furthermore, the same computation applied to F gives

log P F τ(n B0, πn B0) <

A F n log P τ(B0, πB0) <

A F n(B0 + C F ),

Published in Transactions on Machine Learning Research (08/2024)

which, in turn, implies

lim sup n + 1 n B0 log P (τ(n B0, π) < ) B0 + C F B0 A F < A F . (33)

Eqs. (32) and (33) contradict Theorem 4. Therefore, we deduce that if there exist some arm distributions F such that log P F (τ(B0, π) < )

then there also exist some arm distributions F such that

log PF (τ(B0, π) < )

concluding the multinomial case and the proof of Theorem 1.

Second case: in the general case of arm distributions in FK [ 1,1], we assume that, for any arm distributions F = (F1, . . . , FK), log PF (τ(B0, π) < )

AF (B0 + 1),

and that there exist some arm distributions F = ( F1, . . . , FK) and C F > 0 such that

log P F (τ(B0, π) < )

A F (B0 + 1 + C F ),

and show that there is contradiction.

By Lemma 7, for any arm distributions F and for any n 1,

log PF τ(n B0 + (n 1), πn B0) <

AF n log P τ(B0, πB0) <

AF n(B0 + 1).

Consequently, for any arm distributions F,

lim sup n + 1 n B0 + (n 1) log P (τ(n B0 + (n 1), π) < ) AF . (34)

Furthermore, the same computation applied to F gives

log P F τ(n B0 + (n 1), πn B0) <

A F n log P τ(B0, πB0) <

A F n(B0 + 1 + C F ),

which in turn, implies,

lim sup n + 1 n B0 + (n 1) log P (τ(n B0, π) < ) B0 + 1 + C F B0 + 1 < 1. (35)

Eqs. (34) and (35) contradict Theorem 4. Therefore, we deduce that if there exist some arm distributions F such that log P F (τ(B0, π) < )

A F < (B0 + 1),

then there also exist some arm distributions F such that

log PF (τ(B0, π) < )

AF > (B0 + 1),

concluding the general case and the proof of Proposition 7.

Published in Transactions on Machine Learning Research (08/2024)

E Proof of Lemma 1

Please note that applying (22) to the case of one single arm (K = 1) gives that, for any ϵ0 0, 1

3 and any distribution Q which has a negative expectation,

lim inf B + 1 B log P(τ(B, 1) < ) KL(Q F1)

By taking ϵ0 0, we deduce that

lim inf B + 1 B log P(τ(B, 1) < ) inf F :EX F [X]<0 KL(F F1) EX F [ X].

It thus remains to prove that for any B > 0,

1 B log P(τ(B, 1) < ) inf F :EX F [X]<0 KL(F F1) EX F [ X].

The result being trivial if EX F1[X] 0, we assume that EX F1[X] > 0. Let us define the logarithmic moment-generating function of X by Λ(λ) := log E eλX . By Lemma 5, there exists λ < 0 such that Λ(λ ) = 0 and it satisfies

inf F :EX F [X]<0 KL(F F1) EX F [ X] = λ .

Now, let X F1 and let X1, X2, F1 be i.i.d copies of X. We write Sn = Pn i=1 Xi. We define τ := inf{n 1 : Sn B} and τT := min(τ, T) for any T N. Since τT is a bounded stopping time, by the optional stopping theorem, it holds for any T that

E h eλ SτT i = 1.

On the other hand,

1 = E h eλ SτT i

= E h 1τT <T eλ SτT i + E h 1τT =T eλ SτT i

E h 1τT <T eλ SτT i

e λ BP(τT < T),

which implies that P(τ < ) = lim T P(τT < T) eλ B.

Therefore we obtain 1 B log P(τ < ) λ = inf F :EX F [X]<0 KL(F F1) EX F [ X],

which completes the proof.

F Proof of the Upper Bound on the Reward of EXPLOIT Policies (Proposition 3)

We introduce the following notation. Let K+ be the number of arms such that P τ B

K , k = > 0 for k [K]. Recall that we ordered the arms in order of decreasing expectation, and therefore µ1 µK+ > 0. Then, by definition of K+,

k [K+], P τ B

K , k = > 0 ; j {K+ + 1, . . . , K} , P τ B

K , j = = 0.

Published in Transactions on Machine Learning Research (08/2024)

F.1 Preliminary Lemma

In this subsection, we express the expected cumulative reward of any policy in EXPLOIT as the product between p EX and a convex combination of µ1, . . . , µK+, and we explicitly give the coefficients of the convex combination. This decomposition will be useful both as a first step in the proof of Proposition 3 and in the proof of the reward bound of EXPLOIT-UCB, as a particular instance of policy in EXPLOIT.

For any S [K+], we define the event

ΠS := j S, τ B

K , j T and j [K+] \ S, τ B

Given a policy π, an arm k [K+] and a set S [K+], we define the coefficient nπ,k(S) as

nπ,k(S) := 1

t=1 1πt=k,τ(B,π) t 1

When there is no ambiguity on the policy π, we simply write nk(S). Please note that for any fixed policy π and any set S [K+],

k=1 nk(S) = 1

t=1 1τ(B,π) t 1

The following result holds. Lemma 8. For any policy π within the framework EXPLOIT, the expected cumulative reward satisfies

t=1 Xπt t 1τ(B,π) t 1

S [K+]:k S P(ΠS)nk(S)

µk T + o(T). (36)

Proof. First, we write the reward as a sum over the arms of positive probability of survival:

t=1 Xπt t 1τ(B,π) t 1

t=1 1πt=k1τ(B,π) t 1

t=1 1πt=k1τ(B,π) t 1

+ o(T). (37)

Then, we examine the term E h PT t=1 1πt=k1τ(B,π) t 1 i . In order to analyse it, we will introduce the following

events, for any S, S [K+] such that S S = :

ΠS,S := j S, τ B

K , j T; j S ,

K , j < T; j [K+] \ (S S ), τ B

Please note that ΠS = ΠS, . We can then decompose, for any k {1, . . . , K+},

t=1 1πt=k1τ(B,π) t 1

S,S [K+]:S S = P (ΠS,S ) E

t=1 1πt=k1τ(B,π) t 1

Consider the case S = and let k S . We can bound

Published in Transactions on Machine Learning Research (08/2024)

Indeed, the sequence P τ B

n 1 is increasing and upper-bounded by 1, and thus it has a limit and it implies that

K , k T = o(1).

We deduce that, for any S, S [K+] such that S S = ,

S = = P (ΠS,S ) = o(1).

This implies

t=1 1πt=k1τ(B,π) t 1

S [K+] P (ΠS) E

t=1 1πt=k1τ(B,π) t 1

Re-injecting in (37), we have

t=1 Xπt t 1τ(B,π) t 1

S [K+] P (ΠS) E

t=1 1πt=k1τ(B,π) t 1

S [K+] P (ΠS)

t=1 1πt=k1τ(B,π) t 1

Besides, it is clear, by definition of ΠS, that any policy π in EXPLOIT satisfies

t=1 1πt=k1τ(B,π) t 1

We deduce that

t=1 Xπt t 1τ(B,π) t 1

S [K+] P (ΠS) X

t=1 1πt=k1τ(B,π) t 1

S [K+] P (ΠS) X

k S µknk(S)T + o(T)

S [K+]:k S P (ΠS) nk(S)

µk T + o(T),

which concludes the proof of the lemma.

F.2 Proof of Proposition 3

It remains to provide an upper bound to the right-hand side of the equality (36) in Lemma 8 in order to complete the proof of Proposition 3. In order to maximize the right term in (36), we should solve the following maximization problem:

S [K+]:k S P (ΠS) nk(S)

k S nk(S) 1 and k / S, nk(S) = 0.

We can re-order the terms in the above sum as

S [K+]:k S P (ΠS) nk(S)

S [K+] P (ΠS) X

k S µknk(S),

Published in Transactions on Machine Learning Research (08/2024)

and since, by hypothesis, µ1 µ2 µK+ > 0, we deduce the bound

S [K+]:k S P (ΠS) nk(S)

S [K+] P (ΠS) max k S µk,

where the above inequality is an equality for nk(S) = n k(S), defined by

n k(S) = 1 if k = min S 0 otherwise.

This gives, for any given k {1, . . . , K+}, X

S [K+]:k S P (ΠS) n k(S)

S {k+1,...,K+} P Π{k} S

S {k+1,...,K+} P j {k} S, τ B

K , j T; j [K+] \ ({k} S), τ B

= P j [k 1], τ B

K , k T + o(1)

= 1 p EX 1 1 p EX

K , j < P τ B

We deduce that

t=1 Xπt t 1τ(B,π) t 1

1 p EX K+ X

k=1 wkµk T + o(T),

which concludes the proof of the proposition.

G Proof of the Reward Bound of EXPLOIT-UCB (Proposition 4)

In this appendix and in the next one, we will use the following notation. For any k [K], let Y k 1 , . . . , Y k T Fk be i.i.d. rewards drawn from arm k and for any t 1, and we denote by Nk(t) := Pt s=1 1πs=k the number of times arm k has been pulled until round t. Please note that Xπt t = PK k=1 Y k Nk(t)1πt=k. Given that arm k

has been pulled nk times, we also introduce ˆY k nk := 1 nk Pnk n=1 Y k n as the empirical average of arm k at round t. Please note that for any k [K] and any round t,

ˆY k Nk(t) = ˆXk t .

For any t 1, let

Ck(t) := ˆY k Nk(t) +

G.1 Preliminary Lemma

EXPLOIT-UCB is based on the classic bandit algorithm UCB1 (which has a sublinear regret in the classic stochastic MAB), and therefore has the following characteristic which is going to be useful in the proof of the reward bound of both EXPLOIT-UCB and EXPLOIT-UCB-DOUBLE. For the sake of clarity, we assume the arms are ordered in decreasing expectation: µ1 µ2 µK.

Published in Transactions on Machine Learning Research (08/2024)

Lemma 9. Let (πt)t 1 be the policy associated to EXPLOIT-UCB. Assume that µ1 > µk. Then

t=1 1πt=k,Ck(t) C1(t)

Proof. This proof completely follows the proof of Theorem 1 in Auer et al. (2002). Let r T, the quantity to bound can be written as

t=1 1πt=k,Ck(t) C1(t)

t=1 1 πt=k, ˆY k Nk(t)(t)+ q

6 log t Nk(t) ˆY 1 N1(t)(t)+p

6 log t N1(t)

t=T 1/2 1 πt=k, ˆY k Nk(t)(t)+ q

6 log t Nk(t) ˆY 1 N1(t)(t)+p

6 log t N1(t)

t T,r 1πt=k, maxr nk<t ˆY k nk +p

nk min1 n1<t ˆY 1 n1+p

t T,r 1maxr nk<t ˆY k nk +p

nk min1 n1<t ˆY 1 n1+p

nk=r 1 ˆY k nk +p

nk ˆY 1 n1+p

where T,r = n t {1, . . . , T} : Pt s=1 1πs=k r o . Similarly as in Auer et al. (2002), we use the fact that

the probability event n ˆY k nk + q

nk ˆY 1 n1 + q

o implies at least one of the following:

ˆY k nk µk +

µ1 < µk + 2

Therefore, we can write

t=1 1πt=k,Ck(t) C1(t)

nk=r 1 ˆY k nk +p

nk ˆY 1 n1+p

1 ˆY 1 n1 µ1 p

n1 + 1 ˆY k nk µk+p

nk + 1µ1<µk+2p

The choice r = l 24 log T (µ1 µk)2 m ensures that, for any nk r,

Published in Transactions on Machine Learning Research (08/2024)

which implies

t=1 1πt=k,Ck(t) C1(t)

1 ˆY 1 n1 µ1 p

n1 + 1 ˆY k nk µk+p

nk=r E 1 ˆY 1 n1 µ1 p

n1 + 1 ˆY k nk µk+p

ˆY k nk µk +

+ o(T). (38)

Using Hoeffding s inequality, for any n1 {1, . . . , T}, we have

Similarly, for any nk {r, . . . , T}, we have

ˆY k nk µk +

We can then replace in (38):

t=1 1πt=k,Ck(t) C1(t)

2 T 3 + o(T)

which concludes the proof of the lemma.

G.2 Proof of Proposition 4

Let S [K+]. Recall that for any arm k [K+], nk(S) is defined as

t=1 1πt=k,τ(B,π) t 1

and Lemma 8 states that

t=1 Xπt t 1τ(B,π) t 1

S [K+]:k S P(ΠS)nk(S)

µk T + o(T).

Denoting by S0 the set of positive or zero arms,

S0 := j [K+] : P τ B

K , j = = 1 ,

it is clear by definition of ΠS that S0 S = P(Πs) = 0,

Published in Transactions on Machine Learning Research (08/2024)

t=1 Xπt t 1τ(B,π) t 1

S [K+]: S0 S, k S

µk T + o(T).

Given that EXPLOIT-UCB is in EXPLOIT, by Lemma 8, it suffices to show that for any S0 S [K+] and any k S,

nk(S) = 1 + o(1) if k = min S o(1) otherwise.

Then, please note that PK k=1 nk(S) = 1. Therefore, it suffices to prove that for any k S \{min S}, nk(S) = o(1). Thus, let k S \{min S}. Then, on the one hand, we are going to provide a lower bound P(ΠS) which is independent of T. Indeed, since S0 S, there is no positive or zero arm k [K+] \ S, and therefore, we know that there exists ϵ > 0 such that

k [K+], PX Fk(X ϵ) > 0.

We fix such an ϵ and we deduce that Y

k [K+]\S P τ B

We can therefore provide the following lower bound, independent of T and positive by definition of K+. For any T B

k [K+]\S P τ B

k [K+]\S P τ B

On the other hand, an upper bound to E h PT t=1 1πt=k,ΠS i is obtained by

t=1 1πt=k,ΠS

t=1 1πt=k, j S,τ( B

t=1 1πt=k,Ck(t 1) Cmin S(t 1)

by Lemma 9. We deduce the following bound on nk(S):

E h PT t=1 1πt=k,ΠS i

E h PT t=1 1πt=k,Ck(t 1) Cmin S(t 1) i

k [K+]\S P τ B

K , k < B ϵK

which concludes the proof of the proposition.

Published in Transactions on Machine Learning Research (08/2024)

H The Performance of EXPLOIT-UCB-DOUBLE

In the proofs of the performance of EXPLOIT-UCB-DOUBLE, for the sake of clarity, we drop the exponent n in the notation of EXPLOIT-UCB-DOUBLE, and πn becomes π. Recall the following notation from the previous section, for any t 1,

Ck(t) := ˆY k Nk(t) +

H.1 Proof of Proposition 5

For any policy π and any budget B , we will denote π EXPLOIT(B ) if at round t, π only pulls arms k [K] such that Pt s=1 Xπs s 1πs=k B

The probability of ruin of EXPLOIT-UCB-DOUBLE can be decomposed as

P(τ(B, π) < T) =

j=0 P(τ(B, π) < T tj τ(B, π) < tj+1). (39)

Let us first examine the term in j = 0. Then,

P(τ(B, π) < T τ(B, π) < t1) P

τ(B, π) < T and t T, B +

s=1 Xπs s 1τ(B,π) s 1 < n B2 !

P (τ(B, π) < T and π EXPLOIT(B))

Then, let us examine the other terms in the sum. Let j 1. For any t 1, we will denote πt := πtj+t. Let us re-write each of the terms in the sum as

P (τ(B, π) < T and tj τ(B, π) < tj+1) = P

tj τ(B, π) < tj+1; B +

t=1 Xπt t 1τ(B,π) t 1 < 0

This is re-written as

P (τ(B, π) < T and tj τ(B, π) < tj+1)

tj τ(B, π) < tj+1; B +

t=1 Xπt t 1 s t 1,B+Ps

r=1 Xπr r >0 < 0

But then, by definition of tj, under the condition that tj < T, we have that

t=1 Xπt t jn B2,

which implies that, for any t tj + 1,

s=1 Xπs s jn B2 +

s=tj+1 Xπs s .

We can then replace in the previous equation:

P (τ(B, π) < T and tj τ(B, π) < tj+1)

tj τ(B, π) < tj+1; jn B2 +

t=tj+1 Xπt t 1 s t 1,jn B2+Ps

r=tj +1 Xπr r >0 < 0

Published in Transactions on Machine Learning Research (08/2024)

This is re-written as

P (τ(B, π) < T and tj τ(B, π) < tj+1)

tj τ(B, π) < tj+1; jn B2 +

t=1 X πt t+tj1 s t 1,jn B2+Ps

r=1 X πtj r+tj >0 < 0

P (τ(B, π) < T and tj τ(B, π) < tj+1) P (tj τ(B, π) < tj+1; τ(B, π) < T tj)

P τ(B, π) < , π EXPLOIT (jn B2)

p EX jn B .

We can then replace in (39):

P(τ(B, π) < T) =

j=0 P(τ(B, π) < T tj τ(B, π) < tj+1)

1 (p EX)n B ,

which gives the desired result. Let ϵ > 0, then

n log ϵ 1+ϵ B log p EX =

1 (p EX)n B ϵ ϵ

hence P(τ(B, π) < ) p EX + ϵ

which concludes the proof of the proposition.

H.2 Proof of Proposition 6

We will assume that the arm with the biggest expectation is arm 1 and that it is unique for the sake of clarity. Let j := T 1/4

n B2 1 , recall that

t {0, . . . , min(τ(B, π), T)} : B +

s=1 Xπs s > jn B2 )

We still denote πt := πtj+t for any t 1. We can then decompose the reward as follows:

t=1 Xπt t 1τ(B,π) t 1

t=1 Xπt t 1πt=k1τ(B,π) t 1

t=1 1πt=11τ(B,π) t 1

t=1 1πt=k1τ(B,π) t 1

t=1 1πt=11τ(B,π) t 1

t=tj+1 1πt=k1τ(B,π) t 1

| {z } (Bk)

+O (E[tj]) .

Published in Transactions on Machine Learning Research (08/2024)

If we prove that

(A) = P(τ(B, π) = )T + o(T), k {2, . . . , K}, (Bk) = o(T), E[tj] = o(T),

then, we can write that

t=1 Xπt t 1τ(B,π) t 1

= µ1P(τ(B, π) = )T + o(T),

and using Proposition 5 gives the result:

t=1 Xπt t 1τ(B,π) t 1

1 (p EX)n B

Let ϵ > 0. Then by Proposition 5,

n log ϵ 1+ϵ B log p EX = P(τ(B, π) = ) 1 p EX ϵ.

In particular, the choice of ϵ = T B log p EX

1 T B log p EX = o T (1) gives log ϵ 1+ϵ B log p EX = log T, and hence

n log T = E

t=1 Xπt t 1τ(B,π) t 1

µ1(1 p EX)T + o(T),

which concludes the proof. It only remains to study each of the terms (A), (Bk) for k 2 and E[tj].

Study of (Bk)

Let k {2, . . . , K}. We can decompose the term (Bk) as follows:

t=tj+1 1πt=k1τ(B,π) t 1

t=tj+1 1πt=k1τ(B,π) tj1 s t 1,B+Ps

r=1 Xπr r >0

t=tj+1 1πt=k1τ(B,π) tj1 s {tj+1,...,t 1},B+Ptj r=1 Xπr r +Ps

r=tj +1 Xπr r >0

But then, by definition of tj, if tj min(τ(B, π), T),

r=1 Xπr r < jn B2 + 1 = T 1/4 + 1.

This implies that, if tj min(τ(B, π), T), denoting πs := πs+tj for any s 1,

1 s {tj+1,...,t 1},B+Ptj r=1 Xπr r +Ps

r=tj +1 Xπr r >0 1 s {tj+1,...,t 1},T 1/4+1+Ps

r=tj +1 Xπr r >0

= 1 s {1,...,t tj 1},T 1/4+1+Ps tj r=1 X πr r+tj >0

= 1τ(T 1/4+1, π) t tj 1,

Published in Transactions on Machine Learning Research (08/2024)

and thus, if tj min(τ(B, π), T),

t=tj+1 1πt=k1τ(B,π) tj1 s {tj+1,...,t 1},B+Ptj r=1 Xπr r +Ps

r=tj +1 Xπr r >0

t=tj+1 1πt=k1τ(B,π) tj1τ(T 1/4+1, π) t tj 1

t=1 1πt+tj =k1τ(B,π) tj1τ(T 1/4+1, π) t 1

This inequality being a trivial equality if tj = T + 1 (because the sum is a sum on an empty set) or if tj = τ(B, π) + 1 (0 0), we deduce that in any case,

t=1 1πt+tj =k1τ(B,π) tj1τ(T 1/4+1, π) t 1

Then, denoting by sj the realized value of tj and decomposing classically, we have

t=1 1πt+tj =k1τ(B,π) tj1τ(T 1/4+1, π) t 1

t=1 1πt+sj =k1τ(B,π) sj1τ(T 1/4+1, π) t 11tj=sj

t=1 E h 1πt+sj =k1τ(B,π) sj1τ(T 1/4+1, π) t 11tj=sj i

t=1 P τ(B, π) sj, τ(T 1/4 + 1, π) t 1, tj = sj, πt+sj = k .

Let us then study the probability P τ(B, π) sj, τ(T 1/4 + 1, π) t 1, tj = sj, πt+sj = k and decompose it as

P τ(B, π) sj, τ(T 1/4 + 1, π) t 1, tj = sj, πt+sj = k =

τ(B, π) sj, τ(T 1/4 + 1, π) t 1, tj = sj, πt+sj = k, t sj,

s=1 X1 s1πs=1 > T 1/4 + 1

τ(B, π) sj, τ(T 1/4 + 1, π) t 1, tj = sj, πt+sj = k, t sj,

s=1 X1 s1πs=1 T 1/4 + 1

The second term on the right, though, can easily be bounded as follows:

τ(B, π) sj, τ(T 1/4 + 1, π) t 1, tj = sj, πt+sj = k, t sj,

s=1 X1 s1πs=1 T 1/4 + 1

s=1 X1 s1πs=1 T 1/4 + 1

Published in Transactions on Machine Learning Research (08/2024)

Then, let us denote by δπ 1 < δπ 2 < ... the rounds t T 1/4 at which πt = 1, in other words, denoting δπ 0 := T 1/4, j 1, δπ j := inf{t δπ j 1 : πt = 1}.

Then, we can bound the probability as

s=1 X1 s1πs=1 T 1/4

s=1 X1 s1πs=1 T 1/4 + 1

s=1 X1 s1πs=1 T 1/4 + 1

s=1 1πs=1 = n1

s=1 X1 s1πs=1 T 1/4 + 1

s=1 1πs=1 = n1

because the rewards are bounded in [ 1, 1]. Then, using Hoeffding s inequality, for any n1 T 1/4

s=1 X1 s1πs=1 T 1/4 + 1

s=1 1πs=1 = n1

Pδπ j s=1 X1 s1πs=1 Pδπ j s=1 1πs=1 µ1 µ1,

s=1 1πs=1 = n1

exp (n1 + j)µ2 1 2

Summing over n1 and j gives

s=1 X1 s1πs=1 T 1/4 + 1

s=1 1πs=1 = n1

j=0 exp (n1 + j)µ2 1 2

= exp T 1/4µ2 1 2K

1 e µ2 1 2 2 .

We then deduce that

τ(B, π) sj, τ(T 1/4 + 1, π) t 1, tj = sj, πt+sj = k, t sj,

s=1 X1 s1πs=1 T 1/4 + 1

= O exp T 1/4µ2 1 2K

and thus, plugging it in (40),

P τ(B, π) sj, τ(T 1/4 + 1, π) t 1, tj = sj, πt+sj = k =

τ(B, π) sj, τ(T 1/4 + 1, π) t 1, tj = sj, πt+sj = k, t sj,

s=1 X1 s1πs=1 > T 1/4 + 1

+ O exp T 1/4µ2 1 2K

Published in Transactions on Machine Learning Research (08/2024)

Let us also bound the first term in the decomposition in (40):

τ(B, π) sj, τ(T 1/4 + 1, π) t 1, tj = sj, πt+sj = k, t sj,

s=1 X1 s1πs=1 > T 1/4 + 1

P πt+sj = k, tj = sj, Ck(t + sj) C1(t + sj) .

We can then re-write the previous term in the sum, as

t=1 P πt+sj = k, tj = sj, Ck(t + sj) C1(t + sj)

t=1 E h 1πt+sj =k,tj=sj,Ck(t+sj) C1(t+sj) i

t=1 1πt+tj =k,Ck(t+tj) C1(t+tj)

t=1 E 1πt=k,Ck(t) C1(t)

by Lemma 9. Overall, we can replace in (41) and deduce that

t=1 P τ(B, π) sj, τ(T 1/4 + 1, π) t 1, tj = sj, πt+sj = k = o(T),

which straightforwardly implies that (Bk) = o(T).

Study of E[tj]

We can decompose

E[tj] = E tj1tj=min(τ(B,π),T )+1 + E tj1tj min(τ(B,π),T )

= (T + 1) P(tj = T + 1) | {z } (C)

+ E (τ(B, π) + 1)1tj=τ(B,π)+1

+ E tj1tj min(τ(B,π),T )

We can first bound the term (D), as

T + 1 P τ(B, π)

T + (T + 1)P

T < τ(B, π) T .

But since (P(τ(B, π) n))n 1 is a sequence which converges to P(τ(B, π) < ), we deduce that

T < τ(B, π) T = P(τ(B, π) T) P τ(B, π)

and therefore, (D) = o(T).

We then deduce that E[tj] = (C) + (E) + o(T).

Published in Transactions on Machine Learning Research (08/2024)

Then, let us study the term (C) and bound it by:

τ(B, π) T, t T, B +

s=1 Xπs s T 1 4 , tj = T + 1

τ(B, π) T, B +

s=1 Xπs s T 1 4 , tj = T + 1

τ(B, π) T 3 4 , B +

s=1 Xπs s T 1 4 , tj T 3 4

Then, let us study the term (E). Actually,

P(tj T 3/4, tj min(τ(B, π), T)) = P

τ(B, π) T 3 4 , t T 3/4, B +

s=1 Xπs s T 1/4, tj T 3 4

τ(B, π) T 3 4 , B +

s=1 Xπs s T 1/4, tj T 3 4

We then deduce that

(E) TP tj T 3/4, tj min(τ(B, π), T) + T 3/4P tj < T 3/4, tj min(τ(B, π), T)

τ(B, π) T 3 4 , B +

s=1 Xπs s T 1/4, tj T 3 4

+ o(T). (43)

Using (42) and (43), we deduce that

τ(B, π) T 3 4 , B +

s=1 Xπs s T 1/4, tj T 3 4

Then, consider the set of conditions n τ(B, π) T 3 4 , B + PT 3/4

s=1 Xπs s T 1 4 , tj T 3 4 o and assume there

exists an arm k0 [K] such that PT 3/4

s=1 Xπs s 1πs=k0 > T 1 3 . Since T 3/4 tj, we know that for any k [K],

t=1 Xπt t 1πt=k T 1 4 K 1,

t=1 Xπt t = B +

t=1 Xπt t 1πt=k0 + X

t=1 Xπt t 1πt=k

K T 1 4 (K 1)

= Ω(T 1 3 ),

Published in Transactions on Machine Learning Research (08/2024)

which contradicts the hypothesis B + PT 3/4

t=1 Xπt t T 1 4 . We deduce that

τ(B, π) T 3 4 , B +

s=1 Xπs s T 1/4, tj T 3 4

τ(B, π) T 3 4 , k [K],

t=1 Xπt t 1πt=k T 1 3

t=1 1πt=k T 3/4

t=1 Xπt t 1πt=k T 1 3

k [K] : T 1/4

s=1 Xπs s 1πs=k < T 1/3,

s=1 1πs=k T 3/4

s=1 Xπs s 1πs=k

s=1 1πs=k T 3/4

s=1 Xπs s 1πs=k

s=1 1πs=k T 3/4

Then, for any arm k [K], there are two cases: either E[Xk 1 ] = 0, or E[Xk 1 ] = 0. In the former case, we can use Hoeffding s inequality to bound the above probability:

s=1 Xπs s 1πs=k

s=1 1πs=k T 3/4

s=1 Xπs s 1πs=k E[Xk 1 ]

< K T 5/12 E[Xk 1 ],

s=1 1πs=k T 3/4

T 3/4 K T 5/12 E[Xk 1 ] 2

In the latter case, we can bound this probability as follows:

s=1 Xπs s 1πs=k

s=1 1πs=k T 3/4

s=1 Xπs s 1πs=k q PT 3/4 s=1 1πs=k

< K T 1/24 ,

s=1 1πs=k T 3/4

Then, under the assumption that there is no zero arm, Var(Xk 1 ) > 0 and

s=1 Xπs s 1πs=k q PT 3/4 s=1 1πs=k

d N(0, Var(Xk 1 )),

and since 1 T 1/24 T + 0, we deduce that

P tj T 3/4 = o(1).

Published in Transactions on Machine Learning Research (08/2024)

In any case, we have that

s=1 Xπs s 1πs=k

s=1 1πs=k T 3/4

τ(B, π) T 3 4 , B +

s=1 Xπs s T 1/4, tj T 3 4

We then deduce that E[tj] = o(T).

Study of (A)

This term is the main one in the previous decomposition.

t=1 1τ(B,π) t 1

t=1 1πt=11τ(B,π) t 1

t=1 1πt=k1τ(B,π) t 1

t=1 1πt=k1τ(B,π) t 1

t=tj+1 1πt=k1τ(B,π) t 1

k=2 (Bk) + O(E[tj]).

But then, using the previous bounds on (Bk) and E[tj], we deduce that

t=1 1τ(B,π) t 1

Then, we can simply replace the factor with the expectation by the probability of survival, as

t=1 1τ(B,π) t 1

t=1 P(τ(B, π) t 1)

= TP(τ(B, π) = ) + o(T).

Hence, (A) = µ1P(τ(B, π) = )T + o(T),

which concludes the proof of the proposition.

I Proof of the Pareto Optimality of EXPLOIT-UCB-DOUBLE (Theorem 3)

In the proofs of the performance of EXPLOIT-UCB-DOUBLE, for the sake of clarity, we drop the exponent n in the notation of EXPLOIT-UCB-DOUBLE, and πn becomes π. The main objective of this section is to prove that EXPLOIT-UCB-DOUBLE is regret-wise Pareto-optimal in the case of rewards in { 1, 0, 1} and with parameter n = log T.

The first subsection provides a preliminary lemma, useful for the proof of the Pareto-optimality exposed in the second subsection. The last subsection makes use of the preliminary lemma to derive an upper bound on the relative regret of EXPLOIT-UCB-DOUBLE in the general case.

Published in Transactions on Machine Learning Research (08/2024)

I.1 Preliminary Lemma

Lemma 10. Let π be any policy. Then, it holds that

Rew T (π) P(τ(B, π)

T) max k [K] µk T + o(T). (44)

Furthermore, if π is an anytime policy, it holds that

Rew T (π) = P(τ(B, π) = )E

t=1 Xπt t τ(B, π) T

Proof. In order to prove the first statement of the lemma, we decompose the expected cumulative reward as follows:

Rew T (π) = E

t=1 Xπt t 1τ(B,π)<

t=1 Xπt t 1τ(B,π) t 11τ(B,π)

B + P(τ(B, π)

t=1 Xπt t 1τ(B,π) t 1

T + P(τ(B, π)

T +1 Xπt t 1τ(B,π) t 1

T + P(τ(B, π)

T) max k [K] µk(T

T) max k [K] µk T + 2

which gives the desired result. For the second statement, we start by writing the reward as

Rew T (π) = E

t=1 Xπt t 1τ(B,π) t 1

t=1 Xπt t 1τ(B,π) t 11τ(B,π)<T

t=1 Xπt t 1τ(B,π) T

t=1 Xπt t 1τ(B,π)<T

t=1 Xπt t 1τ(B,π) T

By definition of τ(B, π),

t=1 Xπt t 1τ(B,π)<T

t=1 Xπt t 1τ(B,π)<T

and since the rewards are bounded in [ 1, 1], we deduce that

(B + 1) < E

t=1 Xπt t 1τ(B,π)<T

which implies

Rew T (π) = E

t=1 Xπt t 1τ(B,π) T

Published in Transactions on Machine Learning Research (08/2024)

It is trivial that, for any anytime policy π,

P(τ(B, π) T) = P(τ(B, π) = ) + o T + (1). (45)

This implies

Rew T (π) = E

t=1 Xπt t 1τ(B,π) T

= P(τ(B, π) T)E

= P(τ(B, π) = )E

which concludes the proof of the lemma.

I.2 Proof of the Pareto Optimality

We denote by πn the anytime policy EXPLOIT-UCB-DOUBLE with parameter n 1. Then, Propositions 5 and 6, along with Lemma 10 give the cumulative reward of EXPLOIT-UCB-DOUBLE:

1 (p EX)n B

max k [K] µk T + o(T).

In particular, we deduce the reward of the (non-anytime) policy EXPLOIT-UCB-DOUBLE with parameter n = log T: Rew T (πlog T ) = 1 p EX max k [K] µk T + o(T).

Recall from Lemma 10 that, for any policy π,

Rew T ( πT ) P τ(B, πT )

T max k [K] µk T + o(T),

and as a result,

Rew T ( πT ) Rew T (πlog T )

T P τ(B, πT )

T (1 p EX) max k [K] µk + o(1).

2 F , it holds that

Rew T ( πT ) Rew T (πlog T )

T P τ(B, πT ) 3B

(1 p EX) max k [K] µk + o(1),

and taking the limit gives

Reg F (π π) p EX P τ(B, πT ) < 3B

max k [K] µk,

where π denotes (πlog T )T 1 the optimally-tuned EXPLOIT-UCB-DOUBLE. Then, assume that

sup F Reg F (π π) > 0,

which implies that there exists some arm distributions F such that

τ(B, πT ) < 3B

Published in Transactions on Machine Learning Research (08/2024)

Then, by Theorem 1, there exist some other arm distributions F such that

τ(B, πT ) < 3B

and since maxk [K] µk > 0, this implies inf F Reg F (π π) < 0,

proving that (πlog T )T 1 is regret-wise Pareto-optimal.

I.3 Relative Regret of EXPLOIT-UCB-DOUBLE in the General Case

In the general case, Propositions 5 and 6 and Lemma 10 imply that the cumulative reward of EXPLOITUCB-DOUBLE πn with parameter n 1 is

1 (p EX)n B

max k [K] µk T + o(T). (46)

Let π be any policy. The previous result implies that there exist some arm distributions F such that

lim T + Rew T (π ) 1 p EX maxk [K] µk T T < 0.

With (46), this implies that EXPLOIT-UCB-DOUBLE πn achieves, for any policy π ,

inf F Reg F (πn π ) = inf F lim T + Rew T (π ) Rew T (πn)

1 (p EX)n B max k [K] µk.