# rotting_infinitely_manyarmed_bandits__21552049.pdf

Rotting Inﬁnitely Many-Armed Bandits

Jung-hun Kim 1 Milan Vojnovi c 2 Se-Young Yun 1

We consider the inﬁnitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate ϱ = o(1). We show that this learning problem has an Ω(max{ϱ1/3T,

T}) worst-case regret lower bound where T is the time horizon. We show that a matching upper bound O(max{ϱ1/3T,

T}), up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate ϱ. We also show that an O(max{ϱ1/3T, T 3/4}) regret upper bound can be achieved by an algorithm that does not know the value of ϱ, by using an adaptive UCB index along with an adaptive threshold value.

1. Introduction

We consider a fundamental sequential learning problem in which an agent must play one option at a time from an inﬁnite set of options with non-stationary reward distributions, where the mean reward of an option decreases at each play of this option. This is naturally studied as the inﬁnitely many-armed bandit problem with rotting rewards. The assumption of inﬁnitely many arms models practical situations when there is a ﬁnite but large number of arms relative to the number of available experiments. There is an abundance of applications in which one must choose from a large set of options with rotting rewards, e.g. online advertising where arms correspond to ads and rewards decrease over exposures of an ad to a user, content recommendation systems where arms correspond to media items and rotting

1Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea 2London School of Economics, London, UK. Correspondence to: Se-Young Yun <yunseyoung@kaist.ac.kr>, Milan Vojnovi c <m.vojnovic@lse.ac.uk>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

arises because of user boredom when watching the same content, and clinical trials where the efﬁcacy of a medicine may decrease because of drug tolerance when a patient takes the same medicine several times. While there has been a lot of work on multi-armed bandits with a ﬁnite number of arms with stationary or non-stationary rewards, and an inﬁnite number of arms with stationary rewards, not much seems to be known for the case of inﬁnitely many arms with non-stationary rewards.

In this paper we make ﬁrst steps to understand the fundamental limits of sequential learning for inﬁnite number of arms whose mean rewards decrease with the number of pulls the case commonly referred to as the rested rotting bandits. Our focus is on rotting trends where the mean reward of an arm decreases arbitrary for at most a ﬁxed amount ϱ at each pull of this arm. The initial mean rewards of arms are assumed to be independent and identically distributed according to uniform distribution on [0, 1]. The objective is to ﬁnd a policy that minimizes the expected cumulative regret over a time horizon of T time steps with respect to playing the best arm.

We show that the worst-case regret for this problem is lower bounded by Ω(max{ϱ1/3T,

T}), where ϱ is the maximum rotting rate, and show that this lower bound is tight up to a poly-logarithmic factor. This reveals that the rotting trend starts to have an effect on regret precisely at the threshold ϱ = Θ(1/T 3/2). Our result implies that the rotting rested bandit problem with inﬁnitely many arms is harder than for the stationary rewards case, as in the latter case the regret lower bound is Ω(

T) (Wang et al., 2009). This stands in stark contrast to the case of ﬁnite K arms in which case it is known that O(

KT) can be achieved for the rotting case (Seznec et al., 2019), which matches the optimal bound in the stationary case (Auer et al., 2002b) up to a polylogarithmic factor.

In the case of inﬁnitely many arms with stationary rewards, it is not possible to explore all arms to ﬁnd an optimal arm, hence, it is required to ﬁnd a near-optimal arm; contrast this with the case of ﬁnitely many arms, where all arms must be explored to identify an optimal arm. Further, when we consider rotting rewards, the learner must keep exploring new arms because a near-optimal arm may become suboptimal as it is being pulled. Based on this fact, we design algo-

Rotting Inﬁnitely Many-armed Bandits

rithms for the rotting inﬁnitely many-armed bandit problem to achieve tight regret bounds. We summarize our contributions in more details in what follows.

1.1. Summary of our contributions

We show an Ω(max{ϱ1/3T,

T}) worst-case regret lower bound for the rotting rested bandit case with maximum rotting rate ϱ = o(1). This regret lower bound matches the regret lower bound Ω(

T) that is known to hold for the case of stationary rewards, when rotting is sufﬁciently small precisely when ϱ = O(1/T 3/2). Otherwise, when ϱ = ω(1/T 3/2), the regret lower bound becomes worse than for the stationary case.

We show that an O(max{ϱ1/3T,

T}) regret can be achieved by an algorithm when the maximum rotting rate ϱ is known to the algorithm. This algorithm uses a UCB index to decide whether to continue pulling an arm or remove the arm from further consideration and switch to exploring a new arm by comparing the index with a threshold. This threshold is set to account for rotting of rewards.

We further show that an O(max{ϱ1/3T, T 3/4}) regret can be achieved by an algorithm that does not know the value of the maximum rotting rate ϱ. This algorithm uses an adaptive UCB index and an adaptive threshold value to compare the UCB index of an arm with the threshold to decide whether to continue pulling this arm or remove the arm from further consideration. This upper bound matches the lower bound up to poly-logarithmic factors when the rotting rate ϱ is sufﬁciently large, i.e. when ϱ = Ω(1/T 3/4).

We present results of numerical experiments for randomly generated problem instances of rotting inﬁnitely manyarmed bandits. These results validate the insights derived from our theoretical results.

1.2. Related work

The work on multi-armed bandits can be distinguished with respect to two criteria, ﬁrst whether the number of arms is ﬁnite or inﬁnite, and second whether rewards of arms are stationary or non-stationary. For the case of non-stationary rewards, we can further distinguish rested from restless multi-armed bandit problems in the former case, an arm s distribution of reward may change only when the arm is pulled, while in the latter case, it may change at each time step. Our work falls in the category of multi-armed bandit problems with inﬁnitely many non-stationary rested arms.

The case of ﬁnitely many arms with stationary rewards has been studied by many, following on Lai & Robbins (1985); Auer et al. (2002a). There exist algorithms having O(

KT) worst-case regret, where K is the number of arms, and this matches the lower bound Ω(

KT) up to a poly-logarithmic factor (Auer et al., 2002b; Slivkins, 2019).

We next discuss the case of ﬁnitely many arms with nonstationary rewards. The non-stationarity in rewards can be quantiﬁed by the number of abrupt changes or a variation budget, which is referred to as abrupt-changing and slowvarying environments, respectively. The non-stationary environments were studied by Auer et al. (2002b); Garivier & Moulines (2011); Besbes et al. (2014); Auer et al. (2019) in which proposed algorithms are based on a strategy of adapting current state rapidly and fading old history memory (e.g. sliding window, discount factor, and restarting). In addition to this, non-stationary environments were studied under various assumptions, e.g. contextual bandits and MDPs (Cheung et al., 2019; Chen et al., 2019; Zhao et al., 2020; Russac et al., 2019; Cheung et al., 2020), mortal bandits where arms have a stochastic lifetime (Chakrabarti et al., 2008; Trac a et al., 2020), and bandits where arm rewards evolve according to a continuous-time stochastic process (Slivkins & Upfal, 2008).

The multi-armed bandit problem with a ﬁnite number of arms, where each arm s mean reward decays with the number of pulls of this arm, was ﬁrst studied by Komiyama & Qin (2014); Heidari et al. (2016); Bouneffouf & F eraud (2016); Levine et al. (2017). Following Levine et al. (2017), this problem is referred to as rotting bandits problem. Levine et al. (2017) showed that a sliding-window algorithm has a O(K1/3T 2/3) regret in a non-parametric rested rotting setting where the only assumption is that mean rewards are positive and non-increasing in the number of pulls. The nonparametric rotting bandit problem, allowing mean rewards to be negative with bounded decay, was subsequently studied by Seznec et al. (2019), showing an algorithm that has an O(

KT) problem instance independent bound. Seznec et al. (2020) showed that a single algorithm, an adaptivewindow UCB index policy, achieves near-optimal regret for both rested and restless rotting bandits. In this paper, we follow the non-parametric rested rotting setting where mean rewards can only decrease with bounded decrements.

We next discuss the case of inﬁnitely many arms with stationary rewards. Berry et al. (1997); Bonald & Prouti ere (2013) proposed algorithms with asymptotically optimal regret O(

T) for the case of arms with Bernoulli rewards and independent mean values according to uniform distribution on [0, 1]. Wang et al. (2009) studied the case where the mean reward distribution has support on [0, µ ] with µ 1, and for each arm a the distribution of mean reward µ(a) is such that P(µ(a) µ z) = Θ(zβ), for some β > 0. Carpentier & Valko (2015) studied the same problem but focused on simple regret, deﬁned as the instantaneous regret at time step T. Bayati et al. (2020) showed that a subsampled UCB algorithm (SSUCB) that samples Θ(

T) arms and executes UCB only on this subset of arms has O(

T) regret under 1-sub-Gaussian rewards with mean rewards according to uniform distribution on [0, 1]. In this

Rotting Inﬁnitely Many-armed Bandits

setting, for mean reward distributions such that there is a large enough number of near-optimal arms, an algorithm may ﬁnd a near-optimal arm by exploring a restricted number of arms. There also exist several works dealing with inﬁnitely many arms under structured reward functions such as contextual linear bandits (Abbasi-Yadkori et al., 2011) and Lipschitz bandits (Bubeck et al., 2011). In this paper, however, we focus on inﬁnitely many arms under a mean reward distribution, where the structured-reward assumptions may not hold because of rotting.

Our work is different from the work discussed in this section in that we consider the case of inﬁnitely many arms with nonstationary rotting arms. In the case of rotting bandits with a ﬁnite number of arms, as we mentioned, Seznec et al. (2019; 2020) achieves worst-case regret bound O(

KT) which matches the near-optimal regret in the stationary stochastic setting. This result indicates that the rotting in the ﬁnitely many arms setting is not a harder problem than in the stationary rewards setting. However, in the setting of inﬁnitely many arms, rotting of rewards makes the problem harder than in the stationary rewards case. This is because the value of the optimal mean reward is not decreasing as the arms are being pulled as there are inﬁnitely many near-optimal arms, which requires an additional exploration to recurrently search for a new optimal arm outside of the set of already pulled arms. Our algorithms are different from previouslyproposed algorithms for the case of inﬁnitely many arms with stationary rewards in keeping to explore new arms to ﬁnd a near-optimal arm over time because of rotting rewards. In more details, our algorithms use UCB policies to decide whether to continue pulling an arm or remove the arm from further consideration and explore a new arm, by comparing its UCB index with a threshold which is adjusted by using the rotting rate or an estimated value of the rotting rate.

2. Problem formulation

We consider a non-stationary bandit problem with inﬁnitely many arms where the reward distributions of arms vary over time. We consider the case when the mean reward of an arm may decrease only when this arm is pulled by an agent that uses a policy π, which is referred to as the rested rotting bandit setting. Let A be an inﬁnite set of arms, µt(a) be the mean reward of arm a at time t before pulling an arm at time t, and nt(a) be the number of times arm a A is pulled by π before time t. Also, denote by rt the stochastic reward gained by pulling arm aπ t at time t. Let rt = µt(aπ t ) + ηt where ηt is a noise term with a 1-sub-Gaussian distribution. We assume that initial mean rewards {µ1(a)}a A are i.i.d. random variables with uniform distribution on [0, 1]. The rotting of arms is deﬁned as follows. Given a rotting rate 0 ϱt ϱ at time t 1 with maximum rotting rate ϱ = o(1), the mean reward of the selected arm at time t

changes as follows

µt+1(at) = µt(at) ϱt

whereas the mean rewards of other arms remain unchanged. The mean reward of every arm a A at time t > 1 can be represented as follows. With 0 ϱs ϱ for all time steps 0 < s < t,

µt(a) = µ1(a)

s=1 ϱs1(as = a).

The objective is to ﬁnd a policy that minimizes the expected cumulative regret over a time horizon of T time steps, which for a given policy π is deﬁned as follows

E[Rπ(T)] = E

t=1 (1 µt(aπ t ))

In the regret deﬁnition, we use that the mean reward of the optimal arm at any time t is equal to 1. This is because there is an inﬁnite number of arms in A with i.i.d. mean rewards according to uniform distribution on [0, 1], so that there always exist sufﬁciently many arms whose mean rewards are close enough to 1. In what follows, selecting an arm means that a policy chooses an arm in A before playing it and pulling an arm means that the policy plays a selected arm and receives a reward.

3. Regret lower bound

We ﬁrst discuss two different regimes for regret depending on the value of the maximum rotting rate ϱ. When ϱ 1/T 3/2, the mean reward of any arm over the time horizon of T time steps changes for at most ϱT 1/

T. Therefore, for any arm, there can be at most a gap of 1/

T between the initial mean reward and the mean reward after T time steps, which causes an additional regret of at most

T over the horizon of T time steps to the case of stationary arms (i.e. when ϱ = 0). In Theorem 3 in Wang et al. (2009), the optimal regret for the stationary case, with uniform distribution of mean rewards of arms, is shown to be of the order

T. Therefore, the extra regret of

T from the rotting with ϱ 1/T 3/2 does not affect the order of the regret. When ϱ > 1/T 3/2, we expect that the regret lower bound may be different than for the stationary case.

By analyzing the regret lower bound for the speciﬁc case with ϱt = ϱ for all time steps t > 0, we provide a lower bound for the worst-case regret with respect to arbitrary rotting as given in the following theorem.

Theorem 3.1. For the rotting inﬁnitely many-armed bandit problem, there exist rotting rates 0 ϱt ϱ = o(1) for all time steps t > 0 such that any policy π has the regret over

Rotting Inﬁnitely Many-armed Bandits

a time horizon of T time steps such that

E[Rπ(T)] = Ω(max{ϱ1/3T,

From the result of the theorem, when the rotting is small enough, i.e. precisely when ϱ 1/T 3/2, the lower bound corresponds to Ω(

T), which is known to hold when rewards are stationary. Otherwise, when the rotting is sufﬁciently large, then the lower bound is Ω(ϱ1/3T). For example, when ϱ = 1/T γ for some γ > 0, we have the lower bound Ω(

T), if γ 3/2 (small rotting case), and, otherwise (large rotting case), we have Ω(T 1 γ/3). We can observe that ϱ = Θ(1/T 3/2) is a transition point at which the lower bound switches from Ω(

T) to Ω(ϱ1/3T).

Proof sketch. Here we present a proof sketch of the theorem with the full version of the proof provided in Appendix A.2.

We assume that ϱt = ϱ = o(1) for all time steps t > 0. When ϱ = O(1/T 3/2), the lower bound for the stationary case of the order

T (Wang et al., 2009) is tight enough for the non-stationary case. This is because we only need to pay an extra regret of at most of order

T for small ϱ. Therefore, when ϱ = O(1/T 3/2), we have

E[Rπ(T)] = Ω(

We note that even though the mean rewards are rotting in our setting, we can easily obtain (1) by following the same proof steps of Theorem 3 in Wang et al. (2009). For the sake of completeness, we provide a proof in the Appendix A.2.

When ϱ = ω(1/T 3/2), however, the lower bound of the stationary case is not tight enough. Here we provide the proof of the lower bound Ω(ϱ1/3T) for the case when ϱ = ω(1/T 3/2). For showing the lower bound, we will classify each arm to be either bad or good or else according to the deﬁnition given shortly. To distinguish bad and good arms, we use two thresholds 1 c and 1 δ, respectively, where c and δ are such that 0 < 1 c < 1 δ < 1, δ = ϱ1/3, and c is a constant. An arm a is said to be a bad arm if µ1(a) 1 c, and is said to be a good arm if µ1(a) > 1 δ.

Let NT be the number of distinct selected good arms until time step T. We separately consider two cases when NT < m and NT m, where m = (1/2)Tϱ2/3 , and show that each case has Ω(ϱ1/3T) as the regret lower bound. The main ideas for each case are outlined as follows. When the number of selected good arms is relatively small (NT < m), any policy π must pull arms with mean rewards less than 1 δ at least T/2 time steps until T, amounting to at least δ regret for each pull (gap between 1 and mean reward of a pulled arm). Therefore, the regret is lower bounded by Ω(δ(T/2)) = Ω(ϱ1/3T). When the number of selected good arms is relatively large (NT m), we can show

Algorithm 1 UCB-Threshold Policy (UCB-TP)

Given: T, δ, A; Initialize: A A Select an arm a A

Pull arm a and get reward r1 for t = 2, . . . , T do

Update the initial mean reward estimator µo t(a) if UCBt(a) 1 δ then

Pull arm a and get reward rt else

A A /{a} Select an arm a A

Pull arm a and get reward rt end if end for

that any policy π is likely to select at least of order ϱ1/3T number of distinct bad arms until T. From the fact that the selected bad arms are pulled at least once and each pull adds a constant regret of value at least c, the regret is shown to be lower bounded by Ω(cϱ1/3T) = Ω(ϱ1/3T). Therefore, when ϱ = ω(1/T 3/2), we can obtain

= E[Rπ(T)1(NT < m) + Rπ(T)1(NT m)]

= Ω(ϱ1/3T). (2)

Finally, from (1) and (2), we have E[Rπ(T)] = Ω(max{ϱ1/3T,

4. Algorithms and regret upper bounds

In this section, we ﬁrst present an algorithm for the rested rotting bandit problem with inﬁnitely many arms for the case when the algorithm knows the value of the maximum rotting rate. We show a regret upper bound of the algorithm that matches the regret lower bound in Theorem 3.1 up to a poly-logarithmic factor. Second, we present an algorithm that does not know the maximum rotting rate and show a regret upper bound that matches the regret lower bound up to a poly-logarithmic factor, when the maximum rotting rate is large enough.

4.1. An algorithm knowing maximum rotting rate

We present an algorithm which requires knowledge of the maximum rotting rate in Algorithm 1. The algorithm selects an arm and pulls this arm as long as the arm is tested to be a good arm, by using a test comparing an upper conﬁdence bound of this arm with a threshold value. Speciﬁcally, if a is the selected arm at time step t, the algorithm computes an estimator µo t(a) of the initial mean reward of arm a and uses this estimator to compute an estimator of the mean reward of arm a at time step t, considering the worst-case rotting

Rotting Inﬁnitely Many-armed Bandits

rate ϱ for the estimators. Comparing the upper conﬁdence bound for the mean reward with the threshold 1 δ, the algorithm tests whether the arm is a good arm. If the arm is tested to be a good arm, then the algorithm continues to pull this arm. Otherwise, it discards the arm and selects a new one, and repeats the procedure described above until time horizon T. We consider Algorithm 1 with the initial mean reward estimator deﬁned as

µo t(a) := Pt 1 s=1(rs + ϱns(a))1(as = a)

and the upper conﬁdence bound term deﬁned as

UCBt(a) := µo t(a) ϱnt(a) + p

8 log(T)/nt(a).

Note that when ϱs = ϱ for all 0 < s < t, µo t(a) is an unbiased estimator of the initial mean reward µ1(a) of arm a and µo t(a) ϱnt(a) is an unbiased estimator of the mean reward µt(a) of arm a at time step t. The upper conﬁdence bound UCBt(a) follows the standard deﬁnition of an upper conﬁdence bound. By designing the estimators to deal with the maximum rotting rate ϱ, for any arbitrary ρt ϱ for all time steps t > 0, we show that it has a near-optimal worst-case regret upper bound in the following theorem.

Theorem 4.1. For the policy π deﬁned by Algorithm 1 with δ = max{ϱ1/3, 1/

T}, and ϱ = o(1), the regret satisﬁes

E[Rπ(T)] = O(max{ϱ1/3T,

Proof sketch. Here we present a proof sketch of the theorem with a full version of the proof provided in Appendix A.3.

Observe that initial mean rewards of selected arms are i.i.d. random variables with uniform distribution on [0, 1]. We ﬁrst deﬁne arms to be good or bad arms depending on the initial mean rewards. We assume ϱ = o(1) and set δ = max{ϱ1/3, 1/

T}. Then we deﬁne an arm a to be a good arm if (a) δ/2, where (a) = 1 µ1(a), and otherwise, a is a bad arm. Because of rotting, initially good arm may become bad by pulling the arm several times. Therefore, the policy π may select several good arms over the entire time and we analyze the regret over time episodes deﬁned by the selections of good arms. Given the policy, we refer to the period starting from selecting the i 1-st good arm until selecting the i-th good arm as the i-th episode. Because of the uniform distribution of mean rewards with small δ/2, it is likely to have several consecutive selected bad arms in each episode.

We do an episodic analysis. We ﬁrst analyze the expected regret per episode and multiply it by the expected number of episodes in T. However, this proof strategy has an issue that the regret of each episode and the number of episodes

Figure 1: Episodes in a time line.

in T are not independent. To resolve the issue, we ﬁx the number of episodes to m G and analyze the regret not for T but for m G episodes. Note that m G is a ﬁxed value, and the time after m G episodes can exceed T. For obtaining a regret bound, we set m G so that the total number of time steps for m G episodes is larger than T with high probability and thus Rπ(T) Rπ m G where Rπ m G is the regret accumulated for m G episodes. For the regret analysis, we denote by m B i the number of selections of distinct bad arms in the i-th episode. See Figure 1 for an illustration of the episodes in a time line.

In what follows, we provide an overview of the regret analysis by considering two separate cases depending on the value of the maximum rotting rate ϱ; one for a large rotting case, and the other for a small rotting case, which we may interpret as a near-stationary case. For the analysis, we use RG i to denote the regret accumulated by pulling the good arm in the i-th episode, and RB i,j to denote the regret accumulated by pulling the j-th bad arm in the i-th episode.

Case of large rotting ϱ = ω(1/T 3/2): We ﬁrst show that by setting m G = 2Tϱ2/3 , we have Rπ(T) Rπ m G. If the policy selects a good arm a, where µ1(a) 1 δ/2, then it must pull the arm at least δ/(2ϱ) times, with high probability, to decrease the mean reward below the threshold 1 δ. Then the total number of time steps for m G episodes is at least (δ/(2ϱ))m G = (1/(2ϱ2/3)) 2Tϱ2/3 T. This implies that Rπ(T) Rπ m G. We next provide a bound for E[Rπ(T)] using E[Rπ m G]. For bounding E[Rπ m G], we can show that for any i [m G] and j [m B i ], we have

E[RG i ] = O 1

and E[RB i,j] = O (1) . (3)

Observe that m B 1 , . . . , m B m G are i.i.d. random variables with geometric distribution with parameter δ/2. Therefore, for any non-negative integer k, we have P(m B i = k) = (1 δ/2)kδ/2 and E[m B i ] = (2/δ) 1 for all i [m G]. We have set δ = max{ϱ1/3, 1/

T}. Then when ϱ = ω(1/T 3/2), with m G = 2Tϱ2/3 and from (3), E[m B i ] = 2/δ 1, and δ = ϱ1/3, we have

E[Rπ(T)] = O(E[Rπ m G])

j [m B i ] RB i,j

ϱ1/3 + E[m B i ] m G = O ϱ1/3T . (4)

Rotting Inﬁnitely Many-armed Bandits

Case of small rotting ϱ = O(1/T 3/2): By setting m G = C for some large constant C > 0, we can show that Rπ(T) Rπ m G. This is because if the policy selects a good arm a, then it must pull the arm for at least order T times with high probability. This is because the small rotting case is a near-stationary setting so that the policy can pull a good arm for a large amount of time steps. For bounding E[Rπ m G], we can show that for any i [m G] and j [m B i ],

E[RG i ] = O(

T) and E[RB i,j] = O (1) . (5)

Then when ϱ = O(1/T 3/2), with m G = C and from (5), E[m B i ] = 2/δ 1, and δ = Θ(1/

T), we have

E[Rπ(T)] = O(E[Rπ m G])

T + E[m B i ] m G = O(

We note that in the small rotting case, the policy achieves O(

T) which matches a near-optimal bound for the stationary setting (Wang et al., 2009).

Putting the pieces together: From (4) and (6), by taking ϱ = o(1) and δ = max{ϱ1/3, 1/

T}, it follows

E[Rπ(T)] = O(max{ϱ1/3T,

4.2. An algorithm not knowing maximum rotting rate

In this section we present an algorithm which does not require information about the maximum rotting rate ϱ deﬁned in Algorithm 2, and provide a regret upper bound for this algorithm. The algorithm adopts the strategy of hierarchical bandit algorithms similar to BOB (bandit-over-bandit) (Cheung et al., 2019). It consists of a master algorithm and several base algorithms, where EXP3 (Auer et al., 2002b) is used for the master algorithm whose goal is to ﬁnd a near-optimal base algorithm, and for each base algorithm, a UCB index policy similar to that in Algorithm 1 is used with a candidate rotting rate and an adaptive threshold to decide whether to continue pulling currently selected arm. In Algorithm 2, the time horizon of T time steps is partitioned into blocks of H time steps. Before starting each block, the master algorithm selects a rotting estimator β of the unknown maximum rotting rate ϱ from a set of candidate values denoted by B. Then it runs a base algorithm over H time steps which decides whether to continue pulling the selected arm based on a UCB index and a threshold tuned using the selected β. By utilizing the obtained rewards over the block as a feedback for the decision of the master algorithm, it updates the master to ﬁnd a near-optimal base and repeats the procedure described above until time horizon T.

Algorithm 2 Adaptive UCB-Threshold Policy (AUCB-TP)

Given: T, H, B, A, α Initialize: A A, w(β) 1 for β B for i = 1, 2, . . . , T/H do

Select an arm a A

Pull arm a and get reward r(i 1)H+1 p(β) (1 α) w(β) P

k B w(k) + α 1

Select β β with probability p(β) for β B δ β1/3

for t = (i 1)H + 2, . . . , i H T do

if UCBi,t(a, β) 1 δ then

Pull arm a and get reward rt else

A A /{a} Select an arm a A

Pull arm a and get reward rt end if end for w( β)

w( β) exp α Bp( β)

Pi H T t=(i 1)H rt 186H log T +4 H log T

We note that the term 1/2 + Pi H T t=(i 1)H rt/(186H log T +

4 H log T) in updating w( β) in Algorithm 2 is for rescaling and translating rewards, which makes the rewards lie in [0, 1] with high probability. Also by optimizing the block size H, we can control regrets induced from the master and a base. By increasing H, the regret induced from the master increases and the regret induced from a base decreases. Those facts are shown later in the poof of Theorem 4.2.

In what follows, we deﬁne the inputs B and α, and the upper conﬁdence bound index UCBi,t(a, β) for β B. B contains candidate values of β to optimize UCBi,t(a, β) and the threshold parameter δ. We ﬁnd that the optimal base parameter β B is when β = max{1/H3/2, ϱ} including a clipped domain for the optimal threshold value δ as in Theorem 4.1. This implies that the optimized β 1/H3/2 = 2 3/2 log2 H. Also from ϱ = o(1), β 1/8. Therefore, we set

B = {2 3, 2 4, . . . , 2 (3/2) log2 H },

in which the cardinality of the set is restricted by O(log H) which does not hurt the regret from EXP3 up to a logarithmic factor. Let B = |B|. Then we set α = min{1, p

B log B/((e 1) T/H )}, which is used to guarantee a least selection probability for each base. Let ni,t(a) be the number of times that arm a A is pulled by the algorithm from time step (i 1)H + 1 before time step

Rotting Inﬁnitely Many-armed Bandits

t. Let µo i,t(a, β) be deﬁned as

µo i,t(a, β) :=

Pt 1 s=(i 1)H+1(rs + βni,s(a))1(as = a)

and the UCBi,t(a, β) index be deﬁned as

UCBi,t(a, β) := µo i,t(a, β) βni,t(a)+ q

8 log(H)/ni,t(a).

We provide a worst-case regret upper bound for Algorithm 2 in the following theorem.

Theorem 4.2. With ϱ = o(1), for the policy π deﬁned by Algorithm 2 with H = T 1/2 , the regret satisﬁes

E[Rπ(T)] = O(max{ϱ1/3T, T 3/4}).

Proof sketch. Here we present a proof sketch of the theorem with a full version of the proof given in Appendix A.4.

The policy π consists of two strategies: EXP3 for the master and UCB-Threshold Policy (Algorithm 1) for bases. We can decompose the regret into two parts: regret incurred by playing a base with β B over each block of H time steps and regret incurred due to the master trying to ﬁnd a nearoptimal base parameter in B. In what follows, we deﬁne the regret decomposition formally. Let πi(β) for β B denote the base policy with β for time steps between (i 1)H + 1 and i H T. Denote by aπi(β) t the pulled arm at time step t by policy πi(β). Then, for β B, which is set later for a near-optimal base, we have

E[Rπ(T)] = E

t=(i 1)H+1 µt(aπ t )

= E[Rπ 1 (T)] + E[Rπ 2 (T)], (7)

t=(i 1)H+1 µt(aπi(β ) t ),

µt(aπi(β ) t ) µt(aπ t ) .

Note that Rπ 1 (T) accounts for the regret caused by the nearoptimal base algorithm πi(β ) against the optimal mean reward and Rπ 2 (T) accounts for the regret caused by the master algorithm by selecting a base with β B at every block against the base with β as illustrated in Figure 2. We set β to be the smallest value in B which is larger than max{ϱ, 1/H3/2}. Then the base has the threshold parameter β 1/3 of order max{ϱ1/3, 1/

H} which coincides with

Figure 2: Regret decomposition for Algorithm 2.

the optimal threshold parameter of Algorithm 1 by replacing T with H. We note that the policy π does not require knowing β and it is deﬁned only for the proof.

In what follows, we provide upper bounds for each regret component. We ﬁrst provide an upper bound for E[Rπ 1 (T)] by following the proof steps in Theorem 4.1. We can easily ﬁnd that regret of the base with β for each block of size H that has the same regret bound as in Theorem 4.1 by replacing T with H amounting to O(max{ϱ1/3H,

H}). Then by adding the regret for T/H number of blocks, we have

E[Rπ 1 (T)] = O((T/H) max{ϱ1/3H,

= O(max{ϱ1/3T, T/

Then we provide an upper bound for E[Rπ 2 (T)] using a regret bound for EXP3 in (Auer et al., 2002b). The EXP3 in policy π selects a base in B before starting a block and gets feedback at the end of the block and repeats this over T/H blocks. Therefore, EXP3 in π can be thought to be run for T/H decision rounds and the number of decision options for each round is B. Let Q be an upper bound for the absolute sum of rewards for any block with length H with high probability. Then from Corollary 3.2 in (Auer et al., 2002b), we can show that

E[Rπ 2 (T)] = O(Q p

B(T/H)). (9)

By considering that mean rewards may become negative because of rotting and using a Chernoff s bound, we show that with high probability

t=(i 1)H+1 rt

93H log T + 2 p

H log T. (10)

Then with B = O(log T), from (9) and (10), we have

E[Rπ 2 (T)] = O H log(T) p

Finally, from (7), (8), and (11), with H = T 1/2 , we have

E[Rπ(T)] = O(max{ϱ1/3T, T/

= O(max{ϱ1/3T, T 3/4}).

This concludes the proof.

Rotting Inﬁnitely Many-armed Bandits

The regret bound for Algorithm 2 in Theorem 4.2 is larger than or equal to that for Algorithm 1 in Theorem 4.1. This is because the master in Algorithm 2 needs to learn the unknown maximum rotting rate to ﬁnd a near-optimal base algorithm, which produces extra regret. In the following remarks, we discuss the region of the maximum rotting rate ϱ for which Algorithm 2 achieves the near-optimal regret and discuss the computation and memory efﬁciency of our algorithms.

Remark 4.3. When ϱ = Ω(1/T 3/4), Algorithm 2 achieves the optimal regret bound O(ϱ1/3T) up to a poly-logarithmic factor. This is because when ϱ = Ω(1/T 3/4), the additional regret from learning the maximum rotting rate is negligible compared with the regret from the rotting of rewards. It is an open problem to achieve the optimal regret bound for any value of ϱ, without knowing the value of this parameter.

Remark 4.4. Note that we can achieve the same regret bounds as in Theorems 4.1 and 4.2 for the worst-case rotting by replacing our UCB index with adaptive-window UCB index in (Seznec et al., 2020) which is known to achieve near-optimal regret for the case of rotting with a ﬁnite number of arms. However, computing the adaptive-window UCB over horizon T has high computation cost, O(T 2) and O(TH), and memory space O(T) and O(H), in Algorithms 1 and 2, respectively, for optimizing the window size. On the other hand, our proposed UCB index has a lower computation cost of O(T) and requires only O(1) memory space using a simple trick for updating mean estimators in online computation settings (Welford, 1962).

5. Numerical experiments

In this section we present results of our numerical experiments using synthetic data in the rotting setting with inﬁnitely many arms1. To the best of our knowledge, there are no previously-proposed algorithms for the setting of rotting with inﬁnitely many arms. We compare the performance of Algorithms 1 and 2 with SSUCB (Bayati et al., 2020), which was proposed for inﬁnitely many arms with stationary rewards, and is known to have a near-optimal regret for stationary sub-Gaussian reward distributions. From the theoretical results in Section 3 and Section 4, we expect that Algorithm 1 and SSUCB would have similar performance when the maximum rotting rate is sufﬁciently small, which may be regarded as a nearly stationary case, and that both Algorithm 1 and Algorithm 2 would outperform SSUCB when the rotting rate is sufﬁciently large. We expect that Algorithm 2 would not be competitive in the nearly stationary case because of the extra regret from the rotting rate estimation. It requires the rotting rate to be sufﬁciently large

1Our code is available at https://github.com/ junghunkim7786/rotting_infinite_armed_ bandits

Figure 3: Performance of Algorithms 1 and 2, and SSUCB: (a) regret versus ϱ1/3 for ﬁxed time horizon T, and (b,c,d) regret versus time horizon T with rotting rates ϱ = 1/T 1/2

(b), 1/T 3/5 (c), and 1/T 3/2 (d).

to have the near-optimal regret bound. We will conﬁrm these insights by our numerical results in what follows.

In our experiments, we consider the case of identical rotting with ϱt = ϱ for all t [T]. We generate initial mean rewards of arms by sampling from uniform distribution on [0, 1]. In each time step, stochastic reward from pulling an arm has a Gaussian noise with mean zero and variance 1. We repeat each experiment 10 times and compute conﬁdence intervals for conﬁdence probability 0.95.

We ﬁrst investigate the performance of algorithms for varied rotting rate ϱ and ﬁxed time horizon T. We ran experiments for rotting rate ϱ set to 1/T, 1/T 0.9, 1/T 0.8, . . . , 1/T 0.3, and measured the expected regret for the time horizon T = 106. In Figure 3 (a), we can conﬁrm that our algorithms show more robust performance than SSUCB for various rotting rates with linearly increasing regret with respect to ϱ1/3 for large rotting, which matches Theorems 4.1 and 4.2. We observe that for large rotting cases, our algorithms have similar performance and both outperform SSUCB. For sufﬁciently small rotting, we can observe that all the three algorithms have comparable performance while Algorithm 2 performs slightly worse. These results conform to the insights derived from our theoretical analysis.

We next investigate the performance of algorithms versus time horizon T and rotting rate ϱ depending on T. We ran experiments for time horizon T taking values 1, 1 105, 2 105, . . . , 106, and measured expected regret of each case. We set the rotting rate ϱ to 1/T 1/2, 1/T 3/5 and 1/T 3/2. Note that the case 1/T 3/2 may be considered as a

Rotting Inﬁnitely Many-armed Bandits

nearly-stationary case, because in this case the regret lower bound is Ω(

T) from Theorem 3.1. In Figure 3 (b) and (c), corresponding to large rotting rates, we observe that Algorithms 1 and 2 have similar performance and outperform SSUCB. The gaps between regrets of our algorithms and SSUCB become smaller by decreasing the rotting rate from 1/T 1/2 to 1/T 3/5. This is because SSUCB is designed for the case of stationary rewards and has a near-optimal regret in the stationary case. In the near-stationary case, in Figure 3 (d), we observe that Algorithm 1 has best performance and, as expected, Algorithm 2 has worst performance.

6. Conclusion

In this paper we studied the inﬁnitely many-armed bandit problem with rested rotting rewards. We provided a regret lower bound and proposed an algorithm which achieves a near-optimal regret, when the the maximum rotting rate is known to the algorithm. We also proposed an algorithm which does not require knowledge of the rotting rate and we showed that it achieves a near-optimal regret for any large enough rotting rate. In future work, it may be of interest to relax the assumption of uniform distribution for initial mean rewards.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002a.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002b.

Auer, P., Gajane, P., and Ortner, R. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Conference on Learning Theory, pp. 138 158. PMLR, 2019.

Bayati, M., Hamidi, N., Johari, R., and Khosravi, K. Unreasonable effectiveness of greedy algorithms in multi-armed bandit with many arms. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1713 1723. Curran Associates, Inc., 2020.

Berry, D. A., Chen, R. W., Zame, A., Heath, D. C., and Shepp, L. A. Bandit problems with inﬁnitely many arms. The Annals of Statistics, 25(5):2103 2116, 1997.

Besbes, O., Gur, Y., and Zeevi, A. Stochastic multi-armedbandit problem with non-stationary rewards. In Proceed-

ings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS 14, pp. 199 207, Cambridge, MA, USA, 2014. MIT Press.

Bonald, T. and Prouti ere, A. Two-target algorithms for inﬁnite-armed bandits with bernoulli rewards. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS 13, pp. 2184 2192, Red Hook, NY, USA, 2013. Curran Associates Inc.

Bouneffouf, D. and F eraud, R. Multi-armed bandit problem with known trend. Neurocomputing, 205:16 21, 2016.

Brown, D. G. How I wasted too long ﬁnding a concentration inequality for sums of geometric variables. Found at https://cs. uwaterloo. ca/ browndg/negbin. pdf, 8(4), 2011.

Bubeck, S., Stoltz, G., and Yu, J. Y. Lipschitz bandits without the lipschitz constant. In International Conference on Algorithmic Learning Theory, pp. 144 158. Springer, 2011.

Carpentier, A. and Valko, M. Simple regret for inﬁnitely many armed bandits. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pp. 1133 1141. JMLR.org, 2015.

Chakrabarti, D., Kumar, R., Radlinski, F., and Upfal, E. Mortal multi-armed bandits. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS 08, pp. 273 280, Red Hook, NY, USA, 2008. Curran Associates Inc.

Chen, Y., Lee, C.-W., Luo, H., and Wei, C.-Y. A new algorithm for non-stationary contextual bandits: Efﬁcient, optimal and parameter-free. In Conference on Learning Theory, pp. 696 726. PMLR, 2019.

Cheung, W. C., Simchi-Levi, D., and Zhu, R. Learning to optimize under non-stationarity. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 1079 1087. PMLR, 2019.

Cheung, W. C., Simchi-Levi, D., and Zhu, R. Reinforcement learning for non-stationary markov decision processes: The blessing of (more) optimism. In International Conference on Machine Learning, pp. 1843 1854. PMLR, 2020.

Garivier, A. and Moulines, E. On upper-conﬁdence bound policies for switching bandit problems. In Kivinen, J., Szepesv ari, C., Ukkonen, E., and Zeugmann, T. (eds.), Algorithmic Learning Theory, pp. 174 188, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. ISBN 978-3642-24412-4.

Rotting Inﬁnitely Many-armed Bandits

Heidari, H., Kearns, M., and Roth, A. Tight policy regret bounds for improving and decaying bandits. In Proceedings of the Twenty-Fifth International Joint Conference on Artiﬁcial Intelligence, IJCAI 16, pp. 1562 1570, 2016.

Komiyama, J. and Qin, T. Time-decaying bandits for nonstationary systems. In Liu, T.-Y., Qi, Q., and Ye, Y. (eds.), Web and Internet Economics, pp. 460 466, Cham, 2014. Springer International Publishing.

Lai, T. and Robbins, H. Asymptotically efﬁcient adaptive allocation rules. Adv. Appl. Math., 6(1):4 22, mar 1985.

Levine, N., Crammer, K., and Mannor, S. Rotting bandits. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 3077 3086, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.

Rigollet, P. and H utter, J.-C. High dimensional statistics. Lecture notes for course 18S997, 813:814, 2015.

Russac, Y., Vernade, C., and Capp e, O. Weighted linear bandits for non-stationary environments. ar Xiv preprint ar Xiv:1909.09146, 2019.

Seznec, J., Locatelli, A., Carpentier, A., Lazaric, A., and Valko, M. Rotting bandits are no harder than stochastic ones. In Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of the Twenty-Second International Conference on Artiﬁcial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pp. 2564 2572. PMLR, 16 18 Apr 2019.

Seznec, J., Menard, P., Lazaric, A., and Valko, M. A single algorithm for both restless and rested rotting bandits. In Chiappa, S. and Calandra, R. (eds.), Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 3784 3794. PMLR, 26 28 Aug 2020.

Slivkins, A. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(1-2):1 286, 2019.

Slivkins, A. and Upfal, E. Adapting to a changing environment: the brownian restless bandits. In 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pp. 343 354, 2008.

Trac a, S., Rudin, C., and Yan, W. Reducing exploration of dying arms in mortal bandits. In Adams, R. P. and Gogate, V. (eds.), Proceedings of The 35th Uncertainty in Artiﬁcial Intelligence Conference, volume 115 of Proceedings of Machine Learning Research, pp. 156 163. PMLR, 22 25 Jul 2020.

Tsun, A. Probability & statistics with applications to computing, 2020. URL https://www.alextsun.com/ files/Prob_Stat_for_CS_Book.pdf.

Wang, Y., Audibert, J.-y., and Munos, R. Algorithms for inﬁnitely many-armed bandits. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2009.

Welford, B. Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419 420, 1962.

Zhao, P., Zhang, L., Jiang, Y., and Zhou, Z.-H. A simple approach for non-stationary linear bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 746 755. PMLR, 2020.

Rotting Inﬁnitely Many-armed Bandits

A. Appendix

A.1. Preliminaries

Here we state some known concentration inequalities that we use in our proofs.

Lemma A.1 (Theorem 6.2.35 in Tsun (2020)). Let X1, . . . , Xn be identical independent Bernoulli random variables. Then, for 0 < ν < 1, we have

i=1 Xi (1 + ν)E

exp ν2E[Pn i=1 Xi] 3

Lemma A.2 (Corollary 1.7 in Rigollet & H utter (2015)). Let X1, . . . , Xn be independent random variables with σ-sub Gaussian distributions. Then, for any a = (a1, . . . , an) Rn and t 0, we have

i=1 ai Xi > t

i=1 ai Xi < t

A.2. Proof of Theorem 3.1

For showing the lower bound, we will classify arms to be either bad or good depending on the value of their initial mean rewards, in a way that will be deﬁned shortly. We will analyze the number of bad arm pulls which is the main contributor to the regret. For the analysis, we count the number of bad arm pulls until the ﬁrst or m-th good arm is pulled over a ﬁxed time horizon in order to preserve i.i.d. property of mean rewards in the proof. To distinguish good from bad arms, we utilize two thresholds in the proof as follows. Let 0 < δ < c < 1 be such that 1 > 1 δ > 1 c > 0 for δ, which will be set small, and some constant c. Then, 1 δ and 1 c represent threshold values for distinguishing good arms and bad arms, respectively. In A, let a1, a2, . . . , be a sequence of arms. Given a policy π, without loss of generality let π select arms following the sequence of a1, a2, . . . ,.

Case of small rotting: When ϱ = O(1/T 3/2), the lower bound of order

T for the stationary case, from Theorem 3 in Wang et al. (2009), is tight enough for the non-stationary case. This is because we only need to pay the extra regret of at most of order

T for small ϱ. From Theorem 3 in Wang et al. (2009), we have

E[Rπ(T)] = Ω(

We note that even though the mean rewards are rotting in our setting, Theorem 3 in Wang et al. (2009) can be applied to our setting without any proof changes providing a tight regret bound for the near-stationary case. For the sake of completeness, we provide the proof of the theorem. Let K1 denote the number of bad arms a that satisfy µ1(a) 1 c before selecting the ﬁrst good arm, which satisﬁes µ1(a) > 1 δ, in the sequence of arms a1, a2, . . . . Let µ be the initial mean reward of the best arm among the selected arms by π over time horizon T. Then for some κ > 0, we have

Rπ(T) = Rπ(T)1(µ 1 δ) + Rπ(T)1(µ > 1 δ)

Tδ1(µ 1 δ) + K1c1(µ > 1 δ)

Tδ1(µ 1 δ) + κc1(µ > 1 δ, K1 κ). (13)

By taking expectations on the both sides in (13) and setting κ = Tδ/c, we have

E[Rπ(T)] TδP(µ 1 δ) + κc(P(µ > 1 δ) P(K1 < κ)) = cκP(K1 κ).

Let δ = δ/(1 c + δ). Then we observe that K1 follows a geometric distribution with success probability P(µ1(a) > 1 δ)/p(µ1(a) / (1 c, 1 δ)) = δ , in which the success probability is the probability of selecting a good arm given that the arm is either a good or bad arm. Then by setting δ = 1/

T/c, for some constant C > 0 we have

E[Rπ(T)] cκ(1 δ )κ = Ω

Rotting Inﬁnitely Many-armed Bandits

where the last equality is obtained from log x 1 1/x for all x > 0.

Case of large rotting: When ϱ = ω(1/T 3/2), however, the lower bound of the stationary case is not tight enough. Here we provide the proof of the lower bound Tϱ1/3 for the case of ϱ = ω(1/T 3/2). Let Km denote the number of bad arms a that satisfy µ1(a) 1 c before selecting m-th good arm, which satisﬁes µ1(a) > 1 δ, in the sequence of arms a1, a2, . . . . Let NT be the number of selected good arms a such that µ1(a) > 1 δ until T.

We can decompose Rπ(T) into two parts as follows:

Rπ(T) = Rπ(T)1(NT < m) + Rπ(T)1(NT m). (14)

We set m = (1/2)Tϱ2/3 and δ = ϱ1/3with ϱ = o(1). For getting a lower bound for the ﬁrst term in (14), Rπ(T)1(NT < m), we use the fact that the minimal regret is obtained from the situation where there are m 1 arms whose mean rewards are 1. Then we can think of the optimal policy that selects the best m 1 arms until their mean rewards become below the threshold 1 δ (step 1) and then selects the best arm at each time for the remaining time steps (step 2). The required number of pulling each arm for the best m 1 arms until the mean reward becomes below 1 δ is upper bounded by δ/ϱ + 1. Therefore, the regret from step 2 is R = Ω((T mδ/ϱ)δ) = Ω(Tϱ1/3) in which the optimal policy pulls arms which mean rewards are below 1 δ for the remaining time after step 1. Therefore, we have

Rπ(T)1(NT < m) = Ω(R1(NT < m)) = Ω(Tϱ1/31(NT < m)). (15)

For getting a lower bound of the second term in (14), Rπ(T)1(NT m), we use the minimum number of selected arms a that satisfy µ1(a) 1 c. When NT m and Km κ, the policy selects at least κ number of distinct arms a satisfying µ1(a) 1 c until T. Therefore, we have

Rπ(T)1(NT m) cκ1(NT m, Km κ). (16)

Let δ = δ/(1 c + δ). By setting κ = m/δ m m/δ , with ϱ = o(1), we have

κ = Θ(Tϱ1/3). (17)

Then from (15), (16), and (17), we have

E[Rπ(T)] = Ω(Tϱ1/3P(NT < m) + Tϱ1/3P(NT m, Km κ)) Ω(Tϱ1/3P(Km κ)). (18)

Next we provide a lower bound for P(Km κ). Observe that Km follows a negative binomial distribution with m successes and the success probability P(µ1(a) > 1 δ)/P(µ1(a) / (1 c, 1 δ)) = δ/(1 c + δ) = δ , in which the success probability is the probability of selecting a good arm given that the arm is either a good or bad arm. In the following lemma, we provide a concentration inequality for Km.

Lemma A.3. For any 1/2 + δ /m < α < 1,

P(Km αm(1/δ ) m) 1 exp( (1/3)(1 1/α)2(αm δ )). (19)

Proof. Let Xi for i > 0 be i.i.d. Bernoulli random variables with success probability δ . From Section 2 in Brown (2011), we have

Rotting Inﬁnitely Many-armed Bandits

From (20) and Lemma A.1, for any 1/2 + δ /m < α < 1 we have

δ m = P Km αm 1

exp (1 1/α)2

exp (1 1/α)2

3 (αm δ ) ,

in which the ﬁrst inequality comes from Lemma A.1, which concludes the proof.

From Lemma A.3 with α = 1 1/ m and large enough T, we have

P(Km κ) 1 exp

3(m m δ ) 1 m 1

6(m m) 1 m 1

1 exp( 1/6). (21)

Therefore, from (18) and (21), we have

E[Rπ(T)] = Ω(Tϱ1/3). (22)

Finally, from (12) and (22) we conclude that for any policy π, we have

E[Rπ(T)] = Ω(max{Tϱ1/3,

A.3. Proof of Theorem 4.1

Observe that initial mean rewards of selected arms are i.i.d. with a uniform distribution which should simplify analysis of the expected regret. However, by ﬁxing the number of selected arms by a policy over the time horizon T, the mean rewards of arms become dependent. To deal with this dependence, we analyze the regret by controlling the number of distinct selected arms instead of ﬁxing the time horizon. We explain this in more details in the following proofs.

We set δ = max{ϱ1/3, 1/

T}. Then we deﬁne an arm a to be a good arm if (a) δ/2, and, otherwise, a is a bad arm. In A, let a1, a2, . . . , be a sequence of arms, which have i.i.d. mean rewards with uniform distribution on [0, 1]. Given a policy selecting arms in the sequence order, let m G be the number of selections of distinct good arms and m B i be the number of consecutive selections of distinct bad arms between the i 1-st and i-th selection of a good arm among m G good arms. We refer to the period starting from selecting the i 1-st good arm before selecting the i-th good arm as the i-th episode. Observe that m B 1 , . . . , m B m G are i.i.d. random variables with geometric distribution with parameter δ/2, given a ﬁxed value of m G. Therefore, for non-negative integer k we have P(m B i = k) = (1 δ/2)kδ/2, for i = 1, . . . , m G. Deﬁne m to be the number of episodes from the policy π over the horizon T, m G to be the total number of selections of a good arm by the policy π over the horizon T such that m G = m or m G = m 1, and m B i to be the number of selections of a bad arm in the i-th episode by the policy π over the horizon T. Without loss of generality, we assume that the policy selects arms in the sequence of a1, a2, . . . , . Let AT be the set of selected arms over the horizon of T time steps, which satisﬁes |AT | T. Let

ˆµt(a) = Pt 1 s=1 rs1(as = a)

nt(a) and µt(a) = Pt 1 s=1 µs(a)1(as = a)

Rotting Inﬁnitely Many-armed Bandits

We deﬁne the event E1 = {|ˆµt(a) µt(a)| p

2 log(T 4)/nt(a) for all t [T], a AT } to guarantee that the estimators of initial mean reward are well estimated. From Lemma A.2, by following similar steps of the proof of Lemma 5 in Auer et al. (2019), we have P(Ec 1) 2/T 2.

Recall that Rπ(T) = PT t=1 1 µt(aπ t ). It is true that Rπ(T) = o(T 2) because the maximum mean reward gap from rotting is bounded by 1 + Tϱ = o(T). Given that E1 does not hold, the regret is E[Rπ(T)|Ec 1]P(Ec 1) = o(1), which is negligible comparing with the regret when E1 holds true which we show later. Therefore, in the rest of the proof we assume that E1 holds true.

Under a policy π, let RG i be the regret (summation of mean reward gaps) contributed by pulling the good arm in the i-th

episode and RB i,j be the regret contributed by pulling the j-th bad arm in the i-th episode. Then let Rπ m G = Pm G

i=1(RG i + P

j [m B i ] RB i,j),2 which is the regret over the period of m G episodes.

For obtaining a regret bound, we ﬁrst focus on ﬁnding a required number of episodes, m G, such that Rπ(T) Rπ m G. Then we provide regret bounds for each bad arm and good arm in an episode. Lastly, we obtain a regret bound for E[Rπ(T)] using the episodic regret bound.

For i [ m G], j [ m B i ], let n G i be the number of pulls of the good arm in the i-th episode and n B i,j be the number of pulls of the j-th bad arm in the i-th episode by the policy π over the horizon T. Let a be the last selected arm over time horizon T by π. We denote by ˆm G and ˆm B i for i [ m] the number of arms excluding a in the selected m G number of good arms and m B i number of bad arms for i [ m] as follows:

( m G 1 if a is a good arm m G otherwise ,

ˆm B i = m B i for i [ m 1], and ˆm B m =

( m B m if a is a good arm m B m 1 otherwise.

Those notations are deﬁned only if they exist.3 Excluding the last arm a which the policy π may stop to pull suddenly by reaching the horizon T, we provide lower bounds of the number of pulling a good arm, n G i for i [ ˆm G] in the following lemma if they exist.

Lemma A.4. Under E1, given ˆm G, for any i [ ˆm G] we have

n G i δ/(2ϱ).

ˆµo t(a) = Pt 1 s=1(rs + ϱsns(a))1(as = a)

which satisﬁes µo t(a) ˆµo t(a) from ϱs ϱ for all s. Under E1, we can easily show that for all t [T] and a AT ,

|ˆµo t(a) µ1(a)| p

8 log(T)/nt(a).

Let a(i) be a selected good arm in the i-th episode. Suppose that nt(a(i)) = n G i = δ/(2ϱ) for some t > 0, then we have

µo t(a(i)) ϱ n G i + q

8 log(T)/ n G i ˆµo t(a(i)) ϱ n G i + q

8 log(T)/ n G i µ1(a(i)) δ/2 1 δ,

where the second inequality is obtained from E1 and n G i δ/(2ϱ), and the third inequality is from µ1(a(i)) 1 δ/2. Therefore, policy π must pull arm a more times than δ/(2ϱ) , which implies n G i δ/(2ϱ).

2Note that Rπ m B does not contain undeﬁned RB i,j such that RB i,j when m B i = 0. 3 n G i , n B i,j, and ˆm B i are not deﬁned for i [0] or j [0].

Rotting Inﬁnitely Many-armed Bandits

We ﬁrst consider the case when ϱ = ω(1/T 3/2). We have δ = ϱ1/3. For getting Rπ m G, here we deﬁne the policy π after time T such that it pulls δ/(2ϱ) amount for a good arm and 0 for a bad arm. We note that deﬁning how π works after T is only for the proof to get a regret bound over time horizon T. For the last arm a over the horizon T, it pulls the arm up to δ/(2ϱ) amounts if a is a good arm and n G m G < δ/(2ϱ). For i [m G], j [m B i ] let n G i and n B i,j be the number of pulling the good arm in i-th episode and j-th bad arm in i-th episode from the policy, respectively. Here we deﬁne n G i s and n B i,j s as follows:

If a is a good arm and n G m G < δ/(2ϱ),

( n G i for i [ m G 1] δ/(2ϱ) for i [m G]/[ m G 1] , n B i,j =

( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G], j [m B i ]/[ m B i ].

( n G i for i [ m G] δ/(2ϱ) for i [m G]/[ m G] , n B i,j =

( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G 1], j [m B i ]/[ m B i ].

We note that n G i s and n B i,j s are deﬁned only if they exist.4 Then we provide m G such that Rπ(T) Rπ m G in the following lemma. Lemma A.5. Under E1, when m G = 2Tϱ2/3 we have

Rπ(T) Rπ m G.

Proof. From Lemma A.4, with δ = ϱ1/3 we have

j [m B i ] n B i,j

which implies that Rπ(T) Rπ m G.

From the result of Lemma A.5, we set m G = 2Tϱ2/3 . For getting a bound for E[Rπ m G], we provide bounds for E[RG i ] and E[RB i,j] in the following lemma.

Lemma A.6. Under E1 and policy π, for any i [m G], j [m B i ], we have

E[RG i ] = O δ

E[RB i,j] = O 1 + δ ϱ1/3 + δ2

Proof. First we provide a bound for RG i using an upper bound of n G i . Recall that a(i) is the selected arm for the i-th good arm. We have

µo t(a(i)) ϱnt(a(i)) = ˆµt(a(i)) + ϱ Pt 1 s=1 ns(a(i))1(as = a(i))

nt(a(i)) ϱnt(a(i))

ˆµt(a(i)) + ϱnt(a(i)) + 1

2 ϱnt(a(i))

= ˆµt(a(i)) ϱnt(a(i)) 1

4n G i and n B i,j are not deﬁned for i [0] or j [0].

Rotting Inﬁnitely Many-armed Bandits

Then since, under E1,

ˆµt(a(i)) (ϱ/2)(nt(a(i)) 1) µt(a(i)) + p

8 log(T)/nt(a(i)) (ϱ/2)(nt(a(i)) 1)

8 log(T)/nt(a(i)) (ϱ/2)(nt(a(i)) 1),

for i [ m G], from the policy π, we need to get n such that

2(n 1) + 2 p

8 log(T)/n < 1 δ, (23)

in which n + 1 is an upper bound for n G i . Let n1 = 2(δ + ϱ1/3)/ϱ + 1 and n2 = C log(T)/ϱ2/3 with some large enough constant C > 0. Then n = n1 + n2 satisﬁes (23) because 1 ϱn1 + 2 p

8 log(T)/n2 < 1 δ. Therefore, for all i [ m G] we have n G i = O((δ + ϱ1/3)/ϱ). Then with the fact that n G i = δ/(2ϱ) for i [m G]/[ m G] if they exist, for any i [m G] we have

n G i = O((δ + ϱ1/3)/ϱ).

Then for any i [m G] we have

E[RG i ] E (a(i))n G i + n G i (n G i 1) 2 ϱ

ϱ x + δ + ϱ1/3

where the ﬁrst equality is obtained from the fact that (a(i)) are i.i.d. random variables with uniform distribution on [0, δ/2] and n G i = O((δ + ϱ1/3)/ϱ).

Now we provide an upper bound of n B i,j to get a bound of RB i,j for i [m G], j [m B i ]. Let a(i, j) be a selected arm for j-th bad arm in the i-th episode. When δ/2 < (a(i, j)) δ + ϱ1/3, as in the case of the good arm, n = 2(δ + ϱ1/3)/ϱ + 1 + C1 log(T)/ϱ2/3 for some large enough constant C1 > 0, satisﬁes (23) so that n B i,j = O((δ + ϱ1/3)/ϱ) for all i [ m], j [ m B i ]. Since under E1

ˆµt(a(i, j)) (ϱ/2)(nt(a(i, j)) 1) µt(a(i, j)) + p

8 log(T)/nt(a(i, j)) (ϱ/2)(nt(a(i, j)) 1)

µ1(a(i, j)) + p

8 log(T)/nt(a(i, j)) (ϱ/2)(nt(a(i, j)) 1),

when δ + ϱ1/3 < (a(i, j)) 1, from the policy π under E1, we need to get n 1 such that

µ1(a(i, j)) ϱ

2(n 1) + 2 p

8 log(T)/n < 1 δ, (24)

in which n + 1 is an upper bound of n B i,j for i [ m], j [ m B i ]. From a sufﬁcient condition for (24) to hold such that

µ1(a(i, j)) + 2 p

8 log(T)/n < 1 δ,

we can ﬁnd that n = C2 log(T)/( (a(i, j)) δ)2 for some large constant C2 > 0 satisﬁes (24). Therefore, when δ + ϱ1/3 < (a(i, j)) 1, for all i [ m], j [ m B i ] we have n B i,j = O(1/( (a(i, j)) δ)2). Then with the fact that n B i,j = 0 for i [m G]/[ m G], j [m B i ]/[ m B i ] if they exist, for any i [m G] and j [m B i ], we have

( O((δ + ϱ1/3)/ϱ) if δ/2 < (a(i, j)) δ + ϱ1/3

O(1/( (a(i, j)) δ)2) if δ + ϱ1/3 < (a(i, j)) 1

Rotting Inﬁnitely Many-armed Bandits

Then for any i [m G], j [m B i ], we have

E[RB i,j] E

(a(i, j))n B i,j + n B i,j(n B i,j 1) 2 ϱ

ϱ x + δ + ϱ1/3

2 ϱdx + Z 1

δ+ϱ1/3 1 (x δ)2 x + 1 (x δ)4 ϱdx

= O 1 + δ ϱ1/3 + δ2

Recall that Rπ m G = Pm G

i=1(RG i + P

j [m B i ] RB i,j). With δ = ϱ1/3 and m G = 2Tϱ2/3 log(1/δ ) , from Lemmas A.5 and A.6, and the fact that m B i s are i.i.d. random variables with geometric distribution with E[m B i ] = 2/δ 1, we have

E[Rπ(T)] = O(E[Rπ m G])

j [m B i ] RB i,j

= O Tϱ2/3 δ

= O Tϱ2/3 δ

= O Tϱ1/3 . (25)

Now we consider the case when ϱ = O(1/T 3/2). We have δ = Θ(1/

T). With a slight abuse of notation, we use π for a modiﬁed strategy after T. For getting Rπ m G, here we deﬁne the policy π after time T such that it pulls T amounts for a good arm and once for a bad arm. For the last arm a over the horizon T, it pulls the arm up to T amounts if a is a good arm and n G m G < T. With slight abuse of notation, for i [m G], j [m B i ] let n G i and n B i,j be the number of pulling the good arm in i-th episode and j-th bad arm in i-th episode from the policy, respectively. Here we deﬁne n G i s and n B i,j s as follows:

If a is a good arm and n G m G < T,

( n G i for i [ m G 1] T for i [m G]/[ m G 1] , n B i,j =

( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G], j [m B i ]/[ m B i ].

( n G i for i [ m G] T for i [m G]/[ m G] , n B i,j =

( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G 1], j [m B i ]/[ m B i ].

From Lemma A.4, under E1 we can ﬁnd that n G i min{δ/(2ϱ), T} for i [m G]. Then if m G = C3 with some large enough constant C3 > 0, then with δ = Θ(1/

T) and ϱ = O(1/T 3/2), we have

i [m G] n G i C3 min{δ/(2ϱ), T} > T,

which implies Rπ(T) Rπ m G. Therefore, we set m G = C3. For getting a bound for E[Rπ m G], we provide bounds for E[RG i ] and E[RB i,j] in the following lemma.

Rotting Inﬁnitely Many-armed Bandits

Lemma A.7. Under E1 and policy π, for any i [m G] and j [m B i ], we have

E[RG i ] = O δT + T 2ϱ ,

E[RB i,j] = O Tδ2 +

Proof. First we provide a bound for RG i using an upper bound of n G i . With the deﬁnition of n G i = T for i [m G]/[ m G], for any i [m G] we have n G i T.

Recall that a(i) is the selected arm for the i-th good arm. Then for any i [m G] we have

E[RG i ] E (a(i))n G i + n G i (n G i 1) 2 ϱ

= O δT + T 2ϱ ,

where the ﬁrst equality is obtained from the fact that (a(i)) s are i.i.d. random variables with uniform distribution on [0, δ/2] and n G i T.

Now we provide an upper bound of n B i,j to get a bound of RB i,j for i [m G], j [m B i ]. Let a(i, j) be a selected arm for j-th bad arm in the i-th episode. When δ/2 < (a(i, j)) δ + 1/

T, as in the case of the good arm, n B i,j T for all i [ m], j [ m B i ]. When δ + 1/

T < (a(i, j)) 1, since under E1

µo t(a(i, j)) ϱnt(a(i, j)) = ˆµt(a(i, j)) + ϱ Pt 1 s=1 ns(a(i, j))1(as = a(i, j))

nt(a(i, j)) ϱnt(a(i, j))

ˆµt(a(i, j)) + ϱnt(a(i, j)) + 1

2 ϱnt(a(i, j))

= ˆµt(a(i, j)) ϱnt(a(i, j)) 1

2 µt(a(i, j)) + p

8 log(T)/nt(a(i, j)) (ϱ/2)(nt(a(i, j)) 1)

µ1(a(i, j)) + p

8 log(T)/nt(a(i, j)) (ϱ/2)(nt(a(i, j)) 1),

from the policy π we need to get n such that

µ1(a(i, j)) (ϱ/2)(n 1) + 2 p

8 log(T)/n < 1 δ, (26)

in which n + 1 is an upper bound of n B i,j for i [ m], j [ m B i ]. From a sufﬁcient condition for (26) to hold such that

µ1(a(i, j)) + 2 p

8 log(T)/n < 1 δ,

we can ﬁnd that n = C4 log(T)/( (a(i, j)) δ)2 for some large constant C4 > 0 satisﬁes (26). Therefore, when δ + 1/

T < (a(i, j)) 1, for all i [ m], j [ m B i ] we have n B i,j = O(1/( (a(i, j)) δ)2). Then with the fact that n B i,j = 0 for i [m G]/[ m G], j [m B i ]/[ m B i ], for any i [m G], j [m B i ], if δ/2 < (a(i, j)) δ + 1/

and if δ + 1/

T < (a(i, j)) 1, we have

n B i,j = O(1/( (a(i, j)) δ)2).

Rotting Inﬁnitely Many-armed Bandits

Then for any i [m G], j [m B i ], we have

E[RB i,j] E

(a(i, j))n B i,j + n B i,j(n B i,j 1) 2 ϱ

δ/2 Tx + T 2ϱdx + Z 1

1 (x δ)2 x + 1 (x δ)4 ϱdx

Then with δ = Θ(1/

T) and m G = C3, we have

E[Rπ(T)] = O(E[Rπ m G])

i [m G] (RG i + X

j [m B i ] RB i,j)

= O δT + T 2ϱ + 1

where the third equality is obtained from Lemma A.7 and E[m B i ] = 2/δ 1.

Finally, we can conclude the proof: From (25) and (27), for ϱ = o(1), with δ = max{ϱ1/3, 1/

E[Rπ(T)] = O(max{Tϱ1/3,

A.4. Proof of Theorem 4.2

Let πi(β) for β B denote the base policy for time steps between (i 1)H + 1 and i H T in Algorithm 2 using UCBi,t(a, β) as a UCB index and 1 β1/3 as a threshold. Denote by aπi(β) t the pulled arm at time step t by policy πi(β). Then, for β B, which is set later for a near-optimal policy, we have

E[Rπ(T)] = E

t=(i 1)H+1 µt(aπ t )

= E[Rπ 1 (T)] + E[Rπ 2 (T)]. (28)

t=(i 1)H+1 µt(aπi(β ) t )

t=(i 1)H+1 µt(aπi(β ) t )

t=(i 1)H+1 µt(aπ t ).

Note that Rπ 1 (T) accounts for the regret caused by the near-optimal base algorithm πi(β ) s against the optimal mean reward and Rπ 2 (T) accounts for the regret caused by the master algorithm by selecting a base with β B at every block against the base with β . In what follows, we provide upper bounds for each regret component. We ﬁrst provide an upper bound for E[Rπ 1 (T)] by following the proof steps in Theorem 4.1. Then we provide an upper bound for E[Rπ 2 (T)]. We set H =

T and β to be a smallest value in B which is larger than max{ϱ, 1/H3/2}.

Upper bounding E[Rπ 1 (T)]. We refer to the period starting from time step (i 1)H + 1 to time step i H T as the i-th block. For any i T/H 1 , policy πi(β ) runs over H time steps independent to other blocks so that each block has the

Rotting Inﬁnitely Many-armed Bandits

same expected regret and the last block has a smaller or equal expected regret than other blocks. Therefore, we focus on

ﬁnding a bound on the regret from the ﬁrst block equal to PH t=1 1 µt(aπ1(β ) t ). Denote by A(i) the set of selected arms in the i-th block, which satisﬁes |A(i)| H. For notation simplicity, we use nt(a) instead of n1,t(a) and µo t(a) instead of µo 1,t(a). Let

ˆµt(a) = Pt 1 s=1 rs1(as = a)

nt(a) and µt(a) = Pt 1 s=1 µs(a)1(as = a)

Deﬁne the event E1 = {|ˆµt(a) µt(a)| p

8 log(H)/nt(a), for all t [H], a A(1)}. From Lemma A.2, by following similar steps of the proof of Lemma 5 in Auer et al. (2019), we have

P(E1) 1 2/H2.

We assume that E1 holds true in what follows. Otherwise, the regret for the ﬁrst block is negligible from Rπ1(β )(H) = o(H2). The proof follows similar steps as in the proof of Theorem 4.1.

From π1(β ) we have δ = β 1/3. We deﬁne an arm a to be a good arm if (a) δ/2, and, otherwise, a is a bad arm. In A, let a1, a2, . . . be a sequence of arms, which have i.i.d. mean rewards with uniform distribution on [0, 1]. Given a policy selecting arms in the sequence order, let m G be the number of selections of distinct good arms and m B i be the number of consecutive selections of distinct bad arms between the i 1-st and i-th selection of a good arm among m G good arms. We refer to the period starting from selecting the i 1-st good arm before selecting the i-th good arm as the i-th episode. Observe that m B 1 , . . . , m B m G s are i.i.d. random variables with geometric distribution with parameter δ/2, conditional on the value of m G. Therefore, P(m B i = k) = (1 δ/2)kδ/2, for i = 1, . . . , m G. Deﬁne m to be the number of episodes by following policy π over the horizon of T time steps, m G to be the total number of selections of a good arm such that m G = m or m G = m 1, and m B i to be the number of selections of a bad arm in the i-th episode by the policy π1(β ) over the horizon H. Without loss of generality, we assume that the policy selects arms in the order of the sequence a1, a2, . . . .

Under policy π1(β ), let RG i be the regret (summation of mean reward gaps) contributed by pulling the good arm in

the i-th episode and RB i,j be the regret contributed by pulling the j-th bad arm in the i-th episode. Then let Rπ1(β ) m G = Pm G

i=1(RG i + P

j [m B i ] RB i,j)5, which is the regret over the period of m G episodes.

For i [ m G], j [ m B i ], let n G i be the number of pulls of the good arm in the i-th episode and n B i,j be the number of pulls of the j-th bad arm in the i-th episode by the policy π1(β ) over the horizon H. Let a be the last selected arm over time horizon H by π1(β ). We denote by ˆm G and ˆm B i for i [ m] the number of arms excluding a in the selected m G number of good arms and m B i number of bad arms for i [ m] as follows:

( m G 1 if a is a good arm m G otherwise ,

ˆm B i = m B i for i [ m 1], and ˆm B m =

( m B m if a is a good arm m B m 1 otherwise.

These notations are deﬁned only if they exist.6 Excluding the last arm a which the policy π1(β ) may stop to pull suddenly by reaching the horizon H, we provide lower bounds for the number of pulls for each arm, n G i for i [ ˆm G] in the following lemma if they exist.

Lemma A.8. Under E1, given ˆm G, for any i [ ˆm G] we have

n G i δ/(2β ).

5Note that Rπ1(β ) m B does not contain undeﬁned RB i,j such that RB i,j when m B i = 0. 6 n G i , n B i,j, and ˆm B i are not deﬁned for i [0] or j [0].

Rotting Inﬁnitely Many-armed Bandits

ˆµo t(a) = Pt 1 s=1(rs + ϱsns(a))1(as = a)

which satisﬁes µo t(a, β ) ˆµo t(a) from β ϱ. Under E1, it is true that for all t [H] and a A(1),

|ˆµo t(a) µ1(a)| p

8 log(H)/nt(a).

Let a(i) be a selected good arm in the i-th episode. Suppose that nt(a(i)) = n G i = δ/(2β ) for some t > 0, then we have

µo t(a(i), β ) β n G i + q

8 log(H)/ n G i ˆµo t(a(i)) β n G i + q

8 log(H)/ n G i µ1(a(i)) δ/2 1 δ,

where the second inequality is obtained from E1 and n G i δ/(2β ), and the third inequality is from µ1(a(i)) 1 δ/2. Therefore, policy π1(β ) must pull arm a more times than δ/(2β ) , which implies n G i δ/(2β ).

We ﬁrst consider the case when ϱ = ω(1/H3/2). Then we have that β is the smallest value in B which exceeds ϱ

such that ϱ β 2ϱ. For getting Rπ1(β ) m G , here we deﬁne how the policy π1(β ) works after time H such that it pulls δ/(2β ) times a good arm and 0 time a bad arm. We note that deﬁning how π1(β ) works after H is only for the proof to get a regret bound over time horizon H. For the last arm a over the horizon H, it pulls the arm up to δ/(2β ) times if a is a good arm and n G m G < δ/(2β ). For i [m G] and j [m B i ], let n G i and n B i,j be the number of pulls of the good arm in the i-th episode and the j-th bad arm in the i-th episode by the policy, respectively. Here we deﬁne n G i s and n B i,j s as follows:

If a is a good arm and n G m G < δ/(2β ), then

( n G i for i [ m G 1] δ/(2β ) for i [m G]/[ m G 1] , n B i,j =

( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G], j [m B i ]/[ m B i ].

( n G i for i [ m G] δ/(2β ) for i [m G]/[ m G] , n B i,j =

( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G 1], j [m B i ]/[ m B i ].

We note that n G i s and n B i,j s are deﬁned only if they exist.7

We provide m G such that Rπ1(β )(H) Rπ1(β ) m G in the following lemma.

Lemma A.9. Under E1, when m G = 2Hβ 2/3 , we have

Rπ1(β )(H) Rπ1(β ) m G .

Proof. From Lemma A.8, with δ = β 1/3 we have

j [m B i ] n B i,j

which implies that Rπ1(β )(H) Rπ1(β ) m G .

From the result of Lemma A.9, we set m G = 2Hβ 2/3 . For getting a bound for E[Rπ1(β ) m G ], we provide bounds for E[RG i ] and E[RB i,j] in the following lemma.

7n G i and n B i,j are not deﬁned for i [0] or j [0].

Rotting Inﬁnitely Many-armed Bandits

Lemma A.10. Under E1 and policy π1(β ), for any i [m G] and j [m B i ], we have

E[RG i ] = O δ

E[RB i,j] = O 1 + δ ϱ1/3 + δ2

Proof. First we provide a bound for RG i using an upper bound of n G i . Recall that a(i) is the selected arm for the i-th good arm. We have

µo t(a(i), β ) β nt(a(i)) = ˆµt(a(i)) + β Pt 1 s=1 ns(a(i))1(as = a(i))

nt(a(i)) β nt(a(i))

ˆµt(a(i)) + β nt(a(i)) + 1

2 β nt(a(i))

= ˆµt(a(i)) β nt(a(i)) 1

Since, under E1,

ˆµt(a(i)) (β /2)(nt(a(i)) 1) µt(a(i)) + p

8 log(H)/nt(a(i)) (β /2)(nt(a(i)) 1)

8 log(H)/nt(a(i)) (β /2)(nt(a(i)) 1),

for i [ m G], from the policy π, we need to get n such that

2 (n 1) + 2 p

8 log(H)/n < 1 δ, (29)

in which n + 1 is an upper bound for n G i . Let n1 = 2(δ + β 1/3)/β + 1 and n2 = C log(H)/β 2/3 with some large enough constant C > 0. Then n = n1 + n2 satisﬁes (29) because 1 β n1 + 2 p

8 log(H)/n2 < 1 δ. Therefore, for all

i [ m G] we have n G i = O((δ + β 1/3)/β ). Then with the fact that n G i = δ/(2β ) for i [m G]/[ m G] if they exist and β = Θ(ϱ), for any i [m G] we have

n G i = O((δ + β 1/3)/β ) = O((δ + ϱ1/3)/ϱ).

Then for any i [m G] we have

E[RG i ] E (a(i))n G i + n G i (n G i 1) 2 ϱ

ϱ x + δ + ϱ1/3

where the ﬁrst equality is obtained from the fact that (a(i)) s are i.i.d. random variables with uniform distribution on [0, δ/2] and n G i = O((δ + ϱ1/3)/ϱ).

Now we provide an upper bound of n B i,j to get a bound of RB i,j for i [m G], j [m B i ]. Let a(i, j) be a selected arm for j-th bad arm in the i-th episode. When δ/2 < (a(i, j)) δ + ϱ1/3, as in the case of the good arm, n =

Rotting Inﬁnitely Many-armed Bandits

2(δ+β 1/3)/β +1+C1 log(T)/β 2/3 for some large enough constant C1 > 0, satisﬁes (29) so that n B i,j = O((δ+ϱ1/3)/ϱ) for all i [ m], j [ m B i ]. When δ + ϱ1/3 < (a(i, j)) 1, since under E1

µo t(a(i, j)) β nt(a(i, j)) = ˆµt(a(i, j)) + β Pt 1 s=1 ns(a(i, j))1(as = a(i, j))

nt(a(i, j)) β nt(a(i, j))

ˆµt(a(i, j)) + β nt(a(i, j)) + 1

2 β nt(a(i, j))

= ˆµt(a(i, j)) β nt(a(i, j)) 1

2 µt(a(i, j)) + p

8 log(T)/nt(a(i, j)) (β /2)(nt(a(i, j)) 1)

µ1(a(i, j)) + p

8 log(T)/nt(a(i, j)) (β /2)(nt(a(i, j)) 1),

from the policy π we need to get n 1 such that

µ1(a(i, j)) β

2 (n 1) + 2 p

8 log(T)/n < 1 δ, (30)

in which n + 1 is an upper bound of n B i,j for i [ m], j [ m B i ]. From a sufﬁcient condition for (30) to hold such that

µ1(a(i, j)) + 2 p

8 log(T)/n < 1 δ,

we can ﬁnd that n = C2 log(T)/( (a(i, j)) δ)2 for some large constant C2 > 0 satisﬁes (30). Therefore, when δ + ϱ1/3 < (a(i, j)) 1, for all i [ m], j [ m B i ] we have n B i,j = O(1/( (a(i, j)) δ)2). Then with the fact that n B i,j = 0 for i [m G]/[ m G], j [m B i ]/[ m B i ] if they exist, for any i [m G] and j [m B i ], we have

( O((δ + ϱ1/3)/ϱ) if δ/2 < (a(i, j)) δ + ϱ1/3

O(1/( (a(i, j)) δ)2) if δ + ϱ1/3 < (a(i, j)) 1

Then for any i [m G], j [m B i ], we have

E[RB i,j] E

(a(i, j))n B i,j + n B i,j(n B i,j 1) 2 ϱ

ϱ x + δ + ϱ1/3

2 ϱdx + Z 1

δ+ϱ1/3 1 (x δ)2 x + 1 (x δ)4 ϱdx

= O 1 + δ ϱ1/3 + δ2

Recall that Rπ1(β ) m G = Pm G

i=1(RG i + P j [m B i ] RB i,j). With β = Θ(ϱ), δ = Θ(ϱ1/3), and m G = 2Hβ 2/3 , from Lemmas A.9 and A.10, and the fact that m B i s are i.i.d. random variables with geometric distribution with E[m B i ] = 2/δ 1, we have

E[Rπ1(β )(H)] = O(E[Rπ1(β ) m G ])

j [m B i ] RB i,j

= O Hϱ2/3 δ

= O Hϱ2/3 δ

= O Hϱ1/3 . (31)

Rotting Inﬁnitely Many-armed Bandits

Now we consider the case when ϱ = O(1/H3/2). We have set β as the smallest element in B that exceeds

max{ϱ, 1/H3/2}; hence we have β = Θ(1/H3/2). For getting Rπ1(β ) m G , here we deﬁne how the policy π1(β ) works after H time steps such that it pulls H times a good arm and once a bad arm. For the last arm a over the horizon H, it pulls the arm up to H times if a is a good arm and n G m G < H. With slight abuse of notation, for i [m G] and j [m B i ], let n G i and n B i,j be the number of pulls of the good arm in the i-th episode and the j-th bad arm in the i-th episode by the policy, respectively. Here we deﬁne n G i s and n B i,j s as follows:

If a is a good arm and n G m G < H, then

( n G i for i [ m G 1] H for i [m G]/[ m G 1] , n B i,j =

( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G], j [m B i ]/[ m B i ].

( n G i for i [ m G] H for i [m G]/[ m G] , n B i,j =

( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G 1], j [m B i ]/[ m B i ].

From Lemma A.8, we observe that n G i min{δ/(2β ), H} for i [m G]. Then under E1, if m G = C3 for some large

enough constant C3 > 0, then with δ = β 1/3 and β = Θ(1/H3/2), we have X

i [m G] n G i H,

which implies Rπ1(β )(H) Rπ1(β ) m G . Therefore, we set m G = C3. For getting a bound for E[Rπ1(β ) m G ], we provide bounds for E[RG i ] and E[RB i,j] in the following lemma.

Lemma A.11. Under E1 and policy π1(β ), for any i [m G] and j [m B i ], we have

E[RG i ] = O δH + H2ϱ ,

E[RB i,j] = O Hδ2 +

Proof. First we provide a bound for RG i using an upper bound of n G i . With the deﬁnition of n G i = H for i [m G]/[ m G], for any i [m G] we have n G i H.

Recall that a(i) is the selected arm for the i-th good arm. Then, for any i [m G], we have

E[RG i ] E (a(i))n G i + n G i (n G i 1) 2 ϱ

= O δH + H2ϱ ,

where the ﬁrst equality is obtained from the fact that (a(i)) s are i.i.d. random variables with uniform distribution on [0, δ/2] and n G i H.

Now we provide an upper bound of n B i,j to get a bound of RB i,j for i [m G], j [m B i ]. Let a(i, j) be a selected arm for j-th bad arm in the i-th episode. When δ/2 < (a(i, j)) δ + 1/

H, as in the case of the good arm, n B i,j H for all i [ m], j [ m B i ]. When δ + 1/

H < (a(i, j)) 1, from the policy π1(β ) under E1, we need to get n 1 such that

µ1(a(i, j)) β

2 (n 1) + 2 p

8 log(T)/n < 1 δ, (32)

Rotting Inﬁnitely Many-armed Bandits

in which n + 1 is an upper bound of n B i,j for i [ m], j [ m B i ]. From a sufﬁcient condition for (32) to hold such that

µ1(a(i, j)) + 2 p

8 log(H)/n < 1 δ,

we can ﬁnd that n = C4 log(H)/( (a(i, j)) δ)2 for some large constant C4 > 0 satisﬁes (32). Therefore, when δ + 1/

H < (a(i, j)) 1, for all i [ m], j [ m B i ] we have n B i,j = O(1/( (a(i, j)) δ)2). Then with the fact that n B i,j = 0 for i [m G]/[ m G], j [m B i ]/[ m B i ], for any i [m G], j [m B i ], if δ/2 < (a(i, j)) δ + 1/

and if δ + 1/

H < (a(i, j)) 1, we have

n B i,j = O(1/( (a(i, j)) δ)2).

For any i [m G], j [m B i ], we obtain

E[RB i,j] E

(a(i, j))n B i,j + n B i,j(n B i,j 1) 2 ϱ

δ/2 Hx + H2ϱdx + Z 1

1 (x δ)2 x + 1 (x δ)4 ϱdx

It follows that, with δ = Θ(1/

H) and m G = C3, we have

E[Rπ1(β )(H)] = O(E[Rπ1(β ) m G ])

j [m B i ] RB i,j

= O δH + H2ϱ + 1

where the third equality is obtained from Lemma A.11 and E[m B i ] = 2/δ 1.

Finally, we can conclude the proof by noting that from (31) and (33), for ϱ = o(1), we have

E[Rπ1(β )(H)] = O(max{Hϱ1/3,

Therefore, by summing regrets from T/H number of blocks, we have shown that

E[Rπ 1 (T)] = O((T/H) max{Hϱ1/3,

H}) = O(max{Tϱ1/3, T/

Upper bounding E[Rπ 2 (T)]. We observe that the EXP3 is run for T/H decision rounds and the number of policies (i.e. πi(β) for β B) is B. Denote the maximum absolute sum of rewards of any block with length H by a random variable Q . We ﬁrst provide a bound for Q using concentration inequalities. For any block i, we have

t=(i 1)H+1 µt(aπ t ) + ηt

t=(i 1)H+1 µt(aπ t )

t=(i 1)H+1 ηt

Rotting Inﬁnitely Many-armed Bandits

Denote by Ti the set of time steps in the i-th block. We deﬁne the event E2(i) = {|ˆµt(a) µt(a)| p

7 log(T)/nt(a), for all t Ti, a A(i)} and E2 = T i [ T/H ] E2(i). From Lemma A.2, by following similar steps of

the proof of Lemma 5 in Auer et al. (2019), with H =

By assuming that E2 holds true, we can get a lower bound for µt(aπ t ), which may be a negative value from rotting, for getting an upper bound for | Pi H T t=(i 1)H+1 µt(aπ t )|. Let βmax denote the maximum value in B. From the policy π with

T, when µ1(a) = 0 for some arm a, since

µo t(a(i, j)) ϱnt(a(i, j)) +

8 log(H) nt(a(i, j)) ˆµt(a(i, j)) (ϱ/2)(nt(a(i, j)) 1) +

4 log(T) nt(a(i, j))

= µt(a(i, j)) (ϱ/2)(nt(a(i, j)) 1) +

4 log(T) nt(a(i, j)) +

7 log(T) nt(a(i, j))

µ1(a) (ϱ/2)(nt(a(i, j)) 1) +

4 log(T) nt(a(i, j)) +

7 log(T) nt(a(i, j))

4 log(T) nt(a(i, j)) +

7 log(T) nt(a(i, j)),

we need to ﬁnd an positive integer n such that p

4 log(T)/n + p

7 log(T)/n 1 β1/3 max,

in which n is an upper bound for the number of pulls of arm a. From β1/3 max = 1/2, we can observe that the condition holds with n = 92 log(T) . From this fact, we can ﬁnd that for any selected arm a from π, we have µt(a) (92 log(T) + 1)ϱ. Then under E2, with µt(a) 1, for any i [ T/H ] we have | Pi H T t=(i 1)H+1 µt(aπ t )| max{(92 log(T) + 1)ϱH, H}.

Next we provide a bound for | Pi H T t=(i 1)H+1 ηt|. We deﬁne the event E3(i) = {| Pi H T t=(i 1)H+1 ηt| 2 p

H log(T)} and E3 = T

i [ T/H ] E3(i). From Lemma A.2, for any i [ T/H ], we have

P (E3(i)c) 2

Then, under E2 E3, with (35), we have

Q max{93H log(T)ϱ, H} + 2 p

H log(T) 93H log(T) + 2 p

which implies 1/2 + Pi H T t=(i 1)H rt/(186H log T + 4 H log T) [0, 1]. With the rescaling and translation of rewards in Algorithm 2, from Corollary 3.2. in Auer et al. (2002b), we have

E[Rπ 2 (T)|E2 E3] = O (93H log T + 2 p

Note that the expected regret from EXP3 is trivially bounded by o(H2(T/H)) = o(TH) and B = O(log(T)). Then, with (36), we have

E[Rπ 2 (T)] = E[Rπ 2 (T)|E2 E3]P(E2 E3) + E[Rπ 2 (T)|Ec 2 Ec 3]P(Ec 2 Ec 3)

HT + o (TH) (4/T 2)

Finally, from (28), (34), and (37), with H =

E[Rπ(T)] = O max{T 3/4, ϱ1/3T} ,

which concludes the proof.