# rotting_infinitely_manyarmed_bandits__21552049.pdf Rotting Infinitely Many-Armed Bandits Jung-hun Kim 1 Milan Vojnovi c 2 Se-Young Yun 1 We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate ϱ = o(1). We show that this learning problem has an Ω(max{ϱ1/3T, T}) worst-case regret lower bound where T is the time horizon. We show that a matching upper bound O(max{ϱ1/3T, T}), up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate ϱ. We also show that an O(max{ϱ1/3T, T 3/4}) regret upper bound can be achieved by an algorithm that does not know the value of ϱ, by using an adaptive UCB index along with an adaptive threshold value. 1. Introduction We consider a fundamental sequential learning problem in which an agent must play one option at a time from an infinite set of options with non-stationary reward distributions, where the mean reward of an option decreases at each play of this option. This is naturally studied as the infinitely many-armed bandit problem with rotting rewards. The assumption of infinitely many arms models practical situations when there is a finite but large number of arms relative to the number of available experiments. There is an abundance of applications in which one must choose from a large set of options with rotting rewards, e.g. online advertising where arms correspond to ads and rewards decrease over exposures of an ad to a user, content recommendation systems where arms correspond to media items and rotting 1Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea 2London School of Economics, London, UK. Correspondence to: Se-Young Yun , Milan Vojnovi c . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). arises because of user boredom when watching the same content, and clinical trials where the efficacy of a medicine may decrease because of drug tolerance when a patient takes the same medicine several times. While there has been a lot of work on multi-armed bandits with a finite number of arms with stationary or non-stationary rewards, and an infinite number of arms with stationary rewards, not much seems to be known for the case of infinitely many arms with non-stationary rewards. In this paper we make first steps to understand the fundamental limits of sequential learning for infinite number of arms whose mean rewards decrease with the number of pulls the case commonly referred to as the rested rotting bandits. Our focus is on rotting trends where the mean reward of an arm decreases arbitrary for at most a fixed amount ϱ at each pull of this arm. The initial mean rewards of arms are assumed to be independent and identically distributed according to uniform distribution on [0, 1]. The objective is to find a policy that minimizes the expected cumulative regret over a time horizon of T time steps with respect to playing the best arm. We show that the worst-case regret for this problem is lower bounded by Ω(max{ϱ1/3T, T}), where ϱ is the maximum rotting rate, and show that this lower bound is tight up to a poly-logarithmic factor. This reveals that the rotting trend starts to have an effect on regret precisely at the threshold ϱ = Θ(1/T 3/2). Our result implies that the rotting rested bandit problem with infinitely many arms is harder than for the stationary rewards case, as in the latter case the regret lower bound is Ω( T) (Wang et al., 2009). This stands in stark contrast to the case of finite K arms in which case it is known that O( KT) can be achieved for the rotting case (Seznec et al., 2019), which matches the optimal bound in the stationary case (Auer et al., 2002b) up to a polylogarithmic factor. In the case of infinitely many arms with stationary rewards, it is not possible to explore all arms to find an optimal arm, hence, it is required to find a near-optimal arm; contrast this with the case of finitely many arms, where all arms must be explored to identify an optimal arm. Further, when we consider rotting rewards, the learner must keep exploring new arms because a near-optimal arm may become suboptimal as it is being pulled. Based on this fact, we design algo- Rotting Infinitely Many-armed Bandits rithms for the rotting infinitely many-armed bandit problem to achieve tight regret bounds. We summarize our contributions in more details in what follows. 1.1. Summary of our contributions We show an Ω(max{ϱ1/3T, T}) worst-case regret lower bound for the rotting rested bandit case with maximum rotting rate ϱ = o(1). This regret lower bound matches the regret lower bound Ω( T) that is known to hold for the case of stationary rewards, when rotting is sufficiently small precisely when ϱ = O(1/T 3/2). Otherwise, when ϱ = ω(1/T 3/2), the regret lower bound becomes worse than for the stationary case. We show that an O(max{ϱ1/3T, T}) regret can be achieved by an algorithm when the maximum rotting rate ϱ is known to the algorithm. This algorithm uses a UCB index to decide whether to continue pulling an arm or remove the arm from further consideration and switch to exploring a new arm by comparing the index with a threshold. This threshold is set to account for rotting of rewards. We further show that an O(max{ϱ1/3T, T 3/4}) regret can be achieved by an algorithm that does not know the value of the maximum rotting rate ϱ. This algorithm uses an adaptive UCB index and an adaptive threshold value to compare the UCB index of an arm with the threshold to decide whether to continue pulling this arm or remove the arm from further consideration. This upper bound matches the lower bound up to poly-logarithmic factors when the rotting rate ϱ is sufficiently large, i.e. when ϱ = Ω(1/T 3/4). We present results of numerical experiments for randomly generated problem instances of rotting infinitely manyarmed bandits. These results validate the insights derived from our theoretical results. 1.2. Related work The work on multi-armed bandits can be distinguished with respect to two criteria, first whether the number of arms is finite or infinite, and second whether rewards of arms are stationary or non-stationary. For the case of non-stationary rewards, we can further distinguish rested from restless multi-armed bandit problems in the former case, an arm s distribution of reward may change only when the arm is pulled, while in the latter case, it may change at each time step. Our work falls in the category of multi-armed bandit problems with infinitely many non-stationary rested arms. The case of finitely many arms with stationary rewards has been studied by many, following on Lai & Robbins (1985); Auer et al. (2002a). There exist algorithms having O( KT) worst-case regret, where K is the number of arms, and this matches the lower bound Ω( KT) up to a poly-logarithmic factor (Auer et al., 2002b; Slivkins, 2019). We next discuss the case of finitely many arms with nonstationary rewards. The non-stationarity in rewards can be quantified by the number of abrupt changes or a variation budget, which is referred to as abrupt-changing and slowvarying environments, respectively. The non-stationary environments were studied by Auer et al. (2002b); Garivier & Moulines (2011); Besbes et al. (2014); Auer et al. (2019) in which proposed algorithms are based on a strategy of adapting current state rapidly and fading old history memory (e.g. sliding window, discount factor, and restarting). In addition to this, non-stationary environments were studied under various assumptions, e.g. contextual bandits and MDPs (Cheung et al., 2019; Chen et al., 2019; Zhao et al., 2020; Russac et al., 2019; Cheung et al., 2020), mortal bandits where arms have a stochastic lifetime (Chakrabarti et al., 2008; Trac a et al., 2020), and bandits where arm rewards evolve according to a continuous-time stochastic process (Slivkins & Upfal, 2008). The multi-armed bandit problem with a finite number of arms, where each arm s mean reward decays with the number of pulls of this arm, was first studied by Komiyama & Qin (2014); Heidari et al. (2016); Bouneffouf & F eraud (2016); Levine et al. (2017). Following Levine et al. (2017), this problem is referred to as rotting bandits problem. Levine et al. (2017) showed that a sliding-window algorithm has a O(K1/3T 2/3) regret in a non-parametric rested rotting setting where the only assumption is that mean rewards are positive and non-increasing in the number of pulls. The nonparametric rotting bandit problem, allowing mean rewards to be negative with bounded decay, was subsequently studied by Seznec et al. (2019), showing an algorithm that has an O( KT) problem instance independent bound. Seznec et al. (2020) showed that a single algorithm, an adaptivewindow UCB index policy, achieves near-optimal regret for both rested and restless rotting bandits. In this paper, we follow the non-parametric rested rotting setting where mean rewards can only decrease with bounded decrements. We next discuss the case of infinitely many arms with stationary rewards. Berry et al. (1997); Bonald & Prouti ere (2013) proposed algorithms with asymptotically optimal regret O( T) for the case of arms with Bernoulli rewards and independent mean values according to uniform distribution on [0, 1]. Wang et al. (2009) studied the case where the mean reward distribution has support on [0, µ ] with µ 1, and for each arm a the distribution of mean reward µ(a) is such that P(µ(a) µ z) = Θ(zβ), for some β > 0. Carpentier & Valko (2015) studied the same problem but focused on simple regret, defined as the instantaneous regret at time step T. Bayati et al. (2020) showed that a subsampled UCB algorithm (SSUCB) that samples Θ( T) arms and executes UCB only on this subset of arms has O( T) regret under 1-sub-Gaussian rewards with mean rewards according to uniform distribution on [0, 1]. In this Rotting Infinitely Many-armed Bandits setting, for mean reward distributions such that there is a large enough number of near-optimal arms, an algorithm may find a near-optimal arm by exploring a restricted number of arms. There also exist several works dealing with infinitely many arms under structured reward functions such as contextual linear bandits (Abbasi-Yadkori et al., 2011) and Lipschitz bandits (Bubeck et al., 2011). In this paper, however, we focus on infinitely many arms under a mean reward distribution, where the structured-reward assumptions may not hold because of rotting. Our work is different from the work discussed in this section in that we consider the case of infinitely many arms with nonstationary rotting arms. In the case of rotting bandits with a finite number of arms, as we mentioned, Seznec et al. (2019; 2020) achieves worst-case regret bound O( KT) which matches the near-optimal regret in the stationary stochastic setting. This result indicates that the rotting in the finitely many arms setting is not a harder problem than in the stationary rewards setting. However, in the setting of infinitely many arms, rotting of rewards makes the problem harder than in the stationary rewards case. This is because the value of the optimal mean reward is not decreasing as the arms are being pulled as there are infinitely many near-optimal arms, which requires an additional exploration to recurrently search for a new optimal arm outside of the set of already pulled arms. Our algorithms are different from previouslyproposed algorithms for the case of infinitely many arms with stationary rewards in keeping to explore new arms to find a near-optimal arm over time because of rotting rewards. In more details, our algorithms use UCB policies to decide whether to continue pulling an arm or remove the arm from further consideration and explore a new arm, by comparing its UCB index with a threshold which is adjusted by using the rotting rate or an estimated value of the rotting rate. 2. Problem formulation We consider a non-stationary bandit problem with infinitely many arms where the reward distributions of arms vary over time. We consider the case when the mean reward of an arm may decrease only when this arm is pulled by an agent that uses a policy π, which is referred to as the rested rotting bandit setting. Let A be an infinite set of arms, µt(a) be the mean reward of arm a at time t before pulling an arm at time t, and nt(a) be the number of times arm a A is pulled by π before time t. Also, denote by rt the stochastic reward gained by pulling arm aπ t at time t. Let rt = µt(aπ t ) + ηt where ηt is a noise term with a 1-sub-Gaussian distribution. We assume that initial mean rewards {µ1(a)}a A are i.i.d. random variables with uniform distribution on [0, 1]. The rotting of arms is defined as follows. Given a rotting rate 0 ϱt ϱ at time t 1 with maximum rotting rate ϱ = o(1), the mean reward of the selected arm at time t changes as follows µt+1(at) = µt(at) ϱt whereas the mean rewards of other arms remain unchanged. The mean reward of every arm a A at time t > 1 can be represented as follows. With 0 ϱs ϱ for all time steps 0 < s < t, µt(a) = µ1(a) s=1 ϱs1(as = a). The objective is to find a policy that minimizes the expected cumulative regret over a time horizon of T time steps, which for a given policy π is defined as follows E[Rπ(T)] = E t=1 (1 µt(aπ t )) In the regret definition, we use that the mean reward of the optimal arm at any time t is equal to 1. This is because there is an infinite number of arms in A with i.i.d. mean rewards according to uniform distribution on [0, 1], so that there always exist sufficiently many arms whose mean rewards are close enough to 1. In what follows, selecting an arm means that a policy chooses an arm in A before playing it and pulling an arm means that the policy plays a selected arm and receives a reward. 3. Regret lower bound We first discuss two different regimes for regret depending on the value of the maximum rotting rate ϱ. When ϱ 1/T 3/2, the mean reward of any arm over the time horizon of T time steps changes for at most ϱT 1/ T. Therefore, for any arm, there can be at most a gap of 1/ T between the initial mean reward and the mean reward after T time steps, which causes an additional regret of at most T over the horizon of T time steps to the case of stationary arms (i.e. when ϱ = 0). In Theorem 3 in Wang et al. (2009), the optimal regret for the stationary case, with uniform distribution of mean rewards of arms, is shown to be of the order T. Therefore, the extra regret of T from the rotting with ϱ 1/T 3/2 does not affect the order of the regret. When ϱ > 1/T 3/2, we expect that the regret lower bound may be different than for the stationary case. By analyzing the regret lower bound for the specific case with ϱt = ϱ for all time steps t > 0, we provide a lower bound for the worst-case regret with respect to arbitrary rotting as given in the following theorem. Theorem 3.1. For the rotting infinitely many-armed bandit problem, there exist rotting rates 0 ϱt ϱ = o(1) for all time steps t > 0 such that any policy π has the regret over Rotting Infinitely Many-armed Bandits a time horizon of T time steps such that E[Rπ(T)] = Ω(max{ϱ1/3T, From the result of the theorem, when the rotting is small enough, i.e. precisely when ϱ 1/T 3/2, the lower bound corresponds to Ω( T), which is known to hold when rewards are stationary. Otherwise, when the rotting is sufficiently large, then the lower bound is Ω(ϱ1/3T). For example, when ϱ = 1/T γ for some γ > 0, we have the lower bound Ω( T), if γ 3/2 (small rotting case), and, otherwise (large rotting case), we have Ω(T 1 γ/3). We can observe that ϱ = Θ(1/T 3/2) is a transition point at which the lower bound switches from Ω( T) to Ω(ϱ1/3T). Proof sketch. Here we present a proof sketch of the theorem with the full version of the proof provided in Appendix A.2. We assume that ϱt = ϱ = o(1) for all time steps t > 0. When ϱ = O(1/T 3/2), the lower bound for the stationary case of the order T (Wang et al., 2009) is tight enough for the non-stationary case. This is because we only need to pay an extra regret of at most of order T for small ϱ. Therefore, when ϱ = O(1/T 3/2), we have E[Rπ(T)] = Ω( We note that even though the mean rewards are rotting in our setting, we can easily obtain (1) by following the same proof steps of Theorem 3 in Wang et al. (2009). For the sake of completeness, we provide a proof in the Appendix A.2. When ϱ = ω(1/T 3/2), however, the lower bound of the stationary case is not tight enough. Here we provide the proof of the lower bound Ω(ϱ1/3T) for the case when ϱ = ω(1/T 3/2). For showing the lower bound, we will classify each arm to be either bad or good or else according to the definition given shortly. To distinguish bad and good arms, we use two thresholds 1 c and 1 δ, respectively, where c and δ are such that 0 < 1 c < 1 δ < 1, δ = ϱ1/3, and c is a constant. An arm a is said to be a bad arm if µ1(a) 1 c, and is said to be a good arm if µ1(a) > 1 δ. Let NT be the number of distinct selected good arms until time step T. We separately consider two cases when NT < m and NT m, where m = (1/2)Tϱ2/3 , and show that each case has Ω(ϱ1/3T) as the regret lower bound. The main ideas for each case are outlined as follows. When the number of selected good arms is relatively small (NT < m), any policy π must pull arms with mean rewards less than 1 δ at least T/2 time steps until T, amounting to at least δ regret for each pull (gap between 1 and mean reward of a pulled arm). Therefore, the regret is lower bounded by Ω(δ(T/2)) = Ω(ϱ1/3T). When the number of selected good arms is relatively large (NT m), we can show Algorithm 1 UCB-Threshold Policy (UCB-TP) Given: T, δ, A; Initialize: A A Select an arm a A Pull arm a and get reward r1 for t = 2, . . . , T do Update the initial mean reward estimator µo t(a) if UCBt(a) 1 δ then Pull arm a and get reward rt else A A /{a} Select an arm a A Pull arm a and get reward rt end if end for that any policy π is likely to select at least of order ϱ1/3T number of distinct bad arms until T. From the fact that the selected bad arms are pulled at least once and each pull adds a constant regret of value at least c, the regret is shown to be lower bounded by Ω(cϱ1/3T) = Ω(ϱ1/3T). Therefore, when ϱ = ω(1/T 3/2), we can obtain = E[Rπ(T)1(NT < m) + Rπ(T)1(NT m)] = Ω(ϱ1/3T). (2) Finally, from (1) and (2), we have E[Rπ(T)] = Ω(max{ϱ1/3T, 4. Algorithms and regret upper bounds In this section, we first present an algorithm for the rested rotting bandit problem with infinitely many arms for the case when the algorithm knows the value of the maximum rotting rate. We show a regret upper bound of the algorithm that matches the regret lower bound in Theorem 3.1 up to a poly-logarithmic factor. Second, we present an algorithm that does not know the maximum rotting rate and show a regret upper bound that matches the regret lower bound up to a poly-logarithmic factor, when the maximum rotting rate is large enough. 4.1. An algorithm knowing maximum rotting rate We present an algorithm which requires knowledge of the maximum rotting rate in Algorithm 1. The algorithm selects an arm and pulls this arm as long as the arm is tested to be a good arm, by using a test comparing an upper confidence bound of this arm with a threshold value. Specifically, if a is the selected arm at time step t, the algorithm computes an estimator µo t(a) of the initial mean reward of arm a and uses this estimator to compute an estimator of the mean reward of arm a at time step t, considering the worst-case rotting Rotting Infinitely Many-armed Bandits rate ϱ for the estimators. Comparing the upper confidence bound for the mean reward with the threshold 1 δ, the algorithm tests whether the arm is a good arm. If the arm is tested to be a good arm, then the algorithm continues to pull this arm. Otherwise, it discards the arm and selects a new one, and repeats the procedure described above until time horizon T. We consider Algorithm 1 with the initial mean reward estimator defined as µo t(a) := Pt 1 s=1(rs + ϱns(a))1(as = a) and the upper confidence bound term defined as UCBt(a) := µo t(a) ϱnt(a) + p 8 log(T)/nt(a). Note that when ϱs = ϱ for all 0 < s < t, µo t(a) is an unbiased estimator of the initial mean reward µ1(a) of arm a and µo t(a) ϱnt(a) is an unbiased estimator of the mean reward µt(a) of arm a at time step t. The upper confidence bound UCBt(a) follows the standard definition of an upper confidence bound. By designing the estimators to deal with the maximum rotting rate ϱ, for any arbitrary ρt ϱ for all time steps t > 0, we show that it has a near-optimal worst-case regret upper bound in the following theorem. Theorem 4.1. For the policy π defined by Algorithm 1 with δ = max{ϱ1/3, 1/ T}, and ϱ = o(1), the regret satisfies E[Rπ(T)] = O(max{ϱ1/3T, Proof sketch. Here we present a proof sketch of the theorem with a full version of the proof provided in Appendix A.3. Observe that initial mean rewards of selected arms are i.i.d. random variables with uniform distribution on [0, 1]. We first define arms to be good or bad arms depending on the initial mean rewards. We assume ϱ = o(1) and set δ = max{ϱ1/3, 1/ T}. Then we define an arm a to be a good arm if (a) δ/2, where (a) = 1 µ1(a), and otherwise, a is a bad arm. Because of rotting, initially good arm may become bad by pulling the arm several times. Therefore, the policy π may select several good arms over the entire time and we analyze the regret over time episodes defined by the selections of good arms. Given the policy, we refer to the period starting from selecting the i 1-st good arm until selecting the i-th good arm as the i-th episode. Because of the uniform distribution of mean rewards with small δ/2, it is likely to have several consecutive selected bad arms in each episode. We do an episodic analysis. We first analyze the expected regret per episode and multiply it by the expected number of episodes in T. However, this proof strategy has an issue that the regret of each episode and the number of episodes Figure 1: Episodes in a time line. in T are not independent. To resolve the issue, we fix the number of episodes to m G and analyze the regret not for T but for m G episodes. Note that m G is a fixed value, and the time after m G episodes can exceed T. For obtaining a regret bound, we set m G so that the total number of time steps for m G episodes is larger than T with high probability and thus Rπ(T) Rπ m G where Rπ m G is the regret accumulated for m G episodes. For the regret analysis, we denote by m B i the number of selections of distinct bad arms in the i-th episode. See Figure 1 for an illustration of the episodes in a time line. In what follows, we provide an overview of the regret analysis by considering two separate cases depending on the value of the maximum rotting rate ϱ; one for a large rotting case, and the other for a small rotting case, which we may interpret as a near-stationary case. For the analysis, we use RG i to denote the regret accumulated by pulling the good arm in the i-th episode, and RB i,j to denote the regret accumulated by pulling the j-th bad arm in the i-th episode. Case of large rotting ϱ = ω(1/T 3/2): We first show that by setting m G = 2Tϱ2/3 , we have Rπ(T) Rπ m G. If the policy selects a good arm a, where µ1(a) 1 δ/2, then it must pull the arm at least δ/(2ϱ) times, with high probability, to decrease the mean reward below the threshold 1 δ. Then the total number of time steps for m G episodes is at least (δ/(2ϱ))m G = (1/(2ϱ2/3)) 2Tϱ2/3 T. This implies that Rπ(T) Rπ m G. We next provide a bound for E[Rπ(T)] using E[Rπ m G]. For bounding E[Rπ m G], we can show that for any i [m G] and j [m B i ], we have E[RG i ] = O 1 and E[RB i,j] = O (1) . (3) Observe that m B 1 , . . . , m B m G are i.i.d. random variables with geometric distribution with parameter δ/2. Therefore, for any non-negative integer k, we have P(m B i = k) = (1 δ/2)kδ/2 and E[m B i ] = (2/δ) 1 for all i [m G]. We have set δ = max{ϱ1/3, 1/ T}. Then when ϱ = ω(1/T 3/2), with m G = 2Tϱ2/3 and from (3), E[m B i ] = 2/δ 1, and δ = ϱ1/3, we have E[Rπ(T)] = O(E[Rπ m G]) j [m B i ] RB i,j ϱ1/3 + E[m B i ] m G = O ϱ1/3T . (4) Rotting Infinitely Many-armed Bandits Case of small rotting ϱ = O(1/T 3/2): By setting m G = C for some large constant C > 0, we can show that Rπ(T) Rπ m G. This is because if the policy selects a good arm a, then it must pull the arm for at least order T times with high probability. This is because the small rotting case is a near-stationary setting so that the policy can pull a good arm for a large amount of time steps. For bounding E[Rπ m G], we can show that for any i [m G] and j [m B i ], E[RG i ] = O( T) and E[RB i,j] = O (1) . (5) Then when ϱ = O(1/T 3/2), with m G = C and from (5), E[m B i ] = 2/δ 1, and δ = Θ(1/ T), we have E[Rπ(T)] = O(E[Rπ m G]) T + E[m B i ] m G = O( We note that in the small rotting case, the policy achieves O( T) which matches a near-optimal bound for the stationary setting (Wang et al., 2009). Putting the pieces together: From (4) and (6), by taking ϱ = o(1) and δ = max{ϱ1/3, 1/ T}, it follows E[Rπ(T)] = O(max{ϱ1/3T, 4.2. An algorithm not knowing maximum rotting rate In this section we present an algorithm which does not require information about the maximum rotting rate ϱ defined in Algorithm 2, and provide a regret upper bound for this algorithm. The algorithm adopts the strategy of hierarchical bandit algorithms similar to BOB (bandit-over-bandit) (Cheung et al., 2019). It consists of a master algorithm and several base algorithms, where EXP3 (Auer et al., 2002b) is used for the master algorithm whose goal is to find a near-optimal base algorithm, and for each base algorithm, a UCB index policy similar to that in Algorithm 1 is used with a candidate rotting rate and an adaptive threshold to decide whether to continue pulling currently selected arm. In Algorithm 2, the time horizon of T time steps is partitioned into blocks of H time steps. Before starting each block, the master algorithm selects a rotting estimator β of the unknown maximum rotting rate ϱ from a set of candidate values denoted by B. Then it runs a base algorithm over H time steps which decides whether to continue pulling the selected arm based on a UCB index and a threshold tuned using the selected β. By utilizing the obtained rewards over the block as a feedback for the decision of the master algorithm, it updates the master to find a near-optimal base and repeats the procedure described above until time horizon T. Algorithm 2 Adaptive UCB-Threshold Policy (AUCB-TP) Given: T, H, B, A, α Initialize: A A, w(β) 1 for β B for i = 1, 2, . . . , T/H do Select an arm a A Pull arm a and get reward r(i 1)H+1 p(β) (1 α) w(β) P k B w(k) + α 1 Select β β with probability p(β) for β B δ β1/3 for t = (i 1)H + 2, . . . , i H T do if UCBi,t(a, β) 1 δ then Pull arm a and get reward rt else A A /{a} Select an arm a A Pull arm a and get reward rt end if end for w( β) w( β) exp α Bp( β) Pi H T t=(i 1)H rt 186H log T +4 H log T We note that the term 1/2 + Pi H T t=(i 1)H rt/(186H log T + 4 H log T) in updating w( β) in Algorithm 2 is for rescaling and translating rewards, which makes the rewards lie in [0, 1] with high probability. Also by optimizing the block size H, we can control regrets induced from the master and a base. By increasing H, the regret induced from the master increases and the regret induced from a base decreases. Those facts are shown later in the poof of Theorem 4.2. In what follows, we define the inputs B and α, and the upper confidence bound index UCBi,t(a, β) for β B. B contains candidate values of β to optimize UCBi,t(a, β) and the threshold parameter δ. We find that the optimal base parameter β B is when β = max{1/H3/2, ϱ} including a clipped domain for the optimal threshold value δ as in Theorem 4.1. This implies that the optimized β 1/H3/2 = 2 3/2 log2 H. Also from ϱ = o(1), β 1/8. Therefore, we set B = {2 3, 2 4, . . . , 2 (3/2) log2 H }, in which the cardinality of the set is restricted by O(log H) which does not hurt the regret from EXP3 up to a logarithmic factor. Let B = |B|. Then we set α = min{1, p B log B/((e 1) T/H )}, which is used to guarantee a least selection probability for each base. Let ni,t(a) be the number of times that arm a A is pulled by the algorithm from time step (i 1)H + 1 before time step Rotting Infinitely Many-armed Bandits t. Let µo i,t(a, β) be defined as µo i,t(a, β) := Pt 1 s=(i 1)H+1(rs + βni,s(a))1(as = a) and the UCBi,t(a, β) index be defined as UCBi,t(a, β) := µo i,t(a, β) βni,t(a)+ q 8 log(H)/ni,t(a). We provide a worst-case regret upper bound for Algorithm 2 in the following theorem. Theorem 4.2. With ϱ = o(1), for the policy π defined by Algorithm 2 with H = T 1/2 , the regret satisfies E[Rπ(T)] = O(max{ϱ1/3T, T 3/4}). Proof sketch. Here we present a proof sketch of the theorem with a full version of the proof given in Appendix A.4. The policy π consists of two strategies: EXP3 for the master and UCB-Threshold Policy (Algorithm 1) for bases. We can decompose the regret into two parts: regret incurred by playing a base with β B over each block of H time steps and regret incurred due to the master trying to find a nearoptimal base parameter in B. In what follows, we define the regret decomposition formally. Let πi(β) for β B denote the base policy with β for time steps between (i 1)H + 1 and i H T. Denote by aπi(β) t the pulled arm at time step t by policy πi(β). Then, for β B, which is set later for a near-optimal base, we have E[Rπ(T)] = E t=(i 1)H+1 µt(aπ t ) = E[Rπ 1 (T)] + E[Rπ 2 (T)], (7) t=(i 1)H+1 µt(aπi(β ) t ), µt(aπi(β ) t ) µt(aπ t ) . Note that Rπ 1 (T) accounts for the regret caused by the nearoptimal base algorithm πi(β ) against the optimal mean reward and Rπ 2 (T) accounts for the regret caused by the master algorithm by selecting a base with β B at every block against the base with β as illustrated in Figure 2. We set β to be the smallest value in B which is larger than max{ϱ, 1/H3/2}. Then the base has the threshold parameter β 1/3 of order max{ϱ1/3, 1/ H} which coincides with Figure 2: Regret decomposition for Algorithm 2. the optimal threshold parameter of Algorithm 1 by replacing T with H. We note that the policy π does not require knowing β and it is defined only for the proof. In what follows, we provide upper bounds for each regret component. We first provide an upper bound for E[Rπ 1 (T)] by following the proof steps in Theorem 4.1. We can easily find that regret of the base with β for each block of size H that has the same regret bound as in Theorem 4.1 by replacing T with H amounting to O(max{ϱ1/3H, H}). Then by adding the regret for T/H number of blocks, we have E[Rπ 1 (T)] = O((T/H) max{ϱ1/3H, = O(max{ϱ1/3T, T/ Then we provide an upper bound for E[Rπ 2 (T)] using a regret bound for EXP3 in (Auer et al., 2002b). The EXP3 in policy π selects a base in B before starting a block and gets feedback at the end of the block and repeats this over T/H blocks. Therefore, EXP3 in π can be thought to be run for T/H decision rounds and the number of decision options for each round is B. Let Q be an upper bound for the absolute sum of rewards for any block with length H with high probability. Then from Corollary 3.2 in (Auer et al., 2002b), we can show that E[Rπ 2 (T)] = O(Q p B(T/H)). (9) By considering that mean rewards may become negative because of rotting and using a Chernoff s bound, we show that with high probability t=(i 1)H+1 rt 93H log T + 2 p H log T. (10) Then with B = O(log T), from (9) and (10), we have E[Rπ 2 (T)] = O H log(T) p Finally, from (7), (8), and (11), with H = T 1/2 , we have E[Rπ(T)] = O(max{ϱ1/3T, T/ = O(max{ϱ1/3T, T 3/4}). This concludes the proof. Rotting Infinitely Many-armed Bandits The regret bound for Algorithm 2 in Theorem 4.2 is larger than or equal to that for Algorithm 1 in Theorem 4.1. This is because the master in Algorithm 2 needs to learn the unknown maximum rotting rate to find a near-optimal base algorithm, which produces extra regret. In the following remarks, we discuss the region of the maximum rotting rate ϱ for which Algorithm 2 achieves the near-optimal regret and discuss the computation and memory efficiency of our algorithms. Remark 4.3. When ϱ = Ω(1/T 3/4), Algorithm 2 achieves the optimal regret bound O(ϱ1/3T) up to a poly-logarithmic factor. This is because when ϱ = Ω(1/T 3/4), the additional regret from learning the maximum rotting rate is negligible compared with the regret from the rotting of rewards. It is an open problem to achieve the optimal regret bound for any value of ϱ, without knowing the value of this parameter. Remark 4.4. Note that we can achieve the same regret bounds as in Theorems 4.1 and 4.2 for the worst-case rotting by replacing our UCB index with adaptive-window UCB index in (Seznec et al., 2020) which is known to achieve near-optimal regret for the case of rotting with a finite number of arms. However, computing the adaptive-window UCB over horizon T has high computation cost, O(T 2) and O(TH), and memory space O(T) and O(H), in Algorithms 1 and 2, respectively, for optimizing the window size. On the other hand, our proposed UCB index has a lower computation cost of O(T) and requires only O(1) memory space using a simple trick for updating mean estimators in online computation settings (Welford, 1962). 5. Numerical experiments In this section we present results of our numerical experiments using synthetic data in the rotting setting with infinitely many arms1. To the best of our knowledge, there are no previously-proposed algorithms for the setting of rotting with infinitely many arms. We compare the performance of Algorithms 1 and 2 with SSUCB (Bayati et al., 2020), which was proposed for infinitely many arms with stationary rewards, and is known to have a near-optimal regret for stationary sub-Gaussian reward distributions. From the theoretical results in Section 3 and Section 4, we expect that Algorithm 1 and SSUCB would have similar performance when the maximum rotting rate is sufficiently small, which may be regarded as a nearly stationary case, and that both Algorithm 1 and Algorithm 2 would outperform SSUCB when the rotting rate is sufficiently large. We expect that Algorithm 2 would not be competitive in the nearly stationary case because of the extra regret from the rotting rate estimation. It requires the rotting rate to be sufficiently large 1Our code is available at https://github.com/ junghunkim7786/rotting_infinite_armed_ bandits Figure 3: Performance of Algorithms 1 and 2, and SSUCB: (a) regret versus ϱ1/3 for fixed time horizon T, and (b,c,d) regret versus time horizon T with rotting rates ϱ = 1/T 1/2 (b), 1/T 3/5 (c), and 1/T 3/2 (d). to have the near-optimal regret bound. We will confirm these insights by our numerical results in what follows. In our experiments, we consider the case of identical rotting with ϱt = ϱ for all t [T]. We generate initial mean rewards of arms by sampling from uniform distribution on [0, 1]. In each time step, stochastic reward from pulling an arm has a Gaussian noise with mean zero and variance 1. We repeat each experiment 10 times and compute confidence intervals for confidence probability 0.95. We first investigate the performance of algorithms for varied rotting rate ϱ and fixed time horizon T. We ran experiments for rotting rate ϱ set to 1/T, 1/T 0.9, 1/T 0.8, . . . , 1/T 0.3, and measured the expected regret for the time horizon T = 106. In Figure 3 (a), we can confirm that our algorithms show more robust performance than SSUCB for various rotting rates with linearly increasing regret with respect to ϱ1/3 for large rotting, which matches Theorems 4.1 and 4.2. We observe that for large rotting cases, our algorithms have similar performance and both outperform SSUCB. For sufficiently small rotting, we can observe that all the three algorithms have comparable performance while Algorithm 2 performs slightly worse. These results conform to the insights derived from our theoretical analysis. We next investigate the performance of algorithms versus time horizon T and rotting rate ϱ depending on T. We ran experiments for time horizon T taking values 1, 1 105, 2 105, . . . , 106, and measured expected regret of each case. We set the rotting rate ϱ to 1/T 1/2, 1/T 3/5 and 1/T 3/2. Note that the case 1/T 3/2 may be considered as a Rotting Infinitely Many-armed Bandits nearly-stationary case, because in this case the regret lower bound is Ω( T) from Theorem 3.1. In Figure 3 (b) and (c), corresponding to large rotting rates, we observe that Algorithms 1 and 2 have similar performance and outperform SSUCB. The gaps between regrets of our algorithms and SSUCB become smaller by decreasing the rotting rate from 1/T 1/2 to 1/T 3/5. This is because SSUCB is designed for the case of stationary rewards and has a near-optimal regret in the stationary case. In the near-stationary case, in Figure 3 (d), we observe that Algorithm 1 has best performance and, as expected, Algorithm 2 has worst performance. 6. Conclusion In this paper we studied the infinitely many-armed bandit problem with rested rotting rewards. We provided a regret lower bound and proposed an algorithm which achieves a near-optimal regret, when the the maximum rotting rate is known to the algorithm. We also proposed an algorithm which does not require knowledge of the rotting rate and we showed that it achieves a near-optimal regret for any large enough rotting rate. In future work, it may be of interest to relax the assumption of uniform distribution for initial mean rewards. Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002a. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002b. Auer, P., Gajane, P., and Ortner, R. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Conference on Learning Theory, pp. 138 158. PMLR, 2019. Bayati, M., Hamidi, N., Johari, R., and Khosravi, K. Unreasonable effectiveness of greedy algorithms in multi-armed bandit with many arms. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1713 1723. Curran Associates, Inc., 2020. Berry, D. A., Chen, R. W., Zame, A., Heath, D. C., and Shepp, L. A. Bandit problems with infinitely many arms. The Annals of Statistics, 25(5):2103 2116, 1997. Besbes, O., Gur, Y., and Zeevi, A. Stochastic multi-armedbandit problem with non-stationary rewards. In Proceed- ings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS 14, pp. 199 207, Cambridge, MA, USA, 2014. MIT Press. Bonald, T. and Prouti ere, A. Two-target algorithms for infinite-armed bandits with bernoulli rewards. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS 13, pp. 2184 2192, Red Hook, NY, USA, 2013. Curran Associates Inc. Bouneffouf, D. and F eraud, R. Multi-armed bandit problem with known trend. Neurocomputing, 205:16 21, 2016. Brown, D. G. How I wasted too long finding a concentration inequality for sums of geometric variables. Found at https://cs. uwaterloo. ca/ browndg/negbin. pdf, 8(4), 2011. Bubeck, S., Stoltz, G., and Yu, J. Y. Lipschitz bandits without the lipschitz constant. In International Conference on Algorithmic Learning Theory, pp. 144 158. Springer, 2011. Carpentier, A. and Valko, M. Simple regret for infinitely many armed bandits. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pp. 1133 1141. JMLR.org, 2015. Chakrabarti, D., Kumar, R., Radlinski, F., and Upfal, E. Mortal multi-armed bandits. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS 08, pp. 273 280, Red Hook, NY, USA, 2008. Curran Associates Inc. Chen, Y., Lee, C.-W., Luo, H., and Wei, C.-Y. A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. In Conference on Learning Theory, pp. 696 726. PMLR, 2019. Cheung, W. C., Simchi-Levi, D., and Zhu, R. Learning to optimize under non-stationarity. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1079 1087. PMLR, 2019. Cheung, W. C., Simchi-Levi, D., and Zhu, R. Reinforcement learning for non-stationary markov decision processes: The blessing of (more) optimism. In International Conference on Machine Learning, pp. 1843 1854. PMLR, 2020. Garivier, A. and Moulines, E. On upper-confidence bound policies for switching bandit problems. In Kivinen, J., Szepesv ari, C., Ukkonen, E., and Zeugmann, T. (eds.), Algorithmic Learning Theory, pp. 174 188, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. ISBN 978-3642-24412-4. Rotting Infinitely Many-armed Bandits Heidari, H., Kearns, M., and Roth, A. Tight policy regret bounds for improving and decaying bandits. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 16, pp. 1562 1570, 2016. Komiyama, J. and Qin, T. Time-decaying bandits for nonstationary systems. In Liu, T.-Y., Qi, Q., and Ye, Y. (eds.), Web and Internet Economics, pp. 460 466, Cham, 2014. Springer International Publishing. Lai, T. and Robbins, H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math., 6(1):4 22, mar 1985. Levine, N., Crammer, K., and Mannor, S. Rotting bandits. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 3077 3086, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964. Rigollet, P. and H utter, J.-C. High dimensional statistics. Lecture notes for course 18S997, 813:814, 2015. Russac, Y., Vernade, C., and Capp e, O. Weighted linear bandits for non-stationary environments. ar Xiv preprint ar Xiv:1909.09146, 2019. Seznec, J., Locatelli, A., Carpentier, A., Lazaric, A., and Valko, M. Rotting bandits are no harder than stochastic ones. In Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pp. 2564 2572. PMLR, 16 18 Apr 2019. Seznec, J., Menard, P., Lazaric, A., and Valko, M. A single algorithm for both restless and rested rotting bandits. In Chiappa, S. and Calandra, R. (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 3784 3794. PMLR, 26 28 Aug 2020. Slivkins, A. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(1-2):1 286, 2019. Slivkins, A. and Upfal, E. Adapting to a changing environment: the brownian restless bandits. In 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pp. 343 354, 2008. Trac a, S., Rudin, C., and Yan, W. Reducing exploration of dying arms in mortal bandits. In Adams, R. P. and Gogate, V. (eds.), Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, volume 115 of Proceedings of Machine Learning Research, pp. 156 163. PMLR, 22 25 Jul 2020. Tsun, A. Probability & statistics with applications to computing, 2020. URL https://www.alextsun.com/ files/Prob_Stat_for_CS_Book.pdf. Wang, Y., Audibert, J.-y., and Munos, R. Algorithms for infinitely many-armed bandits. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2009. Welford, B. Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419 420, 1962. Zhao, P., Zhang, L., Jiang, Y., and Zhou, Z.-H. A simple approach for non-stationary linear bandits. In International Conference on Artificial Intelligence and Statistics, pp. 746 755. PMLR, 2020. Rotting Infinitely Many-armed Bandits A. Appendix A.1. Preliminaries Here we state some known concentration inequalities that we use in our proofs. Lemma A.1 (Theorem 6.2.35 in Tsun (2020)). Let X1, . . . , Xn be identical independent Bernoulli random variables. Then, for 0 < ν < 1, we have i=1 Xi (1 + ν)E exp ν2E[Pn i=1 Xi] 3 Lemma A.2 (Corollary 1.7 in Rigollet & H utter (2015)). Let X1, . . . , Xn be independent random variables with σ-sub Gaussian distributions. Then, for any a = (a1, . . . , an) Rn and t 0, we have i=1 ai Xi > t i=1 ai Xi < t A.2. Proof of Theorem 3.1 For showing the lower bound, we will classify arms to be either bad or good depending on the value of their initial mean rewards, in a way that will be defined shortly. We will analyze the number of bad arm pulls which is the main contributor to the regret. For the analysis, we count the number of bad arm pulls until the first or m-th good arm is pulled over a fixed time horizon in order to preserve i.i.d. property of mean rewards in the proof. To distinguish good from bad arms, we utilize two thresholds in the proof as follows. Let 0 < δ < c < 1 be such that 1 > 1 δ > 1 c > 0 for δ, which will be set small, and some constant c. Then, 1 δ and 1 c represent threshold values for distinguishing good arms and bad arms, respectively. In A, let a1, a2, . . . , be a sequence of arms. Given a policy π, without loss of generality let π select arms following the sequence of a1, a2, . . . ,. Case of small rotting: When ϱ = O(1/T 3/2), the lower bound of order T for the stationary case, from Theorem 3 in Wang et al. (2009), is tight enough for the non-stationary case. This is because we only need to pay the extra regret of at most of order T for small ϱ. From Theorem 3 in Wang et al. (2009), we have E[Rπ(T)] = Ω( We note that even though the mean rewards are rotting in our setting, Theorem 3 in Wang et al. (2009) can be applied to our setting without any proof changes providing a tight regret bound for the near-stationary case. For the sake of completeness, we provide the proof of the theorem. Let K1 denote the number of bad arms a that satisfy µ1(a) 1 c before selecting the first good arm, which satisfies µ1(a) > 1 δ, in the sequence of arms a1, a2, . . . . Let µ be the initial mean reward of the best arm among the selected arms by π over time horizon T. Then for some κ > 0, we have Rπ(T) = Rπ(T)1(µ 1 δ) + Rπ(T)1(µ > 1 δ) Tδ1(µ 1 δ) + K1c1(µ > 1 δ) Tδ1(µ 1 δ) + κc1(µ > 1 δ, K1 κ). (13) By taking expectations on the both sides in (13) and setting κ = Tδ/c, we have E[Rπ(T)] TδP(µ 1 δ) + κc(P(µ > 1 δ) P(K1 < κ)) = cκP(K1 κ). Let δ = δ/(1 c + δ). Then we observe that K1 follows a geometric distribution with success probability P(µ1(a) > 1 δ)/p(µ1(a) / (1 c, 1 δ)) = δ , in which the success probability is the probability of selecting a good arm given that the arm is either a good or bad arm. Then by setting δ = 1/ T/c, for some constant C > 0 we have E[Rπ(T)] cκ(1 δ )κ = Ω Rotting Infinitely Many-armed Bandits where the last equality is obtained from log x 1 1/x for all x > 0. Case of large rotting: When ϱ = ω(1/T 3/2), however, the lower bound of the stationary case is not tight enough. Here we provide the proof of the lower bound Tϱ1/3 for the case of ϱ = ω(1/T 3/2). Let Km denote the number of bad arms a that satisfy µ1(a) 1 c before selecting m-th good arm, which satisfies µ1(a) > 1 δ, in the sequence of arms a1, a2, . . . . Let NT be the number of selected good arms a such that µ1(a) > 1 δ until T. We can decompose Rπ(T) into two parts as follows: Rπ(T) = Rπ(T)1(NT < m) + Rπ(T)1(NT m). (14) We set m = (1/2)Tϱ2/3 and δ = ϱ1/3with ϱ = o(1). For getting a lower bound for the first term in (14), Rπ(T)1(NT < m), we use the fact that the minimal regret is obtained from the situation where there are m 1 arms whose mean rewards are 1. Then we can think of the optimal policy that selects the best m 1 arms until their mean rewards become below the threshold 1 δ (step 1) and then selects the best arm at each time for the remaining time steps (step 2). The required number of pulling each arm for the best m 1 arms until the mean reward becomes below 1 δ is upper bounded by δ/ϱ + 1. Therefore, the regret from step 2 is R = Ω((T mδ/ϱ)δ) = Ω(Tϱ1/3) in which the optimal policy pulls arms which mean rewards are below 1 δ for the remaining time after step 1. Therefore, we have Rπ(T)1(NT < m) = Ω(R1(NT < m)) = Ω(Tϱ1/31(NT < m)). (15) For getting a lower bound of the second term in (14), Rπ(T)1(NT m), we use the minimum number of selected arms a that satisfy µ1(a) 1 c. When NT m and Km κ, the policy selects at least κ number of distinct arms a satisfying µ1(a) 1 c until T. Therefore, we have Rπ(T)1(NT m) cκ1(NT m, Km κ). (16) Let δ = δ/(1 c + δ). By setting κ = m/δ m m/δ , with ϱ = o(1), we have κ = Θ(Tϱ1/3). (17) Then from (15), (16), and (17), we have E[Rπ(T)] = Ω(Tϱ1/3P(NT < m) + Tϱ1/3P(NT m, Km κ)) Ω(Tϱ1/3P(Km κ)). (18) Next we provide a lower bound for P(Km κ). Observe that Km follows a negative binomial distribution with m successes and the success probability P(µ1(a) > 1 δ)/P(µ1(a) / (1 c, 1 δ)) = δ/(1 c + δ) = δ , in which the success probability is the probability of selecting a good arm given that the arm is either a good or bad arm. In the following lemma, we provide a concentration inequality for Km. Lemma A.3. For any 1/2 + δ /m < α < 1, P(Km αm(1/δ ) m) 1 exp( (1/3)(1 1/α)2(αm δ )). (19) Proof. Let Xi for i > 0 be i.i.d. Bernoulli random variables with success probability δ . From Section 2 in Brown (2011), we have Rotting Infinitely Many-armed Bandits From (20) and Lemma A.1, for any 1/2 + δ /m < α < 1 we have δ m = P Km αm 1 exp (1 1/α)2 exp (1 1/α)2 3 (αm δ ) , in which the first inequality comes from Lemma A.1, which concludes the proof. From Lemma A.3 with α = 1 1/ m and large enough T, we have P(Km κ) 1 exp 3(m m δ ) 1 m 1 6(m m) 1 m 1 1 exp( 1/6). (21) Therefore, from (18) and (21), we have E[Rπ(T)] = Ω(Tϱ1/3). (22) Finally, from (12) and (22) we conclude that for any policy π, we have E[Rπ(T)] = Ω(max{Tϱ1/3, A.3. Proof of Theorem 4.1 Observe that initial mean rewards of selected arms are i.i.d. with a uniform distribution which should simplify analysis of the expected regret. However, by fixing the number of selected arms by a policy over the time horizon T, the mean rewards of arms become dependent. To deal with this dependence, we analyze the regret by controlling the number of distinct selected arms instead of fixing the time horizon. We explain this in more details in the following proofs. We set δ = max{ϱ1/3, 1/ T}. Then we define an arm a to be a good arm if (a) δ/2, and, otherwise, a is a bad arm. In A, let a1, a2, . . . , be a sequence of arms, which have i.i.d. mean rewards with uniform distribution on [0, 1]. Given a policy selecting arms in the sequence order, let m G be the number of selections of distinct good arms and m B i be the number of consecutive selections of distinct bad arms between the i 1-st and i-th selection of a good arm among m G good arms. We refer to the period starting from selecting the i 1-st good arm before selecting the i-th good arm as the i-th episode. Observe that m B 1 , . . . , m B m G are i.i.d. random variables with geometric distribution with parameter δ/2, given a fixed value of m G. Therefore, for non-negative integer k we have P(m B i = k) = (1 δ/2)kδ/2, for i = 1, . . . , m G. Define m to be the number of episodes from the policy π over the horizon T, m G to be the total number of selections of a good arm by the policy π over the horizon T such that m G = m or m G = m 1, and m B i to be the number of selections of a bad arm in the i-th episode by the policy π over the horizon T. Without loss of generality, we assume that the policy selects arms in the sequence of a1, a2, . . . , . Let AT be the set of selected arms over the horizon of T time steps, which satisfies |AT | T. Let ˆµt(a) = Pt 1 s=1 rs1(as = a) nt(a) and µt(a) = Pt 1 s=1 µs(a)1(as = a) Rotting Infinitely Many-armed Bandits We define the event E1 = {|ˆµt(a) µt(a)| p 2 log(T 4)/nt(a) for all t [T], a AT } to guarantee that the estimators of initial mean reward are well estimated. From Lemma A.2, by following similar steps of the proof of Lemma 5 in Auer et al. (2019), we have P(Ec 1) 2/T 2. Recall that Rπ(T) = PT t=1 1 µt(aπ t ). It is true that Rπ(T) = o(T 2) because the maximum mean reward gap from rotting is bounded by 1 + Tϱ = o(T). Given that E1 does not hold, the regret is E[Rπ(T)|Ec 1]P(Ec 1) = o(1), which is negligible comparing with the regret when E1 holds true which we show later. Therefore, in the rest of the proof we assume that E1 holds true. Under a policy π, let RG i be the regret (summation of mean reward gaps) contributed by pulling the good arm in the i-th episode and RB i,j be the regret contributed by pulling the j-th bad arm in the i-th episode. Then let Rπ m G = Pm G i=1(RG i + P j [m B i ] RB i,j),2 which is the regret over the period of m G episodes. For obtaining a regret bound, we first focus on finding a required number of episodes, m G, such that Rπ(T) Rπ m G. Then we provide regret bounds for each bad arm and good arm in an episode. Lastly, we obtain a regret bound for E[Rπ(T)] using the episodic regret bound. For i [ m G], j [ m B i ], let n G i be the number of pulls of the good arm in the i-th episode and n B i,j be the number of pulls of the j-th bad arm in the i-th episode by the policy π over the horizon T. Let a be the last selected arm over time horizon T by π. We denote by ˆm G and ˆm B i for i [ m] the number of arms excluding a in the selected m G number of good arms and m B i number of bad arms for i [ m] as follows: ( m G 1 if a is a good arm m G otherwise , ˆm B i = m B i for i [ m 1], and ˆm B m = ( m B m if a is a good arm m B m 1 otherwise. Those notations are defined only if they exist.3 Excluding the last arm a which the policy π may stop to pull suddenly by reaching the horizon T, we provide lower bounds of the number of pulling a good arm, n G i for i [ ˆm G] in the following lemma if they exist. Lemma A.4. Under E1, given ˆm G, for any i [ ˆm G] we have n G i δ/(2ϱ). ˆµo t(a) = Pt 1 s=1(rs + ϱsns(a))1(as = a) which satisfies µo t(a) ˆµo t(a) from ϱs ϱ for all s. Under E1, we can easily show that for all t [T] and a AT , |ˆµo t(a) µ1(a)| p 8 log(T)/nt(a). Let a(i) be a selected good arm in the i-th episode. Suppose that nt(a(i)) = n G i = δ/(2ϱ) for some t > 0, then we have µo t(a(i)) ϱ n G i + q 8 log(T)/ n G i ˆµo t(a(i)) ϱ n G i + q 8 log(T)/ n G i µ1(a(i)) δ/2 1 δ, where the second inequality is obtained from E1 and n G i δ/(2ϱ), and the third inequality is from µ1(a(i)) 1 δ/2. Therefore, policy π must pull arm a more times than δ/(2ϱ) , which implies n G i δ/(2ϱ). 2Note that Rπ m B does not contain undefined RB i,j such that RB i,j when m B i = 0. 3 n G i , n B i,j, and ˆm B i are not defined for i [0] or j [0]. Rotting Infinitely Many-armed Bandits We first consider the case when ϱ = ω(1/T 3/2). We have δ = ϱ1/3. For getting Rπ m G, here we define the policy π after time T such that it pulls δ/(2ϱ) amount for a good arm and 0 for a bad arm. We note that defining how π works after T is only for the proof to get a regret bound over time horizon T. For the last arm a over the horizon T, it pulls the arm up to δ/(2ϱ) amounts if a is a good arm and n G m G < δ/(2ϱ). For i [m G], j [m B i ] let n G i and n B i,j be the number of pulling the good arm in i-th episode and j-th bad arm in i-th episode from the policy, respectively. Here we define n G i s and n B i,j s as follows: If a is a good arm and n G m G < δ/(2ϱ), ( n G i for i [ m G 1] δ/(2ϱ) for i [m G]/[ m G 1] , n B i,j = ( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G], j [m B i ]/[ m B i ]. ( n G i for i [ m G] δ/(2ϱ) for i [m G]/[ m G] , n B i,j = ( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G 1], j [m B i ]/[ m B i ]. We note that n G i s and n B i,j s are defined only if they exist.4 Then we provide m G such that Rπ(T) Rπ m G in the following lemma. Lemma A.5. Under E1, when m G = 2Tϱ2/3 we have Rπ(T) Rπ m G. Proof. From Lemma A.4, with δ = ϱ1/3 we have j [m B i ] n B i,j which implies that Rπ(T) Rπ m G. From the result of Lemma A.5, we set m G = 2Tϱ2/3 . For getting a bound for E[Rπ m G], we provide bounds for E[RG i ] and E[RB i,j] in the following lemma. Lemma A.6. Under E1 and policy π, for any i [m G], j [m B i ], we have E[RG i ] = O δ E[RB i,j] = O 1 + δ ϱ1/3 + δ2 Proof. First we provide a bound for RG i using an upper bound of n G i . Recall that a(i) is the selected arm for the i-th good arm. We have µo t(a(i)) ϱnt(a(i)) = ˆµt(a(i)) + ϱ Pt 1 s=1 ns(a(i))1(as = a(i)) nt(a(i)) ϱnt(a(i)) ˆµt(a(i)) + ϱnt(a(i)) + 1 2 ϱnt(a(i)) = ˆµt(a(i)) ϱnt(a(i)) 1 4n G i and n B i,j are not defined for i [0] or j [0]. Rotting Infinitely Many-armed Bandits Then since, under E1, ˆµt(a(i)) (ϱ/2)(nt(a(i)) 1) µt(a(i)) + p 8 log(T)/nt(a(i)) (ϱ/2)(nt(a(i)) 1) 8 log(T)/nt(a(i)) (ϱ/2)(nt(a(i)) 1), for i [ m G], from the policy π, we need to get n such that 2(n 1) + 2 p 8 log(T)/n < 1 δ, (23) in which n + 1 is an upper bound for n G i . Let n1 = 2(δ + ϱ1/3)/ϱ + 1 and n2 = C log(T)/ϱ2/3 with some large enough constant C > 0. Then n = n1 + n2 satisfies (23) because 1 ϱn1 + 2 p 8 log(T)/n2 < 1 δ. Therefore, for all i [ m G] we have n G i = O((δ + ϱ1/3)/ϱ). Then with the fact that n G i = δ/(2ϱ) for i [m G]/[ m G] if they exist, for any i [m G] we have n G i = O((δ + ϱ1/3)/ϱ). Then for any i [m G] we have E[RG i ] E (a(i))n G i + n G i (n G i 1) 2 ϱ ϱ x + δ + ϱ1/3 where the first equality is obtained from the fact that (a(i)) are i.i.d. random variables with uniform distribution on [0, δ/2] and n G i = O((δ + ϱ1/3)/ϱ). Now we provide an upper bound of n B i,j to get a bound of RB i,j for i [m G], j [m B i ]. Let a(i, j) be a selected arm for j-th bad arm in the i-th episode. When δ/2 < (a(i, j)) δ + ϱ1/3, as in the case of the good arm, n = 2(δ + ϱ1/3)/ϱ + 1 + C1 log(T)/ϱ2/3 for some large enough constant C1 > 0, satisfies (23) so that n B i,j = O((δ + ϱ1/3)/ϱ) for all i [ m], j [ m B i ]. Since under E1 ˆµt(a(i, j)) (ϱ/2)(nt(a(i, j)) 1) µt(a(i, j)) + p 8 log(T)/nt(a(i, j)) (ϱ/2)(nt(a(i, j)) 1) µ1(a(i, j)) + p 8 log(T)/nt(a(i, j)) (ϱ/2)(nt(a(i, j)) 1), when δ + ϱ1/3 < (a(i, j)) 1, from the policy π under E1, we need to get n 1 such that µ1(a(i, j)) ϱ 2(n 1) + 2 p 8 log(T)/n < 1 δ, (24) in which n + 1 is an upper bound of n B i,j for i [ m], j [ m B i ]. From a sufficient condition for (24) to hold such that µ1(a(i, j)) + 2 p 8 log(T)/n < 1 δ, we can find that n = C2 log(T)/( (a(i, j)) δ)2 for some large constant C2 > 0 satisfies (24). Therefore, when δ + ϱ1/3 < (a(i, j)) 1, for all i [ m], j [ m B i ] we have n B i,j = O(1/( (a(i, j)) δ)2). Then with the fact that n B i,j = 0 for i [m G]/[ m G], j [m B i ]/[ m B i ] if they exist, for any i [m G] and j [m B i ], we have ( O((δ + ϱ1/3)/ϱ) if δ/2 < (a(i, j)) δ + ϱ1/3 O(1/( (a(i, j)) δ)2) if δ + ϱ1/3 < (a(i, j)) 1 Rotting Infinitely Many-armed Bandits Then for any i [m G], j [m B i ], we have E[RB i,j] E (a(i, j))n B i,j + n B i,j(n B i,j 1) 2 ϱ ϱ x + δ + ϱ1/3 2 ϱdx + Z 1 δ+ϱ1/3 1 (x δ)2 x + 1 (x δ)4 ϱdx = O 1 + δ ϱ1/3 + δ2 Recall that Rπ m G = Pm G i=1(RG i + P j [m B i ] RB i,j). With δ = ϱ1/3 and m G = 2Tϱ2/3 log(1/δ ) , from Lemmas A.5 and A.6, and the fact that m B i s are i.i.d. random variables with geometric distribution with E[m B i ] = 2/δ 1, we have E[Rπ(T)] = O(E[Rπ m G]) j [m B i ] RB i,j = O Tϱ2/3 δ = O Tϱ2/3 δ = O Tϱ1/3 . (25) Now we consider the case when ϱ = O(1/T 3/2). We have δ = Θ(1/ T). With a slight abuse of notation, we use π for a modified strategy after T. For getting Rπ m G, here we define the policy π after time T such that it pulls T amounts for a good arm and once for a bad arm. For the last arm a over the horizon T, it pulls the arm up to T amounts if a is a good arm and n G m G < T. With slight abuse of notation, for i [m G], j [m B i ] let n G i and n B i,j be the number of pulling the good arm in i-th episode and j-th bad arm in i-th episode from the policy, respectively. Here we define n G i s and n B i,j s as follows: If a is a good arm and n G m G < T, ( n G i for i [ m G 1] T for i [m G]/[ m G 1] , n B i,j = ( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G], j [m B i ]/[ m B i ]. ( n G i for i [ m G] T for i [m G]/[ m G] , n B i,j = ( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G 1], j [m B i ]/[ m B i ]. From Lemma A.4, under E1 we can find that n G i min{δ/(2ϱ), T} for i [m G]. Then if m G = C3 with some large enough constant C3 > 0, then with δ = Θ(1/ T) and ϱ = O(1/T 3/2), we have i [m G] n G i C3 min{δ/(2ϱ), T} > T, which implies Rπ(T) Rπ m G. Therefore, we set m G = C3. For getting a bound for E[Rπ m G], we provide bounds for E[RG i ] and E[RB i,j] in the following lemma. Rotting Infinitely Many-armed Bandits Lemma A.7. Under E1 and policy π, for any i [m G] and j [m B i ], we have E[RG i ] = O δT + T 2ϱ , E[RB i,j] = O Tδ2 + Proof. First we provide a bound for RG i using an upper bound of n G i . With the definition of n G i = T for i [m G]/[ m G], for any i [m G] we have n G i T. Recall that a(i) is the selected arm for the i-th good arm. Then for any i [m G] we have E[RG i ] E (a(i))n G i + n G i (n G i 1) 2 ϱ = O δT + T 2ϱ , where the first equality is obtained from the fact that (a(i)) s are i.i.d. random variables with uniform distribution on [0, δ/2] and n G i T. Now we provide an upper bound of n B i,j to get a bound of RB i,j for i [m G], j [m B i ]. Let a(i, j) be a selected arm for j-th bad arm in the i-th episode. When δ/2 < (a(i, j)) δ + 1/ T, as in the case of the good arm, n B i,j T for all i [ m], j [ m B i ]. When δ + 1/ T < (a(i, j)) 1, since under E1 µo t(a(i, j)) ϱnt(a(i, j)) = ˆµt(a(i, j)) + ϱ Pt 1 s=1 ns(a(i, j))1(as = a(i, j)) nt(a(i, j)) ϱnt(a(i, j)) ˆµt(a(i, j)) + ϱnt(a(i, j)) + 1 2 ϱnt(a(i, j)) = ˆµt(a(i, j)) ϱnt(a(i, j)) 1 2 µt(a(i, j)) + p 8 log(T)/nt(a(i, j)) (ϱ/2)(nt(a(i, j)) 1) µ1(a(i, j)) + p 8 log(T)/nt(a(i, j)) (ϱ/2)(nt(a(i, j)) 1), from the policy π we need to get n such that µ1(a(i, j)) (ϱ/2)(n 1) + 2 p 8 log(T)/n < 1 δ, (26) in which n + 1 is an upper bound of n B i,j for i [ m], j [ m B i ]. From a sufficient condition for (26) to hold such that µ1(a(i, j)) + 2 p 8 log(T)/n < 1 δ, we can find that n = C4 log(T)/( (a(i, j)) δ)2 for some large constant C4 > 0 satisfies (26). Therefore, when δ + 1/ T < (a(i, j)) 1, for all i [ m], j [ m B i ] we have n B i,j = O(1/( (a(i, j)) δ)2). Then with the fact that n B i,j = 0 for i [m G]/[ m G], j [m B i ]/[ m B i ], for any i [m G], j [m B i ], if δ/2 < (a(i, j)) δ + 1/ and if δ + 1/ T < (a(i, j)) 1, we have n B i,j = O(1/( (a(i, j)) δ)2). Rotting Infinitely Many-armed Bandits Then for any i [m G], j [m B i ], we have E[RB i,j] E (a(i, j))n B i,j + n B i,j(n B i,j 1) 2 ϱ δ/2 Tx + T 2ϱdx + Z 1 1 (x δ)2 x + 1 (x δ)4 ϱdx Then with δ = Θ(1/ T) and m G = C3, we have E[Rπ(T)] = O(E[Rπ m G]) i [m G] (RG i + X j [m B i ] RB i,j) = O δT + T 2ϱ + 1 where the third equality is obtained from Lemma A.7 and E[m B i ] = 2/δ 1. Finally, we can conclude the proof: From (25) and (27), for ϱ = o(1), with δ = max{ϱ1/3, 1/ E[Rπ(T)] = O(max{Tϱ1/3, A.4. Proof of Theorem 4.2 Let πi(β) for β B denote the base policy for time steps between (i 1)H + 1 and i H T in Algorithm 2 using UCBi,t(a, β) as a UCB index and 1 β1/3 as a threshold. Denote by aπi(β) t the pulled arm at time step t by policy πi(β). Then, for β B, which is set later for a near-optimal policy, we have E[Rπ(T)] = E t=(i 1)H+1 µt(aπ t ) = E[Rπ 1 (T)] + E[Rπ 2 (T)]. (28) t=(i 1)H+1 µt(aπi(β ) t ) t=(i 1)H+1 µt(aπi(β ) t ) t=(i 1)H+1 µt(aπ t ). Note that Rπ 1 (T) accounts for the regret caused by the near-optimal base algorithm πi(β ) s against the optimal mean reward and Rπ 2 (T) accounts for the regret caused by the master algorithm by selecting a base with β B at every block against the base with β . In what follows, we provide upper bounds for each regret component. We first provide an upper bound for E[Rπ 1 (T)] by following the proof steps in Theorem 4.1. Then we provide an upper bound for E[Rπ 2 (T)]. We set H = T and β to be a smallest value in B which is larger than max{ϱ, 1/H3/2}. Upper bounding E[Rπ 1 (T)]. We refer to the period starting from time step (i 1)H + 1 to time step i H T as the i-th block. For any i T/H 1 , policy πi(β ) runs over H time steps independent to other blocks so that each block has the Rotting Infinitely Many-armed Bandits same expected regret and the last block has a smaller or equal expected regret than other blocks. Therefore, we focus on finding a bound on the regret from the first block equal to PH t=1 1 µt(aπ1(β ) t ). Denote by A(i) the set of selected arms in the i-th block, which satisfies |A(i)| H. For notation simplicity, we use nt(a) instead of n1,t(a) and µo t(a) instead of µo 1,t(a). Let ˆµt(a) = Pt 1 s=1 rs1(as = a) nt(a) and µt(a) = Pt 1 s=1 µs(a)1(as = a) Define the event E1 = {|ˆµt(a) µt(a)| p 8 log(H)/nt(a), for all t [H], a A(1)}. From Lemma A.2, by following similar steps of the proof of Lemma 5 in Auer et al. (2019), we have P(E1) 1 2/H2. We assume that E1 holds true in what follows. Otherwise, the regret for the first block is negligible from Rπ1(β )(H) = o(H2). The proof follows similar steps as in the proof of Theorem 4.1. From π1(β ) we have δ = β 1/3. We define an arm a to be a good arm if (a) δ/2, and, otherwise, a is a bad arm. In A, let a1, a2, . . . be a sequence of arms, which have i.i.d. mean rewards with uniform distribution on [0, 1]. Given a policy selecting arms in the sequence order, let m G be the number of selections of distinct good arms and m B i be the number of consecutive selections of distinct bad arms between the i 1-st and i-th selection of a good arm among m G good arms. We refer to the period starting from selecting the i 1-st good arm before selecting the i-th good arm as the i-th episode. Observe that m B 1 , . . . , m B m G s are i.i.d. random variables with geometric distribution with parameter δ/2, conditional on the value of m G. Therefore, P(m B i = k) = (1 δ/2)kδ/2, for i = 1, . . . , m G. Define m to be the number of episodes by following policy π over the horizon of T time steps, m G to be the total number of selections of a good arm such that m G = m or m G = m 1, and m B i to be the number of selections of a bad arm in the i-th episode by the policy π1(β ) over the horizon H. Without loss of generality, we assume that the policy selects arms in the order of the sequence a1, a2, . . . . Under policy π1(β ), let RG i be the regret (summation of mean reward gaps) contributed by pulling the good arm in the i-th episode and RB i,j be the regret contributed by pulling the j-th bad arm in the i-th episode. Then let Rπ1(β ) m G = Pm G i=1(RG i + P j [m B i ] RB i,j)5, which is the regret over the period of m G episodes. For i [ m G], j [ m B i ], let n G i be the number of pulls of the good arm in the i-th episode and n B i,j be the number of pulls of the j-th bad arm in the i-th episode by the policy π1(β ) over the horizon H. Let a be the last selected arm over time horizon H by π1(β ). We denote by ˆm G and ˆm B i for i [ m] the number of arms excluding a in the selected m G number of good arms and m B i number of bad arms for i [ m] as follows: ( m G 1 if a is a good arm m G otherwise , ˆm B i = m B i for i [ m 1], and ˆm B m = ( m B m if a is a good arm m B m 1 otherwise. These notations are defined only if they exist.6 Excluding the last arm a which the policy π1(β ) may stop to pull suddenly by reaching the horizon H, we provide lower bounds for the number of pulls for each arm, n G i for i [ ˆm G] in the following lemma if they exist. Lemma A.8. Under E1, given ˆm G, for any i [ ˆm G] we have n G i δ/(2β ). 5Note that Rπ1(β ) m B does not contain undefined RB i,j such that RB i,j when m B i = 0. 6 n G i , n B i,j, and ˆm B i are not defined for i [0] or j [0]. Rotting Infinitely Many-armed Bandits ˆµo t(a) = Pt 1 s=1(rs + ϱsns(a))1(as = a) which satisfies µo t(a, β ) ˆµo t(a) from β ϱ. Under E1, it is true that for all t [H] and a A(1), |ˆµo t(a) µ1(a)| p 8 log(H)/nt(a). Let a(i) be a selected good arm in the i-th episode. Suppose that nt(a(i)) = n G i = δ/(2β ) for some t > 0, then we have µo t(a(i), β ) β n G i + q 8 log(H)/ n G i ˆµo t(a(i)) β n G i + q 8 log(H)/ n G i µ1(a(i)) δ/2 1 δ, where the second inequality is obtained from E1 and n G i δ/(2β ), and the third inequality is from µ1(a(i)) 1 δ/2. Therefore, policy π1(β ) must pull arm a more times than δ/(2β ) , which implies n G i δ/(2β ). We first consider the case when ϱ = ω(1/H3/2). Then we have that β is the smallest value in B which exceeds ϱ such that ϱ β 2ϱ. For getting Rπ1(β ) m G , here we define how the policy π1(β ) works after time H such that it pulls δ/(2β ) times a good arm and 0 time a bad arm. We note that defining how π1(β ) works after H is only for the proof to get a regret bound over time horizon H. For the last arm a over the horizon H, it pulls the arm up to δ/(2β ) times if a is a good arm and n G m G < δ/(2β ). For i [m G] and j [m B i ], let n G i and n B i,j be the number of pulls of the good arm in the i-th episode and the j-th bad arm in the i-th episode by the policy, respectively. Here we define n G i s and n B i,j s as follows: If a is a good arm and n G m G < δ/(2β ), then ( n G i for i [ m G 1] δ/(2β ) for i [m G]/[ m G 1] , n B i,j = ( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G], j [m B i ]/[ m B i ]. ( n G i for i [ m G] δ/(2β ) for i [m G]/[ m G] , n B i,j = ( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G 1], j [m B i ]/[ m B i ]. We note that n G i s and n B i,j s are defined only if they exist.7 We provide m G such that Rπ1(β )(H) Rπ1(β ) m G in the following lemma. Lemma A.9. Under E1, when m G = 2Hβ 2/3 , we have Rπ1(β )(H) Rπ1(β ) m G . Proof. From Lemma A.8, with δ = β 1/3 we have j [m B i ] n B i,j which implies that Rπ1(β )(H) Rπ1(β ) m G . From the result of Lemma A.9, we set m G = 2Hβ 2/3 . For getting a bound for E[Rπ1(β ) m G ], we provide bounds for E[RG i ] and E[RB i,j] in the following lemma. 7n G i and n B i,j are not defined for i [0] or j [0]. Rotting Infinitely Many-armed Bandits Lemma A.10. Under E1 and policy π1(β ), for any i [m G] and j [m B i ], we have E[RG i ] = O δ E[RB i,j] = O 1 + δ ϱ1/3 + δ2 Proof. First we provide a bound for RG i using an upper bound of n G i . Recall that a(i) is the selected arm for the i-th good arm. We have µo t(a(i), β ) β nt(a(i)) = ˆµt(a(i)) + β Pt 1 s=1 ns(a(i))1(as = a(i)) nt(a(i)) β nt(a(i)) ˆµt(a(i)) + β nt(a(i)) + 1 2 β nt(a(i)) = ˆµt(a(i)) β nt(a(i)) 1 Since, under E1, ˆµt(a(i)) (β /2)(nt(a(i)) 1) µt(a(i)) + p 8 log(H)/nt(a(i)) (β /2)(nt(a(i)) 1) 8 log(H)/nt(a(i)) (β /2)(nt(a(i)) 1), for i [ m G], from the policy π, we need to get n such that 2 (n 1) + 2 p 8 log(H)/n < 1 δ, (29) in which n + 1 is an upper bound for n G i . Let n1 = 2(δ + β 1/3)/β + 1 and n2 = C log(H)/β 2/3 with some large enough constant C > 0. Then n = n1 + n2 satisfies (29) because 1 β n1 + 2 p 8 log(H)/n2 < 1 δ. Therefore, for all i [ m G] we have n G i = O((δ + β 1/3)/β ). Then with the fact that n G i = δ/(2β ) for i [m G]/[ m G] if they exist and β = Θ(ϱ), for any i [m G] we have n G i = O((δ + β 1/3)/β ) = O((δ + ϱ1/3)/ϱ). Then for any i [m G] we have E[RG i ] E (a(i))n G i + n G i (n G i 1) 2 ϱ ϱ x + δ + ϱ1/3 where the first equality is obtained from the fact that (a(i)) s are i.i.d. random variables with uniform distribution on [0, δ/2] and n G i = O((δ + ϱ1/3)/ϱ). Now we provide an upper bound of n B i,j to get a bound of RB i,j for i [m G], j [m B i ]. Let a(i, j) be a selected arm for j-th bad arm in the i-th episode. When δ/2 < (a(i, j)) δ + ϱ1/3, as in the case of the good arm, n = Rotting Infinitely Many-armed Bandits 2(δ+β 1/3)/β +1+C1 log(T)/β 2/3 for some large enough constant C1 > 0, satisfies (29) so that n B i,j = O((δ+ϱ1/3)/ϱ) for all i [ m], j [ m B i ]. When δ + ϱ1/3 < (a(i, j)) 1, since under E1 µo t(a(i, j)) β nt(a(i, j)) = ˆµt(a(i, j)) + β Pt 1 s=1 ns(a(i, j))1(as = a(i, j)) nt(a(i, j)) β nt(a(i, j)) ˆµt(a(i, j)) + β nt(a(i, j)) + 1 2 β nt(a(i, j)) = ˆµt(a(i, j)) β nt(a(i, j)) 1 2 µt(a(i, j)) + p 8 log(T)/nt(a(i, j)) (β /2)(nt(a(i, j)) 1) µ1(a(i, j)) + p 8 log(T)/nt(a(i, j)) (β /2)(nt(a(i, j)) 1), from the policy π we need to get n 1 such that µ1(a(i, j)) β 2 (n 1) + 2 p 8 log(T)/n < 1 δ, (30) in which n + 1 is an upper bound of n B i,j for i [ m], j [ m B i ]. From a sufficient condition for (30) to hold such that µ1(a(i, j)) + 2 p 8 log(T)/n < 1 δ, we can find that n = C2 log(T)/( (a(i, j)) δ)2 for some large constant C2 > 0 satisfies (30). Therefore, when δ + ϱ1/3 < (a(i, j)) 1, for all i [ m], j [ m B i ] we have n B i,j = O(1/( (a(i, j)) δ)2). Then with the fact that n B i,j = 0 for i [m G]/[ m G], j [m B i ]/[ m B i ] if they exist, for any i [m G] and j [m B i ], we have ( O((δ + ϱ1/3)/ϱ) if δ/2 < (a(i, j)) δ + ϱ1/3 O(1/( (a(i, j)) δ)2) if δ + ϱ1/3 < (a(i, j)) 1 Then for any i [m G], j [m B i ], we have E[RB i,j] E (a(i, j))n B i,j + n B i,j(n B i,j 1) 2 ϱ ϱ x + δ + ϱ1/3 2 ϱdx + Z 1 δ+ϱ1/3 1 (x δ)2 x + 1 (x δ)4 ϱdx = O 1 + δ ϱ1/3 + δ2 Recall that Rπ1(β ) m G = Pm G i=1(RG i + P j [m B i ] RB i,j). With β = Θ(ϱ), δ = Θ(ϱ1/3), and m G = 2Hβ 2/3 , from Lemmas A.9 and A.10, and the fact that m B i s are i.i.d. random variables with geometric distribution with E[m B i ] = 2/δ 1, we have E[Rπ1(β )(H)] = O(E[Rπ1(β ) m G ]) j [m B i ] RB i,j = O Hϱ2/3 δ = O Hϱ2/3 δ = O Hϱ1/3 . (31) Rotting Infinitely Many-armed Bandits Now we consider the case when ϱ = O(1/H3/2). We have set β as the smallest element in B that exceeds max{ϱ, 1/H3/2}; hence we have β = Θ(1/H3/2). For getting Rπ1(β ) m G , here we define how the policy π1(β ) works after H time steps such that it pulls H times a good arm and once a bad arm. For the last arm a over the horizon H, it pulls the arm up to H times if a is a good arm and n G m G < H. With slight abuse of notation, for i [m G] and j [m B i ], let n G i and n B i,j be the number of pulls of the good arm in the i-th episode and the j-th bad arm in the i-th episode by the policy, respectively. Here we define n G i s and n B i,j s as follows: If a is a good arm and n G m G < H, then ( n G i for i [ m G 1] H for i [m G]/[ m G 1] , n B i,j = ( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G], j [m B i ]/[ m B i ]. ( n G i for i [ m G] H for i [m G]/[ m G] , n B i,j = ( n B i,j for i [ m G], j [ m B i ] 0 for i [m G]/[ m G 1], j [m B i ]/[ m B i ]. From Lemma A.8, we observe that n G i min{δ/(2β ), H} for i [m G]. Then under E1, if m G = C3 for some large enough constant C3 > 0, then with δ = β 1/3 and β = Θ(1/H3/2), we have X i [m G] n G i H, which implies Rπ1(β )(H) Rπ1(β ) m G . Therefore, we set m G = C3. For getting a bound for E[Rπ1(β ) m G ], we provide bounds for E[RG i ] and E[RB i,j] in the following lemma. Lemma A.11. Under E1 and policy π1(β ), for any i [m G] and j [m B i ], we have E[RG i ] = O δH + H2ϱ , E[RB i,j] = O Hδ2 + Proof. First we provide a bound for RG i using an upper bound of n G i . With the definition of n G i = H for i [m G]/[ m G], for any i [m G] we have n G i H. Recall that a(i) is the selected arm for the i-th good arm. Then, for any i [m G], we have E[RG i ] E (a(i))n G i + n G i (n G i 1) 2 ϱ = O δH + H2ϱ , where the first equality is obtained from the fact that (a(i)) s are i.i.d. random variables with uniform distribution on [0, δ/2] and n G i H. Now we provide an upper bound of n B i,j to get a bound of RB i,j for i [m G], j [m B i ]. Let a(i, j) be a selected arm for j-th bad arm in the i-th episode. When δ/2 < (a(i, j)) δ + 1/ H, as in the case of the good arm, n B i,j H for all i [ m], j [ m B i ]. When δ + 1/ H < (a(i, j)) 1, from the policy π1(β ) under E1, we need to get n 1 such that µ1(a(i, j)) β 2 (n 1) + 2 p 8 log(T)/n < 1 δ, (32) Rotting Infinitely Many-armed Bandits in which n + 1 is an upper bound of n B i,j for i [ m], j [ m B i ]. From a sufficient condition for (32) to hold such that µ1(a(i, j)) + 2 p 8 log(H)/n < 1 δ, we can find that n = C4 log(H)/( (a(i, j)) δ)2 for some large constant C4 > 0 satisfies (32). Therefore, when δ + 1/ H < (a(i, j)) 1, for all i [ m], j [ m B i ] we have n B i,j = O(1/( (a(i, j)) δ)2). Then with the fact that n B i,j = 0 for i [m G]/[ m G], j [m B i ]/[ m B i ], for any i [m G], j [m B i ], if δ/2 < (a(i, j)) δ + 1/ and if δ + 1/ H < (a(i, j)) 1, we have n B i,j = O(1/( (a(i, j)) δ)2). For any i [m G], j [m B i ], we obtain E[RB i,j] E (a(i, j))n B i,j + n B i,j(n B i,j 1) 2 ϱ δ/2 Hx + H2ϱdx + Z 1 1 (x δ)2 x + 1 (x δ)4 ϱdx It follows that, with δ = Θ(1/ H) and m G = C3, we have E[Rπ1(β )(H)] = O(E[Rπ1(β ) m G ]) j [m B i ] RB i,j = O δH + H2ϱ + 1 where the third equality is obtained from Lemma A.11 and E[m B i ] = 2/δ 1. Finally, we can conclude the proof by noting that from (31) and (33), for ϱ = o(1), we have E[Rπ1(β )(H)] = O(max{Hϱ1/3, Therefore, by summing regrets from T/H number of blocks, we have shown that E[Rπ 1 (T)] = O((T/H) max{Hϱ1/3, H}) = O(max{Tϱ1/3, T/ Upper bounding E[Rπ 2 (T)]. We observe that the EXP3 is run for T/H decision rounds and the number of policies (i.e. πi(β) for β B) is B. Denote the maximum absolute sum of rewards of any block with length H by a random variable Q . We first provide a bound for Q using concentration inequalities. For any block i, we have t=(i 1)H+1 µt(aπ t ) + ηt t=(i 1)H+1 µt(aπ t ) t=(i 1)H+1 ηt Rotting Infinitely Many-armed Bandits Denote by Ti the set of time steps in the i-th block. We define the event E2(i) = {|ˆµt(a) µt(a)| p 7 log(T)/nt(a), for all t Ti, a A(i)} and E2 = T i [ T/H ] E2(i). From Lemma A.2, by following similar steps of the proof of Lemma 5 in Auer et al. (2019), with H = By assuming that E2 holds true, we can get a lower bound for µt(aπ t ), which may be a negative value from rotting, for getting an upper bound for | Pi H T t=(i 1)H+1 µt(aπ t )|. Let βmax denote the maximum value in B. From the policy π with T, when µ1(a) = 0 for some arm a, since µo t(a(i, j)) ϱnt(a(i, j)) + 8 log(H) nt(a(i, j)) ˆµt(a(i, j)) (ϱ/2)(nt(a(i, j)) 1) + 4 log(T) nt(a(i, j)) = µt(a(i, j)) (ϱ/2)(nt(a(i, j)) 1) + 4 log(T) nt(a(i, j)) + 7 log(T) nt(a(i, j)) µ1(a) (ϱ/2)(nt(a(i, j)) 1) + 4 log(T) nt(a(i, j)) + 7 log(T) nt(a(i, j)) 4 log(T) nt(a(i, j)) + 7 log(T) nt(a(i, j)), we need to find an positive integer n such that p 4 log(T)/n + p 7 log(T)/n 1 β1/3 max, in which n is an upper bound for the number of pulls of arm a. From β1/3 max = 1/2, we can observe that the condition holds with n = 92 log(T) . From this fact, we can find that for any selected arm a from π, we have µt(a) (92 log(T) + 1)ϱ. Then under E2, with µt(a) 1, for any i [ T/H ] we have | Pi H T t=(i 1)H+1 µt(aπ t )| max{(92 log(T) + 1)ϱH, H}. Next we provide a bound for | Pi H T t=(i 1)H+1 ηt|. We define the event E3(i) = {| Pi H T t=(i 1)H+1 ηt| 2 p H log(T)} and E3 = T i [ T/H ] E3(i). From Lemma A.2, for any i [ T/H ], we have P (E3(i)c) 2 Then, under E2 E3, with (35), we have Q max{93H log(T)ϱ, H} + 2 p H log(T) 93H log(T) + 2 p which implies 1/2 + Pi H T t=(i 1)H rt/(186H log T + 4 H log T) [0, 1]. With the rescaling and translation of rewards in Algorithm 2, from Corollary 3.2. in Auer et al. (2002b), we have E[Rπ 2 (T)|E2 E3] = O (93H log T + 2 p Note that the expected regret from EXP3 is trivially bounded by o(H2(T/H)) = o(TH) and B = O(log(T)). Then, with (36), we have E[Rπ 2 (T)] = E[Rπ 2 (T)|E2 E3]P(E2 E3) + E[Rπ 2 (T)|Ec 2 Ec 3]P(Ec 2 Ec 3) HT + o (TH) (4/T 2) Finally, from (28), (34), and (37), with H = E[Rπ(T)] = O max{T 3/4, ϱ1/3T} , which concludes the proof.