# on_multiarmed_bandit_with_impatient_arms__d399f0f5.pdf On Multi-Armed Bandit with Impatient Arms Yuming Shao 1 2 Zhixuan Fang 1 2 In this paper, we investigate a Multi-Armed Bandit (MAB) setting where an arm exits the game if the algorithm continuously neglects it. This setup is motivated by real-world scenarios, such as online advertising and crowdsourcing, where arms only gain benefits after being pulled by the algorithm. We identify the intrinsic hardness of this problem and limitations in existing approaches. We propose FC-SE algorithm with expected regret upper bounds as our solution to this problem. As an extension, we even allow new arms to enter after the game starts and design FC-Entry algorithm with performance guarantees for this setup. Finally, we conduct experiments to validate our theoretical results. 1. Introduction Multi-Armed Bandit (MAB), first introduced in (Robbins, 1952), is a type of machine learning model used to describe the decision-making problems with unknown information that needs to be learned. In this model, each arm represents a potential choice with unknown expected reward. The algorithm s objective is to achieve the highest possible overall reward. The central challenge in a MAB problem is to strike a balance between exploiting the arms that have yielded high rewards and exploring unknown arms to uncover their potential. MAB algorithms find real-world applications in various fields, such as online advertising (Schwartz et al., 2017; Yang & Lu, 2016; Aramayo et al., 2023; Avadhanula et al., 2021; Han & Gabor, 2020), clinical trials (Aziz et al., 2021; Chakravorty & Mahajan, 2014), recommendation systems (Santana et al., 2020; Xie et al., 2021; Mahadik et al., 2020), crowdsourcing (Rangi & Franceschetti, 2018; Song & Jin, 2021; Qin et al., 2023; Liu & Liu, 2017; Zhang et al., 2015; Tran-Thanh et al., 2014), and resource allocation in 1Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 2Shanghai Qi Zhi Institute, Shanghai, China. Correspondence to: Zhixuan Fang . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). computer networks (Zuo & Joe-Wong, 2021; Pase et al., 2022; Feki & Capdevielle, 2011). In these applications, bandit algorithms attempt to find choices that maximize their own benefits, while arms passively comply with and accept the arrangements made by the algorithm. However, in many situations, bandit arms also have their own interests, preferences, and even the right to make choices. For instance, a series of recent studies (Liu et al., 2020; 2021a; Kong et al., 2022; Basu et al., 2021; Sankararaman et al., 2021; Zhang et al., 2022) investigated multi-agent multi-armed bandit in two-sided matching markets. In these contexts, each arm selects its most preferred agent among those who have chosen it. Beyond this, the arms preferences and choices can manifest in various ways. One crucial aspect is their decision to participate in the game. If participation promises satisfactory benefits, they may opt to remain involved; conversely, if participation does not yield sufficient benefits, they may be motivated to exit. To illustrate this, we present the following real world scenarios: Example 1 (Online Advertising). A website owner allocates advertisement slots to advertisers across various time periods. The owner generally prefers advertisements with high click-through rates to generate higher revenue. However, for advertisers with low click-through rates, their ads may remain undisplayed for an extended period. This situation may lead to financial losses for advertisers and evoke their dissatisfaction. Consequently, advertisers may find it advantageous to withdraw from the competition for ad slots on this site. Example 2 (Crowdsourcing). The crowdsourcing workers participating in a dataset labeling task receive compensation for each label they provide. A crowdsourcing system assuming control over the task assignments to workers (Zhang et al., 2015) often continuously assigns tasks to workers with proven proficiency, while potentially neglecting others. Given that workers are only compensated upon task completion, certain workers might experience prolonged periods without receiving any tasks, leading to a lack of income from the platform. They may lose patience and turn to alternative avenues for earning income. What these two cases have in common is that if the algorithm frequently neglects an arm, it can harm the arm s interests, leading it to lose patience and ultimately incentivizing it to leave. Patience, to some extent, represents On Multi-Armed Bandit with Impatient Arms the threshold of loss that an arm is willing to tolerate. The conventional MAB model does not account for the arms patience and the possibility that they may leave the game if they run out of patience, making it inadequate for effectively capturing the real-world examples outlined above. To address this limitation, we introduce the patience of arms into the conventional model and study the Multi-Armed Bandit with Impatient Arms setting. In this setup, there are K arms and each arm k is associated with a positive integer mk indicating arm k s patience. If arm k is consistently ignored by the algorithm for mk consecutive times, it will leave the game. Consequently, the algorithm cannot pull it within the remaining time horizon. While existing bandit algorithms such as the Upper Confidence Bound (UCB) (Lai, 1987; Auer et al., 2002), Successive Elimination (SE) (Even-Dar et al., 2006) and Thompson Sampling (TS) (Thompson, 1933) algorithms can operate within the proposed setting, they do not explicitly accommodate the limited patience of the arms. When the optimal arm is impatient, it could deplete its patience and leave the game early. As a result, the algorithm would be compelled to choose the remaining sub-optimal arms, leading to a linear regret. Therefore, in this paper, we ask two research questions: 1) How does the patience of arms impact the performance of existing algorithms? 2) How can we address the challenges posed by the impatience of arms? 1.1. Our Contributions We introduce a new version of MAB problem, where an arm leaves the game if the algorithm continuously neglects it. As an extension, we also consider a more general and realistic setup that allows not only the participating arms to exit but also new arms to enter. We derive a minimax lower bound to highlight the fundamental hardness of the proposed problem especially when the arms exhibit significant impatience. We also comprehensively study the existing algorithms such as UCB and SE in the proposed setup. We identify a broad range of problem instances where they incur (nearly) linear expected regret. When existing algorithms have no guaranteed performance, we propose the Feasible Cycle-based Successive Elimination (FC-SE) algorithm. In FC-SE, we repeat a special sequence of arms (referred to as a feasible cycle) to prevent unexpected arm departures. When it is possible to include all arms in the feasible cycle, FC-SE achieves a regret of O n + n T , where n is the feasible cycle length and T is the time horizon. However, when it s only possible to include a subset of arms in the initial feasible cycle, FC-SE randomly drops arms from the current feasible cycle, if necessary, to accommodate remaining arms, achiev- ing a regret of O K 4 3 T 2 3 n 1 3 if the remaining arms have sufficient patience. We design the FC-Entry algorithm for the extension scenario with new entering arms. FC-Entry accounts for unknown entry times for new arms, assuming that entries are sparse. FC-Entry achieves a dynamic regret of O K2T 2 3 n 1 3 if the arms initially available but not in the initial feasible cycle have sufficient patience. We conduct numerical experiments to validate our theoretical results. 1.2. Related Work There are several prior works allowing time-variant arm sets. The literature on sleeping bandits (Kanade et al., 2009; Kanade & Steinke, 2014; Cortes et al., 2019; Saha et al., 2020; Gaillard et al., 2023) assumes that the set At of available arms at any given time t may change. Additionally, Chakrabarti et al. (2008) and Trac a et al. (2020) are related to our work since they assume that if an arm leaves the game, it will not return. While our setting is related to these works, there are significant differences. In these papers, arm availability can be either stochastic (Saha et al., 2020) or adversarial (Gaillard et al., 2023). On the one hand, our setting assumes that the available set of arms is determined by the algorithm s previous actions, making it incompatible with the stochastic arm availability assumption. On the other hand, in previous works assuming adversarial arm availability, the algorithm s performance is typically evaluated by comparing its choice at time t against an arm from the adversarially selected At. In our scenario, instead, the algorithm s choice should always be compared against the offline optimal arm k , even though an algorithm can unfortunately lose it (k / At). One step in our proposed algorithms is to schedule the arms in a way that prevents any of them from running out of patience. Similar scheduling problems are considered in the Age of Information (Ao I) literature (Kaul et al., 2011; 2012; Abbas et al., 2023). In the field of communication, Ao I serves as a metric for measuring information freshness. Although a main line of research has focused on Ao I minimization (Liu et al., 2019; Arafa et al., 2020; Chen et al., 2023), our paper is more related to scheduling under hard constraints. For instance, the peak Ao I deadline resembles the concept of arm patience in this paper. Li et al. (2021); Liu et al. (2021b); Li et al. (2023a) study the scheduling problem under various Ao I constraints. Due to space limits, we defer other details of related work to Appendix A. 2. Preliminaries The problem studied in this paper is built upon the standard stochastic MAB. We consider a time horizon of length On Multi-Armed Bandit with Impatient Arms T and a finite set of K arms. At each time t [T], an algorithm pulls one arm at At, where At is the set of arms available at time t. When arm k is pulled the n-th time, the algorithm observes and collects a reward Xk,n. {Xk,n}n 1 are i.i.d random variables and we assume that for any k, n, Xk,n µk is 1-sub-Gaussian (The definition can be found for example in Chapter 5 of Lattimore & Szepesv ari (2020)). Let Xt denote the reward at time t. For any arm k [K], µk is its mean reward. µ = (µ1, ..., µK) is the vector of the mean rewards, which is unknown to the algorithm. The largest mean reward is µ = maxk [K] µk and the optimal arm is k arg maxk [K] µk. For simplicity of presentation, we assume that the optimal arm is unique. Minor modification can be applied to adapt our results to the non-unique optimal arm case. We define the reward gap such that k = µ µk , k = k , where a known constant 1 is an upper bound for all the reward gaps. We also define the relative reward gap k,k = µk µk , k, k s.t. µk µk . Define the empirical mean for arm k given n observations: ˆµk,n = n 1 Pn k=1 Xk,n. Let Tk(t) := Pt s=1 I{as = k} denote the number of times arm k is pulled from time 1 to time t. As we consider the Multi-Armed Bandit with Impatient Arms setup, we introduce a threshold mk N+ for each arm k. We refer to mk as arm k s patience. Similar to µ, m = (m1, ..., m K) is the vector of mk, k [K]. When the algorithm continuously ignores arm k for a duration of mk time steps, arm k loses patience and exits the game. For example, if at0 = k and as = k for all s = t0 + 1, ..., t0 + mk, and if t0 + mk + 1 T, then arm k exits at the end of time t0 + mk. Arm k can no longer be selected starting from time t = t0 + mk + 1 as it has already left the game. We assume that all arms are initially full of patience when they enter the game. In other words, if arm k has not been pulled since time t = 1, we regard the time when it was last pulled as t = 0. We assume that m is given in advance and discuss this in Appendix H. Define a bijective index mapping ind: [K] [K] such that ind(k) is the index of arm k when m is sorted from small to large. Let ind 1(i) denote the arm whose patience is the i-th small. Following Li et al. (2021), we define the load factor as a measure of arm impatience. We adopt cumulative expected regret RT as the performance metric for algorithms, where t=1 µat = E T X t=1 at . (2) It is worth noting that, at any time t, the algorithm s choice at is compared against k , even if k may not be available at time t (i.e., k / At). This definition of regret is reasonable because the best an algorithm can do is to pull k since the beginning and keep it in At throughout the time horizon. This reveals one of the crucial differences between our work and the sleeping bandit literature (Gaillard et al., 2023; Kale et al., 2016), whose regrets typically compare at against some arm in At. 3. Hardness of the Problem and Limitations of the Existing Approaches 3.1. Negative Results in the Low Patience Case Consider the configuration (K, m) = (3, (2, 2, 2)). It can be verified that at least one arm exits at the end of time slot t = 2, regardless of which arms the algorithm pulls at times t = 1 and 2. Intuitively, this serves as an extreme example where some arms have very low level of patience and no algorithm can guarantee a sub-linear regret under such a (K, m) configuration and different µ values. This is because it is impractical to determine the best arm among the three with only two reward observations. In this section, we formally demonstrate how a low level of patience renders learning infeasible. We first define a quantity characterizing the time when the first exit happens given (K, m). Definition 3.1. Fix K, m, define the maximum number of reward observations when the first arm departure happens Q(K,m) to be max n L N {at}L t=1, s [L 1], k K : s < mk or s mk, l = s mk + 1, ..., s with al = k o , where {at}L t=1 is a sequence of arm a1, a2, ..., a L. If there exists an infinite-length arm sequence such that no arm leaves, let Q(K,m) = + . For example, Q(3,(2,2,2)) = 2. By Definition 3.1, at least one arm exits the game before or at the end of time t = Q(K,m), regardless of the algorithm s arm choices. We note that Q(K,m) can be uniquely determined by the number of arms K and the vector of patience m, although the computation of Q(K,m) may be of significant time complexity. Intuitively speaking, the less patient the arms are, the more likely they are to exit early. Thus low level of patience results in a small value of Q(K,m). We present a minimax expected regret lower bound depending on Q(K,m), which is especially powerful when Q(K,m) is small. Theorem 3.2. Given a configuration (K, m) such that Q(K,m) T. Suppose the rewards are independent Gaussian random variables with variance 1. Then the minimax expected regret lower bound is given by min A A sup µ Ξ RT (A, (K, m, µ)) 1 Q(K,m) ln K K 1 On Multi-Armed Bandit with Impatient Arms h 1 f 1 K (1 2 ln K K 1) i (T Q(K,m)), where RT (A, (K, m, µ)) is the expected regret when algorithm A is operated on the problem instance (K, m, µ) for a time horizon T. The possible set of algorithms A = {πt}T t=1 , where {πt}T t=1 is a sequence of policies π1, ..., πT that πt maps from (a1, X1, ..., at 1, Xt 1) and m to the probability simplex over [K]. Ξ = {µ | k , k = k }. f K(p) := h(p) + p ln K K 1 and h(p) = p ln p + (1 p) ln(1 p) is the entropy of Bernoulli distribution. f 1 K ( 1 2 ln K K 1) denotes the unique p such that f K(p) = 1 2 ln K K 1. Though it can be difficult to compute Q(K,m) for arbitrary K, m, we find that if the load factor l(K, m) > 1, it is possible to obtain upper bounds of Q(K,m). In fact, there is a rich set of configurations whose load factor exceeds 1. For instance, l(3, (2, 2, 2)) = 3 2 > 1. We formally describe the upper bound in the following lemma and detail its proof in Appendix C. Lemma 3.3. Given a configuration (K, m), if λ N+ such that l(K, m) > 1+ 1 λ, then we have Q(K,m) λK+1. Combining Theorem 3.2 and Lemma 3.3, we directly have the following result. Corollary 3.4. Fix a configuration (K, m) with l(K, m) > 1, we have min A A sup µ Ξ RT (A, (K, m, µ)) = Ω W(K, m)T , where W(K, m) is a quantity purely depending on (K, m). A, Ξ and RT (A, (K, m, µ)) are defined in Theorem 3.2. Corollary 3.4 demonstrates the hardness of algorithm design when l(K, m) > 1. If we regard l(K, m) as a measure of arm impatience, then l(K, m) > 1 is associated with a low level of patience case, when the arms are so likely to leave early that learning their reward distributions is not feasible. 3.2. Analysis of UCB Algorithm In this part, we study the well-known Upper Confidence Bound (UCB) algorithm in our Multi-Armed Bandit with Impatient Arms setup. We adopt the definition in Lattimore & Szepesv ari (2020). Define the upper confidence bound of arm k given n observations as UCBk(n) = ˆµk,n + 2n 1 ln δ 1 if n > 0 and UCBk(0) = + . δ is the confidence parameter, typically set to be δ = 1/T 2. The UCB algorithm operates by pulling at = arg maxk [K] UCBk(Tk(t)). We find that when the arms are sufficiently patient, UCB still has performance guarantees. Such results are presented in Appendix D. However, when the optimal arm is impatient, it is highly possible that it exits early under UCB, leaving the algorithm with only sub-optimal arms to choose. We will show that, if there are impatient arms, we can find many problem instances such that UCB performs badly, thus demonstrate its limitation in the proposed setup. Theorem 3.5. Run UCB with confidence parameter δ = 1/T 2 in the Multi-Armed Bandit with Impatient Arms setup. Suppose the rewards are independent Gaussian random variables with variance 1. If θ (0, 1), mk = O((ln T)θ) and β = k such that mβ T, then for any γ (0, 1), we have RT = Ω min T 1 γ C(T) ln T 1 , where min := mink =k k and C(T) is a polynomial function of ln T. Proof Sketch of Theorem 3.5. We define events Ek = minn [κT 1]UCBk (n) > x1, UCBk (κT ) < x2 and Eβ = x2 < UCBβ(aβ) < x1, minn [bβ] UCBβ(n) > x2 for arm k and β, respectively. x1, x2, κT , aβ, bβ are constants and UCBk(n) denotes the UCB index of arm k given the first n reward observations. We set x1 > x2 and aβ < bβ. Under these two events, we show that arm k is pulled no more than κT = o(ln T) times. Since UCB algorithm always pulls the arm with the highest UCB index, arm β is pulled at most aβ times before the κT -th pull of arm k , given minn [κT 1] UCBk (n) > x1 > UCBβ(aβ). If arm k is pulled the κT -th time, at least the (aβ + 1)-th, ..., (bβ + 1)-th pulls of arm β happen before the (κT + 1)-th pull of arm k . If the number of arm β pulls between the κT -th and (κT + 1)-th pull of arm k exceeds arm k s patience, then arm k leaves the game before it is pulled the (κT + 1)-th time, not to mention the pulls of other arms. By carefully designing the values of the constants, we show that Ek Eβ occurs with a non-negligible possibility using some techniques of Brownian motion. The complete proof of Theorem 3.5 is also in Appendix D. It shows that as the optimal arm is relatively impatient, the expected regret of UCB algorithm is asymptotically (almost) linear. Fix any θ (0, 1), the patience vector m such that there are negative problem instances for UCB algorithm can have a load factor l(K, m) as small as (ln T) θ + (K 1)/T 0 as T + , and as large as we wish. If we use load factor as a measure of arm impatience, we see that the capability of UCB algorithm in the proposed setting is very limited, since we can find problem instances such that UCB yields almost linear regret asymptotically even as the load factor approaches 0. 3.3. Analysis of SE Algorithm In this section, we study the Successive Elimination (SE) algorithm in our Multi-Armed Bandit with Impatient Arms On Multi-Armed Bandit with Impatient Arms setup. We consider a version similar to Algorithm 3 in Even Dar et al. (2006). All arms are active in the beginning. The algorithm pulls the active arms in a Round-Robin manner, in an increasing order of the arm indices. If at time t all active arms are pulled the same number of times, then at the end of time t the algorithm executes arm elimination when there are more than one active arms. For arm k, if there exists some arm k such that ˆµk ,Tk (t) ˆµk,Tk(t) > 2 q 4T 1 k (t) ln T, arm k is regarded as sub-optimal and deactivated. Since the arms are scheduled in a Round-Robin fashion, for any arm k, the total number of pulls of other arms between any two consecutive pulls of arm k is at most K 1. Then if the patience vector m satisfies that mk K, k [K], no active arm will exit early. Only those deactivated arms may leave the game after being eliminated by the algorithm. The behavior and analysis of SE remain exactly the same in this case as in standard MAB setting, thus, SE is a competitive option when mk K, k [K], i.e., when arms are patient enough in general. However, Round-Robin could induce linear regret when there exists some k such that mk < K. We present this fact in Proposition 3.6 and its proof in Appendix E. Proposition 3.6. Run SE in the Multi-Armed Bandit with Impatient Arms setup. Then there exists a problem instance (K, m, µ) such that l(K, m) < 1 but RT = T. 4. Our Algorithms From the discussion in Section 3, we know that it is infeasible to design new methods to achieve sub-linear regret when l(K, m) is strictly greater than 1. Meanwhile, existing algorithms (e.g., SE) can guarantee sub-linear regret when mk K, k [K]. In this section, we aim at developing algorithms for the regime where we do not have clear guarantees, i.e., for (K, m) such that k [K], mk < K. We notice that in this case, it may be possible to prevent arms early departure with some carefully designed pulling schedules. We say that a schedule is feasible if no arm leaves early under this schedule. For example, when m = (2, 4, 4), a feasible schedule can be a1, ..., a6... = 1, 2, 1, 3, 1, 2, ... . Formally, define τ i k as the time slot when the i-th sample from arm k is scheduled, Ii k as the time interval (in number of slots) between the i-th and the (i + 1)-th pull of arm k, Ii k = τ i+1 k τ i k. Arm k never leaves the game if Ii k mk, i 1. Furthermore, we say a schedule a1, a2... is cyclic if C N+ such that at = at+C, t 1. If there exists an infinite-length feasible schedule for m, Lemma 4.1 of Li (2023) shows that there must be a feasible cyclic schedule, which consists of repeating finite cycles. We call a finite cycle a1, ..., a C feasible cycle if it forms feasible schedules by repeating itself. If a feasible cycle exists for the considered configuration, it is possible to pull arms according to it to prevent Figure 1. An FC-SE example with N < K. In this example, N = 4, K = 7. The rectangles denote feasible cycle snapshots. The colored nodes denote different arms. early departure and execute arm elimination at the end of each cycle to remove the sub-optimal arms. 4.1. FC-SE Algorithm We introduce the intuition and design details of our Feasible Cycle-based Successive Elimination (FC-SE) algorithm. If a feasible cycle can be found for the considered configuration, we can directly replace the Round-Robin cycle in the original SE algorithm with that feasible cycle. In Section 4.2, we introduce how to construct a feasible cycle. Unfortunately, sometimes a feasible cycle exists only for a subset of the set of all arms. Thus we specify an integer N K such that the subset of arms {k : ind(k) N} can form a feasible cycle. We can first repeat this initial feasible cycle to identify sub-optimal arms and then insert remaining arms into it when some arm is eliminated. However, a problem arises when the cycle is still full as the patience of a remaining arm is about to expire. Since we cannot discard any arm without giving it a chance (it might be the optimal arm), we must remove an arm from the current feasible cycle to make room for an arm that has not been pulled yet. In fact, when a full cycle is repeated many times without any arm being eliminated, it is highly likely that the arms in this cycle have such similar mean rewards that distinguishing a sub-optimal arm becomes difficult. Even if the optimal arm in the current feasible cycle is mistakenly dropped, another arm in the cycle can serve as a suitable substitution. To balance the exploration among arms, we select an appropriate constant c and divide the entire time horizon into segments of equal length c. We tighten the patience of a remaining arm k to m k mk, such that m k is an integer multiple of c, and pull arm k at around t = m k to prevent early departure. We drop arms in the feasible cycle if necessary and add remaining arms to the cycle only at these checkpoints (referring to the endpoints of these segments as checkpoints). Figure 1 is a simple example of FC-SE behavior when N < K. We formally present FC-SE in Algorithm 1. Given N, the algorithm constructs a feasible cycle of length n = P k:ind(k) N nk, where nk is the number of times that arm k appears in the feasible cycle. If N < K, we set nk = 1 On Multi-Armed Bandit with Impatient Arms Algorithm 1 FC-SE 1: Input: Number of arms in the initial feasible cycle N, patience vector m, time horizon T, segment length c 2: Construct a feasible cycle a1, ... an for the set of arms {k : ind(k) N} 3: t 1, S {k : ind(k) N}, p 1, a 1, ..., a n a1, ..., an 4: if N < K then 5: for k {k : ind(k ) > N} do 6: m k l ind(k) N 7: end for 8: end if 9: while t T do 10: for i = 1, ..., n do 11: if ai = 0 then 12: Pull at = ai and receive reward Xat,Tat(t) 13: t t + 1 14: end if 15: end for 16: S n k S : j S, ˆµj,Tj(t 1) 2 q ln T 1 Tj(t 1) ˆµk,Tk(t 1) + 2 q ln T 1 Tk(t 1) o 17: ai 0 for each i = 1, ..., n such that ai / S 18: if p K N N 1 and t + P k S nk > cp n then 19: Snew {k [K] : m k = cp} 20: while |S| + |Snew| > N do 21: a Unif(S) 22: S S {a} 23: end while 24: ai 0 for each i = 1, ..., n such that ai / S 25: while |Snew| > 0 do 26: a arg mink Snew ind(k) 27: S S {a}, Snew Snew {a} 28: ai a where i = min i n : a i = min{k [K] | ind(k) N and j n s.t. a j = k : aj = 0} 29: end while 30: p p + 1 31: end if 32: end while for any k such that ind(k) > N. The algorithm maintains a set of active arms S, initialized as the set of arms in the initial feasible cycle. p denotes the index of the next checkpoint. The algorithm pulls the active arms according to the feasible cycle. If some arm is eliminated, all positions associated with it in a1, ... an are cleared (i.e. set to 0). At the end of each feasible cycle, the algorithm checks whether the next checkpoint is met. It is important to note that there can be at most N 1 remaining arms waiting at a checkpoint, because if more were assigned, we would need to drop all the arms in the current cycle, and there would be no guarantee that a good substitution for the optimal arm remains in the feasible cycle. When checking whether the p-th checkpoint is reached, t is the first time slot of a feasible cycle, which ends at t + P k S nk 1. At the first time t + P k S nk > cp n, we have that t cp n, since t grows at most n each time. When the p-th checkpoint is met, the set of arms that need to be added to the feasible cycle is denoted as Snew. Since the actual length of the new feasible cycle is also upper bounded by n, each arm k in Snew is pulled no later than t + n 1 cp 1 < cp = m k mk. Therefore, arm k does not leave before being pulled by the algorithm for the first time. The algorithm randomly drops arms from the current feasible cycle until there is enough space for Snew. First, we analyse the regret performance of FC-SE algorithm in the easier case when we can find a feasible cycle containing all the available arms (i.e. N = K). Note that in this case, the value of c does not matter. Theorem 4.1. Run FC-SE in Algorithm 1 with N = K. Assume that a feasible cycle can be constructed for the whole arm set [K], with length n = P k 1 nk. nk is the number of times that arm k appears in the feasible cycle. Then the expected regret of FC-SE is upper bounded by knk + 32 ln T where max := maxk =k k. We provide the detailed proof of Theorem 4.1 in Appendix F. Under the assumption of Theorem 4.1, it can be shown that RT = O n + n T ln T . Next, we analyse FC-SE in the case when we specify some N < K such that a feasible cycle exists for the subset of arms {k : ind(k) N}. The proof is also in Appendix F. Theorem 4.2. Run FC-SE in Algorithm 1 with N < K. Assume that a feasible cycle can be constructed for the set of arms {k : ind(k) N}, with length n = P k:ind(k) N nk and nk = 1 for any k such that ind(k) > N. We set c to be min k:ind(k)>N mk ind(k) N n ln T (K 1) If we have j mink:ind(k)>N mk ind(k) N N 1 k > 3n, then the expected regret of FC-SE is upper bounded as RT (K 1) c 1 + l K N + 8T 1 + l K N n ln T c 3n + 2K max, On Multi-Armed Bandit with Impatient Arms Algorithm 2 The Shortest Length AUS Cycle Construction 1: Input: Patience vector m 2: Compute the index mapping ind( ) such that mind 1(1) mind 1(2) ... mind 1(K) 3: Solve the optimization problem and obtain solution r : s.t. ˆrk ˆrk+1 N+, k [K 1] (6) 1 mind 1(k) ˆrk 1, k [K] (7) k=1 ˆrk 1 (8) 4: if r exists then 5: Construct an AUS cycle with r where max := maxk =k k. Specifically, if m sat- isfies that j mink:ind(k)>N mk ind(k) N N 1 k > 3n + l 4T n ln T (K 1) 2 3 m , we have RT = O K 4 3 T 2 3 (n ln T) 1 3 . 4.2. A Form of Feasible Cycles: AUS In Section 4.1, we presented our first algorithm, FC-SE. This algorithm requires a feasible cycle containing at least a subset of all the available arms. Besides, the expected regret increases with the feasible cycle length n in our analysis. In this section, we propose a specific method to construct a feasible cycle with the shortest possible length. We need the following definition from the Age of Information (Ao I) literature. Definition 4.3 (Li et al. (2023a)). A cyclic schedule is an Almost Uniform Schedule (AUS) if for each arm k, there exists an integer xk such that Ii k is either xk or xk+1 for any i 1. Ii k is the time interval (in number of slots) between the i-th and the (i+1)-th pull of arm k in the cyclic schedule. Our process to construct a feasible cycle is listed in Algorithm 2. Given any feasible solution to the optimization problem in Algorithm 2, a routine to construct a feasible AUS cycle is described in Li et al. (2021). Besides, our minimization objective (5) is the length of the constructed AUS cycle. In conclusion, Algorithm 2 tries to find the feasible AUS cycle with the shortest possible length. In Appendix B.1, we present a dynamic programming-based solution to the optimization problem in Algorithm 2. Here we formally describe the output of Algorithm 2 and provide the proof in Appendix B.2. Theorem 4.4. Run Algorithm 2 on a patience vector m. If the optimization problem in Algorithm 2 has a feasible solution, then Algorithm 2 finds a feasible AUS cycle of length at most ||m|| . Remark 4.5. Although it is obvious that when l(K, m) > 1, no solution to the optimization problem exists, Li et al. (2021) has shown that l(K, m) ln 2 can ensure the existence of such a feasible solution. For some configurations that mink [K] mk < K but l(K, m) ln 2, though it cannot be handled by the SE algorithm, it is possible to schedule the arms such that no arm leaves early. For instance, in many configurations where ||m|| = O(ln T), it is also highly likely that l(K, m) falls below the constant ln 2 as T becomes sufficiently large, making feasible AUS cycles exist. As Theorem 4.4 indicates, Algorithm 2 finds AUS cycles of length n = O(ln T) for such configurations. As a consequence, we can find a wide range of configurations results in AUS cycles of sub-linear length, leading to the sub-linear regret of our novel algorithms. 4.3. Extension: Allowing New Entering Arms In the proposed setup, we only assume that arms can exit. In reality, a more general and realistic scenario involves both incoming and departing arms. In online advertising, as time progresses, more advertisers may attempt to display their advertisements. Similarly, in the crowdsourcing example, new workers might join the platform and search for jobs. In this section, we take a step forward and allow for the entrance of new arms in the proposed setting. There are also a total of K arms, but only the set of arms [K0] is available at the beginning, where K0 < K. The remaining arms K0 + 1, ..., K arrive later within the time horizon T. ρk denotes the time slot at the beginning of which arm k [K] enters the game. We have ρk = 0, k [K0]. Without loss of generality, we assume that arms k = K0 + 1, ..., K are ordered by their entry times: 0 < ρK0+1 ... ρK. For any k > K0, we assume that the entry time ρk and the patience mk is not known in advance. The algorithm does not know mk and when arm k becomes available until the beginning of time ρk. In this section, we define the mapping ind( ) only for the set of initial arms [K0]: mind 1(1) ... mind 1(K0). When designing performance metrics for algorithms in this new setup, we notice that it is possible for the optimal arm k to be among the new entering arms (i.e. k > K0). In this case, from t = 1 to t = ρk 1, no matter which arm we pull, there is a positive gap in reward mean when comparing at against arm k . Even the Oracle that always pulls the best possible arm yields a positive regret that is linear with respect to ρk . The expected regret (2) is no longer suitable. Instead, we introduce the expected On Multi-Armed Bandit with Impatient Arms dynamic regret t=1 µat , (9) where µ t := maxk [K]:ρk t µk is the highest reward mean of the arms that have entered the game before or at time t. To handle newly entering arms with unknown patience, we reserve special slots in the feasible cycle. In the beginning, we use only N 2 of the initially available arms with relatively low level of patience and introduce two additional virtual arms to construct the feasible cycle, where N is also the maximum number of arms in the cycle. Although mk is unknown before arm k s entrance, we assume a known lower bound m for the patience of newly entering arms. The two virtual arms are denoted as + and , and we set their patience as m+ = m = m. The reserved slots for the virtual arms in the feasible cycle are initially empty. New entering arms occupy the slots of either + or after their entrance. The condition m mk ensures that the slots of either virtual arm are sufficient to keep any entering arm. The details of FC-Entry is in Appendix G. Theorem 4.6. Run FC-Entry algorithm in Algorithm 4 with N. Assume N satisfies that 3 < N < K0 + 2, the set of arms {k [K0] : ind(k) N 2} and two virtual arms +, with patience m can form a feasible cycle of length n = n+ + n + P k [K0]:ind(k) N 2 nk. n+, n are the numbers of pulls of the virtual arms +, in the constructed feasible cycle, respectively. nk = 1 for k [K0] : ind(k) > N 2. For arm k > K0, nk = n+ if it takes up the slots of virtual arm + in the feasible cycle and otherwise nk = n . We set c to be (j min k [K0]:ind(k)>N 2 mk ind(k) N+2 N 3 k , (10) & 8(K N + 3)T n ln T (3 + K0 N+2 If c > 3n and 3c min K0 k, the sum is from k to s. For s k, BWD(r, s) denotes the ˆrs+1 that achieves DP(r, s), while for s > k, BWD(r, s) denotes the ˆrs 1 that achieves DP(r, s). If k > 1, the algorithm minimizes Pk k =1 ˆrk . Note that when ˆrk is fixed, the minimization of Pk 1 k =1 ˆrk is independent of ˆrk+1, ..., ˆr K. If k < K, the algorithm computes the minimum PK k =k ˆrk for each possible value of ˆr K. Then the algorithm computes the minimal objective if a feasible solution exists given ˆrk = 1 mind 1(k) . At the end of the iteration with k, the algorithm updates the minimal objective seen so far. B.2. Proof of The Correctness of Algorithm 2 Proof of Theorem 4.4. Say r is a solution of the optimization problem in Algorithm 2. By Lemma 2 in Li et al. (2021), the schedule that Algorithm 2 outputs is an AUS cycle of length PK k=1 r k r K . Since r satisfies (7) and (8), the length can be upper bounded as PK k=1 r k r K 1 r K mind 1(K) = maxk [K] mk = ||m|| . Consider a cyclic schedule composed of such AUS cycles. To show that this cyclic schedule is feasible for m, it suffices to show that k [K], i 1 : Ii k mind 1(k). We prove this by contradiction. For any k, suppose i : Ii k > mind 1(k), we have that i 1 : Ii k mind 1(k) by the definition of AUS. We consider the first AUS cycle in the schedule. By Algorithm 1 in Li et al. (2021), in the AUS cycle, nk = r k r K for any k and n = PK k=1 r k r K . The next cycle in the schedule is just a repetition of the current cycle. So i 1 : Ii k mind 1(k) implies i [nk], a[τ i k mod n]+1, ..., a[(τ i k+mind 1(k) 2) mod n]+1 = k. The i-th pull of arm k is followed by at least mind 1(k) 1 pulls of other arms for i < nk, while the nk-th pull of arm k implies there are at least mind 1(k) 1 pulls of other arms in the set of time slots [τ 1 k] {τ nk k + 1, ..., n}. Thus we have mind 1(k)nk < n since i : Ii k > mind 1(k). Let rk denote the average rate of arm k s appearance in the schedule, then rk = nk n < 1 mind 1(k) . Thus the rate rk satisfies 1 mind 1(k) > rk = r k/r K PK k=1 r k/r K r k 1 mind 1(k) . A contradiction occurs. So we have k [K], i 1 : Ii k mind 1(k) and the AUS cycle is feasible given the patience vector m. C. Proofs for The Hardness Result Proof of Lemma 3.3. Let t = λK. Suppose that Q(K,m) > t + 1. We have that from t = 1 to t = t + 1 no arm leaves. For each arm i, say tk i is the time that it is pulled the k-th time. Consider the sequence t = 0, t1 i , ..., t Ti( t) i , t + 1. Say j, j are On Multi-Armed Bandit with Impatient Arms Algorithm 3 A Dynamic Programming Solution to the Optimization Problem in Algorithm 2 1: Input: Patience vector m, the index mapping ind( ) 2: r (1, ..., 1), min len + 3: for k = 1, 2, ..., K do 4: ˆr (1, ..., 1), min len + 5: ˆrk 1 mind 1(k) , Dk {ˆrk}, DP(ˆrk, k) 1 mind 1(k) 6: if k > 1 then 7: Dk {lˆrk| 1 mind 1(k ) lˆrk 1, l N+} for each k = 1, ..., k 1 8: for k = k 1, ..., 1 do 9: for r Dk do 10: DP(r, k ) r + minr :r Dk +1, r r N+ DP(r , k + 1) 11: BWD(r, k ) arg minr :r Dk +1, r r N+ DP(r , k + 1) 12: end for 13: end for 14: if minr D1 DP(r, 1) > 1 then 15: Continue 16: end if 17: ˆr1 arg minr D1 DP(r, 1) 18: for k = 2, ..., k 1 do 19: ˆrk BWD(ˆrk 1, k 1) 20: end for 21: end if 22: if k < K then 23: Dk { ˆrk l | 1 mind 1(k ) ˆrk l , l N+} for each k = k + 1, ..., K 24: for k = k + 1, ..., K do 25: for r Dk do 26: DP(r, k ) r + minr :r Dk 1, r r N+ DP(r , k 1) 27: BWD(r, k ) arg minr :r Dk 1, r r N+ DP(r , k 1) 28: end for 29: end for 30: end if 31: if minr DK Pk 1 k =1 ˆrk + DP(r, K) 1 then 32: min len minr DK:Pk 1 k =1 ˆrk+DP(r,K) 1 r 1[Pk 1 k =1 ˆrk + DP(r, K)] 33: end if 34: if min len < min len then 35: min len min len 36: r k ˆrk for each k = 1, ..., k 37: r K arg minr DK:Pk 1 k =1 ˆrk+DP(r,K) 1 r 1[Pk 1 k =1 ˆrk + DP(r, K)] 38: for k = K 1, ..., k + 1 do 39: r k BWD(r k +1, k + 1) 40: end for 41: end if 42: end for 43: Return r On Multi-Armed Bandit with Impatient Arms consecutive items in the sequence, then we have j j mi. This implies that for i [K], By definition t = PK i=1 Ti( t). Summing over i, we have λK = K + PK i=1 Ti( t) t > 1 mi = l(K, m) > 1 + 1 A contradiction occurs. Thus we can conclude that Q(K,m) t + 1 = λK + 1. Proof of Theorem 3.2. Given an unschedulable configuration (K, m), i.e. Q(K,m) T, after any bandit algorithm A A pulls Q(K,m) times, there must be at least 1 arm that has left. We mainly focus on these first Q(K,m) pulls. Define a measurable space (ΩQ(K,m), FQ(K,m)) where ΩQ(K,m) = ([K] R)Q(K,m) R2Q(K,m) and FQ(K,m) is the σ-algebra of the subsets of ΩQ(K,m). Fix a positive constant , we construct µi Ξ, i [K] such that µi = (0, 0, ..., |{z} the ith entry , 0, ..., 0). Say we run any A on some µi. Define Ψ as the set of all measurable mappings ψ that maps from ΩQ(K,m) to [K]. We want to find a lower bound on inf ψ Ψ max j [K] Pr A,(K,m,µj)(ψ = j), where Pr A,(K,m,µi), i [K] are probability measures over (ΩQ(K,m), FQ(K,m)). Our derivation is similar with that of Fano s inequality in Rigollet & H utter (2019). We write Pr A,(K,m,µj) as Pj for notational simplicity. Fix ψ Ψ, define pj = Pj(ψ = j), p = 1 j=1 pj [0, 1], qj = 1 k=1 Pk(ψ = j), j=1 qj = 1 K2 j,k=1 Pk(ψ = j) = 1 K2 j,k=1 1 Pk(ψ = j) Let kl(p, p ) denote KL divergence between Bernoulli distributions with mean p, p , respectively. We have kl(p, p ) = p ln p p + (1 p) ln 1 p Denote the entropy of Bernoulli distribution h(p) = p ln p + (1 p) ln(1 p). We have on the one hand, kl( p, q) =h( p) p ln q (1 p) ln(1 q) h( p) p ln q = h( p) + p ln K K 1 =: f K( p). On Multi-Armed Bandit with Impatient Arms On the other hand, we derive an upper bound for kl( p, q), by the convexity of KL divergence, kl( p, q) 1 j=1 kl(pj, qj) 1 K2 j,k=1 kl(Pj(ψ = j), Pk(ψ = j)). Fix any j, k, we derive upper bound for kl(Pj(ψ = j), Pk(ψ = j)). We observe that probability measures Pj, Pk over (ΩQ(K,m), FQ(K,m)) induce probability measures Ber(Pj({ω ΩQ(K,m) : ψ(ω) = j})), Ber(Pk({ω ΩQ(K,m) : ψ(ω) = j})) over measurable space ({0, 1}, 2{0,1}). Then by data processing inequality in Lou & Goldfeld (2020), especially their Example 7.1.2(i), we have KL(Pj, Pk) KL(Ber(Pj(ψ = j)), Ber(Pk(ψ = j))) = kl(Pj(ψ = j), Pk(ψ = j)). Some sequences of a1, a2, ..., a Q(K,m) are impossible due to the leaving behaviour of arms, depending only on (K, m). However, an arm sequence is possible with respect to (K, m, µj) if and only if it is possible with respect to (K, m, µk). As a result, Pj is absolutely continuous with respect to Pk. With this fact in mind, we can use Lemma 15.1 in Lattimore & Szepesv ari (2020) to show that KL(Pj, Pk) = i=1 Ej[Ti(Q(K,m))]KL(N(µj,i, 1), N(µk,i, 1)) 2 (Ej[Tj(Q(K,m))] + Ej[Tk(Q(K,m))]) Merging the derivations above, we obtain that f K( p) 1 K2 2 Q(K,m) = 2 Now we study the function f K(p) to obtain an upper bound for p. f K(p) = p ln K K 1 + p ln p + (1 p) ln(1 p), f K(p) = ln K K 1 + ln p + 1 ln(1 p) 1 = ln K K 1 + ln p 1 p. We see that f K(0) = 0, f K(1) = ln K K 1, f K is monotonically decreasing in [0, K 1 2K 1] and monotonically increasing in 2K 1, 1]. Thus c (0, ln K K 1], there is a unique p ( K 1 2K 1, 1] such that f K(p) = c. Setting = q 1 Q(K,m) ln K K 1, let f 1 K ( 1 2 ln K K 1) denote the unique p such that f K(p) = Q(K,m) 2 1 Q(K,m) ln K K 1 = 1 2 ln K K 1. We obtain that j=1 Pj(ψ = j) = p f 1 K (1 2 ln K K 1). inf ψ Ψ max j [K] Pj(ψ = j) inf ψ Ψ 1 K j=1 Pj(ψ = j) = inf ψ Ψ 1 K j=1 1 Pj(ψ = j) 2 ln K K 1). On Multi-Armed Bandit with Impatient Arms Define ψL as the measurable map that returns the first leaving arm during the first Q(K,m) pulls, breaking tie by returning the first leaving arm with the smallest arm index. We claim that, for any bandit policy A and configuration (K, m), there is j [K] such that Pj(ψL = j) 1 f 1 K ( 1 2 ln K K 1). Suppose not, then there exist A, (K, m) that for all j [K], Pj(ψL = j) < 1 f 1 K ( 1 2 ln K K 1), then setting = q 1 Q(K,m) ln K K 1, 2 ln K K 1) > max j [K] Pj(ψL = j) inf ψ Ψ max j [K] Pj(ψ = j) 2 ln K K 1). A contradiction occurs, so we proved the claim. For any algorithm A, say arm i A satisfies Pi A(ψL = i A) 1 f 1 K ( 1 2 ln K K 1). Now we derive the lower bound for expected regret when the ground truth reward mean vector is µi A RT (A, (K, m, µi A)) = Ei A h T X t=1 at i = Ei A h K X k =i A Tk(T) i 1 Q(K,m) ln K K 1 k =i A Ei A h Tk(T) i 1 Q(K,m) ln K K 1Ei A h T Ti A(T) i 1 Q(K,m) ln K K 1Ei A h T Ti A(T)|ψL = i A i Pi A(ψL = i A) 1 Q(K,m) ln K K 1Ei A h T Ti A(T)|ψL = i A i Pi A(ψL = i A) 1 Q(K,m) ln K K 1 h 1 f 1 K (1 2 ln K K 1) i (T Q(K,m)), where Ej is the expectation when A, (K, m, µj) is given. Finally, we note that the bound above holds for any A, and RT (A, (K, m, µi A)) sup µ Ξ RT (A, (K, m, µ)). So we can make our conclusion, for any unschedulable profile (K, m), the minimax lower bound for the expected regret is min A A sup µ Ξ RT (A, (K, m, µ)) 1 Q(K,m) ln K K 1 h 1 f 1 K (1 2 ln K K 1) i (T Q(K,m)). D. Results for UCB Algorithm In the standard regret analysis of UCB in the stochastic MAB setting (Lattimore & Szepesv ari, 2020), a critical observation is that under a high probability event (usually called a Good event) the number of pulls of sub-optimal arms is bounded. In our setup characterized by arm patience, arm k exits if there have been too many pulls of other arms after its last pull. However, if the optimal arm k s patience mk is even greater than the highest possible total number of sub-optimal pulls under the good event, then it is unlikely that arm k will exit under the good event. Thus, if arm k has good patience, the performance of UCB algorithm can be intuitively guaranteed. Inspired by this observation, we present the following expected regret upper bound for UCB algorithm in our setup. On Multi-Armed Bandit with Impatient Arms Theorem D.1. Run UCB algorithm with confidence parameter δ = 1/T 2 in the Multi-Armed Bandit with Impatient Arms setup. If mk > PK k =k 16 ln T 2 k the expected regret of UCB algorithm is upper bounded by RT K max + 3 where max := maxk =k k. Proof of Theorem D.1. Define D as the event that arm k is ignored for at least mk times, formally D = n t {1, 2, ..., T + 1 mk } : at, at+1, ..., at+mk 1 = k o . The expected regret is decomposed with respect to event D, t=1 at I{ D} + E T X t=1 at I{D} max T Pr(D) + RBase T , where RBase T denotes E PT t=1 at I{ D} , whose analysis resembles that of UCB algorithm in the vanilla MAB setting. Following the standard analysis of UCB algorithm, we can define good events when the sample means of each arm lie in their corresponding confidence intervals. Now we define good events. Ek := n µ < min n [T ] ˆµk ,n + Ek := n ˆµk,uk + uk < µ o , k = k . If mk is large enough, formally mk > PK k =k uk, by total probability rule, Pr(D) = Pr D k=1 Ek Pr K \ k=1 Ek + Pr D k=1 Ek Pr K [ k=1 Ek + Pr K [ k=1 Pr Ek . Conditioned on event TK k=1 Ek, if t s.t. at, at+1, ..., at+mk 1 = k , there must be at least one arm denoted k, that it is selected more than uk times even during the time interval from t to t + mk 1. However under event Ek , Ek, and the fact that arm k may deviate in advance, arm k cannot be selected for more than uk times. There is a contradiction, so Pr(D| TK k=1 Ek) = 0. We see that Pr(D) PK k=1 Pr( Ek) as a result. On Multi-Armed Bandit with Impatient Arms On the other hand, RBase T =E T X t=1 at I{ D} t=1 k I{at = k}I{ D} k =k k E Tk(T)I{ D} k =k k E Tk(T)I{Ek , Ek}I{ D} + E Tk(T)I{ Ek Ek}I{ D} k =k k uk + E Tk(T)I{ Ek Ek}I{ D} k =k k uk + TE I{ Ek Ek}I{ D} k =k k uk + T Pr( Ek Ek) . The first inequality holds since under event D, arm k never leaves the game and under event Ek Ek, no matter whether arm k derivates, it can be selected for no more than uk times. Even if it is pulled for uk times, it will never be pulled again since arm k s UCB index is always larger. By Chernoff s inequality, say ck, k = k satisfy k q Pr( Ek ) = Pr [ n [T ] {µ ˆµk ,n + n=1 Pr µ ˆµk ,n + Pr( Ek) Pr(ˆµk,uk µk ck k) exp ukc2 k 2 k 2 Then for the expected regret when mk > PK k =k uk, k =k k uk + T Pr( Ek Ek) + max T k=1 Pr( Ek) k =k k T Tδ + exp ukc2 k 2 k 2 + max TTδ + max T k =k exp ukc2 k 2 k 2 k =k k + max + k =k ( k + max)T exp ukc2 k 2 k 2 We set uk = 2 ln 1/δ (1 ck)2 2 k and obtain k =k k l 2 ln 1/δ (1 ck)2 2 k k =k k + max + k =k ( k + max)T 1 2c2 k (1 ck)2 . On Multi-Armed Bandit with Impatient Arms By setting ck = 1 2, k = k , we obtain k =k k l8 ln 1/δ k =k k + max + k =k ( k + max) In Theorem D.1, we derive a problem-dependent upper bound for a fixed µ. Based on Theorem D.1, we further present a problem-independent expected regret upper bound for a general set of mean reward vectors in Corollary D.2. Corollary D.2. Run UCB algorithm with confidence parameter δ = 1/T 2 in the Multi-Armed Bandit with Impatient Arms setup. For a constant ϵ > 0, consider a set of mean reward vectors Ξϵ = µ ϵ k , k = k }. For any instance (K, m, µ) satisfying: 1. µ Ξϵ, 2. mk > (K 1) 16ϵ 2 ln T , k [K], the expected regret of UCB algorithm is uniformly upper bounded by RT K max + 3 k =k k + 8 p (K 1)T ln T, where max := maxk =k k. In Corollary D.2, we show that UCB algorithm maintains almost the same performance as in stochastic MAB setting, under the assumption that all arm have relatively high level of patience. Proof of Corollary D.2. mk > (K 1) 16ϵ 2 ln T , k [K] implies that µ Ξϵ, we have mk > (K 1) 16 ln T ϵ2 PK k =k 16 ln T 2 k . We consider two cases for the value of ϵ: 16(K 1)T 1 ln T. In this case, µ satisfies that k p 16(K 1)T 1 ln T, k = k . We directly adopt Theorem D.1 and obtain RT K max + 3 k =k k + 4 p (K 1)T ln T (K 1)T ln T + 3 k =k k + K max. (2) ϵ < := p 16(K 1)T 1 ln T. Following the derivation in the proof of Theorem D.1, the operation on the first term On Multi-Armed Bandit with Impatient Arms of the regret decomposition remains the same, since mk > PK k =k 16 ln T 2 k . So we focus on the second term, k =k k E Tk(T)I{ D} = X k =k : k k E Tk(T)I{ D} + X k =k : k< k E Tk(T)I{ D} k =k : k k E Tk(T)I{Ek , Ek}I{ D} + E Tk(T)I{ Ek Ek}I{ D} k =k : k k uk + E Tk(T)I{ Ek Ek}I{ D} k =k : k k uk + TE I{ Ek Ek}I{ D} k =k : k k uk + T Pr( Ek Ek) . Merging the two terms, we also have k =k : k k uk + T Pr( Ek Ek) + max T k=1 Pr( Ek) k =k : k kuk + k =k k + max + k =k ( k + max)T exp( ukc2 k 2 k 2 ) k =k : k k 2 ln 1/δ (1 ck)2 2 k + 1 + 2 k =k k + K max k =k k + K max T + 16(K 1) ln T k =k k + K max (K 1)T ln T + 3 k =k k + K max, with ck = 1 2 for any k = k . Proof of Theorem 3.5. Fix constants aβ, bβ, κT , g T < T, and x1 > x2 > µ . The precise values of these constants will be specified later. We define some events for arm k , Ek := Ek (x1, x2) = min n [κT 1] UCBk (n) > x1, UCBk (κT ) < x2 , and for arm β, Eβ := Eβ(x1, x2) = x2 < UCBβ(aβ) < x1, min n [bβ] UCBβ(n) > x2 . We will show later that under Ek Eβ with our value assignments, the optimal arm k is only pulled at most κT times before it leaves. The number of sub-optimal pulls is at least T κT in total. Set κT 1 > x2 = µ + On Multi-Armed Bandit with Impatient Arms κT , g T are positive and non-decreasing w.r.t T. Now we compute lower bounds for both Pr(Ek (x1, x2)) and Pr(Eβ(x1, x2)). Pr(Ek (x1, x2)) = Pr min n [κT 1] UCBk (n) > x1, UCBk (κT ) < x2 = Pr UCBk (κT ) < x2 min n [κT 1] UCBk (n) > x1 Pr min n [κT 1] UCBk (n) > x1 j=1 Xk ,j µ < g T min n [κT 1] UCBk (n) > x1 Pr min n [κT 1] UCBk (n) > x1 = Pr Xk ,κT µ + j=1 Xk ,j µ < g T min n [κT 1] UCBk (n) > x1 Pr min n [κT 1] UCBk (n) > x1 = Pr min n [κT 1] UCBk (n) > x1 Z + 0 Pr Xk ,κT µ < s g T S(k ) κT 1 = s, min n [κT 1] UCBk (n) > x1 f S(k ) κT 1 = s min n [κT 1] UCBk (n) > x1 ds, where we denote that S(i) t = Pt j=1(Xi,j µi). Say Φ( ) is the cdf of the standard Gaussian distribution. Since Xk ,κT µ is standard Gaussian and it is independent of any other random variables, Pr Xk ,κT µ < s g T S(k ) κT 1 = s, min n [κT 1] UCBk (n) > x1 = Φ( s g T ) = 1 Φ(s + g T ). Now we consider the conditional probably density f S(k ) κT 1 = s minn [κT 1] UCBk (n) > x1 . By Bayes Theorem, f S(k ) κT 1 = s min n [κT 1] UCBk (n) > x1 = Pr minn [κT 1] UCBk (n) > x1 S(k ) κT 1 = s Pr minn [κT 1] UCBk (n) > x1 f(S(k ) κT 1 = s). Since S(k ) κT 1 N(0, κT 1), we have f(S(k ) κT 1 = s) = 1 2π(κT 1) exp( s2 2(κT 1)). It suffices to lower bounding Pr minn [κT 1] UCBk (n) > x1 S(k ) κT 1 = s . Pr min n [κT 1] UCBk (n) > x1 S(k ) κT 1 = s = Pr n = 1, ..., κT 1 : S(k ) n > ( n κT 1 n) p 2 ln 1/δ S(k ) κT 1 = s Pr n = 1, ..., κT 1 : S(k ) n > (1 1 κT 1) 2 ln 1/δ κT 2 (n κT + 1) S(k ) κT 1 = s =1 Pr n = 1, ..., κT 1 : S(k ) n (1 1 κT 1) 2 ln 1/δ κT 2 (n κT + 1) S(k ) κT 1 = s 1 Pr t [0, κT 1] : B(t) (1 1 κT 1) 2 ln 1/δ κT 2 (t κT + 1) B(κT 1) = s , where {B(t), t 0} is standard Brownian motion. The first inequality is by convexity. The last inequality is by the fact that the joint distribution of S(k ) 1 , ..., S(k ) κT 1 is identical to that of B(1), ..., B(κT 1). To proceed, we introduce a boundary-crossing property of Brownian bridge mentioned in Scheike (1992). Lemma D.3 (Proposition 3, Scheike (1992)). If B(t) is standard Brownian motion, and a > 0, T < , and c < a + b T, then Pr [ t [0,T ] (B(t) a + bt) B(T) = c = exp 2a(a + b T c)/T . If a 0, the probability is 1. On Multi-Armed Bandit with Impatient Arms a = (κT 1)(1 1 κT 1) 2 ln 1/δ κT 2 , b = (1 1 κT 1) 2 ln 1/δ κT 2 , c = s and adopting Lemma D.3, we obtain Pr min n [κT 1] UCBk (n) > x1 S(k ) κT 1 = s 1 exp 2(1 1 κT 1) 2 ln 1/δ κT 2 s . Now we have Pr(Ek (x1, x2)) Z + 0 [1 Φ(s + g T )] 1 exp 2(1 1 κT 1) 2 ln 1/δ κT 2 s # f(S(k ) κT 1 = s)ds 1 exp 2(1 1 κT 1) 2 ln 1/δ κT 2 g T # Z + 4 + (s + g T )2 + s + g T 2π exp (s + g T )2 2π(κT 1) exp s2 in the last inequality we use Komatu s lower bound mentioned in Duembgen (2010). To further derive a lower bound for Pr(Ek (x1, x2)), we consider Z + 4 + (s + g T )2 + s + g T 2π exp (s + g T )2 4 + 4s2 + 2s 1 2π exp (s + g T )2 1 + g2 T + g T 2π exp (s + g T )2 1 + g2 T + g T exp r κT κT 1s + κT (g T + 1 1 + g2 T + g T exp s + (g T + 1 g T ) κT 1 1 + g2 T + g T exp r κT κT 1 + (g T + 1 Also by Komatu s lower bound, r κT κT 1 + (g T + 1 r κT κT 1 + (g T + 1 κT κT 1 + (g T + 1 g T ) q κT 2 + g T q κT κT 1 + (g T + 1 g T ) q On Multi-Armed Bandit with Impatient Arms r κT κT 1 + (g T + 1 r κT κT 1 + (g T + 1 1 + g2 T + g T . Then we get Pr(Ek (x1, x2)) C(T) 1 " 1 exp 2(1 1 κT 1) 2 ln 1/δ κT 2 g T # r κT κT 1 + (g T + 1 1 exp 2(1 1 κT 1) 2 ln 1/δ κT 2 g T # exp 1 (2 + 1 2(κT 1))g2 T . Now we turn to lower bounding Pr(Eβ(x1, x2)). Pr(Eβ(x1, x2)) = Pr x2 < UCBβ(aβ) < x1, min n [bβ] UCBβ(n) > x2 = Z aβ(x1 µβ) 2aβ ln 1/δ Pr min n [aβ] UCBβ(n) > x2, min n=aβ+1,...,bβ UCBβ(n) > x2 S(β) aβ = s f(S(β) aβ = s)ds. Since minn [aβ] UCBβ(n) > x2 and minn=aβ+1,...,bβ UCBβ(n) > x2 are independent conditioned on S(β) aβ , we have Pr min n [aβ] UCBβ(n) > x2, min n=aβ+1,...,bβ UCBβ(n) > x2 S(β) aβ = s = Pr min n [aβ] UCBβ(n) > x2 S(β) aβ = s Pr min n=aβ+1,...,bβ UCBβ(n) > x2 S(β) aβ = s . We first derive a lower bound for Pr minn [aβ] UCBβ(n) > x2 S(β) aβ = s as we did for Pr minn [κT 1] UCBk (n) > x1 S(k ) κT 1 = s . Pr min n [aβ] UCBβ(n) > x2 S(β) aβ = s = Pr n = 1, ..., aβ : S(β) n > n(x2 µβ) p 2n ln 1/δ|S(β) aβ = s Pr n = 1, ..., aβ : S(β) n > [(x2 µβ) 1 aβ + 1 2 ln 1/δ]n (1 1 aβ + 1) p 2 ln 1/δ S(β) aβ = s 1 exp 2(1 1 aβ + 1) p 2 ln 1/δ aβ(x2 µβ) + q 2aβ ln 1/δ + s a 1 β , On Multi-Armed Bandit with Impatient Arms where the last inequality is by Lemma D.3. Then we consider Pr minn=aβ+1,...,bβ UCBβ(n) > x2 S(β) aβ = s , Pr min n=aβ+1,...,bβ UCBβ(n) > x2 S(β) aβ = s = Pr n = aβ + 1, ..., bβ : S(β) aβ + j=1+aβ (Xβ,j µβ) > n(x2 µβ) p 2n ln 1/δ S(β) aβ = s = Pr n = aβ + 1, ..., bβ : s + j=1+aβ (Xβ,j µβ) > n(x2 µβ) p = Pr n = 1, ..., bβ aβ : s + S(β) n > (n + aβ)(x2 µβ) q 2(n + aβ) ln 1/δ Pr n = 1, ..., bβ aβ : S(β) n > (bβ aβ)(x2 µβ) p 2 ln 1/δ( p bβ aβ) bβ aβ n + aβ(x2 µβ) q 2aβ ln 1/δ s = Pr n = 1, ..., bβ aβ : S(β) n > x2 µβ n + aβ(x2 µβ) q 2aβ ln 1/δ s = h 1 Pr n = 1, ..., bβ aβ : S(β) n x2 µβ n + aβ(x2 µβ) q 2aβ ln 1/δ s i , where the inequality is due to convexity. We also note that, the joint distribution of S(β) 1 , ..., S(β) bβ aβ is identical to that of B(1), ..., B(bβ aβ). Thus we can further lower bound Pr(Eβ(x1, x2)) as Pr(Eβ(x1, x2)) Z aβ(x1 µβ) 2aβ ln 1/δ Pr min n [aβ] UCBβ(n) > x2 S(β) aβ = s h 1 Pr t [0, bβ aβ] : t + aβ(x2 µβ) q 2aβ ln 1/δ s i f(S(β) aβ = s)ds h 1 exp (1 1 aβ + 1) p 2 ln 1/δ g T i Z aβ(x1 µβ) Pr t [0, bβ aβ] : B(t) x2 µβ t + aβ(x2 µβ) q 2aβ ln 1/δ s i f(S(β) aβ = s)ds =: h 1 exp (1 1 aβ + 1) p 2 ln 1/δ g T Now it suffices to lower bound Pβ. Again, here we need a proposition in Scheike (1992) that characterizes the probability of Brownian motion crossing a linear boundary within a finite time horizon. Lemma D.4 (Proposition 2, Scheike (1992)). If B(t) is standard Brownian motion, and a > 0, T < , then t [0,T ] (B(t) a + bt) = 1 Φ( a T) + exp( 2ab)Φ( a If a 0, the probability is 1. a = s aβ(x2 µβ) + q 2aβ ln 1/δ, b = x2 µβ On Multi-Armed Bandit with Impatient Arms we have that Pβ Z aβ(x1 µβ) h Φ s aβ(x2 µβ) + p 2aβ ln 1/δ p bβ aβ x2 µβ exp 2 x2 µβ s aβ(x2 µβ) + q Φ s aβ(x2 µβ) + p 2aβ ln 1/δ p bβ aβ x2 µβ bβ aβ i f(S(β) aβ = s)ds. For notational convenience, we denote bβ aβ, B = s aβ(x2 µβ) + p 2aβ ln 1/δ p For λ, η, a1, a2 (0, 1), we set (1 η) 2 ln 1/δ (x1 µβ)2 (1 + λ) 2 ln 1/δ (x2 µβ)2 g T = (ln T)a1, κT = 3 + (ln T)a2 . We require that a1, a2 satisfy the following conditions: In the below, we take T + and evaluate the asymptotic rates of aβ, bβ, A, B. lim T + aβ (ln T)a2 = (1 η) lim T + 1 (ln T)a2 4 ln T β + q 4 ln T κT 1 2 = (1 η) lim T + κT 1 (ln T)a2 = 1 η, lim T + bβ (ln T)a2 = (1 + λ) lim T + 1 (ln T)a2 4 ln T β + q 2 > a1 by condition 1., we further have lim T + bβ (ln T)a2 = (1 + λ) lim T + 1 (ln T)a2 4 ln T β + q lim T + bβ aβ (ln T)a2 = (1 + λ) (1 η) = λ + η, (ln T) 1 2 = lim T + 1 = lim T + 1 4 ln T( 1 κT 1 p bβ + aβ ) g T On Multi-Armed Bandit with Impatient Arms we observe that lim T + 1 κT 1 4 ln T/(ln T) 1 2 a2 2 = 2 1 1 1+λ+ 1 η lim T + g T κT /(ln T)a1 a2 = 1 > 0, and again 1 2 > a1 a2, thus (ln T) 1 2 = lim T + 1 4 ln T( 1 κT 1 p bβ + aβ ) p 1 + λ + 1 η Finally we consider B. Note that B depends on the value of s, and in the integral s h aβ( x1+x2 2aβ ln 1/δ, aβ(x1 µβ) p 2aβ ln 1/δ i . Thus B is bounded by 0 < aβ(x1 x2) bβ aβ B aβ(x1 x2) p Now it suffices to evaluate (ln T)a1 a2 2 aβ(x1 x2) p bβ aβ = lim T + 1 (ln T)a1 a2 4 ln T κT 1 We observe that lim T + q 4 ln T κT 1 q /(ln T) 1 2 3 2 a2 = 1 > 0 and that by condition 2, for a1, a2, (ln T)a1 a2 2 aβ(x1 x2) p bβ aβ = lim T + 1 (ln T)a1 a2 g T κT = 1 η λ + η . Comparing A and B, since 1 2 , we have that T0 N+: T > T0, A > B. And we also have that s h aβ( x1+x2 2aβ ln 1/δ, aβ(x1 µβ) p 2aβ ln 1/δ i 4 + (A B)2 + (A B) 2 p 2 + (A + B)2 + (A + B) 4 + A aβ(x1 x2) 2 + A aβ(x1 x2) 2 + A + aβ(x1 x2) 2 + A + aβ(x1 x2) Denoting B = aβ(x1 x2) bβ aβ , we have that T1 N, T > T1, AB > 1/2. By straightforward calculation, we can verify that for any T > T1, 4 + (A B)2 + (A B) 2 p 2 + (A + B)2 + (A + B) > 0. 4 + (A B)2 + (A B) 2 p 2 + (A + B)2 + (A + B) 2 + (A + B)2 + A + B p 4 + (A B)2 A + B 2B =1 + lim T + 2 + (A + B)2 p 2B = 1 + lim T + 4AB 2 2 + (A + B)2 + p 4 + (A B)2) =1 + lim T + 4A 2 + (A + B)2 + p 4 + (A B)2) = 2. On Multi-Armed Bandit with Impatient Arms Adopting both Komatu s lower bound and upper bound simultaneously, we again focus on Pβ Pβ Z aβ(x1 µβ) h Φ(B A) exp(2AB) Φ( B A) i f(S(β) aβ = s)ds Z aβ(x1 µβ) 4 + (A B)2 + (A B) exp( 1 exp(2AB) 2 p 2 + (A + B)2 + (A + B) exp( 1 2(A + B)2) i f(S(β) aβ = s)ds = Z aβ(x1 µβ) 4 + (A B)2 + (A B) 2 + (A + B)2 + (A + B) 2(A B)2)f(S(β) aβ = s)ds. When T > max{T0, T1}, noting that S(β) aβ N(0, aβ), 4 + (A B)2 + (A B) 2 p 2 + (A + B)2 + (A + B) Z aβ(x1 µβ) 2(A B)2)f(S(β) aβ = s)ds 4 + (A B)2 + (A B) 2 p 2 + (A + B)2 + (A + B) Z aβ(x1 µβ) 2A2) 1 p2πaβ exp( s2 Since the upper limit of the integral is negative, i.e. aβ(x1 µβ) q 2aβ ln 1/δ (1 η p 1 η) 2 ln 1/δ x1 µβ < 0, η (0, 1), we have that 2A2 (aβ( x1+x2 2aβ ln 1/δ)2 2A2 (aβ(x2 µβ) p 2aβ ln 1/δ)2 On Multi-Armed Bandit with Impatient Arms Now we can get rid of the integral and rewrite the lower bound for Pβ as 4 + (A B)2 + (A B) 2 p 2 + (A + B)2 + (A + B) aβ(x1 µβ) q 2aβ ln 1/δ aβ(x1 + x2 4 + (A B)2 + (A B) 2 p 2 + (A + B)2 + (A + B) 2aβ (x1 x2) aβ if T > max{T0, T1}. Since we have shown that events Ek and Eβ happen with non-negligible probability, now we focus on the expected regret lower bound of UCB algorithm, t=1 at = E T X t=1 at(I(Ek Eβ) + I( Ek Eβ)) t=1 at I(Ek Eβ) = E I(Ek Eβ) t=1 I(at = i) min E I(Ek Eβ) t=1 I(at = i) = min E I(Ek Eβ) t=1 I(at = k ) . Recall that by assumption, mk = O((ln T)θ), θ (0, 1). Given this θ, ϵ > 0: max{ 1 2, θ} + 2ϵ < 1. We set a1 = 1 2, θ} + ϵ, a2 = max{ 1 2, θ} + ϵ. Obviously a1, a2 (0, 1). Here we validate the conditions that a1, a2 are required to satisfy: 2ϵ < a1 = 1 2, θ} + ϵ < 1 Fix any γ (0, 1), we set 16, η = min n γ 32, 1 1 (1 + γ/4 Also obviously, λ, η (0, 1). We now show that, under the event Ek Eβ and when T is large enough, the optimal arm is pulled at most κT times before leaving. Under both events, when the number of pulls of arm k is smaller than κT , (1) arm k s UCB index is greater than x1 and (2) the number of pulls of arm β is no greater than aβ. When the number of pulls of arm k is κT 1 and at some time arm k s UCB index is the largest among all active arms, it will be pulled the κT -th time. After that, arm k s UCB index is below x2. By Eβ, we know that before arm k s UCB index becomes the largest again, at least the (aβ + 1)-th, (aβ + 2)-th, ..., (bβ + 1)-th pulls of arm β need to be done. The total number of sub-optimal pulls (contributed by arm β) after arm k is pulled the κT -th time is at least bβ aβ + 1, not to mention the arms other than k and β. By our assignment, a2 > θ. Besides, we have shown that lim T + bβ aβ (ln T )a2 = λ + η > 0. As a result, lim T + mk bβ aβ + 1 = lim T + mk (ln T)a2 (ln T)a2 bβ aβ + 1 = 0. On Multi-Armed Bandit with Impatient Arms Thus T2 N+, T > T2, mk < bβ aβ + 1. Under the event Ek Eβ, when T > T2, arm k is pulled at most κT times. We obtain that when T > T2, RT min E I(Ek Eβ) t=1 I(at = k ) min E I(Ek Eβ) (T κT ) = min Pr(Ek Eβ)(T κT ) = min Pr(Ek ) Pr(Eβ)(T κT ), noticing that Ek Eβ | x1, x2. For Ek , we have 0 lim T + T 1 Pr(Ek ) lim T + T 1 C(T) 1 exp 1 + (2 + 1 2(κT 1))g2 T , where we have used the fact that lim T + 1 exp 2(1 1 κT 1) 2 ln 1/δ κT 2 g T = 1, 2 + a1 a2 > 0. Furthermore, 0 lim T + T 1 Pr(Ek ) lim T + C(T) 1 C(T) 1 exp 1 + (2 + 1 2(κT 1))g2 T γ 2 ln T = 0, because 2a1 < 1. Thus lim T + T 1 Pr(Ek ) = 0, Pr(Ek ) = Ω(T 1 2 γC(T) 1). For Eβ, since Pβ LB, T > max{T0, T1}, define 2(bβ aβ) + 1 we note that lim T + h(T) = lim T + 1 2 bβ aβ (ln T)a2 2 + 2(ln T) bβ (ln T )a2 + q aβ (ln T )a2 2 aβ (ln T)a2 2 + 2(ln T) 2 2 q aβ (ln T )a2 = lim T + 1 2 bβ aβ (ln T)a2 bβ (ln T )a2 + q aβ (ln T )a2 2 aβ (ln T)a2 κT 2 q aβ (ln T )a2 since a2 < 1 and a1 < 1 lim T + h(T) 2(λ + η) h 2 2 1 + λ + 1 η 2(1 η) h 2 2 1 η =2(λ + η) h 1 1 1 + λ + 1 η i2 + 2(1 η) h 1 1 1 η 2(λ + η) + 2 h 1 1 1 η On Multi-Armed Bandit with Impatient Arms lim T + exp(h(T) γ 2 ln T) = 0. For Eβ, we have 0 lim T + T 1 2 γ(ln T) 1 Pr(Eβ) lim T + T 1 2 γ(ln T) 1 lim T + T 1 2 γ(ln T) 1 4π (x1 x2) aβ = lim T + 8πA2(ln T) 1 (x1 x2)2(aβ) 3 2 bβ aβ exp(h(T) γ 2 ln T) = 0, 2 + a1 a2 > 0. Note that we have used lim T + 1 exp (1 1 aβ + 1) p 2 ln 1/δ g T Thus lim T + T 1 2 γ(ln T ) 1 Pr(Eβ) = 0, Pr(Eβ) = Ω T 1 2 γ(ln T) 1 . Finally, we reach a conclusion that RT = Ω min T 1 γ C(T) ln T 1 for γ (0, 1). E. Proofs for SE Algorithm Proof of Proposition 3.6. Construct a problem instance: K = 4, m = (20, 20, 3, 2), and µ = (0, 0, 0, ). In this case, l(K, m) = 14 15 < 1. Since SE algorithm pulls a1 = 1, a2 = 2, arm 4 exits the game at the end of time 2. The optimal arm is never pulled during the whole game, thus the expected regret RT = T. F. Proofs for FC-SE Algorithm Proof of Theorem 4.1. Inspired by the regret analysis of the SE algorithm in Lancewicki et al. (2021), we define τk as the last time the algorithm attempts to do arm elimination but arm k is NOT eliminated. Clearly, by this definition, at the end of the next feasible cycle, arm k is eliminated. Arm k is active for only one more cycle, so we have Tk(T) Tk(τk) + nk. At the end of time τk, since arm k is not eliminated, ˆµk ,Tk (τk) 2 ln T Tk (τk) ˆµk,Tk(τk) + 2 ln T Tk(τk). (12) We define the good event G = n t [T], k [K] : |ˆµk,Tk(t) µk| 2 On Multi-Armed Bandit with Impatient Arms meaning that the empirical means are close to the true reward means for any arm at any time. Then we show that G happens with high probability by presenting an upper bound on Pr( G), Pr( G) = Pr t [T], k [K] : ˆµk,Tk(t) µi > 2 ln T Tk(t) ˆµk,Tk(t) µk < 2 Pr n [T], k [K] : ˆµk,n µk > 2 n ˆµk,n µk < 2 h Pr ˆµk,n µk > 2 + Pr ˆµk,n µk < 2 where in the last inequality, we used a sub-Gaussian tail bound in Lattimore & Szepesv ari (2020): Lemma F.1 (Corollary 5.5, Lattimore & Szepesv ari (2020)). Assume that Xi µ are independent, σ-sub-Gaussian random variables. Then for any ϵ 0, Pr(ˆµ µ + ϵ) exp( nϵ2 2σ2 ), Pr(ˆµ µ ϵ) exp( nϵ2 where ˆµ = n 1 Pn i=1 Xi. Under the good event G, the optimal arm k is never eliminated since ˆµk ,Tk (t) + 2 q ln T Tk (t) µ > µk ˆµk,Tk(t) ln T Tk(t), for all t, i. Furthermore, at time τk, by (12) we have ln T Tk (τk) µk + 4 ln T Tk(τk) ln T Tk (τk) + 4 ln T Tk(τk). The FC-SE algorithm behaves in a strictly cyclic manner. An arm is pulled the same number of times in each feasible cycle. Since τk is at the end of a feasible cycle, we observe that nk = Tk (τk) Using this fact, we obtain nk Tk(τk) + 4 ln T Tk(τk) = 4 ln T Tk(τk) which directly implies Tk(τk) 16 ln T Tk(T) nk + 32 ln T On Multi-Armed Bandit with Impatient Arms The expected regret of FC-SE is then upper bounded as t=1 at = E K X k =k k Tk(T) k =k k Tk(T)[I(G) + I( G)] k =k k E Tk(T)I(G) + max T Pr( G) knk + 32 ln T Proof of Theorem 4.2. Let rk be the index of cycle at the end of which arm k enters the feasible cycle. We say that the arms in the initial feasible cycle enter in the 0-th cycle, k : ind(k) N, rk = 0. Since in the N < K case, there are some arms entering the feasible cycle later in the game, the fact (14) in the proof of Theorem 4.1 no longer holds. Say t is the time that a feasible cycle ends. Arm k, k have entered the feasible cycle before or at time t and not been eliminated before time t. Then we have nk + rk = Tk (t) nk + rk . (15) Following the analysis in the proof of Theorem 4.1, under the good event G (we use the definition in (13)), if arm j is not eliminated by the optimal active arm i at the end of some cycle t, we have ln T Tj(t) + 4 ln T Ti (t). Using (15), we obtain ln T Tj(t) + 4 nj Tj(t) + ni (rj ri ). We consider the following three different cases and get a uniform upper bound for Tj(t) (1) rj ri . Arm j enters either after or simultaneously with the optimal active arm. We have ln T Tj(t) + 4 (2) rj < ri and Tj(t) > nj(ri rj). We have Tj(t) > Tj(t) nj(ri rj). Thus ln T Tj(t) nj(ri rj) + 4 nj Tj(t) + ni (rj ri ) ln T Tj(t) nj(ri rj) On Multi-Armed Bandit with Impatient Arms nj ri rj + 16 ln T (3) rj < ri and Tj(t) nj(ri rj). This directly implies In conclusion, we have nj max(0, ri rj) + 16 ln T if arm j is not eliminated by i at t under the good event G. Due to the patience tightening process in FC-SE, arm ind 1(N + a) is assigned to checkpoint a N 1 c, for a [K N]. Let C denote the set of all checkpoints, then a = 1, ..., K N = {c1, c2, ..., c|C|}, where c1, c2, ... is the ordered sequence of checkpoints: c1 < c2 < ... < c|C|. Now we go through the whole process from the beginning. Let A0 := {k [K] : ind(k) N} be the arms in the initial cycle. Define k (0) = arg maxk A0 µk. Consider the first time t1 that triggers the checkpoint c1 = c. It satisfies t1 + P k S > c1 n, where S is the set of arms in the feasible cycle after the arm elimination at time t1 1. Let R1 denote the number of the operated feasible cycles from time slot 1 to time slot t1 1. t1 + P k S > c1 n if and only if the length of R1 feasible cycles plus the length of the feasible cycle beginning at time t1 is at least c1 n. The length of all feasible cycles is upper bounded by n, thus we have n(R1 + 1) c1 n, R1 c1 n n 1. This inspires us to define i = k (0), 16 ln T 1 ni + 1 2 + 1 c1 n For any k E0, under the good event G defined in (13), must have been eliminated either before or at the arm elimination phase of the 16 ln T 2 k (0),k 1 nk + 1 2 % -th feasible cycle by (16). Since $ 16 ln T 2 k (0),k 1 nk + 1 2 % + 1 16 ln T 1 nk + 1 2 + 1 c1 n arm k is eliminated before the condition t + P k S > c1 n is triggered. Thus for any arm k S when the condition is triggered, we have that k / E0. If k = k (0), k (0),k = 0. Otherwise, 16 ln T 2 k (0),k 1 nk + 1 2 + 1 > c1 n n 1, k (0),k < 4 1 nk + 1 = 4 Since it is possible that k (0) is unfortunately dropped at checkpoint c1, we can only guarantee that under G there is an arm in S with a mean reward at least µk (0) 8 q n 3 after the operation at checkpoint c1. The operation includes randomly dropping arms, adding new arms and rebuilding the feasible cycle. Define k (1) as the temporarily optimal arm in the feasible cycle after the operation at checkpoint c1. Then we have under the good event G, n 3, max a [K]:m a=c1 µa with µk (0) = µk (0). Define A1 = (A0 E0) {a [K] : m a = c1} as the set of arms possibly in the feasible cycle after the operation at checkpoint c1. On Multi-Armed Bandit with Impatient Arms Now we consider the situation after the operation at checkpoint cj 1 where 1 < j 1 + |C|. Note that c1+|C| is actually not a checkpoint in C, but we define it as the time slot c(1 + |C|) for the convenience of our analysis. There is no any special operation at this checkpoint but the condition t + P k S nk > c1+|C| n is defined. Define Rj 1 as the number of the operated feasible cycles from time slot 1 to time slot tj 1, where tj 1 is the first time that triggers the checkpoint cj 1. tj, Rj are defined accordingly. Following the discussion concerning checkpoint c1, we have that the length of the first Rj 1 feasible cycles plus the length of the next feasible cycle is at least cj 1 n, but the length of the first Rj 1 feasible cycles is strictly less than cj 1 n. Thus we observe that at least the time slots from cj 1 n to cj n (cj cj 1 + 1 time slots in total) is covered by the Rj Rj 1 + 1 more feasible cycles after the operation at checkpoint cj 1. We have n(Rj Rj 1 + 1) cj cj 1 + 1, Rj Rj 1 cj cj 1 n 1. This also inspires us to define µk (j 1) > µi, 16 ln T (µk (j 1) µi)2 1 ni + 1 2 + 1 cj cj 1 where Aj 1 = (Aj 2 Ej 2) {a [K] : m a = cj 1} is the set of arms possibly in the feasible cycle after the operation at checkpoint cj 1 and µk (j 1) := max µk (j 2) 8 q ln T c n 3, maxa [K]:m a=cj 1 µa . For any k Ej 1, we will show that it must be eliminated before the operation at checkpoint cj. Assume that arm k has not yet been eliminated after the operation at checkpoint cj 1. By (16), under the good event G, arm k must have been eliminated either before or at the feasible cycle indexed by rk + max(0, rk (j 1) rk) + $ 16 ln T 2 k (j 1),k 1 nk + 1 2 % Before the condition t + P k S > cj n at checkpoint cj is triggered, the last feasible cycle where there can be new entrance is the Rj 1-th cycle. Thus rk, rk (j 1) Rj 1. By induction hypothesis, µk (j 1) µk (j 1). We have rk + max(0, rk (j 1) rk) + $ 16 ln T 2 k (j 1),k 1 nk + 1 2 % max(rk (j 1), rk) + 16 ln T 2 k (j 1),k 1 nk + 1 2 + 1 Rj 1 + 16 ln T (µk (j 1) µk)2 1 nk + 1 2 + 1 Rj 1 + cj cj 1 n 1 Rj 1 + Rj Rj 1 = Rj. So arm k must be eliminated before the operation at checkpoint cj. Thus for any arm k still in the feasible cycle when the condition at checkpoint cj is triggered, we have that k / Ej 1. We have either µk (j 1) µk or µk (j 1) µk < ln T c n 3. Under the good event G, there must be an arm in the feasible cycle after the operation at checkpoint cj whose mean reward is at least µk (j 1) 8 q ln T c n 3. Thus we have ln T c n 3, max a [K]:m a=cj µa under the good event G. Before formally deriving the expected regret upper bound for the FC-SE algorithm, we need to find a lower bound for µk (|C|) which indicates that randomly dropping arms does not severely decrease the highest mean reward in the active arm set. We consider two cases: On Multi-Armed Bandit with Impatient Arms (1) µ = maxk A0 µk. The optimal arm k is in the initial feasible cycle. µ = µk (0) = µk (0). By the definition of µk (j), j = 0, 1, ..., |C|, we have µk (|C|) µk (|C| 1) 8 ln T c n 3 µk (|C| 2) 2 8 ln T c n 3 ... µk (0) 8|C| ln T c n 3 = µ 8|C| ln T c n 3. (2) j {1, 2, ..., |C|}, µ = maxa [K]:m a=cj µa. Say the optimal arm enters the feasible cycle at checkpoint cj , thus µk (j ) = max ln T c n 3, max a [K]:m a=cj µa ln T c n 3, µ ) As in the last case, we also obtain µk (|C|) µk (|C| 1) 8 ln T c n 3 ... µk (j ) 8(|C| j ) ln T c n 3 µ 8|C| ln T c n 3. Now we derive the expected regret upper bound for the FC-SE algorithm t=1 at = E K X k =k k Tk(T) k =k k Tk(T)[I(G) + I( G)] k =k k E Tk(T)I(G) + max T Pr( G) k =k :k E|C| k E Tk(T)I(G) + X k =k :k/ E|C| k E Tk(T)I(G) + max T Pr( G). Since all arms have once entered the feasible cycle either before or at checkpoint c|C|, any arm k E|C| must be eliminated before (virtual) checkpoint c|C|+1 under the good event G. Thus arm k is pulled at most c(1 + |C|) times. For arm k / E|C|, µk is not too small (i.e. µk µk (|C|) 8 q ln T c n 3). k =k :k E|C| kc(1 + |C|) + X k =k :k/ E|C| (µ µk (|C|) + µk (|C|) µk)E Tk(T)I(G) + 2K max k =k :k E|C| kc(1 + |C|) + X k =k :k/ E|C| 8(1 + |C|) ln T c n 3E Tk(T) + 2K max k =k kc(1 + |C|) + 8(1 + |C|) ln T c n 3T + 2K max (K 1) c(1 + |C|) + 8T(1 + |C|) n ln T c 3n + 2K max. The last formula is minimized when 8T(1 + |C|) n ln T 2(K 1) (1 + |C|) n ln T (K 1) On Multi-Armed Bandit with Impatient Arms By the definition of c in (4), if j mink:ind(k)>N mk ind(k) N N 1 k > 3n + l 4T n ln T (K 1) 2 3 m , we have n ln T (K 1) 3 '! 1 + l K N + 8T 1 + l K N n ln T (K 1) 2 n ln T (K 1) 3 ! 1 + l K N + 8 1 + l K N 3 T 2 3 (n ln T) 1 3 (K 1) 1 3 1 3 + 2K max. Thus RT = O K 4 3 T 2 3 (n ln T) 1 3 . G. Details of FC-Entry Algorithm The details of Feasible Cycle-based successive elimination with new Entering arms (FC-Entry) is given in Algorithm 4. For notational simplicity, m is written to be one of the inputs for Algorithm 4, but it is important to note that the algorithm only needs the patience mk of the initially available arms (i.e. k [K0]). There is a similar patience tightening operation for the arms initially available but not included in the initial feasible cycle, as in Algorithm 1. We still define checkpoints as integer multiples of a constant c, while the c for FC-Entry algorithm is instead given in (11). S is the set contain all arms actually in the feasible cycle, while Sres is the set of arms occupying the reserved slots. When arm k enters the game at time ρk, the algorithm immediately adds it to the feasible cycle (it is possible that the algorithm randomly drops some arm in the reserved places to make room for arm k). Thus the algorithm adds arm k to S and Sres. However, the algorithm adds arm k to the feasible cycle only to prevent its departure. For the ease of theoretical analysis, we require that the rewards generated by a newly entered arm k before it is formally involved in the feasible cycle are not used to compute its estimated reward mean. A new entering arm is formally involved in the feasible cycle only in the operation at the nearest checkpoint after its entrance. It does not either participate in the arm elimination phase before that checkpoint. When a new entering arm is detected, the algorithm computes the next checkpoint if necessary. Before formally involved in the feasible cycle, the newly entered arm temporarily stays in Spre. For any arm k [K], ξk, T k record the sum of observed rewards and the number of pulls after arm k is formally involved in the feasible cycle, respectively. At the end of each feasible cycle, the algorithm checks whether the p-th checkpoint is reached. Checkpoints c1, ..., c K0 N+2 N 3 are known in advance since at these checkpoints the algorithm involves the arms in {k [K0] : ind(k) > N 2} into the feasible cycle. However, if after the operation at checkpoint c K0 N+2 N 3 there is still new entering arms that have not formally enter the feasible cycle, the algorithm needs to compute the previously unknown cp for p > K0 N+2 N 3 when detecting new entering arms. Since t denotes the first time slot of the next feasible cycle, t := t + P k S Sres nk + n+ + n is an upper bound for the first time slot of the cycle after the next feasible cycle. Let t denote the first time slot of the last pulled feasible cycle. The condition is not triggered then: t t + P k S S res nk + n+ + n cp n, where S , S res are the versions of S, Sres when checking the condition at time t 1. Snew is not empty, we have p K0 N+2 N 3 . Each arm k in Snew is pulled no later than t + n 1 cp 1 < cp m k mk. Thus arm k does not leave early. In Theorem 4.6 we present an expected dynamic regret upper bound for FC-Entry algorithm under the sparse entrance assumption: 3c min K0 K0} = {c1, c2, ..., c|C|}, where jk is defined such that arm k is involved in the feasible cycle at the jk-th checkpoint. Specifically, jk = 0 for k s.t. k [K0] and ind(k) N 2, while jk = ind(k) N+2 N 3 for k s.t. k [K0] and ind(k) > N 2. jk for k > K0 is On Multi-Armed Bandit with Impatient Arms Algorithm 4 FC-Entry 1: Input: Number of arms in the initial feasible cycle N, patience vector m, time horizon T, a lower bound for the new entering arms patience m, segment length c 2: Construct a feasible cycle a1, ..., an for the set of arms {k : ind(k) N 2} {+, } 3: S {k [K0] : ind(k) N 2}, Spre , Sres , p 1, e K0 + 1, a 1, ..., a n a1, ..., an 4: ai 0, i : a i {+, }. ξk 0, T k 0, k [K]. t 1, t t + P k S Sres nk + n+ + n 5: for k {k [K0] : ind(k ) > N 2} do 6: m k l ind(k) N+2 7: end for 8: while t T do 9: for i = 1, ..., n do 10: if e K and t = ρe then 11: If |Sres| = 2 randomly choose {+, }, S S {a}, Sres Sres {a} such that ai = a, i : a i = . ai 0 for i = 1, ..., n s.t. ai / S 12: Randomly choose {+, } s.t. ai = 0, i : a i = , S S {e}, Sres Sres {e}, Spre Spre {e}, ai e for i = 1, ..., n s.t. a i = 13: if p > K0 N+2 N 3 then 14: cp min{j c | j N+, t j c n} 15: end if 16: e e + 1 17: end if 18: if ai = 0 then 19: Pull at = ai and receive reward Xat,Tat(t). ξat ξat + Xat,Tat(t) and T at T at + 1 if at / Spre 20: t t + 1 21: end if 22: end for 23: S Spre n k S Spre : j S Spre, ξj ln T 1 T j ξk 24: Sres Sres S 25: ai 0 for i = 1, ..., n s.t. ai / S 26: t t + P k S Sres nk + n+ + n 27: if cp is known and t > cp n then 28: Spre , Snew {k [K0] : m k = cp} 29: while |S Sres| + |Snew| > N 2 do 30: a Unif(S Sres) 31: S S {a} 32: end while 33: ai 0 for i = 1, ..., n s.t. ai / S 34: while |Snew| > 0 do 35: a arg mink Snew ind(k) 36: S S {a}, Snew Snew {a} 37: ai a where i = min i n : a i = min{k [K0] | ind(k) N 2 and j n s.t. a j = k : aj = 0} 38: end while 39: p p + 1 40: end if 41: end while On Multi-Armed Bandit with Impatient Arms a random variable though ρk is fixed because the feasible cycles can have various lengths. Besides, since we assume that ρk for k > K0 is unknown in advance, only the first part of C (i.e. nl a N 3 m c a = 1, ..., K0 N + 2 o ) is known in the beginning of the game. Let A0 = {k [K0] : ind(k) N 2} be the set of arms in the initial feasible cycle. Define k (0) = arg maxk A0 µk. We define the same E0 as in (17). We note that if there is a new entering arm before the condition at checkpoint c1 is triggered, no arm in Sres is kicked off because in the beginning, the reserved positions in the feasible cycle are empty. Thus following the proof of Theorem 4.2, we also have that any arm k E0 is eliminated before checkpoint c1 under the good event G defined in (13), while any arm k S Spre when the condition t+P k S Sres nk +n+ +n > cp n is triggered satisfies that k (0),k < 4 1 nk + 1 = 4 Define k (1) as the temporarily optimal arm in the feasible cycle after the operation at checkpoint c1. Then we have under G, n 3, max a [K0]:m a=c1 µa, max a>K0:ja=1 µa with µk (0) = µk (0). Define A1 := (A0 E0) {a [K0] : m a = c1} {a > K0 : ja = 1} as the set of arms possibly in the feasible cycle after the operation at checkpoint c1. Under the good event G, µk (1) is a lower bound for the mean reward of the temporarily optimal arm in the feasible cycle after the operation at checkpoint c1. Now we consider the situation after the operation at checkpoint cj 1 where 1 < j K0 N+2 N 3 . The induction hypothesis is that under G, µk (j 1) is a lower bound for the mean reward of the temporarily optimal arm in the feasible cycle after the operation at checkpoint cj 1. Define Rj 1 as the number of the operated feasible cycles from time slot 1 to time slot tj 1, where tj 1 is the first time that triggers the checkpoint cj 1. tj, Rj are defined accordingly. When the condition at checkpoint cj 1 is triggered, we also have that t j 1 + P k S Sres nk + n+ + n cj 1 n, where t j 1 is the last time when arm elimination is performed before tj 1. Since P k S Sres nk + n+ + n is an upper bound for the length of the feasible cycle beginning at time t j 1, we have tj 1 t j 1 + P k S Sres nk + n+ + n cj 1 n. At least the time slots from cj 1 n to cj n is covered by the Rj Rj 1 + 1 more feasible cycles after the operation at checkpoint cj 1. We have n(Rj Rj 1 + 1) cj cj 1 + 1, Rj Rj 1 cj cj 1 To compute µk (j), we consider the following cases: 1. No entering arm k whose jk = j. Under G, the temporarily optimal arm k (j 1) has a reward mean at least µk (j 1). Following the analysis in the proof of Theorem 4.2, any k S when the condition at checkpoint cj is triggered satisfies that µk (j 1) µk 8 q ln T c n 3. We have µk (j) µk (j 1) 8 q ln T c n 3. 2. A new entering arm k is involved in the feasible cycle at checkpoint cj, but |Sres| < 2 when it enters the game. The new entering arm does not kick off any arm, thus the temporarily optimal arm with reward mean at least µk (j 1) must be still active. We also have µk (j) µk (j 1) 8 q ln T c n 3. 3. A new entering arm k is involved in the feasible cycle at checkpoint cj and |Sres| = 2 when it enters the game, but the temporarily optimal arm is not kicked off. As long as the temporarily optimal arm is still active, we have µk (j) µk (j 1) 8 q ln T c n 3. 4. A new entering arm k is involved in the feasible cycle at checkpoint cj and it kicks off the temporarily optimal arm when it enters the game. Let +, represent the arms in Sres when arm k enters the game. In this case, by the sparse entrance assumption, max(j+, j ) j 2. |µ+ µ | 8 q ln T c n 3 because otherwise one of {+, } will be eliminated by the other either before or at checkpoint cj 1. Thus although arm k kicks off the temporarily optimal arm, another arm in Sres has a mean reward at least µk (j 1) 8 q ln T c n 3. If this arm becomes temporarily optimal, it must be still in On Multi-Armed Bandit with Impatient Arms the feasible cycle after the operation at checkpoint cj, thus µk (j) µk (j 1) 8 q ln T c n 3. But if not, the temporarily optimal arm is in S Sres and has a reward mean at least µk (j 1) 8 q ln T c n 3. Though this arm may be unfortunately dropped at checkpoint cj, it guarantees that µk (j) µk (j 1) 8 q ln T c n 3 8 q ln T c n 3 = µk (j 1) 16 q ln T c n 3. Thus we define Ej 1 as ln T c n 3 > µi, 16 ln T µk (j 1) 8 q ln T c n 3 µi 2 1 ni + 1 2 + 1 cj cj 1 where Aj 1 := (Aj 2 Ej 2) {a [K0] : m a = cj 1} {a > K0 : ja = j 1}. For any arm k Ej 1, if it is in S after the operation at checkpoint cj 1, we show that it will be eliminated before checkpoint cj. By our discussion above, there is an arm v with reward mean at least µk (j 1) 8 q ln T c n 3 which survives until the operation at checkpoint cj. We have rk + max(0, rv rk) + 1 nk + 1 2 % = max(rv, rk) + 1 nk + 1 2 % max(rv, rk) + 16 ln T 1 nk + 1 2 + 1 Rj 1 + 16 ln T µk (j 1) 8 q ln T c n 3 µk 2 1 nk + 1 2 + 1 Rj 1 + cj cj 1 Rj 1 + Rj Rj 1 =Rj. Thus for any k still in the feasible cycle when the condition at checkpoint cj is triggered, we have k / Ej 1 and µk µk (j 1) 16 ln T c n 3. In conclusion, under the good event G, µk (j 1) 16 ln T c n 3, max a [K0]:m a=cj µa, max a>K0:ja=j µa Aj 1, Ej 1 can be defined the same way for j = K0 N+2 N 3 + 1, ..., |C| + 1 by setting c1+|C| = c|C| + c. The difference is that cj cj 1 can be larger than c and there is no arm dropped out at cj. The temporarily optimal arm can still be kicked off by a new entering arm. With a similar discussion as in the j K0 N+2 N 3 case, we obtain µk (j 1) 16 ln T c n 3, max a>K0:ja=j µa for j = K0 N+2 N 3 + 1, ..., |C|. Whether the optimal arm k is in the initial feasible cycle is crucial for our regret analysis. We consider the following cases: On Multi-Armed Bandit with Impatient Arms 1. k > K0. The optimal arm k enters later in the game. Let κ 0 = arg maxk [K0] µk be the optimal arm among those enter at the beginning of the game. Let κ j = min{k > K0 : µk > µκ j 1} be the first entering arm whose mean reward is greater than µκ j 1. Integer τ is defined such that κ τ = k , we have τ > 0 since k > K0. By these definitions, we have µκ 0 < µκ 1 < ... < µκ τ = µk = µ . And we can rewrite the expected dynamic regret RT as t=1 µat = E ρκ 1 1 X t=1 (µκ 0 µat) + (µκ 1 µat) + ... + t=ρk (µ µat) =E ρκ 1 1 X t=1 (µκ 0 µat)(I{G} + I{ G}) + (µκ 1 µat)(I{G} + I{ G}) + ... t=ρk (µ µat)(I{G} + I{ G}) t=1 (µκ 0 µat)I{G} + (µκ 1 µat)I{G} + ... + t=ρk (µ µat)I{G} + t=1 at I{ G} T max Pr( G) + k =k E ρκ 1 1 X t=1 (µκ 0 µk)I{at = k}I{G} (µκ 1 µk)I{at = k}I{G} + ... + t=ρk (µ µk)I{at = k}I{G} =:T max Pr( G) + k =k E[Dk]. To upper bound Dk, we define min n l = 0, ..., τ µκ l 16(K N + 3) q ln T c n 3 > µk o , if k > 16(K N + 3) q ln T c n 3. τ + 1, if k 16(K N + 3) q ln T c n 3. and consider three possibilities: (a) l(k) = τ + 1. In this case, µk is very close to µ , thus we trivially bound Dk k T 16(K N + 3)T q ln T c n 3. (b) l(k) τ, jk jκ l(k). At the jκ l(k)-th checkpoint, arm κ l(k) enters the feasible cycle. Among those have once entered the feasible cycle, it has the highest mean reward. Thus by our previous definition, µk (jκ l(k)) = µκ l(k). The number of checkpoints is |C| K0 N+2 N 3 + K K0 K0 N + 2 + K K0 = K N + 2. Since arm k enters the feasible cycle either after or simultaneously with arm κ l(k) (i.e. jk jκ l(k)), there must be an arm with reward mean at least µκ l(k) 16|C| q ln T c n 3 8 q ln T c n 3 after the operation at checkpoint cjk which survives until at least the next checkpoint. Given that µκ l(k) 16(K N + 2) q ln T c n 3 8 q ln T c n 3 µk > 8 q ln T c n 3, arm k is eliminated before time slot cjk + c n under the good event G. Thus the total number of pulls of arm k is at most 3c and Dk 3c k. (c) l(k) τ, jk < jκ l(k). Arm κ l(k) enters the game at time ρκ l(k). For notational simplicity, let cκ denote the checkpoint at which arm κ l(k) is involved in the feasible cycle. We show that cκ c 2n ρκ l(k). Suppose this inequality does not hold. Let t be the first time slot of the feasible cycle that contains the time slot ρκ l(k), t ρκ l(k). Thus On Multi-Armed Bandit with Impatient Arms t + n ρκ l(k) + n < cκ c n. Arm κ l(k) should be involved in the feasible cycle at checkpoint cκ c instead of cκ, which is a contradiction. If arm k is still active when the condition at checkpoint cκ is triggered, it must be eliminated before time slot cκ + c n under the good event G since µκ l(k) µk > 16(K N + 3) q ln T c n 3. Then we have Pρκ l(k)+1 1 t=ρκ l(k) I{at = k}I{G} cκ + c n ρκ l(k) cκ + c n cκ + c + 2n 3c where we set ρκ τ+1 = T + 1. We derive an upper bound for Dk, t=1 (µκ 0 µk)I{at = k}I{G} + (µκ 1 µk)I{at = k}I{G} + ... + t=ρk (µ µk)I{at = k}I{G} t=1 (µκ 0 µk) + (µκ 1 µk) + ... + k ρκ l(k)+1 1 X I{at = k}I{G} (ρκ l(k) 1)16(K N + 3) ln T c n 3 + 3c k 16(K N + 3)T ln T c n 3 + 3c k. Then RT can be further bounded as RT 2K max + k =k E[Dk] 2K max + 16(K N + 3)T ln T c n 3 + 3c k 2. k K0, ind(k ) > N 2. The optimal arm k is in the game from the beginning. t=1 µat T max Pr( G) + k =k E[Dk], where in this case Dk = k PT t=1 I{at = k}I{G} = k Tk(T)I{G}. Arm k is involved in the feasible cycle at checkpoint c ind(k ) N+2 N 3 . Consider the case when k > 16(K N + 3) q ln T c n 3. By our previous discussion, it guarantees that Tk(T) 3c when jk jk under the good event G. When jk < jk , arm k must be eliminated before time slot ck + c n. Thus k > 16(K N + 3) q ln T c n 3 implies that I{G}Tk(T) 3c + ck 3c + c K0 N+2 N 3 . Since it is also possible that k 16(K N + 3) q ln T c n 3, we have Dk 3c k + kc K0 N+2 N 3 + 16(K N + 3)Tk(T) q ln T c n 3. Thus RT 2K max + 3c k + kc l K0 N + 2 m + 16(K N + 3)Tk(T) 3. k K0, ind(k ) N 2. The optimal arm k is in the initial feasible cycle. The only difference from the k K0, ind(k ) > N 2 case is that jk jk = 0, k = k . We can similarly obtain RT 2K max + 3c k + 16(K N + 3)Tk(T) Merging the above three cases, the expected dynamic regret upper bound is RT 2K max + (K 1) c 3 + l K0 N + 2 + 16(K 1)(K N + 3)T n ln T c 3n, which is minimized when 8(K N + 3)T n ln T (3 + K0 N+2 If c is exactly 3n + 8(K N+3)T n ln T (3+ K0 N+2 3 , it can be verified that RT = O K2T 2 3 (n ln T) 1 3 . On Multi-Armed Bandit with Impatient Arms H. Discussion on the Knowledge of the Patience Vector m For the following reasons, we assume that the value of m is known in advance in this paper: 1. In practice, m can come as a result of negotiation between the algorithm and arms in advance. We take the crowdsourcing scenario as our example. The system (algorithm) can make a promise of mk for worker (arm) k such that in any time period of length mk, the worker will surely have at least one job assignments. After a suitable value of m is confirmed, the proposed algorithms in this paper (FC-SE and FC-Entry) are able to ensure that this promise is never violated. 2. It is sufficient for the known m to only be a valid element-wise lower bound for the true patience vector. For the clarity of our discussion, say m is the underlying true arm patience vector and m is an element-wise lower bound for m , i.e., mk m k for any k [K]. The knowledge of any valid m, instead of the exact m , is sufficient for the construction of a feasible cycle for the set of arms. This is because, if we repeat a feasible cycle constructed given m, for any arm k, it is never continuously ignored for a duration of mk time steps, thus arm k is also never continuously ignored for m k mk time steps and it never leaves the game. 3. If no a-priori knowledge of arm patience m is accessible, we can prove that the bandit learning problem is intractable. Unlike learning the mean reward µ, learning the arm patience m is impractical, since obtaining partial information of m is accompanied by the risk of losing the optimal arm. Besides, in this scenario, all arms are initially indistinguishable for any algorithm. Intuitively, for any algorithm, there exists instances (µ, m) such that the optimal arm leaves very early. In fact, we can formally prove that, without the knowledge of m, any algorithm can incur unacceptable regret that is linear in the time horizon T: Proposition H.1. Given any algorithm A that only takes the historical observations (a1, r1, ..., at 1, rt 1) as input at the beginning of time slot t and outputs an action at. The algorithm observes reward rt at time t. Then there exists a family of problem instances such that the expected regret RT satisfies RT = Ω(T). Proof of Proposition H.1. Note that the algorithm A has no access to the arm patience m. Assume K > 2. There exists i [K] such that Pr(a1 = i ) 1/K, since otherwise 1 = PK i=1 Pr(a1 = i) < PK i=1 1/K = 1. Similarly, there exists j [K] such that Pr(a2 = j |a1 = i , r1 = 0) 1/K. Note that i can be equal to j . We construct a problem instance as follows: Select k [K] that satisfies k = i , k = j . Set µk = 1, mk = 2. For k = k , set µk = 0, mk = T. Set the reward noise to be 0 almost surely. That is, rt = µat almost surely. Obviously, arm k is an impatient optimal arm. If the algorithm pulls a1 = k and a2 = k , the optimal arm leaves at the end of time slot t = 2. We have RT Pr(a1 = k , a2 = k )T Pr(a1 = i , a2 = j )T. We observe that Pr(a2 = j |a1 = i ) = X r Pr(a2 = j |a1 = i , r1 = r) Pr(r1 = r|a1 = i ) = Pr(a2 = j |a1 = i , r1 = 0) since µi = 0. Now we see Pr(a1 = i , a2 = j ) = Pr(a1 = i ) Pr(a2 = j |a1 = i ) = Pr(a1 = i ) Pr(a2 = j |a1 = i , r1 = 0) As a result, we have shown that RT T/K2.