# on_multiarmed_bandit_with_impatient_arms__d399f0f5.pdf

On Multi-Armed Bandit with Impatient Arms

Yuming Shao 1 2 Zhixuan Fang 1 2

In this paper, we investigate a Multi-Armed Bandit (MAB) setting where an arm exits the game if the algorithm continuously neglects it. This setup is motivated by real-world scenarios, such as online advertising and crowdsourcing, where arms only gain benefits after being pulled by the algorithm. We identify the intrinsic hardness of this problem and limitations in existing approaches. We propose FC-SE algorithm with expected regret upper bounds as our solution to this problem. As an extension, we even allow new arms to enter after the game starts and design FC-Entry algorithm with performance guarantees for this setup. Finally, we conduct experiments to validate our theoretical results.

1. Introduction

Multi-Armed Bandit (MAB), first introduced in (Robbins, 1952), is a type of machine learning model used to describe the decision-making problems with unknown information that needs to be learned. In this model, each arm represents a potential choice with unknown expected reward. The algorithm s objective is to achieve the highest possible overall reward. The central challenge in a MAB problem is to strike a balance between exploiting the arms that have yielded high rewards and exploring unknown arms to uncover their potential. MAB algorithms find real-world applications in various fields, such as online advertising (Schwartz et al., 2017; Yang & Lu, 2016; Aramayo et al., 2023; Avadhanula et al., 2021; Han & Gabor, 2020), clinical trials (Aziz et al., 2021; Chakravorty & Mahajan, 2014), recommendation systems (Santana et al., 2020; Xie et al., 2021; Mahadik et al., 2020), crowdsourcing (Rangi & Franceschetti, 2018; Song & Jin, 2021; Qin et al., 2023; Liu & Liu, 2017; Zhang et al., 2015; Tran-Thanh et al., 2014), and resource allocation in

1Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 2Shanghai Qi Zhi Institute, Shanghai, China. Correspondence to: Zhixuan Fang <zfang@mail.tsinghua.edu.cn>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

computer networks (Zuo & Joe-Wong, 2021; Pase et al., 2022; Feki & Capdevielle, 2011). In these applications, bandit algorithms attempt to find choices that maximize their own benefits, while arms passively comply with and accept the arrangements made by the algorithm.

However, in many situations, bandit arms also have their own interests, preferences, and even the right to make choices. For instance, a series of recent studies (Liu et al., 2020; 2021a; Kong et al., 2022; Basu et al., 2021; Sankararaman et al., 2021; Zhang et al., 2022) investigated multi-agent multi-armed bandit in two-sided matching markets. In these contexts, each arm selects its most preferred agent among those who have chosen it. Beyond this, the arms preferences and choices can manifest in various ways. One crucial aspect is their decision to participate in the game. If participation promises satisfactory benefits, they may opt to remain involved; conversely, if participation does not yield sufficient benefits, they may be motivated to exit. To illustrate this, we present the following real world scenarios: Example 1 (Online Advertising). A website owner allocates advertisement slots to advertisers across various time periods. The owner generally prefers advertisements with high click-through rates to generate higher revenue. However, for advertisers with low click-through rates, their ads may remain undisplayed for an extended period. This situation may lead to financial losses for advertisers and evoke their dissatisfaction. Consequently, advertisers may find it advantageous to withdraw from the competition for ad slots on this site. Example 2 (Crowdsourcing). The crowdsourcing workers participating in a dataset labeling task receive compensation for each label they provide. A crowdsourcing system assuming control over the task assignments to workers (Zhang et al., 2015) often continuously assigns tasks to workers with proven proficiency, while potentially neglecting others. Given that workers are only compensated upon task completion, certain workers might experience prolonged periods without receiving any tasks, leading to a lack of income from the platform. They may lose patience and turn to alternative avenues for earning income.

What these two cases have in common is that if the algorithm frequently neglects an arm, it can harm the arm s interests, leading it to lose patience and ultimately incentivizing it to leave. Patience, to some extent, represents

On Multi-Armed Bandit with Impatient Arms

the threshold of loss that an arm is willing to tolerate. The conventional MAB model does not account for the arms patience and the possibility that they may leave the game if they run out of patience, making it inadequate for effectively capturing the real-world examples outlined above. To address this limitation, we introduce the patience of arms into the conventional model and study the Multi-Armed Bandit with Impatient Arms setting. In this setup, there are K arms and each arm k is associated with a positive integer mk indicating arm k s patience. If arm k is consistently ignored by the algorithm for mk consecutive times, it will leave the game. Consequently, the algorithm cannot pull it within the remaining time horizon.

While existing bandit algorithms such as the Upper Confidence Bound (UCB) (Lai, 1987; Auer et al., 2002), Successive Elimination (SE) (Even-Dar et al., 2006) and Thompson Sampling (TS) (Thompson, 1933) algorithms can operate within the proposed setting, they do not explicitly accommodate the limited patience of the arms. When the optimal arm is impatient, it could deplete its patience and leave the game early. As a result, the algorithm would be compelled to choose the remaining sub-optimal arms, leading to a linear regret. Therefore, in this paper, we ask two research questions: 1) How does the patience of arms impact the performance of existing algorithms? 2) How can we address the challenges posed by the impatience of arms?

1.1. Our Contributions

We introduce a new version of MAB problem, where an arm leaves the game if the algorithm continuously neglects it. As an extension, we also consider a more general and realistic setup that allows not only the participating arms to exit but also new arms to enter.

We derive a minimax lower bound to highlight the fundamental hardness of the proposed problem especially when the arms exhibit significant impatience. We also comprehensively study the existing algorithms such as UCB and SE in the proposed setup. We identify a broad range of problem instances where they incur (nearly) linear expected regret.

When existing algorithms have no guaranteed performance, we propose the Feasible Cycle-based Successive Elimination (FC-SE) algorithm. In FC-SE, we repeat a special sequence of arms (referred to as a feasible cycle) to prevent unexpected arm departures. When it is possible to include all arms in the feasible cycle, FC-SE achieves a regret of O n +

n T , where n is the feasible cycle length and T is the time horizon. However, when it s only possible to include a subset of arms in the initial feasible cycle, FC-SE randomly drops arms from the current feasible cycle, if necessary, to accommodate remaining arms, achiev-

ing a regret of O K 4 3 T 2 3 n 1 3 if the remaining arms have sufficient patience. We design the FC-Entry algorithm for the extension scenario with new entering arms. FC-Entry accounts for unknown entry times for new arms, assuming that entries are sparse. FC-Entry achieves a dynamic regret of O K2T 2 3 n 1 3 if the arms initially available but not in the initial feasible cycle have sufficient patience.

We conduct numerical experiments to validate our theoretical results.

1.2. Related Work

There are several prior works allowing time-variant arm sets. The literature on sleeping bandits (Kanade et al., 2009; Kanade & Steinke, 2014; Cortes et al., 2019; Saha et al., 2020; Gaillard et al., 2023) assumes that the set At of available arms at any given time t may change. Additionally, Chakrabarti et al. (2008) and Trac a et al. (2020) are related to our work since they assume that if an arm leaves the game, it will not return. While our setting is related to these works, there are significant differences. In these papers, arm availability can be either stochastic (Saha et al., 2020) or adversarial (Gaillard et al., 2023). On the one hand, our setting assumes that the available set of arms is determined by the algorithm s previous actions, making it incompatible with the stochastic arm availability assumption. On the other hand, in previous works assuming adversarial arm availability, the algorithm s performance is typically evaluated by comparing its choice at time t against an arm from the adversarially selected At. In our scenario, instead, the algorithm s choice should always be compared against the offline optimal arm k , even though an algorithm can unfortunately lose it (k / At).

One step in our proposed algorithms is to schedule the arms in a way that prevents any of them from running out of patience. Similar scheduling problems are considered in the Age of Information (Ao I) literature (Kaul et al., 2011; 2012; Abbas et al., 2023). In the field of communication, Ao I serves as a metric for measuring information freshness. Although a main line of research has focused on Ao I minimization (Liu et al., 2019; Arafa et al., 2020; Chen et al., 2023), our paper is more related to scheduling under hard constraints. For instance, the peak Ao I deadline resembles the concept of arm patience in this paper. Li et al. (2021); Liu et al. (2021b); Li et al. (2023a) study the scheduling problem under various Ao I constraints. Due to space limits, we defer other details of related work to Appendix A.

2. Preliminaries

The problem studied in this paper is built upon the standard stochastic MAB. We consider a time horizon of length

On Multi-Armed Bandit with Impatient Arms

T and a finite set of K arms. At each time t [T], an algorithm pulls one arm at At, where At is the set of arms available at time t. When arm k is pulled the n-th time, the algorithm observes and collects a reward Xk,n. {Xk,n}n 1 are i.i.d random variables and we assume that for any k, n, Xk,n µk is 1-sub-Gaussian (The definition can be found for example in Chapter 5 of Lattimore & Szepesv ari (2020)). Let Xt denote the reward at time t. For any arm k [K], µk is its mean reward. µ = (µ1, ..., µK) is the vector of the mean rewards, which is unknown to the algorithm. The largest mean reward is µ = maxk [K] µk and the optimal arm is k arg maxk [K] µk. For simplicity of presentation, we assume that the optimal arm is unique. Minor modification can be applied to adapt our results to the non-unique optimal arm case. We define the reward gap such that k = µ µk , k = k , where a known constant 1 is an upper bound for all the reward gaps. We also define the relative reward gap k,k = µk µk , k, k s.t. µk µk . Define the empirical mean for arm k given n observations: ˆµk,n = n 1 Pn k=1 Xk,n. Let Tk(t) := Pt s=1 I{as = k} denote the number of times arm k is pulled from time 1 to time t. As we consider the Multi-Armed Bandit with Impatient Arms setup, we introduce a threshold mk N+ for each arm k. We refer to mk as arm k s patience. Similar to µ, m = (m1, ..., m K) is the vector of mk, k [K]. When the algorithm continuously ignores arm k for a duration of mk time steps, arm k loses patience and exits the game. For example, if at0 = k and as = k for all s = t0 + 1, ..., t0 + mk, and if t0 + mk + 1 T, then arm k exits at the end of time t0 + mk. Arm k can no longer be selected starting from time t = t0 + mk + 1 as it has already left the game. We assume that all arms are initially full of patience when they enter the game. In other words, if arm k has not been pulled since time t = 1, we regard the time when it was last pulled as t = 0. We assume that m is given in advance and discuss this in Appendix H. Define a bijective index mapping ind: [K] [K] such that ind(k) is the index of arm k when m is sorted from small to large. Let ind 1(i) denote the arm whose patience is the i-th small. Following Li et al. (2021), we define the load factor

as a measure of arm impatience. We adopt cumulative expected regret RT as the performance metric for algorithms, where

t=1 µat = E T X

t=1 at . (2)

It is worth noting that, at any time t, the algorithm s choice at is compared against k , even if k may not be available at

time t (i.e., k / At). This definition of regret is reasonable because the best an algorithm can do is to pull k since the beginning and keep it in At throughout the time horizon. This reveals one of the crucial differences between our work and the sleeping bandit literature (Gaillard et al., 2023; Kale et al., 2016), whose regrets typically compare at against some arm in At.

3. Hardness of the Problem and Limitations of the Existing Approaches

3.1. Negative Results in the Low Patience Case

Consider the configuration (K, m) = (3, (2, 2, 2)). It can be verified that at least one arm exits at the end of time slot t = 2, regardless of which arms the algorithm pulls at times t = 1 and 2. Intuitively, this serves as an extreme example where some arms have very low level of patience and no algorithm can guarantee a sub-linear regret under such a (K, m) configuration and different µ values. This is because it is impractical to determine the best arm among the three with only two reward observations. In this section, we formally demonstrate how a low level of patience renders learning infeasible. We first define a quantity characterizing the time when the first exit happens given (K, m). Definition 3.1. Fix K, m, define the maximum number of reward observations when the first arm departure happens Q(K,m) to be

max n L N {at}L t=1, s [L 1], k K : s < mk

or s mk, l = s mk + 1, ..., s with al = k o ,

where {at}L t=1 is a sequence of arm a1, a2, ..., a L. If there exists an infinite-length arm sequence such that no arm leaves, let Q(K,m) = + .

For example, Q(3,(2,2,2)) = 2. By Definition 3.1, at least one arm exits the game before or at the end of time t = Q(K,m), regardless of the algorithm s arm choices. We note that Q(K,m) can be uniquely determined by the number of arms K and the vector of patience m, although the computation of Q(K,m) may be of significant time complexity. Intuitively speaking, the less patient the arms are, the more likely they are to exit early. Thus low level of patience results in a small value of Q(K,m). We present a minimax expected regret lower bound depending on Q(K,m), which is especially powerful when Q(K,m) is small. Theorem 3.2. Given a configuration (K, m) such that Q(K,m) T. Suppose the rewards are independent Gaussian random variables with variance 1. Then the minimax expected regret lower bound is given by

min A A sup µ Ξ RT (A, (K, m, µ))

1 Q(K,m) ln K K 1

On Multi-Armed Bandit with Impatient Arms

h 1 f 1 K (1

2 ln K K 1) i (T Q(K,m)),

where RT (A, (K, m, µ)) is the expected regret when algorithm A is operated on the problem instance (K, m, µ) for a time horizon T. The possible set of algorithms A = {πt}T t=1 , where {πt}T t=1 is a sequence of policies π1, ..., πT that πt maps from (a1, X1, ..., at 1, Xt 1) and m to the probability simplex over [K]. Ξ = {µ | k , k = k }. f K(p) := h(p) + p ln K K 1 and h(p) = p ln p + (1 p) ln(1 p) is the entropy of Bernoulli distribution. f 1 K ( 1

2 ln K K 1) denotes the unique p such that f K(p) = 1

2 ln K K 1.

Though it can be difficult to compute Q(K,m) for arbitrary K, m, we find that if the load factor l(K, m) > 1, it is possible to obtain upper bounds of Q(K,m). In fact, there is a rich set of configurations whose load factor exceeds 1. For instance, l(3, (2, 2, 2)) = 3

2 > 1. We formally describe the upper bound in the following lemma and detail its proof in Appendix C.

Lemma 3.3. Given a configuration (K, m), if λ N+

such that l(K, m) > 1+ 1

λ, then we have Q(K,m) λK+1.

Combining Theorem 3.2 and Lemma 3.3, we directly have the following result.

Corollary 3.4. Fix a configuration (K, m) with l(K, m) > 1, we have

min A A sup µ Ξ RT (A, (K, m, µ)) = Ω W(K, m)T ,

where W(K, m) is a quantity purely depending on (K, m). A, Ξ and RT (A, (K, m, µ)) are defined in Theorem 3.2.

Corollary 3.4 demonstrates the hardness of algorithm design when l(K, m) > 1. If we regard l(K, m) as a measure of arm impatience, then l(K, m) > 1 is associated with a low level of patience case, when the arms are so likely to leave early that learning their reward distributions is not feasible.

3.2. Analysis of UCB Algorithm

In this part, we study the well-known Upper Confidence Bound (UCB) algorithm in our Multi-Armed Bandit with Impatient Arms setup. We adopt the definition in Lattimore & Szepesv ari (2020). Define the upper confidence bound of arm k given n observations as UCBk(n) = ˆµk,n +

2n 1 ln δ 1 if n > 0 and UCBk(0) = + . δ is the confidence parameter, typically set to be δ = 1/T 2. The UCB algorithm operates by pulling at = arg maxk [K] UCBk(Tk(t)). We find that when the arms are sufficiently patient, UCB still has performance guarantees. Such results are presented in Appendix D. However, when the optimal arm is impatient, it is highly possible that

it exits early under UCB, leaving the algorithm with only sub-optimal arms to choose. We will show that, if there are impatient arms, we can find many problem instances such that UCB performs badly, thus demonstrate its limitation in the proposed setup. Theorem 3.5. Run UCB with confidence parameter δ = 1/T 2 in the Multi-Armed Bandit with Impatient Arms setup. Suppose the rewards are independent Gaussian random variables with variance 1. If θ (0, 1), mk = O((ln T)θ) and β = k such that mβ T, then for any γ (0, 1), we have

RT = Ω min T 1 γ C(T) ln T 1 ,

where min := mink =k k and C(T) is a polynomial function of ln T.

Proof Sketch of Theorem 3.5. We define events Ek = minn [κT 1]UCBk (n) > x1, UCBk (κT ) < x2 and Eβ = x2 < UCBβ(aβ) < x1, minn [bβ] UCBβ(n) > x2 for arm k and β, respectively. x1, x2, κT , aβ, bβ are constants and UCBk(n) denotes the UCB index of arm k given the first n reward observations. We set x1 > x2 and aβ < bβ. Under these two events, we show that arm k is pulled no more than κT = o(ln T) times. Since UCB algorithm always pulls the arm with the highest UCB index, arm β is pulled at most aβ times before the κT -th pull of arm k , given minn [κT 1] UCBk (n) > x1 > UCBβ(aβ). If arm k is pulled the κT -th time, at least the (aβ + 1)-th, ..., (bβ + 1)-th pulls of arm β happen before the (κT + 1)-th pull of arm k . If the number of arm β pulls between the κT -th and (κT + 1)-th pull of arm k exceeds arm k s patience, then arm k leaves the game before it is pulled the (κT + 1)-th time, not to mention the pulls of other arms. By carefully designing the values of the constants, we show that Ek Eβ occurs with a non-negligible possibility using some techniques of Brownian motion.

The complete proof of Theorem 3.5 is also in Appendix D. It shows that as the optimal arm is relatively impatient, the expected regret of UCB algorithm is asymptotically (almost) linear. Fix any θ (0, 1), the patience vector m such that there are negative problem instances for UCB algorithm can have a load factor l(K, m) as small as (ln T) θ + (K 1)/T 0 as T + , and as large as we wish. If we use load factor as a measure of arm impatience, we see that the capability of UCB algorithm in the proposed setting is very limited, since we can find problem instances such that UCB yields almost linear regret asymptotically even as the load factor approaches 0.

3.3. Analysis of SE Algorithm

In this section, we study the Successive Elimination (SE) algorithm in our Multi-Armed Bandit with Impatient Arms

On Multi-Armed Bandit with Impatient Arms

setup. We consider a version similar to Algorithm 3 in Even Dar et al. (2006). All arms are active in the beginning. The algorithm pulls the active arms in a Round-Robin manner, in an increasing order of the arm indices. If at time t all active arms are pulled the same number of times, then at the end of time t the algorithm executes arm elimination when there are more than one active arms. For arm k, if there exists some arm k such that ˆµk ,Tk (t) ˆµk,Tk(t) > 2 q

4T 1 k (t) ln T, arm k is regarded as sub-optimal and deactivated. Since the arms are scheduled in a Round-Robin fashion, for any arm k, the total number of pulls of other arms between any two consecutive pulls of arm k is at most K 1. Then if the patience vector m satisfies that mk K, k [K], no active arm will exit early. Only those deactivated arms may leave the game after being eliminated by the algorithm. The behavior and analysis of SE remain exactly the same in this case as in standard MAB setting, thus, SE is a competitive option when mk K, k [K], i.e., when arms are patient enough in general. However, Round-Robin could induce linear regret when there exists some k such that mk < K. We present this fact in Proposition 3.6 and its proof in Appendix E.

Proposition 3.6. Run SE in the Multi-Armed Bandit with Impatient Arms setup. Then there exists a problem instance (K, m, µ) such that l(K, m) < 1 but RT = T.

4. Our Algorithms

From the discussion in Section 3, we know that it is infeasible to design new methods to achieve sub-linear regret when l(K, m) is strictly greater than 1. Meanwhile, existing algorithms (e.g., SE) can guarantee sub-linear regret when mk K, k [K]. In this section, we aim at developing algorithms for the regime where we do not have clear guarantees, i.e., for (K, m) such that k [K], mk < K. We notice that in this case, it may be possible to prevent arms early departure with some carefully designed pulling schedules. We say that a schedule is feasible if no arm leaves early under this schedule. For example, when m = (2, 4, 4), a feasible schedule can be a1, ..., a6... = 1, 2, 1, 3, 1, 2, ... . Formally, define τ i k as the time slot when the i-th sample from arm k is scheduled, Ii k as the time interval (in number of slots) between the i-th and the (i + 1)-th pull of arm k, Ii k = τ i+1 k τ i k. Arm k never leaves the game if Ii k mk, i 1. Furthermore, we say a schedule a1, a2... is cyclic if C N+ such that at = at+C, t 1. If there exists an infinite-length feasible schedule for m, Lemma 4.1 of Li (2023) shows that there must be a feasible cyclic schedule, which consists of repeating finite cycles. We call a finite cycle a1, ..., a C feasible cycle if it forms feasible schedules by repeating itself. If a feasible cycle exists for the considered configuration, it is possible to pull arms according to it to prevent

Figure 1. An FC-SE example with N < K. In this example, N = 4, K = 7. The rectangles denote feasible cycle snapshots. The colored nodes denote different arms.

early departure and execute arm elimination at the end of each cycle to remove the sub-optimal arms.

4.1. FC-SE Algorithm

We introduce the intuition and design details of our Feasible Cycle-based Successive Elimination (FC-SE) algorithm. If a feasible cycle can be found for the considered configuration, we can directly replace the Round-Robin cycle in the original SE algorithm with that feasible cycle. In Section 4.2, we introduce how to construct a feasible cycle. Unfortunately, sometimes a feasible cycle exists only for a subset of the set of all arms. Thus we specify an integer N K such that the subset of arms {k : ind(k) N} can form a feasible cycle. We can first repeat this initial feasible cycle to identify sub-optimal arms and then insert remaining arms into it when some arm is eliminated. However, a problem arises when the cycle is still full as the patience of a remaining arm is about to expire. Since we cannot discard any arm without giving it a chance (it might be the optimal arm), we must remove an arm from the current feasible cycle to make room for an arm that has not been pulled yet. In fact, when a full cycle is repeated many times without any arm being eliminated, it is highly likely that the arms in this cycle have such similar mean rewards that distinguishing a sub-optimal arm becomes difficult. Even if the optimal arm in the current feasible cycle is mistakenly dropped, another arm in the cycle can serve as a suitable substitution. To balance the exploration among arms, we select an appropriate constant c and divide the entire time horizon into segments of equal length c. We tighten the patience of a remaining arm k to m k mk, such that m k is an integer multiple of c, and pull arm k at around t = m k to prevent early departure. We drop arms in the feasible cycle if necessary and add remaining arms to the cycle only at these checkpoints (referring to the endpoints of these segments as checkpoints). Figure 1 is a simple example of FC-SE behavior when N < K.

We formally present FC-SE in Algorithm 1. Given N, the algorithm constructs a feasible cycle of length n = P k:ind(k) N nk, where nk is the number of times that arm k appears in the feasible cycle. If N < K, we set nk = 1

On Multi-Armed Bandit with Impatient Arms

Algorithm 1 FC-SE

1: Input: Number of arms in the initial feasible cycle N, patience vector m, time horizon T, segment length c 2: Construct a feasible cycle a1, ... an for the set of arms {k : ind(k) N} 3: t 1, S {k : ind(k) N}, p 1, a 1, ..., a n a1, ..., an 4: if N < K then 5: for k {k : ind(k ) > N} do

6: m k l ind(k) N

7: end for 8: end if 9: while t T do 10: for i = 1, ..., n do 11: if ai = 0 then 12: Pull at = ai and receive reward Xat,Tat(t) 13: t t + 1 14: end if 15: end for 16: S n k S : j S, ˆµj,Tj(t 1) 2 q

ln T 1 Tj(t 1)

ˆµk,Tk(t 1) + 2 q

ln T 1 Tk(t 1) o

17: ai 0 for each i = 1, ..., n such that ai / S 18: if p K N

N 1 and t + P

k S nk > cp n then 19: Snew {k [K] : m k = cp} 20: while |S| + |Snew| > N do 21: a Unif(S) 22: S S {a} 23: end while 24: ai 0 for each i = 1, ..., n such that ai / S 25: while |Snew| > 0 do 26: a arg mink Snew ind(k) 27: S S {a}, Snew Snew {a} 28: ai a where i = min i n : a i = min{k [K] | ind(k) N and j n s.t. a j = k : aj = 0}

29: end while 30: p p + 1 31: end if 32: end while

for any k such that ind(k) > N. The algorithm maintains a set of active arms S, initialized as the set of arms in the initial feasible cycle. p denotes the index of the next checkpoint. The algorithm pulls the active arms according to the feasible cycle. If some arm is eliminated, all positions associated with it in a1, ... an are cleared (i.e. set to 0). At the end of each feasible cycle, the algorithm checks whether the next checkpoint is met. It is important to note that there can be at most N 1 remaining arms waiting at a checkpoint, because if more were assigned, we would need to drop all the arms in the current cycle, and there would be no guarantee that a good substitution for the optimal arm

remains in the feasible cycle. When checking whether the p-th checkpoint is reached, t is the first time slot of a feasible cycle, which ends at t + P

k S nk 1. At the first time t + P

k S nk > cp n, we have that t cp n, since t grows at most n each time. When the p-th checkpoint is met, the set of arms that need to be added to the feasible cycle is denoted as Snew. Since the actual length of the new feasible cycle is also upper bounded by n, each arm k in Snew is pulled no later than t + n 1 cp 1 < cp = m k mk. Therefore, arm k does not leave before being pulled by the algorithm for the first time. The algorithm randomly drops arms from the current feasible cycle until there is enough space for Snew.

First, we analyse the regret performance of FC-SE algorithm in the easier case when we can find a feasible cycle containing all the available arms (i.e. N = K). Note that in this case, the value of c does not matter.

Theorem 4.1. Run FC-SE in Algorithm 1 with N = K. Assume that a feasible cycle can be constructed for the whole arm set [K], with length n = P

k 1 nk. nk is the number of times that arm k appears in the feasible cycle. Then the expected regret of FC-SE is upper bounded by

knk + 32 ln T

where max := maxk =k k.

We provide the detailed proof of Theorem 4.1 in Appendix F. Under the assumption of Theorem 4.1, it can be shown that RT = O n +

n T ln T . Next, we analyse FC-SE in the case when we specify some N < K such that a feasible cycle exists for the subset of arms {k : ind(k) N}. The proof is also in Appendix F.

Theorem 4.2. Run FC-SE in Algorithm 1 with N < K. Assume that a feasible cycle can be constructed for the set of arms {k : ind(k) N}, with length n = P

k:ind(k) N nk and nk = 1 for any k such that ind(k) > N. We set c to be

min k:ind(k)>N mk ind(k) N

n ln T (K 1)

If we have j mink:ind(k)>N mk ind(k) N

N 1 k > 3n, then the expected regret of FC-SE is upper bounded as

RT (K 1) c 1 + l K N

+ 8T 1 + l K N

n ln T c 3n + 2K max,

On Multi-Armed Bandit with Impatient Arms

Algorithm 2 The Shortest Length AUS Cycle Construction

1: Input: Patience vector m 2: Compute the index mapping ind( ) such that mind 1(1) mind 1(2) ... mind 1(K) 3: Solve the optimization problem and obtain solution r :

s.t. ˆrk ˆrk+1 N+, k [K 1] (6)

1 mind 1(k) ˆrk 1, k [K] (7)

k=1 ˆrk 1 (8)

4: if r exists then 5: Construct an AUS cycle with r

where max := maxk =k k. Specifically, if m sat-

isfies that j mink:ind(k)>N mk ind(k) N

N 1 k > 3n + l 4T

n ln T (K 1) 2

3 m , we have RT = O K 4 3 T 2 3 (n ln T) 1 3 .

4.2. A Form of Feasible Cycles: AUS

In Section 4.1, we presented our first algorithm, FC-SE. This algorithm requires a feasible cycle containing at least a subset of all the available arms. Besides, the expected regret increases with the feasible cycle length n in our analysis. In this section, we propose a specific method to construct a feasible cycle with the shortest possible length. We need the following definition from the Age of Information (Ao I) literature.

Definition 4.3 (Li et al. (2023a)). A cyclic schedule is an Almost Uniform Schedule (AUS) if for each arm k, there exists an integer xk such that Ii k is either xk or xk+1 for any i 1. Ii k is the time interval (in number of slots) between the i-th and the (i+1)-th pull of arm k in the cyclic schedule.

Our process to construct a feasible cycle is listed in Algorithm 2. Given any feasible solution to the optimization problem in Algorithm 2, a routine to construct a feasible AUS cycle is described in Li et al. (2021). Besides, our minimization objective (5) is the length of the constructed AUS cycle. In conclusion, Algorithm 2 tries to find the feasible AUS cycle with the shortest possible length. In Appendix B.1, we present a dynamic programming-based solution to the optimization problem in Algorithm 2. Here we formally describe the output of Algorithm 2 and provide the proof in Appendix B.2.

Theorem 4.4. Run Algorithm 2 on a patience vector m. If the optimization problem in Algorithm 2 has a feasible solution, then Algorithm 2 finds a feasible AUS cycle of length at most ||m|| .

Remark 4.5. Although it is obvious that when l(K, m) > 1, no solution to the optimization problem exists, Li et al. (2021) has shown that l(K, m) ln 2 can ensure the existence of such a feasible solution. For some configurations that mink [K] mk < K but l(K, m) ln 2, though it cannot be handled by the SE algorithm, it is possible to schedule the arms such that no arm leaves early. For instance, in many configurations where ||m|| = O(ln T), it is also highly likely that l(K, m) falls below the constant ln 2 as T becomes sufficiently large, making feasible AUS cycles exist. As Theorem 4.4 indicates, Algorithm 2 finds AUS cycles of length n = O(ln T) for such configurations. As a consequence, we can find a wide range of configurations results in AUS cycles of sub-linear length, leading to the sub-linear regret of our novel algorithms.

4.3. Extension: Allowing New Entering Arms

In the proposed setup, we only assume that arms can exit. In reality, a more general and realistic scenario involves both incoming and departing arms. In online advertising, as time progresses, more advertisers may attempt to display their advertisements. Similarly, in the crowdsourcing example, new workers might join the platform and search for jobs. In this section, we take a step forward and allow for the entrance of new arms in the proposed setting. There are also a total of K arms, but only the set of arms [K0] is available at the beginning, where K0 < K. The remaining arms K0 + 1, ..., K arrive later within the time horizon T. ρk denotes the time slot at the beginning of which arm k [K] enters the game. We have ρk = 0, k [K0]. Without loss of generality, we assume that arms k = K0 + 1, ..., K are ordered by their entry times: 0 < ρK0+1 ... ρK. For any k > K0, we assume that the entry time ρk and the patience mk is not known in advance. The algorithm does not know mk and when arm k becomes available until the beginning of time ρk. In this section, we define the mapping ind( ) only for the set of initial arms [K0]: mind 1(1) ... mind 1(K0). When designing performance metrics for algorithms in this new setup, we notice that it is possible for the optimal arm k to be among the new entering arms (i.e. k > K0). In this case, from t = 1 to t = ρk 1, no matter which arm we pull, there is a positive gap in reward mean when comparing at against arm k . Even the Oracle that always pulls the best possible arm yields a positive regret that is linear with respect to ρk . The expected regret (2) is no longer suitable. Instead, we introduce the expected

On Multi-Armed Bandit with Impatient Arms

dynamic regret

t=1 µat , (9)

where µ t := maxk [K]:ρk t µk is the highest reward mean of the arms that have entered the game before or at time t.

To handle newly entering arms with unknown patience, we reserve special slots in the feasible cycle. In the beginning, we use only N 2 of the initially available arms with relatively low level of patience and introduce two additional virtual arms to construct the feasible cycle, where N is also the maximum number of arms in the cycle. Although mk is unknown before arm k s entrance, we assume a known lower bound m for the patience of newly entering arms. The two virtual arms are denoted as + and , and we set their patience as m+ = m = m. The reserved slots for the virtual arms in the feasible cycle are initially empty. New entering arms occupy the slots of either + or after their entrance. The condition m mk ensures that the slots of either virtual arm are sufficient to keep any entering arm. The details of FC-Entry is in Appendix G. Theorem 4.6. Run FC-Entry algorithm in Algorithm 4 with N. Assume N satisfies that 3 < N < K0 + 2, the set of arms {k [K0] : ind(k) N 2} and two virtual arms +, with patience m can form a feasible cycle of length n = n+ + n + P k [K0]:ind(k) N 2 nk. n+, n are the numbers of pulls of the virtual arms +, in the constructed feasible cycle, respectively. nk = 1 for k [K0] : ind(k) > N 2. For arm k > K0, nk = n+ if it takes up the slots of virtual arm + in the feasible cycle and otherwise nk = n . We set c to be

(j min k [K0]:ind(k)>N 2 mk ind(k) N+2

N 3 k , (10)

& 8(K N + 3)T

n ln T (3 + K0 N+2

If c > 3n and 3c min K0<k<K(ρk+1 ρk 1) then the expected dynamic regret of FC-Entry is upper bounded as

RT 2K max + (K 1) c

3 + l K0 N + 2

+ 16(K 1)(K N + 3)T

n ln T c 3n,

where max := maxk =k k. If c is exactly 3n + 8(K N+3)T

n ln T (3+ K0 N+2

3 , RT = O K2T 2 3 (n ln T) 1 3 .

5. Numerical Experiments

We examine the theoretical results in 4 simulations. In a simulation, each curve is the average over 50 trials with i.i.d.

Figure 2. Simulation results. (a) UCB with sufficient arm patience; (b) FC-SE when N = K; (c) FC-SE when N < K; (d) FC-Entry with newly entering arms. The three new arms 5, 6, 7 enter the game at time ρ5 = 3000, ρ6 = 18000, ρ7 = 38000.

standard Gaussian noises. T = 105 in each simulation.

UCB. We consider there are K = 5 arms. Let µ1 = 0.7. Other reward means are sampled uniformly from [0, 0.6], thus k = 1 and µ Ξϵ (defined in Corollary D.2) with ϵ = 0.1 and = 0.7. The entries of m are sampled uniformly from 1 + (K 1) 16ϵ 2 ln T , T . We run UCB with δ = 1/T 2 to validate the sub-linear regret upper bound in Appendix D. Fig. 2(a) shows the obtained regret curve. The figure also shows the realized minimum regret and maximum regret in the shaded area.

FC-SE with N = K. We consider there are K = 5 arms and set m = (3, 5, 12, 155, 1000). We run Algorithm 2 and construct a feasible cycle 1, 2, 3, 1, 2, 4, 1, 2, 5 with n = 9 for all the arms. The entries of µ are sampled uniformly from [0, 1]. We run UCB (with δ = 1/T 2) and SE as baseline algorithms for FC-SE. In each run of FC-SE, no arm exits the game before it is regarded as sub-optimal. Fig. 2(b) shows the obtained regret curves.

FC-SE with N < K. We consider there are K = 6 arms and set m = (2, 4, 4, 6800, 6800, 15000). We let N = 3 and use the feasible cycle 1, 2, 1, 3 with n = 4. The entries of µ are sampled uniformly from [0, 1]. We run UCB (with δ = 1/T 2) and SE as baseline algorithms. In each run of FC-SE, no arm exits the game before being regarded as sub-optimal. Fig. 2(c) shows the obtained regret curves.

FC-Entry. We consider there are K = 7 arms and set m = (3, 5, 1000, 6667, 12, 10000, 26). Arm 5, 6, 7 are newly entering arms and m = 12. We construct a feasible cycle 1, 2, +, 1, 2, -, 1, 2, 3 with n = 9 for patience vector (m1, m2, m+, m , m3) = (3, 5, 12, 12, 1000), where +, are two virtual arms with patience m. The entries of µ are

On Multi-Armed Bandit with Impatient Arms

sampled uniformly from [0, 1]. In each run of FC-Entry, no arm exits the game before it is regarded as sub-optimal. Fig. 2(d) shows the obtained dynamic regret curve.

Acknowledgements

The authors thank the anonymous reviewers for their helpful comments. This work is supported by Tsinghua University Dushi Program.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Abbas, Q., Zeb, S., Hassan, S. A., Mumtaz, R., and Zaidi, S. A. R. Joint optimization of age of information and energy efficiency in iot networks. In 2020 IEEE 91st Vehicular Technology Conference (VTC2020-Spring), pp. 1 5, 2020. doi: 10.1109/VTC2020-Spring48590.2020.9129207.

Abbas, Q., Hassan, S. A., Qureshi, H. K., Dev, K., and Jung, H. A comprehensive survey on age of information in massive iot networks. Computer Communications, 197:199 213, 2023. ISSN 01403664. doi: https://doi.org/10.1016/j.comcom.2022.10. 018. URL https://www.sciencedirect.com/ science/article/pii/S0140366422004066.

Abhishek, K., Ghalme, G., Gujar, S., and Narahari, Y. Sleeping combinatorial bandits. Co RR, abs/2106.01624, 2021. URL https://arxiv.org/abs/2106.01624.

Agarwal, A., Khanna, S., and Patil, P. A sharp memory-regret trade-off for multi-pass streaming bandits. In Loh, P.-L. and Raginsky, M. (eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pp. 1423 1462. PMLR, 02 05 Jul 2022. URL https://proceedings.mlr.press/ v178/agarwal22a.html.

Arafa, A., Yang, J., Ulukus, S., and Poor, H. V. Ageminimal transmission for energy harvesting sensors with finite batteries: Online policies. IEEE Transactions on Information Theory, 66(1):534 556, 2020. doi: 10.1109/TIT.2019.2938969.

Aramayo, N., Schiappacasse, M., and Goic, M. A multiarmed bandit approach for house ads recommendations. Marketing Science, 42(2):271 292, 2023.

Assadi, S. and Wang, C. Single-pass streaming lower bounds for multi-armed bandits exploration with instancesensitive sample complexity. Advances in Neural Information Processing Systems, 35:33066 33079, 2022.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235 256, 2002.

Avadhanula, V., Colini Baldeschi, R., Leonardi, S., Sankararaman, K. A., and Schrijvers, O. Stochastic bandits for multi-platform budget optimization in online advertising. In Proceedings of the Web Conference 2021, WWW 21, pp. 2805 2817, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383127. doi: 10.1145/3442381. 3450074. URL https://doi.org/10.1145/ 3442381.3450074.

Aziz, M., Kaufmann, E., and Riviere, M.-K. On multiarmed bandit designs for dose-finding clinical trials. J. Mach. Learn. Res., 22(1), jan 2021. ISSN 1532-4435.

Basu, S., Sankararaman, K. A., and Sankararaman, A. Beyond log2(t) regret for decentralized bandits in matching markets. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 705 715. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/ v139/basu21a.html.

Chakrabarti, D., Kumar, R., Radlinski, F., and Upfal, E. Mortal multi-armed bandits. Advances in neural information processing systems, 21, 2008.

Chakravorty, J. and Mahajan, A. Multi-armed bandits, gittins index, and its calculation. Methods and applications of statistics in clinical trials: Planning, analysis, and inferential methods, 2:416 435, 2014.

Chaudhuri, A. R. and Kalyanakrishnan, S. Regret minimisation in multi-armed bandits using bounded arm memory. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 10085 10092, 2020.

Chen, S., Yang, N., Zhang, M., and Wang, J. Minimizing age of information for mobile edge computing systems: A nested index approach, 2023.

Cortes, C., Desalvo, G., Gentile, C., Mohri, M., and Yang, S. Online learning with sleeping experts and feedback graphs. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 1370 1378. PMLR, 09 15 Jun 2019. URL https://proceedings.mlr.press/ v97/cortes19a.html.

On Multi-Armed Bandit with Impatient Arms

Duembgen, L. Bounding standard gaussian tail probabilities. ar Xiv preprint ar Xiv:1012.2063, 2010.

Even-Dar, E., Mannor, S., and Mansour, Y. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7(39):1079 1105, 2006. URL http: //jmlr.org/papers/v7/evendar06a.html.

Feki, A. and Capdevielle, V. Autonomous resource allocation for dense lte networks: A multi armed bandit formulation. In 2011 IEEE 22nd International Symposium on Personal, Indoor and Mobile Radio Communications, pp. 66 70, 2011. doi: 10.1109/PIMRC.2011.6140047.

Gaillard, P., Saha, A., and Dan, S. One arrow, two kills: A unified framework for achieving optimal regret guarantees in sleeping bandits. In Ruiz, F., Dy, J., and van de Meent, J.-W. (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp. 7755 7773. PMLR, 25 27 Apr 2023. URL https://proceedings.mlr.press/ v206/gaillard23a.html.

Han, B. and Gabor, J. Contextual bandits for advertising budget allocation. Proceedings of the ADKDD, 17, 2020.

Huang, Z., Xu, Y., Hu, B., Wang, Q., and Pan, J. Thompson sampling for combinatorial semi-bandits with sleeping arms and long-term fairness constraints. Co RR, abs/2005.06725, 2020. URL https://arxiv.org/ abs/2005.06725.

Jin, T., Huang, K., Tang, J., and Xiao, X. Optimal streaming algorithms for multi-armed bandits. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 5045 5054. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/ v139/jin21a.html.

Kale, S., Lee, C., and Pal, D. Hardness of online sleeping combinatorial optimization problems. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips. cc/paper_files/paper/2016/file/ 184260348236f9554fe9375772ff966e-Paper. pdf.

Kanade, V. and Steinke, T. Learning hurdles for sleeping experts. ACM Trans. Comput. Theory, 6(3), jul 2014. ISSN 1942-3454. doi: 10.1145/2505983. URL https: //doi.org/10.1145/2505983.

Kanade, V., Mc Mahan, H. B., and Bryan, B. Sleeping experts and bandits with stochastic action availability and adversarial rewards. In van Dyk, D. and Welling, M. (eds.), Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pp. 272 279, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16 18 Apr 2009. PMLR. URL https://proceedings.mlr. press/v5/kanade09a.html.

Kaul, S., Gruteser, M., Rai, V., and Kenney, J. Minimizing age of information in vehicular networks. In 2011 8th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks, pp. 350 358, 2011. doi: 10.1109/SAHCN.2011.5984917.

Kaul, S., Yates, R., and Gruteser, M. Real-time status: How often should one update? In 2012 Proceedings IEEE INFOCOM, pp. 2731 2735, 2012. doi: 10.1109/ INFCOM.2012.6195689.

Kong, F., Yin, J., and Li, S. Thompson sampling for bandit learning in matching markets. In Raedt, L. D. (ed.), Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 3164 3170. International Joint Conferences on Artificial Intelligence Organization, 7 2022. doi: 10.24963/ijcai.2022/ 439. URL https://doi.org/10.24963/ijcai. 2022/439. Main Track.

Lai, T. L. Adaptive treatment allocation and the multiarmed bandit problem. The Annals of Statistics, 15(3): 1091 1114, 1987. ISSN 00905364. URL http://www. jstor.org/stable/2241818.

Lancewicki, T., Segal, S., Koren, T., and Mansour, Y. Stochastic multi-armed bandits with unrestricted delay distributions. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 5969 5978. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/ v139/lancewicki21a.html.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Li, C. Optimizing Information Freshness in Wireless Networks. Ph D thesis, Virginia Tech, 2023.

Li, C., Liu, Q., Li, S., Chen, Y., Hou, Y. T., and Lou, W. On scheduling with aoi violation tolerance. In IEEE INFOCOM 2021 - IEEE Conference on Computer Communications, pp. 1 9, 2021. doi: 10.1109/INFOCOM42981. 2021.9488685.

On Multi-Armed Bandit with Impatient Arms

Li, C., Li, S., Liu, Q., Hou, Y. T., Lou, W., and Kompella, S. Eywa: A general approach for scheduler design in aoi optimization. In IEEE INFOCOM 2023 - IEEE Conference on Computer Communications, pp. 1 9, 2023a. doi: 10.1109/INFOCOM53939.2023.10228973.

Li, F., Liu, J., and Ji, B. Combinatorial sleeping bandits with fairness constraints. IEEE Transactions on Network Science and Engineering, 7(3):1799 1813, 2020. doi: 10.1109/TNSE.2019.2954310.

Li, S., Zhang, L., Wang, J., and Li, X.-Y. Tight memoryregret lower bounds for streaming bandits, 2023b.

Liau, D., Song, Z., Price, E., and Yang, G. Stochastic multi-armed bandits in constant space. In Storkey, A. and Perez-Cruz, F. (eds.), Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 386 394. PMLR, 09 11 Apr 2018. URL https://proceedings.mlr.press/v84/ liau18a.html.

Ling, Z., Hu, F., Zhang, H., and Han, Z. Age-of-information minimization in healthcare iot using distributionally robust optimization. IEEE Internet of Things Journal, 9(17): 16154 16167, 2022. doi: 10.1109/JIOT.2022.3150321.

Liu, L. T., Mania, H., and Jordan, M. Competing bandits in matching markets. In Chiappa, S. and Calandra, R. (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 1618 1628. PMLR, 26 28 Aug 2020. URL https://proceedings.mlr.press/ v108/liu20c.html.

Liu, L. T., Ruan, F., Mania, H., and Jordan, M. I. Bandit learning in decentralized matching markets. Journal of Machine Learning Research, 22(211):1 34, 2021a. URL http://jmlr.org/papers/v22/ 20-1429.html.

Liu, Q., Zeng, H., and Chen, M. Minimizing age-ofinformation with throughput requirements in multi-path network communication. In Proceedings of the Twentieth ACM International Symposium on Mobile Ad Hoc Networking and Computing, Mobihoc 19, pp. 41 50, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450367646. doi: 10.1145/ 3323679.3326502. URL https://doi.org/10. 1145/3323679.3326502.

Liu, Q., Li, C., Hou, Y. T., Lou, W., and Kompella, S. Aion: A bandwidth optimized scheduler with aoi guarantee. In IEEE INFOCOM 2021 - IEEE Conference on Computer Communications, pp. 1 10, 2021b. doi: 10.1109/INFOCOM42981.2021.9488781.

Liu, Q., Li, C., Hou, Y. T., Lou, W., Reed, J. H., and Kompella, S. Ao2i: Minimizing age of outdated information to improve freshness in data collection. In IEEE INFOCOM 2022 - IEEE Conference on Computer Communications, pp. 1359 1368, 2022. doi: 10.1109/INFOCOM48880.2022.9796932.

Liu, Y. and Liu, M. An online learning approach to improving the quality of crowd-sourcing. IEEE/ACM Transactions on Networking, 25(4):2166 2179, 2017. doi: 10.1109/TNET.2017.2680245.

Lou, A. and Goldfeld, Z. Ece 5630 lecture 7: Data processing inequality, February 2020.

Lu, N., Ji, B., and Li, B. Age-based scheduling: Improving data freshness for wireless real-time traffic. In Proceedings of the Eighteenth ACM International Symposium on Mobile Ad Hoc Networking and Computing, Mobihoc 18, pp. 191 200, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450357708. doi: 10.1145/3209582.3209602. URL https://doi. org/10.1145/3209582.3209602.

Mahadik, K., Wu, Q., Li, S., and Sabne, A. Fast distributed bandits for online recommendation systems. In Proceedings of the 34th ACM International Conference on Supercomputing, ICS 20, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450379830. doi: 10.1145/3392717.3392748. URL https://doi. org/10.1145/3392717.3392748.

Maiti, A., Patil, V., and Khan, A. Multi-armed bandits with bounded arm-memory: Near-optimal guarantees for best-arm identification and regret minimization. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 19553 19565. Curran Associates, Inc., 2021. URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ a2f04745390fd6897d09772b2cd1f581-Paper. pdf.

Mehta, A. et al. Online matching and ad allocation. Foundations and Trends in Theoretical Computer Science, 8 (4):265 368, 2013.

Ni, Y., Cai, L., and Bo, Y. Vehicular beacon broadcast scheduling based on age of information (aoi). China Communications, 15(7):67 76, 2018. doi: 10.1109/CC. 2018.8424604.

Nika, A., Elahi, S., and Tekin, C. Contextual combinatorial volatile multi-armed bandit with adaptive discretization. In Chiappa, S. and Calandra, R. (eds.), Proceedings of the Twenty Third International Conference on Artificial

On Multi-Armed Bandit with Impatient Arms

Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 1486 1496. PMLR, 26 28 Aug 2020. URL https://proceedings.mlr. press/v108/nika20a.html.

Pase, F., Giordani, M., Cuozzo, G., Cavallero, S., Eichinger, J., Verdone, R., and Zorzi, M. Distributed resource allocation for urllc in iiot scenarios: A multi-armed bandit approach. In 2022 IEEE Globecom Workshops (GC Wkshps), pp. 383 388, 2022. doi: 10.1109/ GCWkshps56602.2022.10008671.

Qin, Z., Yang, S., Huang, Y., Fu, H., Zhou, P., and Ding, G. Towards relevance and diversity in crowdsourcing worker recruitment with insufficient information. IEEE Transactions on Network Science and Engineering, pp. 1 14, 2023. doi: 10.1109/TNSE.2023.3302375.

Rangi, A. and Franceschetti, M. Multi-armed bandit algorithms for crowdsourcing systems with online estimation of workers ability. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, AAMAS 18, pp. 1345 1352, Richland, SC, 2018. International Foundation for Autonomous Agents and Multiagent Systems.

Rathod, S. On reducing the order of arm-passes bandit streaming algorithms under memory bottleneck, 2021.

Rigollet, P. and H utter, J.-C. High dimensional statistics. Lecture notes for course 18.657 at MIT, 2019.

Robbins, H. E. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58:527 535, 1952. URL https://api. semanticscholar.org/Corpus ID:15556973.

Saha, A., Gaillard, P., and Valko, M. Improved sleeping bandits with stochastic actions sets and adversarial rewards. In Proceedings of the 37th International Conference on Machine Learning, ICML 20. JMLR.org, 2020.

Sankararaman, A., Basu, S., and Abinav Sankararaman, K. Dominate or delete: Decentralized competing bandits in serial dictatorship. In Banerjee, A. and Fukumizu, K. (eds.), Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp. 1252 1260. PMLR, 13 15 Apr 2021. URL https://proceedings.mlr.press/ v130/sankararaman21a.html.

Santana, M. R. O., Melo, L. C., Camargo, F. H. F., Brand ao, B., Soares, A., Oliveira, R. M., and Caetano, S. Contextual meta-bandit for recommender systems selection. In Proceedings of the 14th ACM Conference on Recommender Systems, Rec Sys 20, pp. 444 449,

New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450375832. doi: 10.1145/ 3383313.3412209. URL https://doi.org/10. 1145/3383313.3412209.

Scheike, T. H. A boundary-crossing result for brownian motion. Journal of Applied Probability, 29(2):448 453, 1992.

Schwartz, E. M., Bradlow, E. T., and Fader, P. S. Customer acquisition via display advertising using multiarmed bandit experiments. Marketing Science, 36(4): 500 522, 2017.

Song, Y. and Jin, H. Minimizing entropy for crowdsourcing with combinatorial multi-armed bandit. In IEEE INFOCOM 2021 - IEEE Conference on Computer Communications, pp. 1 10, 2021. doi: 10.1109/INFOCOM42981. 2021.9488800.

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

Trac a, S., Rudin, C., and Yan, W. Reducing exploration of dying arms in mortal bandits. In Adams, R. P. and Gogate, V. (eds.), Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, volume 115 of Proceedings of Machine Learning Research, pp. 156 163. PMLR, 22 25 Jul 2020. URL https://proceedings.mlr. press/v115/traca20a.html.

Tran-Thanh, L., Stein, S., Rogers, A., and Jennings, N. R. Efficient crowdsourcing of unknown experts using bounded multi-armed bandits. Artificial Intelligence, 214:89 111, 2014. ISSN 00043702. doi: https://doi.org/10.1016/j.artint.2014.04. 005. URL https://www.sciencedirect.com/ science/article/pii/S0004370214000538.

Wang, C. Tight regret bounds for single-pass streaming multi-armed bandits. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 35525 35547. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/ v202/wang23a.html.

Xie, Z., Yu, T., Zhao, C., and Li, S. Comparison-based conversational recommender system with relative bandit feedback. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 21, pp. 1400 1409, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380379. doi: 10.1145/ 3404835.3462920. URL https://doi.org/10. 1145/3404835.3462920.

On Multi-Armed Bandit with Impatient Arms

Yang, H. and Lu, Q. Dynamic contextual multi arm bandits in display advertisement. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1305 1310, 2016. doi: 10.1109/ICDM.2016.0177.

Yates, R. D., Sun, Y., Brown, D. R., Kaul, S. K., Modiano, E., and Ulukus, S. Age of information: An introduction and survey. IEEE Journal on Selected Areas in Communications, 39(5):1183 1210, 2021. doi: 10.1109/JSAC.2021.3065072.

Zhang, H., Ma, Y., and Sugiyama, M. Bandit-based task assignment for heterogeneous crowdsourcing. Neural Computation, 27(11):2447 2475, 2015. doi: 10.1162/ NECO a 00782.

Zhang, Y., Wang, S., and Fang, Z. Matching in multi-arm bandit with collision. Advances in Neural Information Processing Systems, 35:9552 9563, 2022.

Zuo, J. and Joe-Wong, C. Combinatorial multi-armed bandits for resource allocation. In 2021 55th Annual Conference on Information Sciences and Systems (CISS), pp. 1 4, 2021. doi: 10.1109/CISS50987.2021.9400228.

On Multi-Armed Bandit with Impatient Arms

A. Details of Related Work

A.1. Bandit with Time-Varying Arm Availability

In many real world applications, the available arm set usually varies over time (Saha et al., 2020). In addition to the sleeping bandit literature we introduced earlier, there is research on combinatorial bandits with sleeping arms (Kale et al., 2016; Nika et al., 2020; Huang et al., 2020; Abhishek et al., 2021; Li et al., 2020), where they permit the set of base arms to evolve. As we have mentioned in the main body, one of the key differences of our work from this literature is the difference of suitable performance metrics. In the literature, the selected action is compared with some arm in At. To illustrate the potential pitfalls of such performance metrics, consider an example where the patience of each arm is relatively low and a simple algorithm that keeps pulling some fixed sub-optimal arm k. As the other arms have all departed, k becomes the only available option. The regret ceases to increase as arm k is only compared against itself. While this simple algorithm is satisfactory in terms of the regret defined in these papers, it is important to note that it incurs a linear loss with respect to the time horizon T if compared to an algorithm that consistently pulls the optimal arm. The appropriate performance metric is then to compare the algorithm choice against the optimal arm that has ever appeared, regardless of whether it is still available at time t. In fact, a well-designed algorithm should aim to prevent the unexpected departure of the optimal arm, as its early exit can result in significant regret. Besides, we note that a concept of patience is also introduced in (Mehta et al., 2013). The patience there represents the allowed maximum times of pulls for each arm. Say each arm is associated with a counter recording the number of times it has been pulled. The value in the counter is non-decreasing and the counter does not reset when the arm is ignored in some round. In contrast, the patience defined in this paper is a threshold for the time an arm is continuously ignored. Say each arm has a counter recording the time that has elapsed since the last pull of it. The counter is reset once the arm is pulled. This is why it is possible to schedule some arms infinitely without ever violating our patience thresholds.

A.2. Streaming Bandit

Recently, motivated by the fact that the number of bandit arms can be notably extensive while the learner often contends with limited available memory, a line of research (Liau et al., 2018; Chaudhuri & Kalyanakrishnan, 2020; Maiti et al., 2021; Rathod, 2021; Jin et al., 2021; Assadi & Wang, 2022; Agarwal et al., 2022; Wang, 2023; Li et al., 2023b) have defined and studied the Streaming Bandit setting. In this context, the arms arrive one at a time in a sequential manner, and the algorithm is constrained to pulling arms currently stored in memory. The size of the memory is generally much smaller than the total number of arms. When the memory becomes saturated and the algorithm aims to select a new arm, it is necessary to remove at least one arm already stored in memory before the new selection can be made (Li et al., 2023b). This model is relevant to ours for two reasons. Firstly, while our setting does not consider bounded memory, an algorithm must choose to operate on only a subset of arms when the arms have limited patience. This is necessary because attempting to manage all arms may deplete the patience of some. Secondly, both this model and ours allow for the entry and exit of arms within the game process. Although, in most parts of our work, we assume that there are no arm entries. The setup of streaming bandits can be categorized into two distinct types. The first is the single-pass setting (Maiti et al., 2021; Assadi & Wang, 2022; Wang, 2023), in which the algorithm is unable to re-add an arm to the arm memory that was previously removed from it. The second is the multi-pass setting (Liau et al., 2018; Chaudhuri & Kalyanakrishnan, 2020; Rathod, 2021; Agarwal et al., 2022), where the algorithm is permitted to scan the stream of arms a limited number of times, thus allowing for the retrieval of arms that were previously discarded. Compared to the multi-pass streaming bandit scenario, the setup we investigate in this paper is more akin to the single-pass streaming bandit configuration, primarily because arms that have exited cannot be recovered by the algorithm. Notably, our model differs from the streaming bandit framework by assuming that arms possess the capacity to autonomously determine whether to persist in participation or withdraw from the game. Despite the algorithm potentially being aware of their exit strategy, arms retain a certain degree of decision-making power. In contrast, in the streaming bandit, it is the algorithm that dictates whether to retain an arm in memory or to discard it.

A.3. Age of Information (Ao I)

The Age of Information (Ao I) is a metric within the application layer of computer networks that characterizes the freshness of information. In its basic model, there are multiple data collection sources and a base station. Each source collects its data samples and transmits them to the base station through a shared wireless channel (Liu et al., 2022). Ao I is defined as the time elapsed since the generation of the latest received message at the base station. Since its original introduction in Kaul et al. (2011; 2012), Ao I has recently garnered significant research attention. Several survey papers (Yates et al.,

On Multi-Armed Bandit with Impatient Arms

2021; Abbas et al., 2023) offer comprehensive summaries of the latest advancements in Ao I research. Ao I is particularly suitable for scenarios where the timeliness of status updates is critical, and as a result, it finds a wide range of applications in intelligent transportation systems (Ni et al., 2018), cellular-based Io T systems (Ling et al., 2022), wireless ad hoc network traffic scheduling (Lu et al., 2018), smart agriculture (Abbas et al., 2020), and so on.

B. Details of AUS construction

B.1. Solution to the Optimization Problem in Algorithm 2

We design a dynamic programming-based algorithm to compute the solution to the optimization problem in Algorithm 2. The details are presented in Algorithm 3. The algorithm is based on a crucial observation: say ˆr is a feasible solution to the optimization problem in Algorithm 2, then it can be mapped to another feasible solution ˆr such that k [K] : ˆr k = 1 mind 1(k) and

PK k=1 ˆr k ˆr K . If ˆr itself satisfies that k [K] : ˆrk = 1 mind 1(k) , we can

simply set ˆr = ˆr. Otherwise we can set ˆr = aˆr where a = maxk [K] 1 ˆrk mind 1(k ) < 1. It can be verified that ˆr is

also a feasible solution and

PK k=1 ˆr k ˆr K =

PK k=1 aˆrk

ˆr K . Given this fact, it suffices to find an optimal ˆr such that

k [K] : ˆrk = 1 mind 1(k) . In the outer iteration, the algorithm iterates over k and fixes ˆrk = 1 mind 1(k) . Given k, for s k,

DP(r, s) denotes the minimum value of Pk k =s ˆrk where ˆrs = r and ˆrs, ..., ˆrk satisfy (6), (7). Similarly, for s > k, the sum is from k to s. For s k, BWD(r, s) denotes the ˆrs+1 that achieves DP(r, s), while for s > k, BWD(r, s) denotes the ˆrs 1 that achieves DP(r, s). If k > 1, the algorithm minimizes Pk k =1 ˆrk . Note that when ˆrk is fixed, the minimization of Pk 1 k =1 ˆrk is independent of ˆrk+1, ..., ˆr K. If k < K, the algorithm computes the minimum PK k =k ˆrk for each possible value of ˆr K. Then the algorithm computes the minimal objective if a feasible solution exists given ˆrk = 1 mind 1(k) . At the

end of the iteration with k, the algorithm updates the minimal objective seen so far.

B.2. Proof of The Correctness of Algorithm 2

Proof of Theorem 4.4. Say r is a solution of the optimization problem in Algorithm 2. By Lemma 2 in Li et al. (2021), the schedule that Algorithm 2 outputs is an AUS cycle of length

PK k=1 r k r K . Since r satisfies (7) and (8), the length can be

upper bounded as

PK k=1 r k r K 1 r K mind 1(K) = maxk [K] mk = ||m|| . Consider a cyclic schedule composed of such

AUS cycles. To show that this cyclic schedule is feasible for m, it suffices to show that k [K], i 1 : Ii k mind 1(k). We prove this by contradiction. For any k, suppose i : Ii k > mind 1(k), we have that i 1 : Ii k mind 1(k) by the definition of AUS. We consider the first AUS cycle in the schedule. By Algorithm 1 in Li et al. (2021), in the AUS cycle, nk = r k r K for any k and n =

PK k=1 r k r K . The next cycle in the schedule is just a repetition of the current cycle. So

i 1 : Ii k mind 1(k) implies i [nk], a[τ i k mod n]+1, ..., a[(τ i k+mind 1(k) 2) mod n]+1 = k. The i-th pull of arm k is followed by at least mind 1(k) 1 pulls of other arms for i < nk, while the nk-th pull of arm k implies there are at least mind 1(k) 1 pulls of other arms in the set of time slots [τ 1 k] {τ nk k + 1, ..., n}. Thus we have mind 1(k)nk < n since i : Ii k > mind 1(k). Let rk denote the average rate of arm k s appearance in the schedule, then rk = nk

n < 1 mind 1(k) . Thus the rate rk satisfies

1 mind 1(k) > rk = r k/r K PK k=1 r k/r K r k 1 mind 1(k) .

A contradiction occurs. So we have k [K], i 1 : Ii k mind 1(k) and the AUS cycle is feasible given the patience vector m.

C. Proofs for The Hardness Result

Proof of Lemma 3.3. Let t = λK. Suppose that Q(K,m) > t + 1. We have that from t = 1 to t = t + 1 no arm leaves. For each arm i, say tk i is the time that it is pulled the k-th time. Consider the sequence t = 0, t1 i , ..., t Ti( t) i , t + 1. Say j, j are

On Multi-Armed Bandit with Impatient Arms

Algorithm 3 A Dynamic Programming Solution to the Optimization Problem in Algorithm 2

1: Input: Patience vector m, the index mapping ind( ) 2: r (1, ..., 1), min len + 3: for k = 1, 2, ..., K do 4: ˆr (1, ..., 1), min len + 5: ˆrk 1 mind 1(k) , Dk {ˆrk}, DP(ˆrk, k) 1 mind 1(k) 6: if k > 1 then 7: Dk {lˆrk| 1 mind 1(k ) lˆrk 1, l N+} for each k = 1, ..., k 1

8: for k = k 1, ..., 1 do 9: for r Dk do 10: DP(r, k ) r + minr :r Dk +1, r

r N+ DP(r , k + 1) 11: BWD(r, k ) arg minr :r Dk +1, r

r N+ DP(r , k + 1) 12: end for 13: end for 14: if minr D1 DP(r, 1) > 1 then 15: Continue 16: end if 17: ˆr1 arg minr D1 DP(r, 1) 18: for k = 2, ..., k 1 do 19: ˆrk BWD(ˆrk 1, k 1) 20: end for 21: end if 22: if k < K then 23: Dk { ˆrk

l | 1 mind 1(k ) ˆrk

l , l N+} for each k = k + 1, ..., K

24: for k = k + 1, ..., K do 25: for r Dk do 26: DP(r, k ) r + minr :r Dk 1, r

r N+ DP(r , k 1)

27: BWD(r, k ) arg minr :r Dk 1, r

r N+ DP(r , k 1)

28: end for 29: end for 30: end if 31: if minr DK Pk 1 k =1 ˆrk + DP(r, K) 1 then

32: min len minr DK:Pk 1 k =1 ˆrk+DP(r,K) 1 r 1[Pk 1 k =1 ˆrk + DP(r, K)]

33: end if 34: if min len < min len then 35: min len min len 36: r k ˆrk for each k = 1, ..., k

37: r K arg minr DK:Pk 1 k =1 ˆrk+DP(r,K) 1 r 1[Pk 1 k =1 ˆrk + DP(r, K)]

38: for k = K 1, ..., k + 1 do 39: r k BWD(r k +1, k + 1) 40: end for 41: end if 42: end for 43: Return r

On Multi-Armed Bandit with Impatient Arms

consecutive items in the sequence, then we have j j mi. This implies that for i [K],

By definition t = PK i=1 Ti( t). Summing over i, we have

λK = K + PK i=1 Ti( t) t >

1 mi = l(K, m) > 1 + 1

A contradiction occurs. Thus we can conclude that Q(K,m) t + 1 = λK + 1.

Proof of Theorem 3.2. Given an unschedulable configuration (K, m), i.e. Q(K,m) T, after any bandit algorithm A A pulls Q(K,m) times, there must be at least 1 arm that has left. We mainly focus on these first Q(K,m) pulls. Define a measurable space (ΩQ(K,m), FQ(K,m)) where ΩQ(K,m) = ([K] R)Q(K,m) R2Q(K,m) and FQ(K,m) is the σ-algebra of the subsets of ΩQ(K,m). Fix a positive constant , we construct µi Ξ, i [K] such that

µi = (0, 0, ..., |{z} the ith entry , 0, ..., 0).

Say we run any A on some µi. Define Ψ as the set of all measurable mappings ψ that maps from ΩQ(K,m) to [K]. We want to find a lower bound on inf ψ Ψ max j [K] Pr A,(K,m,µj)(ψ = j),

where Pr A,(K,m,µi), i [K] are probability measures over (ΩQ(K,m), FQ(K,m)). Our derivation is similar with that of Fano s inequality in Rigollet & H utter (2019). We write Pr A,(K,m,µj) as Pj for notational simplicity. Fix ψ Ψ, define

pj = Pj(ψ = j), p = 1

j=1 pj [0, 1], qj = 1

k=1 Pk(ψ = j),

j=1 qj = 1 K2

j,k=1 Pk(ψ = j) = 1 K2

j,k=1 1 Pk(ψ = j)

Let kl(p, p ) denote KL divergence between Bernoulli distributions with mean p, p , respectively. We have

kl(p, p ) = p ln p

p + (1 p) ln 1 p

Denote the entropy of Bernoulli distribution h(p) = p ln p + (1 p) ln(1 p). We have on the one hand,

kl( p, q) =h( p) p ln q (1 p) ln(1 q)

h( p) p ln q = h( p) + p ln K K 1 =: f K( p).

On Multi-Armed Bandit with Impatient Arms

On the other hand, we derive an upper bound for kl( p, q), by the convexity of KL divergence,

kl( p, q) 1

j=1 kl(pj, qj) 1 K2

j,k=1 kl(Pj(ψ = j), Pk(ψ = j)).

Fix any j, k, we derive upper bound for kl(Pj(ψ = j), Pk(ψ = j)). We observe that probability measures Pj, Pk over (ΩQ(K,m), FQ(K,m)) induce probability measures Ber(Pj({ω ΩQ(K,m) : ψ(ω) = j})), Ber(Pk({ω ΩQ(K,m) : ψ(ω) = j})) over measurable space ({0, 1}, 2{0,1}). Then by data processing inequality in Lou & Goldfeld (2020), especially their Example 7.1.2(i), we have

KL(Pj, Pk) KL(Ber(Pj(ψ = j)), Ber(Pk(ψ = j)))

= kl(Pj(ψ = j), Pk(ψ = j)).

Some sequences of a1, a2, ..., a Q(K,m) are impossible due to the leaving behaviour of arms, depending only on (K, m). However, an arm sequence is possible with respect to (K, m, µj) if and only if it is possible with respect to (K, m, µk). As a result, Pj is absolutely continuous with respect to Pk. With this fact in mind, we can use Lemma 15.1 in Lattimore & Szepesv ari (2020) to show that

KL(Pj, Pk) =

i=1 Ej[Ti(Q(K,m))]KL(N(µj,i, 1), N(µk,i, 1))

2 (Ej[Tj(Q(K,m))] + Ej[Tk(Q(K,m))])

Merging the derivations above, we obtain that

f K( p) 1 K2

2 Q(K,m) = 2

Now we study the function f K(p) to obtain an upper bound for p.

f K(p) = p ln K K 1 + p ln p + (1 p) ln(1 p),

f K(p) = ln K K 1 + ln p + 1 ln(1 p) 1

= ln K K 1 + ln p 1 p.

We see that f K(0) = 0, f K(1) = ln K K 1, f K is monotonically decreasing in [0, K 1

2K 1] and monotonically increasing in

2K 1, 1]. Thus c (0, ln K K 1], there is a unique p ( K 1

2K 1, 1] such that f K(p) = c. Setting = q

1 Q(K,m) ln K K 1,

let f 1 K ( 1

2 ln K K 1) denote the unique p such that f K(p) = Q(K,m)

2 1 Q(K,m) ln K K 1 = 1

2 ln K K 1. We obtain that

j=1 Pj(ψ = j) = p f 1 K (1

2 ln K K 1).

inf ψ Ψ max j [K] Pj(ψ = j) inf ψ Ψ 1 K

j=1 Pj(ψ = j)

= inf ψ Ψ 1 K

j=1 1 Pj(ψ = j)

2 ln K K 1).

On Multi-Armed Bandit with Impatient Arms

Define ψL as the measurable map that returns the first leaving arm during the first Q(K,m) pulls, breaking tie by returning the first leaving arm with the smallest arm index. We claim that, for any bandit policy A and configuration (K, m), there is j [K] such that Pj(ψL = j) 1 f 1 K ( 1

2 ln K K 1). Suppose not, then there exist A, (K, m) that for all j [K],

Pj(ψL = j) < 1 f 1 K ( 1

2 ln K K 1), then setting = q

1 Q(K,m) ln K K 1,

2 ln K K 1) > max j [K] Pj(ψL = j)

inf ψ Ψ max j [K] Pj(ψ = j)

2 ln K K 1).

A contradiction occurs, so we proved the claim. For any algorithm A, say arm i A satisfies Pi A(ψL = i A) 1 f 1 K ( 1

2 ln K K 1). Now we derive the lower bound for expected regret when the ground truth reward mean vector is µi A

RT (A, (K, m, µi A)) = Ei A h T X

t=1 at i = Ei A h K X

k =i A Tk(T) i

1 Q(K,m) ln K K 1

k =i A Ei A h Tk(T) i

1 Q(K,m) ln K K 1Ei A h T Ti A(T) i

1 Q(K,m) ln K K 1Ei A h T Ti A(T)|ψL = i A i Pi A(ψL = i A)

1 Q(K,m) ln K K 1Ei A h T Ti A(T)|ψL = i A i Pi A(ψL = i A)

1 Q(K,m) ln K K 1

h 1 f 1 K (1

2 ln K K 1) i (T Q(K,m)),

where Ej is the expectation when A, (K, m, µj) is given. Finally, we note that the bound above holds for any A, and

RT (A, (K, m, µi A)) sup µ Ξ RT (A, (K, m, µ)).

So we can make our conclusion, for any unschedulable profile (K, m), the minimax lower bound for the expected regret is

min A A sup µ Ξ RT (A, (K, m, µ))

1 Q(K,m) ln K K 1

h 1 f 1 K (1

2 ln K K 1) i (T Q(K,m)).

D. Results for UCB Algorithm

In the standard regret analysis of UCB in the stochastic MAB setting (Lattimore & Szepesv ari, 2020), a critical observation is that under a high probability event (usually called a Good event) the number of pulls of sub-optimal arms is bounded. In our setup characterized by arm patience, arm k exits if there have been too many pulls of other arms after its last pull. However, if the optimal arm k s patience mk is even greater than the highest possible total number of sub-optimal pulls under the good event, then it is unlikely that arm k will exit under the good event. Thus, if arm k has good patience, the performance of UCB algorithm can be intuitively guaranteed. Inspired by this observation, we present the following expected regret upper bound for UCB algorithm in our setup.

On Multi-Armed Bandit with Impatient Arms

Theorem D.1. Run UCB algorithm with confidence parameter δ = 1/T 2 in the Multi-Armed Bandit with Impatient Arms setup. If mk > PK k =k 16 ln T

2 k the expected regret of UCB algorithm is upper bounded by

RT K max + 3

where max := maxk =k k.

Proof of Theorem D.1. Define D as the event that arm k is ignored for at least mk times, formally

D = n t {1, 2, ..., T + 1 mk } : at, at+1, ..., at+mk 1 = k o .

The expected regret is decomposed with respect to event D,

t=1 at I{ D} + E T X

t=1 at I{D}

max T Pr(D) + RBase T ,

where RBase T denotes E PT t=1 at I{ D} , whose analysis resembles that of UCB algorithm in the vanilla MAB setting. Following the standard analysis of UCB algorithm, we can define good events when the sample means of each arm lie in their corresponding confidence intervals. Now we define good events.

Ek := n µ < min n [T ] ˆµk ,n +

Ek := n ˆµk,uk +

uk < µ o , k = k .

If mk is large enough, formally mk > PK k =k uk, by total probability rule,

Pr(D) = Pr D

k=1 Ek Pr K \

k=1 Ek + Pr D

k=1 Ek Pr K [

k=1 Ek + Pr K [

k=1 Pr Ek .

Conditioned on event TK k=1 Ek, if t s.t. at, at+1, ..., at+mk 1 = k , there must be at least one arm denoted k, that it is selected more than uk times even during the time interval from t to t + mk 1. However under event Ek , Ek, and the fact that arm k may deviate in advance, arm k cannot be selected for more than uk times. There is a contradiction, so Pr(D| TK k=1 Ek) = 0. We see that Pr(D) PK k=1 Pr( Ek) as a result.

On Multi-Armed Bandit with Impatient Arms

On the other hand,

RBase T =E T X

t=1 at I{ D}

t=1 k I{at = k}I{ D}

k =k k E Tk(T)I{ D}

k =k k E Tk(T)I{Ek , Ek}I{ D} + E Tk(T)I{ Ek Ek}I{ D}

k =k k uk + E Tk(T)I{ Ek Ek}I{ D}

k =k k uk + TE I{ Ek Ek}I{ D}

k =k k uk + T Pr( Ek Ek) .

The first inequality holds since under event D, arm k never leaves the game and under event Ek Ek, no matter whether arm k derivates, it can be selected for no more than uk times. Even if it is pulled for uk times, it will never be pulled again

since arm k s UCB index is always larger. By Chernoff s inequality, say ck, k = k satisfy k q

Pr( Ek ) = Pr [

n [T ] {µ ˆµk ,n +

n=1 Pr µ ˆµk ,n +

Pr( Ek) Pr(ˆµk,uk µk ck k) exp ukc2 k 2 k 2

Then for the expected regret when mk > PK k =k uk,

k =k k uk + T Pr( Ek Ek) + max T

k=1 Pr( Ek)

k =k k T Tδ + exp ukc2 k 2 k 2

+ max TTδ + max T

k =k exp ukc2 k 2 k 2

k =k k + max +

k =k ( k + max)T exp ukc2 k 2 k 2

We set uk = 2 ln 1/δ (1 ck)2 2 k and obtain

k =k k l 2 ln 1/δ (1 ck)2 2 k

k =k k + max +

k =k ( k + max)T 1 2c2 k (1 ck)2 .

On Multi-Armed Bandit with Impatient Arms

By setting ck = 1

2, k = k , we obtain

k =k k l8 ln 1/δ

k =k k + max +

k =k ( k + max)

In Theorem D.1, we derive a problem-dependent upper bound for a fixed µ. Based on Theorem D.1, we further present a problem-independent expected regret upper bound for a general set of mean reward vectors in Corollary D.2.

Corollary D.2. Run UCB algorithm with confidence parameter δ = 1/T 2 in the Multi-Armed Bandit with Impatient Arms setup. For a constant ϵ > 0, consider a set of mean reward vectors Ξϵ = µ ϵ k , k = k }. For any instance (K, m, µ) satisfying: 1. µ Ξϵ, 2. mk > (K 1) 16ϵ 2 ln T , k [K], the expected regret of UCB algorithm is uniformly upper bounded by

RT K max + 3

k =k k + 8 p

(K 1)T ln T,

where max := maxk =k k.

In Corollary D.2, we show that UCB algorithm maintains almost the same performance as in stochastic MAB setting, under the assumption that all arm have relatively high level of patience.

Proof of Corollary D.2. mk > (K 1) 16ϵ 2 ln T , k [K] implies that µ Ξϵ, we have mk > (K 1) 16 ln T

ϵ2 PK k =k 16 ln T

2 k . We consider two cases for the value of ϵ:

16(K 1)T 1 ln T. In this case, µ satisfies that k p

16(K 1)T 1 ln T, k = k . We directly adopt Theorem D.1 and obtain

RT K max + 3

k =k k + 4 p

(K 1)T ln T

(K 1)T ln T + 3

k =k k + K max.

(2) ϵ < := p

16(K 1)T 1 ln T. Following the derivation in the proof of Theorem D.1, the operation on the first term

On Multi-Armed Bandit with Impatient Arms

of the regret decomposition remains the same, since mk > PK k =k 16 ln T

2 k . So we focus on the second term,

k =k k E Tk(T)I{ D} = X

k =k : k k E Tk(T)I{ D} + X

k =k : k< k E Tk(T)I{ D}

k =k : k k E Tk(T)I{Ek , Ek}I{ D} + E Tk(T)I{ Ek Ek}I{ D}

k =k : k k uk + E Tk(T)I{ Ek Ek}I{ D}

k =k : k k uk + TE I{ Ek Ek}I{ D}

k =k : k k uk + T Pr( Ek Ek) .

Merging the two terms, we also have

k =k : k k uk + T Pr( Ek Ek) + max T

k=1 Pr( Ek)

k =k : k kuk +

k =k k + max +

k =k ( k + max)T exp( ukc2 k 2 k 2 )

k =k : k k 2 ln 1/δ (1 ck)2 2 k + 1 + 2

k =k k + K max

k =k k + K max

T + 16(K 1) ln T

k =k k + K max

(K 1)T ln T + 3

k =k k + K max,

with ck = 1

2 for any k = k .

Proof of Theorem 3.5. Fix constants aβ, bβ, κT , g T < T, and x1 > x2 > µ . The precise values of these constants will be specified later. We define some events for arm k ,

Ek := Ek (x1, x2) = min n [κT 1] UCBk (n) > x1, UCBk (κT ) < x2 ,

and for arm β, Eβ := Eβ(x1, x2) = x2 < UCBβ(aβ) < x1, min n [bβ] UCBβ(n) > x2 .

We will show later that under Ek Eβ with our value assignments, the optimal arm k is only pulled at most κT times before it leaves. The number of sub-optimal pulls is at least T κT in total. Set

κT 1 > x2 = µ +

On Multi-Armed Bandit with Impatient Arms

κT , g T are positive and non-decreasing w.r.t T. Now we compute lower bounds for both Pr(Ek (x1, x2)) and Pr(Eβ(x1, x2)).

Pr(Ek (x1, x2)) = Pr min n [κT 1] UCBk (n) > x1, UCBk (κT ) < x2

= Pr UCBk (κT ) < x2 min n [κT 1] UCBk (n) > x1 Pr min n [κT 1] UCBk (n) > x1

j=1 Xk ,j µ < g T min n [κT 1] UCBk (n) > x1 Pr min n [κT 1] UCBk (n) > x1

= Pr Xk ,κT µ +

j=1 Xk ,j µ < g T min n [κT 1] UCBk (n) > x1 Pr min n [κT 1] UCBk (n) > x1

= Pr min n [κT 1] UCBk (n) > x1 Z +

0 Pr Xk ,κT µ < s g T S(k ) κT 1 = s, min n [κT 1] UCBk (n) > x1

f S(k ) κT 1 = s min n [κT 1] UCBk (n) > x1 ds,

where we denote that S(i) t = Pt j=1(Xi,j µi). Say Φ( ) is the cdf of the standard Gaussian distribution. Since Xk ,κT µ

is standard Gaussian and it is independent of any other random variables,

Pr Xk ,κT µ < s g T S(k ) κT 1 = s, min n [κT 1] UCBk (n) > x1 = Φ( s g T ) = 1 Φ(s + g T ).

Now we consider the conditional probably density f S(k ) κT 1 = s minn [κT 1] UCBk (n) > x1 . By Bayes Theorem,

f S(k ) κT 1 = s min n [κT 1] UCBk (n) > x1 = Pr minn [κT 1] UCBk (n) > x1 S(k ) κT 1 = s

Pr minn [κT 1] UCBk (n) > x1 f(S(k ) κT 1 = s).

Since S(k ) κT 1 N(0, κT 1), we have f(S(k ) κT 1 = s) = 1

2π(κT 1) exp( s2 2(κT 1)). It suffices to lower bounding

Pr minn [κT 1] UCBk (n) > x1 S(k ) κT 1 = s .

Pr min n [κT 1] UCBk (n) > x1 S(k ) κT 1 = s

= Pr n = 1, ..., κT 1 : S(k ) n > ( n κT 1 n) p

2 ln 1/δ S(k ) κT 1 = s

Pr n = 1, ..., κT 1 : S(k ) n > (1 1 κT 1)

2 ln 1/δ κT 2 (n κT + 1) S(k ) κT 1 = s

=1 Pr n = 1, ..., κT 1 : S(k ) n (1 1 κT 1)

2 ln 1/δ κT 2 (n κT + 1) S(k ) κT 1 = s

1 Pr t [0, κT 1] : B(t) (1 1 κT 1)

2 ln 1/δ κT 2 (t κT + 1) B(κT 1) = s ,

where {B(t), t 0} is standard Brownian motion. The first inequality is by convexity. The last inequality is by the fact that the joint distribution of S(k ) 1 , ..., S(k ) κT 1 is identical to that of B(1), ..., B(κT 1). To proceed, we introduce a boundary-crossing property of Brownian bridge mentioned in Scheike (1992).

Lemma D.3 (Proposition 3, Scheike (1992)). If B(t) is standard Brownian motion, and a > 0, T < , and c < a + b T, then Pr [

t [0,T ] (B(t) a + bt) B(T) = c = exp 2a(a + b T c)/T .

If a 0, the probability is 1.

On Multi-Armed Bandit with Impatient Arms

a = (κT 1)(1 1 κT 1)

2 ln 1/δ κT 2 , b = (1 1 κT 1)

2 ln 1/δ κT 2 , c = s

and adopting Lemma D.3, we obtain

Pr min n [κT 1] UCBk (n) > x1 S(k ) κT 1 = s 1 exp 2(1 1 κT 1)

2 ln 1/δ κT 2 s .

Now we have

Pr(Ek (x1, x2)) Z +

0 [1 Φ(s + g T )]

1 exp 2(1 1 κT 1)

2 ln 1/δ κT 2 s #

f(S(k ) κT 1 = s)ds

1 exp 2(1 1 κT 1)

2 ln 1/δ κT 2 g T # Z +

4 + (s + g T )2 + s + g T

2π exp (s + g T )2

2π(κT 1) exp s2

in the last inequality we use Komatu s lower bound mentioned in Duembgen (2010). To further derive a lower bound for Pr(Ek (x1, x2)), we consider Z +

4 + (s + g T )2 + s + g T

2π exp (s + g T )2

4 + 4s2 + 2s 1

2π exp (s + g T )2

1 + g2 T + g T

2π exp (s + g T )2

1 + g2 T + g T exp

r κT κT 1s +

κT (g T + 1

1 + g2 T + g T exp

s + (g T + 1 g T ) κT 1

1 + g2 T + g T exp

r κT κT 1 + (g T + 1

Also by Komatu s lower bound,

r κT κT 1 + (g T + 1

r κT κT 1 + (g T + 1

κT κT 1 + (g T + 1 g T ) q

κT 2 + g T q

κT κT 1 + (g T + 1 g T ) q

On Multi-Armed Bandit with Impatient Arms

r κT κT 1 + (g T + 1

r κT κT 1 + (g T + 1

1 + g2 T + g T .

Then we get

Pr(Ek (x1, x2)) C(T) 1 "

1 exp 2(1 1 κT 1)

2 ln 1/δ κT 2 g T #

r κT κT 1 + (g T + 1

1 exp 2(1 1 κT 1)

2 ln 1/δ κT 2 g T #

exp 1 (2 + 1 2(κT 1))g2 T .

Now we turn to lower bounding Pr(Eβ(x1, x2)).

Pr(Eβ(x1, x2))

= Pr x2 < UCBβ(aβ) < x1, min n [bβ] UCBβ(n) > x2

= Z aβ(x1 µβ)

2aβ ln 1/δ Pr min n [aβ] UCBβ(n) > x2, min n=aβ+1,...,bβ UCBβ(n) > x2 S(β) aβ = s f(S(β) aβ = s)ds.

Since minn [aβ] UCBβ(n) > x2 and minn=aβ+1,...,bβ UCBβ(n) > x2 are independent conditioned on S(β) aβ , we have

Pr min n [aβ] UCBβ(n) > x2, min n=aβ+1,...,bβ UCBβ(n) > x2 S(β) aβ = s

= Pr min n [aβ] UCBβ(n) > x2 S(β) aβ = s

Pr min n=aβ+1,...,bβ UCBβ(n) > x2 S(β) aβ = s .

We first derive a lower bound for Pr minn [aβ] UCBβ(n) > x2 S(β) aβ = s as we did for Pr minn [κT 1] UCBk (n) >

x1 S(k ) κT 1 = s .

Pr min n [aβ] UCBβ(n) > x2 S(β) aβ = s

= Pr n = 1, ..., aβ : S(β) n > n(x2 µβ) p

2n ln 1/δ|S(β) aβ = s

Pr n = 1, ..., aβ : S(β) n > [(x2 µβ) 1 aβ + 1

2 ln 1/δ]n (1 1 aβ + 1) p

2 ln 1/δ S(β) aβ = s

1 exp 2(1 1 aβ + 1) p

2 ln 1/δ aβ(x2 µβ) + q

2aβ ln 1/δ + s a 1 β ,

On Multi-Armed Bandit with Impatient Arms

where the last inequality is by Lemma D.3. Then we consider Pr minn=aβ+1,...,bβ UCBβ(n) > x2 S(β) aβ = s ,

Pr min n=aβ+1,...,bβ UCBβ(n) > x2 S(β) aβ = s

= Pr n = aβ + 1, ..., bβ : S(β) aβ +

j=1+aβ (Xβ,j µβ) > n(x2 µβ) p

2n ln 1/δ S(β) aβ = s

= Pr n = aβ + 1, ..., bβ : s +

j=1+aβ (Xβ,j µβ) > n(x2 µβ) p

= Pr n = 1, ..., bβ aβ : s + S(β) n > (n + aβ)(x2 µβ) q

2(n + aβ) ln 1/δ

Pr n = 1, ..., bβ aβ : S(β) n > (bβ aβ)(x2 µβ) p

2 ln 1/δ( p

bβ aβ) bβ aβ n

+ aβ(x2 µβ) q

2aβ ln 1/δ s

= Pr n = 1, ..., bβ aβ : S(β) n > x2 µβ

n + aβ(x2 µβ) q

2aβ ln 1/δ s

= h 1 Pr n = 1, ..., bβ aβ : S(β) n x2 µβ

n + aβ(x2 µβ) q

2aβ ln 1/δ s i ,

where the inequality is due to convexity. We also note that, the joint distribution of S(β) 1 , ..., S(β) bβ aβ is identical to that of B(1), ..., B(bβ aβ). Thus we can further lower bound Pr(Eβ(x1, x2)) as

Pr(Eβ(x1, x2))

Z aβ(x1 µβ)

2aβ ln 1/δ Pr min n [aβ] UCBβ(n) > x2 S(β) aβ = s h 1 Pr t [0, bβ aβ] :

t + aβ(x2 µβ) q

2aβ ln 1/δ s i f(S(β) aβ = s)ds

h 1 exp (1 1 aβ + 1) p

2 ln 1/δ g T

i Z aβ(x1 µβ)

Pr t [0, bβ aβ] : B(t) x2 µβ

t + aβ(x2 µβ) q

2aβ ln 1/δ s i f(S(β) aβ = s)ds

=: h 1 exp (1 1 aβ + 1) p

2 ln 1/δ g T

Now it suffices to lower bound Pβ. Again, here we need a proposition in Scheike (1992) that characterizes the probability of Brownian motion crossing a linear boundary within a finite time horizon.

Lemma D.4 (Proposition 2, Scheike (1992)). If B(t) is standard Brownian motion, and a > 0, T < , then

t [0,T ] (B(t) a + bt) = 1 Φ( a

T) + exp( 2ab)Φ( a

If a 0, the probability is 1.

a = s aβ(x2 µβ) + q

2aβ ln 1/δ, b = x2 µβ

On Multi-Armed Bandit with Impatient Arms

we have that

Pβ Z aβ(x1 µβ)

h Φ s aβ(x2 µβ) + p

2aβ ln 1/δ p

bβ aβ x2 µβ

exp 2 x2 µβ

s aβ(x2 µβ) + q

Φ s aβ(x2 µβ) + p

2aβ ln 1/δ p

bβ aβ x2 µβ

bβ aβ i f(S(β) aβ = s)ds.

For notational convenience, we denote

bβ aβ, B = s aβ(x2 µβ) + p

2aβ ln 1/δ p

For λ, η, a1, a2 (0, 1), we set

(1 η) 2 ln 1/δ (x1 µβ)2

(1 + λ) 2 ln 1/δ (x2 µβ)2

g T = (ln T)a1, κT = 3 + (ln T)a2 .

We require that a1, a2 satisfy the following conditions:

In the below, we take T + and evaluate the asymptotic rates of aβ, bβ, A, B.

lim T + aβ (ln T)a2 = (1 η) lim T + 1 (ln T)a2 4 ln T β + q

4 ln T κT 1 2 = (1 η) lim T + κT 1 (ln T)a2 = 1 η,

lim T + bβ (ln T)a2 = (1 + λ) lim T + 1 (ln T)a2 4 ln T β + q

2 > a1 by condition 1., we further have

lim T + bβ (ln T)a2 = (1 + λ) lim T + 1 (ln T)a2 4 ln T β + q

lim T + bβ aβ (ln T)a2 = (1 + λ) (1 η) = λ + η,

(ln T) 1 2 = lim T + 1

= lim T + 1

4 ln T( 1 κT 1 p

bβ + aβ ) g T

On Multi-Armed Bandit with Impatient Arms

we observe that lim T + 1 κT 1

4 ln T/(ln T) 1 2 a2

2 = 2 1 1 1+λ+ 1 η

lim T + g T κT /(ln T)a1 a2 = 1 > 0, and again 1

2 > a1 a2, thus

(ln T) 1 2 = lim T + 1

4 ln T( 1 κT 1 p

bβ + aβ ) p

1 + λ + 1 η

Finally we consider B. Note that B depends on the value of s, and in the integral s h aβ( x1+x2

2aβ ln 1/δ, aβ(x1 µβ) p

2aβ ln 1/δ i . Thus B is bounded by

0 < aβ(x1 x2)

bβ aβ B aβ(x1 x2) p

Now it suffices to evaluate

(ln T)a1 a2

2 aβ(x1 x2) p

bβ aβ = lim T + 1

(ln T)a1 a2

4 ln T κT 1

We observe that lim T + q

4 ln T κT 1 q

/(ln T) 1 2 3

2 a2 = 1 > 0 and that by condition 2, for a1, a2,

(ln T)a1 a2

2 aβ(x1 x2) p

bβ aβ = lim T + 1

(ln T)a1 a2

g T κT = 1 η λ + η .

Comparing A and B, since 1

2 , we have that T0 N+: T > T0, A > B. And we also have that

s h aβ( x1+x2

2aβ ln 1/δ, aβ(x1 µβ) p

2aβ ln 1/δ i

4 + (A B)2 + (A B) 2 p

2 + (A + B)2 + (A + B)

4 + A aβ(x1 x2)

2 + A aβ(x1 x2)

2 + A + aβ(x1 x2)

2 + A + aβ(x1 x2)

Denoting B = aβ(x1 x2)

bβ aβ , we have that T1 N, T > T1, AB > 1/2. By straightforward calculation, we can verify that

for any T > T1,

4 + (A B)2 + (A B) 2 p

2 + (A + B)2 + (A + B) > 0.

4 + (A B)2 + (A B) 2 p

2 + (A + B)2 + (A + B)

2 + (A + B)2 + A + B p

4 + (A B)2 A + B 2B

=1 + lim T +

2 + (A + B)2 p

2B = 1 + lim T + 4AB 2

2 + (A + B)2 + p

4 + (A B)2)

=1 + lim T + 4A

2 + (A + B)2 + p

4 + (A B)2) = 2.

On Multi-Armed Bandit with Impatient Arms

Adopting both Komatu s lower bound and upper bound simultaneously, we again focus on Pβ

Pβ Z aβ(x1 µβ)

h Φ(B A) exp(2AB) Φ( B A) i f(S(β) aβ = s)ds

Z aβ(x1 µβ)

4 + (A B)2 + (A B) exp( 1

exp(2AB) 2 p

2 + (A + B)2 + (A + B) exp( 1

2(A + B)2) i f(S(β) aβ = s)ds

= Z aβ(x1 µβ)

4 + (A B)2 + (A B)

2 + (A + B)2 + (A + B)

2(A B)2)f(S(β) aβ = s)ds.

When T > max{T0, T1}, noting that S(β) aβ N(0, aβ),

4 + (A B)2 + (A B) 2 p

2 + (A + B)2 + (A + B)

Z aβ(x1 µβ)

2(A B)2)f(S(β) aβ = s)ds

4 + (A B)2 + (A B) 2 p

2 + (A + B)2 + (A + B)

Z aβ(x1 µβ)

2A2) 1 p2πaβ exp( s2

Since the upper limit of the integral is negative, i.e.

aβ(x1 µβ) q

2aβ ln 1/δ (1 η p

1 η) 2 ln 1/δ

x1 µβ < 0, η (0, 1),

we have that

2A2 (aβ( x1+x2

2aβ ln 1/δ)2

2A2 (aβ(x2 µβ) p

2aβ ln 1/δ)2

On Multi-Armed Bandit with Impatient Arms

Now we can get rid of the integral and rewrite the lower bound for Pβ as

4 + (A B)2 + (A B) 2 p

2 + (A + B)2 + (A + B)

aβ(x1 µβ) q

2aβ ln 1/δ aβ(x1 + x2

4 + (A B)2 + (A B) 2 p

2 + (A + B)2 + (A + B)

2aβ (x1 x2) aβ

if T > max{T0, T1}. Since we have shown that events Ek and Eβ happen with non-negligible probability, now we focus on the expected regret lower bound of UCB algorithm,

t=1 at = E T X

t=1 at(I(Ek Eβ) + I( Ek Eβ))

t=1 at I(Ek Eβ) = E I(Ek Eβ)

t=1 I(at = i)

min E I(Ek Eβ)

t=1 I(at = i) = min E I(Ek Eβ)

t=1 I(at = k ) .

Recall that by assumption, mk = O((ln T)θ), θ (0, 1). Given this θ, ϵ > 0: max{ 1

2, θ} + 2ϵ < 1. We set a1 = 1

2, θ} + ϵ, a2 = max{ 1

2, θ} + ϵ. Obviously a1, a2 (0, 1). Here we validate the conditions that a1, a2 are required to satisfy:

2ϵ < a1 = 1

2, θ} + ϵ < 1

Fix any γ (0, 1), we set

16, η = min n γ

32, 1 1 (1 + γ/4

Also obviously, λ, η (0, 1). We now show that, under the event Ek Eβ and when T is large enough, the optimal arm is pulled at most κT times before leaving. Under both events, when the number of pulls of arm k is smaller than κT , (1) arm k s UCB index is greater than x1 and (2) the number of pulls of arm β is no greater than aβ. When the number of pulls of arm k is κT 1 and at some time arm k s UCB index is the largest among all active arms, it will be pulled the κT -th time. After that, arm k s UCB index is below x2. By Eβ, we know that before arm k s UCB index becomes the largest again, at least the (aβ + 1)-th, (aβ + 2)-th, ..., (bβ + 1)-th pulls of arm β need to be done. The total number of sub-optimal pulls (contributed by arm β) after arm k is pulled the κT -th time is at least bβ aβ + 1, not to mention the arms other than k

and β. By our assignment, a2 > θ. Besides, we have shown that lim T + bβ aβ (ln T )a2 = λ + η > 0. As a result,

lim T + mk bβ aβ + 1 = lim T + mk (ln T)a2 (ln T)a2

bβ aβ + 1 = 0.

On Multi-Armed Bandit with Impatient Arms

Thus T2 N+, T > T2, mk < bβ aβ + 1. Under the event Ek Eβ, when T > T2, arm k is pulled at most κT times. We obtain that when T > T2,

RT min E I(Ek Eβ)

t=1 I(at = k ) min E I(Ek Eβ) (T κT )

= min Pr(Ek Eβ)(T κT ) = min Pr(Ek ) Pr(Eβ)(T κT ),

noticing that Ek Eβ | x1, x2. For Ek , we have

0 lim T + T 1

Pr(Ek ) lim T + T 1

C(T) 1 exp 1 + (2 + 1 2(κT 1))g2 T ,

where we have used the fact that

lim T + 1 exp 2(1 1 κT 1)

2 ln 1/δ κT 2 g T = 1,

2 + a1 a2 > 0. Furthermore,

0 lim T + T 1

Pr(Ek ) lim T + C(T) 1

C(T) 1 exp 1 + (2 + 1 2(κT 1))g2 T γ

2 ln T = 0,

because 2a1 < 1. Thus lim T + T 1

Pr(Ek ) = 0, Pr(Ek ) = Ω(T 1

2 γC(T) 1). For Eβ, since Pβ LB, T > max{T0, T1}, define

2(bβ aβ) + 1

we note that

lim T + h(T)

= lim T + 1 2 bβ aβ (ln T)a2

2 + 2(ln T)

bβ (ln T )a2 + q aβ (ln T )a2

2 aβ (ln T)a2

2 + 2(ln T)

2 2 q aβ (ln T )a2

= lim T + 1 2 bβ aβ (ln T)a2

bβ (ln T )a2 + q aβ (ln T )a2

2 aβ (ln T)a2

κT 2 q aβ (ln T )a2

since a2 < 1 and a1 < 1

lim T + h(T)

2(λ + η) h 2 2

1 + λ + 1 η

2(1 η) h 2 2 1 η

=2(λ + η) h 1 1

1 + λ + 1 η

i2 + 2(1 η) h 1 1 1 η

2(λ + η) + 2 h 1 1 1 η

On Multi-Armed Bandit with Impatient Arms

lim T + exp(h(T) γ

2 ln T) = 0.

For Eβ, we have

0 lim T + T 1

2 γ(ln T) 1

Pr(Eβ) lim T + T 1

2 γ(ln T) 1

lim T + T 1

2 γ(ln T) 1 4π (x1 x2) aβ

= lim T + 8πA2(ln T) 1

(x1 x2)2(aβ) 3 2

bβ aβ exp(h(T) γ

2 ln T) = 0,

2 + a1 a2 > 0. Note that we have used

lim T + 1 exp (1 1 aβ + 1) p

2 ln 1/δ g T

Thus lim T + T 1

2 γ(ln T ) 1

Pr(Eβ) = 0, Pr(Eβ) = Ω T 1

2 γ(ln T) 1 . Finally, we reach a conclusion that

RT = Ω min T 1 γ C(T) ln T 1

for γ (0, 1).

E. Proofs for SE Algorithm

Proof of Proposition 3.6. Construct a problem instance: K = 4, m = (20, 20, 3, 2), and µ = (0, 0, 0, ). In this case, l(K, m) = 14

15 < 1. Since SE algorithm pulls a1 = 1, a2 = 2, arm 4 exits the game at the end of time 2. The optimal arm is never pulled during the whole game, thus the expected regret RT = T.

F. Proofs for FC-SE Algorithm

Proof of Theorem 4.1. Inspired by the regret analysis of the SE algorithm in Lancewicki et al. (2021), we define τk as the last time the algorithm attempts to do arm elimination but arm k is NOT eliminated. Clearly, by this definition, at the end of the next feasible cycle, arm k is eliminated. Arm k is active for only one more cycle, so we have Tk(T) Tk(τk) + nk. At the end of time τk, since arm k is not eliminated,

ˆµk ,Tk (τk) 2

ln T Tk (τk) ˆµk,Tk(τk) + 2

ln T Tk(τk). (12)

We define the good event

G = n t [T], k [K] : |ˆµk,Tk(t) µk| 2

On Multi-Armed Bandit with Impatient Arms

meaning that the empirical means are close to the true reward means for any arm at any time. Then we show that G happens with high probability by presenting an upper bound on Pr( G),

Pr( G) = Pr t [T], k [K] : ˆµk,Tk(t) µi > 2

ln T Tk(t) ˆµk,Tk(t) µk < 2

Pr n [T], k [K] : ˆµk,n µk > 2

n ˆµk,n µk < 2

h Pr ˆµk,n µk > 2

+ Pr ˆµk,n µk < 2

where in the last inequality, we used a sub-Gaussian tail bound in Lattimore & Szepesv ari (2020):

Lemma F.1 (Corollary 5.5, Lattimore & Szepesv ari (2020)). Assume that Xi µ are independent, σ-sub-Gaussian random variables. Then for any ϵ 0,

Pr(ˆµ µ + ϵ) exp( nϵ2

2σ2 ), Pr(ˆµ µ ϵ) exp( nϵ2

where ˆµ = n 1 Pn i=1 Xi.

Under the good event G, the optimal arm k is never eliminated since ˆµk ,Tk (t) + 2 q

ln T Tk (t) µ > µk ˆµk,Tk(t)

ln T Tk(t), for all t, i. Furthermore, at time τk, by (12) we have

ln T Tk (τk) µk + 4

ln T Tk(τk)

ln T Tk (τk) + 4

ln T Tk(τk).

The FC-SE algorithm behaves in a strictly cyclic manner. An arm is pulled the same number of times in each feasible cycle. Since τk is at the end of a feasible cycle, we observe that

nk = Tk (τk)

Using this fact, we obtain

nk Tk(τk) + 4

ln T Tk(τk) = 4

ln T Tk(τk)

which directly implies

Tk(τk) 16 ln T

Tk(T) nk + 32 ln T

On Multi-Armed Bandit with Impatient Arms

The expected regret of FC-SE is then upper bounded as

t=1 at = E K X

k =k k Tk(T)

k =k k Tk(T)[I(G) + I( G)]

k =k k E Tk(T)I(G) + max T Pr( G)

knk + 32 ln T

Proof of Theorem 4.2. Let rk be the index of cycle at the end of which arm k enters the feasible cycle. We say that the arms in the initial feasible cycle enter in the 0-th cycle, k : ind(k) N, rk = 0. Since in the N < K case, there are some arms entering the feasible cycle later in the game, the fact (14) in the proof of Theorem 4.1 no longer holds. Say t is the time that a feasible cycle ends. Arm k, k have entered the feasible cycle before or at time t and not been eliminated before time t. Then we have

nk + rk = Tk (t)

nk + rk . (15)

Following the analysis in the proof of Theorem 4.1, under the good event G (we use the definition in (13)), if arm j is not eliminated by the optimal active arm i at the end of some cycle t, we have

ln T Tj(t) + 4

ln T Ti (t).

Using (15), we obtain

ln T Tj(t) + 4

nj Tj(t) + ni (rj ri ).

We consider the following three different cases and get a uniform upper bound for Tj(t)

(1) rj ri . Arm j enters either after or simultaneously with the optimal active arm. We have

ln T Tj(t) + 4

(2) rj < ri and Tj(t) > nj(ri rj). We have Tj(t) > Tj(t) nj(ri rj). Thus

ln T Tj(t) nj(ri rj) + 4

nj Tj(t) + ni (rj ri )

ln T Tj(t) nj(ri rj)

On Multi-Armed Bandit with Impatient Arms

nj ri rj + 16 ln T

(3) rj < ri and Tj(t) nj(ri rj). This directly implies

In conclusion, we have

nj max(0, ri rj) + 16 ln T

if arm j is not eliminated by i at t under the good event G.

Due to the patience tightening process in FC-SE, arm ind 1(N + a) is assigned to checkpoint a N 1 c, for a [K N]. Let C denote the set of all checkpoints, then

a = 1, ..., K N

= {c1, c2, ..., c|C|},

where c1, c2, ... is the ordered sequence of checkpoints: c1 < c2 < ... < c|C|. Now we go through the whole process from the beginning. Let A0 := {k [K] : ind(k) N} be the arms in the initial cycle. Define k (0) = arg maxk A0 µk. Consider the first time t1 that triggers the checkpoint c1 = c. It satisfies t1 + P

k S > c1 n, where S is the set of arms in the feasible cycle after the arm elimination at time t1 1. Let R1 denote the number of the operated feasible cycles from time slot 1 to time slot t1 1. t1 + P

k S > c1 n if and only if the length of R1 feasible cycles plus the length of the feasible cycle beginning at time t1 is at least c1 n. The length of all feasible cycles is upper bounded by n, thus we have n(R1 + 1) c1 n, R1 c1 n

n 1. This inspires us to define

i = k (0), 16 ln T

1 ni + 1 2 + 1 c1 n

For any k E0, under the good event G defined in (13), must have been eliminated either before or at the arm elimination

phase of the

16 ln T 2 k (0),k

1 nk + 1 2 %

-th feasible cycle by (16). Since

$ 16 ln T 2 k (0),k

1 nk + 1 2 %

+ 1 16 ln T

1 nk + 1 2 + 1 c1 n

arm k is eliminated before the condition t + P

k S > c1 n is triggered. Thus for any arm k S when the condition is triggered, we have that k / E0. If k = k (0), k (0),k = 0. Otherwise,

16 ln T 2 k (0),k

1 nk + 1 2 + 1 > c1 n

n 1, k (0),k < 4

1 nk + 1 = 4

Since it is possible that k (0) is unfortunately dropped at checkpoint c1, we can only guarantee that under G there is an arm in S with a mean reward at least µk (0) 8 q

n 3 after the operation at checkpoint c1. The operation includes randomly

dropping arms, adding new arms and rebuilding the feasible cycle. Define k (1) as the temporarily optimal arm in the feasible cycle after the operation at checkpoint c1. Then we have under the good event G,

n 3, max a [K]:m a=c1 µa

with µk (0) = µk (0). Define A1 = (A0 E0) {a [K] : m a = c1} as the set of arms possibly in the feasible cycle after the operation at checkpoint c1.

On Multi-Armed Bandit with Impatient Arms

Now we consider the situation after the operation at checkpoint cj 1 where 1 < j 1 + |C|. Note that c1+|C| is actually not a checkpoint in C, but we define it as the time slot c(1 + |C|) for the convenience of our analysis. There is no any special operation at this checkpoint but the condition t + P

k S nk > c1+|C| n is defined. Define Rj 1 as the number of the operated feasible cycles from time slot 1 to time slot tj 1, where tj 1 is the first time that triggers the checkpoint cj 1. tj, Rj are defined accordingly. Following the discussion concerning checkpoint c1, we have that the length of the first Rj 1 feasible cycles plus the length of the next feasible cycle is at least cj 1 n, but the length of the first Rj 1 feasible cycles is strictly less than cj 1 n. Thus we observe that at least the time slots from cj 1 n to cj n (cj cj 1 + 1 time slots in total) is covered by the Rj Rj 1 + 1 more feasible cycles after the operation at checkpoint cj 1. We have n(Rj Rj 1 + 1) cj cj 1 + 1, Rj Rj 1 cj cj 1

n 1. This also inspires us to define

µk (j 1) > µi, 16 ln T (µk (j 1) µi)2

1 ni + 1 2 + 1 cj cj 1

where Aj 1 = (Aj 2 Ej 2) {a [K] : m a = cj 1} is the set of arms possibly in the feasible cycle after the operation

at checkpoint cj 1 and µk (j 1) := max µk (j 2) 8 q

ln T c n 3, maxa [K]:m a=cj 1 µa . For any k Ej 1, we will show that it must be eliminated before the operation at checkpoint cj. Assume that arm k has not yet been eliminated after the operation at checkpoint cj 1. By (16), under the good event G, arm k must have been eliminated either before or at the feasible cycle indexed by

rk + max(0, rk (j 1) rk) +

$ 16 ln T 2 k (j 1),k

1 nk + 1 2 %

Before the condition t + P

k S > cj n at checkpoint cj is triggered, the last feasible cycle where there can be new entrance is the Rj 1-th cycle. Thus rk, rk (j 1) Rj 1. By induction hypothesis, µk (j 1) µk (j 1). We have

rk + max(0, rk (j 1) rk) +

$ 16 ln T 2 k (j 1),k

1 nk + 1 2 %

max(rk (j 1), rk) + 16 ln T 2 k (j 1),k

1 nk + 1 2 + 1

Rj 1 + 16 ln T (µk (j 1) µk)2

1 nk + 1 2 + 1

Rj 1 + cj cj 1

n 1 Rj 1 + Rj Rj 1 = Rj.

So arm k must be eliminated before the operation at checkpoint cj. Thus for any arm k still in the feasible cycle when the condition at checkpoint cj is triggered, we have that k / Ej 1. We have either µk (j 1) µk or

µk (j 1) µk <

ln T c n 3.

Under the good event G, there must be an arm in the feasible cycle after the operation at checkpoint cj whose mean reward

is at least µk (j 1) 8 q

ln T c n 3. Thus we have

ln T c n 3, max a [K]:m a=cj µa

under the good event G. Before formally deriving the expected regret upper bound for the FC-SE algorithm, we need to find a lower bound for µk (|C|) which indicates that randomly dropping arms does not severely decrease the highest mean reward in the active arm set. We consider two cases:

On Multi-Armed Bandit with Impatient Arms

(1) µ = maxk A0 µk. The optimal arm k is in the initial feasible cycle. µ = µk (0) = µk (0). By the definition of µk (j), j = 0, 1, ..., |C|, we have

µk (|C|) µk (|C| 1) 8

ln T c n 3 µk (|C| 2) 2 8

ln T c n 3 ... µk (0) 8|C|

ln T c n 3 = µ 8|C|

ln T c n 3.

(2) j {1, 2, ..., |C|}, µ = maxa [K]:m a=cj µa. Say the optimal arm enters the feasible cycle at checkpoint cj , thus

µk (j ) = max

ln T c n 3, max a [K]:m a=cj µa

ln T c n 3, µ )

As in the last case, we also obtain

µk (|C|) µk (|C| 1) 8

ln T c n 3 ...

µk (j ) 8(|C| j )

ln T c n 3 µ 8|C|

ln T c n 3.

Now we derive the expected regret upper bound for the FC-SE algorithm

t=1 at = E K X

k =k k Tk(T)

k =k k Tk(T)[I(G) + I( G)]

k =k k E Tk(T)I(G) + max T Pr( G)

k =k :k E|C| k E Tk(T)I(G) + X

k =k :k/ E|C| k E Tk(T)I(G) + max T Pr( G).

Since all arms have once entered the feasible cycle either before or at checkpoint c|C|, any arm k E|C| must be eliminated before (virtual) checkpoint c|C|+1 under the good event G. Thus arm k is pulled at most c(1 + |C|) times. For arm k / E|C|,

µk is not too small (i.e. µk µk (|C|) 8 q

ln T c n 3).

k =k :k E|C| kc(1 + |C|) + X

k =k :k/ E|C| (µ µk (|C|) + µk (|C|) µk)E Tk(T)I(G) + 2K max

k =k :k E|C| kc(1 + |C|) + X

k =k :k/ E|C| 8(1 + |C|)

ln T c n 3E Tk(T) + 2K max

k =k kc(1 + |C|) + 8(1 + |C|)

ln T c n 3T + 2K max

(K 1) c(1 + |C|) + 8T(1 + |C|)

n ln T c 3n + 2K max.

The last formula is minimized when

8T(1 + |C|)

n ln T 2(K 1) (1 + |C|)

n ln T (K 1)

On Multi-Armed Bandit with Impatient Arms

By the definition of c in (4), if j mink:ind(k)>N mk ind(k) N

N 1 k > 3n + l 4T

n ln T (K 1) 2

3 m , we have

n ln T (K 1)

3 '! 1 + l K N

+ 8T 1 + l K N

n ln T (K 1) 2

n ln T (K 1)

3 ! 1 + l K N

+ 8 1 + l K N

3 T 2 3 (n ln T) 1 3 (K 1) 1 3 1 3 + 2K max.

Thus RT = O K 4 3 T 2 3 (n ln T) 1 3 .

G. Details of FC-Entry Algorithm

The details of Feasible Cycle-based successive elimination with new Entering arms (FC-Entry) is given in Algorithm 4. For notational simplicity, m is written to be one of the inputs for Algorithm 4, but it is important to note that the algorithm only needs the patience mk of the initially available arms (i.e. k [K0]). There is a similar patience tightening operation for the arms initially available but not included in the initial feasible cycle, as in Algorithm 1. We still define checkpoints as integer multiples of a constant c, while the c for FC-Entry algorithm is instead given in (11). S is the set contain all arms actually in the feasible cycle, while Sres is the set of arms occupying the reserved slots. When arm k enters the game at time ρk, the algorithm immediately adds it to the feasible cycle (it is possible that the algorithm randomly drops some arm in the reserved places to make room for arm k). Thus the algorithm adds arm k to S and Sres. However, the algorithm adds arm k to the feasible cycle only to prevent its departure. For the ease of theoretical analysis, we require that the rewards generated by a newly entered arm k before it is formally involved in the feasible cycle are not used to compute its estimated reward mean. A new entering arm is formally involved in the feasible cycle only in the operation at the nearest checkpoint after its entrance. It does not either participate in the arm elimination phase before that checkpoint. When a new entering arm is detected, the algorithm computes the next checkpoint if necessary. Before formally involved in the feasible cycle, the newly entered arm temporarily stays in Spre. For any arm k [K], ξk, T k record the sum of observed rewards and the number of pulls after arm k is formally involved in the feasible cycle, respectively. At the end of each feasible cycle, the algorithm checks whether the p-th checkpoint is reached. Checkpoints c1, ..., c K0 N+2

N 3 are known in advance since at

these checkpoints the algorithm involves the arms in {k [K0] : ind(k) > N 2} into the feasible cycle. However, if after the operation at checkpoint c K0 N+2

N 3 there is still new entering arms that have not formally enter the feasible cycle,

the algorithm needs to compute the previously unknown cp for p > K0 N+2

N 3 when detecting new entering arms. Since t denotes the first time slot of the next feasible cycle, t := t + P

k S Sres nk + n+ + n is an upper bound for the first time slot of the cycle after the next feasible cycle. Let t denote the first time slot of the last pulled feasible cycle. The condition is not triggered then: t t + P

k S S res nk + n+ + n cp n, where S , S res are the versions of S, Sres when

checking the condition at time t 1. Snew is not empty, we have p K0 N+2

N 3 . Each arm k in Snew is pulled no later than t + n 1 cp 1 < cp m k mk. Thus arm k does not leave early. In Theorem 4.6 we present an expected dynamic regret upper bound for FC-Entry algorithm under the sparse entrance assumption: 3c min K0<k<K(ρk+1 ρk 1).

Proof of Theorem 4.6. The set of all check points is

a = 1, ..., K0 N + 2

{jkc | k > K0} = {c1, c2, ..., c|C|},

where jk is defined such that arm k is involved in the feasible cycle at the jk-th checkpoint. Specifically, jk = 0 for k s.t. k [K0] and ind(k) N 2, while jk = ind(k) N+2

N 3 for k s.t. k [K0] and ind(k) > N 2. jk for k > K0 is

On Multi-Armed Bandit with Impatient Arms

Algorithm 4 FC-Entry

1: Input: Number of arms in the initial feasible cycle N, patience vector m, time horizon T, a lower bound for the new entering arms patience m, segment length c 2: Construct a feasible cycle a1, ..., an for the set of arms {k : ind(k) N 2} {+, } 3: S {k [K0] : ind(k) N 2}, Spre , Sres , p 1, e K0 + 1, a 1, ..., a n a1, ..., an 4: ai 0, i : a i {+, }. ξk 0, T k 0, k [K]. t 1, t t + P

k S Sres nk + n+ + n 5: for k {k [K0] : ind(k ) > N 2} do

6: m k l ind(k) N+2

7: end for 8: while t T do 9: for i = 1, ..., n do 10: if e K and t = ρe then 11: If |Sres| = 2 randomly choose {+, }, S S {a}, Sres Sres {a} such that ai = a, i : a i = . ai 0 for i = 1, ..., n s.t. ai / S 12: Randomly choose {+, } s.t. ai = 0, i : a i = , S S {e}, Sres Sres {e}, Spre Spre {e}, ai e for i = 1, ..., n s.t. a i = 13: if p > K0 N+2

N 3 then 14: cp min{j c | j N+, t j c n} 15: end if 16: e e + 1 17: end if 18: if ai = 0 then 19: Pull at = ai and receive reward Xat,Tat(t). ξat ξat + Xat,Tat(t) and T at T at + 1 if at / Spre 20: t t + 1 21: end if 22: end for 23: S Spre n k S Spre : j S Spre, ξj

ln T 1 T j ξk

24: Sres Sres S 25: ai 0 for i = 1, ..., n s.t. ai / S 26: t t + P

k S Sres nk + n+ + n 27: if cp is known and t > cp n then 28: Spre , Snew {k [K0] : m k = cp} 29: while |S Sres| + |Snew| > N 2 do 30: a Unif(S Sres) 31: S S {a} 32: end while 33: ai 0 for i = 1, ..., n s.t. ai / S 34: while |Snew| > 0 do 35: a arg mink Snew ind(k) 36: S S {a}, Snew Snew {a} 37: ai a where i = min i n : a i = min{k [K0] | ind(k) N 2 and j n s.t. a j = k : aj = 0}

38: end while 39: p p + 1 40: end if 41: end while

On Multi-Armed Bandit with Impatient Arms

a random variable though ρk is fixed because the feasible cycles can have various lengths. Besides, since we assume that ρk for k > K0 is unknown in advance, only the first part of C (i.e. nl a N 3 m c a = 1, ..., K0 N + 2 o ) is known in the beginning of the game.

Let A0 = {k [K0] : ind(k) N 2} be the set of arms in the initial feasible cycle. Define k (0) = arg maxk A0 µk. We define the same E0 as in (17). We note that if there is a new entering arm before the condition at checkpoint c1 is triggered, no arm in Sres is kicked off because in the beginning, the reserved positions in the feasible cycle are empty. Thus following the proof of Theorem 4.2, we also have that any arm k E0 is eliminated before checkpoint c1 under the good event G defined in (13), while any arm k S Spre when the condition t+P

k S Sres nk +n+ +n > cp n is triggered satisfies that

k (0),k < 4

1 nk + 1 = 4

Define k (1) as the temporarily optimal arm in the feasible cycle after the operation at checkpoint c1. Then we have under G,

n 3, max a [K0]:m a=c1 µa, max a>K0:ja=1 µa

with µk (0) = µk (0). Define A1 := (A0 E0) {a [K0] : m a = c1} {a > K0 : ja = 1} as the set of arms possibly in the feasible cycle after the operation at checkpoint c1. Under the good event G, µk (1) is a lower bound for the mean reward of the temporarily optimal arm in the feasible cycle after the operation at checkpoint c1.

Now we consider the situation after the operation at checkpoint cj 1 where 1 < j K0 N+2

N 3 . The induction hypothesis is that under G, µk (j 1) is a lower bound for the mean reward of the temporarily optimal arm in the feasible cycle after the operation at checkpoint cj 1. Define Rj 1 as the number of the operated feasible cycles from time slot 1 to time slot tj 1, where tj 1 is the first time that triggers the checkpoint cj 1. tj, Rj are defined accordingly. When the condition at checkpoint cj 1 is triggered, we also have that t j 1 + P

k S Sres nk + n+ + n cj 1 n, where t j 1 is the last time when arm elimination is performed before tj 1. Since P

k S Sres nk + n+ + n is an upper bound for the length of the feasible cycle beginning at time t j 1, we have tj 1 t j 1 + P

k S Sres nk + n+ + n cj 1 n. At least the time slots from cj 1 n to cj n is covered by the Rj Rj 1 + 1 more feasible cycles after the operation at checkpoint cj 1. We have n(Rj Rj 1 + 1) cj cj 1 + 1, Rj Rj 1 cj cj 1

To compute µk (j), we consider the following cases:

1. No entering arm k whose jk = j. Under G, the temporarily optimal arm k (j 1) has a reward mean at least µk (j 1). Following the analysis in the proof of Theorem 4.2, any k S when the condition at checkpoint cj is triggered

satisfies that µk (j 1) µk 8 q

ln T c n 3. We have µk (j) µk (j 1) 8 q

ln T c n 3.

2. A new entering arm k is involved in the feasible cycle at checkpoint cj, but |Sres| < 2 when it enters the game. The new entering arm does not kick off any arm, thus the temporarily optimal arm with reward mean at least µk (j 1) must

be still active. We also have µk (j) µk (j 1) 8 q

ln T c n 3.

3. A new entering arm k is involved in the feasible cycle at checkpoint cj and |Sres| = 2 when it enters the game, but the temporarily optimal arm is not kicked off. As long as the temporarily optimal arm is still active, we have µk (j) µk (j 1) 8 q

ln T c n 3.

4. A new entering arm k is involved in the feasible cycle at checkpoint cj and it kicks off the temporarily optimal arm when it enters the game. Let +, represent the arms in Sres when arm k enters the game. In this case, by the sparse entrance assumption, max(j+, j ) j 2. |µ+ µ | 8 q

ln T c n 3 because otherwise one of {+, } will be eliminated by the other either before or at checkpoint cj 1. Thus although arm k kicks off the temporarily optimal arm, another arm

in Sres has a mean reward at least µk (j 1) 8 q

ln T c n 3. If this arm becomes temporarily optimal, it must be still in

On Multi-Armed Bandit with Impatient Arms

the feasible cycle after the operation at checkpoint cj, thus µk (j) µk (j 1) 8 q

ln T c n 3. But if not, the temporarily

optimal arm is in S Sres and has a reward mean at least µk (j 1) 8 q

ln T c n 3. Though this arm may be unfortunately

dropped at checkpoint cj, it guarantees that µk (j) µk (j 1) 8 q

ln T c n 3 8 q

ln T c n 3 = µk (j 1) 16 q

ln T c n 3.

Thus we define Ej 1 as

ln T c n 3 > µi, 16 ln T µk (j 1) 8 q

ln T c n 3 µi 2 1 ni + 1 2 + 1 cj cj 1

where Aj 1 := (Aj 2 Ej 2) {a [K0] : m a = cj 1} {a > K0 : ja = j 1}. For any arm k Ej 1, if it is in S after the operation at checkpoint cj 1, we show that it will be eliminated before checkpoint cj. By our discussion above,

there is an arm v with reward mean at least µk (j 1) 8 q

ln T c n 3 which survives until the operation at checkpoint cj. We have

rk + max(0, rv rk) +

1 nk + 1 2 %

= max(rv, rk) +

1 nk + 1 2 %

max(rv, rk) + 16 ln T

1 nk + 1 2 + 1

Rj 1 + 16 ln T µk (j 1) 8 q

ln T c n 3 µk 2 1 nk + 1 2 + 1

Rj 1 + cj cj 1

Rj 1 + Rj Rj 1 =Rj.

Thus for any k still in the feasible cycle when the condition at checkpoint cj is triggered, we have k / Ej 1 and

µk µk (j 1) 16

ln T c n 3.

In conclusion, under the good event G,

µk (j 1) 16

ln T c n 3, max a [K0]:m a=cj µa, max a>K0:ja=j µa

Aj 1, Ej 1 can be defined the same way for j = K0 N+2

N 3 + 1, ..., |C| + 1 by setting c1+|C| = c|C| + c. The difference is that cj cj 1 can be larger than c and there is no arm dropped out at cj. The temporarily optimal arm can still be kicked off by a new entering arm. With a similar discussion as in the j K0 N+2

N 3 case, we obtain

µk (j 1) 16

ln T c n 3, max a>K0:ja=j µa

for j = K0 N+2

N 3 + 1, ..., |C|.

Whether the optimal arm k is in the initial feasible cycle is crucial for our regret analysis. We consider the following cases:

On Multi-Armed Bandit with Impatient Arms

1. k > K0. The optimal arm k enters later in the game. Let κ 0 = arg maxk [K0] µk be the optimal arm among those enter at the beginning of the game. Let κ j = min{k > K0 : µk > µκ j 1} be the first entering arm whose mean reward is greater than µκ j 1. Integer τ is defined such that κ τ = k , we have τ > 0 since k > K0. By these definitions, we have

µκ 0 < µκ 1 < ... < µκ τ = µk = µ .

And we can rewrite the expected dynamic regret RT as

t=1 µat = E ρκ 1 1 X

t=1 (µκ 0 µat) +

(µκ 1 µat) + ... +

t=ρk (µ µat)

=E ρκ 1 1 X

t=1 (µκ 0 µat)(I{G} + I{ G}) +

(µκ 1 µat)(I{G} + I{ G}) + ...

t=ρk (µ µat)(I{G} + I{ G})

t=1 (µκ 0 µat)I{G} +

(µκ 1 µat)I{G} + ... +

t=ρk (µ µat)I{G} +

t=1 at I{ G}

T max Pr( G) +

k =k E ρκ 1 1 X

t=1 (µκ 0 µk)I{at = k}I{G}

(µκ 1 µk)I{at = k}I{G} + ... +

t=ρk (µ µk)I{at = k}I{G}

=:T max Pr( G) +

k =k E[Dk].

To upper bound Dk, we define

min n l = 0, ..., τ µκ l 16(K N + 3) q

ln T c n 3 > µk o , if k > 16(K N + 3) q

ln T c n 3.

τ + 1, if k 16(K N + 3) q

ln T c n 3.

and consider three possibilities:

(a) l(k) = τ + 1. In this case, µk is very close to µ , thus we trivially bound Dk k T 16(K N + 3)T q

ln T c n 3.

(b) l(k) τ, jk jκ l(k). At the jκ l(k)-th checkpoint, arm κ l(k) enters the feasible cycle. Among those have once entered the feasible cycle, it has the highest mean reward. Thus by our previous definition, µk (jκ l(k)) = µκ l(k). The number of

checkpoints is |C| K0 N+2

N 3 + K K0 K0 N + 2 + K K0 = K N + 2. Since arm k enters the feasible cycle either after or simultaneously with arm κ l(k) (i.e. jk jκ l(k)), there must be an arm with reward mean at least

µκ l(k) 16|C| q

ln T c n 3 8 q

ln T c n 3 after the operation at checkpoint cjk which survives until at least the next checkpoint.

Given that µκ l(k) 16(K N + 2) q

ln T c n 3 8 q

ln T c n 3 µk > 8 q

ln T c n 3, arm k is eliminated before time slot cjk + c n

under the good event G. Thus the total number of pulls of arm k is at most 3c and Dk 3c k.

(c) l(k) τ, jk < jκ l(k). Arm κ l(k) enters the game at time ρκ l(k). For notational simplicity, let cκ denote the checkpoint at which arm κ l(k) is involved in the feasible cycle. We show that cκ c 2n ρκ l(k). Suppose this inequality does not hold. Let t be the first time slot of the feasible cycle that contains the time slot ρκ l(k), t ρκ l(k). Thus

On Multi-Armed Bandit with Impatient Arms

t + n ρκ l(k) + n < cκ c n. Arm κ l(k) should be involved in the feasible cycle at checkpoint cκ c instead of cκ, which is a contradiction. If arm k is still active when the condition at checkpoint cκ is triggered, it must be eliminated before time slot cκ + c n under the good event G since µκ l(k) µk > 16(K N + 3) q

ln T c n 3. Then we have Pρκ l(k)+1 1

t=ρκ l(k) I{at = k}I{G} cκ + c n ρκ l(k) cκ + c n cκ + c + 2n 3c where we set ρκ τ+1 = T + 1. We

derive an upper bound for Dk,

t=1 (µκ 0 µk)I{at = k}I{G} +

(µκ 1 µk)I{at = k}I{G} + ... +

t=ρk (µ µk)I{at = k}I{G}

t=1 (µκ 0 µk) +

(µκ 1 µk) + ... + k

ρκ l(k)+1 1 X

I{at = k}I{G}

(ρκ l(k) 1)16(K N + 3)

ln T c n 3 + 3c k 16(K N + 3)T

ln T c n 3 + 3c k.

Then RT can be further bounded as

RT 2K max +

k =k E[Dk] 2K max +

16(K N + 3)T

ln T c n 3 + 3c k

2. k K0, ind(k ) > N 2. The optimal arm k is in the game from the beginning.

t=1 µat T max Pr( G) +

k =k E[Dk],

where in this case Dk = k PT t=1 I{at = k}I{G} = k Tk(T)I{G}. Arm k is involved in the feasible cycle at checkpoint

c ind(k ) N+2

N 3 . Consider the case when k > 16(K N + 3) q

ln T c n 3. By our previous discussion, it guarantees that

Tk(T) 3c when jk jk under the good event G. When jk < jk , arm k must be eliminated before time slot ck + c n. Thus k > 16(K N + 3) q

ln T c n 3 implies that I{G}Tk(T) 3c + ck 3c + c K0 N+2

N 3 . Since it is also possible that

k 16(K N + 3) q

ln T c n 3, we have Dk 3c k + kc K0 N+2

N 3 + 16(K N + 3)Tk(T) q

ln T c n 3. Thus

RT 2K max +

3c k + kc l K0 N + 2

m + 16(K N + 3)Tk(T)

3. k K0, ind(k ) N 2. The optimal arm k is in the initial feasible cycle. The only difference from the k K0, ind(k ) > N 2 case is that jk jk = 0, k = k . We can similarly obtain

RT 2K max +

3c k + 16(K N + 3)Tk(T)

Merging the above three cases, the expected dynamic regret upper bound is

RT 2K max + (K 1) c

3 + l K0 N + 2

+ 16(K 1)(K N + 3)T

n ln T c 3n,

which is minimized when

8(K N + 3)T

n ln T (3 + K0 N+2

If c is exactly 3n + 8(K N+3)T

n ln T (3+ K0 N+2

3 , it can be verified that RT = O K2T 2 3 (n ln T) 1 3 .

On Multi-Armed Bandit with Impatient Arms

H. Discussion on the Knowledge of the Patience Vector m

For the following reasons, we assume that the value of m is known in advance in this paper:

1. In practice, m can come as a result of negotiation between the algorithm and arms in advance. We take the crowdsourcing scenario as our example. The system (algorithm) can make a promise of mk for worker (arm) k such that in any time period of length mk, the worker will surely have at least one job assignments. After a suitable value of m is confirmed, the proposed algorithms in this paper (FC-SE and FC-Entry) are able to ensure that this promise is never violated.

2. It is sufficient for the known m to only be a valid element-wise lower bound for the true patience vector. For the clarity of our discussion, say m is the underlying true arm patience vector and m is an element-wise lower bound for m , i.e., mk m k for any k [K]. The knowledge of any valid m, instead of the exact m , is sufficient for the construction of a feasible cycle for the set of arms. This is because, if we repeat a feasible cycle constructed given m, for any arm k, it is never continuously ignored for a duration of mk time steps, thus arm k is also never continuously ignored for m k mk time steps and it never leaves the game.

3. If no a-priori knowledge of arm patience m is accessible, we can prove that the bandit learning problem is intractable. Unlike learning the mean reward µ, learning the arm patience m is impractical, since obtaining partial information of m is accompanied by the risk of losing the optimal arm. Besides, in this scenario, all arms are initially indistinguishable for any algorithm. Intuitively, for any algorithm, there exists instances (µ, m) such that the optimal arm leaves very early. In fact, we can formally prove that, without the knowledge of m, any algorithm can incur unacceptable regret that is linear in the time horizon T:

Proposition H.1. Given any algorithm A that only takes the historical observations (a1, r1, ..., at 1, rt 1) as input at the beginning of time slot t and outputs an action at. The algorithm observes reward rt at time t. Then there exists a family of problem instances such that the expected regret RT satisfies RT = Ω(T).

Proof of Proposition H.1. Note that the algorithm A has no access to the arm patience m. Assume K > 2. There exists i [K] such that Pr(a1 = i ) 1/K, since otherwise 1 = PK i=1 Pr(a1 = i) < PK i=1 1/K = 1. Similarly, there exists j [K] such that Pr(a2 = j |a1 = i , r1 = 0) 1/K. Note that i can be equal to j . We construct a problem instance as follows: Select k [K] that satisfies k = i , k = j . Set µk = 1, mk = 2. For k = k , set µk = 0, mk = T. Set the reward noise to be 0 almost surely. That is, rt = µat almost surely.

Obviously, arm k is an impatient optimal arm. If the algorithm pulls a1 = k and a2 = k , the optimal arm leaves at the end of time slot t = 2. We have RT Pr(a1 = k , a2 = k )T Pr(a1 = i , a2 = j )T.

We observe that

Pr(a2 = j |a1 = i ) = X

r Pr(a2 = j |a1 = i , r1 = r) Pr(r1 = r|a1 = i )

= Pr(a2 = j |a1 = i , r1 = 0)

since µi = 0. Now we see

Pr(a1 = i , a2 = j ) = Pr(a1 = i ) Pr(a2 = j |a1 = i )

= Pr(a1 = i ) Pr(a2 = j |a1 = i , r1 = 0)

As a result, we have shown that RT T/K2.