# adversarial_blocking_bandits__eb9590c2.pdf

Adversarial Blocking Bandits

Nicholas Bishop University of Southampton, UK nb8g13@soton.ac.uk

Hau Chan University of Nebraska-Lincoln, USA hchan3@unl.edu

Debmalya Mandal Columbia University, USA dm3557@columbia.edu

Long Tran-Thanh University of Warwick, UK long.tran-thanh@warwick.ac.uk

We consider a general adversarial multi-armed blocking bandit setting where each played arm can be blocked (unavailable) for some time periods and the reward per arm is given at each time period adversarially without obeying any distribution. The setting models scenarios of allocating scarce limited supplies (e.g., arms) where the supplies replenish and can be reused only after certain time periods. We ﬁrst show that, in the optimization setting, when the blocking durations and rewards are known in advance, ﬁnding an optimal policy (e.g., determining which arm per round) that maximises the cumulative reward is strongly NP-hard, eliminating the possibility of a fully polynomial-time approximation scheme (FPTAS) for the problem unless P = NP. To complement our result, we show that a greedy algorithm that plays the best available arm at each round provides an approximation guarantee that depends on the blocking durations and the path variance of the rewards. In the bandit setting, when the blocking durations and rewards are not known, we design two algorithms, RGA and RGA-META, for the case of bounded duration an path variation. In particular, when the variation budget BT is known in advance,

RGA can achieve O( q

T(2 D + K)BT ) dynamic approximate regret. On the other hand, when BT is not known, we show that the dynamic approximate regret of RGA-META is at most O((K + D)1/4 B1/2T 3/4) where B is the maximal path variation budget within each batch of RGA-META (which is provably in order of o(

T). We also prove that if either the variation budget or the maximal blocking duration is unbounded, the approximate regret will be at least Θ(T). We also show that the regret upper bound of RGA is tight if the blocking durations are bounded above by an order of O(1).

1 Introduction

This paper investigates the blocking bandit model where pulling an arm results in having that arm blocked for a deterministic number of rounds. For example, consider the classical problem of online task allocation, in which new task requests arrive at each time step, waiting to be assigned to one of many servers [Karthik et al., 2017]. Once a server is allocated to a task, it starts working on it, and becomes unavailable for future tasks until that task is done. If there are no servers available or none is allocated to the task at its arrival, the request will not be served and leave the system forever. A more recent example comes from the domain of expert crowdsourcing (e.g., Upwork, Outsourcely, etc.). In this setting, a job requester can sequentially choose from a pool of workers and allocate a short-term job/project to the worker [Ho and Vaughan, 2012, Tran-Thanh et al., 2014]. The stochastic version of this problem, where the rewards are randomly drawn from a distribution in an

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

i.i.d. manner, with the constraint that the blocking durations are ﬁxed per arm over time, has been studied in [Basu et al., 2019] and [Basu et al., 2020]. However, in many applications, the stochastic setting is too restrictive and not realistic. For example, in the online task allocation problem, the tasks can be heterogeneous, and both the value and the serving time of the tasks can vary over time in an arbitrary manner. Furthermore, in the expert crowdsourcing setting, the time and quality workers need to deliver the job are unknown in advance, can vary over time, and do not necessarily follow an i.i.d. stochastic process. These examples demonstrate that for many real-world situations, the stochastic blocking bandit model is not an appropriate choice.

To overcome this issue, in this paper we propose the adversarial blocking bandit setting, where both the sequence of rewards and blocking durations per arm can be arbitrary. While the literature of adversarial bandits is enormous, to the best of our knowledge, this is the ﬁrst attempt to address the effect of blocking in adversarial models. In particular, we are interested in a setting where the rewards are neither sampled i.i.d., nor maliciously chosen in an arbitrary way. Instead, in many real-world systems, the change in the value of rewards is rather slow or smooth over time (e.g., in the online task allocation problem, similar tasks usually arrive in batch, or in the crowdsourcing system, workers may have periods when they perform consistently, and thus, their performance slowly varies over time). To capture this, we assume that there is a path variation budget which controls the change of the rewards over time 1.

1.1 Main Contributions

In this paper, apart from the adversarial blocking bandit setting, we also investigate two additional versions of the model: (i) The ofﬂine MAXREWARD problem, where all the rewards and blocking durations are known in advance; and (ii) the online version of MAXREWARD, in which we see the corresponding rewards and blocking durations of the arms at each time step before we choose an arm to pull. Our main ﬁndings can be summarised as follows:

1. We prove that the ofﬂine MAXREWARD problem is strongly NP-hard (Theorem 1). Note that this result is stronger than the computational hardness result in Basu et al. [2019], which depends on the correctness of the randomised exponential time hypothesis.

2. We devise a provable approximation ratio for a simple online greedy algorithm, Greedy-BAA, for the online MAXREWARD problem (Theorem 2). Our approximation ratio, when applied to the stochastic blocking bandit model with ﬁxed blocking durations, is slightly weaker than that of Basu et al. [2019]. However, it is more generic, as it can be applied to any arbitrary sequence of rewards and blocking durations.

3. For the bandit setting, we consider the case when both the maximal blocking duration and the path variance are bounded, and propose two bandit algorithms:

We design RGA for the case of known path variation budget BT . In particular, we show that

RGA can provably achieve O q

T(2 D + K)BT

regret, where T is the time horizon, K

is the number of arms, D is the maximum blocking duration, and the regret is computed against the performance of Greedy-BAA (Theorem 3).

For the case of unknown path variation budget BT , we propose RGA-META that uses Exp3 as a meta-bandit algorithm to learn an appropriate path variation budget and runs RGA with it. We prove that RGA-META achieves O((K + D)1/4 B1/2T 3/4) regret bound where B is the maximal path variance within a single batch of the algorithm, and is in order of O(

T) in the worst case (Theorem 4).

4. Finally, we also discuss a number of regret lower bound results. In particular, we show that if either BT or D is in Θ(T) (or unbounded), then the regret is at least Θ(T) (Claims 1 and 2). We also discuss that if D O(1), then there is a matching lower bound for the regret of RGA (Section 5).

1We will show in Section 5 that bounded variation budgets are necessary to achieve sub-linear regrets.

1.2 Related Work

Stochastic Blocking Bandits. The most relevant work to our setting is the stochastic blocking bandit model. As mentioned before, Basu et al. [2019] introduce and study this model where the reward per each time period is generated from a stochastic distribution with mean µk reward for each arm k and the blocking duration is ﬁxed across all time period for each arm k (e.g., Dk t = Dk for all t and k). In the optimization setting where the mean rewards and blocking durations are known, they consider a simpler version of the MAXREWARD problem for their setting and show that the problem is as hard as the PINWHEEL Scheduling on dense instances [Jacobs and Longo, 2014] and provide that a simple greedy algorithm (see Algorithm 1) achieves an approximation ratio of (1 1/e O(1/T)) where T is total time period. In the bandit setting, they provide lower and upper regret bounds that depend on the number of arms, mean rewards, and log(T). A very recent work [Basu et al., 2020] extends the stochastic blocking bandit to a contextual setting where a context is sampled according to a distribution each time period and the reward per arm is drawn from a distribution with the mean depending on the pulled arm and the given context. Similar to the work of Basu et al. [2019], Basu et al. [2020] derive an online algorithm with an approximation ratio that depends on the maximum blocking durations and provide upper and lower α-regret bounds of O(log T) and Ω(log T), respectively. However, the results from this models cannot be directly applied to the adversarial setting due to the differences between the stochastic and adversarial reward generation schemes.

Budgeted and Knapsack Bandits. Since the underlying ofﬂine optimisation problem of our setting, MAXREWARD, can also be casted as an instance of the multiple-choice multidimensional knapsack problem, it is also worth mentioning the line of work in the bandit literature that solve online knapsack problems with bandit feedback. In these models, the pull of an arm requires the consumption of resources in d 1 dimensions. The resource per arm is given either stochastic or adversarially in each time period and a (non replenishable) total budget B = (B1, ..., Bd) is available at the initial time period. The one-dimensional stochastic version of this setting is ﬁrst studied in Tran-Thanh et al. [2010, 2012], Ding et al. [2013] under the name budgeted bandits, and is later extended to multiple dimensions (a.k.a. bandits with knapsack) by Badanidiyuru et al. [2013], Agrawal and Devanur [2014], Badanidiyuru et al. [2014]. More recently, Rangi et al. [2019] and Immorlica et al. [2019] initiate the study of adversarial knapsack bandits. Rangi et al. [2019] consider the d = 1 setting with a regret benchmark that is measured based on the best ﬁxed-arm s reward to cost ratio. Under such a regret benchmark, they show that sub-linear regret (with respect to B and k) is possible in both the stochastic and adversarial settings. Immorlica et al. [2019] consider the d 1 setting with a regret benchmark that is deﬁned to be the ratio of the expected reward of the best ﬁxed distribution over arms and the policy s expected reward. show that the ratio is at least Ω(log T). However, none of the techniques developed in these work can be applied to our setting, due to the following reason: The results in the knapsack bandit models typically assume that the pulling costs are bounded above by a constant, and the budget is signiﬁcantly larger than this constant to allow sufﬁcient exploration. In contrast, when MAXREWARD is conversed into a knapsack model, many of its dimensions will have a budget of 1, and the corresponding pulling cost for that dimension is also 1 (due to the blocking condition).

Other Settings with Arm Availability Constraints. Other bandit models with arm availability cosntrainsts include the mortal bandits [Chakrabarti et al., 2009], sleeping bandits [Kleinberg et al., 2010, Kale et al., 2016], bandits with stochastic action sets [Neu and Valko, 2014], and combinatorial semi-bandits [Neu and Bartók, 2016]. We refer readers to [Basu et al., 2019] for a discussion of these models, including the relevance of the blocking bandit setting to online Markov decision processes.

Connection to the scheduling literature. Notice that there is a strong connection between MAXREWARD and the interval scheduling problems. In particular, the MAXREWARD problem belongs to the class of ﬁxed interval scheduling problems with arbitrary weight values, no preemption, and machine dependent processing time (see e.g., Kolen et al. [2007] for a comprehensive survey). This is one of the most general, and thus, hardest versions of the ﬁxed interval scheduling literature (see, e.g., Kovalyov et al. [2007] for more details). In particular, MAXREWARD is a special case of this setting where for each task, the starting point of the feasible processing interval is equal to the arrival time. Note that to date, provable performance guarantees for ﬁxed interval scheduling problems with arbitrary weight values only exist in ofﬂine, online but preemptive, or settings with some special uniformity assumptions (e.g., [Erlebach and Spieksma, 2000, Miyazawa and Erlebach, 2004, Bender et al., 2017, Yu and Jacobson, 2020]). Therefore, to our best knowledge, Theorem 2 in our paper is

the ﬁrst result which provides provable approximation ratio for a deterministic algorithm in an online non-preemptive setting. Note that with some modiﬁcations, our proof can also be extended to the general online non-preemptive setting, i.e., online interval scheduling with arbitrary weight values, no preemption, and machine dependent processing time.

2 Preliminaries

Adversarial blocking bandit. In this paper we consider the following bandit setting. Let K = {1, . . . , K} be the set of K arms. Let T = {1, . . . , T} denote a sequence of T time steps, or decision points faced by a decision maker. At every time step t T , the decision maker may pull one of the K arms. When pulling an arm k K at time step t T , the reward Xk t [0, 1] is obtained. In addition, the pulled arm k is deterministically blocked and cannot be pulled for the next (Dk t 1) time steps for some integer blocking duration Dk t Z+. We also use the notation to denote the action of not pulling an arm. In which case, X t = 0 and D t = 1 for each time step t.

We denote by Xk the sequence of rewards over T time steps associated with an arm k K such that Xk = {Xk t }T t=1. In addition, we denote by X the sequence of vectors of all K rewards such that X = {Xk}K k=1. Similarly, we denote by Dk = {Dk t }T t=1 the sequence of blocking durations over T time steps associated with an arm k and denote by D = {Dk}K k=1 the sequence of vectors of all K blocking duration vectors.

In our model, the rewards and blocking durations of each arm can change an arbitrary number of times. We let D ( D) be the maximal blocking duration (minimal blocking duration) which is the upper bound (lower bound) of the largest (smallest) possible blocking duration. We denote by D = {1, . . . , D}K T the set of all blocking duration vector sequences which are upper bounded by D. Note that D is deﬁned with respect to minimal blocking duration D = 1. It is sometime be useful to deﬁne D for arbitrarily lower bound D.

Bounded path variation. Motivated by and adapted from a recent line of work in the bandit literature (e.g., [Besbes et al., 2014]), we assume that there is a path variation budget on the sequence of the rewards. In particular, the deﬁnition of path variation on the sequence of the rewards is deﬁned to be

Xk t+1 Xk t .

We refer to BT as the path variation budget over T . We deﬁne the corresponding temporal uncertainty set as the set of reward vector sequences which satisfy the variation budget over the set of time steps {1, . . . , T}:

X [0, 1]K T :

Xk t Xk t+1 BT

Note that by setting BT = KT we can recover the standard unbounded version of our bandit model (as all the rewards are from [0, 1]). Note that our analysis also works for other variation budgets such as the maximum variation [Besbes et al., 2014] or the number of changes budgets [Auer et al., 2019]. See Section 5 for a more detailed discussion.

Arm pulling policy. Let U be a random variable deﬁned over a probability space (U, U, Pu) Let π1 : U K and πt : [0, 1]t 1 {1, . . . , D}t 1 U K for t = 2, 3, . . . be measurable functions With some abuse of notation we denote by πt K the arm chosen at time t, that is given by

πt = π1(U) t = 1 πt(Xπ t 1, . . . , Xπ 1 , Dπ t 1, . . . , Dπ 1 , U) t = 2, 3, . . .

Here Xπ t (resp. Dπ t ) denotes the reward (resp. blocking duration) observed by the policy π at time t The mappings {πt : t = 1, . . . , T} together with the distribution Pu deﬁne the class of policies We deﬁne the class P of admissible policies to be those, at every time step, which choose an action which is not blocked. That is,

P = (π1, . . . , πT ) : πt / {πj : j + Dπj j 1 t, j t 1}, t {1, . . . , T}, X B, D D .

Algorithm 1: Greedy-BAA

Input : T, K, {Xk t }k K,t T , {Dk t }k K,t T - An instance of the MAXREWARD Problem Output : π+ = (π+ 1 , π+ 2 , ..., π+ T ) P - A greedy solution to the MAXREWARD Problem

1 π+ = ( , ..., );

2 for j 1 to T do

3 Select π+ j arg maxkj Aj(π+ 1 ,...,π+ j 1) Xkj j # See the preliminary section for deﬁnitions

5 return π+

In addition, let At(π1, . . . , πt 1) = K \ {πj : j + Dπj j 1 t, j t 1} denote the set of available arms at time step t (we will also use At for the sake of brevity).

Objective. The cumulative reward of a policy π P is deﬁned to be r(π) = PT t=1 Xπ t where Xπ t is the reward obtained by policy π at time step t. Our objective is to ﬁnd π P such that π arg maxπ P Eπ[r(π)], where the expectation is over all possible randomisation coming from policy π.

Feedback. The difﬁculty of the optimisation problem depends on the information (or the feedback) we have about the rewards and blocking durations of the arms. In this paper, we consider three feedback models in increasing order of difﬁculty. In the simplest setting, we know the value of all Xk t and Dk t in advance. We refer to this setting as the (ofﬂine) MAXREWARD optimization problem. In the online version of MAXREWARD, we assume that Xk t and Dk t are not known in advance, but at each time step t, the value of Xk t and Dk t for all k at that particular time step t is revealed before we choose any arm to pull. Finally, in the (classical) bandit setting, we assume that only the reward and blocking duration of the chosen arms are revealed after that arm is pulled2. We will refer to third model as the adversarial blocking bandit problem.

3 The Ofﬂine and Online MAXREWARD Problems

We start with the analysis of the ofﬂine and online MAXREWARD problems. As a slight preview of the next subsections, computing an optimal solution of the ofﬂine MAXREWARD problem is strongly NP-hard even with bounded variation budget. Such result eliminates the possibility of a fully polynomial-time approximation scheme (FPTAS) for the problem unless P = NP. In addition, for the online MAXREWARD problem, we design an online greedy algorithm with provable approximation guarantee.

3.1 The Computational Complexity of the Ofﬂine MAXREWARD Problem

To show that the MAXREWARD problem is strongly NP-hard, we reduce from the Boolean satisﬁability problem with three literals per clause (3-SAT), which is known to be strongly NP-complete [Garey and Johnson, 1979]. In a 3-SAT instance, we are given m variables and n clauses. Each clause consists of three literals, and each literal is either a variable or the negation of the variable. The problem is to determine if there is a boolean true/false assignment to each variable so that the given 3-SAT instance is true (i.e., each clause contains at least one true literal).

Theorem 1. Computing an optimal solution for the MAXREWARD problem is strongly NP-hard. The hardness result holds even when the path variation is bounded.

3.2 Online MAXREWARD Problem with Bounded Variation Budget

In this section, we consider the online version of MAXREWARD. We devise a simple online greedy algorithm, Greedy Best Available Arm (Greedy-BAA), in which, at each time step, the algorithm plays an available arm with the highest reward. Algorithm 1 provides a detail description of Greedy-BAA.

2In this paper, due to space limits, we do not deal with the full information feedback model, in which the reward and blocking duration values of all the arms are revealed at each time step after the pull.

Below, we show that Greedy-BAA provides an approximation guarantee to the ofﬂine MAXREWARD problem that depends on the blocking durations and the variation budget.

Theorem 2. Let k = arg maxk Dk max Dk min denote the arm with the highest max-min blocking duration

ratio. Let π+ denote the solution returned by Greedy-BAA, and π denote an optimal solution of the ofﬂine MAXREWARD problem, respectively. We state that: 1 + Dk max Dk min

r(π+) + Dk max Dk min BT r(π ),

That is, Greedy-BAA has an approximation ratio of 1 + Dk max Dk min

1 1 Dk max BT Dk minr(π )

Note that as Dk min D and Dk max D, the approximation ratio above can be further bounded above

Comparison to the result of Basu et al. [2019]. We note that Basu et al. [2019] has studied the MAXREWARD problem with path variation budget BT = 0 (i.e., the reward values are ﬁxed over time) and homogeneous blocking durations per arm (i.e., when the blocking duration per arm do not change over time). In that case, our proof provides an approximation ratio of 1/2 whereas Basu et al. [2019] provides an approximation ratio of O(1 1/e O(1/T)). Their technique uses a much complicated LP-bounding technique/proof that does not directly generalize to the case of BT > 0 with varying blocking durations. On the other hand, our approximation ratio result holds for the general case. For example, if BT grows slower than r(π+) with T, our algorithm guarantees an approximation ratio of (1 + 2 D

4 The Adversarial Blocking Bandit Problem

Given the investigation of the (ofﬂine and online) MAXREWARD problems in the previous section, we now turn to the main focus of our paper, namely the online MAXREWARD problem with bandit feedback, a.k.a the adversarial blocking bandit problem. While the regret analyses are typically done by benchmarking against the best ﬁxed policy in hindsight, we can easily show that in our setting, this benchmark would perform arbitrarily poorly, compared to the ofﬂine optimal solution. Therefore, instead of following the standard regret analysis, we are interested in comparing the performance of the designed algorithms to that of the ofﬂine optimal solution. Therefore, we will use the following regret deﬁnition:

Dynamic approximate regret. We compare the performance of a policy with respect to the dynamic oracle algorithm that returns the ofﬂine optimal solution of MAXREWARD .We deﬁne the α-regret under a policy π P as the worst case difference between an (ofﬂine) α-optimal sequence of actions and the expected performance under policy π. More precisely, let π denote the arm pulling policy of that dynamic oracle algorithm. The α-regret of a policy π P against π is deﬁned to be

Rα π(BT , D, T) = αr(π ) E[r(π)]

where the expectation is over all the possible randomisation of π. Note that this regret notion is stronger than the regret against the best ﬁxed policy in hindsight, as it is easy to show that the best ﬁxed policy can perform arbitrarily badly, compared to π .

4.1 Blocking Bandit with Known Path Variation Budget

We now turn to describe our new bandit algorithm, RGA, designed for the adversarial blocking bandit problem. This algorithm can be described as follows:

1. We split the time horizon T into batches T1, . . . , Tm of size T each (except possibly the last batch):

Tj = {t {1, . . . , T } : (j 1) T + t min {j T , T}} , for all j = 1, . . . , m

where m = l T T

m is the number of batches.

Algorithm 2: Repeating Greedy Algorithm (RGA) Input: T .

1 while 1 j l T T

2 Set τ = 1

3 while τ T do

4 if (1 τ K) then

5 Pull arm k = τ mod K + 1

6 Receive reward and blocking duration (Xk τ , Dk τ )

7 Set ˆXk t = Xk τ for all t [1, T ].

8 if (K + 1 τ D + K) then

9 Pull no arms

10 if ( D + K + 1 τ T D) then

11 Pick arms according to GREEDY-BAA( T 2 D K, K, ˆX1, . . . , ˆXK, D1, . . . , DK)

12 if ( T D + 1 τ T ) then

13 Pull no arms

2. Within each batch we spend the ﬁrst K rounds pulling each arm. Without loss of generality, we shall assume that arm k is pulled on round k. After this we spend the next D rounds pulling no arms. This ensures that all arms will be available when we next pull an arm.

3. Then, up until the ﬁnal D rounds we play Greedy-BAA using the rewards observed in the ﬁrst K rounds as the ﬁxed rewards for each arm.

4. In the ﬁnal D rounds of each batch, we again pull no arms. This ensures that all of the arms are available at the beginning of the next batch.

Theorem 3. Suppose that the variation budget BT is known in advance and maximal duration D 1

such that DBT o(T). The α-regret of RGA, where α = D D+ D, is at most O q

T(2 D + K)BT

when the parameter when T is set to lq

(T +1)(2 D+K) 2BT

Note that this bound is sub-linear in T if DBT = o(T) (e.g., D is bounded above by a constant and BT o(T)). It is also worth noting that while α = 1 1+ D might imply that RGA can perform better than the worst-case performance of Greedy-BAA, with BT o(T) it is not the case (see Section ?? in the appendix for more details).

4.2 Blocking Bandit with Unknown Path Variation Budget

Note that RGA requires knowledge of BT in order to properly set T . To resolve this issue we propose META-RGA, a meta-bandit algorithm, where each arm corresponds to an instance of the RGA algorithm whose T parameter tuned for a different variation budget. The time horizon T is broken into meta-blocks of length H. At the start of each meta-block an arm (i.e., an instance of RGA with its corresponding budget) is selected according to the well known Exp3 algorithm [Auer et al., 2002]. The RGA is then played for the next H time steps with optimally tuned restarts (see Theorem 3 for more details). At the end of the meta-block, the Exp3 observes a reward corresponding to the total reward accumulated by the chosen RGA in this meta-block. The intuition of this idea is that the meta-bandit will learn which budget will be the best upper bound for RGA.

In what follows, we shall denote the set of arms available to the Exp3 algorithm by J , and denote the corresponding set of variation budgets by JB. The META-RGA algorithm uses log2(KT) + 1 meta-arms with budgets JB = {20, 21, . . . , 2 log2(KT ) }. That is, the budget values are powers of 2 up to the smallest 2-power, which is still larger than KT, which is the ultimate upper bound of the

Algorithm 3: Meta Repeating Greedy Algorithm (META-RGA) Input: T, K, γ (0, 1], batch length H.

1 Initialize: |J | = log2(KT) + 1, JB = {20, 21, . . . , 2 log2(KT ) }, wi(1) = 1 for i = 1, . . . , |J |.

2 for τ = 1, . . . , T

pi(τ) = (1 γ) wi(τ) P|J | j=1 wj(τ) + γ

|J | i = 1, . . . , |J |

4 Draw iτ randomly according to the probabilities p1(τ), . . . , p|J |(τ)

5 Run RGA in batch τ with budget JB[iτ] = 2iτ 1 and optimally tuned restarts

6 Receive reward xit(τ) [0, H] at the end of the batch

7 for j = 1, . . . , |J | do

pj(τ) if j = iτ 0 otherwise

wj(τ + 1) = wj(τ)exp(γˆxj(τ)/(H|J |))

path variation budget (as BT KT). In addition, let Bi denote the total path variance within batch i, and B = maxi Bi. We state the following: Theorem 4. Suppose that the variation budget BT is unknown in advance to us. In addition, suppose that the maximal blocking duration D 1 such that DBT o(T). The α-regret of RGA-META, where α = 1 1+ D, is at most

O B1/2T 3/4(2 D + K)1/4 ln(KT)1/4 ln(ln(KT))1/4

when the parameters of RGA-META are set as follows:

T(2 D + K) ln(KT) ln(ln(KT)), γ = min

ln(KT) ln(ln(KT))

Note that since B HK by deﬁnition (the maximum path variance within a batch is at most

HK), by setting H = q

T (2 D+K) ln(KT ) ln(ln(KT )) we always get sub-linear regret in T if D O(1) (i.e.,

is bounded above by a constant). Otherwise we need to have B2 D o(T). Furthermore, when B is small, our regret bound tends to O(T 3/4). Thus, it is still an open question whether we can get a tighter upper bound (e.g., O(

T)) for this case (i.e., when the variation budget is unknown).

5 Discussions

In this section we will provide some intuitions why we set BT and D to be small in the previous sections. In particular, we show that if either the variation budget or the maximum blocking duration is large, the lower bound of the α-regret is Θ(T). We also discuss a potential lower bound for the α-regret of the adversarial blocking bandit problem in the case of BT o(KT) and D O(1). Finally, we will also discuss how our results change if we use other types of variation budgets.

Large variation budget. Consider the case when BT Θ(T). Theorem 3 indicates that the upper bound of the α-regret is Θ(T) where α = 1 1+ D as deﬁned in Theorem 3. Indeed, we show that this is the best possible we can achieve: Claim 1. For any T > 0 and BT Θ(KT), there exists a sequence of rewards and blocking durations X and D such that Rα π(BT , D, T) = Θ(T) for that particular (X, D).

Large blocking durations. If D Θ(T) and α is the approximation ratio of Greedy-BAA:

Claim 2. For any T > 0 and D Θ(T), there exists a sequence of rewards and blocking durations X and D such that Rα π(BT , D, T) = Θ(T) for that particular (X, D).

Note that our regret bounds only make sense if DBT o(T). Thus, it is still an open question whether we can achieve sub-linear α-regret bounds in T if both BT , D o(T) but DBT Ω(T).

Almost matching regret lower bound for RGA. Consider the case when D = O(1). This implies that the α-regret bound of RGA is reduced to O( KTBT ). This in fact matches the known lower bounds of the 1-regret for the case of D = 1 (i.e., no blocking) from the literature [Auer et al., 2019]. In particular, with D = 1, the Greedy-BAA algorithm becomes optimal (see, e.g., Section 4.3 of Basu et al. [2019] for the discussion of this), and thus, the α-regret notion becomes 1-regret. Therefore, if there exists an algorithm which could achive an α-regret better than O( KTBT ) in our setting, then it would be able to achieve O( KTBT ) 1-regret for the standard (i.e., non-blocking) adversarial bandit as well.

It is also worth noting that when D is not bounded above by a constant, or the variation budget BT is not known in advance, it is still not known what the regret lower bound would be.

Other variation budget deﬁnitions. There are a number of different variation budget deﬁnitions in the literature [Besbes et al., 2014, Wei and Luo, 2018, Auer et al., 2019]. It is worth noting that our analysis works in a similar way for the maximum variation budget Bmax T and number of changes budget LT , which can be deﬁned as follows:

t,t+1 T max k K |Xk t+1 Xk t |, LT = #{t : 1 t T 1, k : Xk t = Xk t+1}

If we use these variation budgets instead, the regret in Theorem 3 will be modiﬁed to

(2 D + K)TBmax T

(2 D + K)TLT

, respectively. Furthermore, the ap-

proximation ratio of Greedy-BAA will also change. In particular, it becomes h 1 + D +

DKBmax T /r(π+) i 1 and h 1 + D + DKLT /r(π+) i 1 . We refer the reader to Section ?? in the appendix for a more detailed discussion. It remains as future work to derive regret bounds for the other variation budgets.

Broader Impact

The paper examines a novel multi-armed bandit problem in which the decision-making agent aims to receive as many (cumulative) rewards as possible over a ﬁnite period subject to constraints. Our focus and results are largely theoretical. In particular, our contributions advance our understanding of multi-armed bandit models and its theoretical limitations and beneﬁt the general (theoretical) machine learning community, speciﬁcally the multi-armed bandit and online learning communities. In addition, we do not expect that our theoretical ﬁndings can be directly used in more applied domains.

Acknowledgments and Disclosure of Funding

Nicholas Bishop was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) Doctoral Training Partnership grant. Debmalya Mandal was supported by a Columbia Data Science Institute Post-Doctoral Fellowship.

Shipra Agrawal and Nikhil R Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the ﬁfteenth ACM conference on Economics and computation, pages 989 1006, 2014.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

Peter Auer, Pratik Gajane, and Ronald Ortner. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Conference on Learning Theory, pages 138 158, 2019.

Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 207 216. IEEE, 2013.

Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. Resourceful contextual bandits. In Conference on Learning Theory, pages 1109 1134, 2014.

Soumya Basu, Rajat Sen, Sujay Sanghavi, and Sanjay Shakkottai. Blocking bandits. In Advances in Neural Information Processing Systems 32, pages 4784 4793, 2019.

Soumya Basu, Orestis Papadigenopoulos, Constantine Caramanis, and Sanjay Shakkottai. Contextual blocking bandits. ar Xiv, abs/2003.03426, 2020.

Marco Bender, Clemens Thielen, and Stephan Westphal. Online interval scheduling with a bounded number of failures. Journal of Scheduling, 20(5):443 457, 2017.

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with nonstationary rewards. In Proceedings of the 27th International Conference on Neural Information Processing Systems, pages 199 207, 2014.

Deepayan Chakrabarti, Ravi Kumar, Filip Radlinski, and Eli Upfal. Mortal multi-armed bandits. In Advances in neural information processing systems, pages 273 280, 2009.

Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. Multi-armed bandit with budget constraint and variable costs. In Twenty-Seventh AAAI Conference on Artiﬁcial Intelligence, 2013.

Thomas Erlebach and Frits CR Spieksma. Simple algorithms for a weighted interval selection problem. In International Symposium on Algorithms and Computation, pages 228 240. Springer, 2000.

Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., 1979.

Chien-Ju Ho and Jennifer Wortman Vaughan. Online task assignment in crowdsourcing markets. In Twenty-sixth AAAI conference on artiﬁcial intelligence, 2012.

N. Immorlica, K. A. Sankararaman, R. Schapire, and A. Slivkins. Adversarial bandits with knapsacks. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 202 219, 2019.

Tobias Jacobs and Salvatore Longo. A new perspective on the windows scheduling problem. Ar Xiv, abs/1410.7237, 2014.

Satyen Kale, Chansoo Lee, and David Pal. Hardness of online sleeping combinatorial optimization problems. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2181 2189. Curran Associates, Inc., 2016.

A Karthik, Arpan Mukhopadhyay, and Ravi R Mazumdar. Choosing among heterogeneous server clouds. Queueing Systems, 85(1-2):1 29, 2017.

Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. Regret bounds for sleeping experts and bandits. Mach. Learn., 80(2 3):245 272, 2010.

Antoon WJ Kolen, Jan Karel Lenstra, Christos H Papadimitriou, and Frits CR Spieksma. Interval scheduling: A survey. Naval Research Logistics (NRL), 54(5):530 543, 2007.

Mikhail Y Kovalyov, CT Ng, and TC Edwin Cheng. Fixed interval scheduling: Models, applications, computational complexity and algorithms. European journal of operational research, 178(2): 331 342, 2007.

Hiroyuki Miyazawa and Thomas Erlebach. An improved randomized on-line algorithm for a weighted interval selection problem. Journal of Scheduling, 7(4):293 311, 2004.

Gergely Neu and Gábor Bartók. Importance weighting without importance weights: An efﬁcient algorithm for combinatorial semi-bandits. The Journal of Machine Learning Research, 17(1): 5355 5375, 2016.

Gergely Neu and Michal Valko. Online combinatorial optimization with stochastic decision sets and adversarial losses. In Advances in Neural Information Processing Systems, pages 2780 2788, 2014.

Anshuka Rangi, Massimo Franceschetti, and Long Tran-Thanh. Unifying the stochastic and the adversarial bandits with knapsack. In Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence, IJCAI-19, pages 3311 3317, 2019.

Long Tran-Thanh, Archie Chapman, Enrique Munoz de Cote, Alex Rogers, and Nicholas R Jennings. Epsilon ﬁrst policies for budget limited multi-armed bandits. In Twenty-Fourth AAAI Conference on Artiﬁcial Intelligence, 2010.

Long Tran-Thanh, Archie Chapman, Alex Rogers, and Nicholas R Jennings. Knapsack based optimal policies for budget limited multi armed bandits. In Twenty-Sixth AAAI Conference on Artiﬁcial Intelligence, 2012.

Long Tran-Thanh, Sebastian Stein, Alex Rogers, and Nicholas R Jennings. Efﬁcient crowdsourcing of unknown experts using bounded multi-armed bandits. Artiﬁcial Intelligence, 214:89 111, 2014.

Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, pages 1263 1291, 2018.

Ge Yu and Sheldon H Jacobson. Approximation algorithms for scheduling c-benevolent jobs on weighted machines. IISE Transactions, 52(4):432 443, 2020.