# blocking_bandits__521cbbee.pdf

Blocking Bandits

Soumya Basu

Sujay Sanghavi UT Austin, Amazon

Sanjay Shakkottai

We consider a novel stochastic multi-armed bandit setting, where playing an arm makes it unavailable for a ﬁxed number of time slots thereafter. This models situations where reusing an arm too often is undesirable (e.g. making the same product recommendation repeatedly) or infeasible (e.g. compute job scheduling on machines). We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm (in the number of arms) unless randomized exponential time hypothesis is false, by mapping to the PINWHEEL scheduling problem. Subsequently, we show that a simple greedy algorithm that plays the available arm with the highest reward is asymptotically (1 1/e) optimal. When the rewards are unknown, we design a UCB based algorithm which is shown to have c log T + o(log T) cumulative regret against the greedy algorithm, leveraging the free exploration of arms due to the unavailability. Finally, when all the delays are equal the problem reduces to Combinatorial Semi-bandits providing us with a lower bound of c0 log T + !(log T).

1 Introduction

We propose Blocking Bandits a novel stochastic multi armed bandits (MAB) problem where there are multiple arms with i.i.d. stochastic rewards and, additionally, each arm is blocked for a deterministic number of rounds. In online systems, such blocking constraints arise naturally when repeating an action within a time frame may be detrimental, or even be infeasible. In data processing systems, a resource (e.g. a compute node, a GPU) may become unavailable for a certain amount of time when a job is allocated to it. The detrimental effect is evident in recommendation systems, where it is highly unlikely to make an individual attracted to a certain product (e.g. book, movie or song) through incessant recommendations of it. A resting time between recommendations of identical products can be effective as it maintains diversity.

Surprisingly, this simple yet powerful extension of stochastic MAB problem remains unexplored despite the plethora of research surrounding the bandits literature [7, 1, 4, 8, 10] from its onset in [25]. Given the extensive research in this ﬁeld, it is of no surprise that there are multiple existing ways to model this phenomenon. However, as we discuss such connections next, we observe that none of these approaches are direct, resulting in either large regret bounds or huge time complexity or both.

We brieﬂy present the problem. There are K arms, where mean reward µi is the reward and Di is the delay of arm i, for each i = 1 to K. When arm i is played it is blocked for (Di 1) time slots and becomes available on the Di-th time slot after it s most recent play. The objective is to collect the maximum reward in a given time horizon T.

Illustrative Example: Consider three arms: arm 1 with delay 1 and mean reward 1/2, arm 2 with delay 4 and mean reward 1, and arm 3 with delay 4 and mean reward 1. The reward maximization objective is met when the arms are played cyclically as 31213121 . . . . There are two observations: First, due to blocking constraints we are forced to play multiple arms over time. Second, we note that the order in which arms are played is crucial. To illustrate, an alternate schedule 321 321 . . .

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

( represents no arm is played) results in strictly less reward compared to the previous one as every fourth time slot no arm is available.

1.1 Main Contributions

We now present the main contributions of this paper. 1. Formulation: We formulate the blocking Bandits problem where each time an arm is played, it is

blocked for a deterministic amount of time, and thus provides an abstraction for applications such as recommendations or job scheduling.

2. Computational Hardness: We prove that when the rewards and the delays are known, the problem of choosing a sequence of available arms to optimize the reward over a time horizon T is computationally hard (see, Theorem 3.1). Speciﬁcally, we prove the ofﬂine optimization is as hard as PINWHEEL Scheduling on dense instances [18, 12, 20, 3], which does not permit any pseudo-polynomial time algorithm (in the number of arms) unless randomized exponential time hypothesis [5] is false.

3. Approximation Algorithm: On the positive side, we prove that the Oracle Greedy algorithm that knows the mean reward of the arms and simply plays the available arm with the highest mean reward is (1 1/e O(1/T))-optimal (see, Theorem 3.3). The approximation guarantee does not follow from standard techniques (e.g. sub-modular optimization bounds); instead it is proved by relating a novel lower bound of the Oracle Greedy algorithm to the LP relaxation based upper bound on MAXREWARD.

4. Regret Upper Bound for UCB Greedy: We propose the natural UCB Greedy algorithm which plays the available arm with the highest upper conﬁdence bound. We provide regret upper bounds for the UCB Greedy as compared to the Oracle Greedy in Theorem 4.1.

Our proof technique is novel in two ways.

(i) In each time slot, the Oracle Greedy and the UCB Greedy algorithm have different sets of available arms (sample-path wise), as the set of available arms is correlated with the past decisions. We construct a coupling between the Oracle Greedy and the UCB Greedy algorithm, which enables us to capture the effect of learning error in UCB Greedy locally in time for each arm, despite the correlation with past decisions.

(ii) We prove that due to the blocking constraint, there is free exploration in the UCB Greedy algorithm. As the UCB Greedy algorithm plays the current best arm, it gets blocked, enforcing the play of the next suboptimal arm a phenomenon we call free exploration. Free exploration ensures that upto a time horizon t, certain number of arms, namely K (deﬁned below), are played ct amount of time each, for c > 0, w.h.p. Suppose µi-s are non-decreasing with i. Let K = min{i : Pi

j=1 1/Di 1}, and

(u, l) = min{µi µ(i+1) : l i < u}. Then the regret is upper bounded by O( K(K K )

(K,K ) log T).

Ignoring free exploration the regret bound is O( K2 (K,1) log T), where (K, 1) (K, K ).

5. Regret Lower Bound: We provide regret lower bounds for instances where the Oracle Greedy algorithm is optimal, and the regret is accumulated only due to learning errors. We consider the instances where all the delays are equal to K < K. We show under this setting the Oracle Greedy algorithm is optimal and the feedback structure of any online algorithm coincides with the combinatorial semi-bandit feedback [17, 13]. We show that for speciﬁc instances the regret admits a lower bound ( (K K )

(K,K ) log T) in Theorem 4.3.

1.2 Connections to Existing Bandit Frameworks

We now brieﬂy review related work in bandits, highlighting their shortcomings in solving the stochastic blocking bandits problem.

1. Combinatorial Semi-bandits: The blocking bandit problem is combinatorial in nature as the

decisions of playing one arm changes the set of available arms in the future. Instead of viewing this problem on a per-time-slot basis, we can group a large block of time-slots together to determine a schedule of arm pulls and repeat this schedule, thus giving us an asymptotically optimal policy. We can now use ideas from stochastic Combinatorial semi bandits [13, 24] to learn the rewards by observing

all the rewards attained in each block. This approach, however, has two shortcomings. First we might need to consider extremely large blocks of time, speciﬁcally of size O(exp(lcm(Di : i 2 [K] log K)) (lcm stands for the least common multiple), as an optimal policy may have periodic cycles of that length. This will require a large computational time as in the online algorithm the schedule will change depending on the reward estimates. Second, as the set of actions with large blocks is huge, the regret guarantees of such an approach may scale as O(exp(lcm(Di : i 2 [K] log K) log T).

2. Budgeted Combinatorial Bandits: There are extensions to the above combinatorial semi bandit setting where additional global budget constraints are imposed, such as Knapsack constraints [26] where an arm can only be played for a pre-speciﬁed number of times, and Budget constraints [28] where each play of arm has an associated cost and the total expenditure has a budget. However, these settings cannot handle blocking that are local (per arm) in nature. Further, in [19] the authors consider adaptive adversaries, which can model our problem. But their approach will lead to a an approximation guarantee of O(1/ log(T)) over T timeslots.

An interesting recent work, Recharging Bandits [22] studies a system where the rewards of each arm is a concave and weakly increasing function of the time since the arm is played (i.e. a recharging time). However, the results therein do not apply as we focus on hard blocking constraints. Another work on bandits with delay-dependent payoffs [6] is not applicable as the results therein give no approximation guarantee for our setting.

3. Sleeping Bandits: Yet another bandit setting where the set of available actions change across time slots is Sleeping Bandits [23]. In this setting, the available action set is the same for all the competing policies including the optimal one in each time slot. However, in our scenario the set of available action in a particular time slot is dependent on the actions taken in the past time slots. Therefore, different policies may have different available action in each time slot. This precludes the application of ideas presented in Sleeping Bandits, and in sleeping combinatorial bandits [21], to our problem.

4. Online Markov Decision Processes: Finally, we can view this as a general Markov decision process on the state space S = [D1] [D2] . . . [DK], and the action space of arms A = [K], with mean reward µi for action i. The state space is again exponential in K, leading to huge computational bottleneck (O(exp(K))) and regret (O(poly(|S|) log T)) for standard approaches in online Markov decision processes [2, 27, 15].

2 Problem Deﬁnition

We consider a multi-armed bandit problem with blocking of arms. We have K arms. For each i 2 [K], the i-th arm provides a reward Xi(t) in time slot t 1, where Xi(t) are i.i.d. random variables with mean µi and support [0, 1]. Let us order the arms from highest to lowest reward w.l.o.g., s.t. µ1 µ2 µK. Let ij = µi µj for all 1 i < j K.

Blocking: For all i 2 [K], each arm i is deterministically blocked for (Di 1) 0 number of time slots once it is played. The actions of a player now decide the set of available arms due to blocking. In the t-th time slot, let us denote the set of available arms as At and the arm pulled by the player as It 2 At. For each i 2 [K], and t 1, let the number of timeslots after and including t, the arm i is blocked as i,t = (Di + maxt0 t{It0 = i} t). The set of available arms at each time t is given as At := At(i1, . . . , it 1) = {i : i 2 [K], i,t 0}. For a ﬁxed time horizon, T 1, the set of all valid actions is given as IT = {it 2 At(i1, . . . , it 1) : t 2 [T]}.

Optimization: Our objective is to attain the maximum expected cumulative reward. The expected cumulative reward of a policy IT 2 IT is given as r(IT ) = E[PT

t=1 Xit(t)] = P

it2IT µit. The ofﬂine optimization problem, with the knowledge of delays and mean rewards is stated as below.

MAXREWARD: Solve OPT = max

I02IT r(I0).

-Regret: We now deﬁne the -regret of a policy, which is identical to the ( , 1)-regret deﬁned in the combinatorial bandits literature [9]. For any 2 [0, 1], the -regret of a policy is the difference of expected cumulative reward of an -optimal policy and the expected cumulative reward of that policy, R

3 Scheduling with Known Rewards

3.1 Hardness of MAXREWARD

The ofﬂine algorithm is a periodic scheduling problem with the objective of reward maximization. In this section, we ﬁrst prove (Corollary 3.2) that the ofﬂine problem does not admit any pseudo polynomial time algorithm in the number of arms, unless randomized exponential time hypothesis is false. We show hardness of the MAXREWARD problem by mapping it to the PINWHEEL SCHEDULING problem [18] as deﬁned below.

PINWHEEL SCHEDULING: Given K arms with delays {ai : i 2 [K]}, the PINWHEEL SCHEDULING problem is to decide if there exists a schedule (i.e. mapping : [T] ! [K] for any T 1) such that for each i 2 [K] in ai consecutive time slots arm i appears at least once.

We call such a schedule, if it exists, a valid schedule. A PINWHEEL SCHEDULING instance with a valid schedule is a YES instance, otherwise it is a NO instance. A PINWHEEL SCHEDULING instance is called dense if PK

i=1 1/ai = 1. Also, note that this problem is also known as Single Machine Windows Scheduling Problem with Inexact Periods [20].

Theorem 3.1. MAXREWARD is at least as hard as PINWHEEL SCHEDULING on dense instances.

In the proof, which is presented in the supplementary material, we show that given dense instances of PINWHEEL SCHEDULING there is an instance of MAXREWARD where the optimal value is strictly larger if the dense instance is an YES instance as compared to a NO instance. The following corollary provides hardness of MAXREWARD.

Corollary 3.2. The problem MAXREWARD does not admit any pseudo-polynomial algorithm unless the randomized exponential time hypothesis is false.

Proof. The proof follows from Theorem 3.1 and Theorem 24 in [20]. In [20], the authors shows that the PINWHEEL SCHEDULING with dense instances do not admit any pseudo-polynomial algorithm unless the randomized exponential time hypothesis [5] is False.

3.2 (1 1/e)-Approximation of MAXREWARD

We study the Oracle Greedy algorithm where in each time slot the policy picks the best arm (i.e. the arm with highest mean reward µi) in the set of available arms. We show in Theorem 3.3 that the greedy algorithm is (1 1/e O(1/T)) optimal1 for the problem for any time-horizon T and any number of arms K.

Theorem 3.3. The greedy algorithm is asymptotically (1 1/e) optimal for the MAXREWARD.

Proof Sketch: The proof is presented in the supplementary material. It relies on three steps. Firstly, we show that using a Linear problem (LP) relaxation it is possible to obtain an upper bound to OPT in closed form as a function fupper(T, µi, Di, 8i) of µi, Di for all i 2 [K]. In the next step, we show that the Greedy algorithm can be lower bounded as another function flower(T, µi, Di, 8i) of µi, Di for all i 2 [K]. The ﬁnal step is to lower bound the ratio min µi2[0,1],Di 1,8i

flower(T,µi,Di,8i) fupper(T,µi,Di,8i). Our

approach for the ﬁnal step is to break this non-convex optimization into two steps, ﬁrstly optimization over µis which takes the form of a linear fractional program with a closed form lower bound as a function of Di, 8i. Secondly, we show that this value can be furthered lower bounded universally across all Di 1, 8i, as (1 1/e O(1/T)).

3.3 Optimality Gap

We now show that greedy is suboptimal by constructing instances where greedy attains a cumulative reward (3/4 δ) times the optimal reward, for any δ > 0. Finally, the greedy algorithm that plays the available arm with maximum µi/Di is shown to attain 1/K times the optimal reward in certain instances. We call this algorithm greedy-per-round.

1An algorithm is optimal for the ofﬂine problem if the expected cumulative reward is times the optimal expected cumulative reward

Proposition 3.4. For any > 0, there exists an instance with 4 arms where the greedy algorithm achieves (3 )

4 2 fraction of optimal reward.

Proof. Consider the instance where arm 1 and 2 have reward 1 and delay 3, arm 3 has reward 1 and delay 1, and arm 4 has reward 0 and delay 0. Also, each arm has only one copy. For any time horizon T which is a multiple of 4, the greedy algorithm has the repeated schedule 1, 2, 3, 4, 1, 2, 3, 4, . . . . Therefore, the reward for greedy is (3 )T/4. Whereas, the optimal reward of (4 2 )T/4 is attained by the schedule 1, 3, 2, 3, 1, 3, 2, 3, . . . . Therefore, the greedy achieves reward (3 )

4 2 times the optimal.

Proposition 3.5. For any > 0, there exists an instance with K arms where the greedy-per-round algorithm achieves (1+ ) (K 2) fraction of the optimal reward.

Proof. Consider the instance where the arms 1 to (K 1) each has reward 1, delay (K 2). The K-th arm has reward (1 + )/(K 2) for > 0 and delay 0. The greedy-per-round will always play the K-th arm, as (1 + )/(K 2) 1/(K 2), attaining a reward of (1 + )T/(K 1) in T time-slots. Whereas, the optimal algorithm will play the arms 1, 2, . . . , (K 1) in a round robin manner attaining a reward of T in T time-slots. Therefore, greedy-per-round can only attain (K 2)/(1 + ) fraction of the optimal reward.

4 Greedy Scheduling with Unknown Rewards

4.1 UCB Greedy Algorithm

In this section, we present the Upper Conﬁdence Bound Greedy algorithm that operates without the knowledge of the mean rewards and the delays. The algorithm maintains the upper conﬁdence bound for the mean reward of each arm, and in each time slot plays the available arm with the highest upper

conﬁdence bound,

, where for arm i, ˆµi is the estimate of the mean reward and ni the total number of time arm i has been played. 2

Algorithm 1 Upper Conﬁdence Bound Greedy

1: Initialize: Mean estimate ˆµi = 0 and Count ni = 0, for all i 2 [K] 2: for all t = 1 to T do

3: Play arm it =

(t, if t K,

it = arg maxi2At

4: if it 6= ; then 5: nit nit + 1

ˆµit(t) + 1 nit Xit(t).

4.2 Analysis of UCB Greedy

We now provide an upper bound to the regret of the UCB Greedy algorithm as compared to the Oracle Greedy algorithm that uses the knowledge of the rewards. Let us recall that, the rewards are sorted (i.e. µi is non-increasing with i).

Quantities used in Regret Bound. Kg is the worst arm with mean reward strictly greater than 0 played by the Oracle Greedy algorithm. Let H(m) = P1

n=1 1/nm, m > 1 (Reimann zeta function). We deﬁne K

= min(K [ {k : P(k 1)

i=1 1/Dk 1 }) for any 0; and K := K

0. For each 1 k < k0 K, let (k, k0) := min{µi µj : i k, j k0 + 1, i < j}.

Further for all i = 1 to Kg, and j = (i + 1) to K

, we deﬁne cij =

8 0, (1 ) min

min(K, (1 ) max

i Di + 1), Dmin Kg min(K, Dmax).

2We believe with some increased complexity in the proof, the constant 8 in UCB can be improved to 2.

Theorem 4.1. The (1 1/e)-Regret of UCB Greedy for a time horizon T is upper bounded, for any > 0, as

@2H(4) µi µK

j=1+ max(i,K

Simpliﬁed Regret Bound. The regret admits the simpliﬁed upper bound for any > 0,

) mini2[K ,...,Kg] i,i+1

Role of Free Exploration in Regret Bound. Ignoring the free exploration in the system, we can upper bound the regret as PKg

ij + 2H(4) PKg

i . Therefore, by capturing the free exploration, we are able to signiﬁcantly improve the regret bound of the UCB Greedy algorithm when min

i(i+1) << min

Proof Sketch of Theorem 4.1. We present parts of the proof here, where the complete proof is deferred to the supplementary material. While computing the regret, we consider each arm i = 1 to Kg separately. For each arm i = 1 to Kg, let Ti be the instances where greedy with full information, henceforth a.k.a. oracle Greedy (OG), plays arm i. Also, let ng(i) = |Ti| be the number of time the greedy algorithm plays arm i. Let Xg(t) be the mean reward obtained by OG in time slot t, which is a deterministic quantity. Recall, we denote the award obtained by UCB Greedy in time slot t as Xit(t), which is a random variable.

In the blocking bandit model, we end up with free exploration as each arm becomes unavailable for certain amount of time once it is played. This presents us with opportunity to learn more about the subsequent arms. However, when the delays, i.e. the Dis, are arbitrary the OG algorithm itself follows a complicated repeating pattern, which is periodic but with period lcm(Di, i = 1 to Kg). We do not analyze the regret in a period directly, but consider the regret from each arm separately.

To understand our approach to regret bound, let us ﬁx an arm i Kg. We consider the time slots divided into blocks of length Di, where each block begins at an instance where OG plays arm i. In each block, the arm i becomes available at least once for any algorithm, including UCB Greedy (UCBG); but not necessarily at the beginning as in case of OG. In each such block, if we play arm less of equal to i when it becomes ﬁrst available we don t accumulate any regret when the reward from arm i is considered in isolation. Instead, if we play arm j (i + 1) when arm i becomes ﬁrst available we may upper bound the regret as ij in that block. Let us denote by Pij(t) the probability that arm j (i + 1) is played in the block starting at time t 2 Ti where arm i becomes available ﬁrst.

Using the previous logic, separately for each arm and using linearity of expectation we arrive at the following regret bound.

Pij(t) ij. (1)

While bounding the regret in equation 7, in order to account for the combinatorial constraints due to the unavailability of arms, we phrase it as the following optimization problem (9).

Pij(t) ij (2)

s.t. Pij(t) 2

nj(t) 32 log t

ij ; at = j

, 8i, j 2 [K], 8t 2 Ti, (3)

nj(t) cjt c0

j log t, w.p. (1 K/t3), cj, c0

The ﬁrst constraint is standard, whereas the second constraint represent the free exploration in the system. If any arm i is played ni(t) times upto time t then it is available for (t ni(t)Di) time slots.

Among these time slots where arm i is available, UCBG can play

1) arms 1 j (i 1), at most

times in total, w.p. 1, due to the blocking constraints;

2) the arms (i + 1) j K, can be played at most

ij many times in total, w.p. at least

(1 K/t3), due to the UCB property and union bound over all arms and time slots upto t.

Therefore, for all i K we have, w.p. at least (1 K/t3),

More importantly, w.h.p. for all i K

we see ni(t) grows linearly with time t. This provides us with the required upper bound after using the lower bounds for nj(t) for j = 1 to K

, appropriately.

4.3 Easy Instances and Regret Lower Bound

In this section, we show that there are class of instances where Oracle greedy is optimal and provide regret lower bounds for such a setting. Deﬁnition 4.2. An instance of the blocking bandit is an easy instance if the Oracle Greedy is an ofﬂine optimal algorithm for that instance.

Examples: 1) A class of examples of such easy instances is blocking bandits where all the arms have equal delay D < K. 2) When the sequences seqi := {i + k Di : k 2 N} for i = 1 to Kg do not collide in any location (seqi \ seqj = ;, 8i 6= j) and cover the integers 8T 1, [T] [Kg

i=1seqi ( a.k.a. exact covering systems [16]) then Oracle Greedy is asymptotically optimal.

Lower Bound: We now provide a lower bound on the regret for easy instances . An algorithm is consistent iff for any instance of stochastic blocking bandit, the regret upto time T, R1

T = o(T δ) for all δ > 0. We prove the regret lower bound over the class of consistent algorithms for easy instances of stochastic blocking bandits.

We consider an instance with equal delay D < K, which is an easy instance. In this instance, the rewards for each arm i = 1 to K has Bernoulli distribution with mean 1/2; whereas arms i = (K + 1) to K has reward (1/2 ). We call this instance K -Set and prove the following. Theorem 4.3. For any K and K < K and 2 (0, 1/2) the regret of any consistent algorithm on the K -Set instance is lower bounded as lim

T log T (K K )

The proof of the above theorem makes use of the following lemma which shows that the blocking bandit instance is equivalent to that of a combinatorial semi-bandit [11], problem on m-sets, for which regret lower bounds were established in [1]. Lemma 4.4. For any Blocking Bandit instance where Di = D K for all arms i 2 [K], time horizon T, and any online algorithm AO, there exists an online algorithm AB which chooses arms for blocks of D time slots and obtain the same distribution of the cumulative reward as AO.

The proof of the lemma is deferred to the supplementary material.

5 Experimental Evaluation

Synthetic Experiments: We ﬁrst validate our results on synthetic experiments, where we use K = 20 arms. The gaps in mean rewards of the arms are ﬁxed with i(i+1), chosen uniformly at random (u.a.r.) from [0.01, 0.05] for all i = 1 to 19. We also ﬁx µK = 0. The rewards are distributed as Bernoulli random variables with mean µi. The delays are ﬁxed either 1) by sampling all delays u.a.r. from [1, 10] (small delay instances), or 2) u.a.r. from [11, 20] (large delay instances), or 3) by ﬁxing all the delay to a single value.

(a) K = 6, Kg = 9.

(b) K = 20, Kg = 20.

(c) K = 6, Kg = 8.

(d) Regret vs K

Figure 1: Cumulative regrets scale as logarithmic, constant, and negative linear regret with randomly initialized delays, in Fig.1a, Fig.1b, and Fig.1c, resp. Fig.1d: Regret vs K with identical delays. Once the rewards and the delays are ﬁxed, we run both the oracle greedy and the UCB Greedy algorithm 250 times to obtain the expected regret (i.e. Reward of Oracle Greedy - Reward of UCB Greedy) trajectory each with 10k timeslots. For each setting, we repeat this process 50 times for each experiment to obtain 50 such trajectories. We then plot the median, 75% and 25% points in each timeslot accorss all these 50 trajectories in Figure 1.

Scaling with Time: We observe three different behaviors. In most of the cases, we observe the regret scales logarithmically with T (see, Fig. 1a). In the second situation, when K = Kg the typical behavior is depicted in Fig. 1b where we observe constant regret (for K = K the logarithmic part vanishes in our regret bounds). Finally, there are instances, as shown in Fig.1c, when the regret is negative and scales linearly with time. Note as the Oracle greedy is suboptimal UCB Greedy can potentially outperform it and have negative regret. As an example consider the illustrative example in Section 1. In this example, if due to learning error the UCB greedy plays the sequence 121 then the UCB Greedy gets latched to the sequence 12131213 . . . which is optimal. Such events can happen with constant probability, resulting in a reward linearly larger than the Oracle Greedy which plays 321 321 . . . . This example explains the instances with linear negative regret.

Scaling with K : In Fig.1d, (where only the median is plotted) we consider the instances with identical delay equal to K = 7, 11, 16, 20. We observe that the regret decreases with increasing K , which is similar to the proved lower bound.

0 2500 5000 7500 10000 12500 15000 Time

No Blocking K = 10,Kg = 13 K = 18,Kg = 24 K = 50,Kg = 58

Figure 2: Regret vs K in jokes recommendation with blocking.

Jokes Recommendation Experiment: We perform jokes recommendation experiment using the Jesters joke dataset [14]. In particular, we consider 70 jokes from the dataset, each joke with at least 15k valid ratings in range [ 10, 10]. We rescale the ratings to [0, 1] using x ! (x + 10)/20. In our experiments, when a speciﬁc joke is recommended a rating out of the more than 15k ratings is selected uniformly at random with repetition and this rating acts as the instantaneous reward. The task is to recommend jokes to maximize the rating over a time horizon, with blocking constraints for each joke. The delays are chosen randomly similar to the synthetic experiments. For each experiment, we plot the expected regret trajectory for 15k time slots, taking expectation over 500 simulated sample paths. We observe the expected scaling behavior, where the regret scales logarithmically in time and for larger K we observe smaller regret.

6 Conclusion

We propose blocking bandits, a novel stochastic multi-armed bandit problem, where each arm is blocked for a speciﬁc number of time slots once it is played. We provide hardness results and approximation guarantees for the ofﬂine version of the problem, showing an online greedy algorithm provides an (1 1/e) approximation. We propose UCB Greedy and analyze the regret upper bound through novel techniques, such as free exploration. For instances on which oracle greedy is optimal we provide lower bounds on regret. Improving regret bounds using the knowledge of the delays of the arms is an interesting future direction which we intend to explore. In another direction, providing better lower bounds through novel constructions (e.g. exact covering systems) can be investigated.

Acknowledgements: This research was partially supported by NSF Grant 1826320, ARO grant W911NF-17-1-0359, the Wireless Networking and Communications Group Industrial Afﬁliates Program, and the the US Do T supported D-STOP Tier 1 University Transportation Center.

[1] Venkatachalam Anantharam, Pravin Varaiya, and Jean Walrand. Asymptotically efﬁcient

allocation rules for the multiarmed bandit problem with multiple plays-part i: Iid rewards. IEEE Transactions on Automatic Control, 32(11):968 976, 1987.

[2] Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement

learning. In Advances in Neural Information Processing Systems, pages 49 56, 2007.

[3] Thomas Bosman, Martijn Van Ee, Yang Jiao, Alberto Marchetti-Spaccamela, R Ravi, and Leen

Stougie. Approximation algorithms for replenishment problems with ﬁxed turnover times. In Latin American Symposium on Theoretical Informatics, pages 217 230. Springer, 2018.

[4] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic

multi-armed bandit problems. Machine Learning, 5(1):1 122, 2012.

[5] Chris Calabro, Russell Impagliazzo, Valentine Kabanets, and Ramamohan Paturi. The complex-

ity of unique k-sat: An isolation lemma for k-cnfs. Journal of Computer and System Sciences, 74(3):386 393, 2008.

[6] Leonardo Cella and Nicolò Cesa-Bianchi. Stochastic bandits with delay-dependent payoffs,

[7] Nicolò Cesa-Bianchi and Paul Fischer. Finite-time regret bounds for the multiarmed bandit

problem. In ICML, pages 100 108. Citeseer, 1998.

[8] Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer and

System Sciences, 78(5):1404 1422, 2012.

[9] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework

and applications. In International Conference on Machine Learning, pages 151 159, 2013.

[10] Richard Combes, Chong Jiang, and Rayadurgam Srikant. Bandits with budgets: Regret lower

bounds and optimal algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 245 257. ACM, 2015.

[11] Richard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre Proutiere, et al. Com-

binatorial bandits revisited. In Advances in Neural Information Processing Systems, pages 2116 2124, 2015.

[12] Peter C Fishburn and Jeffrey C Lagarias. Pinwheel scheduling: Achievable densities. Algorith-

mica, 34(1):14 38, 2002.

[13] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network optimization with

unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking (TON), 20(5):1466 1478, 2012.

[14] Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: A constant time

collaborative ﬁltering algorithm. information retrieval, 4(2):133 151, 2001.

[15] Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov

decision processes. In Conference on Learning Theory, pages 861 898, 2015.

[16] IP Goulden, Andrew Granville, L Bruce Richmond, and Jeffrey Shallit. Natural exact covering

systems and the reversion of the möbius series. The Ramanujan Journal, pages 1 25, 2018.

[17] András György, Tamás Linder, Gábor Lugosi, and György Ottucsák. The on-line shortest path

problem under partial monitoring. Journal of Machine Learning Research, 8(Oct):2369 2403, 2007.

[18] Robert Holte, Al Mok, Louis Rosier, Igor Tulchinsky, and Donald Varvel. The pinwheel: A

real-time scheduling problem. In [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume II: Software Track, volume 2, pages 693 702. IEEE, 1989.

[19] Nicole Immorlica, Karthik Abinav Sankararaman, Robert Schapire, and Aleksandrs Slivkins.

Adversarial bandits with knapsacks. In IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2019.

[20] Tobias Jacobs and Salvatore Longo. A new perspective on the windows scheduling problem.

ar Xiv preprint ar Xiv:1410.7237, 2014.

[21] Satyen Kale, Chansoo Lee, and Dávid Pál. Hardness of online sleeping combinatorial opti-

mization problems. In Advances in Neural Information Processing Systems, pages 2181 2189, 2016.

[22] Robert Kleinberg and Nicole Immorlica. Recharging bandits. In 2018 IEEE 59th Annual

Symposium on Foundations of Computer Science (FOCS), pages 309 319. IEEE, 2018.

[23] Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. Regret bounds for

sleeping experts and bandits. Machine learning, 80(2-3):245 272, 2010.

[24] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for

stochastic combinatorial semi-bandits. ar Xiv preprint ar Xiv:1410.0949, 2014.

[25] Tze Leung Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules. Ad-

vances in applied mathematics, 6(1):4 22, 1985.

[26] Karthik Abinav Sankararaman and Aleksandrs Slivkins. Combinatorial semi-bandits with

knapsacks. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1760 1770, 2018.

[27] Ambuj Tewari and Peter L Bartlett. Optimistic linear programming gives logarithmic regret for

irreducible mdps. In Advances in Neural Information Processing Systems, pages 1505 1512, 2008.

[28] Datong P Zhou and Claire J Tomlin. Budget-constrained multi-armed bandits with multiple

plays. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.