# recurrent_submodular_welfare_and_matroid_blocking_semibandits__79493649.pdf Recurrent Submodular Welfare and Matroid Blocking Semi-Bandits Orestis Papadigenopoulos Department of Computer Science The University of Texas at Austin papadig@cs.utexas.edu Constantine Caramanis Electrical and Computer Engineering The University of Texas at Austin constantine@utexas.edu A recent line of research focuses on the study of stochastic multi-armed bandits (MAB), in the case where temporal correlations of specific structure are imposed between the player s actions and the reward distributions of the arms. These correlations lead to (sub-)optimal solutions that exhibit interesting dynamical patterns a phenomenon that yields new challenges both from an algorithmic as well as a learning perspective. In this work, we extend the above direction to a combinatorial semi-bandit setting and study a variant of stochastic MAB, where arms are subject to matroid constraints and each arm becomes unavailable (blocked) for a fixed number of rounds after each play. A natural common generalization of the state-of-the-art for blocking bandits, and that for matroid bandits, only guarantees a 1/2-approximation for general matroids. In this paper we develop the novel technique of correlated (interleaved) scheduling, which allows us to obtain a polynomial-time (1 1/e)-approximation algorithm (asymptotically and in expectation) for any matroid. Along the way, we discover an interesting connection to a variant of Submodular Welfare Maximization, for which we provide (asymptotically) matching upper and lower approximability bounds. In the case where the mean arm rewards are unknown, our technique naturally decouples the scheduling from the learning problem, and thus allows to control the (1 1/e)-approximate regret of a UCB-based adaptation of our online algorithm. 1 Introduction Despite the large number of variants of the stochastic multi-armed bandits (MAB) model [46, 33] that have been introduced [8, 34], the majority of the results comply with the common assumption that playing an action does not alter the environment, namely, the reward distributions of the subsequent rounds (with notable exceptions discussed below). Only recently, researchers have focused their attention on settings where temporal dependencies of specific structure are imposed between the player s actions and the reward distributions [27, 10, 7, 39, 6]. In [27], Kleinberg and Immorlica consider the setting of recharging bandits, where the expected reward of each arm is a concave and weakly increasing function of the time passed since its last play, modeling in that way scenarios of local performance loss. In a similar spirit, Basu et al. [7] consider the problem of blocking bandits, in which case once an arm is played at some round, it cannot be played again (i.e., it becomes blocked) for a fixed number of consecutive rounds. Notice that all the aforementioned examples are variations of the stochastic MAB setting, where the decision maker plays (at most) one arm per time step. When combinatorial constraints and time dynamics come together, the result is a much richer and more challenging setting, precisely because their interplay creates a complex dynamical structure. Indeed, in the standard combinatorial bandits setting [11], the optimal solution in hindsight is to consistently play the feasible subset of arms of maximum expected reward. However, in the presence 35th Conference on Neural Information Processing Systems (Neur IPS 2021). of local temporal constraints on the arms, an optimal (or even suboptimal) solution cannot be trivially characterized a fact that significantly complicates the analysis, both from the algorithmic as well as from the learning perspective. In this work, we study the following bandit setting a common generalization of matroid bandits, introduced by Kveton et al. [30], and blocking bandits [7]: Problem 1.1 (Matroid Blocking Semi-Bandits (MBB)). We consider a set A of k arms, a matroid M = (A, I), and an unknown time horizon of T rounds. Each arm i 2 A is associated with an unknown bounded reward distribution of mean µi, and with a known deterministic delay di, such that whenever an action i is played at some round, it cannot be played again for the next di 1 rounds. At each time step, the player pulls a subset of the available (i.e., not blocked) arms restricted to be an independent set of M. Subsequently, she observes the reward realization of each arm played (semi-bandit feedback) and collects their sum as the reward for this round. The goal of the player is to maximize her expected cumulative reward over T rounds. The above model captures a number of applications, varying from team formation to ad placement, when arms represent actions that cannot be played repeatedly without restriction. As a concrete example, consider a recommendation system that repeatedly suggests a variety of products (e.g., songs, movies, books) to a user. The need for diversity on the collection of suggested products (arms), to capture different aspects of user s preferences, can be modeled as a linear matroid. Further, the blocking constraints preclude the incessant recommendation of the same product (which can be detrimental, as the product might be perceived as a spam ), while the maximum rate of recommendation (controlled by the delay) might depend on factors such as popularity, promotion and more. Finally, the expected reward of each product is the probability of purchasing (or clicking). From a technical viewpoint, the MBB problem is already NP-hard for the simple case of a uniform rank-1 matroid (see Theorem 2.1 in [43]), even in the full-information setting, where the reward distributions are known to the player a priori. The natural common generalization of the algorithms in [7, 30], computes and plays, at each time step, an independent set of maximum mean reward consisting of the available elements. While the above strategy is a (1 1/e)-approximation asymptotically (that is, for T ! 1) for partition matroids, unfortunately, it only guarantees a 1/2-approximation for general matroids [1] and this guarantee is tight (see Appendix E for an example). A natural question that arises is whether a (1 1/e)-approximation is possible for any matroid. The main result of this paper shows that this is indeed possible. Along the way, we identify that the key insight (and also the weak point of the naive 1/2-approximation) is the underlying diminishing returns property hidden in the matroid structure. In particular, we discover an interesting connection of MBB to the following problem of interest in its own right: Problem 1.2 (Recurrent Submodular Welfare (RSW)). We consider a monotone (non-decreasing) submodular function f : 2A ! R 0 over a universe A and a time horizon T. At each round t 2 [T] we choose a subset At A and collect a reward f(At). However, using an element i 2 A at some round t 2 [T] makes it unavailable (i.e., blocked) for a fixed and known number of di 1 subsequent rounds, namely, during the interval [t, t+di 1]. The objective is to maximize P t2[T ] f(At), subject to the blocking constraints, within a (potentially unknown) time horizon T. For the above model, which can be thought of as a variant of Submodular Welfare Maximization [47], we provide an efficient randomized (1 1/e)-approximation (asymptotically), accompanied by a matching hardness result. Note that the RSW problem is a very natural model, capturing applications of submodular maximization in repeating scenarios, where the elements cannot be constantly used without restriction. As an example, consider the process of renting goods to a stream of customers with identical submodular utility functions modeling their satisfaction. As we show, our approach for the RSW problem immediately implies an algorithm of the same approximation guarantee for the full-information case of MBB and, additionally, it has important implications for the bandit setting, where the reward distributions are initially unknown. The standard goal in this case is to provide a (sublinear in the time horizon) upper bound on the regret, namely, the difference between the expected reward of a bandit algorithm and a (near-)optimal algorithm, due to the initial lack of knowledge of the former1. 1In fact, we upper bound the (1 1/e)-(approximate) regret, defined as the difference between (1 1/e) OPT(T) and the expected reward collected by a bandit algorithm. The notion of -regret is widely used in the combinatorial bandits literature [15, 48] for combinatorial problems where an efficient algorithm 1.1 Related Work A recent line of research focuses on non-stationary models in the case where each reward distribution is a special function of the player s actions [10, 39, 6]. In [7], Basu et al. provide a greedy (1 1/e)- approximation for the full-information case of the blocking bandits problem (a special case of the MBB model for a uniform rank-1 matroid). As we have already mentioned, generalizing their strategy to the MBB problem fails to provide the same guarantee for general matroids. In the bandit setting, where the reward distributions are initially unknown, the authors have to overcome the burden of characterizing a (sub)optimal solution, where the rate of mean collected reward exhibits significant fluctuations over time. The key insight is to observe that every time the full-information algorithm plays an arm, its bandit variant, which relies on estimations of the mean rewards, has at least one chance of playing the same arm. However, this key coupling argument, that enables sublinear regret bounds, becomes significantly more involved in the presence of matroid constraints. In [27], Kleinberg and Immorlica study the case of recharging bandits. Their approach first computes the optimal playing frequency 1/xi of each arm i via a mathematical formulation. In order to play each arm with this frequency, they develop the technique of interleaved rounding, where they associate each arm i with a sequence of real numbers {( i+k)/xi}k2N, with i U[0, 1]. Then, the arms are played sequentially in the same order they appear on the real line. This novel rounding technique exhibits reduced variance and, thus, an improved approximation guarantee comparing to other natural approaches such as independent randomized rounding. The MBB model is also related to the literature on periodic scheduling [5, 4]. In [43], Sgall et al. consider the problem of periodically scheduling jobs on a set of machines. Each job is associated with a processing time, during which it occupies the machine it is executed on, a vacation time, namely, a minimum time required after its completion in order to be rescheduled, and a reward. It is not hard to see that the case of unit processing times is a special case of MBB with a uniform matroid of rank equal to the number of machines, under the objective of maximizing the total reward. Further, it is known [7] that the rank-1 case of MBB generalizes the Pinwheel Scheduling problem [24]: Given k colors associated a set of integers {di}i2[k], such that P i2[k] 1/di = 1, decide whether there is a coloring of the natural numbers : N ! [k] such that every color i 2 [k] appears at least once every di numbers. As it is proved in [25], the above problem does not admit a pseudopolynomial time algorithm unless SAT can be solved by a randomized algorithm in expected quasi-polynomial time. In a concurrent work [1], the authors consider the blocking bandit model in a generic combinatorial setting and under stochastic delays. As they show, the greedy algorithm that plays at each time the maximum feasible subset of available arms is a O(1)-approximation for downward-closed set systems. However, when specialized to matroids, they cannot do better than a 1/2-approximation. We need new ideas to reach a (1 1/e)-approximation algorithm and associated regret guarantees for the rich class of matroid bandits. Our work is related to the line of research regarding bandit optimization of submodular functions (see [13, 20] and references therein). We refer the reader to Appendix A for additional related work on non-stationary bandits, combinatorial bandits, and submodular welfare maximization. 1.2 Our Contributions Reducing full-information MBB to RSW. We first focus on the full-information variant of MBB, where the mean rewards of the arms are known to the player a priori. We assume that the player has access to the matroid M via an independence oracle and knowledge of the arms fixed delays, yet she is oblivious to the time horizon T. In this sense, she plays online. An interesting aspect of dynamics, as illustrated in [27, 7, 6], is that one needs to guarantee, via scheduling, that each arm is roughly played at a frequency close to its optimal rate. This is particularly important in the presence of hard blocking constraints, where no reward can be obtained by a blocked arm. In order to address the above scheduling problem, we propose a particular decoupled two-phase strategy. We refer to each phase as (cooperative) Player A and Player B. Initially, Player A decides on a schedule that determines arm availability, namely, a subset of rounds where each arm is allowed to be played. Subsequently, Player B chooses a subset of available arms that maximizes the total does not exist, and, thus, any efficient algorithm would inevitably suffer linear regret in standard definition (where = 1). expected reward, subject to the matroid constraints. In order to completely decouple the two phases, the availability schedule produced by Player A is never affected by which arms are eventually chosen by Player B (that is, it is impossible for Player B to violate the blocking constraints). In the case where Player B knows the expected rewards of the arms and due to the above decoupling property, his optimal strategy (given any availability schedule) can be easily characterized: Since the arms of each round are subject to matroid constraints, Player B achieves his goal by playing a maximum expected reward independent set among the available arms of each round, which can be computed efficiently using the greedy algorithm for matroids. Thus, the role of Player A becomes to choose an availability schedule that maximizes the total reward, knowing that Player B will behave exactly as described above. The key observation is that the solution computed by Player B at each round, corresponds to the weighted rank function of the matroid evaluated on the set of available arms of the round. More importantly, it can be proved that this function is monotone submodular and, hence, Player A s task is a special case of the RSW problem. Optimal approximation for RSW. Focusing our attention on the RSW problem, any good solution should guarantee that each element i 2 A is selected a fraction of the time close to 1/di (the maximum possible), where di is the delay. However, a naive randomized approach that selects (if available) each element i with probability 1/di independently at each round, can be as bad as a (1 e 1/2) 0.393-approximation (see Appendix E for an example). Instead, motivated by the rounding technique of Kleinberg and Immorlica [27], we develop a (time-)correlated sampling strategy, which we call interleaved scheduling. While our technique is based on the same principle of transforming (randomly interleaved) sequences of real numbers into a feasible schedule, our implementation is very different. Indeed, as opposed to [27], we additionally face the issue of scheduling more than one arms per round, subject to matroid constraints, and the fact that our hard blocking constraints are particularly sensitive to the variance of the produced schedule. Using our technique, we construct a polynomial-time randomized algorithm, named INTERLEAVED-SUBMODULAR (IS), that achieves the following guarantee for RSW: Theorem 1.3. The expected reward collected by INTERLEAVED-SUBMODULAR over T rounds, RIS(T), is at least (1 1/e) OPT(T) O(dmaxf(A)), where OPT(T) is the optimal reward of RSW for T rounds and dmax = maxi2A di is the maximum delay of the instance. The proof of the above guarantee relies on the construction of a convex program (CP), based on the concave closure of f (see below), that yields an (approximate up to an additive term) upper bound on the optimal reward. Although our algorithm never computes an optimal solution to this convex program, it allows us to compare its expected collected reward with the optimal solution of CP, leveraging known results on the correlation gap of submodular functions. As we show via a reduction from the SWM problem with identical utilities, the (1 1/e) term in the above guarantee is asymptotically the best possible, unless P = NP; further, the additive term results from the fact that our algorithm is oblivious to the time horizon T. Bandit algorithm and regret guarantees. We now turn our attention to the bandit setting of MBB, where the mean rewards are initially unknown. Our interleaved scheduling method exhibits an additional property: It does not rely on the monotone submodular function itself, a fact that is particularly important for the bandit setting. Indeed, in the full-information setting Player B computes a maximum expected reward independent set at each round, for any availability schedule provided by Player A. In the bandit setting, however, the reward distributions are not a priori known and, thus, must be learned. Nevertheless, we do not need to wait to learn these distributions to find a good availability schedule. This allows us to make a natural coupling between the strategy of Player B in the bandit and in the full-information case and, thus, to compare the expected reward collected pointwise , assuming a fixed common availability schedule. We remark that the above coupling is very different than the one in [7], as ours is independent of the trajectory of the observed rewards. The above analysis allows us to develop a bandit algorithm for MBB based on the UCB method, called INTERLEAVED-UCB (IB). Specifically, given any availability schedule provided by Player A (independently of the rewards) and in increasing order of rounds, Player B greedily computes a maximal independent set consisting the available arms of each round, based on estimates (known as UCB indices) of the mean rewards. In order to analyze the regret, we use the independence of the availability schedule in combination with the strong basis exchange property of matroids. This allows us to decompose the overall regret of our algorithm into contributions from each individual arm. Once we have established this regret decomposition, we can bound the individual regret attributed to each arm using more standard UCB type arguments [30], leading to the following guarantee: Theorem 1.4. The expected reward collected by INTERLEAVED-UCB in T rounds, RIB(T), for k arms, a matroid of rank r = rk(M) and maximum delay dmax is at least T ln(T) + k2 + dmaxr In the above bound, the additive loss corresponds to the regret with respect to OPT(T). Interestingly, our regret bound is very close (even in constant factors) to the information-theoretically optimal bound provided in [30] for the non-blocking setting. In fact, except for the small additive O(dmaxr) term, the regret bound in [30] is the same as ours, if we replace the number of arms k with k r. Intuitively, this is due to the fact that our algorithm must learn the complete order of mean rewards, as opposed to the non-blocking setting where learning the maximum expected reward independent set in hindsight is sufficient for eliminating the regret. All the omitted proofs of our results have been moved to the Appendix. 2 Preliminaries on Matroids and Submodular Functions Continuous extensions and correlation gap of submodular functions. Consider any set function f : 2A ! R 0 over a ground set A. Recall that f is submodular, if 8S, T A we have f(S [ T) + f(S \ T) f(S) + f(T). For any point x 2 [0, 1]k, we denote by S x the random set S A, such that P (i 2 S) = xi. We consider two canonical continuous extensions of a set function: Definition 2.1 (Continuous extensions). For any set function f the multi-linear extension is S x [f(S)] = Moreover, the concave closure is defined as f +(x) = max where 1S 2 {0, 1}k is an indicator vector such that (1S)i = 1, if i 2 S, and (1S)i = 0, otherwise. Lemma 2.2 (Correlation gap [9]). Let f : 2k ! R 0 be a monotone (non-decreasing) submodular function. Then for any point x 2 [0, 1]k, we have F(x) f +(x) (1 1/e) 1 F(x). Matroids and weighted rank functions. Consider a matroid M = (A, I), where A is the ground set and I is the family of independent sets. Recall that in any matroid, the family I satisfies the following two properties: (i) Every subset of an independent set (including the empty set) is an independent set, namely, if S0 S A and S 2 I, then S0 2 I (hereditary property). (ii) Let S, S0 A be two independent sets with |S| < |S0|, then there exists some e 2 S0\S such that S [ {e} 2 I (augmentation property). See [42, 37] for more details on matroids. We assume that access to M is given through an independence oracle [22, 40], namely, a black-box routine that, given a set S A, answers whether S is an independent set of M. For any set R A we define the restriction of M to R, denoted by M |R, to be the matroid M |R = (R, {I 2 I | I R}). Given any non-negative linear weight vector w 2 Rk 0, the problem of computing a maximum weight independent set can be solved optimally by the standard greedy algorithm: Starting from the empty set S = ;, add each ground element e 2 A to the set S in a non-increasing order of weights, as long as the set S [ {e} does not contain a circuit. Given a matroid M = (A, I) and a weight vector w, the function f M,w(S) = max I2I,I S{w(I)} is called the weighted rank function of M and returns the weight of the maximum independent set of the restriction M |S. Lemma 2.3 (Weighted rank function [9]). For any matroid M and non-negative weight vector w, the function f M,w(S) = max I2I,I S{w(I)} is monotone (non-decreasing) submodular. Technical notation. For any event E, we denote by X (E) 2 {0, 1} the indicator variable such that X (E) = 1, if E occurs, and X (E) = 0, otherwise. For any non-negative integer n 2 N, we define [n] = {1, 2, . . . , n}. For any vector µ 2 Rk and set S [k], we define µ(S) = P i2S µi. Moreover, we use the notation t 2 [a, b] (for a b) for some time index t, in place of t 2 [T] \ [a, . . . , b}. Unless otherwise noted, we use the indices i, j or i0 to refer to arms and t, t0 or to refer to time. Let A t 2 I be the set of arms played by some algorithm 2 {IS, IG, IB} (defined in Sections 3 and 4) at time t. Unless otherwise noted, all expectations are taken over the randomness of the offsets {ri}i2[k] (see Section 3) and the reward realizations. 3 Recurrent Submodular Welfare Let f(S) : 2A ! R 0 be a monotone submodular function over a universe A of k elements, such that f(;) = 0. In the blocking setting, each element i 2 A is associated with a known deterministic delay di 2 N>0, such that once the arm is played at some round t, it becomes unavailable for the next di 1 rounds, namely, in the interval {t, . . . , t + di 1}. At each round t 2 [T], the player chooses a subset At of available (i.e., non-blocked) elements and collects a reward f(At). The goal is to maximize the total reward collected, i.e., P t2[T ] f(At), within an unknown time horizon T. We provide an efficient randomized (1 1/e)-approximation algorithm for RSW. Informally, the algorithm starts by considering, for each element i 2 A, a sequence of rational numbers of the form {t 1 di }t2[T ]. Then, these sequences are interleaved by randomly adding an offset ri, drawn uniformly at random from [0, 1], for each i 2 A to the corresponding sequence. At every round t 2 [T], the algorithm chooses a set At, consisting only of elements for which the (perturbed) interval Li,t = [t 1 di + ri, (t + 1) 1 di + ri) contains an integer. Algorithm 3.1 (INTERLEAVED-SUBMODULAR (IS)). For each element i 2 A, let ri U[0, 1] be a random offset drawn uniformly from [0, 1]. At every round t = 1, 2, . . . , let At A be the subset of elements such that for any i 2 At, the interval Li,t = [t 1 di + ri, (t + 1) 1 di + ri) contains an integer. Choose the elements At and collect the reward f(At). 3.1 Correctness and approximation guarantee. We first show the algorithm is correct, namely, that the elements chosen at each round respect the blocking constraints. The correctness is established by the following simple observation: Fact 3.2. At any t 2 [T], all the elements in At are available (i.e., not blocked). In order to prove the competitive guarantee of our algorithm, we first construct a convex programming (CP)-based (approximate) upper bound on the optimal reward. Although our algorithm never computes an optimal solution to this CP, this step allows us to prove our guarantee, leveraging results on the correlation gap of submodular functions. For d 1 2 Rk such that (d 1)i = 1/di, 8i 2 [k], consider the following formulation based on the concave closure f + of f: z2Rk T f +(z) s.t. 0 z d 1. (CP) In (CP), each variable zi can be thought of as the fraction of rounds where element i 2 A is chosen. Intuitively, the constraints indicate the fact that, due to the blocking, each element i 2 A can be played at most once every di steps. In order to derive (CP), we start from a non-convex integer program (IP) with 0-1 variables {xi,t}i2A,t2[T ], each indicating whether element i 2 A is used at round t 2 [T]. The objective is to maximize P i/2S(1 xi,t) subject to natural blocking constraints. For integral solutions, the above objective is equivalent to P t2[T ] f +(xt) (where (xt)i = xi,t) and, thus, the above relaxation is simply the result of averaging over time the variables and constraints of this IP. By using the concavity of f +, we are able to show that (CP) yields an (approximate) upper bound on the optimal solution of RSW, while the approximation becomes exact as T increases. Lemma 3.3. Let RCP (T) be the optimal solution to (CP) and OPT(T) be the optimal solution over T rounds. We have RCP (T) OPT(T) O(dmaxf(A)), where dmax = maxi2A{di}. Before we complete the proof of our first main result, we first compute the probability that i 2 At, i.e., an element i 2 A is sampled at round t 2 [T]: Fact 3.4. For any i 2 A and t 2 [T], we have P (i 2 At) = P (Li,t \ N 6= ;) = 1/di. Proof of Theorem 1.3. Let us denote by S p with p 2 [0, 1]k the random set S A, where each element i participates in S independently with probability equal to pi. By Fact 3.4 and due to the randomness of the offsets {ri}i2A, we have that At d 1 for each t 2 [T]. Let z be an optimal solution to (CP). By monotonicity of f and the fact that z d 1, for the expected value of f(At) at any round t 2 [T], we know that E At d 1 [f(At)] E At z [f(At)]. Moreover, by definition of the multi-linear extension, we have that E At z [f(At)] = F(z ), while by Lemma 2.2 (the correlation gap of submodular functions), we have that, F(z) f +(z) for any vector z 2 [0, 1]k. By combining the above facts, we can see that E At d 1 [f(At)] T f +(z ) = Therefore, by Lemma 3.3, we can conclude that RIS(T) OPT(T) O(dmaxf(A)). In Appendix C.2, we provide a (1 1/e)-hardness result for RSW, thus proving that the guarantee of Theorem 1.3 is asymptotically tight. This result, which holds even for the special case where dmax = o(T) (that is when the delays are significantly smaller than the time horizon), is proved via a reduction from the SWM problem with identical utilities, in a way that the constructed RSW instance accepts w.l.o.g. solutions of a simple periodic structure. Theorem 3.5. For any > 0, there exists no polynomial-time -approximation algorithm for the RSW problem, unless P = NP, even in the special case where dmax = o(T). 4 Matroid Blocking Semi-Bandits Let A be a set of k arms and T be an unknown time horizon. At any round t 2 [T] and for each i 2 A a reward Xi,t is drawn independently from an unknown distribution of mean µi and bounded support in [0, 1]. Let di 2 N>0 be the known determinisitc delay of each arm i 2 A, and dmax = maxi2A{di}. At any round t 2 [T], the player pulls any subset At of the available (i.e., non-blocked) arms, as long as it forms an independent set of a given matroid M = (A, I). The player only observes the realized reward of each arm she plays and collects their sum. The goal is to maximize the expected cumulative reward collected within T rounds, denoted by RIG(T) = E i2A Xi,t X (i 2 At) The full-information setting The following algorithm is the implementation of IS in the special case of the full-information MBB setting, where the mean rewards {µi}i2A are known a priori: Algorithm 4.1 (INTERLEAVED-GREEDY (IG)). For each arm i 2 A, let ri U[0, 1] be a random offset drawn uniformly from [0, 1]. At every round t = 1, 2, . . . , let Gt A be the subset of arms i 2 A, such that the interval Li,t = [t 1 di + ri, (t + 1) 1 di + ri) contains an integer. Greedily compute a maximum independent set At of M | Gt with respect to {µi}i2Gt and play these arms. The following result is an immediate corollary of Theorem 1.3, given that the value of the greedily computed maximum independent set in M | Gt corresponds to the weighted rank function f M,µ(Gt) which, by Lemma 2.3, is monotone submodular: Theorem 4.2. The expected reward collected by INTERLEAVED-GREEDY for T rounds, RIG(T), is at least OPT(T) O(dmax rk(M)), where OPT(T) is the optimal expected reward. Remark 4.3. The analysis of IG is tight for rank-1 matroids. Indeed, consider k arms, each of delay k and deterministic reward equal to 1. For T ! 1, the optimal average reward is equal to 1, simply by playing the arms in a round-robin manner. However, the probability that at least one arm is sampled at some round t is Pk e as k ! 1. The bandit setting and regret analysis In the setting where the mean rewards are initially unknown, we develop a UCB-based bandit algorithm, INTERLEAVED-UCB (IB). The algorithm is identical to IG, except for the greedy computation of the maximum independent set over the sampled arms, which is now performed using estimates. Specifically, the algorithm maintains for every i 2 A, t 2 [T] the following upper estimate of µi: µi,t = ˆµi,Ti(t) + ci,t with ci,t = where Ti(t) denotes the number of times arm i has been played at the beginning of round t and ˆµi,Ti(t) denotes the empirical average of the Ti(t) i.i.d. samples from its reward distribution. The term ci,t is the confidence length around ˆµi,Ti(t) that guarantees µi,t lies in [µi, µi + 2ci,t] with high probability. Note that all the above quantities are random variables depending on the random offsets and the observed reward realizations. We are interested in upper bounding the -regret, for = 1 1 e, namely, the difference between OPT(T) and the expected reward collected by IB. Due to the complex time dynamics, characterizing the optimal expected reward as a function of the instance is hard. However, using Theorem 4.2 we can upper bound OPT(T) by the expected reward collected by IG, thus giving: OPT(T) RUCB(T) RIG(T) RUCB(T) + O(dmax rk(M)). (1) By the above inequality, it becomes clear that in order to upper bound the regret, it suffices to bound the difference between the expected reward collected by IG and IB. This difference not only depends on the reward realizations (through the UCB estimates), but also on the trajectory of sampled arms in each algorithm, which is itself a function of the random offsets. However, by construction of our interleaved scheduling scheme, these offsets are sampled at the initialization phase of each algorithm and are identically distributed. Thus, the trajectories of sampled arms in the two algorithms exhibit a coupled evolution. This allows us to analyse the regret pointwise , under the assumption that the sequences of sampled arms are identical throughout the time horizon. To make this idea precise, let r 2 [0, 1]k be the random offsets used and let {G t (r )}t2[T ] be the sequence of sampled arms by algorithm 2 {IG, IB}. Using (henceforth) Q to denote the randomness due to the reward realizations of the arms, the next lemma gives our pointwise regret bound. Lemma 4.4. Let µt(S) = P i2S µi,t and µ(S) = P i2S µi. We have RIG(T) RIB(T) = E r U[0,1]k,Q max S Gt(r),S2I{µ (S)} µ arg max S Gt(r),S2I{ µt(S)} Thus w.l.o.g., we focus on the case where the sequences of sampled arms are identical. Let Er denote the event that both algorithms, IG and IB, sample the same offset vector r, namely, r IG = r IB = r. Assuming that Er holds for some r 2 [0, 1]k, let {Gt}t2[T ] = {Gt(r)}t2[T ] be the sequence of sampled arms, common in both algorithms. Clearly, IB accumulates regret only when it plays independent sets of arms that are suboptimal w.r.t. the true means, i.e., when µ(AIB t ) < µ(AIG t ) for some t 2 [T]. We assume w.l.o.g. that the arms are indexed in decreasing order of mean rewards and that these mean rewards are distinct. We now formally define the gaps related to our analysis: Definition 4.5 (Gaps). For any subset S A and reward vector 2 Rk, we define S( ) = max I2I,I S{µ (I)} µ arg max B2I,B S{ (B)} Moreover, let i,j = µi µj be the standard suboptimality gap between two arms i, j 2 A. By Lemma 4.4 and assuming that the event Er holds for some r, we are interested in bounding the expectation of P t2[T ] Gt(r)( µt) w.r.t. the reward realizations. The next step is to decompose the suboptimality of IB by noticing that both algorithms play, at each round t 2 [T], a basis of M | Gt and thus | AIG t | = | AIB t |. We use the following fundamental property of matroids: Theorem 4.6 (Strong Basis Exchange, Corollary 39.12a in [42]). Let M = (A, I) be a matroid and I1, I2 2 I be two independent sets such that |I1| = |I2|. Then, there exists a bijection σ : I1 ! I2, such that for any i 2 I1 the set I1 i + σ(i) is an independent set of M. Let σt : AIB t for each t 2 [T] be the bijection described in Theorem 4.6 with respect to the sets AIB t and let σ 1 t be its inverse mapping. Note that in any bijection σt and any i 2 AIB t we can assume w.l.o.g. that σt(i) = i. Notice, further, that under the event Er, the bijections {σt}t2[T ] are still random variables that depend on the observed realizations. Lemma 4.7. Under the event Er and at any time t 2 [T], we have Gt( µt) = P Conditioned on the fact that both algorithms operate on the same sequence {Gt}t2[T ] of sampled arms, Lemma 4.7 allows us to decompose the suboptimality gap Gt( µt) of each round t 2 [T], into simpler gaps of the form i,j between any arms i 2 AIG t and j 2 AIB t that are perfectly matched according to the bijection σt, namely, σt(j) = i. Assuming that the event {σt(j) = i} directly implies that i 2 AIG t and j 2 AIB t , we can further upper bound the regret as X i,j X (σt(j) = i) . The above inequality allows us to study the regret attributed to each arm independently, using more standard arguments for UCB-based algorithms in combination with Theorem 4.6. Specifically, for every pair of arms i, j 2 A with i < j (thus, i,j > 0), we define a threshold i,j with the following key-property: After IB exchanges arm j for arm i = σt(j) more than i,j times, due to insufficient exploration, then it has collected enough samples to infer that µj < µi with high probability. Lemma 4.8. Let i,j = for any i < j. Under event Er and for any arm j > 1, we have i,j X (σt(j) = i, Tj(t) i,j) 16 j 1,j ln(T) (Under-sampled regret) (2) i,j X (σt(j) = i, Tj(t) > i,j) i,j (Sufficiently sampled regret) (3) Proof sketch of Theorem 1.4. By inequality (1) and Lemma 4.4, in order to bound the regret of IB, it suffices to upper bound the difference between RIG(T) and RIB(T), conditioned on the fact that both algorithms use exactly the same offset vector r and, thus, they operate on the exact same sequence of sampled arms, denoted by {Gt}t2[T ]. By construction, IG plays at any round t 2 [T] a basis of M | Gt of maximum expected reward, while IB plays a basis of M | Gt that is maximum with respect to the estimates { µi,t}i2A. By Theorem 4.6, we can consider a perfect matching between exchangeable arms of AIG t and, thus, to decompose the regret into suboptimality gaps between individual arms. Then, using Lemma 4.8, we can upper bound on the expected regret due to the fact that IB erroneously plays arm j instead of arm i, when i,j > 0. The above analysis culminates in the following gap-dependent regret upper bound: i,j + O(dmax rk(M)) (gap-dependent regret). In order to derive a gap-independent regret bound, we partition the gaps into small and large and notice that any pair of arms i, j 2 A with i,j < ( T ) cannot contribute more than T ln(T) loss in the regret. Conclusion and Further Directions We explore the effect of action-reward dependencies in the combinatorial MAB setting by introducing and studying the MBB problem. After relating the problem to RSW, we provide a (1 1/e)-approximation for its full-information case, based on the technique of interleaved schedul- ing. Importantly, our technique is oblivious to the reward distributions of the arms a fact that allows us to provide regret bounds of optimal dependence in T, when these distributions are initially unknown. We believe that this idea could be further applied to different classes of (combinatorial) non-stationary bandits, other than blocking bandits. Our work leaves behind numerous interesting questions. By exhaustive search over O(1)-periodic schedules, one can construct a PTAS for the (asymptotic) MBB problem, assuming constant rk(M) and {di}i2[k]. It remains an open question, however, whether the (1 1/e)-approximation is the best possible in general. We remark that the hardness of MBB cannot solely rely on an argument similar to Theorem 3.5, since the welfare maximization problem for the class of gross substitutes, which includes weighted matroid rank functions, is easy [35]. Finally, it is easy to show that our algorithm gives a O(1)-approximation for the case of stochastic delays. Whether we can recover a (1 1/e)-approximation in this case is an interesting open question. Acknowledgements The authors would like to thank an anonymous reviewer of a previous version of this work for an unusually thoughtful and helpful review, which aided us in improving the document in particular, for pointing out the idea of correlated rounding. Further, the authors would like to thank Jannik Matuschke for noticing that the weighted matroid rank function falls into the class of gross substitutes. Funding Transparency Statement This research was partially supported by NSF Grant 2019844. In addition, this research was sponsored by the Army Research Office and was accomplished under Cooperative Agreement Number W911NF19-2-0333. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. [1] A. Atsidakou, O. Papadigenopoulos, S. Basu, C. Caramanis, and S. Shakkottai. Combinatorial blocking bandits with stochastic delays. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 404 413. PMLR, 18 24 Jul 2021. [2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2 3):235 256, May 2002. [3] A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. J. ACM, 65(3), March [4] A. Bar-Noy, R. Bhatia, J. Naor, and B. Schieber. Minimizing service and operation costs of periodic scheduling. Mathematics of Operations Research, 27(3):518 544, 2002. [5] A. Bar-Noy, R. E. Ladner, and T. Tamir. Windows scheduling as a restricted version of bin packing. ACM Trans. Algorithms, 3(3):28 es, August 2007. [6] S. Basu, O. Papadigenopoulos, C. Caramanis, and S. Shakkottai. Contextual blocking bandits. In International Conference on Artificial Intelligence and Statistics, pages 271 279. PMLR, 2021. [7] S. Basu, R. Sen, S. Sanghavi, and S. Shakkottai. Blocking bandits. In Advances in Neural Information Processing Systems (Neur IPS) 32, pages 4785 4794. Curran Associates, Inc., 2019. [8] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Machine Learning, 5(1):1 122, 2012. [9] G. Calinescu, C. Chekuri, M. Pál, and J. Vondrák. Maximizing a submodular set function subject to a matroid constraint (extended abstract). In Matteo Fischetti and David P. Williamson, editors, Integer Programming and Combinatorial Optimization, pages 182 196, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. [10] L. Cella and N. Cesa-Bianchi. Stochastic bandits with delay-dependent payoffs. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1168 1177, Online, 26 28 Aug 2020. PMLR. [11] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422, 2012. JCSS Special Issue: Cloud Computing 2011. [12] L. Chen, A. Gupta, and J. Li. Pure exploration of multi-armed bandit under matroid constraints. Proceeding of the 29th Annual Conference on Learning Theory (COLT 2016), 2016. [13] L. Chen, A. Krause, and A. Karbasi. Interactive submodular bandit. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [14] W. Chen, W. Hu, F. Li, J. Li, Y. Liu, and P. Lu. Combinatorial multi-armed bandit with general reward functions. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, page 1659 1667, Red Hook, NY, USA, 2016. Curran Associates Inc. [15] W. Chen, Y. Wang, Y. Yuan, and Q. Wang. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. J. Mach. Learn. Res., 17(1):1746 1778, January 2016. [16] R. Combes, C. Jiang, and R. Srikant. Bandits with budgets: Regret lower bounds and optimal algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 15, page 245 257, New York, NY, USA, 2015. Association for Computing Machinery. [17] R. Combes, M. S. Talebi, A. Proutiere, and M. Lelarge. Combinatorial bandits revisited. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS 15, page 2116 2124, Cambridge, MA, USA, 2015. MIT Press. [18] U. Feige and J. Vondrák. The submodular welfare problem with demand queries. Theory Comput., 6:247 290, 2010. [19] J. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, Series B, pages 148 177, 1979. [20] D. Golovin and A. Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. J. Artificial Intelligence Research, 2011. [21] S. Guha, K. Munagala, and P. Shi. Approximation algorithms for restless bandit problems. J. ACM, 58(1), December 2010. [22] D. Hausmann and B. Korte. Algorithmic versus axiomatic definitions of matroids, pages 98 111. Springer Berlin Heidelberg, Berlin, Heidelberg, 1981. [23] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963. [24] R. Holte, A. Mok, L. Rosier, I. Tulchinsky, and D Varvel. Pinwheel: a real-time scheduling problem. volume 2, pages 693 702 vol.2, 02 1989. [25] T. Jacobs and S. Longo. A new perspective on the windows scheduling problem. Co RR, abs/1410.7237, 2014. [26] S. Khot, R. J. Lipton, E. Markakis, and A. Mehta. Inapproximability results for combinatorial auctions with submodular utility functions. In Proceedings of the First International Conference on Internet and Network Economics, WINE 05, page 92 101, Berlin, Heidelberg, 2005. Springer-Verlag. [27] R. Kleinberg and N. Immorlica. Recharging bandits. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 7-9, 2018, pages 309 319, 2018. [28] R. Kleinberg, A. Niculescu-Mizil, and Y. Sharma. Regret bounds for sleeping experts and bandits. Mach. Learn., 80(2 3):245 272, September 2010. [29] N. Korula, V. Mirrokni, and M. Zadimoghaddam. Online submodular welfare maximization: Greedy beats 1/2 in random order. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC 15, page 889 898, New York, NY, USA, 2015. Association for Computing Machinery. [30] B. Kveton, Z. Wen, A. Ashkan, H. Eydgahi, and B. Eriksson. Matroid bandits: Fast combinato- rial optimization with learning. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI 14, page 420 429, Arlington, Virginia, USA, 2014. AUAI Press. [31] B. Kveton, Z. Wen, A. Ashkan, and C. Szepesvári. Combinatorial cascading bandits. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS 15, page 1450 1458, Cambridge, MA, USA, 2015. MIT Press. [32] B. Kveton, Z. Wen, A. Ashkan, and C. Szepesvari. Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 535 543, San Diego, California, USA, 09 12 May 2015. PMLR. [33] T.L Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math., 6(1):4 22, March 1985. [34] T. Lattimore and C. Szepesvári. Bandit algorithms. preprint, page 28, 2018. [35] R. P. Leme. Gross substitutability: An algorithmic survey. Games Econ. Behav., 106:294 316, [36] V. Mirrokni, M. Schapira, and J. Vondrak. Tight information-theoretic lower bounds for welfare maximization in combinatorial auctions. In Proceedings of the 9th ACM Conference on Electronic Commerce, EC 08, page 70 77, New York, NY, USA, 2008. Association for Computing Machinery. [37] J. G. Oxley. Matroid Theory (Oxford Graduate Texts in Mathematics). Oxford University Press, Inc., New York, NY, USA, 2006. [38] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of optimal queuing network control. Mathematics of Operations Research, 24(2):293 305, 1999. [39] C. Pike-Burke and S. Grunewalder. Recovering bandits. In Advances in Neural Information Processing Systems 32, pages 14122 14131. 2019. [40] G. C. Robinson and D. J. A. Welsh. The computational complexity of matroid properties. Mathematical Proceedings of the Cambridge Philosophical Society, 87(1):29 45, 1980. [41] K. A. Sankararaman and A. Slivkins. Combinatorial semi-bandits with knapsacks. In AISTATS, [42] A. Schrijver. Combinatorial Optimization - Polyhedra and Efficiency. Springer, 2003. [43] J. Sgall, H. Shachnai, and T. Tamir. Periodic scheduling with obligatory vacations. Theoretical Computer Science, 410(47):5112 5121, 2009. [44] A. Slivkins. Dynamic ad allocation: Bandits with budgets. Ar Xiv, abs/1306.0155, 2013. [45] C. Tekin and M. Liu. Online learning of rested and restless bandits. IEEE Transactions on Information Theory, 58(8):5588 5611, 2012. [46] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. [47] J. Vondrak. Optimal approximation for the submodular welfare problem in the value oracle model. In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, STOC 08, page 67 74, New York, NY, USA, 2008. Association for Computing Machinery. [48] Q. Wang and W. Chen. Improving regret bounds for combinatorial semi-bandits with proba- bilistically triggered arms and its applications. In Advances in Neural Information Processing Systems, pages 1161 1171, 2017. [49] P. Whittle. Restless bandits: activity allocation in a changing world. Journal of Applied Probability, 25(A):287 298, 1988.