# bandit_task_assignment_with_unknown_processing_time__473d230d.pdf Bandit Task Assignment with Unknown Processing Time Shinji Ito NEC Corporation, RIKEN AIP i-shinji@nec.com Daisuke Hatano RIKEN AIP daisuke.hatano@riken.jp Hanna Sumita Tokyo Institute of Technology sumita@c.titech.ac.jp Kei Takemura NEC Corporation kei_takemura@nec.com Takuro Fukunaga Chuo University fukunaga.07s@g.chuo-u.ac.jp Naonori Kakimura Keio University kakimura@math.keio.ac.jp Ken-ichi Kawarabayashi National Institute of Informatics, The University of Tokyo k_keniti@nii.ac.jp This study considers a novel problem setting, referred to as bandit task assignment, that incorporates the processing time of each task in the bandit setting. In this problem setting, a player sequentially chooses a set of tasks to start so that the set of processing tasks satisfies a given combinatorial constraint. The reward and processing time for each task follow unknown distributions, values of which are revealed only after the task has been completed. The problem generalizes the stochastic combinatorial semi-bandit problem and the budget-constrained bandit problem. For this problem setting, we propose an algorithm based on upper confidence bounds (UCB) combined with a phased-update approach. The proposed algorithm admits a gap-dependent regret upper bound of O(MN(1/ )log T) and a gap-free regret upper bound of O( MNT), where N is the number of the tasks, M is the maximum number of tasks run at the same time, T is the time horizon, and is the gap between expected per-round rewards of the optimal and best suboptimal sets of tasks. These regret bounds nearly match lower bounds. 1 Introduction This paper introduces a new model of sequential decision-making that we refer to as the bandit task assignment problem. The goal of this model is to determine which tasks to perform sequentially so that the total reward will be maximized, while estimating the distribution of rewards and processing time for each task. For each round t = 1, 2, . . . , T, we choose a set At of tasks to start. Each task i in At will be completed in the (t+cti)-th round, and we then obtain the reward rti. The processing time cti and the reward rti follow unknown distributions independently for all t, and they are observed only when the task has been completed, i.e., only bandit feedback is available. In all rounds, the set of processing tasks is required to satisfy a given combinatorial constraint. That is, in each round, the player can start a set of new tasks that, together with the tasks still in progress, satisfies the given 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Figure 1: An example of a feasible task assignment. constraint. The performance of the algorithm is evaluated by means of regret RT , which is defined as the difference between the cumulative rewards obtained with an optimal policy and with the algorithm. Our model allows an arbitrary constraint for a set of tasks that is expressed by A {0, 1}N, where N is the number of tasks. Each vector a = (a1, a2, . . . , a N) A is interpreted as a feasible set of tasks A = {i [N] | ai = 1}, where [N] := {1, 2, . . . , N}. Figure 1 illustrates an example of a feasible task assignment for the case of A = {a {0, 1}3 | a 1 2}, in which we choose which tasks to start from three candidates so that the number of processing tasks does not exceed two. Each arrow in this figure represents the interval of time to process a task. In this example, at round 2, two tasks (Tasks 1 and 2) are still processing, and hence, we cannot start any new task. At the beginning of round 3, we observe that Task 1 is completed, and it is then possible to start one new task. We note that the processing time cti and the reword rti of task i stated at time t are revealed only at the completion time (t + cti). The model in this paper is a common generalization of stochastic combinatorial semi-bandit problems [25, 24, 34] and budget-constrained bandit problems [20, 15, 12, 37]. In fact, if the processing time cti is always equal to 1 for any task i, the player can choose any set of A in every round and can observe the feedback immediately afterward, which coincides with the stochastic combinatorial semi-bandit problem. On the other hand, if we suppose that A = {a {0, 1}N | a 1 1}, i.e., if only a single task can be performed in every round, the problem corresponds to stochastic N-armed bandits with a budget constraint [20, 15]. Indeed, by regarding cti as the cost for choosing arm i, we can interpret the model as a problem with the budget constraint that the cumulative cost is at most T. We note that our problem setting is similar to blocking bandits [8], in which each chosen arm is blocked, i.e., is unavailable for a certain period of time after it is chosen. However, they are different in the constraints on the actions in each round. For example, consider A = {a {0, 1}N | a 1 K}. Then, in the bandit task assignment problem, we cannot start a new task when K tasks are already running. On the other hand, in the blocking bandit problem, the player can choose any K arms that are not blocked. Blocking bandits have been extended to contextual bandit models [9], semi-bandit models [3, 28], and adversarial models [10], as well. Applications of our model include the following examples: Example 1 (Matching constraint). We consider the problem of assigning L jobs to K workers. Let V1 and V2 be the sets of workers and jobs, respectively. Let E V1 V2 be a set of pairs of a worker and a job such that (u, v) E if and only if the worker u is qualified to do the job v. In each round, we choose a subset M of E so that no two elements in M share the same worker or job, that is, M must satisfy the matching constraint for the bipartite graph G = (V1 V2, E). Each job has processing time to complete, and, when a job is assigned to a worker, the job-worker pair is occupied until the job is completed. Thus the occupied job-worker pairs have to be included in M. We assume that the time required to complete a job and the quality of the results (rewards) follow unknown distributions that depend on the job-worker pair. This problem can be regarded as a special case of the bandit task assignment problem with N = |E|. Choosing at A corresponds to assigning a job v to a worker u for each e = (u, v) such that ate = 1, and the worker u and the job v will then be occupied and unavailable until the task is completed. It would be worth mentioning that the matching constraint is often the subject of consideration in the context of combinatorial semi-bandits [18, 30]. Example 2 (Knapsack constraint). Consider the situation where we want to perform tasks (e.g., of data processing) under resource constraints (e.g., computational resources such as CPUs, GPUs, and memories). We assume that the processing time of each task varies stochastically, and is not revealed until it is completed. Formally, we are given N tasks and K resources. Assume that each task i [N] consumes bki R 0 amount of resource k [K] per unit time, and that the total amount of resource k available in unit time is fk. Then the player can choose a set of tasks from A = {a {0, 1}N | b k a fk (k [K])}, where bk = (bk1, bk2, . . . , bk N) RN 0 for each resource k. Note that the set of all running tasks must satisfy the constraint given by A, which means that we cannot use resources occupied by tasks in progress and we are restricted to choose a set of tasks that can be executed with remaining resources. We note that the class of problems with a single resource (K = 1) includes the budget-constrained multi-armed bandit [20, 15] as a special case. Example 3 (Matroid constraint [35, 17]). If A consists of indicator vectors for independent sets of a matroid with the ground set [N], we call this a matroid constraint. Matroid constraints capture a variety of problem settings of practical importance, including the situation in which the number of tasks that can be executed simultaneously is limited to at most K, i.e., A = {a {0, 1}N | a 1 K}. This special case is called a uniform matroid. 1.1 Contribution Our main contribution is to present an algorithm that achieves a regret bound of O( Cu C2 l MN(1/ + Cu) ln T), where Cu and Cl are upper and lower bounds for the processing times cti, M is the maximum of a 1 for a A, and > 0 is the gap between expected per-round rewards for the optimal and the best suboptimal actions. The proposed algorithm also enjoys a gap-free regret bound of O( 1 Cl Cu MNT ln T). Our contributions include a nearly tight lower bound as well, i.e., we show that any algorithm suffers a regret at least Ω( 1 Cl Cu MNT). This indicates that the proposed algorithm achieves nearly optimal regret bounds. To design our algorithm, we first show that a nearly optimal policy can be characterized by expected per-round rewards for tasks (Proposition 2.2 in Section 2). This implies that, given estimation for expected per-round rewards and a linear optimization oracle for A, we can compute asymptotically optimal actions. This is in contrast to the blocking bandits, in which it is computationally hard to maximize the total reward even if all the distributions are known, as presented, e.g., in Corollary 3.2 by Basu et al. [8]. That is why, for the blocking bandits, existing algorithms can achieve only approximate regret bounds, i.e., the performance is only guaranteed to be (close to) a constant-factor approximation for the optimal strategy. Proposition 2.2 suggests us to balance exploration and exploitation with the aid of upper confidence bounds (UCB) on the expected per-round reward for each task, which can be computed efficiently. Such a UCB-based approach is similar to Comb UCB1 for combinatorial semi-bandits [25]. Comb UCB1, however, does not directly apply to our problem setting due to the stochastically variable processing time. In the bandit task assignment problem, the set of tasks that the player can start in each round changes depending on the set of tasks still in progress, which makes the decision of the player restricted. To address this issue, we introduce a phased-update approach, in which we divide the entirety of the rounds [T] = {1, 2, . . . , T} into segments of appropriate length. At the beginning of each segment s, we compute a set of tasks A s based on the UCB for expected per-round rewards. We note that this can be done independently of the state of running tasks. Then, in the s-th segment, we basically continue to perform the same set of tasks A s. One main technical difficulty is how to determine the length ls of the segment s. At the start of a new segment s, tasks in A s cannot be executed until all tasks performed in the previous segment have been completed, i.e., every switching point between segments causes idle time with non-zero probability. Therefore, the larger the number of segments (i.e., the shorter the length of each segment), the greater the loss due to such waiting time. In particular, if the segment length is O(C), we cannot avoid a linear regret of Ω(T/C) due to idle time. On the other hand, setting too long segments also has a negative impact on performance as the longer the length of each segment, the less frequently the set A s will be updated, which may degrade the efficiency of exploration and exploitation. To address this dilemma, this study designs a way of determining the segment length ls by which these potential losses will be well balanced and the overall regret will be bounded as desired. Another nontrivial technique in the proposed algorithm is to employ Bernstein-type confidence bounds [4, 27] for the processing time. Thanks to this, we can construct a tight UCB estimator for the expected per-round reward for each task i, with a width depending on the mean ci and variance σ2 i of the processing time. This is essential to achieve the nearly optimal O( 1 Cl Cu MNT)-regret bound. In fact, as we mention in Remark 4.2, an algorithm with standard confidence bounds will lead to regret upper bounds with additional Cu/Cl factors, which do not match the lower bound. Summary of contributions The contributions of this paper are summarized in the following four points: (i) We propose a novel problem setting of the bandit task assignment, which incorporates simultaneously combinatorial constraints among tasks, (unknown) processing time of each task, and the exploration-exploitation trade-off. The model includes various practical applications as described in Examples 1, 2 and 3. (ii) We show that a nearly optimal policy can be characterized with expected per-round rewards, which can be computed efficiently given only a linear optimization oracle. This contrasts with the computational difficulty in blocking bandits. (iii) We provide an algorithm with nearly optimal regret bounds. We handle the difficulties arising from the combinatorial constraints together with processing times by the phased-update approach. (iv) We present a regret lower bound that matches to the regret bound of the proposed algorithm, ignoring logarithmic factors. This means that the proposed algorithm achieves nearly optimal performance in terms of the worst-case analysis. 2 Problem Setup Let N be the number of tasks. The player is given a family of feasible sets of tasks, which is expressed by A {0, 1}N. Here, each binary vector a in {0, 1}N corresponds to a set of tasks, and a A means that the set of tasks A = {i [N] | ai = 1} can be executed simultaneously. We assume that A is closed under inclusion, i.e., a A and b a together imply b A. Note here that, for any vectors a = (a1, . . . , a N) , b = (b1, . . . , b N) RN, the notation of b a means that bi ai holds for all i {1, . . . , N}. We denote M = maxa A a 1. In each round t, the player will choose a set of tasks at from A. The chosen set has to exclude all the processing tasks not yet completed. More precisely, at is constrained to satisfy at + bt A, where bt is the set of tasks still in progress at the t-th round (see (1) below). Each task i with ati = 1 is then started and will be completed at the beginning of the (t + cti)-th round, which yields the reward of rti. The reward rti and the processing time cti will be revealed only after the task is completed. We assume that rti and cti are restricted so that rti [0, 1] and cti {Cl, Cl + 1, . . . , Cu}, where Cl and Cu are integers satisfying 1 Cl Cu. We also assume that ((rti, cti))N i=1 follows an unknown distribution D over ([0, 1] {Cl, . . . , Cu})N, independently for t = 1, 2, . . . , T. Note that rti and cti may be dependent. From the above problem definition, a family At of task sets available at the t-th round is expressed as follows: s=1 asi1[s + csi > t] (i [N]), At = {a A | a + bt A}, (1) where the vector bt = (bti)N i=1 corresponds to the set of uncompleted tasks that starts before the t-th round. The goal of the player is to maximize the sum of rewards earned by the T-th round, which can be expressed as PT t=1 r t at. Remark 2.1. Stochastic combinatorial semi-bandits [25] and the multi-armed bandit problem with a budget constraint [15] are special cases of the bandit task assignment problem. In fact, if cti = 1 for all t and i, the problem corresponds to a combinatorial semi-bandit problem with action set A. In addition, if A = {a {0, 1}N | a 1 1}, the problem corresponds to an N-armed bandit with budget B = T and costs {cti}. In the problem, we assume that we are able to solve linear optimization over A efficiently. More precisely, we suppose that we are given access to an offline linear optimization oracle that returns a solution ˆa arg maxa A q a for any input q RN 0. Note here that, for the problems with matching constraints or with matroid constraints (Examples 1 and 3, respectively), such an oracle can be implemented in polynomial time [23, 17]. For Example 2, linear optimization over A is NP-hard, but can be solved practically fast in many cases, with the aid of dynamic programming or integer programming algorithms. We define the regret RT to be the expectation of the difference between the total rewards obtained with the optimal proper policy and those obtained with the algorithm by the T-th round, where we call a policy for choosing at proper if at satisfies the constraint defined by (1) and each at depends on information observed by the beginning of t-th round, but not on ((rs, cs))T s=t. Any proper policy can be expressed as sequences of mapping π = (πt)t=1,2,..., where each πt maps from the history ht, which consists of chosen actions (as)t 1 s=1 and feedback obtained by the beginning of the t-round, to action at in A. Let Π denote the set of all proper policies. The regret can then be expressed as RT = max π Π E t=1 r t π t (h t ) where (h t )t=1,2,... denotes histories generated by the policy π . Intuitively, the total reward obtained with the optimal proper policy corresponds to the maximum performance achieved by an agent with unlimited computational resources who knows the distribution D of processing times and rewards. Note that the regret RT is defined on the basis of tasks started by the T-th round. Although it may seem more natural to define the regret by the rewards for tasks completed by the T-th round, the difference between these two alternatives is only O(M/Cl), which is almost negligible as it is independent of T. We can see that the performance of the optimal proper policy (and therefore also regret) can be characterized by using the expected per-round rewards. We define ri and ci to be the expectation of rti and cti, respectively. We also denote qi = ri/ ci. Define vectors r [0, 1]N, c [Cl, Cu]N and q [0, 1/Cl]N by r = ( r1, r2, . . . , r N) , c = ( c1, c2, . . . , c N) and q = (q1, q2, . . . , q N) = ( r1/ c1, r2/ c2, . . . , r N/ c N) . Note that qi corresponds to the expected per-round reward for running each task i. Then, the expected total reward for an optimal policy can be bounded as follows: Proposition 2.2. For any proper policy π = (πt)t=1,2,... Π, we have E h PT t=1 r t πt(ht) i (T + Cu) maxa A q a , where (ht)T t=1 denote histories generated by π. Proposition 2.2 can be considered as a generalization of Lemma 1 by Ding et al. [15]. From this proposition, the regret can be bounded as follows: RT (T + Cu) max a A q a E t=1 (q a r t at) where we denote a arg maxa A q a and the last inequality follows from the facts that qi 1/Cl holds for any i and that a 1 M. 3 Algorithm The proposed algorithm employs a phased-update approach, in which we update a policy for each phase, and each phase continues to select the same set of tasks. The procedure starts with the initialization phase (or the zeroth phase), in which we execute each task B = Θ( Cu Cl ln T) times. This is required to ensure that the UCB estimators ˆqi(t) given in (3) are close enough to the true expected reward qi. Let t1 1 denote the round this initialization terminates, which is at most O(Cu NB). As the regret per round is at most M Cl , the total regret in the initialization phase is at most O( Cu MNB Cl ) = O( C2 u C2 l MN ln T), which turns out to be negligible in the overall regret upper bound. The s-th phase (s {1, 2, . . .}) consists of a segment of rounds with length ls, which implies that the first round ts of the s-th phase is expressed as ts = Ps 1 u=1 lu + t1. At the beginning of each phase s, we compute a s A and the length ls of the s-th phase, for which the details will be described later. Then, in the subsequent ls rounds, we continue to choose the set A s = {i [N] | a si = 1}. More precisely, in the t-th round, we choose at = a s bt if bt a s and choose at = 0 otherwise, where we recall bt is the set of tasks in progress at the t-th round (see (1)). Since a s A, at clearly belongs to At. In this procedure, intuitively, we repeatedly restart each task i in A s just after i is completed, in the s-th phase. Note here that all tasks executed in the (s 1)-st phase are completed by the (ts + Cu)-th round as the processing times are at most Cu, which implies that bt a s holds for any t [ts + Cu, ts+1 1]. Hence, in any round t [ts + Cu, ts+1 1], it holds that bt + at = a s, which implies all tasks in A s are running in the phase. Furthermore, for each task i A s, the number of times the task i is completed in the s-th phase will be at least (ls Cu)/Cu ( ls/Cu 2) and at most ls/Cl + 1. We next describe how to compute a s. We use upper confidence bounds (UCB) for qi = ri/ ci, the expected reward per round. In our definition of the UCB, we use the empirical means ˆri(t) and ˆci(t) of rewards and costs, the empirical variance V c i (t) of costs, and the number Ti(t) of times that the task i has been completed before the t-th round. We define the UCB ˆqi(t) on qi by ˆqi(t) = min{1, ˆri(t) + dr i (t)} max{Cl, ˆci(t) dc i(t)}, where dr i (t) = 3V c i (t) ln t Ti(t) + 9(Cu Cl) ln t Ti(t) . (3) We would like to note this definition includes an empirical Bernstein bound dc i(t) [4] for the processing time cti, which will turn out to be essential for tight regret upper bounds, as mentioned in Remark 4.2. We then have Pr [|ˆri(t) ri| dr i (t)] 2 t2 , Pr [|ˆci(t) ci| dc i(t)] 4 These can be shown from Azuma Hoeffding inequality and empirical Bernstein s inequality [4], of which details can be found in Section A.4 of the appendix. Hence, with probability 1 6/t2, we have qi ˆqi(t). At the beginning of each phase s, i.e., in the ts-th round, we calculate these UCBs ˆq(ts) = (ˆq1(ts), ˆq2(ts), . . . , ˆq N(ts)) , and then call the linear optimization oracle to find a s arg maxa A ˆq(ts) a. We set the length ls of the s-th phase on the basis of mini A s Ti(ts), where we denote A s = {i [N] | a si = 1}. We set ls = Cl mini A s Ti(ts) + 2Cu. In the analysis, we assume that B is given by B = 90 Cu Cl ln T . We then have Ti(ts) 90Cu ln T/Cl, ls 90Cu ln T, 90Cu N ln T t1 = O C2 u N ln T/Cl , (5) which will be used at several points in our analysis. These conditions and the construction of ls lead to the bound on Ti(ts) as in the lemma below, which will be used in the regret analysis for the proposed algorithm. Lemma 3.1. For any s 1 and i A s, we have ls 2Cu(Ti(ts+1) Ti(ts)) and Ti(ts+1) 4Ti(ts). All omitted proofs are given in the appendix. We also have the following bound on ts: Lemma 3.2. For any s 1, there exists i A s such that Ti(ts+1) (1 + Cl/Cu)Ti(ts). Consequently, we have ts Cl(1 + Cl/Cu)s/N 2 for any s 1. From Lemma 3.2, we have ln ts ln Cl + s N 2 ln 1 + Cl N 2 Cl 2Cu , where the last inequality follows from the fact that ln(1 + x) x/2 for any x [0, 1]. This implies that we have s N 2Cu Cl ln ts + 2 . Hence, the number of phases, denoted by S, is bounded by Cl ln T + 2 + 1 = O Cu Cl N ln T , since the last phase contains T. The overall procedure of the proposed algorithm is summarized in Algorithm 1, which consists of arithmetic operations and offline linear optimization over A. The number of arithmetic operations in each round is bounded by O(N) since the update of each parameter in Step 7 10 of Algorithm 1 can be performed in O(1)-arithmetic operations. Furthermore the number of calls to the offline linear optimization oracle is at most S = O Cu Cl N ln T . In fact, the offline linear optimization oracle is called only at the beginning of each phase. This means that Algorithm 1 is more efficient than standard UCB algorithms for combinatorial semi-bandits algorithms [24, 25] that require O(T) calls to the oracle. The space complexity, other than that required for the offline optimization oracle, is O(N) since the algorithm works by maintaining O(N) parameters. 4 Regret Analysis We denote a arg maxa A q a. Let A = {i [N] | a i = 1} and A = [N] \ A . We define suboptimality gaps a, i and min, respectively, by a = q a q a (a A), i = min a A:ai=1, a>0 a (i [N]), min = min i A i. (6) Algorithm 1 Phased-update UCB algorithm for bandit task assignment Require: A {0, 1}N: a family of feasible task sets. Cl, Cu: lower and upper bounds for the processing time. B: length of initialization phase. Offline linear optimization oracle for A. 1: For each i [N], execute the i s task B times, and let t1 be the round it completed. Set Ti(t1) = B. 2: for s = 1, 2, . . . , do 3: Set ˆqi(ts) by (3). 4: Call the offline linear optimization oracle to find a s arg maxa A ˆq(ts) a, and set A s = {i [N] | a si = 1}. 5: Set ls = Cl mini A s Ti(ts) + 2Cu and ts+1 = ts + ls. 6: for t = ts, ts + 1, . . . , ts+1 1 do 7: If bt a s, output at = a s bt, where bt is given by (1). Otherwise, output at = 0. 8: for i = 1, 2, . . . , N do 9: If task i is completed (i.e., bti = 0) and ct ,i and rt ,i are observed (t is the last round the task i is started), update ci(t), ri(t) and V c i (t) by using ct ,i and rt ,i. Set Ti(t + 1) = Ti(t) + 1. 10: end for 11: end for 12: end for We also denote the variance of the processing time cti by σ2 i = E[(cti ci)2]. We note that σ2 i E[(cti Cl)2] (Cu Cl) E[cti Cl] = (Cu Cl)( ci Cl). The goal of this section is to show the following: Theorem 4.1. The regret for Algorithm 1 is bounded as i + C2 u C2 l NM ln T O 1 min + Cu where Ci O 1 ci σ2 i + (Cu Cl)2Cl . Furthermore, for any distribution D, the regret is bounded as RT = O 1 Cl Cu NMT ln T + C2 u C2 l NM ln T . The specific definition of Ci is given in (19) in the appendix. Remark 4.2. Bernstein-type confidence bounds used in (3) are essential to achieve the nearly optimal regret bound of RT = O 1 Cl Cu NMT ln T . In fact, if we employ a standard confidence bound for ci instead of the Bernstein-type one given in (3), the parameter σ2 i in the definition of Ci will be replaced with (Cu Cl)2, which leads a suboptimal regret bound of RT = O q C2 u C3 l T ln T . From Proposition 2.2, the regret can be bounded as follows: t=1 (q a r t at) Cl E [t1] + R(1) T + R(2) T + Cu where we define t=ts (q a q a s) , R(2) T = E t=ts (q a s r t at) In these definitions of R(1) T and R(2) T , the index S represents the phase that includes the T-th round, and we define t S+1 = T + 1 exceptionally, for notational simplicity. Let us next show the upper bounds for R(1) T and R(2) T separately. We will show that i + C2 u C2 l NM , R(2) T = O C2 u C2 l NM ln T . (10) These, together with (8) and (5), prove the gap-dependent regret bound (7) in Theorem 4.1. We here discuss an upper bound on R(1) T while the analysis of R(2) T is given in the supplementary material. We denote As = A s\A . Define di(t) by di(t) = q ci ln t Ti(t), where Ci is defined by (19) in the appendix so that Ci = Θ 1 ci σ2 i + (Cu Cl)2Cl . We then have ˆqi(t) qi(t) di(t) with high probability. In fact, we can show the following lemma using concentration inequalities, of which proof can be found in the appendix. Lemma 4.3. For any s, with a probability at least 1 6N/t2 s, we have a s P i As di(ts). Consequently, we have R(1) T E[ ˆRT ] + O MN Clt1 , where we define ˆRT = PS s=1 ls a s1[Fs] with i As di(ts), a s > 0 . We can show an upper bound on ˆRT in a way similar to what Kveton et al. [25] did, as follows: Lemma 4.4. ˆRT defined in Lemma 4.3 is bounded as E[ ˆRT ] = O M P i A Ci ln T This bound on ˆRT together with Lemma 4.3 and (5) leads to the first part of (10). 4.1 Gap-Free Regret Bound We can obtain the gap-free regret bound of RT = O( q Cu C2 l NMT) in Theorem 4.1 by modifying the analysis of ˆRT . For any ϵ > 0, ˆRT can be bounded as ˆRT PS s=1 ls a s1[Fs, a s > ϵ] + ls a s1[Fs, a s ϵ] = O Cu C2 l MN ln T ϵ + ϵT , where the last inequality follows from the same argument for showing Lemma 4.4. By setting ϵ = q C2 l T , we obtain ˆRT = O 1 Cl Cu MNT ln T . From this, (8), the second part of (10), and Lemma 4.3, we obtain the gap-free regret bound presented in Theorem 4.1. This completes the proof of Theorem 4.1. 4.2 Regret Lower Bound As shown in Theorem 4.1, the proposed algorithm achieves RT = O( 1 Cl Cu NMT). This is tight up to a logarithmic factor in T. In fact, we have the following lower bound: Theorem 4.5. For any N, M, T such that N M and T = Ω(Cu), there exists a problem instance for which any algorithm suffers regret of RT = Ω( 1 Cl min Cu NMT, MT ). In the proof of this theorem, we first focus on the case of M = 1. If we consider the case of cti = Cl, our model is equivalent to the N-armed bandit with time horizon T = Θ(T/Cl), which leads to a lower bound of Ω( NT ) = Ω( p NT/Cl) with the aid of the lower bound for multi-armed bandit problems given in Theorem 5.1 by Auer et al. [6]. When considering the case of rti = 1 and cti {Cl, Cu}, we can show a regret lower bound of Ω( Cu Cl Cl Cu Cu NT), by using the proof technique by Badanidiyuru et al. [7]. Combining these two bounds, we obtain an Ω( 1 Cl Cu NT)- lower bound for the case of M = 1. From this result and the technique used in Proposition 2 by Kveton et al. [25], we have an Ω( 1 Cl Cu MNT)-lower bound for M > 1. 5 Numerical Experiment We conducted small experiments to evaluate the practical performance of the proposed algorithm using a synthetic dataset. We set the parameters for the dataset as follows: A = {A [N] | 0 2000 4000 6000 8000 10000 round UCB-BV1 Comb UCB1 Proposed Figure 2: Instance with small . 0 2000 4000 6000 8000 10000 round UCB-BV1 Comb UCB1 Proposed Figure 3: Instance with large . |A| M} with (N, M) = (4, 2), T = 10000, Cu = 6, Cl = 1, r = (0.5, 0.5, 0.5, 0.5). For the expected processing times, we set c = (1.5, 1.5, 2.0, 2.0) as a problem instance with small or c = (1.5, 1.5, 5.0, 5.0) as a problem instance with large . Each cti and rti are generated so that cti 1 follows binomial distributions over {0, 1, . . . , Cu 1} and rti follows Bernoulli distributions over {0, 1}. For comparison, we have added the results of applying UCB-BV1 by Ding et al. [15] and Comb UCB1 by Kveton et al. [24]. More detailed information on the experimental setup is provided in the supplementary material. Figures 2 and 3 show the empirical means of regret computed by 100 repetitions of independent experiments. We depict 2-standard deviations of the empirical regret by the shaded areas. These results imply that the proposed algorithm performs better experimentally than other algorithms. We note that UCB-BV1 and Comb UCB1 cannot avoid a linear regret as they choose a set in A and wait for them all to be completed, which causes idling time with non-zero probability every time. It is also worth mentioning that the empirical performance of the proposed algorithm is much better than Theorem 4.1 predicts. In fact, under the parameter settings of these experiments, the values in Theorem 4.1 can be approximated as: 1 Cl Cu NMT ln T 2100 and Cu Cl NM ln T 2650. 6 Related Work One of the most relevant studies to this work would be those on combinatorial semi-bandits, in which a family F 2[N] of subsets of arms [N] is given and the player sequentially chooses At F and then observes the obtained rewards rti for each i At. Studies on combinatorial semi-bandits are classified into those on stochastic models [25, 24, 34], in which rt are i.i.d. for t, and those on adversarial models [33, 5], in which {rt} are arbitrary sequences that may be chosen in an adversarial manner. As noted in the introduction, the stochastic combinatorial semi-bandits are special cases of bandit task assignment. The stochastic combinatorial semi-bandits are known to admit sublinear regret. Typical approaches to achieve sublinear regret include upper confidence bounds (UCB)-type algorithms [25, 24] and Thompson sampling algorithms [34]. Our problem is also relevant to bandit problems with budget constraints. Ding et al. [15] consider a stochastic multi-armed bandit (MAB) problem with a budget constraint, in which the player observes the reward rtit and the cost ctit for the chosen arm it in each round t. The goal of the player is to maximize the cumulative rewards earned before cumulative costs reach a given total budget amount. This can be seen as a special case of our problem. As shown in Ding et al. [15], in the budget-constrained MAB problem, the reward of the optimal policy can be characterized by the expected reward-to-cost. This fact is generalized to our setting in Proposition 2.2, which will be used in our regret analysis. It should be mentioned that the budget-constrained MAB problem has been generalized to bandits with multiple plays [37], which differs from our model, as problems with processing times cannot be reduced to budget-constrained models in general. In addition to their work, there are various generalizations and variants, such as contextual bandit models [36], models with general distributions [11, 12], and models with multiple budgets [2, 31, 7]. While delayed feedback models [13, 38, 3] are similar to our model in that rewards earned will be revealed in later rounds, there are some essential differences. For example, in our model, the selected arm is occupied during the processing time and cannot be selected, while in the delayed feedback model, the arm can be selected while waiting for feedback. The former model can handle situations where the arm corresponds to a labor or computing resource, as in the examples shown in Examples 1 and 2. Phased-update or lazy-update approaches were incorporated in other bandit problems and online learning problems such as a phased version of UCB for MAB (e.g., Exercise 7.5 of the book [26]) and the rarely switching OFUL algorithm for linear stochastic bandits [1]. The motivation for using phased-update approaches is primarily to reduce the computational cost [1, 22], to minimize switching cost [2, 21], or to address batched settings [29, 19]. To our knowledge, our proposed algorithm is the first algorithm that applies such a phased-update approach to combinatorial semi-bandits. It allows us to reduce the number of oracle calls, and more importantly, to achieve the nearly tight regret bounds for the bandit task assignment. Our study reveals new usefulness of a phased-update approach. Another line of research related to our problem is the online matching or online allocation problems with reusable resources [14, 16, 32]. The problem is formalized with a bipartite graph, and we choose edges for sequentially arriving vertices. The model is similar to ours in that each chosen edge will be unavailable for a certain period of time that follows a probability distribution. However, it is different in that the bipartite graph is unknown at first and all the distributions are given. Moreover, those papers adopt competitive ratio analyses rather than regret analyses, i.e., they evaluate the performance compared to a stronger agent who knows the sequence of arriving vertices. Thus, the existing papers are incomparable with our work. As mentioned in the introduction, the problem setting of blocking bandit [8] appears to be similar to ours while the difficulties are significantly different. Considering unconstrained situations can highlight the differences in the problem. In the task assignment problem with no constraints, the optimal strategy will be trivial, in which we make each task always processing, i.e., in each round, we start every task that is completed at that round. On the other hand, as the reviewer commented, blocking bandits only allow us to play a single action in each round, which makes the optimal strategy nontrivial. In fact, finding the optimal policy can be an NP-hard problem even if the true distributions of r and c are given. This is an illustrative example of why bandit task assignment can be easier than blocking bandits. 7 Conclusion and limitation In this paper, we introduced a new problem setting that we referred to as bandit task assignment, for which we proposed an algorithm with regret bounds. A limitation of this work is that we need to know the upper and lower bounds Cu, Cl on the processing times, which may not be possible to assume in some real-world applications. To remove this assumption, it would be necessary to modify the framework of the problem. One possible modification would be to allow the player to suspend processing tasks. In this modified problem setup, even in cases where Cu is incorrectly underestimated, it will be possible to immediately abort the task and correct Cu when the underestimation is detected. We expect that this strategy and the appropriate parameter update rules for Cu and Cl allow us to remove the assumption of prior knowledge about processing times. Another future direction is to extend the problem to a decentralized setting, in which multiple agents collaborate to make task assignment decisions. We believe that such an extension of the problem provides insight into real-world applications where multiple entities perform task assignments, such as crowdsourcing and federated learning. Acknowledgments We deeply appreciate many comprehensive comments and feedback by anonymous reviewers. HS was supported by JSPS KAKENHI Grant Numbers JP17K12646, JP21K17708, and JP21H03397, Japan. TF was supported by JSPS KAKENHI Grant Numbers JP20H05965, JP21K11759, and JP21H03397, Japan. NK was supported by JSPS KAKENHI Grant Numbers JP22H05001, JP20H05795, and JP21H03397, Japan. KK was supported by JSPS KAKENHI Grant Numbers JP22H05001 and JP20A402, Japan. [1] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011. [2] R. Agrawal, M. Hegde, D. Teneketzis, et al. Multi-armed bandit problems with multiple plays and switching cost. Stochastics and Stochastic reports, 29(4):437 459, 1990. [3] A. Atsidakou, O. Papadigenopoulos, S. Basu, C. Caramanis, and S. Shakkottai. Combinatorial blocking bandits with stochastic delays. In International Conference on Machine Learning, pages 404 413. PMLR, 2021. [4] J.-Y. Audibert, R. Munos, and C. Szepesvári. Tuning bandit algorithms in stochastic environments. In International conference on algorithmic learning theory, pages 150 165. Springer, 2007. [5] J.-Y. Audibert, S. Bubeck, and G. Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31 45, 2013. [6] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002. [7] A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. Journal of the ACM (JACM), 65(3):1 55, 2018. [8] S. Basu, R. Sen, S. Sanghavi, and S. Shakkottai. Blocking bandits. Advances in Neural Information Processing Systems, 32:4785 4794, 2019. [9] S. Basu, O. Papadigenopoulos, C. Caramanis, and S. Shakkottai. Contextual blocking bandits. In International Conference on Artificial Intelligence and Statistics, pages 271 279. PMLR, 2021. [10] N. Bishop, H. Chan, D. Mandal, and L. Tran-Thanh. Adversarial blocking bandits. Advances in Neural Information Processing Systems, 33:8139 8149, 2020. [11] S. Cayci, A. Eryilmaz, and R. Srikant. Learning to control renewal processes with bandit feedback. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(2): 1 32, 2019. [12] S. Cayci, A. Eryilmaz, and R. Srikant. Budget-constrained bandits over general cost and reward distributions. In International Conference on Artificial Intelligence and Statistics, pages 4388 4398. PMLR, 2020. [13] N. Cesa-Bianchi, C. Gentile, and Y. Mansour. Delay and cooperation in nonstochastic bandits. Journal of Machine Learning Research, 20(17):1 38, 2019. [14] J. P. Dickerson, K. A. Sankararaman, A. Srinivasan, and P. Xu. Allocation problems in ridesharing platforms: Online matching with offline reusable resources. ACM Transactions on Economics and Computation, 9(3), June 2021. [15] W. Ding, T. Qin, X.-D. Zhang, and T.-Y. Liu. Multi-armed bandit with budget constraint and variable costs. In Twenty-Seventh AAAI Conference on Artificial Intelligence, 2013. [16] Z. Dong, S. Das, P. Fowler, and C.-J. Ho. Efficient nonmyopic online allocation of scarce reusable resources. In International Conference on Autonomous Agents and Multiagent Systems, pages 447 455, 2021. [17] J. Edmonds. Matroids and the greedy algorithm. Mathematical programming, 1(1):127 136, 1971. [18] Y. Gai, B. Krishnamachari, and R. Jain. Learning multiuser channel allocations in cognitive radio networks: A combinatorial multi-armed bandit formulation. In 2010 IEEE Symposium on New Frontiers in Dynamic Spectrum (Dy SPAN), pages 1 9. IEEE, 2010. [19] Z. Gao, Y. Han, Z. Ren, and Z. Zhou. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32:501 511, 2019. [20] A. György, L. Kocsis, I. Szabó, and C. Szepesvári. Continuous time associative bandit problems. In Proceedings of the 20th international joint conference on Artifical intelligence, pages 830 835, 2007. [21] T. Jun. A survey on the bandit problem with switching costs. de Economist, 152(4):513 541, 2004. [22] A. Kalai and S. Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005. [23] H. W. Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83 97, 1955. [24] B. Kveton, Z. Wen, A. Ashkan, H. Eydgahi, and B. Eriksson. Matroid bandits: fast combinatorial optimization with learning. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pages 420 429, 2014. [25] B. Kveton, Z. Wen, A. Ashkan, and C. Szepesvari. Tight regret bounds for stochastic combinatorial semi-bandits. In International Conference on Artificial Intelligence and Statistics, pages 535 543. PMLR, 2015. [26] T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020. [27] A. Maurer and M. Pontil. Empirical bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740, 2009. [28] O. Papadigenopoulos and C. Caramanis. Recurrent submodular welfare and matroid blocking semi-bandits. Advances in Neural Information Processing Systems, 34:23334 23346, 2021. [29] V. Perchet, P. Rigollet, S. Chassang, and E. Snowberg. Batched bandit problems. The Annals of Statistics, 44(2):660 681, 2016. [30] P. Perrault, E. Boursier, M. Valko, and V. Perchet. Statistical efficiency of thompson sampling for combinatorial semi-bandits. Advances in Neural Information Processing Systems, 33: 5429 5440, 2020. [31] K. A. Sankararaman and A. Slivkins. Combinatorial semi-bandits with knapsacks. In International Conference on Artificial Intelligence and Statistics, pages 1760 1770. PMLR, 2018. [32] H. Sumita, S. Ito, K. Takemura, D. Hatano, T. Fukunaga, N. Kakimura, and K. Kawarabayashi. Online task assignment problems with reusable resources. In The Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022. [33] T. Uchiya, A. Nakamura, and M. Kudo. Algorithms for adversarial bandit problems with multiple plays. In International Conference on Algorithmic Learning Theory, pages 375 389, 2010. [34] S. Wang and W. Chen. Thompson sampling for combinatorial semi-bandits. In International Conference on Machine Learning, pages 5114 5122. PMLR, 2018. [35] H. Whitney. On the abstract properties of linear dependence. American Journal of Mathematics, 57(3):509 533, 1935. [36] H. Wu, R. Srikant, X. Liu, and C. Jiang. Algorithms with logarithmic or sublinear regret for constrained contextual bandits. Advances in Neural Information Processing Systems, 28: 433 441, 2015. [37] D. P. Zhou and C. J. Tomlin. Budget-constrained multi-armed bandits with multiple plays. In Thirty-Second AAAI Conference on Artificial Intelligence, pages 4572 4579, 2018. [38] J. Zimmert and Y. Seldin. An optimal algorithm for adversarial bandits with arbitrary delays. In International Conference on Artificial Intelligence and Statistics, pages 3285 3294. PMLR, 2020. A Omitted Proofs Lemma A.1 (Empirical bernstein inequalities, [4]). Let X1, X2, . . . , Xn [Cl, Cu] be a sequence of i.i.d. random variables with mean µ and variance σ2. Let ˆµn = Pn i=1 Xi/n and Vn = Pn i=1(Xi ˆµn)2/n. For any δ (0, 1), with probability 1 4δ, we have δ + 3(Cu Cl) δ + 5(Cu Cl) Proof. Theorem 1 by Audibert et al. [4] implies that the first inequality holds with probability 1 3δ. The second inequality follows from Bernstein s inequality for Vt: 2(Cu Cl)2σ2 δ + 2(Cu Cl)2 From this, with probability 1 δ, we have r 2(Cu Cl)2σ2 δ + 2(Cu Cl)2 δ + 2(Cu Cl)2 δ + 2(Cu Cl) where the second inequality follows from 1 + x + y 1 + x + y 1 + x/2 + y that holds for any x, y 0. This implies that the second inequality of (11) holds with probability 1 δ. A.1 Proof of Proposition 2.2 Proof. For t = 1, 2, . . . , T, denote at = πt(ht) and define bt by (1). For t > T, let at = 0 for the notational simplicity. Then, as rt is independent of at, we have From (1), we have E h PT +Cu t=1 (ati + bti) i = E h PT +Cu t=1 ati + Pt 1 s=1 asi1[s + csi t] i = E h PT t=1 aticti i = ci E h PT t=1 ati i for any i, where the last equality follows from the fact that ct is independent of at. Hence, we have E h PT +Cu t=1 (at + bt) i = D E h PT t=1 at i , where D RN N >0 denote the diagonal matrix with entries c1, c2, . . . , c N. As we have at + bt A, this implies that 1 T +Cu D E h PT t=1 at i = E h 1 T +Cu PT +Cu t=1 (at + bt) i is in the convex hull of A. We hence have r E h PT t=1 at i = (T + Cu)(D 1 r) 1 T +Cu D E h PT t=1 at i (T + Cu) maxa A q a , where the last inequality follows from D 1 r = q and the fact that 1 T +Cu D E h PT t=1 at i is in the convex hull of A. Combining this with (13), we obtain the bound in Proposition 2.2. A.2 Proof of Lemma 3.1 Proof. From (5), we have mini A s Ti(ts) Cu/Cl, which implies ls = Cl mini A s Ti(ts)+2Cu 2Cu. From this, for all i A s. we have Ti(ts+1) Ti(ts) (ls 2Cu)/Cu ls/(2Cu). This means the first inequality in Lemma 3.1 holds. Further, we have Ti(ts+1) Ti(ts) (ls + 1)/Cl (Cl min i A s Ti(ts) + Cu + 1)/Cl = min i A s Ti(ts) + Cu/Cl + 1/Cl 3 min i A s Ti(ts), which implies that the second inequality in Lemma 3.1 holds. A.3 Proof of Lemma 3.2 Proof. Let i A s be a task such that Ti (s) = mini A s Ti(ts). We then have ls = Cl Ti (s)+2Cu. As the task i is completed at least (ls 2Cu)/Cu = (Cl/Cu)Ti (s) times within the s-th phase, we have Ti (s + 1) Ti (s) (Cl/Cu)Ti (s), which implies Ti (s + 1) (1 + Cl/Cu)Ti (s). Further, if i A s for some s < s, we have T i(s) Cu/Cl as shown in the proof of Lemma 3.1. Hence, if s NK + 1 for some positive integer K, there exists i [N] such that Tti(s) (Cu/Cl)(1 + Cl/Cu)K 1. From this, as ts Cl Ti(ts) holds for any i [N], we have ts Cl(1 + Cl/Cu)s/N 2. A.4 Proof of Lemma 4.3 Let us first show that a s P i As di(ts) holds with a probability of at least 1 6N/t2 s. Fix t arbitrarily. For any i and let τ(1) < τ(2) < represent all the indices of rounds at which the i-th task is started. Define the sequence E1, E2, . . . by Ej = Pj s=1(ri(τ(s)) ri). Then, since Ej is a sum of j i.i.d. random variables, the Azuma Hoeffding inequality implies Pr[|Ej| 1.5j ln t] 2 t for any t and j. We hence have Pr [|ˆri(t) ri| dr i (t)] = Pr h |ETi(t)| p 1.5Ti(t) ln t i Pr h j [t], |Ej| p 1.5j ln t i j=1 Pr h |Ej| p 1.5j ln t i which means that the first part of (4) holds. Similarly, by considering Ej = Pj s=1(ci(τ(s)) ci). it can be shown from Lemma A.1 that |ˆci(t) ci| dc i(t) 3σ2 i ln t Ti(t) + 15(Cu Cl) ln t We further have s 3σ2 i ln t Ti(t) + 15(Cu Cl) ln t 3σ2 i + 15(Cu Cl) Cl 90Cu + 15(Cu Cl) Cl (Cu Cl)( ci Cl) Cl 30Cu + (Cu Cl) Cl where the first and second inequalities follow from the lower bound on Ti(ts) in (5), and the third and last inequalities follow from Cl ci Cu. Hence, with probability 1 6N/t2, ˆqi(t) = min{1, ˆri(t) + dr i (t)} max{Cl, ˆci(t) dc i(t)} ri ci = qi. (17) As (16) implies ci 2dc i(t) ci/2, we have ˆqi(t) = min{1, ˆri(t) + dr i (t)} max{Cl, ˆci(t) dc i(t)} min{1, ri + 2dr i (t)} max{Cl, ci 2dc i(t)} =: ri ci + ri( ci ci) ci ci qi + 2dr i (t) ci + 2dc i(t) ci ci qi + 2dr i (t) ci + 4dc i(t) c2 i 3σ2 i + 15(Cu Cl) ln ts Ti(ts) ln ts Ti(ts) = qi + di(ts), (18) where we denote ri = min{1, ri + 2dr i (t)} and ci = max{Cl, ci 2dc i(t)} ci/2. We here defined Ci by 3σ2 i + 15(Cu Cl) We then have i A \A s qi X i A s\A qi X i A \A s ˆqi(ts) X i A s\A ˆqi(ts) X i A s\A qi X i A s\A di(ts) = X i As di(ts), (20) where the first and last inequalities follow from (17) and (18), respectively, and the second inequality follows from the definition of a s. Indeed, a s arg maxa A ˆq(ts) a implies the following inequality: P i A \A s ˆqi(ts) P i A s\A ˆqi(ts) = ˆq(ts) a ˆq(ts) a s 0. Let us next show the bound on R(1) T . We decompose R(1) T as follows: s=1 ls a s1[Fs] + s=1 ls a s1[ Fs] s=1 ls a s1[ Fs] where Fs denotes the complement of the event Fs. The second part of RHS can be bounded as s=1 ls a s1[ Fs] s=1 ls a s1 i As di(ts) s=1 ls M Cl where the first inequality follows from the fact that (20) holds with a probability of at least 1 6N/t2 s and the inequality of a s M/Cl. We have ls t2s = ts+1 ts t2s = ts+1 ts where the last inequality follows from ts = 1 + ls Cl max i [N] T i(s) + 2Cu ts max i [N] T i(s) 1 + 3Cl Hence, we have Combining (21), (22) and (23), we obtain the bound on R(1) T in Lemma 4.3. A.5 Proof of Lemma A.2 The following lemma can be shown in a way similar to that of Kveton et al. [25, Lemma 3]. Lemma A.2. Let {αk} and {βk} be positive real sequences for which 1 = β0 > β1 > > βk > > 0, α1 > α2 > > αk > > 0, limk αk = limk βk = 0, and P k=1 βk 1 βk αk 1. Suppose that Fs occurs. There then exists k such that Ss,k = {i As | a s M αkdi(ts)} satisfies |Ss,k| βk M . Proof. Let us show the claim via proof by contradiction. Suppose that |Ss,k| < βk M holds for all k. We then have X i As di(ts) < i Ss,k 1\Ss,k a s CM αk = a s |Ss,k 1| |Ss,k| αk |Ss,0| α1 + k=1 |Sk| 1 αk+1 1 αk 1 αk+1 1 αk βk 1 βk αk a s where the first inequality follows from the definition of Ss,k, i.e., CM αkdi(ts) < a s holds for all i [N] \ Ss,k, and the second inequality follows from the initial assumption made in this proof. This means that Fs occurs. Hence, if Fs occurs, thi initial assumption is false, which means that there exists k such that |Ss,k| βk M. We will apply Lemma A.2 with {αk} and {βk} such that P k=1 αk βk = O(1), which are provided in Appendix A.4 in [25]. From Lemma A.2, if Fs occurs, P i A 1[i As, a s 2 CM αkdi(ts)] βk M holds for some k, which implies i A 1 [Gs,k,i] , where Gs,k,i = n i As, a s M αkdi(ts) o . (24) Proof. Using (24), we can bound ˆRT as s=1 ls a s1 [Gs,k,i] X s=1 lsdi(ts)1 [Gs,k,i] ln ts Ti(ts)1 [Gs,k,i] ls1 [Gs,k,i] p Let us evaluate PS s=1 ls1[Gs,k,i] Ti(ts) for some fixed k and i. If Gs,k,i occurs, we have i a s M αkdi(ts) M q αk Ci ln T ci Ti(ts) which implies that p ci 1 i =: γ. We hence have S X ls1 [Gs,k,i] p ls1[i As]1[ p Ti(ts) γ] p Ti(ts) . (26) The expectation of the right-hand side can be bounded as follows. From Lemma A.4 in the Appendix, we have " ls1[i As] p " Ti(ts+1) Ti(ts) p By using Lemma 3.1, we obtain Ti(ts+1) Ti(ts) Ti(ts) = 3 Ti(ts+1) Ti(ts) Ti(ts) 3 Ti(ts+1) Ti(ts) Ti(ts)). Combining this with (26) and (27), we obtain E PS s=1 ls1[Gs,k,i] 6 ci PS s=1 p Ti(ts) 1 hp Ti(ts) γ i 12 ciγ = 12M αk ci Ci ln T i , where we the last inequality follows from Lemma 3.1. Combining this with (25), we obtain E h ˆRT i = i A Ci P k=1 αk βk 1 i i A Ci ln T , which completes the proof. A.6 Proof of the second part of (10) (bounding ˆR(2) T ) The second part of (10) can be shown via the following: Lemma A.3. Given a s, ts and ts+1, we have E h Pts+1 1 t=ts (q a s r t at) i 3M Cu Proof. For any s 1 and task i [N], the number of times to start task i duaring the s-th phase is at least (Ti(ts+1) Ti(ts) 1). We hence have t=ts r t at i A s ri(Ti(ts+1) Ti(ts) 1) i A s ri ls 3Cu i A s qi(ls 3Cu) Cl M = E ls q a s 3Cu where the second inequality follows from Lemma A.4 and the third inequality follows from qi = ri/ ci 1/Cl and |A s| M. This completes the proof. We used the following lemma in the proof of Lemma A.3. Lemma A.4. For any s 1 and i A s, we have E[Ti(ts+1) Ti(ts)|ls] ls 2Cu Consequently, we have ls 2 ci E s [Ti(ts+1) Ti(ts)] , (29) where Es denote the conditional expectation given ls and Ti(ts). Proof. We will show (28) by induction in the value of ls 0. It is clear that (28) hold if ls 2Cu. As the inductive hypothesis, we assume that (28) holds if ls m for some m 2Cu. Then, if ls = m + 1 > 2Cu, at least once, task i A s is started and completed during the s-th phase. Let ci denote the processing time when task i is first started and completed during the s-th phase. We then have E [Ti(ts+1) Ti(ts)|ls = m + 1] = c=Cl Pr[ci = c] E [1 + Ti(ts+1) Ti(ts)|ls = m + 1 c] c=Cl Pr[ci = c] E [Ti(ts+1) Ti(ts)|ls = m + 1 c] c=Cl Pr[ct,i = c]m + 1 c 2Cu = 1 + m + 1 ci 2Cu ci = m + 1 2Cu where the inequality follows from the inductive hypothesis. Hence, (28) holds for l = m + 1 as well. By induction, it has been shown that (28) holds. Further, as we have ls Cl B + 2Cu 4Cu, we have ls 2Cu ls/2, which implies (29) holds. From Lemma A.3, we have ˆR(2) T 2 Cu Cl SM. By combining this with S = O Cu Cl N ln T , which follows from Lemma 3.2, we obtain the second part of (10). A.7 Proof of Theorem 4.5 We start from the special case in which M = 1, i.e., we assume A = {{i} | i [N]} { }. We will show the following two lower bounds separately: RT = Ω min r RT = Ω Cu Cl Cl Cu min np Cu NT, T o Cu These two bounds together lead to the regret lower bound of Cu NT, T o . (32) In fact, if Cu 2Cl, (30) implies that (32) holds as we have Cu C2 l = Θ( 1 Cl ). On the other hand, if Cu 2Cl, (31) implies that (32) holds since we have Cu Cl Cl Cu = Ω( 1 From the lower bound of (32), we can obtain Cu MNT, MT o (33) by the technique used in the proof of Proposition 2 by Kveton et al. [25], which provides a regret lower bound for combinatorial semi-bandits. To see this, suppose that N = MK for simplicity. The action set of K-path combinatorial semi-bandit (in Proposition 2 by Kveton et al. [25]) is expressed as A = {Ak | k [K]} { }, where Ak = {M(k 1) + j | j [M]}. Note that this does not satisfy the assumption that the action set is closed under inclusion, which is posed in our problem setting. In order to address this assumption, we set A = {A [N] | A A , A A }, which is closed under inclusion. Let us consider bandit task assignment problems with action set A, instead of A . We suppose that, for each k [K], rti and cti take the same values for all i Ai, i.e., there exists r tk and c tk such that rti = r tk and cti = c tk for all i Ai. If r _tk and c _tk are i.i.d. for different t, this is a proper problem instance in our model. (Our model does not assume that rti and cti are independent for different tasks i as stated in Section 2). For any action At A \ { }, let A t A be such that At A t. Then, if (At)T t=1 is feasible, (A t)T t=1 is also feasible, and the reward for the latter is greater than or equal to the former. This means that any algorithm can be converted to one that takes only actions in A without sacrificing the performance. Hence, it is sufficient to consider only algorithms that choose actions from A in the proof of lower bounds. Therefore, regret lower bounds with action set {{k} | k [K]} { }, multiplied with M, apply to our problem setting. Hence, the lower bound of (32) implies (33) as well. The rest of this section is dedicated to the proofs of (30) and (31). Proof of (30) If M = 1 and cti = Cl for all t and i with probability one, the problem is equivalent to a standard N-armed bandit problem with time horizon T = Θ(T/Cl). As shown in Theorem 5.1 by Auer et al. [6], any algorithm for the N-armed bandit problem suffers Ω(min{ NT , T })-regret in the worst case with time horizon T . By substituting T = Θ(T/Cl) into this lower bound, we obtain (30). Proof of (31) Set p = Cl Cu+Cl . Fix an arbitrary algorithm. For some fixed i [N] and ϵ [0, p/2], define a distribution Di ,ϵ of c and r as follows: ci = Cu with probability p ϵ 1[i = i ] Cl with probability 1 p + ϵ 1[i = i ] , ri = 1 (i [N]). We also denote this distribution with ϵ = 0 by D0, which does not depend on i . Denote c = p Cu + (1 p)Cl = 2Cl Cu Cl+Cu = 2p Cu. Then the expected processing time can be expressed as ci = E[ci] = c ϵ (Cu Cl)1[i = i ]. (34) Denote the number of completing the task i during T rounds by Ti. Then the expected regret can be expressed as (T Cu) ( c ϵ (Cu Cl)) 1 c ϵ (Cu Cl) E i ,ϵ where Ei ,ϵ means the expectation for the case in which c and r follow Di ,ϵ. As the expected value of the number of rounds to process the task i is expressed as Ei ,ϵ[ ci Ti], we have From (34), the left-hand side of this can be expressed as i=1 Ti ϵ (Cu Cl)Ti Combining the above, we obtain c (T + ϵ (Cu Cl) E i ,ϵ[Ti ]). (36) Similarly, we can show that where E0 means the expectation for the case in which c and r follow D0. This means that there exists i such that E 0 [Ti ] T c N We assume that i satisfies this in the following. From (35) and (36), we have c E i ,ϵ[Ti ] 2Cu By a similar argument to Lemma 6.6 of by Badanidiyuru et al. [7], we can show that E i ,ϵ[T i ] 3 4 T c ϵ (Cu Cl) if ϵ 1 Hence, if ϵ = min p 4, 1 3 4 T c ϵ (Cu Cl) 3 4 T c p Cu/4 = 3 4 T c c/8 = 6 Combining this with (37), we obtain T 2p Cu min Cl Cu min n T, p Cu NT o 2Cu which completes the proof of (31). 0 2000 4000 6000 8000 10000 round UCB-BV1 Comb UCB1 Proposed Figure 4: Instance specified by parameters c = (2.19, 4.6, 5.35, 1.42) and r = (0.38, 0.43, 0.35, 0.47), which are generated from uniform distributions over [Cl, Cu] and [0, 1], respectively. 0 2000 4000 6000 8000 10000 round UCB-BV1 Comb UCB1 Proposed Figure 5: The average for instances in which the entries of c and r are generated from uniform distributions over [Cl, Cu] and [0, 1], respectively. The shaded areas represent the standard deviations of the empirical regret. B Notes on experimental setup In the implementation on UCB-BV1 proposed by Ding et al. [15], by considering each element in A = {A [N] | |A| = M} as a single arm, we transform an instance of the bandit task assignment into a budget constrained multi-armed bandit problem with N M -arms. To satisfy the constraints, we wait until all tasks in the previously chosen set are completed before deciding the next arm, i.e., the budget consumed by choosing At is given as maxi At{cti}. Similarly, in the implementation of Comb UCB1 proposed by Kveton et al. [25], each time a set At in A is selected and it waits until all the tasks in At are completed before selecting the next set. However, the way of choosing the set is different from that of UCB-BV1: we maintain the UCB scores for each task i [N], not for each set A A, and then choose a set so that the sum of the UCB scores for tasks in it are maximized, as suggested in the paper by Kveton et al. [25]. Data and source code to reproduce the experimental results in this paper are included in the supplementary materials. C Additional experiments This section provides the results of numerical experiments performed to evaluate (i) the performance for randomly generated instances and (ii) how the algorithm behaves under different settings, such as varying values of N and Cu. Settings of parameters not specified in the captions are the same as in Section 5 and in Appendix B. Figures 4 and 5 show the results of synthetic experiments for randomly chosen r and c. This experiment confirms that the proposed algorithm works well empirically in randomly generated environments, including cases where the values of r are not necessarily constant. Some results of an empirical evaluation of the algorithm s sensitivity to parameters N and Cu are given in Figures 6, 7, and 8. Figure 6 suggests that the values of the regret scale linearly with respect to N, which is consistent with the regret upper bound of (7). Figure 7 shows the computation time for varying problem sizes, which implies that the proposed algorithm is superior in terms of computational complexity. Figure 8 represents how false beliefs about Cu affect the performance of the algorithms. Experimental results suggest that the performance of the proposed algorithm is sensitive to underestimation of Cu, while it is robust to overestimation of Cu. 3 4 5 6 7 8 9 10 N: number of tasks UCB-BV1 Comb UCB1 Proposed Figure 6: Dependency of the regret on the problem size. 3 4 5 6 7 8 9 10 N: number of tasks computational time [s] UCB-BV1 Comb UCB1 Proposed Figure 7: Dependency of the computational time on the problem size. 2 4 6 8 10 Cu: input parameter corresponding to Cu UCB-BV1 Comb UCB1 Proposed Figure 8: Impact of the misspecification of Cu on the regret. Note that the true value of Cu is 6.