# on_preemption_and_learning_in_stochastic_scheduling__41420271.pdf On Preemption and Learning in Stochastic Scheduling Nadav Merlis * 1 Hugo Richard * 1 2 Flore Sentenac * 1 Corentin Odic 1 Mathieu Molina 1 3 Vianney Perchet 1 2 We study single-machine scheduling of jobs, each belonging to a job type that determines its duration distribution. We start by analyzing the scenario where the type characteristics are known and then move to two learning scenarios where the types are unknown: non-preemptive problems, where each started job must be completed before moving to another job; and preemptive problems, where job execution can be paused in the favor of moving to a different job. In both cases, we design algorithms that achieve sublinear excess cost, compared to the performance with known types, and prove lower bounds for the non-preemptive case. Notably, we demonstrate, both theoretically and through simulations, how preemptive algorithms can greatly outperform non-preemptive ones when the durations of different job types are far from one another, a phenomenon that does not occur when the type durations are known. 1. Introduction Single Machine Scheduling is a longstanding problem with many variants and applications (Pinedo, 2012). In this problem, a set of N jobs must be processed on one machine, each of a different size processing time required for its completion. An algorithm is a policy assigning jobs to the machine, and performance is usually measured by flow time the sum of the times when jobs have finished. If one has access to the size of each job, then scheduling the jobs by increasing size is optimal (Schrage, 1968). Unfortunately, for most applications, this knowledge is unavailable; yet, oftentimes, some structure or knowledge on the jobs can still be leveraged. In this paper, we focus on scheduling problems where jobs are grouped by types that determine their duration distribu- *Equal contribution 1ENSAE, Paris 2Criteo, Paris 3Inria, France. Correspondence to: Nadav Merlis . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). tion. This model approximates many real-world scenarios. For example, when scheduling patients for surgery, patients may be grouped by expected procedure time (Magerlein & Martin, 1978). The model is also relevant in computing problems, where jobs with similar features are expected to have a similar processing time (Li et al., 2006). Lastly, in calendar learning, where an agent advises the user on how to organize its day based on the tasks to be done, similar tasks can be assumed to have a similar duration (White & Hassan Awadallah, 2019). In practice, when encountering a new scheduling task, we usually know the type of each job, but have little-to-no information on the expected duration under each type. Then, the scheduling algorithm must learn the characteristics of each type to be able to utilize this information. This must be done concurrently with the scheduling of tasks, which poses an extra challenge to be useful, learning must be done as early as possible; however, wrong scheduling allocation at the beginning delays all jobs and causes large penalties. In this work, we show how learning can be efficiently done in scheduling problems with job types, characterized by exponential distributions, in two different settings the nonpreemptive setting, where once a job started running, it must be completed, and the preemptive setting, where jobs can be put on hold. We present two algorithms in each setting and show that the preemptive setting has a clear advantage when the type durations have to be learned. This comes in contrast to the case of known types, where under reasonable assumptions, the optimal algorithm is non-preemptive. While our algorithms resemble classic bandit methods, the scheduling objective requires different analysis approaches. In particular, in the context of scheduling, the quality of an algorithm is measured by the ordering of jobs. In stark contrast, regret-minimization objectives measure the number of plays from each arm (job type). Indeed, in scheduling problems, the number of pulls from each job type is always the same by the end of the interaction, we would finish all the jobs of all types. Thus, both our algorithmic design and analysis will be comparative focus on the number of jobs evaluations from a bad type before the completion of jobs of a good type. Our contributions are as follows. (1) We present the scheduling setting with unknown job types. (2) We analyze the On Preemption and Learning in Stochastic Scheduling optimal algorithm for the case of known job types, called Follow-the-Perfect-Prediction (FTPP), and bound its competitive ratio (CR). (3) We present explore-then-commit (ETC) and upper confidence bound (UCB) algorithms for the preemptive and non-preemptive settings and bound their performance, compared to FTPP. In particular, our bounds show that the non-preemptive algorithms have worse dependence on the durations of the longest job types. (4) We complement this by proving lower bounds to the non-preemptive case. (5) We end by simulating our suggested algorithms and show that their empirical behavior is consistent with our theoretical findings. 2. Related Work Scheduling problems. The scheduling literature and problem zoology are large. We focus on static scheduling on a single machine with the objective of minimizing flow time. Static scheduling (Motwani et al., 1994) means that all scheduled tasks are given in advance before the scheduling starts. Possible generalizations include dynamic scheduling where scheduled tasks arrive online (Becchetti & Leonardi, 2004), weighted flow time (Bansal & Dhamdhere, 2007) where different jobs have different weights, multiple machines (Lawler & Labetoulle, 1978)); and many more (Dürr et al., 2020; Tsung-Chyan et al., 1997). While we only tackle some versions of this problem, we believe that our approaches can be adapted or extended to other settings. Clairvoyant and non-clairvoyant scheduling. In clairvoyant scheduling, job sizes are assumed to be known, and scheduling the shortest jobs first gives the lowest flow time (Schrage, 1968). In non-clairvoyant scheduling, job sizes are arbitrary and unknown. The Round Robin (RR) algorithm, which gives the same amount of computing time to all jobs, is the best deterministic algorithm with a competitive ratio of 2 2 N+1 = 2 + o(N) (Motwani et al., 1994). The best randomized algorithm has a competitive ratio of 2 4 N+3 = 2 + o(N) (Motwani et al., 1994). Stochastic scheduling. Stochastic scheduling covers a middle ground where job sizes are known random variables. The field of optimal stochastic scheduling aims to design optimal algorithms for stochastic scheduling (see Cai et al. 2014 for a review). When distributions have a non-decreasing hazard rate, scheduling the shortest mean first is optimal (see Cai et al. 2014, Corollary 2.1). In this work, we consider exponential job sizes (which have a non-decreasing hazard rate), as frequently assumed in the scheduling literature (Kämpke, 1989; Hamada & Glazebrook, 1993; Cunningham & Dutta, 1973; Cai & Zhou, 2000; 2005; Pinedo & Weiss, 1985; Glazebrook, 1979) and similarly for the presence of different types of jobs (Mitzenmacher, 2020; Hamada & Glazebrook, 1993; Marbán et al., 2011). Yet, in contrast to most of the literature on stochastic scheduling, the means of the exponential sizes are unknown to the scheduler and are learned as the algorithm runs. Nonetheless, we later present algorithms whose CR asymptotically converges to the optimal value, obtained in stochastic scheduling with known job means. The problem of learning in scheduling has received some attention lately. Specifically, Levi et al. (2019) consider a setting where it is possible to test jobs to learn about their attributes, which comes at a cost. In (Krishnasamy et al., 2018), the authors propose an algorithm to learn the cµ rule (a rule to balance different holding costs per job) in the context of dynamic queues. Perhaps closest to our setting, in (Lee & Vojnovic, 2021), job types are also considered, but the length of the jobs is assumed to be known, and the goal is to deal with the uncertainty on the holding costs, which are noisily observed at each iteration. In the last two papers, no explicit exploration is needed, which stands in contrast with our setting. The problem we tackle was previously studied in a Bayesian setting (Marbán et al., 2011), under the assumption of two job types, and a Bayesian algorithm, called LSEPT, was presented. When run with an uninformative prior (the same for all job types), LSEPT is reduced to a greedy algorithm; whenever a job finishes, it runs until completion a job whose type has the lowest expected belief on its mean size (computed across jobs that have been processed so far). The author proved it has better performance in expectation than fully non-adaptive methods, but provided no other guarantee. In Appendix D, we empirically evaluate this algorithm and show it has a behavior typical of greedy algorithms: it has a very large variance, and its CR does not converge to the optimal CR, in contrast to our suggested methods. 3. Setting and Notations We consider scheduling problems of N jobs on a single machine, each belonging to one of K job types. We assume that N = n K, i.e, there are n jobs of each type. The different sizes (also called processing times) of the jobs of type k are denoted (P k i )i [n],1 where P k i E(λk) are independent samples from an exponential variable of parameter λk. By extension, E[P k i ] = λk, and we call λk the mean size of type k. We assume without loss of generality that the mean sizes of the K types are in an increasing order λ1 λ2 . . . λK and denote λ = (λ1, . . . , λK). With slight abuse of notations, we sometimes ignore the job types and denote the job durations by Pi for i [N]. 1For clarity of exposition, we assume that there are exactly n jobs per type. When types have different numbers of jobs n1, n2, . . . , n K, all algorithms can run with n = maxℓnℓ, and all bounds hold with this same parameter n. On Preemption and Learning in Stochastic Scheduling Next, denote bk i and ek i the beginning and end dates of the computation of the ith job of type k. We define the cost of an algorithm ALG, also called flow time, as the sum of all completion times: CALG = PK k=1 P i n ek i . Given knowledge of the job size realizations, the cost is minimized by an algorithm that computes them in increasing order, which we term as OPT. Preemption is the operation of pausing the execution of one job in the favor of running another one. Thus, preemptive algorithms are ones that support preemption, while nonpreemptive algorithms do not allow it and must run each started job until completion. 4. Benchmark: Follow The Perfect Prediction We compare our algorithms to a baseline that completes each job by increasing expected sizes, called Follow-The-Perfect Prediction (FTPP). For exponential job sizes, this strategy is optimal between all algorithms without access to job size realizations (see Cai et al. 2014, Corollary 2.1). Thus, with learning, we aim at designing algorithms approaching the performance of FTPP, whilst mitigating the cost of learning (i.e., mitigating the cost of exploration). In the rest of this section, we analyze the performance of FTPP. First, we evaluate the performance of non-clairvoyant algorithms that do not exploit the job type structure. We then compare the performance of the best of those algorithms against that of FTPP and show the clear advantage of using the structure of job types. 4.1. Non-clairvoyant Algorithms An algorithm A is said non-clairvoyant if it does not have any information on the job sizes, including the job type structure. Recall that RR is the algorithm that computes all unfinished jobs in parallel and is the optimal deterministic algorithm in the adversarial setting. The following proposition states that in our setting, it is the optimal algorithm among all non-clairvoyant ones. Proposition 4.1. For any λ and any (deterministic or randomized) non-clairvoyant algorithm A, there exists a job ordering such that E[CA] E[CRR]. Proof sketch (full proof in Appendix A.3). Consider T A ij , the amount of time that job i and job j delay each other. As the algorithm is unaware of the expected job size order, its run is independent of whether the expected size of job i is smaller or greater than that of j. This holds because a non-clairvoyant algorithm has no information on job expected sizes nor on the existence of job types. As an adversary, we can therefore choose the job order so that the algorithm incurs the largest flow time. A careful analysis then provides E[T A ij ] 2E[T OP T ij ] where OPT is the optimal realization-aware algorithm. A similar reasoning is made in the case of randomized algorithms. We conclude by observing that RR achieves such delay. Unfortunately, even though our setting is not adversarial, the CR of RR is bounded from below (Lemma A.3): for any λ, E [CRR] E [COPT] 2 4 n + 3. (1) 4.2. Performance of FTPP The first statement establishes that FTPP outperforms RR on any instance. Lemma 4.2. For any n and λ, E[CFTPP] E[CRR]. The proof is a straightforward computation, done in Appendix A.2.3. This indicates that when information on the job types is available, it is always advantageous to use it. In the rest of the section, we quantify the improvement this extra information brings. More precisely, we show that on a wide variety of instances, the CR of FTPP is much smaller than that of RR. We first present such a bound when K = 2. Proposition 4.3. The CR of FTPP with K = 2 types of jobs with n jobs per type with λ1 = 1 and λ2 = λ > 1 satisfies: E[COPT] 2 4 λ 1 (1 + λ)2 + 4λ. Proof sketch (full proof in Proposition A.4). The total expected flow time of any algorithm is given by the sum of the expected time spent computing all jobs and the expected time lost waiting as jobs delay each other. In the case of OPT, the expected flow time can be computed in closed form using that job i and j delay each other by E[min(Xi, Xj)] = E[Xi]E[Xj] E[Xi]+E[Xj]. The CR E[CFTPP] E[COPT] can then be calculated and is upper bounded to yield the result. In the case K = 2, there exists values of λ for which the CR of FTPP is lower than 1.71. In the general case, Proposition A.5 in Appendix A.2.2 shows that there exist values of K and λ for which the CR is as low as 1.274.2 5. Non-Preemptive Algorithms After establishing FTPP as the baseline for learning algorithms, we move to tackle learning in the non-preemptive 2An exact expression for the CR of FTPP is given at Equation (7) in the appendix and is omitted for clarity reasons. On Preemption and Learning in Stochastic Scheduling Algorithm 1 Non-Preemptive Algorithms routine 1: Init: type set U = [K], active jobs ik = 1, k [K] 2: while U is not empty do 3: Use a type selection subroutine to select a type k U 4: Run job ik until completion 5: Set ik ik + 1 6: if All jobs of type k are completed then 7: Remove type k from U 8: end if 9: end while setting, where once started, job execution cannot be stopped (see Algorithm 1). This is relevant, for example, to settings where switching tasks is very costly (e.g., in running time or memory) or even impossible (e.g., in medical applications, where treatment of a patient cannot be stopped). We show how algorithms from the bandit literature can be adapted to the scheduling setting and bound their excessive cost, compared to FTPP. Specifically, by treating each job type as an arm , we adapt explore-then-commit and optimism-based strategies to the scheduling setting. 5.1. Description of ETC-U and UCB-U In the following, we describe the type selection mechanism for ETC-U and UCB-U. The full pseudo-code of both algorithms is available in Appendix B.1. Let U be the set of all job types with at least one remaining job. ETC-U type selection. While ETC-U runs, it maintains a set of types A that are candidates for having the lowest mean size among the incomplete types U. At each iteration, ETC-U chooses a job of type k of the minimal number of completed jobs in A and executes it to completion. Then, U and A are updated and the procedure repeats until no more jobs are available in U. We now describe the mechanism of maintaining the candidate type set A. At a given iteration, denote by mk and mℓ, the number of jobs of type k and ℓthat have been computed up to that iteration. Letting ˆrmin(mk,mℓ) k,ℓ = Pmin(mk,mℓ) i=1 1 P k i < P ℓ i min(mk, mℓ) and δmin(mk,mℓ) k,ℓ = log(2n2K3) 2 min(mk, mℓ), a type ℓis excluded from A if there exists a type k such that ˆrmin(mk,mℓ) k,l δmin(mk,mℓ) k,ℓ > 0.5. (2) In the proof, we show that this condition implies w.h.p. that λk < λℓ. Thus, when it holds, job type ℓis no longer a candidate for the remaining job type with the smallest expectation, and we say that type k eliminates type ℓ. Once a job type is eliminated, it remains so until A is empty, at which point all job types in U are reinstated to A. Finally, whenever A contains only one type k, all jobs of this type are run to completion, and after all jobs from type k are finished, it is removed from U and therefore from A. This means that types that were eliminated by k can be candidates again. UCB-U type selection. At every iteration, the algorithm computes an index for each job type and plays a type with the minimal index from the incomplete types U. Specifically, if mk jobs were completed from type k, the index of the type is defined as λmk k = 2 Pmk i=1 Xk i χ2 2mk(1 1 2n2K2 ), where χ2 m(δ) is the δ-percentile of a χ2 distribution with m degrees of freedom. In the proof, we show that these indices are a lower bound of the job means w.h.p., so choosing the minimal index corresponds to choosing the type with the optimistic shortest duration. 5.2. Cost Analysis Proposition 5.1. The following bounds hold: E[CETC-U] E[CFTPP] + 1 2(k 1)(2K k) + (K k)2 λkn p 8n log(2n2K3) E[CUCB-U] E[CFTPP] + 2 3n ln (2n2K2) Proposition 5.1 shows that both ETC-U and UCB-U have sublinear excess cost. Indeed the optimal cost and thus also the cost of FTPP is lower bounded by Ω(n2 P k λk) which makes the terms in O(n n) strictly sublinear in n compared to the optimal cost. When all parameters lie within a finite range bounded away from 0, the optimal cost scales as K2n2 making the excess cost linear in K (up to log terms) compared to the optimal cost. Proof sketch (full proof in Appendix B). The above proposition is a concatenation of Propositions B.2 and B.4. The proof of both bounds starts with the decomposition of the cost with the following Lemma (proven in Appendix B.2). On Preemption and Learning in Stochastic Scheduling Lemma 5.2 (Cost of non-preemptive algorithms). Any nonpreemptive algorithm A has a cost E[CA] =E[CFTPP] (ℓ,k) [K2],k>ℓ (i,j) [n]2 (λk λℓ)E 1 ek i bℓ j . This lemma is obtained by computing explicitly the expected cost of algorithm FTPP and using the fact that the realized length of the jobs conditioned on their type is independent of their start date. Then the two proofs diverge. For ETC-U, the first step is to prove that condition (2) implies w.h.p that λk λℓ, which implies that the type in U with the smallest mean is never eliminated. Then, for the sake of the analysis, the run of the algorithm is divided into phases. In phase number ℓ, type ℓis the job type remaining with the smallest mean. We then bound the total number of samples before an arm with a large mean is eliminated at phase ℓ. The proof for UCB also starts by showing that w.h.p., arm indices lower bound the true means. Then, under the condition that the bounds hold, we upper bound the number of times an arm of type k ℓcan be pulled while type ℓis still active. The bounds in Proposition 5.1 hold for any value of the parameters. When the parameter values are far from each other, tighter bounds hold. We give here these tighter bounds for K = 2 when λ2 3λ1. A more general version of this bound is given in the appendix, Propositions B.2 and B.4. Lemma 5.3. If K = 2 and λ2 3λ1, the following bounds hold: E[CETC-U] E[CFTPP] + 12λ2n log(2n2K3) + 2 E[CUCB-U] E[CFTPP] + 9 2λ2n ln 2n2K2 + 4 The bounds of this section seem quite discouraging they imply that the existence of even one type of extremely large duration has grave implications on the cost of any algorithm. Unfortunately, for any non-preemptive algorithm, an extra cost w.r.t. FTPP scaling as nλK is unavoidable. Indeed, in the beginning, no information on the mean types is available, and any started job will be fully computed, delaying all remaining n K 1 jobs (see Appendix B.5.2 for a formal proof). 5.3. Lower bound We end this section by analyzing lower bounds for any nonpreemptive scheduling algorithm. In particular, we focus on the dependency of the excessive cost, compared to FTPP, as a function of n. We focus on lower bounds for the case of K = 2, providing a lower bound when λ1 and λ2 are close to each other and showing that in this case, the excess cost increases as n n. Proposition 5.4 (Dependency in n). For any λ1 < λ2, the flow time of any (possibly randomized) non-preemptive algorithm A satisfies: E[CA] E[CFTPP] + (λ2 λ1)n2 exp( n (λ2 λ1)2 In particular, for any λ2 λ1 1 + 1 n , E[CA] E[CFTPP] + (λ1 + λ2)n ne 1/4 Proof sketch (full proof in Appendix B.5.1). We start with the decomposition of Lemma 5.2 and look at the event (i,j) [n]2 1 e2 j < b1 i n2/2 Notice that a since the algorithm is non-preemptive, a job that terminates delays any incomplete job by its full duration. Thus, the event indicates that the number of times a job of type 2 delayed any job of type 1 is at least n2/2. Since λ2 > λ1, this scenario would lead to a large excess cost. Then, using standard information-theoretic tools, we lower bound the probability that either E occurs in the original scheduling problem or E occurs in a problem where the type order has been switched. In both problems, the relevant event causes an excess cost of Ω(n2), and substituting the exact probability of this failure case concludes the proof. 6. Preemptive Algorithms In this section, we show how to leverage preemption to get better theoretical and practical performances. In practice, we allow preemption by discretizing the computation time into small time slots of length . Then, at every iteration, one or multiple job types are selected depending on some algorithm-specific criteria. The current running job(s) of the selected type is allocated computation time instead of being run to completion. As before, we employ both an explore-then-commit strategy and an optimism-based strategy. In both cases, the only dependence of the resulting algorithm on the discretization size is due On Preemption and Learning in Stochastic Scheduling Algorithm 2 Preemptive Algorithms routine 1: Init: type set U = [K], active jobs ik = 1, k [K] 2: while U is not empty do 3: Use a type selection subroutine to select a type k U 4: Run job ik for time units 5: if ik was completed then 6: Set ik ik + 1 7: end if 8: if All jobs of type k are completed then 9: Remove type k from U 10: end if 11: end while to the discretization error (the time between the end of a job and the end of a window), which decreases with the discretization step. We omit that discretization error of at most N is the bounds. Note that in practice, any implementation of RR proceeds in a similar manner. For instance, in (Motwani et al., 1994), the discretization step is assumed much smaller than the length of the jobs. In particular, when we say we run jobs in parallel, in practice, they cyclically run in a RR manner with a small discretization step. 6.1. ETC-RR and UCB-RR ETC-RR type selection. As ETC-U, ETC-RR maintains a set of types A that are candidates for lowest mean size among the set U of types with at least one remaining jobs. The main difference is that the job type selected is the one in A with the lowest total run-time (not the one with the lowest number of computed jobs). The statistics needed to construct A are different from the ones used in ETC-U. At a given time, βk,ℓis the number of times a job of type k has finished while ℓand k were both active. Moreover, we define ˆrk,ℓ= βk,ℓ βk,ℓ+ βℓ,k and δk,ℓ= log(2n2K3) 2(βk,ℓ+ βℓ,k). The elimination rule is the same as the one of ETC-U, using these modified statistics. Reducing the number of algorithm updates: In practice, both the statistics and the chosen types are not updated at every iteration; active jobs run in parallel (meaning in a roundrobin style), and the statistics are updated every time a job terminates. This formulation of the algorithm is the one we implement(see pseudo-code in Appendix C.1). UCB-RR type selection. For each job type k [K], we introduce Tk(t), the number of times job type k has been chosen up to iteration t, and the random variables (xs k)s s.t.: t 1{a(t) = k, Tk(t) = s and the job finishes} . It is the indicator that a job of type k is completed when this type is picked for the sth time by the algorithm. We define the empirical means as: ˆµk(T) := 1 and define the index for each arm k as uk(t) = max µ [0, 1] : d (ˆµk(Tk(t)), µ) log n2 with d(x, y) the Kullback-Leibler divergence between x and y. A job type with the largest index is selected. Reducing the number of algorithm updates: As for ETC-RR, the running jobs and statistics are not updated at every iteration. Suppose type k is chosen at iteration t. If k is the last remaining type, it is run until the end. Otherwise, let ℓbe the type with the second largest index. We define µγ k(T) := 1 T + 2γ = max µ [0, 1] : d ( µγ k(Tk(t)), µ) log n2 This would be the new index of arm k, were it to run for additional 2γ iterations with no job terminating during this additional iterations. Then, we set γ = argmax γ uγ k (t) uℓ(t), and type k is allocated 2γ iterations with no statistics update. 6.2. Cost Analysis Proposition 6.1. The following bounds hold: E[CETC-RR] E[CFTPP] + 12K n E[COP T ] n log(2n2K3) k=1 (K k)2λk. and for any λ1 4 and n max(20, 10 ln(K)), E[CUCB-RR] E[CFTPP] + 12K n E[COP T ] 2n log(2n2K2) + 2 k=1 (K k)λk. On Preemption and Learning in Stochastic Scheduling Figure 1. CR of all algorithms with varying number of jobs, λ1 = 1, λ2 = 0.25, averaged over 400 seeds. Proof sketch (full proof in Appendix C). The above proposition is a combination of Propositions C.3 and C.4. Both algorithms belong to the following family of type-wise non-preemptive algorithms. Definition 6.2. Recall that bk i and ek i are the beginning and end dates of the computation of the ith job of type k. A type-wise non-preemptive algorithm is an algorithm that computes jobs of the same type one after another, i.e., i [n], k [K], ek i bk i+1. The following Lemma, proven in Appendix C.2 bounds the expected cost of any type-wise non-preemptive algorithms. Lemma 6.3 (Cost of type-wise non-preemptive algorithms). Any type-wise non-preemptive algorithm A has the following upper bound on its cost: E[CA] E[CFTPP] (ℓ,k) [K2],k>ℓ (i,j) [n]2 (λk λℓ)E 1 ek j < bℓ i The proof of this Lemma again involves computing explicitly the cost of FTPP and using that the realization of a job length is independent of its start date. A first upper bound is obtained by noting that a job started before another delays the former in expectation by at most its expected length. The second element to the proof is the fact that at every job termination, at most a job of each other type is currently active. This observation leads to the upper bound on the additional cost of preemption and the last term of the expression of the Lemma. Note that this last term implies that Figure 2. Normalized excess cost of all algorithms w.r.t. FTPP with varying value of λ1, for λ2 = 1 and n = 50, averaged over 5, 000 seeds. our upper bound will include a term scaling as nλK, which would indicate that preemptive algorithms have an extra learning cost scaling as the highest mean type. However, we strongly believe this to be an artefact of the analysis. Given the decomposition, the two proofs diverge. The analysis of ETC-RR is split into phases, as the analysis of ETC-U. However, the bound on the number of bad jobs computed in each phase requires more care because of independence arguments. Specifically, the upper bound is derived from concentration bounds on the computed statistics, and an additional bound on the number of successful jobs of each type when two types are run in parallel. The details on how to deal with those two non-independent events can be found in Appendix C.3. For UCB-RR, the first step is to distinguish two types of failures of the index. In the first failure case, the index deviates below the true mean. We show that this happens with probability O(1/n2) (Lemma C.7), independently of . The second type of failure is when the index of a suboptimal arm is much larger than its true mean. Here, we show that the upper bound on the number of iterations where this happens does diverge as goes to zero. However, the algorithm only incurs a cost on the bad pull of a type when the selected job terminates. The probability of job termination decreases as decreases, which compensates for the rise in the upper bound (Equation (32)). 7. Experiments In this section, we design synthetic experiments to compare ETC-U, ETC-RR, UCB-RR,UCB-U, RR and FTPP. All code is written in Python. We use matplotlib (Hunter, On Preemption and Learning in Stochastic Scheduling 2007) for plotting, and numpy (Harris et al., 2020) for array manipulations. The above libraries use open-source licenses. Computations are run on a laptop.3 The first experiment plots the CR of each algorithm for two types of jobs and fixed values of λ1, λ2 as n varies (see Figure 1). Even though all our suggested algorithms have the same asymptotic performance, their non-asymptotic behavior drastically varies. As predicted by theory, the preemptive versions of the algorithm consistently outperform the nonpreemptive ones. In the second experiment, n = 50 and λ2 = 1 are fixed, while λ1 varies in (0, 1) (see Figure 2). To be able to discern performance gaps when λ1 is small, we plot the difference between the CR of different algorithms and FTPP at a logarithmic scale. Here, for small values of λ, both preemptive methods outperform the non-preemptive ones. This corresponds with the improvement in the dominant error term of the preemptive cost upper bounds, as a function of λ2. 7.1. Discussion Preemptive vs. Non-Preemptive The competitive ratio of all algorithms is asymptotically the one of FTPP. Indeed, it always holds that E[COPT] (λ1+λ2) n2 4 (Equation (6)), so by Propositions 6.1 and 5.1, for any algorithm A among ETC-U, ETC-RR, UCB-U and UCB-RR: CRA = CRFTPP + O On the one hand, the leading term in the cost is the same for all algorithms. On the other hand, the error term can be much smaller in the case of preemptive algorithms. To illustrate this claim, let us consider the case where there are two types of jobs of expected sizes λ1 and λ2, respectively. Instantiating the bounds of Propositions 5.1 and 6.1 to this setting, we get: E[CETC-U] E[CFTPP] + n(λ1 + λ2) p 8n log(2n2K3) n E[COPT], (3) E[CETC-RR] E[CFTPP] + 2nλ1( p 4n log(2n2K3) + 1) n E[COPT]. (4) If λ2 λ1 the bound in Equation (3) is much larger than the bound in Equation (4), which is consistent with what we observe in Figure 2. In particular, one can observe 3The code to reproduce experiemnts is available at https: //github.com/hugorichard/ml4a-scheduling. that for small λ1, non-preemptive algorithms converge to a strictly positive error (due to the unavoidable dependence in λ2 = 1), while the error of the preemptive algorithms diminishes. This empirically supports our claim that the nλK-dependence, as appears in the preemptive cost decomposition of Lemma 6.3, is only due to a proof artefact. Optimism-based vs. explore-then-commit. In the simulations, we see that optimism-based algorithms perform much better than their ETC counterparts. In traditional bandit settings, it is well known that the regret of ETC strategies is a constant-times larger than that of optimismbased strategies. Here, we believe that in addition to that, a second phenomenon, not reflected in the analysis, renders the optimism-based strategies better than the other ones. Because of the structure of the cost, a pull of a bad job at the beginning is much more expensive than the same pull done later in the interaction (as it delays more jobs). Optimismbased strategies explore continuously as they run, whereas ETC strategies have all the exploration at the beginning, when it is more expensive. Again, this phenomenon stands in contrast with traditional bandits, where only the number of bad pulls matter, and not their position. 8. Conclusion and Future Work We designed and analyzed a family of algorithms for static scheduling on a single machine in the presence of job types. The special cost structure of this problem differs from that of traditional bandit problems, and early mistakes carry much more weight than late ones, as they delay more jobs. This modified cost directly impacts the performance of algorithms; although all suggested algorithms asymptotically have the same CR as the optimal algorithm that knows job type sizes (FTPP), their non-asymptotic performances differ. When preemption is allowed, algorithms that explore job types with a strategy inspired by the worst-case optimal deterministic algorithm RR have a clear advantage over non-preemptive learning algorithms. Thus, because of the cost structure, the performance is impacted not only by the number of exploratory steps but also by the nature of the exploratory steps. Due to the ubiquitousness of scheduling problems, we believe that our results could be extended to many other variants of this setting. In particular, it would be interesting to take our algorithmic principles and test them on real-world scheduling problems. Whether our current assumption on the exponential distribution of job sizes can be removed is an exciting direction for future work. We believe that many of the proofs for the non-preemptive case can be extended to other well-behaved distributions. However, in the preemptive case, our proofs do heavily rely on the properties On Preemption and Learning in Stochastic Scheduling of the exponential distribution, mainly the memoryless increments property. Without it, we can no longer average over different sections of a job evaluation to estimate its expected duration. Nonetheless, we believe that the results are still generalizable although proving the bounds would be technically much harder. Moreover, we believe that elements from our works can be taken to other online learning settings outside the scope of scheduling. Specifically, we believe that the notion of types serves as a reasonable approximation that allows the integration of learning to many online problems. We also think that the study of cost functions that are sensitive the early exploration is of great interest. Acknowledgements This project has received funding from the European Union s Horizon 2020 research and innovation programme under the Marie Skłodowska Curie grant agreement No 101034255. Nadav Merlis is partially supported by the Viterbi Fellowship, Technion. Mathieu Molina is supported by the French National Research Agency (ANR) through grant ANR-20-CE23-0007. Vianney Perchet acknowledges support from the French National Research Agency (ANR) under grant number (ANR19-CE23-0026 as well as the support grant, as well as from the grant Investissements d Avenir (Lab Ex Ecodec/ANR11-LABX-0047). Bansal, N. and Dhamdhere, K. Minimizing weighted flow time. ACM Transactions on Algorithms (TALG), 3(4): 39 es, 2007. Becchetti, L. and Leonardi, S. Nonclairvoyant scheduling to minimize the total flow time on single and parallel machines. Journal of the ACM, 51(4):517 539, July 2004. ISSN 0004-5411. doi: 10.1145/1008731.1008732. Cai, X. and Zhou, X. Asymmetric earliness and tardiness scheduling with exponential processing times on an unreliable machine. Annals of Operations Research, 98(1): 313 331, 2000. Cai, X. and Zhou, X. Single-machine scheduling with exponential processing times and general stochastic cost functions. Journal of Global Optimization, 31(2):317 332, 2005. Cai, X., Wu, X., and Zhou, X. Optimal Stochastic Scheduling, volume 4. Springer, 2014. Calin, O. and Udri ste, C. Geometric modeling in probability and statistics, volume 121. Springer, 2014. Cunningham, A. A. and Dutta, S. K. Scheduling jobs, with exponentially distributed processing times, on two machines of a flow shop. Naval Research Logistics Quarterly, 20(1):69 81, 1973. Dürr, C., Erlebach, T., Megow, N., and Meißner, J. An adversarial model for scheduling with testing. Algorithmica, 82(12):3630 3675, 2020. Garivier, A., Ménard, P., and Stoltz, G. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 44(2):377 399, 2019. doi: 10.1287/moor.2017.0928. URL https: //doi.org/10.1287/moor.2017.0928. Glazebrook, K. D. Scheduling tasks with exponential service times on parallel processors. Journal of Applied Probability, 16(3):685 689, 1979. Hamada, T. and Glazebrook, K. D. A bayesian sequential single machine scheduling problem to minimize the expected weighted sum of flowtimes of jobs with exponential processing times. Operations Research, 41(5): 924 934, 1993. Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del R ıo, J. F., Wiebe, M., Peterson, P., G erard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. Array programming with Num Py. Nature, 585(7825):357 362, September 2020. doi: 10. 1038/s41586-020-2649-2. URL https://doi.org/ 10.1038/s41586-020-2649-2. Hunter, J. D. Matplotlib: A 2d graphics environment. Computing in science & engineering, 9(3):90 95, 2007. Kämpke, T. Optimal scheduling of jobs with exponential service times on identical parallel processors. Operations Research, 37(1):126 133, 1989. Krishnasamy, S., Arapostathis, A., Johari, R., and Shakkottai, S. On learning the cµ rule in single and parallel server networks. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 153 154. IEEE, 2018. Lattimore, T. and Szepesvári, C. Bandit Algorithms. Cambridge University Press, 2020. doi: 10.1017/ 9781108571401. Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pp. 1302 1338, 2000. On Preemption and Learning in Stochastic Scheduling Lawler, E. L. and Labetoulle, J. On preemptive scheduling of unrelated parallel processors by linear programming. Journal of the ACM (JACM), 25(4):612 619, 1978. Lee, D. and Vojnovic, M. Scheduling jobs with stochastic holding costs. Advances in Neural Information Processing Systems, 34:19375 19384, 2021. Levi, R., Magnanti, T., and Shaposhnik, Y. Scheduling with testing. Management Science, 65(2):776 793, 2019. Li, H., Chen, J., Tao, Y., Gro, D., and Wolters, L. Improving a local learning technique for queuewait time predictions. In Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID 06), volume 1, pp. 335 342. IEEE, 2006. Magerlein, J. M. and Martin, J. B. Surgical demand scheduling: a review. Health services research, 13(4):418, 1978. Marbán, S., Rutten, C., and Vredeveld, T. Learning in stochastic machine scheduling. In International Workshop on Approximation and Online Algorithms, pp. 21 34. Springer, 2011. Mitzenmacher, M. Scheduling with predictions and the price of misprediction. In 11th Innovations in Theoretical Computer Science Conference (ITCS 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020. Motwani, R., Phillips, S., and Torng, E. Nonclairvoyant scheduling. Theoretical computer science, 130(1):17 47, 1994. Pinedo, M. and Weiss, G. Scheduling jobs with exponentially distributed processing times and intree precedence constraints on two parallel machines. Operations Research, 33(6):1381 1388, 1985. Pinedo, M. L. Scheduling, volume 29. Springer, 2012. Schrage, L. A proof of the optimality of the shortest remaining processing time discipline. Operations Research, 16 (3):687 690, 1968. Tsung-Chyan, L., Sotskov, Y. N., Sotskova, N. Y., and Werner, F. Optimal makespan scheduling with given bounds of processing times. Mathematical and Computer Modelling, 26(3):67 86, 1997. White, R. W. and Hassan Awadallah, A. Task duration estimation. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 636 644, 2019. On Preemption and Learning in Stochastic Scheduling A. Benchmark FTPP In this section, for all jobs i [N], we call Pi the job size of job i. Jobs are ordered in increasing order of their expected size (Notation Pi and P i/n i mod n denote the same job). For any algorithm A, we note T A ij for each (i, j) [N]2 the amount of time job i and job j delay each other under algorithm A. A.1. Cost of OPT and FTPP, CR of RR Let us express the expected cost of any algorithm in terms of T A i,j for k [K]: j=i+1 T A i,j Lemma A.1 (Cost of OPT). The cost of OPT is given by E[COPT] = n2 K X λkλℓ λk + λℓ Note that this lemma implies the following inequality, which will be used in other proofs: Proof. We apply Equation (5) with A = OPT. In that case, for any to jobs (i, j) [N], i = j, as the shortest job is scheduled first, we have E[T A ij ] = E[min(Pi, Pj)]. So E[T A ij ] = λkλℓ λk+λℓif job i is of type k and job j is of type ℓ. E[COPT] = E j=i+1 T A i,j λkλℓ λk + λℓ nλℓ+ n(n 1) λkλℓ λk + λℓ λkλℓ λk + λℓ Lemma A.2 (Cost of FTPP). The cost of FTPP is given by: E[CFTPP] = n2 1 2 ℓ=1 (K ℓ)λℓ On Preemption and Learning in Stochastic Scheduling Proof. We apply Equation (5) with A = OPT, so that E[T A ij ] = min(λk, λℓ) if job i is of type k and job j is of type ℓ E[CFTPP] = E j=i+1 T A i,j nλℓ+ n(n 1) 2 λℓ+ n2 K X ℓ=1 (K ℓ)λℓ Lemma A.3 (CR of RR). For any For any λ, the following lower bound holds: E[CRR] E[COPT] 2 4 n + 3. Proof. For RR, any two jobs are run in parallel until one terminates, thus: E[T RR ij ] = 2E[min(Pi, Pj)]. Thus, by equation 5: i=1 E[Pi] + j=1 2E[min(Pi, Pj)]. On the other hand: i=1 E[Pi] + j=1 E[min(Pi, Pj)]. E[CRR] E[COPT] = 2 PN i=1 Pi E[COPT] = 2 n PK ℓ=1 λℓ n2 PK ℓ=1 1 4λℓ+ PK ℓ=1 PK k=ℓ+1 λkλℓ λk+λℓ 4 PK ℓ=1 λℓ 2 n PK ℓ=1 λℓ n2 PK ℓ=1 1 4λℓ + 3n 4 PK ℓ=1 λℓ = 2 4 n + 3. With the second line obtained by Lemma A.1. A.2. CR of FTPP A.2.1. CR WITH K TYPES Proposition A.4 (Upper bound on the CR in function of λ). The CR of FTPP with K types of jobs with n jobs per type satisfies: E[CFTPP] E[COPT] 2 f K(λ) On Preemption and Learning in Stochastic Scheduling where f K(λ) = 2 PK ℓ=1 PK k=ℓ+1 λkλℓ λk+λℓ PK ℓ=1(K ℓ)λℓ PK ℓ=1 1 4 λℓ+PK ℓ=1 PK k=ℓ+1 λkλℓ λk+λℓ Note that instantiating this bound with K = 2 types of jobs, n jobs per type, λ1 = 1 and λ2 = λ > 1, we get Proposition 4.3: E[COPT] 2 4 λ 1 (1 + λ)2 + 4λ. Proof of Proposition A.4. Compute E[COPT] using Lemma A.1, E[CFTPP] using Lemma A.2. The competitive ratio of FTPP is given by: CRFTPP = E[CFTPP] E[COPT] = n2 1 2 PK ℓ=1 λℓ+ PK ℓ=1(K ℓ)λℓ + n( 1 2 PK ℓ=1 λℓ) n2 PK ℓ=1 1 4λℓ+ PK ℓ=1 PK k=ℓ+1 λkλℓ λk+λℓ 4 PK ℓ=1 λℓ) (7) For any values a, b, c, d R4 +, if a > c > 0 and d > b > 0, then a + b Now, we have 1 λkλℓ λk + λℓ 1 ℓ=1 (K ℓ)λℓ. This implies: 1 2 PK ℓ=1 λℓ+ PK ℓ=1(K ℓ)λℓ PK ℓ=1 1 4λℓ+ PK ℓ=1 PK k=ℓ+1 λkλℓ λk+λℓ = 2 2 PK ℓ=1 PK k=ℓ+1 λkλℓ λk+λℓ PK ℓ=1(K ℓ)λℓ PK ℓ=1 1 4λℓ+ PK k=1 PK k=ℓ+1 λkλℓ λk+λℓ | {z } f K(λ) A.2.2. UPPER BOUND ON THE CR FOR PARTICULAR VALUES OF λ Proposition A.5. K > 1, λ, 0 < λ1 λK = 1, CRFTPP(λ) HK 1 2BK 1 4BK + AK with HK = PK k=1 1 k, BK = PK k=1 1 k2 and AK = PK k=1 Pk 1 ℓ=1 1 k2+ℓ2 . Furthermore lim K HK 1 2 BK 1 4 BK+AK = 4 π 1.273 which implies that there exists some value of K for which CRFTPP(λ) 1.274. Proof. A way to prove such a result would be to find the minimum of CRFTPP with respect to λ. But this is difficult. We propose another point λ. λk = 1 (K k + 1)2 . (9) On Preemption and Learning in Stochastic Scheduling We express the competitive ratio using λ: CRFTPP( λ) = PK k=1( 1 2 + K k) λk PK k=1 1 4 λk + PK ℓ=k+1 λk λℓ λk+ λℓ 2 + K k) 1 (K k+1)2 PK k=1 1 4 1 (K k+1)2 + PK ℓ=k+1 1 (K k+1)2+(K ℓ+1)2 = PK k=1( 1 k2 PK k=1 1 4 1 k2 + Pk 1 ℓ=1 1 k2+ℓ2 2BK 1 4BK + AK 1 k2 , and AK = 1 k2 + ℓ2 . This shows the first part of the lemma. The fact that HK 1 2 BK 1 4 BK+AK 4 π follows from Lemma A.6. 2BK 1 4BK + AK = 4 with HK = PK k=1 1 k, BK = PK k=1 1 k2 and AK = PK k=1 Pk 1 ℓ=1 1 k2+ℓ2 Proof. We now focus on the behavior of HK 1 2 BK 1 4 BK+AK as K goes to . We know that for the harmonic number HK = Θ(log(K)), and that for the partial sum of the Basel problem 0 BK P k=1 k 2 = π2/6 = O(1). Let us bound Ak. Using the fact that for y > 0 and x > 0 the function f : (x, y) 7 (x2+y2) 1 is decreasing in x, for (k, ℓ) [K]2 we have Z ℓ+1 1 k2 + t2 dt 1 k2 + ℓ2 Z ℓ 1 (t/k)2 + 1dt 1 k2 + ℓ2 1 1 (t/k)2 + 1dt 1 k (arctan(ℓ+ 1 k ) arctan( ℓ k )) 1 k2 + ℓ2 1 k (arctan( ℓ k ) arctan(ℓ 1 Hence by summing for 1 ℓ< k K: 1 k (arctan(1) arctan(1 1 k (arctan(k 1 k ) arctan(0)), 1 k arctan(k 1 For the right-hand-side we use that arctan is increasing, thus 1 k arctan(k 1 Using that arctan(x) x for x 0, we have On Preemption and Learning in Stochastic Scheduling Combining everything we obtain the following inequality: 2BK π 4 HK + 1 4BK CRFTPP( λ) HK 1 2BK π 4 HK 3 lim K CRFTPP( λ) = 4 A.2.3. THE COST OF FTPP IS LOWER THAN THE COST OF RR Let us order all jobs i [N] in order of their increasing expected size, and denote Pi, the size of job i. An alternative notation to Pi is P i/n i mod n, where the first is used in this proof for convenience. We consider here the most general setting where K = N. E[CFTPP] E[CRR] Proof. The cost of FTPP with K = N and n = 1 is given by i=1 E[Pi] + j=i+1 E T FTPP ij i=1 E[Pi] + j=i+1 min(λi, λj) where T FTPP ij is the amount of time job i and job j delay each other in FTPP which verifies E T FTPP ij = min(λi, λj) Similarly, using T RR ij = 2 min(Pi, Pj) which implies E[T RR ij ] = 2 λiλj λi+λj , we get i=1 E[Pi] + j=i+1 E T RR ij i=1 E[Pi] + j=i+1 2 λiλj Then we write λi + λj = 2 1 λi + 1 λj 2 1 min(λi,λj) + 1 min(λi,λj) min(λi, λj) We conclude that CFTPP CRR. On Preemption and Learning in Stochastic Scheduling A.3. Lower bound: Proof of Proposition 4.1 Let us order all jobs i [N] in order of their increasing expected size, and denote Pi, the size of job i. An alternative notation to Pi is P i/n i mod n, where the first is used in this proof for convenience. We consider here the most general setting where K = N. Any algorithm has a cost: i=1 E[Pi] + j=i+1 E[T A ij ] where T A ij = DA ij + DA ji where DA ij is the amount of time job i delay job j. Lemma A.8. Consider K = N jobs where job i [N] has mean size λi and λ1 λN. Consider any algorithm A and let T A ij the total amount of time spent by A on i or j while both jobs are alive. E[T A ij ] 2E[T OP T ij ] where OPT is the optimal offline algorithm Proof of Lemma A.8. Let us first prove our proposition for any deterministic algorithm A. We denote i(t) amount of time that A allocates to job i after a time t < T A ij is allocated to job i or j. t=0 P(T A ij t)dt t=0 P (Pi i(t)) P (Pj t i(t)) dt t=0 exp i(t) t=0 exp i(t) + t/2 t/2 exp t (i(t) + t/2 t/2) Calling f(t) = exp 1 λi + 1 λj t 2 and g(t) = | 1 λi 1 λj | i(t) t 2 it holds that either t=0 f(t) exp( g(t))dt Z t=0 f(t) exp(g(t))dt Z t=0 f(t)dt. Otherwise, we would have Z 2(exp( g(t)) + exp(g(t)))dt < Z which cannot be true since t, 1 2(exp( t) + exp(t)) 1. Therefore an adversary knowing i(t) can always chose the order of λi and λj such that E[T A ij ] Z + t=0 exp( ( 1 2)dt = 2 λiλj On Preemption and Learning in Stochastic Scheduling The optimal delay is E[T OP T ij ] = E[min(Pi, Pj)] = λiλj λi + λj so our Lemma is proven for any deterministic algorithm A. Consider a randomized algorithm R which can be seen as a probabilistic distribution over the set of deterministic algorithms. Therefore A, i(t) and g(t) are now seen as random variables. By the tower rule, the amount of time job i and j delay each other in R is such that: E[T R ij ] = E[E[T A ij |A]] t=0 f(t) exp(sign(λi λj)g(t))dt] By the same argument as in the deterministic case, it holds that either t=0 f(t) exp( g(t))dt] Z t=0 f(t) exp(g(t))dt] Z Otherwise, we would have 2(exp( g(t)) + exp(g(t)))dt] < Z which implies that there exists a deterministic function g such that Z 2(exp( g(t)) + exp(g(t)))dt < Z which cannot be true as shown in the deterministic case. The rest of the argument is the same as in the deterministic case and therefore omitted. Now we are ready to prove Proposition 4.1. Proof of Proposition 4.1. Take any algorithm A i=1 E[Pi] + j=i+1 E[T A ij ] (10) j=i+1 E[T OP T ij ] (11) where (11) comes from Lemma A.8. Observe that applying RR on the same data would yield an expected completion time: i=1 E[Pi] + 2 j=i+1 E[min(Pi, Pj)] i=1 E[Pi] + 2 j=i+1 E[T OP T ij ] which concludes the proof. On Preemption and Learning in Stochastic Scheduling B. Analysis of Non-Preemptive Learning algorithms B.1. Full Algorithmic Details In this appendix, we present a full description of ETC-U and UCB-U. Algorithm 3 Explore-Then-Commit Uniform (ETC-U)] 1: Input : n 1 (number of jobs of each type), K 2 (number of types) 2: For all pairs of different types k, ℓinitialize δk,ℓ= 0, ˆrk,ℓ= 0 and hk,ℓ= 0 3: For all types k, set mk = 0 4: repeat 5: U is the set of types with at least one remaining job 6: if A is empty then 7: A = {ℓ U, k U, k = ℓ, ˆrk,ℓ δk,ℓ 0.5} 8: end if 9: Select the type ℓwith the lowest number of finished jobs ℓ= argmink A mk and run one job of type ℓyielding a size P ℓ mℓ+1. 10: mℓ= mℓ+ 1 11: for k, ℓin A, k = ℓdo 12: hk,ℓ= Pmin(mk,mℓ) i=1 1{P k i < P ℓ i } 13: δk,ℓ= q log(2n2K4) 2 min(mk,mℓ) 14: ˆrk,ℓ= hk,ℓ min(mk,mℓ) 15: if ˆrk,ℓ δk,ℓ 0.5 or mℓ= n then 16: Remove ℓfrom A 17: end if 18: end for 19: until U is not empty Algorithm 4 Upper-Confidence-Bound-Uniform (UCB-U) 1: Input : n 1 (number of jobs of each type), K 2 (number of types) 2: For all types k [K], set mk = 0 3: Set U = [K] 4: For all types k [K], compute the lower bound λmk k using Equation (16) 5: repeat 6: Select k = argmink U λmk k 7: Set mk = mk + 1 8: Compute a job of type k until completion and record its size P mk k 9: Update the lower bound λmk k using again Equation (16) 10: If mk = n, remove k from U 11: until U is empty B.2. Cost Decomposition In this appendix, we analyze the non-preemptive learning algorithms presented in our paper - ETC-U and UCB-U. We start by presenting a general cost decomposition result that relates the cost of any non-preemptive algorithm to the one of FTPP. We will use this result to derive the bounds of both our suggested algorithms. Lemma B.1 (Cost of non-preemptive algorithms). Any non-preemptive algorithm A has a cost E[CA] = E[CFTPP] + (i,j) [n]2 (λk λℓ)E 1 P k i computed before P ℓ j On Preemption and Learning in Stochastic Scheduling Proof. Denote Pi, the size of the job i, and T A ij = DA ij + DA ji, where DA ij is the amount of time a job i delays job j. For any algorithm, we have: b=a+1 T A ab For non-preemptive algorithms, T A ab = Pa if job a is scheduled before b and Pb otherwise so that we can write b=a+1 (Pa1{Pa computed before Pb} + Pb1{Pb computed before Pa}) Now assume w.l.o.g. that (Pa)a [N] are in the order chosen by FTPP, i.e., Pa is the ath executed task by FTPP and if a b then E[Pa] E[Pb]. Under this convention, we get: and recalling that 1{Pa computed before Pb} = 1 1{Pb computed before Pa} CA = CFTPP + b=a+1 (Pb Pa)1{Pb computed before Pa} Reindexing the job without changing the order, where P k i is now the i-th job of type k, we have: CA = CFTPP + i=1 (P k i P ℓ j )1 P k i computed before P ℓ j Taking the expectation finishes the proof. B.3. Upper bound for ETC-U Proposition B.2. The following upper bounds hold: E[CETC-U] E[CFTPP] + 1 n E[COPT] + X 2(k 1)(2K k) + (K k)2 λkn p 8n log(2n2K3). E[CETC-U] E[CFTPP] + 1 n E[COPT] + X ℓ=1 (K ℓ)(λk + λℓ)2 (λk λℓ) n8 log(2n2K3). We start with the following technical lemma, isolated to be reused in other proofs. Pick some α N. Let (X1 i )i [αn] and (X2 i )i [αn] be independent exponential variables of parameters λ1 and λ2 respectively. Define for any m [αn]: i=1 1X1 i δ(m,n) 1 A union bound over the K(K 1) 2 possible pairs gives the following bound: E 1 2n K . (12) With the help of Lemma B.1, the cost of ETC-U can be decomposed using the event E as follows: E[CETC-U] = E[CFTPP] + (i,j) [n]2 (λk λℓ)E 1 P k i computed before P ℓ j (Lemma B.1) (i,j) [n]2 (λk λℓ)E 1 P k i computed before P ℓ j |E (i,j) [n]2 (λk λℓ)E 1 P k i computed before P ℓ j |E P | {z } (ii) On Preemption and Learning in Stochastic Scheduling Bounding (ii). Recall that by assumption, if k ℓ, then λk λℓ. Therefore, we have that (i,j) [n]2 (λk λℓ) | {z } 0 1 P k i computed before P ℓ j (i,j) [n]2 (λk λℓ) k=ℓ+1 (λk λℓ) k=1 (k 1)λk ℓ=1 (K ℓ)λℓ E (Equation (6)) n E[COPT], (14) where the last inequality is by Equation (12). Bounding (i). Consider any couple (k, ℓ) [K]2 s.t. ℓ k. Let m ℓ,k be the number of comparisons performed between jobs of type ℓand k before the algorithm detects that λℓ λk. A first obvious upper bound is m ℓ,k n. A second upper is obtained by noting that m ℓ,k is smaller than any m s.t. δ(m ,n) < 1 λk λk + λℓ 0.5 . For this value of δ(m ,n), the event E ensures that if λk λℓ, then ˆrm ℓ,k δ(m ,n) Under E E[rm ℓ,k] 2δ(m ,n) > λk λk + λℓ λk λk + λℓ 1 and type k would be eliminated. This implies the following upper bound on m ℓ,k: n, 8 λk + λℓ 2 log 2n2K3 ! On the other hand, notice that under the good event E, a type ℓwill never be eliminated due to a type k of greater expected duration λk λℓ, since ˆrm k,ℓ δ(m ,n) Under E E[rm ℓ,k] + δ(m ,n) δ(m ,n) = λℓ λk + λℓ 1 We decompose the run of the algorithm into (up to K) phases. For each ℓ [K], we call phase ℓthe iterations at which jobs of type ℓare the jobs with the smallest mean still not terminated. Note that during phase ℓ, job type ℓis always active, as the contrary would mean event E does not hold. This implies that the number of jobs of any type k > ℓcomputed during phase ℓis lower than m ℓ,k. We have the following bound: On Preemption and Learning in Stochastic Scheduling (i,j) [n]2 (λk λℓ)E 1 P k i computed before P ℓ j |E (i,j) [n]2 (λk λℓ)E 1 P k i computed before phase ℓ+ 1 |E i [n] n(λk λℓ)E o=1 1 P k i computed during phase o |E o=1 E[m o,k|E]n(λk λℓ) o=1 E[m o,k|E]n(λk λo) l=o E[m o,k|E]n(λk λo) o=1 E[m o,k|E]n(k o)(λk λo) ℓ=1 E[m ℓ,k|E]n(K ℓ)(λk λℓ) n, 8 λk + λℓ 2 log 2n2K3 ! n(K ℓ)(λk λℓ). (1) is since by the beginning of phase ℓ+ 1, all jobs of type ℓwere completed. (2) is since during phase o, the oth type was not eliminated, so there cannot be more than m o,k jobs of type k in this phase. In (3), we changed the summation order and in (4), we replaced o ℓ. Finally, (5) is due to the bound of Equation (15), which holds under E. Next, for any λk λℓ, we have: (λk λℓ) min n, 8 λk + λℓ 2 log 2n2K3 ! (λk + λℓ) p 8n log(2n2K3), since min {a, b} ab for any a, b 0. This implies that ℓ=1 n(K ℓ)(λk + λℓ) p 8n log(2n2K3) 2(k 1)(2K k) + (K k)2 λkn p 8n log(2n2K3). Substituting this and the bound of Equation (14) into the decomposition of Equation (13) gives the first bound of the proposition. The second bound is obtained by upper bounding: (λk λℓ) min n, 8 λk + λℓ 2 log 2n2K3 ! 8(λk + λℓ)2 λk λℓ log 2n2K3 , On Preemption and Learning in Stochastic Scheduling B.4. Upper bound for UCB-U Proposition B.4. The expected cost of UCB-U is upper bounded by: E[CUCB-U] E[CFTPP] + n(K 1) p 3n ln (2n2K2) E[CUCB-U] E[CFTPP] + λk λℓ 3n ln 2n2K2 + 2 Concentration of exponential distribution If X E(λ), then 2 λX E(2) = χ2 2 (χ2 with 2 degrees of freedom). It follows that if i [m], Xi E(λ), then 2 λ Pm i=1 Xi χ2 2m. Denote χ2 2m(α) the α-th percentile, we have with probability 1 δ that 2 P i Xi χ2 2m(1 δ/2) λ 2 P i Xi χ2 2m(δ/2) Setting δ = 1 n2K2 , we get the following formula for a lower bound: λm k = 2 Pm i=1 Xk i χ2 2m(1 1 2n2K2 ) (16) and another formula for the upper bound λ m k = 2 Pm i=1 Xk i χ2 2m( 1 2n2K2 ) If a job k is wrongly scheduled before a job of type ℓ, then the decision rule is misleading meaning that: λmk k = 2 Pnk i=1 Xk i χ2 2nk(1 1 2n2K2 ) < 2 Pnℓ i=1 Xℓ i χ2 2nℓ(1 1 2n2K2 ) = λmℓ ℓ even though λℓ< λk. Bounding the cost From Lemma B.1, the cost of any non preemptive algorithm A writes E[CA] = E[CFTPP] + (i,j) [n]2 (λk λℓ)E 1 P k i computed before P ℓ j (17) Let us then introduce the GOOD event which is: E = { i [n], k [K], λi k λk λ i k} With a union bound, it is easy to show that E holds with probability 1 1 n K and that the contradictory event E happens with probability 1 n K . Using the same method as in the proof of Proposition B.2 (the decomposition using E and E as done in Equation (13) and the derivation of Equation (14)), we can upper bound the cost of UCB-U as: E[CUCB U] E[CFTPP] + (i,j) [n]2 (λk λℓ)E 1 P k i computed before P ℓ j |E + E[COP T ] 4KP(E) | {z } 4/n On Preemption and Learning in Stochastic Scheduling Furthermore, P k i computed before P ℓ j implies that λi k < λj ℓand therefore (i,j) [n]2 (λk λℓ)P(λi k < λj ℓ|E) Under E, we have λj ℓ λℓ. Moreover, it holds that λ i k λk, and by the definition of λi k, and λ i k, λi k = χ2 2i( 1 2n2K2 ) χ2 2i(1 1 2n2K2 )λ i k χ2 2i( 1 2n2K2 ) χ2 2i(1 1 2n2K2 )λk. Combined, under E we can bound n λi k < λj ℓ o λk χ2 2i( 1 2n2K2 ) χ2 2i(1 1 2n2K2 ) < λℓ and therefore write (i,j) [n]2 1 λk χ2 2i( 1 2n2K2 ) χ2 2i(1 1 2n2K2 ) < λℓ k=ℓ+1 n max i [n], λk χ2 2i( 1 2n2K2 ) χ2 2i(1 1 2n2K2 ) < λℓ So finally we have E[CUCB-U] E[CFTPP] + k=ℓ+1 n max i [n], λk χ2 2i( 1 2n2K2 ) χ2 2i(1 1 2n2K2 ) < λℓ (λk λℓ) + 2 n E[COPT] (19) Bounding the ratio. We now focus on bounding the maximum term in Equation (19). By Lemma 1 of (Laurent & Massart, 2000), if U χ2 D, then Dx + 2x exp( x), and P U D 2 Dx exp( x), (20) and in particular, χ2 D(δ) D 2 δ , and χ2 D(1 δ) D + 2 Thus, for any i [n], a necessary condition to the inequality λk χ2 2i( 1 2n2K2 ) χ2 2i(1 1 2n2K2 ) < λℓis 2i ln (2n2K2) 2i ln (2n2K2) + 2 ln (2n2K2) < λℓ 2 ln (2n2K2) 1 + λℓ λk ln 2n2K2 < 0 (λk λℓ) i p 2 ln (2n2K2) (λk + λℓ) i λℓln 2n2K2 < 0 2 ln (2n2K2) (λk + λℓ) + q 2 ln (2n2K2) (λk + λℓ)2 + 4 ln (2n2K2) λℓ(λk λℓ) 2 ln (2n2K2) (λk + λℓ) + q (λk + λℓ)2 + 2λℓ(λk λℓ) Now, using the fact that 2λℓ λℓ+ λk, we get the simplified bound 2 ln (2n2K2)(λk + λℓ) + p 2λk (λk + λℓ) 2 (λk λℓ) p 2 ln (2n2K2) 1 + 2 λk + λℓ 2 (λk λℓ) , (22) On Preemption and Learning in Stochastic Scheduling or i 3 ln 2n2K2 λk+λℓ λk λℓ 2 . Since we also know that i [n], we can write max i [n], λk χ2 2i( 1 2n2K2 ) χ2 2i(1 1 2n2K2 ) < λℓ 3 ln 2n2K2 λk + λℓ 3n ln (2n2K2)λk + λℓ where the second inequality is since min {a, b} ab for a, b > 0. Substituting back into Equation (19), we get the first bound in the proposition: E[CUCB-U] E[CFTPP] + 3n ln (2n2K2)λk + λℓ λk λℓ (λk λℓ) + 2 = E[CFTPP] + n p 3n ln (2n2K2) k=ℓ+1 (λk + λℓ) + 2 = E[CFTPP] + n p 3n ln (2n2K2) k=1 (k 1)λk + ℓ=1 (K ℓ)λℓ = E[CFTPP] + n(K 1) p 3n ln (2n2K2) The second bound is obtained through the upper bound: (λk λℓ) min 3 ln 2n2K2 λk + λℓ 3 ln 2n2K2 (λk + λℓ)2 B.5. Lower bounds for Non-Preemptive Algorithms B.5.1. SMALL DIFFERENCES (PROPOSITION 5.4) Proof of Proposition 5.4. Assume K = 2 and take any non-preemptive algorithm A. Call P 1 i the i-th job of type 1 and P 2 j the j-th job of type 2. According to Lemma B.1, A has a cost E[CA] = E[CFTPP] + (λ2 λ1)E (i,j) [n]2 1 P 2 j computed before P 1 i if λ2 > λ1 (the role of λ2 and λ1 are reversed if λ2 < λ1). We then follow the same approach as in chapter 15 in (Lattimore & Szepesvári, 2020). Consider situation 1 where λ1 = a, λ2 = b and situation 2 where λ1 = b and λ2 = a with a < b and assumes that the adversary chooses the situation based on the algorithm A. Intuitively, if A tends to complete more of jobs of type 1 before jobs of type 2, the adversary will decide that λ1 > λ2 (situation 2) otherwise, it will choose λ2 > λ1 (situation 1). Call Pν1 the joint probability over the scheduling decisions and job sizes in situation 1 following the policy prescribed by algorithm A and Pν2 the same probability in situation 2. Call Pat(xt) the probability that the job of type at chosen at time t is of size xt. Calling KL the KL divergence, we have following the Lemma 15.1 in (Lattimore & Szepesvári, 2020): KL(Pν1, Pν2) = n(KL(Xa, Xb) + KL(Xb, Xa)) where Xa is an exponential random variable of expectation a and Xb is an exponential random variable of expectation b. Note right away that KL(Xa, Xb) = a b ) (e.g., Calin & Udri ste, 2014, Example 4.2.1), therefore KL(Xa, Xb)+ KL(Xb, Xa) = a KL(Pν1, Pν2) n(b a)2 The cost of algorithm A in situation 1 is lower bounded as: Eν1[CA] Eν1[CFTPP] + (b a)Eν1 (i,j) [n]2 1 P 2 j computed before P 1 i n2/2 On Preemption and Learning in Stochastic Scheduling The cost of algorithm A in situation 2 is lower bounded as: Eν1[CA] Eν2[CFTPP] + (b a)Eν2 (i,j) [n]2 1 P 1 i computed before P 2 j > n2/2 Introduce the event E = 1 n P (i,j) [n]2 1 P 2 j computed before P 1 i n2/2 o , we have that E[CA] = max ν {ν1,ν2} Eν[CA] Eν1[CA] + Eν2[CA] Eν1[CFTPP] + Eν2[CFTPP] 2 + (b a)n2/2Pν1(E) + Pν2(E) First, let us notice that E[CFTPP] = Eν1[CFTPP] + Eν2[CFTPP] Then, using Bretagnolle Huber inequality (Th 14.2 in (Lattimore & Szepesvári, 2020)), we get Pν1(E) + Pν2(E) 1 2 exp( KL(Pν1, Pν2)) and since KL(Pν1, Pν2) n (λ2 λ1)2 λ1λ2 , we have E[CA] E[CFTPP] + (b a)n2/2exp( n (b a)2 At this stage, we can rewrite the equation assuming λ2 λ1 and so that we get E[CA] E[CFTPP] + (λ2 λ1)n2 exp( n (λ2 λ1)2 which proves the first result of the proposition. In particular, taking λ2 λ1 1 + 1 n gives its second result E[CA] E[CFTPP] λ1n n exp( n 1/n (1+1/ n)2 ) (λ1 + λ2)n ne 1/4 B.5.2. LARGE DIFFERENCES Proposition B.5. For any non-preemptive algorithm, there exists a problem instance with expected type durations of λ1 λ2 λK such that E[CA] E[CFTPP] + n k=1 (2k K 1)λk. In particular, for K = 2 and λ2 3λ1, it holds that E[CA] E[CFTPP] + n 4 (λ1 + λ2), On Preemption and Learning in Stochastic Scheduling Let pk be the probability that a non-preemptive algorithm completes a job of type k at its first round. Notice that this distribution cannot depend on the expected duration of any of the types, since no data was gathered. Thus, types can be arbitrarily ordered without affecting this distribution. In particular, we assume w.l.o.g. that p1 p2 . . . p K and choose a problem instance where job types are ordered in an increasing duration λ1 λ2 . . . λK. Then, the expected cost of the algorithm can be bounded according to Lemma B.1, by E[CA] = E[CFTPP] + (i,j) [n]2 (λk λℓ)E 1 P k i computed before P ℓ j j [n] (λk λℓ)E 1 P k 1 computed before P ℓ j E[CFTPP] + n k=ℓ+1 (λk λℓ)E 1 P k 1 was the first job = E[CFTPP] + n k=ℓ+1 (λk λℓ)pk = E[CFTPP] + n ℓ=1 (λk λℓ). Now, one can be easily convinced that between all probability vectors with non-decreasing components, this bound is minimized by the uniform distribution pk = 1/K. To see this, observe that the sum Pk 1 ℓ=1 (λk λℓ) increases with k. Therefore, if p is non-uniform, then p K > 1/K, and there would exist a coordinate k < K to which we could move weight from p K, which would decrease the bound. Substituting pk = 1/K we then get E[CA] E[CFTPP] + n ℓ=1 (λk λℓ) = E[CFTPP] + n k=1 (2k K 1)λk. In particular, if K = 2, we get E[CA] E[CFTPP] + n and, for λ2 3λ1, we have λ2 λ1 λ1+λ2 E[CA] E[CFTPP] + n 4 (λ1 + λ2). On Preemption and Learning in Stochastic Scheduling C. Analysis of Preemptive Learning algorithms C.1. Full Algorithmic Details In this appendix, we present a full description of ETC-RR and UCB-RR. Algorithm 5 Explore-Then-Commit-Round-Robin (ETC-RR) 1: Input : n 1 (number of jobs of each type), K 2 (number of types) 2: For all pairs of different types k, ℓinitialize δk,ℓ= 0, ˆrk,ℓ= 0 and hk,ℓ= 0 3: For all types k, set ck = 0 4: repeat 5: U is the set of types with at least one remaining job 6: if A is empty then 7: A = {ℓ U, k U, k = ℓ, ˆrk,ℓ δk,ℓ 0.5} 8: end if 9: Run jobs (P k ck+1)k A in parallel until a job finishes and denote ℓthe type of this job 10: cℓ= cℓ+ 1 11: for k A, k = ℓdo 12: βℓ,k = βℓ,k + 1 13: δℓ,k = δk,ℓ= q log(2n2K4) 2(βℓ,k+βk,ℓ) 14: ˆrℓ,k = βℓ,k βk,ℓ+βℓ,k 15: ˆrk,ℓ= βk,ℓ βk,ℓ+βℓ,k 16: if ˆrℓ,k δℓ,k 0.5 then 17: Remove k from A 18: end if 19: if ˆrk,ℓ δk,ℓ 0.5 or cℓ= n then 20: Remove ℓfrom A 21: end if 22: end for 23: until U is empty Algorithm 6 Upper-Confidence-Bound-Round-Robin (UCB-RR) 1: Input : n 1 (number of jobs of each type), K 2 (number of types), discretization step 2: repeat 3: U is the set of types with at least one remaining job 4: Calculate type indices uk for all jobs k U according to Equation (29) 5: Choose type ℓ argmaxℓ U uℓ 6: Run a job of type ℓfor time units 7: until U is empty C.2. Cost Decomposition We start with a cost decomposition, which relates the performance of preemptive algorithms to the one of FTPP. We limit ourselves to the natural family of preemptive algorithms that do not simultaneously run two tasks of the same type and is formally defined as follows. Definition C.1. Denote bℓ i and eℓ i the beginning and end dates of the computation of the ith job of type ℓ. A type-wise non-preemptive algorithm is an algorithm that computes jobs of the same type one after another, i.e., i [n], k [K], eℓ i bℓ i+1. This property is very natural, as for exponential durations without knowledge of the real execution times, there is no advantage in simultaneously running two tasks of the same type. Specifically, all of our suggested algorithms fall under this definition. On Preemption and Learning in Stochastic Scheduling For such algorithms, the cumulative cost can be compared to FTPP using the following lemma. Lemma C.2 (Cost of type-wise non-preemptive algorithms). Any type-wise non-preemptive algorithm A has the following upper bound on its cost: E[CA] E[CFTPP] + (i,j) [n]2 (λk λℓ)E 1 ek j < bℓ i + (K 1)n Proof. Recall that if DA ij is the amount of time a job i delays job j when running algorithm A, then we can write the cost of algorithm A as DA ij + DA ji . Moreover, if bi,ei are the start (end) time of job i, it always holds that DA ij Pi1{bi < ej}. Using this inequality and dividing the summation into types, we get E P ℓ i 1 bℓ i < ek j + E P k j 1 bk j < eℓ i i=1 E[P ℓ i ] + j=i+1 E[P ℓ i 1 bℓ i < eℓ j ] + E[P ℓ j 1 bℓ j < eℓ i ] Since jobs are independent, the expected duration of a job of type ℓis λℓ, independently of its start time. Also, as the algorithm is type-wise non-preemptive, for all ℓ [K], j > i, we have 1 bℓ j < eℓ i] = 0 and 1 bℓ i eℓ j] = 1. Thus, λℓE 1 bℓ i < ek j + λk E 1 bk j < eℓ i ℓ=1 λℓ n(n + 1) λℓE 1 bℓ i < ek j + λk E 1 bk j < eℓ i We can now decompose the event that job i of type ℓstarted before job j of type k finished: 1 bℓ i < ek j = 1 bℓ i < ek j eℓ i + 1 eℓ i < ek j = 1 bℓ i < ek j eℓ i + 1 eℓ i < bk j + 1 bk j eℓ i < ek j eℓ i < bk j is the event that job i of type ℓwas fully computed before job j of type k started. bℓ i < ek j eℓ i is the event that job i of type ℓwas running when job j of type k finished, and reciprocally for bk j eℓ i < ek j . This gives the following On Preemption and Learning in Stochastic Scheduling decomposition: λℓE 1 eℓ i < bk j + λk E 1 ek j < bℓ i | {z } (ii) (i,j) [n]2 λℓE 1 bℓ i < ek j eℓ i + λk1 bk j eℓ i < ek j | {z } (iii) (i,j) [n]2 λk E 1 bk j < eℓ i ek j + λk1 bℓ i ek j < eℓ i | {z } (iv) Since the algorithm is type-wise non-preemptive, a single job of each type may run at a given time. This implies that any job of type ℓcannot be in a middle of two different jobs of type k, namely (ℓ, k) [K]2, ℓ = k, (i, j) [n]2, j=1 1 bk j eℓ i < ek j 1. The same conclusion similarly holds for all other sums in terms (iii) and (iv), and therefore implies the following bound: (iii) + (iv) n k=ℓ+1 (λℓ+ λk) = (K 1)n We also have 1 ek j < bℓ i 1 1 bℓ i < ek j , thus λℓ+ (λk λℓ)E 1 ek j < bℓ i k=ℓ+1 (K ℓ)λℓ+ (i,j) [n]2 (λk λℓ)E 1 ek j < bℓ i . Combining everything into Equation (23), we get: ℓ=1 λℓ n(n + 1) k=ℓ+1 (K ℓ)λℓ (i,j) [n]2 (λk λℓ)E 1 ek j < bℓ i + (K 1)n = E[CFTPP] + (i,j) [n]2 (λk λℓ)E 1 ek j < bℓ i + (K 1)n where the last equality is by Lemma A.2. C.3. Upper bound for ETC-RR Proposition C.3. The following bound holds: E[CETC-RR] E[CFTPP] + 12K n E[COP T ] + 4n p n log(2n2K3) ℓ=1 (K ℓ)2λℓ. (24) On Preemption and Learning in Stochastic Scheduling Good Event. For any couple (k, ℓ), if at some iteration βk,ℓ+ βℓ,k = m, we define the more precise notation for ˆrℓ,k at that iteration as ˆrm ℓ,k. Notice that m represents the number of times that jobs of either type were completed while both were active. Therefore, m can have 2n different values, where the extreme case m = 2n 1 is, for example, when n 1 jobs of type k are first completed and only then all n jobs of type ℓare completed. Let us start by showing that the estimators ˆrℓ,k well-concentrate around their expectations. Exponential random variables are memory-less, i.e., if Xi Exp(λi), the law of Xi conditionally on it being larger than a constant is unchanged. In particular, resetting (replacing by an independent copy) an exponential random variable at any time that precedes its activation does not affect its distribution. We employ reset when one of the types is completed, or when either of the types is removed from A (and then we discard the comparison between types k, ℓ), and say that way a comparison is triggered, it is taken from an i.i.d. sequence of comparisons. Specifically, given a deterministic number of comparisons m between types k, ℓ, we write ˆrm ℓ,k L= 1 i=1 1{Xℓ i < Xk i }, with (Xℓ i )i [m] and (Xk i )i [m] independent exponential variables of parameters λℓand λk respectively. Finally, E be the event that all comparisons between k, ℓare well-concentrated, namely, E = (k, ℓ) [K]2, m [2n], ˆrm ℓ,k λk λℓ+ λk with δ(m,n) = q 2m , and recall that for any m [2n], E[ˆrm ℓ,k] = λk λℓ+λk . By Lemma B.3 applied with α = 2, and a union bound over the K(K 1) 2 possible pairs, we have: P(E) 1 n K . (25) Notice that under E a type ℓwill never be eliminated by a type k with λk λℓ, since ˆrk,ℓ δℓ,k < λℓ λk + λℓ + δℓ,k δℓ,k = λℓ λk + λℓ 1 so if k is the minimal type in A, then under the good event, it will never be eliminated. Moreover, type k can only be compared to a type ℓwith λℓ λk at most mmax ℓ,k min 2n, 8 λk + λℓ 2 log(2n2K3) := m ℓ,k (26) since clearly m 2n and if m 8 λk+λℓ λk λℓ 2 log(2n2K3), then δℓ,k 1 4 λk λℓ λk+λℓand ˆrℓ,k δℓ,k > λk λk + λℓ δℓ,k δℓ,k λk λk + λℓ 2 1 4 λk λℓ λk + λℓ = 1 Cost Analysis. Assume that the active type set A can only change at discrete times t {0, , 2 , . . . } T for some > 0. We will later take the limit 0, which coincides with the following Algorithm 5. In the following, we denote by A(t) the active type set at time interval [t, t + ), and U(t) incomplete type set at [t, t + ). Also, let bℓ i T be the start time of the ith job of the ℓth type and eℓ i T be its end-time (w.l.o.g., if a task ended at a time t / T , we delay its ending to t Starting from the cost decomposition of Lemma C.2, we have E[CA] E[CFTPP] + (K 1)n (i,j) [n]2 (λk λℓ)E 1 ek j < bℓ i E[CFTPP] + (K 1)n k=ℓ+1 n(λk λℓ) j=1 E 1 ek j < bℓ n On Preemption and Learning in Stochastic Scheduling We focus our attention on bounding term ( ). For any t T and every o [K], define the events Fo(t) = {o A(t), p < o : p / A(t)} , Fo(t) = { p o : p / A(t)} These events capture the notion of phases, namely, when Fo is active, the o is the type of the smallest mean that has not been finished. Then, we can write j=1 E 1 ek j < bℓ n t T E 1 ek j = t + , ek j < bℓ n t T E 1 ek j = t + , ℓ U(t) t T E 1 ek j = t + , Fo(t) + t T E 1 ek j = t + , ℓ U(t), Fℓ(t) t T E 1 ek j = t + , Fo(t) + t T E 1 ek j = t + , E t T E 1 ek j = t + , Fo(t) (1) is true since the event ek j = t + , ek j < bℓ n implies that bℓ n > t, so type ℓwas not completed at time t. (2) holds since under E, the type of the minimal duration expected is never eliminated, so if ℓ U(t), it is impossible that p / A(t) for all p ℓ(either there is a type p < ℓ, p U(t) that should not have been eliminated, or type ℓshould not have been eliminated since it is still incomplete). Finally, (3) is since every job can only end at one time point. We now further continue to bound term (i). To do so, observe that if ek j = t + , then k A(t) and the task j of this type was completed at the interval [t, t + ). Moreover, since the job durations are exponential, the completion of any job in A(t) at interval [t, t + ) is independent of the events that occurred until time t. Taking into consideration that only one job of any type can run in every interval, the two following equalities hold: E 1 ek j = t + , Fo(t) = (1 exp( /λk)) E [1{Fo(t), k A(t)}] E 1 ek j = t + or eo j = t + , Fo(t), k A(t) = (1 exp( /λk /λo)) E [1{Fo(t), k A(t)}] . Thus, (i) can be written as (1 exp( /λk)) (1 exp( /λk /λo)) t T E 1 ek j = t + or eo j = t + , Fo(t), k A(t) (1 exp( /λk)) (1 exp( /λk /λo))E t T 1 ek j = t + or eo j = t + , Fo(t), k A(t), E | {z } (ii) t T E 1 ek j = t + or eo j = t + , Fo(t), E | {z } (iii) On Preemption and Learning in Stochastic Scheduling The bad-event term can be bounded by t T 1 ek j = t + or eo j = t + , Fo(t) (ℓ+ 1)n P(E). For the second term, recall that the ℓth phase, in which type ℓis the smallest one that was not completed, is represented by the event Fℓ(t). Furthermore, given the good event, the smallest type is never eliminated. so once Fℓ(t) becomes active, it would end only when all jobs of type ℓare completed. In other words, the time indices in which Fℓ(t) hold form a (possibly empty) interval I (ℓ), which represents the ℓth phase. Thus, term (ii) counts the expected number of times that jobs of either type k or o could finish in this interval while type k is still active. Now, we take the limit 0 (and denoting the limit interval I (ℓ) I(ℓ)). Notice that the limit and expectation are interchangeable by the bounded convergence theorem, as the number of times a job of either type k or o can be completed is bounded by 2n. λo λk + λo E j=1 1 ek j I(o) or eo j I(o), k A(t), E λo λk + λo m o,k. The inequality holds since under the good event, at any interval where both types k and o with λo λk are active, there can be at most m o,k comparisons. Substituting (ii) and (iii) back into (i), we get λo λk + λo m o,k + (ℓ+ 1)n P(E), and yet again, substituting this, through ( ), back into Equation (27), yields E[CA] E[CFTPP] k=ℓ+1 n(λk λℓ) λo λk + λo m o,k + (ℓ+ 2)n P(E) ℓ=1 λℓ+ 2Kn2P(E) k=ℓ+1 (λk λℓ) k=ℓ+1 n(λk λℓ) λo λk + λo min 2n, 8 λk + λo 2 log(2n2K3) (Equation (26)) ℓ=1 λℓ+ 2K2n2P(E) o=1 n(λk λℓ) λo λk + λo 4 p n log(2n2K3)λk + λo λk λo (min {a, b} ab, a, b 0) E[COP T ] (Equation (25) and (6)) n log(2n2K3) (λo λℓ) n E[COP T ] + 4n p n log(2n2K3) o=1 (K ℓ)λo n E[COP T ] + 4n p n log(2n2K3) ℓ=1 (K ℓ)2λℓ On Preemption and Learning in Stochastic Scheduling C.4. Analysis of UCB-RR Proposition C.4. The following bound holds for any λ1 4 and n max(20, 10 ln(K)) : E[CUCB-RR] E[CFTPP] + 12K n E[COP T ] + 6n p 2n log(2n2K2) ℓ=1 (K ℓ)λℓ. (28) Assume discretization of the time to units of , as was done in the analysis of ETC-RR. Specifically, assume that the active job only changes at times t {0, , 2 , . . . } T for some > 0. We then denote the index of the discretization step by h = t + 1 {1, 2, . . . }. For each job type ℓ [K], we introduce Tℓ(h), the number of times job type ℓhas been chosen up to iteration h. Due to the fact that job durations are exponential, their increments are independent, and increments of length of jobs of type ℓhave a termination probability of µℓ= 1 e λk . Leveraging this, let (xs ℓ)s 1 be sequences of i.i.d Bernoulli random variables of mean µℓ,. We then fix our probability space for the analysis s.t. when choosing a job of type ℓfor the sth time, it is terminated if xs ℓ= 1. Notice that while we allow the sequence (xs ℓ)s 1 to have more than n job terminations, it is of no consequence of the analysis, as the algorithm will never choose a job type after its nth job was terminated. Next, define the empirical means after running m discretized intervals of type-ℓjobs as ˆµℓ(m) := 1 m Pm s=1 xs ℓ, and the index at iteration h as uℓ(h) = max µ [0, 1] : d (ˆµℓ(Tℓ(t 1)), µ) log n2K2 Starting from the cost decomposition of Lemma C.2, we have E[CA] E[CFTPP] + (K 1)n (i,j) [n]2 (λk λℓ)E 1 ek j < bℓ i E[CFTPP] + (K 1)n k=ℓ+1 n(λk λℓ) j=1 E 1 ek j < eℓ n Denote a(h) [K], the type of job chosen at iteration h, and let εℓ,k > 0 be some constant that will be determined later in the proof. Notice that if ek j < eℓ n, then there must be an iteration where type ℓwas not completed and the jth job of type k were played and completed: k=ℓ+1 n(λk λℓ)E h 1 n a(h) = k, ℓ U(h), x Tk(h) k = 1 oi k=ℓ+1 n(λk λℓ)E h 1 n a(h) = k, ℓ U(h), uℓ(h) uk(h), x Tk(h) k = 1 oi k=ℓ+1 n(λk λℓ)E h 1 n a(h) = k, ℓ U(h), uk(h) uℓ(h) µℓ εℓ,k, x Tk(h) k = 1 oi k=ℓ+1 n(λk λℓ)E h 1 n a(h) = k, ℓ U(h), uℓ(h) µℓ εℓ,k, x Tk(h) k = 1 oi k=ℓ+1 n(λk λℓ)(1 e λk )E [1{a(h) = k, uk(h) µℓ εℓ}] k=ℓ+1 n2(λk λℓ)P ( h s.t. uℓ(h) µℓ εℓ) | {z } (ii) On Preemption and Learning in Stochastic Scheduling In (1), we get the first line by the memoryless property of exponential random variables, noting that all the events inside the indicator are determined before the beginning of the hth iteration. The second line of this relation uses the fact that all tasks will eventually be completed, so P h=1 E h 1 n a(h) = k, x Tk(h) k = 1 oi = n. Bounding term (i). We now bound the first term of the decomposition in Equation (31). k=ℓ+1 n(λk λℓ)(1 e λk )E [1{a(h) = k, uk(h) µℓ εℓ}] k=ℓ+1 n(λk λℓ)(1 e h=1 E [1{a(h) = k, uk(h) µℓ εℓ}] . Denoting d(p, q) = d(p, q)1{p q}, we have {µℓ εℓ,k uk(h)} = {ˆµk(Tk(h)) µℓ εℓ,k and d(ˆµk(Tk(h)), µℓ εℓ,k) log n2K2 Tk(h) } or {ˆµk(Tk(h)) µℓ εℓ,k}, which is equivalent to d(ˆµk(Tk(h)), µℓ εℓ,k) log(n2K2) . Thus, we can bound k=ℓ+1 n(λk λℓ)(1 e a(h) = k, d(ˆµk(Tk(h)), µℓ εℓ,k) log n2K2 k=ℓ+1 n(λk λℓ)(1 e d(ˆµk(s), µℓ εℓ,k) log n2K2 where the second inequality is since Tk(h) increases by 1 every time that a(h) = k. This can be naturally bounded using the following lemma. Lemma C.5. Let X1, X2, . . . be a sequence of Bernoulli independent random variables with mean µ, and let ˆµs = 1 s Ps t=1 Xt be the sample mean. Further, let a > 0, µ > µ and define κ = P s=1 1 d(ˆµs, µ ) a E[κ] inf ε (0,µ µ) a d(µ + ε, µ ) + 1 d (µ + ε, µ) Proof. The proof closely follows the one of (Lattimore & Szepesvári, 2020, Lemma 10.8). For completeness, we now state the well-known Chernoff bound. Lemma C.6 (Chernoff s bound, e.g., Lattimore & Szepesvári (2020), Lemma 10.3). Let X1, X2, . . . , XT be a sequence of Bernoulli independent random variables with mean µ, and let ˆµ = 1 T PT t=1 Xt be the sample mean. Then, for a 0: P(d(ˆµ, µ) a, ˆµ µ) exp( Ta). On Preemption and Learning in Stochastic Scheduling Let ϵ (0, µ µ) and u = a d(µ+ε,µ ). Then, it holds that s=1 P n d(ˆµs, µ ) a s=1 P n ˆµs µ or d(ˆµs, µ ) a s=1 P n ˆµs µ + ε or d(ˆµs, µ ) a o (µ > µ + ϵ) s=1 P n ˆµs µ + ε or d(µ + ε, µ ) a o (d( , µ ) is decreasing in [0, µ ]) s=1 P {ˆµs µ + ε} s=1 exp ( sd(µ + ϵ, µ)) (Chernoff s bound) a d(µ + ε, µ ) + 1 d (µ + ε, µ), and the proof is concluded by taking the infimum over all ε (0, µ µ). Now, assume w.l.o.g. that λk > λℓ(or, equivalently, µℓ< µk) for all k > ℓ; otherwise, terms where λk = λℓin (i) will be equal to 0. Then, letting κk,ℓ= P s=1 1 d(ˆµk(s), µℓ εℓ,k) log(n2K2) , the last lemma implies that k=ℓ+1 n(λk λℓ)(1 e λk )E[κk,ℓ] k=ℓ+1 n(λk λℓ)(1 e d(µk + εk,ℓ, µℓ εℓ) + 1 d (µk + εk,ℓ, µk) for some εk,ℓ (0, µℓ εℓ,k µk) that will be determined later. Bounding term (ii). We next focus on bounding the probabilities at the second summation of Equation (31). To do so, we prove the following lemma, which bounds each of the summands of term (ii). Lemma C.7. The following bound holds: P ( h s.t. uℓ(h) µℓ εℓ,k) µℓ n2K2d(µℓ εℓ,k,µℓ). Proof. Define Smax ℓ := P h=1 1{a(h) = ℓ}, the number of iterations ℓis picked by the algorithm. We have: P ( h s.t. uℓ(h) µℓ εℓ,k) =P h s.t. ˆµℓ(Tℓ(h)) µℓ εℓ,k and d(ˆµℓ(Tℓ(h)), µℓ εℓ,k) log n2K2 s Smax ℓ s.t. ˆµℓ(s) µℓ εℓ,k and d(ˆµℓ(s), µℓ εℓ,k) log n2K2 1 s < s.t. ˆµℓ(s) µℓ εℓ,k and d(ˆµℓ(s), µℓ εℓ,k) log n2K2 Now, observe that the empirical means ˆµℓdecrease in intervals without successes. Namely, if a < b are time indices such that xa ℓ= 1, xb ℓ= 1 and for all s [a + 1, b 1], xs ℓ= 0, then for any s [a, b 1], it holds that ˆµℓ(s) ˆµℓ(b 1). We On Preemption and Learning in Stochastic Scheduling 1 s < : d (ˆµℓ(s), µℓ εℓ,k) > log n2K2 s , ˆµℓ(s) µℓ εℓ,k 1 s < : d (ˆµℓ(s), µℓ εℓ,k) > log n2K2 s , ˆµℓ(s) µℓ εℓ,k, xs+1 ℓ = 1 Using the union bound, this implies P ( h s.t. uℓ(h) µℓ εℓ,k) d (ˆµℓ(s), µℓ εℓ,k) > log n2K2 s , ˆµℓ(s) µℓ εℓ,k, and xs+1 ℓ = 1 s=1 P xs+1 ℓ = 1 P d (ˆµℓ(s), µℓ εℓ,k) > log n2K2 s , ˆµℓ(s) µℓ εℓ,k|xs+1 ℓ = 1 d (ˆµℓ(s), µℓ εℓ,k) > log n2K2 s , ˆµℓ(s) < µℓ εℓ,k d (ˆµℓ(s), µ) > log n2K2 s + d(µℓ εℓ,k, µℓ), ˆµℓ(s) < µℓ where we used the fact that the sequence xs ℓis independent and the last inequality is by (Lattimore & Szepesvári, 2020, Lemma 10.2, (c)). Next, using Chernoff s bound (Lemma C.6), we get P ( h s.t. uℓ(h) µℓ εℓ,k) µℓ d(µℓ εℓ,k, µℓ) + log n2K2 s=1 exp ( sd(µℓ εℓ,k, µℓ)) µℓ n2K2d(µℓ εℓ,k, µℓ), which concludes the proof of Lemma C.7. Finally, substituting back into (ii) leads to the bound k=ℓ+1 n2(λk λℓ) µℓ n2K2d(µℓ εℓ,k, µℓ) µℓ(λk λℓ) d(µℓ εℓ,k, µℓ). (33) Combining both bounds. k=ℓ+1 n(λk λℓ) µk log n2K2 d(µk + εk,ℓ, µℓ εℓ,k) + µk d (µk + εk,ℓ, µk) k=ℓ+1 (λk λℓ) µℓ K2d(µℓ εℓ,k, µℓ). We now use a local refinement of Pinsker s inequality (Garivier et al., 2019): d(p, q) 1 2 max(p, q)(p q)2. On Preemption and Learning in Stochastic Scheduling This implies: k=ℓ+1 n(λk λℓ) 2µk (µℓ εℓ,k) log n2K2 (µℓ εℓ,k µk εk,ℓ)2 + 2µk(µk + εk,ℓ) k=ℓ+1 (λk λℓ) 2µ2 ℓ K2ε2 ℓ,k . Case 1: Assume µℓ 5µk. Setting εk,ℓ= µk, we obtain, k=ℓ+1 n(λk λℓ) 2µk(µℓ εℓ,k) log n2K3 (µℓ εℓ,k 2µk)2 + 4 k=ℓ+1 (λk λℓ) 2µ2 ℓ K2ε2 ℓ,k k=ℓ+1 n(λk λℓ) 2µk(µℓ εℓ,k) log n2K3 5µℓ εℓ,k)2 + 4 k=ℓ+1 (λk λℓ) 2µ2 ℓ K2ε2 ℓ,k Setting εℓ,k = 1 5µℓ, we get: k=ℓ+1 n(λk λℓ) 10µk log n2K2 k=ℓ+1 (λk λℓ) 50 We have µk = 1 e λk , and if 1 4λℓ, 1 µℓ 1.13 λℓ , this implies: k=ℓ+1 nλℓ11.3 log n2K2 + Since K 2, this implies: k=ℓ+1 nλℓ11.3 log n2K2 + ℓ=1 λℓ(12.5 + 4n) K. Case 2: Assume µℓ 5µk. Setting εk,ℓ= (µℓ µk)/4 and εℓ,k = (µℓ µk)/4, we obtain, k=ℓ+1 n(λk λℓ) µ2 k (µℓ µk)2 32 log n2K2 + 64 + k=ℓ+1 (λk λℓ) 32µ2 ℓ K2(µℓ µk)2 . It also holds that ( ) PK ℓ=1 PK k=ℓ+1 n2(λk λℓ). Thus: k=ℓ+1 n(λk λℓ) 4µk (µℓ µk) 2n (log (n2K3) + 4) + k=ℓ+1 (λk λℓ) 4 2µℓn K(µℓ µk). 4λℓ, we have: 1 µℓ µk 1.46 λkλℓ (λk λℓ). We thus obtain: k=ℓ+1 6nλℓ p 2n (log (n2K2) + 4) + For any n max(10, 10 log(K)), we have: which implies 6nλℓ p 2n (log (n2K2) + 4) 11.3 log n2K2 . Thus for any n max(10, 10 log(K)) and 1 k=ℓ+1 6nλℓ p 2n (log (n2K2) + 4) + 10K n E[COP T ]. On Preemption and Learning in Stochastic Scheduling D. Additional experiments We implemented the Bayesian approach of (Marbán et al., 2011) that we call LSEPT. We used an uninformative prior α = 1, w = 0 (the same for all job types). LSEPT is then in essence a greedy algorithm. Whenever a job finishes, it runs until completion a job whose type has the lowest empirical mean size (computed across jobs that have been processed so far). We ran all algorithms with K = 2, where jobs of type 1 have a mean size λ1 = 0.8 and jobs of type 2 have a mean λ2 = 1. As can be seen in Figure 3, LSEPT has better mean performance than RR, a non-adaptive method. However, it has a large variance and its performance does not improve with n. This is typical of the performance of greedy algorithms: since the algorithm commits very early, it can either get very good or very bad performances. We plot the mean over 200 seeds. Figure 3. CR on jobs with 2 different types. K = 2, λ2 = 1 and λ1 = 0.8, n takes a grid of values.