# improving_online_algorithms_via_ml_predictions__87875111.pdf Improving Online Algorithms via ML Predictions Ravi Kumar Google Mountain View, CA ravi.k53@gmail.com Manish Purohit Google Mountain View, CA mpurohit@google.com Zoya Svitkina Google Mountain View, CA zoya@cs.cornell.edu In this work we study the problem of using machine-learned predictions to improve the performance of online algorithms. We consider two classical problems, ski rental and non-clairvoyant job scheduling, and obtain new online algorithms that use predictions to make their decisions. These algorithms are oblivious to the performance of the predictor, improve with better predictions, but do not degrade much if the predictions are poor. 1 Introduction Dealing with uncertainty is one of the most challenging issues that real-world computational tasks, besides humans, face. Ranging from will it snow next week? to should I rent an apartment or buy a house? , there are questions that cannot be answered reliably without some knowledge of the future. Similarly, the question of which job should I run next? is hard for a CPU scheduler that does not know how long this job will run and what other jobs might arrive in the future. There are two interesting and well-studied computational paradigms aimed at tackling uncertainty. The first is in the field of machine learning where uncertainty is addressed by making predictions about the future. This is typically achieved by examining the past and building robust models based on the data. These models are then used to make predictions about the future. Humans and real-world applications can use these predictions to adapt their behavior: knowing that it is likely to snow next week can be used to plan a ski trip. The second is in the field of algorithm design. Here, the effort has to been to develop a notion of competitive ratio1 for the goodness of an algorithm in the presence of an unknown future and develop online algorithms that make decisions heedless of the future but are provably good in the worst-case, i.e., even in the most pessimistic future scenario. Such online algorithms are popular and successful in real-world systems and have been used to model problems including paging, caching, job scheduling, and more (see the book by Borodin and El-Yaniv [5]). Recently, there has been some interest in using machine-learned predictions to improve the quality of online algorithms [20, 18]. The main motivation for this line of research is two-fold. The first is to design new online algorithms that can avoid assuming a worst-case scenario and hence have better performance guarantees both in theory and practice. The second is to leverage the vast amount of modeling work in machine learning, which precisely deals with how to make predictions. Furthermore, as machine-learning models are often retrained on new data, these algorithms can naturally adapt to evolving data characteristics. When using the predictions, it is important that the online algorithm is unaware of the performance of the predictor and makes no assumptions on the types of prediction errors. Additionally, we desire two key properties of the algorithm: (i) if the predictor is good, then the online algorithm should perform close to the best offline algorithm (consistency) and (ii) if the predictor is bad, then the online algorithm should gracefully degrade, i.e., its performance should be close to that of the online algorithm without predictions (robustness). 1Informally, competitive ratio compares the worst-case performance of an online algorithm to the best offline algorithm that knows the future. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. Our problems. We consider two basic problems in online algorithms and show how to use machinelearned predictions to improve their performance in a provable manner. The first is ski rental, in which a skier is going to ski for an unknown number of days and on each day can either rent skis at unit price or buy them for a higher price b and ski for free from then on. The uncertainty is in the number of skiing days, which a predictor can estimate. Such a prediction can be made reasonably well, for example, by building models based on weather forecasts and past behavior of other skiers. The ski rental problem is the canonical example of a large class of online rent-or-buy problems, which arise whenever one needs to decide between a cheap short-term solution ( renting ) and an expensive long-term one ( buying ). Several extensions and generalizations of the ski rental problem have been studied leading to numerous applications such as dynamic TCP acknowledgement [11], buying parking permits [21], renting cloud servers [14], snoopy caching [13], and others. The best known deterministic algorithm for ski rental is the break-even algorithm: rent for the first b 1 days and buy on day b. It is easy to observe that the break-even algorithm has a competitive ratio of 2 and no deterministic algorithm can do better. On the other hand, Karlin et al. [12] designed a randomized algorithm that yields a competitive ratio of e e 1 1.58, which is also optimal. The second problem we consider is non-clairvoyant job scheduling. In this problem a set of jobs, all of which are available immediately, have to be scheduled on one machine; any job can be preempted and resumed later. The objective is to minimize the sum of completion times of the jobs. The uncertainty in this problem is that the scheduler does not know the running time of a job until it actually finishes. Note that a predictor in this case can predict the running time of a job, once again, by building a model based on the characteristics of the job, resource requirements, and its past behavior. Non-clairvoyant job scheduling, introduced by Motwani et al. [23], is a basic problem in online algorithms with a rich history and, in addition to its obvious applications to real-world systems, many variants and extensions of it have been studied extensively in the literature [9, 3, 1, 10]. Motwani et al. [23] showed that the round-robin algorithm has a competitive ratio of 2, which is optimal. Main results. Before we present our main results we need a few formal notions. In online algorithms, the competitive ratio of an algorithm is defined as the worst-case ratio of the algorithm cost to the offline optimum. In our setting, this is a function c(η) of the error η of the predictor2. We say that an algorithm is γ-robust if c(η) γ for all η, and that it is β-consistent if c(0) = β. So consistency is a measure of how well the algorithm does in the best case of perfect predictions, and robustness is a measure of how well it does in the worst-case of terrible predictions. Let λ (0, 1) be a hyperparameter. For the ski rental problem with a predictor, we first obtain a deterministic online algorithm that is (1 + 1/λ)-robust and (1 + λ)-consistent (Section 2.2). We next improve these bounds by obtaining a randomized algorithm that is ( 1 1 e (λ 1/b) )-robust and ( λ 1 e λ )-consistent, where b is the cost of buying (Section 2.3). For the non-clairvoyant scheduling problem, we obtain a randomized algorithm that is (2/(1 λ))-robust and (1/λ)-consistent. Note that the consistency bounds for all these algorithms circumvent the lower bounds, which is possible only because of the predictions. It turns out that for these problems, one has to be careful how the predictions are used. We illustrate through an example that if the predictions are used naively, one cannot ensure robustness (Section 2.1). Our algorithms proceed by opening up the classical online algorithms for these problems and using the predictions in a judicious manner. We also conduct experiments to show that the algorithms we develop are practical and achieve good performance compared to ones that do not use any prediction. Related work. The work closest to ours is that of Medina and Vassilvitskii [20] and Lykouris and Vassilvitskii [18]. The former used a prediction oracle to improve reserve price optimization, relating the gap beween the expected bid and revenue to the average predictor loss. In a sense, this paper initiated the study of online algorithms equipped with machine learned predictions. The latter developed this framework further, introduced the concepts of robustness and consistency, and considered the online caching problem with predictions. It modified the well-known Marker algorithm to use the predictions ensuring both robustness and consistency. While we operate in the same framework, none of their techniques are applicable to our setting. Another recent work is that of Kraska et al. [17] that empirically shows that better indexes can be built using machine learned models; it does not provide any provable guarantees for its methods. 2The definition of the prediction error η is problem-specific. In both the problems considered in this paper, η is defined to be the L1 norm of the error. There are other computational models that try to tackle uncertainty. The field of robust optimization [16] considers uncertain inputs and aims to design algorithms that yield good performance guarantees for any potential realization of the inputs. There has been some work on analyzing algorithms when the inputs are stochastic or come from a known distribution [19, 22, 6]. In the optimization community, the whole field of online stochastic optimization concerns online decision making under uncertainty by assuming a distribution on future inputs; see the book by Russell Bent and Pascal Van Hentenryck [4]. Our work differs from these in that we do not assume anything about the input; in fact, we do not assume anything about the predictor either! 2 Ski rental with prediction In the ski rental problem, let rentals cost one unit per day, b be the cost to buy, x be the actual number of skiing days, which is unknown to the algorithm, and y be the predicted number of days. Then η = |y x| is the prediction error. Note that we do not make any assumptions about its distribution. The optimum cost is OPT = min{b, x}. 2.1 Warmup: A simple consistent, non-robust algorithm We first show that an algorithm that naively uses the predicted number of days to decide whether or not to buy is 1-consistent, i.e., its competitive ratio is 1 when η = 0. However, this algorithm is not robust, as the competitive ratio can be arbitrarily large in case of incorrect predictions. Algorithm 1: A simple 1-consistent algorithm if y b then Buy on the first day. else Keep renting for all skiing days. end Lemma 2.1. Let ALG denote the cost of the solution obtained by Algorithm 1 and let OPT denote the optimal solution cost on the same instance. Then ALG OPT + η. Proof. We consider different cases based on the relative values of the prediction y and the actual number of days x of the instance. Recall that Algorithm 1 incurs a cost of b whenever the prediction is at least b and incurs a cost of x otherwise. y b, x b = ALG = b = OPT. y < b, x < b = ALG = x = OPT y b, x < b = ALG = b x + y x = x + η = OPT + η y < b, x b = ALG = x < b + x y = b + η = OPT + η A major drawback of Algorithm 1 is its lack of robustness. In particular, its competitive ratio can be unbounded if the prediction y is small but x b. Our goal next is to obtain an algorithm that is both consistent and robust. 2.2 A deterministic robust and consistent algorithm In this section, we show that a small modification to Algorithm 1 yields an algorithm that is both consistent and robust. Let λ (0, 1) be a hyperparameter. As we see later, varying λ gives us a smooth trade-off between the robustness and consistency of the algorithm. Theorem 2.2. With a parameter λ (0, 1), Algorithm 2 has a competitive ratio of at most λ , (1 + λ) + η (1 λ)OPT . In particular, Algorithm 2 is (1 + 1/λ)-robust and (1 + λ)- consistent. Proof. We begin with the first bound. Suppose y b and the algorithm buys the skis at the start of day λb . Since the algorithm incurs a cost of b+ λb 1 whenever x λb , the worst competitive Algorithm 2: A deterministic robust and consistent algorithm. if y b then Buy on the start of day λb else Buy on the start of day b/λ end ratio is obtained when x = λb , for which OPT = λb . In this case, we have ALG = b+ λb 1 b + λb 1+λ λ OPT. On the other hand, when y < b, the algorithm buys skis at the start of day b/λ and rents until then. In this case, the worst competitive ratio is attained whenever x = b/λ as we have OPT = b and ALG = b + b/λ 1 b + b/λ = 1+λ To prove the second bound, we need to consider the following two cases. Suppose y b. Then, for all x < λb , we have ALG = OPT = x. On the other hand, for x λb , we have ALG = b + λb 1 (1 + λ)b (1 + λ)(OPT + η). The second inequality follows since either OPT = b (if x b) or b y OPT + η (if x < b). Suppose y < b. Then, for all x b, we have ALG = OPT = x. Similarly, for all x (b, b/λ ), we have ALG = x y + η < b + η = OPT + η. Finally for all x b/λ , noting that η = x y > b/λ b = (1 λ)b/λ, we have ALG = b + b/λ 1 b + b/λ < b + ( 1 1 λ)η = OPT + ( 1 1 λ)η. Thus we obtain ALG (1 + λ)OPT + ( 1 1 λ)η, completing the proof. Thus, Algorithm 2 gives an option to trade-off consistency and robustness. In particular, greater trust in the predictor suggests setting λ close to zero as this leads to a better competitive ratio when η is small. On the other hand, setting λ close to one is conservative and yields a more robust algorithm. 2.3 A randomized robust and consistent algorithm In this section we consider a family of randomized algorithms and compare their performance against an oblivious adversary. In particular, we design robust and consistent algorithms that yield a better trade-off than the above deterministic algorithms. Let λ (1/b, 1) be a hyperparameter. For a given λ, Algorithm 3 samples the day when skis are bought based on two different probability distributions, depending on the prediction received, and rents until that day. Algorithm 3: A randomized robust and consistent algorithm if y b then Let k λb ; Define qi b 1 b k i 1 b(1 (1 1/b)k) for all 1 i k; Choose j {1 . . . k} randomly from the distribution defined by qi; Buy at the start of day j. else Let ℓ b/λ ; Define ri b 1 b ℓ i 1 b(1 (1 1/b)ℓ) for all 1 i ℓ; Choose j {1 . . . ℓ} randomly from the distribution defined by ri; Buy at the start of day j. end Theorem 2.3. Algorithm 3 yields a competitive ratio of at most min{ 1 1 e (λ 1/b) , λ 1 e λ (1 + η OPT)}. In particular, Algorithm 3 is ( 1 1 e (λ 1/b) )-robust and ( λ 1 e λ )-consistent. Proof. We consider different cases depending on the relative values of y and x. (i) y b, x k. Here, we have OPT = min{b, x}. Since the algorithm incurs a cost of (b + i 1) when we buy at the beginning of day i, we have i=1 (b + i 1)qi = i=1 (b + i 1) b 1 k i 1 b(1 (1 1/b)k) = k 1 (1 1/b)k k 1 e k/b k/b 1 e k/b (OPT + η) λ 1 e λ (ii) y b, x < k. Here, we have OPT = x. On the other hand, the algorithm incurs a cost of (b + i 1) only if it buys at the beginning of day i x. In particular, we have i=1 (b + i 1)qi + = 1 b(1 (1 1/b)k) i=1 (b + i 1) b 1 i=x+1 x b 1 = x 1 (1 1/b)k 1 1 e k/b OPT 1 1 e (λ 1/b) which establishes robustness. In order to prove consistency, we can rewrite the RHS as follows E[ALG] 1 1 e k/b OPT = k/b 1 e k/b OPT + (b k)/b 1 e k/b k/b 1 e k/b OPT + k/b 1 e k/b since x < k and b k η. (iii) y < b, x < ℓ. Here, we have OPT = min{b, x}. On the other hand, the expected cost of the algorithm can be computed similar to (ii) i=1 (b + i 1)ri + i=x+1 xri 1 1 e ℓ/b (OPT + η) λ 1 e λ (iv) y < b, x ℓ. Here, we have OPT = b. The expected cost incurred by the algorithm is as in (i). i=1 (b + i 1)ri = ℓ 1 (1 1/b)ℓ b/λ (1 e ℓ/b) 1/λ + 1/b (1 e 1/λ) OPT 1 1 e (λ 1/b) which proves robustness. To prove consistency, we rewrite the RHS as follows. E[ALG] ℓ 1 e ℓ/b ℓ 1 e 1/λ = 1 1 e 1/λ (b + ℓ b) 1 1 e 1/λ (OPT + η) λ 1 e λ Algorithms 2 and 3 both yield a smooth trade-off between the robustness and consistency guarantees for the ski rental problem. As shown in Figure 1, the randomized algorithm offers a much better trade-off by always guaranteeing smaller consistency for a given robustness guarantee. We remark that setting λ = 1 in Algorithms 2 and 3 allows us to recover the best deterministic and randomized algorithms for the classical ski rental problem without using predictions. 2.4 Extensions Consider a generalization of the ski rental problem where we have a varying demand xi for computing resources on each day i. Such a situation models the problem faced while designing small enterprise data centers. System designers have the choice of buying machines at a high setup cost or renting machines from a cloud service provider to handle the computing needs of the enterprise. One can satisfy the demand in two ways: either pay 1 to rent one machine and satisfy one unit of demand for one day, or pay b to buy a machine and use it to satisfy one unit of demand for all future days. It is easy to cast the classical ski rental problem in this framework by setting xi = 1 for the first x days and to 0 later. Kodialam [15] considers this generalization and gives a deterministic algorithm with a competitive ratio of 2 as well as a randomized algorithm with competitive ratio of e e 1. Figure 1: Ski rental: Robustness vs. consistency. Now suppose we have predictions yi for the demand on day i. We define η = P i |xi yi| to be the total L1 error of the predictions. Both Algorithms 2 and 3 extend naturally to this setting to yield the same robustness and consistency guarantees as in Theorems 2.2 and 2.3. Our results follow from viewing an instance of ski rental with varying demand problem as k disjoint instances of the classical ski rental problem, where k is an upper bound on the maximum demand on any day. The proofs are similar to those in Sections 2.2 and 2.3; we omit them for brevity. 3 Non-clairvoyant job scheduling with prediction We consider the simplest variant of non-clairvoyant job scheduling, i.e., scheduling n jobs on a single machine with no release dates. The processing requirement xj of a job j is unknown to the algorithm and only becomes known once the job has finished processing. Any job can be preempted at any time and resumed at a later time without any cost. The objective function is to minimize the sum of completion times of the jobs. Note that no algorithm can yield any non-trivial guarantees if preemptions are not allowed. Let x1, . . . , xn denote the actual processing times of the n jobs, which are unknown to the nonclairvoyant algorithm. In the clairvoyant case, when processing times are known up front, the optimal algorithm is to simply schedule the jobs in non-decreasing order of job lengths, i.e., shortest job first. A deterministic non-clairvoyant algorithm called round-robin (RR) yields a competitive ratio of 2 [23], which is known to be best possible. Now, suppose that instead of being truly non-clairvoyant, the algorithm has an oracle that predicts the processing time of each job. Let y1, . . . , yn be the predicted processing times of the n jobs. Then ηj = |xj yj| is the prediction error for job j, and η = Pn j=1 ηj is the total error. We assume that there are no zero-length jobs and that units are normalized such that the actual processing time of the shortest job is at least one. Our goal in this section is to design algorithms that are both robust and consistent, i.e., can use good predictions to beat the lower bound of 2, while at the same time guaranteeing a worst-case constant competitive ratio. 3.1 A preferential round-robin algorithm In scheduling problems with preemption, we can simplify exposition by talking about several jobs running concurrently on the machine, with rates that sum to at most 1. For example, in the round-robin algorithm, at any point of time, all k unfinished jobs run on the machine at equal rates of 1/k. This is just a shorthand terminology for saying that in any infinitesimal time interval, 1/k fraction of that interval is dedicated to running each of the jobs. We call a non-clairvoyant scheduling algorithm monotonic if it has the following property: given two instances with identical inputs and actual job processing times (x1, . . . , xn) and (x 1, . . . , x n) such that xj x j for all j, the objective function value found by the algorithm for the first instance is no higher than that for the second. It is easy to see that the round-robin algorithm is monotonic. We consider the Shortest Predicted Job First (SPJF) algorithm, which sorts the jobs in the increasing order of their predicted processing times yj and executes them to completion in that order. Note that SPJF is monotonic, because if processing times xj became smaller (with predictions yj staying the same), all jobs would finish only sooner, thus decreasing the total completion time objective. SPJF produces the optimal schedule in the case that the predictions are perfect, but for bad predictions, its worst-case performance is not bounded by a constant. To get the best of both worlds, i.e. good performance for good predictions as well as a constant-factor approximation in the worst-case, we combine SPJF with RR using the following, calling the algorithm Preferential Round-Robin (PRR). Lemma 3.1. Given two monotonic algorithms with competitive ratios α and β for the minimum total completion time problem with preemptions, and a parameter λ (0, 1), one can obtain an algorithm with competitive ratio min{ α Proof. The combined algorithm runs the two given algorithms in parallel. The α-approximation (call it A) is run at a rate of λ, and the β-approximation (B) at a rate of 1 λ. Compared to running at rate 1, if algorithm A runs at a slower rate of λ, all completion times increase by a factor of 1/λ, so it becomes a α λ-approximation. Now, the fact that some of the jobs are concurrently being executed by algorithm B only decreases their processing times from the point of view of A, so by monotonicity, this does not make the objective of A any worse. Similarly, when algorithm B runs at a lower rate of 1 λ, it becomes a β 1 λ-approximation, and by monotonicity can only get better from concurrency with A. Thus, both bounds hold simultaneously, and the overall guarantee is their minimum. We next analyze the performance of SPJF. Lemma 3.2. The SPJF algorithm has competitive ratio at most 1 + 2η Proof. Assume w.l.o.g. that jobs are numbered in non-decreasing order of their actual processing times, i.e. x1 . . . xn. For any pair of jobs (i, j), define d(i, j) as the amount of job i that has been executed before the completion time of job j. In other words, d(i, j) is the amount of time by which i delays j. Let ALG denote the output of SPJF. Then (i,j):i 0.5 gives an algorithm that beats the round-robin ratio of 2 in the case of sufficiently good predictions. For the special case of zero prediction errors (or, more generally, if the order of jobs sorted by yj is the same as that sorted by xj), we can obtain an improved competitive ratio of 1+λ 2λ via a more sophisticated analysis. Theorem 3.4. The preferential round-robin algorithm with parameter λ (0, 1) has competitive ratio at most ( 1+λ 2λ ) when η = 0. Proof. Suppose w.l.o.g. that the jobs are sorted in non-decreasing job lengths (both actual and predicted), i.e. x1 xn and y1 yn. Since the optimal solution schedules the jobs sequentially, we have j=1 (n j + 1)xj = (i,j):i