# learning_online_algorithms_with_distributional_advice__b89b0c02.pdf

Learning Online Algorithms with Distributional Advice

Ilias Diakonikolas * 1 Vasilis Kontonis * 1 Christos Tzamos * 1 Ali Vakilian * 2 Nikos Zariﬁs * 1

We study the problem of designing online algorithms given advice about the input. While prior work had focused on deterministic advice, we only assume distributional access to the instances of interest, and the goal is to learn a competitive algorithm given access to i.i.d. samples. We aim to be competitive against an adversary with prior knowledge of the distribution, while also performing well against worst-case inputs.

We focus on the classical online problems of skirental and prophet-inequalities, and provide sample complexity bounds for the underlying learning tasks. First, we point out that for general distributions it is information-theoretically impossible to beat the worst-case competitive-ratio with any ﬁnite sample size. As our main contribution, we establish strong positive results for wellbehaved distributions. Speciﬁcally, for the broad class of log-concave distributions, we show that poly(1/ϵ) samples sufﬁce to obtain (1 + ϵ)- competitive ratio. Finally, we show that this sample upper bound is close to best possible, even for very simple classes of distributions.

1. Introduction

Uncertainty in the input is a central challenge in the area of algorithm design. A number of both classical and recent paradigms have been introduced to capture this issue. Online computation is a prototypical domain where dealing with uncertainty of future data naturally arises. To get around the inherent uncertainty of the online setting, many different models have been introduced: online algorithms with lookahead (Grove, 1995), with statistical adversary (Raghavan, 1992), or with diffused ad-

*Equal contribution 1Department of Computer Sciences, University of Wisconsin, Madison, Wisconsin, USA 2Toyota Technological Institute at Chicago (TTIC), Chicago, Illinois, USA. This work was done when the author was at University of Wisconsin Madison. Correspondence to: Ali Vakilian <vakilian@ttic.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

versary (Koutsoupias & Papadimitriou, 2000). A more recent line of work studies online algorithms with predictions or advice (also known as learning-based/data-driven online algorithms) (Renault & Ros en, 2015; Angelopoulos et al., 2015; Lykouris & Vassilvtiskii, 2018; Purohit et al., 2018; Gollapudi & Panigrahi, 2019; Angelopoulos et al., 2020; D utting et al., 2020; Lattanzi et al., 2020; Anand et al., 2020; Bamas et al., 2020).

In this work, we introduce a new model of learning-based algorithms. Our goal is to design algorithms that can leverage the learnable structure of the input. Speciﬁcally, we assume sample access to the underlying distribution over inputs. Given i.i.d. samples from this distribution, we want to design methods that are competitive against the optimal performance of any algorithm with exact knowledge of the distribution over inputs; and at the same time are robust to model misspeciﬁcation or adversarial predictions. In other words, when the predictions are almost perfect, we expect the algorithm to solve the problem almost optimally; and if the predictions are misleading, the algorithm is required to attain guarantees comparable to the best possible worstcase bounds. This work falls in the active domain of beyond worst-case analysis (see the recent book (Roughgarden, 2020) and the related chapter on algorithms with predictions (Mitzenmacher & Vassilvitskii, 2020)).

While most of the existing results in learning-based online algorithms only consider deterministic predictions, in this work we aim to design such algorithms when predictions are random and drawn from an underlying distribution. This setting can accurately model the type of predictions that arise in real applications. In many scenarios, we have access to past data, i.e., distributional advice, that can be used to design an algorithm that performs well on unseen data.

In this paper, we focus on two basic online problems with distributional information. In particular, we investigate the trade-off between the efﬁciency and the sample complexity of the ski-rental and the prophet inequality problems in an online model when the input distribution is assumed to belong to a known class of distributions. Unlike the classical online setting, we consider the diffused adversary model proposed by Koutsoupias and Papadimitriou (2000), which assumes that the given input distribution belongs to

Learning Online Algorithms with Distributional Advice

a known class C of distributions. We note that Koutsoupias and Papadimitriou study the generalized competitive ratio of online problems such as paging and k-server in the defused adversary model only assuming knowledge of the class of distributions. In contrast, we also assume sample access to the input distribution.

1.1. Problem Statement

Given an online minimization problem f, a distribution D on inputs for f, and an algorithm A, we denote cost(A; t) the cost of A for f on input t and cost(A; D) := Et D[cost(A; t)], i.e., the average cost of A when the input is drawn from D. We remark that when A is a randomized algorithm, the expected cost is computed with respect to the randomness of both A and D. We also deﬁne the optimal cost under the distribution D to be the minimum achievable cost for any algorithm, i.e., OPTD = min A cost(A; D).

Observe that apart from the computational hardness of optimizing over possible algorithms, computing OPTD requires exact knowledge of the input distribution D. In our setting, we only assume sample access to D.

For maximization problems, we deﬁne OPTD similarly where instead of minimizing cost( ) the goal is to maximize gain( ), i.e., OPTD = max A gain(A; D).

Robustness and Consistency. Two measures of interest in the domain of algorithms with predictions are consistency and robustness. At a high-level, the consistency requirement implies that, if the samples that we observe are accurate, then the generalized competitive ratio should approach one. The robustness requirement implies that in any case (i.e., no matter how inaccurate the predictions are), the (generalized) competitive ratio should not be much worse compared to the best pure online algorithm.

Fix a family of distributions C and some distribution D not necessarily in C. Let X be a ﬁnite set of i.i.d. samples from D and let AX be the algorithm that we learn given the set of samples X. We say that the algorithm AX is α-consistent and β-robust under the class C if the following hold.

If D C, then cost(AX ; D)/OPTD α.

Otherwise, cost(AX ; D)/OPTONL β, where OPTONL is the cost of an optimal classical online algorithm on D.

Similarly, we can deﬁne the notion of robust and consistent algorithms for maximization problems by considering the inverse ratios, e.g., OPTD gain(AX ;D) α.

In this paper, our goal is to answer the following question.

Given ﬁnite samples from a distribution D, can we efﬁciently design consistent and robust algorithms?

1.2. Our Results

Ski Rental. In the ski-rental problem, each day the player has to decide whether to rent skis for this day, which costs one unit, or buy skis for the rest of the season at a cost of b units. The goal is to minimize the total cost paid by the player. Ski-rental is a well-studied problem and admits an algorithm with competitive ratio equal to e e 1, which is known to be tight (Karlin et al., 1994).

Before we present our result for the ski-rental setting, we note that it is not possible to improve over the existing competitive ratio bounds without distributional assumptions.

Observation 1.1. Fix any α < e/(e 1). There is no algorithm that for any input distribution D, draws ﬁnitely many samples from D and returns an α-consistent strategy for the ski-rental problem.

We show that we can bypass this negative result when the input distributions are assumed to have some nice structure. In previous literature for the ski-rental problem with predictions (Purohit et al., 2018; Gollapudi & Panigrahi, 2019; Wang et al., 2020), the authors assume that the advice is a guess for the exact number of the days that we are going to ski. A distributional generalization of this would be to assume that number of ski-days is distributed uniformly in some interval of the real line. More generally, a natural model for the number of ski-days is that it follows a log-concave distribution. Log-concave distributions are a general non-parametric class of distributions that have been used as a model extensively in statistics and machine learning (Bagnoli & Bergstrom, 2005). In particular, they include the uniform distribution on any interval, but also distributions with inﬁnite support, including Gaussians and exponentials.

Deﬁnition 1.2 (Log-Concave Distributions). We say that a distribution is log-concave if its density can be written as p(x) = ef(x), for some concave function f : R R.

As our ﬁrst main contribution, we show that under any logconcave distribution we can design consistent and robust algorithms for the ski-rental problem.

Result 1. For any λ > 1, there exists an algorithm that draws e O(1/ε2) samples, runs in sample near-linear time, and outputs a (λ(1+ε))-consistent and ( λ λ 1)-robust strategy for ski-rental under log-concave distributions (see Theorem 2.9).

To prove the above result, we ﬁrst show that we can obtain ε-additive strategies, i.e., strategies with cost OPTD + ε, for general distributions, see Section 2.1. Next, we show that the optimal solution under log-concave distributions

Learning Online Algorithms with Distributional Advice

is structured: it either corresponds to buying the skis initially or renting them indeﬁnitely. Then, by an application of our result on ε-additive strategies, we derive a (1 + ε)- multiplicative strategy on log-concave distributions with tight sample complexity, see Section 2.2. Finally, by using an algorithm of Mahdian et al. (2012) in black-box fashion, we obtain our main result for ski-rental.

We also prove that the sample complexity of Result 1 is essentially optimal: in Section 2.4 we show that Ω(1/ε2) are necessary in order to obtain (1 + ε)-consistent algorithm, even when the underlying distributions are exponentials.

Prophet Inequality. In the prophet inequality problem, there are n distributions D1, . . . , Dn. A gambler knows both the distributions Di and their order. Each day , the gambler observes a value Xi Di and has to decide whether to accept and get Xi as reward or to reject and continue with Xi+1, losing Xi irrevocably. The gambler s goal is to maximize the gain. In the standard setting, the gambler is competing against the prophet who knows the exact realizations X1, . . . , Xn and therefore achieves gain equal to E[max(X1, . . . , Xn)]. In this setting, it is known that the best possible competitive ratio is 1/2 (Krengel & Sucheston, 1978) and it is achievable by simple stopping rules such as the median of maxi Xi (i.e., a value T s.t. Pr[maxi Xi > T] = 1/2) (Samuel-Cahn, 1984) or E[maxi Xi]/2 (Kleinberg & Weinberg, 2012). In our distributional setting, the gambler does not know exactly the distributions D1, . . . , Dn, but only has sample access to them. Similar to the ski-rental setting, the gambler is competing against an adversary who knows the distributions D1, . . . , Dn (not their exact realizations X1, . . . , Xn).

Similarly to Observation 1.1 for ski-rental, we cannot improve over the existing, worst case, competitive ratio of 1/2 for prophet inequality without distributional assumptions (see the full version of the paper for the formal statement). For log-concave distributions, we show:

Result 2. For any λ > 1, there exists an algorithm that draws e O(n3/ε2) samples from the distributions D1, . . . , Dn and, in sample near-linear time, outputs a ((1 + ε)λ)-consistent and (2λ/(λ 1))-robust strategy for prophet inequality under log-concave distributions (see Theorem 3.3).

We remark that Ω(n) samples are necessary to guarantee any non-trivial error guarantees in the prophet inequality setting: if we do not observe at least one sample from each Di we may ignore one with very high average value. We did not attempt to optimize the sample complexity of our algorithm with respect to the number of distributions n. This is left as an interesting question for future work.

Stronger Sampling Models. We show that, assuming access to the stronger conditional sampling oracle as introduced in (Canonne et al., 2015; Chakraborty et al., 2016), we can achieve a (1 + ε)-multiplicative guarantee on any distribution without structural assumptions. We work in the ski-rental setting. Given an unknown distribution D supported on [0, + ), for any subset S [0, + ), the conditional sampling oracle returns a sample with probability proportional to the conditional distribution of D restricted to S. In this paper, we use a weaker version of this oracle where the target sets S are of form [α, + ). Due to space limitations, we include the proof of the following result in the ﬁnal version of the paper.

Result 3. For any input distribution D, O(1/ε4) conditional samples from D sufﬁce to design a strategy A that achieves cost(A; D) (1 + ε)OPTD.

We remark that under some natural anti-concentration properties of the distribution, the conditional sampling access we require can be simulated via traditional samples and rejection sampling with a small overhead.

1.3. Related Work

Online Algorithms with Advice. Incorporating machine learning advice in the design of online algorithm has received a lot of attention in recent years (Mahdian et al., 2012; Angelopoulos et al., 2015; Esfandiari et al., 2018; Lykouris & Vassilvtiskii, 2018; Purohit et al., 2018; Gollapudi & Panigrahi, 2019; Kodialam, 2019; Indyk et al., 2020; Anand et al., 2020; Angelopoulos et al., 2020; Rohatgi, 2020; Lattanzi et al., 2020; D utting et al., 2020; Lavastida et al., 2020; Antoniadis et al., 2020; Bamas et al., 2020). In particular, the ski-rental problem has served as one of the prominent standard test beds for most of developed techniques and proposed models in this area, e.g., (Purohit et al., 2018; Gollapudi & Panigrahi, 2019; Kodialam, 2019; Anand et al., 2020; Bamas et al., 2020).

Algorithms with predictions have been also studied extensively in other domains such as online learning (Alabi et al., 2019; Bhaskara et al., 2020), data structures (Kraska et al., 2018; Mitzenmacher, 2018; Dai & Shrivastava, 2019; Vaidya et al., 2021), streaming and sketching (Hsu et al., 2019; Indyk et al., 2019; Jiang et al., 2020; Cohen et al., 2020) and combinatorial problems (Gupta & Roughgarden, 2017; Dai et al., 2017; Balcan et al., 2017; 2018).

Comparison with (Anand et al., 2020). We provide a comparison of our framework to the recent result of Anand, Ge, and Panigrahi, in which the authors design a learning to rent framework. In their setting, the algorithm receives a feature vector x which can be considered as prediction , and they assume that there exists an unknown underlying joint distribution between x and the actual number of ski-

Learning Online Algorithms with Distributional Advice

Table 1. This table shows the sample complexity of our algorithms in different settings of ski-rental and prophet inequality. In ski-rental, b denotes the price of ski. In prophet inequality, for ε-additive approximation guarantee, we assume D1, . . . , Dn are supported on [0, b].

(1 + ε)-Multiplicative Consistency and Robustness ε-Additive General Log-Concave

Ski-Rental Inapprox. e O(ε 2) Theorem 2.9 e O(b2ε 2)

Prophet Inequality Inapprox. e O(n3ε 2) Theorem 3.3 e O(b2n2 ε 2)

days y; (x, y) K. Then, their goal is to learn an algorithm θ(x), which decides the number of days to rent. The goal of Anand et al. is to output a learning rule θ, which can be thought as a mapping from the distributional prediction to a stopping time, and their objective is to minimize the expected value of the competitive ratios over prediction x. In contrast, in our work, we design a single strategy for the given unknown distribution and our measure of efﬁciency is the competitive ratio of the expected costs. While the objective of both papers is to design learning-based algorithms under distributional information, the settings and approaches are quite different.

2. Ski-Rental

In this section, our main goal is to prove Result 1. In Subsection 2.1 we show that we can obtain additive approximations to OPTD under any distribution with e O(1/ε2) samples. In Subsection 2.2 we show that using the additive approximation result we can obtain multiplicative approximations assuming that the underlying distribution is logconcave. In Subsection 2.4 we prove that Ω(1/ε2) samples are necessary in order to obtain (1 + ε)-multiplicative estimates for log-concave distributions.

2.1. Sample Complexity of Additive Approximation

In this subsection, we study the sample complexity of ε-additive approximation algorithms for ski-rental. We remark that in this problem any deterministic algorithm/strategy A is equivalent to a threshold T where we decide to buy the skis. For example, if our strategy is to buy the skis initially we have T = 0 or, on other extreme, renting them indeﬁnitely corresponds to T = . Any randomized algorithm for this problem corresponds to a distribution over thresholds. The cost of an algorithm A corresponding to a distribution q over thresholds with respect to some distribution D is then deﬁned as1

cost(A; D) = cost(q; D)

h E x D[1{x < T}x + (b + T)1{x T}] i ,

1We denote by 1{x < t} the indicator function that is 0 for all x < t and 1 otherwise.

where the ﬁrst term corresponds to the case where the actual number of ski-days x D is smaller than T: we just pay amount x for renting the skis for those days. The second term corresponds to x T in which case we buy the skis at day T and pay b + T overall. We denote by OPTD the cost achieved by the optimal threshold for a given distribution D, i.e.,

OPTD = min T [0,+ ]

h E x D[1{x < T}x+(b+T)1{x > T}] i .

Observe that the optimal cost is given by a deterministic threshold T. We ﬁrst deﬁne the notion of ε-additive approximation algorithms.

Deﬁnition 2.1 (ε-Additive Approximation). Given an instance of ski-rental in which the number of ski-days is drawn from an unknown distribution D, and the ski price is b, we say that an algorithm A achieves ε-additive approximation if, cost(A; D) OPTD+ε, where OPTD denotes the optimal cost when D is known.

The main result of this section is the following theorem, where we show that for any value of ε > 0, e O(1/ε2) samples from D sufﬁces to design an ε-additive approximation algorithm for ski-rental over D without any assumption on the distribution D.

Theorem 2.2. Let D denote the distribution of the number of ski-days. There exists an algorithm (see Algorithm 1) that for any ε, δ (0, 1], draws e O(b2ε 2 log(1/δ)) samples from D, runs in time linear in the number of samples, and, with probability at least 1 δ, outputs a strategy A that satisﬁes cost(A; D) OPTD + ε .

Proof. The proof follows from an application of Dvoretzky, Kiefer, and Wolfowitz (DKW) inequality. Let X1, . . . , Xn be i.i.d. samples from D and let Dn denote the (empirical) distribution constructed from the sampled points. Formally, the cdf of Dn is deﬁned as follows:

i=1 1{Xi x} x R.

By DKW inequality, for n = Ω(b2ε 2 log(1/δ)), with probability at least 1 δ, it holds d K(Dn, D) ε/b. Recall

Learning Online Algorithms with Distributional Advice

that, the Kolmogorov distance of distributions D and D

with cdf respectively F and F is deﬁned as d K(D, D ) := supx R |F (x) F(x)|.

For any threshold θ R, let Aθ denote the algorithm that buys at θ. First, we prove the following result that relates the cost of an algorithm of the form Aθ for ski-rental on D to its cost on a distribution D where d K(D, D ) ε/b. The proof of the following claim can be found in the Supplementary Material.

Claim 2.3. Suppose that d K(D, D ) ε/b. If there exists T and t T such that cost(AT ; D ) OPTD + ε and cost(At ; D) OPTD + ε, then cost(AT ; D) OPTD + O(ε (T + T + b)/b).

Next, we bound the value T in Claim 2.3. In other words, we show that for any distribution there exists an ε-additive approximation strategy of form At with t = O(b log b

Claim 2.4. For any D, we can ﬁnd a t 2b log b

ε such that cost(At ; D) OPTD + O(ε).

Now, we are ready to describe our algorithm (Algorithm 1). By Claim 2.4, recall that n = Θ(b log(1/δ)/ε2), we can ﬁnd an ε-additive approximation strategy Aθε, for Dn with θε 2b log b

ε. Moreover, by another application of Claim 2.4, there exists t = 2b log b

ε such that cost(At ; D) OPTD + ε. Hence, since with probability at least 1 δ, d K(D, Dn) ε/b, by an application of Claim 2.3 with D = Dn, we obtain

cost(Aθε; D) OPTD + O(ε (θε + t + b)/b)

OPTD + O(ε log(b/ε)).

Hence, e O(b2ε 2 log(1/δ)) samples from the input sufﬁces to design an ε-additive approximation strategy for skirental on D.

Algorithm 1 Algorithm for ε-Additive Approximation

1: Input: i.i.d. samples X1, . . . , Xn D 2: Output: A buying time T 3: Dn is empirical distribution computed from X1, . . . , Xn 4: T optimal buying time for ski-rental on Dn 5: return min{b log 1

2.2. Multiplicative Guarantee for Log-Concave Distributions

In this subsection, we focus on the class of log-concave distributions and design an efﬁcient algorithm that achieves (1 + ε)-multiplicative approximation for the case the number of days is drawn from a log-concave distribution. The main result of this section is the following theorem.

Theorem 2.5. Assume that the number of ski-days follows a log-concave distribution D. There exists an algorithm that for any ε, δ (0, 1], draws e O(log(1/δ)/ε2) samples from D, runs in time linear in the number of samples, and, with probability at least 1 δ, outputs a strategy A s.t.

cost(A; D) (1 + ε)OPTD.

Proof. First, we show that the ε-additive approximation algorithm, Algorithm 1, can be used for designing (1 + ε)- multiplicative approximation algorithms on distributions whose optimal strategy is to either buy initially or rent indeﬁnitely. The proof of the following claim can be found in the Supplementary Material.

Claim 2.6. Let D be a distribution whose optimal strategy is either to buy initially or rent indeﬁnitely. Fix ε, δ (0, 1]. There exists an algorithm that draws e O(log(1/δ)/ε2) samples from D and with with probability at least 1 δ outputs a strategy A such that cost(A; D) (1 + ε)OPTD.

Next, we show that the optimal strategy for log-concave distributions is to either buy initially or rent indeﬁnitely. Let p D and PD respectively denote the pdf and the cdf of D. Let q( ) be the pdf corresponding to a randomized strategy where for each x > 0, q(x) denotes the probability of stopping at time x. The cost of the strategy corresponding to q is,

cost(A; D) = Z

0 yp D(y) dy

+ (b + x)(1 PD(x)) dx . (1)

Therefore, to minimize the cost, we need to set q( ) to be a point mass on x that minimizes

g(x) := Z x

0 yp D(y) dy + (b + x)(1 PD(x)) .

To ﬁnd the minimizer of g, we compute the derivative g (x) = (1 PD(x)) bp D(x) = (1 PD(x))(1 b hr D(x)), where hr D(x) = p D(x)/(1 PD(x)) is the hazard rate function of the distribution D. Since D is a logconcave distribution, we have that its hazard rate hr D(x) is an increasing function of x.

1. If for all x R, it holds hr D(x) < 1/b, then g( ) is an increasing function and the optimal strategy is to buy the skis initially. Similarly, if for all x R, it holds hr D(x) > 1/b, then g is minimized at + , i.e., the optimal strategy is to rent the skis indeﬁnitely.

2. If for some x0 R, it holds hr D(x0) = 1/b, then g( ) is increasing in [0, x0] and decreasing in [x0, + ). This means that g is either minimized for x = 0, or x = + .

Learning Online Algorithms with Distributional Advice

Therefore, when D is a log-concave distribution, the optimal strategy is always to either buy the skis initially or rent them indeﬁnitely. The result follows from Claim 2.6.

2.3. Consistency-Robustness Trade-Off of Ski-Rental in Distributional Setting

Here, we show that for any λ > 1, we can achieve(λ, λ λ 1) consistency-robustness trade-off for ski-rental by applying the result of Mahdian et al. (2012) that works for the more general problem of online resource allocation in a blackbox fashion.

In online resource allocation, we are given a sequence of jobs J and a set of servers S. Jobs arrive one by one, and at each time t, the task is to assign the job Jt to a set of servers. Each server Si has an activation cost ci and assigns each job Jt to a server Si along with a serving cost di. The goal is to ﬁnd an assignment that minimizes the cost of activating servers plus the serving costs. The ski-rental problem is a special case of this problem where we have two servers buy (S1) and rent (S2) with (c1 = b, d1 = 0) and (c2 = 0, d2 = 1). Moreover, each ski-day is a job.

Lemma 2.7 (Theorem 3.1 of Mahdian et al.). For any γ > 1, there exists an algorithm H(γ) that given two algorithms A, B for an online resource allocation problem P, satisﬁes

cost(H(γ); I) min{λ cost(A; I), ( λ λ 1) cost(B; I)}.

Corollary 2.8. Given two strategies A, B for ski-rental, for any instance I, and for any λ > 1, the cost the strategy SKI-RENTAL(A, B, λ), Algorithm 2, on instance I is at most min{λ cost(A; I), ( λ λ 1) cost(B; I)}.

Algorithm 2 SKI-RENTAL: consistent and robust algorithm for ski-rental adapted from (Mahdian et al., 2012).

1: Input: Two strategies A, B for ski-rental 2: Output: A buying time T 3: t 0 {number of days} 4: while ski-day do 5: t t + 1 6: if costt(A) (1 γ)costt(B) then 7: if A buys at t buy; otherwise, rent 8: else 9: if B buys at t buy; otherwise, rent 10: end if 11: end while

Now we are ready to we prove the main result of the section which is a formal restatement of Result 1.

Theorem 2.9. For any λ > 1, δ (0, 1], there exists an algorithm that draws e O(log(1/δ)/ε2) samples from the input distribution D, runs in time linear in the number of samples

and, with probability at least 1 δ, returns a (λ(1 + ε))- consistent and ( λ λ 1)-robust strategy A for ski-rental under log-concave distributions, i.e., if D is a log-concave distribution, cost(A; D) λ(1 + ε)OPTD; otherwise, cost(A; D) λ λ 1OPTONL.

Proof. Let AX be the strategy guaranteed by Theorem 2.5 that receives samples X = {X1, . . . , X e O(1/ε2 log(1/δ))} from D. Moreover, let AONL be a worst-case optimal algorithm (i.e., with competitive ratio e e 1) for ski-rental.

By Theorem 2.5, for log-concave distributions, with probability at least 1 δ, cost(AX ; D) (1 + ε) OPTD. Then, by an application of Corollary 2.8, SKIRENTAL(AX , AONL) satisﬁed the desired consistency and robustness guarantees.

2.4. Sample Complexity Lower Bound for Exponential Distributions

In this subsection, we bound from below the sample complexity for any algorithm that learns a strategy that achieves (1 + ε)OPTD cost under log-concave distributions. Interestingly, the lower bound holds for the simple case of exponential distributions which are a (very small) subset of log-concave distributions. Theorem 2.10. Suppose that there exists an algorithm that for any log-concave distribution D and any ε (0, 1], draws k samples from D and, with probability at least 2/3, outputs a strategy A such that cost(A; D) (1 + ε)OPTD . Then k = Ω(1/ε2).

Proof. We reduce the above optimization problem to a distribution testing problem. In particular, we show that given an algorithm A that satisﬁes the guarantees of Theorem 2.10, we can construct an algorithm that identiﬁes the distribution D that generated the samples. Set the buying cost of the ski-rental problem b = 1. For our proof, we only need to consider two distributions. We set D1 to be the exponential distribution with rate λ1 = 1/(1 + 2ε) and D2 to be the exponential distribution with rate λ2 = 1/(1 2ε). Observe that since exponential distributions are log-concave, their optimal strategy is to either buy initially or rent indeﬁnitely, see the proof of Theorem 2.5. In particular, the optimal strategy for D1 is to buy initially and OPTD1 = 1, while the optimal strategy for D2 is to rent indeﬁnitely and OPTD2 = 1 2ε. Let X1, . . . , Xk be k samples from Di and denote by q the distribution/strategy that algorithm A(X1, . . . , Xk) outputs. Notice that this captures also randomized algorithms that decide to buy at some time t with density q(t). We consider the following testing algorithm T :

If cost(q; D2) (1 2ε)cost(q; D1) then output 1, i.e., the distribution that generated the samples is D1.

Learning Online Algorithms with Distributional Advice

Otherwise, output 2.

Our proof crucially relies on the following claim. We show that any strategy q that performs well under distribution D1 cannot perform well under distribution D2. We provide its proof in the Supplementary Material.

Claim 2.11. Let q be any distribution on [0, ] such that cost(q; D1) (1 + ε)OPTD1. Then, for ε sufﬁciently small it holds cost(q; D2) > (1 + ε)OPTD2.

Assume, in order to reach to contradiction, that the testing algorithm T outputs 2 and the underlying distribution is D1, that is it holds cost(q; D2) < (1 2ε)cost(q; D1). By the guarantee for the strategy A we have that cost(q; D1) (1 + ε)OPTD1. This implies

cost(q; D2) < (1 2ε)cost(q; D1)

= (1 2ε)(1 + ε)OPTD1 = (1 2ε)(1 + ε) .

From Claim E.1, we have that cost(q; D2) > (1 + ε)OPTD2 = (1 + ε)(1 2ε). Therefore, we conclude (1 2ε)(1+ε) > cost(q; D2) > (1+ε)(1 2ε) , which is a contradiction. The case where the algorithm outputs 1 and the underlying distribution is D2 is similar. Therefore, our testing algorithm can distinguish the two exponential distributions whose mean differ by O(ε). It is not hard to see that, to distinguish between these two exponential distributions, Ω(1/ε2) samples are required. For the details, see the full proof provided in the Supplementary material.

3. Prophet Inequality

In this section we prove our result for prophet inequalities, Result 2. Any strategy in this setting can be described with a set of thresholds T1, . . . Tn, see Algorithm 3. Since we should always accept the last value Xn, the last threshold Tn is redundant and should always be equal to 0. To keep notation simple we assume that this is the case and use nthresholds.

Algorithm 3 Threshold-Algorithm

1: Input: Thresholds {Ti}i [n], values {Xi Di}i [n] 2: Output: One of the values Xi s 3: i 1 4: while Xi < Ti do i i + 1 end while 5: return Xi.

We ﬁrst deﬁne the gain of a strategy T1, . . . , Tn.

Deﬁnition 3.1 (Gain of a Strategy). Fix distributions D1, . . . , Dn. Let T1, . . . , Tn be a set of thresholds. We deﬁne the gain(T1, . . . , Tn; D1, . . . , Dn) to be the expected output of Algorithm 3 with thresholds T1, . . . , Tn. When, it

is clear from the context, we may drop the distributions and simply write gain(T1, . . . , Tn).

The following fact gives the optimal way to choose the thresholds T1, . . . , Tn given that we know exactly D1, . . . , Dn. For the proof, see Supplementary Material.

Fact 3.2. Algorithm 3 is optimal with respect to D1, . . . , Dn when the thresholds are Ti = EXi Di[max(Xi, Ti+1)], Tn 1 = EXn Dn[Xn], Tn = 0. We use OPT to denote the gain of the optimal algorithm that knows the distributions, see Fact 3.2.

For example, with two distributions D1 and D2, the optimal gain according to the fact above is OPT = EX1 D1[max(X1, EX2 D2[X2])].

We present below the main theorem of this section, i.e., the formal statement of Result 2.

Theorem 3.3. For any λ > 1 and ε, δ (0, 1], there exists a randomized algorithm that draws e O(n3/ε2) samples from the distributions D1, . . . , Dn and, in sample nearlinear time, with probability at least 1 δ, outputs an ((1 + ε)λ)-consistent and (2λ/(λ 1))-robust strategy for prophet inequality under log-concave distributions.

For the proof of Theorem 3.3 we rely on two main components. The ﬁrst is the following lemma that shows that when the underlying distributions are log-concave we can obtain multiplicative approximations to the optimal gain.

Lemma 3.4. Let X1, . . . , Xn be non-negative independent random variables with log-concave densities. There is an algorithm such that for any ε, δ (0, 1] draws O( n2

ε2 log(n/δ)) samples from each distribution Di, and computes in near-linear sample time a set of thresholds T1, . . . , Tn such that, with probability at least 1 δ, it holds OPT/gain(T1, . . . , Tn) 1 + ε .

Theorem 3.3 follows by combining our multiplicative approximation result of Lemma 3.4 that works under the assumption that the distributions are log-concave with a recent result from (Rubinstein et al., 2019) for the prophet inequality problem showing that with just one sample from each of D1, . . . , Dn we can obtain value (1/2)OPTONL.

Lemma 3.5 (Theorem 1 of (Rubinstein et al., 2019)). There is an algorithm that draws n samples from D1, . . . , Dn, runs in sample linear time, and outputs a strategy A such that gain(A) (1/2)OPTONL .

We now proceed with the proof of Lemma 3.4.

Proof sketch of Lemma 3.4. Our proof crucially relies on the following result where we prove that, when the underlying distribution of a random variable X 0 is log-concave, we can obtain multiplicative estimates of thresholds, i.e., of

Learning Online Algorithms with Distributional Advice

quantities of max(X, c) with a sample complexity independent of the support size of X.

Lemma 3.6. Let X be a non-negative random variable with log-concave density and let c > 0. Fix ε, δ (0, 1], then by drawing N = O( 1

ε2 log(1/δ)) samples, we can compute, in sample linear time, an estimate b T such that

1 1 + ε b T E[max(X, c)] 1 + ε .

Proof. Denote µ = E[max(X, c)]. We draw n i.i.d. samples X1, . . . , Xn from the log-concave distribution and use the empirical average ˆT = 1 n Pn i=1 max(c, Xi). By Chebyshev s inequality we obtain

Pr h ˆT µ ϵµ i Var[ ˆT]

ε2µ2 = Var[max(c, X)]

nε2µ2 . (2)

To bound Var[max(c, X)] consider X to be an independent copy of X. We have

Var[max(c, X)] = 1

2 E[(max(c, X) max(c, X ))2]

2 E[(X X )2] = Var[X] .

In order to bound the variance of X we are going to use the following reverse H older inequality for log-concave distributions, see, for example, (Lov asz & Vempala, 2007).

Lemma 3.7 (Reverse H older for Log-concave Measures). Let X be a random variable distributed according to some log-concave density on R. It holds E[|X|k| (2k)k(E[|X|])k.

Since X is distributed according to a log-concave density we have that Var[X] E[X2] 16 E[|X|]2 = 16 E[X]2, where the last equality follows from the fact that X is non-negative. Finally, we have that E[X] E[max(c, X)] = µ and therefore, from Equation (2), we

obtain that Pr h ˆT µ ϵµ i 16 nε2 . Therefore, with

O(1/ε2) samples we have that with probability at least 2/3 it holds | ˆT/µ 1| ε which also implies that 1/(1 + ε) ˆT/µ 1 + ε by the fact that 1 ε 1/(1 + ε) for all ε [0, 1]. To amplify the success probability to 1 δ for any δ > 0 we can use the median-trick , i.e., repeat the above process M = O(log(1/δ)) times and keep the median of the estimates ˆT1, . . . , ˆTM. Since each one of them satisﬁes the error guarantee with probability at least 2/3 we have that the probability that the median violates the same error guarantee is at most (2/3)M/2 δ. This holds because for the median to be outside the interval [µ εµ, µ + εµ] we need at least half of ˆT1, . . . , ˆTM to fall outside the same interval. Overall, we obtain that with O(1/ε2 log(1/δ) we can compute an estimate ˆT such that 1/(1 + ε) ˆT/µ 1 + ε.

We now return to the proof of Lemma 3.4. For simplicity assume that we have two distributions D1 and D2. Let ˆT be an (1 + ε)-multiplicative approximation of the optimal threshold T = E[X2], calculated using O(1/ε2 log(1/δ)) samples, see Lemma 3.6 (set c = 0).

gain( ˆT) = E[X11{X1 ˆT} + X21{X1 < ˆT}]

= E[X11{X1 ˆT} + E[X2]1{X1 < ˆT}]

E[X11{X1 ˆT} + ˆT 1 + ε1{X1 < ˆT}]

1/(1 + ε) E[max(X1, ˆT)]

1/(1 + ε)2OPT

where we used T ˆT/(1 + ε) T/(1 + ε)2. Similar analysis can be applied when we have n distributions to obtain (1+ε)2ngain( ˆT1, . . . , ˆTn) OPT. By setting ε = Θ(ε/n), we have (1 + ε )gain( ˆT1, . . . , ˆTn) OPT. For this value of ε we obtain that e O(n2/ε2 log(1/δ)) samples from each Di sufﬁce to estimate the thresholds ˆT1, . . . , ˆTn. The full proof is deferred to Supplementary Material.

Proof of Theorem 3.3. Let A be the randomized strategy that with probability 1/λ uses the (1 + ε)-multiplicative strategy for log-concave distributions (Lemma 3.4), otherwise, it uses the worst-case optimal algorithm for the online setting given in Lemma 3.5, which has (1/2)-competitive ratio. If the underlying distributions are log-concave, the expected gain of A is with probability 1/λ, OPT/(1 + ε) and with probability (λ 1)/λ at least zero. Thus, the algorithm A is ((1 + ε)λ)-consistent. On the other case, where the underlying distributions are arbitrary, the expected gain of A is (1/2)OPTONL with probability 1 1/λ, and at least zero, otherwise. Thus, A is (2λ/(λ 1))-robust.

Similar to the additive result for ski-rental, we remark that we can obtain additive approximation guarantees in the prophet inequality setting. We remark that we assume nothing about the distributions D1, . . . , Dn apart from their support being bounded in some interval [0, b].

Theorem 3.8. Let X1, . . . , Xn be non-negative independent random variables with maximum value b > 0. There is an algorithm such that for any ε, δ (0, 1], draws e O( nb2

ε2 log(1/δ)) samples from each one, and in sample polynomial time, computes a set of thresholds T1, . . . , Tn such that OPT gain(T1, . . . , Tn) ε .

Acknowledgements

This project was supported by NSF award #2006206. The authors would like to thank the anonymous reviewers for their insightful comments and suggestions.

Learning Online Algorithms with Distributional Advice

Alabi, D., Kalai, A. T., Liggett, K., Musco, C., Tzamos, C., and Vitercik, E. Learning to prune: Speeding up repeated computations. In Conference on Learning Theory, pp. 30 33. PMLR, 2019.

Anand, K., Ge, R., and Panigrahi, D. Customizing ML predictions for online algorithms. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp. 303 313. PMLR, 2020.

Angelopoulos, S., D urr, C., Kamali, S., Renault, M., and Ros en, A. Online bin packing with advice of small size. In Workshop on Algorithms and Data Structures, pp. 40 53. Springer, 2015.

Angelopoulos, S., D urr, C., Kamali, S., Jin, S., and Renault, M. Online computation with untrusted advice. In 11th Innovations in Theoretical Computer Science Conference (ITCS 2020), volume 151, pp. 52 1. Schloss Dagstuhl Leibniz-Zentrum fuer Informatik, 2020.

Antoniadis, A., Gouleakis, T., Kleer, P., and Kolev, P. Secretary and online matching problems with machine learned advice. Advances in Neural Information Processing Systems, 2020.

Bagnoli, M. and Bergstrom, T. Log-concave probability and its applications. Economic theory, 26(2):445 469, 2005.

Balcan, M.-F., Nagarajan, V., Vitercik, E., and White, C. Learning-theoretic foundations of algorithm conﬁguration for combinatorial partitioning problems. In Conference on Learning Theory, pp. 213 274. PMLR, 2017.

Balcan, M.-F., Dick, T., Sandholm, T., and Vitercik, E. Learning to branch. In International conference on machine learning, pp. 344 353. PMLR, 2018.

Bamas, E., Maggiori, A., and Svensson, O. The primaldual method for learning augmented algorithms. Advances in Neural Information Processing Systems, 2020.

Bhaskara, A., Cutkosky, A., Kumar, R., and Purohit, M. Online learning with imperfect hints. In International Conference on Machine Learning, pp. 822 831. PMLR, 2020.

Canonne, C. L., Ron, D., and Servedio, R. A. Testing probability distributions using conditional samples. SIAM Journal on Computing, 44(3):540 616, 2015.

Chakraborty, S., Fischer, E., Goldhirsh, Y., and Matsliah, A. On the power of conditional samples in distribution testing. SIAM Journal on Computing, 45(4):1261 1296, 2016.

Cohen, E., Geri, O., and Pagh, R. Composable sketches for functions of frequencies: Beyond the worst case. In International Conference on Machine Learning, pp. 2057 2067. PMLR, 2020.

Dai, H., Khalil, E. B., Zhang, Y., Dilkina, B., and Song, L. Learning combinatorial optimization algorithms over graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6351 6361, 2017.

Dai, Z. and Shrivastava, A. Adaptive learned bloom ﬁlter (ada-bf): Efﬁcient utilization of the classiﬁer. ar Xiv preprint ar Xiv:1910.09131, 2019.

D utting, P., Lattanzi, S., Leme, R. P., and Vassilvitskii, S. Secretaries with advice. ar Xiv preprint ar Xiv:2011.06726, 2020.

Dvoretzky, A., Kiefer, J., and Wolfowitz, J. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pp. 642 669, 1956.

Esfandiari, H., Korula, N., and Mirrokni, V. Allocation with trafﬁc spikes: Mixing adversarial and stochastic models. ACM Transactions on Economics and Computation (TEAC), 6(3-4):1 23, 2018.

Gollapudi, S. and Panigrahi, D. Online algorithms for rentor-buy with expert advice. In International Conference on Machine Learning, pp. 2319 2327, 2019.

Grove, E. F. Online bin packing with lookahead. In SODA, volume 95, pp. 430 436, 1995.

Gupta, R. and Roughgarden, T. A pac approach to application-speciﬁc algorithm selection. SIAM Journal on Computing, 46(3):992 1017, 2017.

Hsu, C.-Y., Indyk, P., Katabi, D., and Vakilian, A. Learning-based frequency estimation algorithms. In International Conference on Learning Representations, 2019.

Indyk, P., Vakilian, A., and Yuan, Y. Learning-based lowrank approximations. ar Xiv preprint ar Xiv:1910.13984, 2019.

Indyk, P., Mallmann-Trenn, F., Mitrovi c, S., and Rubinfeld, R. Online page migration with ml advice. ar Xiv preprint ar Xiv:2006.05028, 2020.

Jiang, T., Li, Y., Lin, H., Ruan, Y., and Woodruff, D. P. Learning-augmented data stream algorithms. ICLR, 2020.

Learning Online Algorithms with Distributional Advice

Karlin, A. R., Manasse, M. S., Mc Geoch, L. A., and Owicki, S. Competitive randomized algorithms for nonuniform problems. Algorithmica, 11(6):542 571, 1994.

Kleinberg, R. and Weinberg, S. M. Matroid prophet inequalities. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pp. 123 136, 2012.

Kodialam, R. Optimal algorithms for ski rental with soft machine-learned predictions. ar Xiv preprint ar Xiv:1903.00092, 2019.

Koutsoupias, E. and Papadimitriou, C. H. Beyond competitive analysis. SIAM Journal on Computing, 30(1): 300 317, 2000.

Kraska, T., Beutel, A., Chi, E. H., Dean, J., and Polyzotis, N. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, pp. 489 504, 2018.

Krengel, U. and Sucheston, L. On semiamarts, amarts, and processes with ﬁnite value. Advances in Probability and Related Topics, 4:197-266, 1978.

Lattanzi, S., Lavastida, T., Moseley, B., and Vassilvitskii, S. Online scheduling via learned weights. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1859 1877. SIAM, 2020.

Lavastida, T., Moseley, B., Ravi, R., and Xu, C. Learnable and instance-robust predictions for online matching, ﬂows and load balancing. ar Xiv preprint ar Xiv:2011.11743, 2020.

Le Cam, L. Asymptotic methods in statistical decision theory. Springer Science & Business Media, 1987.

Lov asz, L. and Vempala, S. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307 358, 2007.

Lykouris, T. and Vassilvtiskii, S. Competitive caching with machine learned advice. In International Conference on Machine Learning, pp. 3296 3305, 2018.

Mahdian, M., Nazerzadeh, H., and Saberi, A. Online optimization with uncertain information. ACM Transactions on Algorithms (TALG), 8(1):1 29, 2012.

Mitzenmacher, M. A model for learned bloom ﬁlters, and optimizing by sandwiching. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 462 471, 2018.

Mitzenmacher, M. and Vassilvitskii, S. Algorithms with predictions. ar Xiv preprint ar Xiv:2006.09123, 2020.

Purohit, M., Svitkina, Z., and Kumar, R. Improving online algorithms via ml predictions. In Advances in Neural Information Processing Systems, pp. 9661 9670, 2018.

Raghavan, P. A statistical adversary for on-line algorithms. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 7:79 83, 1992.

Renault, M. P. and Ros en, A. On online algorithms with advice for the k-server problem. Theory of Computing Systems, 56(1):3 21, 2015.

Rohatgi, D. Near-optimal bounds for online caching with machine learned advice. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1834 1845. SIAM, 2020.

Roughgarden, T. Beyond the worst-case analysis of algorithms (introduction). ar Xiv preprint ar Xiv:2007.13241, 2020.

Rubinstein, A., Wang, J. Z., and Weinberg, S. M. Optimal single-choice prophet inequalities from samples. ar Xiv preprint ar Xiv:1911.07945, 2019.

Samuel-Cahn, E. Comparison of threshold stop rules and maximum for independent nonnegative random variables. the Annals of Probability, 12(4):1213 1216, 1984.

Vaidya, K., Knorr, E., Kraska, T., and Mitzenmacher, M. Partitioned learned bloom ﬁlter. International Conference on Learning Representations, 2021.

Wang, S., Li, J., and Wang, S. Online algorithms for multishop ski rental with machine learned advice. Advances in Neural Information Processing Systems, 33, 2020.