# online_ad_allocation_with_predictions__b12a0ef1.pdf

Online Ad Allocation with Predictions

Fabian Spaeh Department of Computer Science Boston University fspaeh@bu.edu

Alina Ene Department of Computer Science Boston University aene@bu.edu

Display Ads and the generalized assignment problem are two well-studied online packing problems with important applications in ad allocation and other areas. In both problems, ad impressions arrive online and have to be allocated immediately to budget-constrained advertisers. Worst-case algorithms that achieve the ideal competitive ratio are known, but might act overly conservative given the predictable and usually tame nature of real-world input. Given this discrepancy, we develop an algorithm for both problems that incorporate predictions and can thus improve the performance beyond the worst-case. Our algorithm is based on the work of Feldman et al. (2009a) and similar in nature to Mahdian et al. (2007) who were the first to develop an algorithm with predictions for the related, but more structured Ad Words problem. We use a novel analysis to show that our algorithm is able to capitalize on a good prediction, while being robust against poor predictions. We experimentally evaluate our algorithm on synthetic and real-world data on a wide range of predictions. Our algorithm is consistently outperforming the worst-case algorithm without predictions.

1 Introduction

Advertising on the internet is a multi-billion dollar industry with ever growing revenue, especially as online retail gains more and more popularity. Typically, a user arrives on a website which fills an empty advertising spot (called an impression) by allocating it instantly to one of many advertisers. Advertisers value users differently based on search queries or demographic data and reveal their valuations in auctions or through contracts with the website. Formulations of increasing complexity have been studied to capture this problem, creating a hierarchy of difficulty (Mehta, 2013). The most basic is online bipartite matching, where each vertex on one side arrives online with all its adjacent edges and has to be matched immediately to one of the vertices on the other side, which were supplied offline. The problem and all generalizations admit a hard lower bound of 1 1

e due to the uncertainty about future vertices. Motivated by online ad exchanges, where advertisers place bids on impressions, Mehta et al. (2007) introduced the Ad Words problem, which is a generalization of online bipartite matching where we charge each advertiser for the amount they bid. Going beyond the Ad Words setting, Feldman et al. (2009a) considered the more expressive problems Display Ads and the generalized assignment problem (GAP), and proposed algorithms for these settings with worst-case guarantees.

Classic algorithms that defend against the worst case of 1 1

e are often overly conservative given that the real world does not behave like a contrived worst-case instance. Recently, researchers have thus been trying to leverage a prediction about some problem parameter to go beyond the worst-case (Mitzenmacher and Vassilvitskii, 2022). In the context of ad allocation, a prediction can be the keyword distribution of users on a certain day, or simply the advertiser allocation itself. Such a prediction is readily obtainable in practice, for example through learning on historic data. Two opposing properties are important: The algorithm has to be consistent, meaning that its performance

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

should improve with the prediction quality. Simultaneously, we want the algorithm to be robust against a poor prediction, i.e. not to decay completely but retain some form of worst-case guarantee. This is particularly important in the advertising business as much revenue is extracted from fat tails containing special events that are difficult or impossible to predict, but extremely valuable for the advertising business (e.g. advertising fan merchandise after a team s victory). To this end, Mahdian et al. (2007) developed an algorithm with predictions for the Ad Words problem and Medina and Vassilvitskii (2017) show how to use bid predictions to set reserve prices for ad auctions. Inspired by their work, we develop an algorithm with predictions for Display Ads and GAP.

Our Contributions: We design the first algorithms that incorporate predictions for the well-studied problems Display Ads and GAP. The two problems are general online packing problems, that capture a wide range of applications. Our algorithm follows a primal-dual approach, which yields a combinatorial algorithm that is very efficient and easy to implement. It is able to leverage predictions from any source, such as predictions constructed from historical data. Using a novel analysis, we show that the algorithm is robust against bad predictions and able to improve its performance with good predictions. In particular, we are able to bypass the strong lower bound on the worst-case competitive ratio for these problems. We experimentally verify the practical applicability of our algorithm under various kinds of predictions on synthetic and real-world data sets. Here, we observe that our algorithm is able to outperform the baseline worst-case algorithm due to Feldman et al. (2009a) that does not use predictions, by leveraging predictions that are obtained from historic data, as well as predictions that are corrupted versions of the optimum allocation.

1.1 Preliminaries

Problem Definition: In this work, we study the Display Ads problem and its generalization, the generalized assignment problem (GAP) (Feldman et al., 2009a). In Display Ads, there are advertisers a {1, . . . , k} that are known ahead of time, each of which is willing to pay for at most Ba ad impressions. A sequence of ad impressions arrive online, one at a time, possibly in adversarial order. When impression t arrives, the values wat 0 for each advertiser a become known. These values might be a prediction of click-through probability or any valuation from the advertiser, but we treat them as abstractly given to the algorithm. We have to allocate t immediately to an advertiser, or decide not to allocate it at all. The goal is to maximize the total value P a,t xatwat subject to P t xat Ba for all a, where xat = 1 if t is allocated to a and xat = 0, otherwise. GAP is a generalization of Display Ads where the size that each impression takes up in the budget constraint of an advertisers is non-uniform. That is, each impression t has a size uat 0 for each advertisers a, and advertiser a is only willing to pay for a set of impressions whose total size is at most Ba. More precisely, we require that P a,t xatuat Ba.

Free Disposal Model: In general, it is not possible to achieve any competitive ratio for the online problems described above. Motivated by online advertising, Feldman et al. (2009a) introduced the free disposal model which makes the problem tractable: when a new impression arrives, the algorithm allocates it to an advertiser a; if a is out of budget, we can decide to dispose of an impression previously allocated to a. The motivation for this model is that advertisers are happy to receive more ads, as long as they are only charged for the Ba most valuable impressions. We refer the reader to the paper of Feldman et al. (2009a) for additional motivation of this model. In this work, we consider both Display Ads and GAP in the free disposal model.

Related Problems: Display Ads and GAP are significant generalizations of well-studied problems such as online bipartite matching and Ad Words. In online bipartite matching, all values are 1. In Ad Words, values and sizes are identical. The latter setting allows for more specialized algorithms that exploit the special properties of this problem, as we discuss in more detail later.

Algorithms with Predictions: The algorithms we study follow under the umbrella of algorithms that leverage predictions to obtain improved performance. These were studied in an extensive line of work, see e.g. the survey of Mitzenmacher and Vassilvitskii (2022). Following this established research, we use two important measures for the performance of the algorithm: The robustness ALG/OPT indicates how well the algorithm s objective value ALG performs against the optimum solution OPT; the consistency ALG/PRD measures how close the algorithm gets to the prediction s objective value PRD. Most algorithms with predictions, including the one presented in this work, allow to control the trade-off between robustness and consistency with a parameter α.

1.2 Related Work

Online Ad-Allocation with Predictions: To the best to our knowledge, we are the first to study Display Ads and GAP with predictions. Related problems were considered in the work by Mahdian et al. (2007) and Medina and Vassilvitskii (2017) for Ad Words, and by Lattanzi et al. (2020) for online capacitated bipartite matching. Medina and Vassilvitskii (2017) use bid predictions to set reserve prices for ad auctions. Lavastida et al. (2021) incorporate predictions of the dual variables into the proportional-weights algorithm (Agrawal et al., 2018; Karp et al., 1990; Lattanzi et al., 2020) for online capacitated bipartite matching. Chen and Indyk (2021) analyze an algorithm that uses degree predictions by matching arriving vertices to vertices with minimum predicted degree. Jin and Ma (2022) provide an algorithm for batched allocation in bipartite matching under predictions, based on the model of Feng et al. (2021). As noted above, Ad Words and bipartite matching have additional structure, which is exploited in these prior works. In particular, the algorithms proposed in these works are not applicable to the more general problems Display Ads and GAP. Our algorithm builds on the approach of Mahdian et al. (2007) for the Ad Words problem, but substantial new ideas are needed in the algorithm design and analysis, as discussed in more detail in Section 2.

There has been further extensive work in the design of worst-case algorithms and under random input models without predictions, which we now summarize.

Worst-Case Algorithms: The design of worst-case algorithms has been the focus of a long line of work which can, for instance, be found in the survey of Mehta (2013). A large focus has been on Ad Words. Several combinatorial algorithms have been proposed, based on the work of Karp et al. (1990). The combinatorial approach is tailored to the structure of to these special cases. The primal-dual approach is a more general approach that can handle more complex problems such as Display Ads and GAP (Buchbinder et al., 2007; Feldman et al., 2009a). In this work, we build on the primal-dual algorithm of Feldman et al. (2009a) and show how to incorporate predictions into their framework. The worst-case guarantee for online bipartite-matching, and therefore for all generalizations, is 1 1

e (Karp et al., 1990).

Stochastic Algorithms: The lower bound of 1 1

e can be circumvented under distributional assumptions. This has been extensively studied for online bipartite matching (Karande et al., 2011; Feldman et al., 2009b; Jin and Williamson, 2022). Further work has been done for the Ad Words problem (Devanur and Hayes, 2009; Devanur et al., 2012) with generalizations due to Feldman et al. (2010) for a more general stochastic packing problem.

2 Our Algorithm

In order to illustrate the algorithmic ideas and analysis, we consider the simpler setting of Display Ads in this section. Our algorithm for GAP is a generalization of this algorithm and we include it in Appendix B. For simplicity, we assume that our prediction is a solution to the problem, which is as in prior work (Mahdian et al., 2007). However, due to our general analysis framework, we also consider our algorithm a starting point towards incorporating weaker predictors, such as partial solutions or predictions of the supply, which we leave for future work.

Prediction: We assume that we are given access to a prediction, which is a fixed solution to the problem, given as an allocation of impressions to advertisers. With each impression t, we also receive the advertiser PRD(t) to which the prediction allocates t. In particular, this means that the prediction does not have to be created up front, but can be adjusted on the fly based on the observed impressions.

Given a solution to the problem, we could also consider the following random-mixture algorithm: For some parameter q [0, 1], run the worst-case algorithm; with probability 1 q, follow the prediction exactly. This algorithm achieves a robustness of q (1 1

e) and consistency of q (1 1

e) + 1 q. However, this is only in expectation and against a weak adversary that is oblivious to the algorithm s random choices. In contrast, Algorithm 1 is designed to obtain its guarantees deterministically against the strongest possible adversary that can adapt to the algorithm s choices, which is identical to the setup in Mahdian et al. (2007). We observe that Algorithm 1 clearly outperforms this random-mixture algorithm in our experiments (cf. Section 4) which shows that our stronger setting is indeed valuable in practice. Furthermore, the random-mixture algorithm cannot be adapted to different predictors that are not solutions, such as the ones mentioned above.

Algorithm Overview: Our Algorithm, shown in Algorithm 1, incorporates predictions in the

Display Ads Primal

a, t: xat 0

Display Ads Dual

a, t: zt wat βa a: βa 0 t: zt 0

Figure 1: Primal and dual of the Display Ads LP

Algorithm 1 Exponential Averaging with Predictions

1: Input: Robustness-consistency trade-off parameter α [1, ), advertiser budgets Ba N

2: Define the constants B := mina Ba, e B := 1 + 1

B B, and αB := B eα/B B 1

3: For each advertiser a, initialize βa 0 and allocate Ba zero-value impressions 4: for all arriving impressions t do 5: a(PRD) PRD(t) 6: a(EXP) arg maxa{wat βa} 7: if αB(wa(PRD)t βa(PRD)) wa(EXP)t βa(EXP) then 8: a a(PRD) 9: else 10: a a(EXP) 11: end if 12: Allocate t to a and remove the least valuable impression currently assigned to a 13: Let w1 w2 w Ba be the values of impressions currently assigned to a in nondecreasing order

14: Update βa eα/Ba Ba 1 eα Ba 1

i=1 wieα(Ba i)/Ba Ba

15: end for

primal-dual algorithm of Feldman et al. (2009a). The algorithm is based on the primal and dual LP formulations in Figure 1. The algorithm constructs both a primal integral solution which is an allocation of impressions to advertisers as well as a dual solution, given explicitly by the dual variables βa. Analogously to other algorithms with predictions, the algorithm takes as parameter the value α; a larger value of α means that we trust the prediction more. Similarly to the worst-case algorithm of Feldman et al. (2009a), the dual variables βa play an important role in the allocation of impressions. When an impression t arrives, we evaluate the discounted gain wat βa for each advertiser. The worst-case algorithm allocates the impression to the advertiser a(EXP) maximizing the discounted gain and only allocates if the discounted gain is positive, i.e. the value exceeds the threshold βa. Our algorithm modifies this base algorithm to incorporate predictions as shown in Line 7 and it follows the prediction a(PRD) if its discounted gain is a sufficiently high fraction of the discounted gain of a(EXP). We refer the reader to the discussion below for more intuition on the choice of update. After selecting the advertiser to which to allocate the impression t, we remove the least valuable impression currently assigned to a to make space for t, and then allocate t to a. Another crucial part of the algorithm is the update rule for βa in Line 14, which is updated in a novel way based on the parameter α. More precisely, we set βa as an exponential averaging of the values of impressions currently allocated to a. Compared to the worst-case algorithm, we assign higher weight to impressions with less value which is essential for leveraging predictions.

To simplify the algorithm description and analysis, we initially allocate Ba impressions of zero value to each advertiser a. Furthermore, we assume that there exists a dummy advertiser with large budget that only receives zero value impressions. Instead of not allocating an impression explicitly (either in the algorithm or the prediction), we allocate to the dummy advertiser, instead.

2.5 5.0 7.5 10.0 α

Consistency

2.5 5.0 7.5 10.0 α

0.6 0.8 1.0 Consistency

B = 2 B = 10 B = 20 B =

Figure 2: We illustrate the consistency-robustness trade-off of Algorithm 1 for various values of α and budgets B.

Intuition for our Algorithm: As noted above, we make two crucial modifications to the worst-case algorithm to incorporate predictions: The advertiser selection (Line 7) and the update of βa (Line 14). We now provide intuition for both choices.

First, let us illustrate the difficulties in incorporating predictions in the advertiser selection. Based on the worst-case algorithm, a natural way to incorporate predictions is to allocate to the prediction if the distorted gain wa(PRD)t 1

α βa(PRD) exceeds the maximum discounted gain. However, this approach does not work as shown by the following example: Consider a scenario where the prediction suggests a constant advertiser a(PRD). Impressions are split into two phases: Phase 1 contains Ba(PRD) impressions t where only a(PRD) can derive a value of wa(PRD)t = 1 and wat = 0 for a = a(PRD). The algorithm allocates all these impressions to a(PRD) and at the end of phase 1 has βa(PRD) = 1. In phase 2, a large amount of impressions with wa(PRD)t = 1 and wat = α 1

2α for a = a(PRD) arrive. Since the distorted gain wa(PRD)t 1

αβa(PRD) = α 1

α exceeds the discounted gain wat βa = α 1

2α for a = a(PRD), the algorithm allocates to a(PRD) which yields 0 gain. However, we forfeit an unbounded amount of potential value derived from allocating to advertisers a = a(PRD). An important takeaway of this example is the crucial observation that the algorithm should never allocate to the predicted advertiser if its discounted gain is 0. The selection rule in our algorithm is designed to meet this important consideration.

Second, we need to change the update rule for βa. We update βa using a carefully selected exponential average of the values of impressions currently assigned to a, that incorporates our confidence in the prediction parameterized by α. In contrast to the worst-case algorithm, we weigh less valuable impressions more. This lowers the threshold for the addition of new impressions, which allows us to exploit more potential gain from the predicted advertiser.

Theorem 1. Let B := mina Ba and e B := 1 + 1

B B. Let OPT and PRD be the values of the optimal and predicted solutions, respectively. For any α 1, Algorithm 1 obtains a value of at least

ALG max {R(α) OPT, C(α) PRD}

where the robustness is

R(α) := eα B 1

Beα B eα/B B 1 eα 1

and the consistency is

C(α) := 1 + 1 eα B 1 max 1

eα B eα B 1

, ln (eα B) 1

1 + 1 eα 1 max 1

, α 1 (B ).

Figure 2 shows the above trade-off between consistency and robustness. We can observe that the guarantee rapidly improves as the minimum advertiser budget B increases, which is very beneficial both in theory and in practice. The trade-off is comparable to the one obtained by Mahdian et al. (2007) for the more structured Ad Words problem.

Comparison to Prior Work: The work most closely related to ours is the work of Mahdian et al. (2007) which incorporates predictions into the algorithm of Mehta et al. (2007). These works are for the related but different Ad Words problem, and do not apply to Display Ads and GAP. In the Ad Words problem, the value of an impression is equal to the price that the advertiser pays for it, i.e., the size of the impression in the budget constraint. In contrast, in Display Ads and GAP, the values and the sizes are independent of each other (e.g., in Display Ads an impression takes up 1 unit of space in the advertiser s budget but it can accrue an arbitrary value). Thus there is no longer any relationship between the total value/profit of the impressions and the amount of the advertiser s budget that has been exhausted. The Ad Words algorithms of Mehta et al. (2007); Mahdian et al. (2007) crucially rely on this relationship both in the algorithm and in the analysis. Due to the special structure of the problem, these algorithms do not dispose of impressions and consider only the fraction of the advertiser s budget that has been filled up in order to decide the allocation and to incorporate the prediction. Moreover, the algorithm and the analysis do not need to account for the loss incurred by disposing impressions. These crucial differences require a new algorithmic approach and analysis, which was given by Feldman et al. (2009a) using a primal-dual approach. Since we build on their framework as opposed to the work of Mehta et al. (2007), we also need a new approach for incorporating predictions, as described above in the intuition for our algorithm. Further, the primal-dual framework only helps in establishing the robustness but not the consistency of our algorithm, and we develop a novel combinatorial analysis for proving the consistency.

In the following, we outline the analysis of Algorithm 1 to prove Theorem 1. Specifically, we show separately that ALG/OPT R(α) (robustness) and ALG/PRD C(α) (consistency).

Notation: We denote with superscript (t) the value of variables after allocating impression t. E.g. a(t)

is the algorithm s choice of advertiser for impression t and β(t) a is the dual variable after allocating t. Let Xa := t : a(t) = a and Pa := t : a(t) (PRD) = a

be the impressions that were assigned to a and potentially disposed of, and the impressions that the prediction recommended for assignment to a, respectively. We set Ia := |Xa| and ℓa := |Pa Xa| as the size of the overlap. Let also T be the last impression and Sa is the final allocation, i.e., the Ba impressions allocated to a at the end of the algorithm. Finally, let ALG and PRD be the total value of the solution created by the algorithm and the prediction, respectively.

Robustness: Our proof for robustness closely follows the analysis in Feldman et al. (2009a) by using the primal-dual formulation of the problem, with some additional care that is needed not to violate dual feasibility whenever we follow the prediction. We defer the full proof to Appendix A.1.

Consistency: We now show the consistency, i.e. that PRD is bounded by a multiple of ALG. The complete analysis can be found in Appendix A.2, while we only give a high-level overview here.

A common approach in the analysis of online primal-dual algorithms is to employ a local analysis where, in each iteration, we relate the change in the value of the primal solution to the change in the dual solution (Buchbinder and Naor, 2009). However, it is not clear how to employ such a strategy in our setting due to the complications arising from our algorithm following a mixture of the worst-case and predicted solutions. We overcome this challenge using a novel global analysis that relates the final primal value to the prediction s value.

We now provide a high level overview of our global analysis. We start by noting that the objective value of our algorithm and the prediction is the sum of the impression values allocated to each advertiser, i.e. ALG = X

t Sa wat and PRD = X

However, note that PRD contains values of impression that do not appear in ALG since we ignore the prediction in some iterations, or already disposed of the impression. It is further unclear which advertiser to charge for an impression that does not agree with the prediction.

Consider an impression for which we followed the worst-case choice a(EXP) that maximizes the discounted gain instead of the prediction. Due to our selection rule, the reason for this departure is

due to the discounted gains satisfying the following key inequality:

wa(PRD) 1 αB

wa(EXP) βa(EXP) + βa(PRD). (1)

By using this important relationship, we upper bound the value PRD of the prediction using a linear combination of the values of impressions allocated by the algorithm (but possibly disposed of) and the thresholds. By grouping the impression values and dual variables by advertiser in the resulting upper bound, we are able to correctly charge each impression for which we deviated from the prediction to a suitable advertiser, thus overcoming one of the challenges mentioned above. To summarize, using (1) we obtain a bound PRD P

a PRDa where each PRDa is a linear combination of impression values and dual variables for advertiser a, and we want to compare this quantity to ALGa := P

Let us now consider a fixed advertiser a, and relate PRDa to ALGa. At this point, a key difficulty is that the amount PRDa that we charged to advertiser a involves the threshold βa. By definition, the threshold is a convex combination of the values of the impressions in the algorithm s allocation. This gives us that PRDa is a linear combination of only the weights, but this cannot be readily compared to ALGa due to the complicated structure of the coefficients in the former. To bridge this gap, we show a useful structural property (Lemma 4) that gives us the following upper bound on PRDa: If we define ti as the i-th impression allocated to a, we have

i=Ia Ba+1 ϕiwati +

i=Ia ℓa+1 ψiwati + wat Ia Ba Ωa (2)

for appropriate coefficients ϕi, ψi, and Ωa. The RHS of this inequality accounts for the value as follows: the first sum is for the impressions that agree with the prediction; the second sum is for the impressions that disagree with the prediction; the final term accounts for the values of all impressions that were disposed. As this is (almost) a linear combination over impression values that all appear in ALG (except wat Ia Ba ), we could bound the ratio PRDa/ALGa by the maximum coefficient in (2). However, this does not lead to a constant competitive ratio. We therefore need a delicate analysis (Lemma 10) to balance the coefficients as uniformly as possible among all values, where we use properties of the coefficients ϕi, ψi, and Ωa and the structural property we derived in Lemma 4.

In Appendix B, we discuss how to generalize Algorithm 1 and its analysis to GAP, where impressions can take up arbitrary size uat in the budget constraints. This generalization uses a gain wat uatβa that considers the size of each impression, and the comparison of the worst-case and predicted choices are updated correspondingly. We also change the update of βa to consider the impression density wat/uat. We also note that our technique can be used to achieve a slight improvement in the guarantees, but we omit this in the interest of simplicity and conciseness.

4 Experimental Evaluation

We now evaluate the practical applicability of Algorithm 1. We use multiple baseline algorithms on different kinds of predictions, which we describe below. We showcase results on real-world and synthetic data, with further experimental results in Appendix C.

Algorithms: We compare Algorithm 1 to the random-mixture algorithm described in Section 2 (Random Mixture) and the worst-case algorithm without predictions due to Feldman et al. (2009a) (Worst-Case Baseline). Additionally, we consider two modifications of the worst-case algorithm: One allocates to the impression of maximum value (Greedy Baseline) and the other to the impression with maximum gain after disposal (Discounted Greedy Baseline).

Predictors: We consider variations of the following predictors. Recall that each predictor is a fixed allocation of impressions to advertisers that is revealed online.

(1) Optimum Solution (OPT): The optimum solution is obtained by solving the problem optimally offline using an LP solver. To evaluate our algorithm s robustness, we also consider a version of the optimum solution where a random p-fraction of the allocations has been corrupted. Under a random corruption, we corrupt by reallocating to randomly chosen advertisers. For a biased corruption, we sample a random permutation offline and corrupt by reallocating according to this permutation, generating a more adversarial corruption.

1 2 3 4 5 α

Consistency

1 2 3 4 5 α

Dual Base (0.64)

OPT biased corruption p = 0.5 (0.51)

OPT biased corruption p = 0.9 (0.12)

Previous Day (0.66) Random Mixture

Greedy Baseline Discounted Greedy Baseline

Worst-Case Baseline

1 2 3 4 5 α

Consistency

1 2 3 4 5 α

Dual Base (0.69)

OPT random corruption p = 0.9 (0.16)

OPT biased corruption p = 0.9 (0.14)

Previous Day (0.54) Random Mixture

Greedy Baseline Discounted Greedy Baseline

Worst-Case Baseline

Figure 3: Experimental results on i Pin You (top) and Yahoo datasets (bottom) using different predictions for varying α. The solid lines show our algorithm and the dashed lines the randommixture algorithm. We run the algorithms 5 times and report average for both algorithms and the standard deviation only for our algorithm, to avoid clutter. For the robustness, the black line shows the performance of the worst-case algorithm without predictions due to Feldman et al. (2009a). For each predictor, we also include in parentheses the average competitive ratio over 5 runs (e.g. Previous Day (0.66) indicates that the average competitive ratio for the solution of the Previous Day prediction was 0.66). We run the random-mixture algorithm for each prediction and q := 1/α.

0.4 0.6 0.8 1.0 Prediction Competitiveness

Consistency

0.4 0.6 0.8 1.0 Prediction Competitiveness

Dual Base OPT random corruption OPT biased corruption

0.2 0.4 0.6 0.8 1.0 Prediction Competitiveness

Consistency

0.2 0.4 0.6 0.8 1.0 Prediction Competitiveness

Dual Base OPT random corruption OPT biased corruption

Figure 4: Performance of our algorithm on the i Pin You (top) and Yahoo (bottom) datasets for α = 5 with predictions of varying quality obtained as follows: We vary the sample fraction ϵ [0, 1] for the dual base algorithm and p [0, 1] for random and biased corruptions. The prediction competitiveness is defined as PRD/OPT.

(2) Dual Base: We generate a solution using the algorithm of Devanur and Hayes (2009). Here, we sample the initial ϵ-fraction of all impressions and optimally solve a scaled version of the dual LP to obtain the dual variables {βa}a. We get a primal allocation for all future impressions t by allocating to the advertiser a that maximizes the discounted gain wat βa, but do not update βa.

(3) Previous Day: We look at all impressions from the previous day and optimally solve the dual LP offline to obtain dual variables {βa}a. To get a prediction for today s impressions, we use the same algorithm as above and allocate to the advertiser maximizing the discounted gain.

Real-World Instances: We generate two instances for Display Ads based on the real-word datasets i Pin You (Zhang et al., 2014) and Yahoo (Yahoo, 2011).

1 2 3 4 5 α

Consistency

1 2 3 4 5 α

Dual Base ϵ = 0.1 (0.80)

OPT random corruption p = 0.5 (0.74)

OPT biased corruption p = 0.5 (0.72)

Random Mixture Greedy Baseline

Discounted Greedy Baseline Worst-Case Baseline

Figure 5: Experimental results for varying values of α on synthetic data with 12 advertisers and 2000 impressions of 10 types, where we report the same quantities as in Figure 3. We use different predictors with σ = 1.5. We conducted experiments with significantly more or less impressions, where we obtain similar results, and we therefore omit these.

(1) i Pin You: The i Pin You dataset contains real-time bidding data from the i Pin You online advertisement exchange. This dataset contains 40372 impressions over 7 days and 301 advertisers. Each advertiser places multiple bids for an impression. We use this bid data to construct the dataset. Specifically, we set the maximum of those bids as the advertiser s valuation. We assume a constant budget for each advertiser of 10 impressions as it makes for an interesting instance.

(2) Yahoo: We replicate the experimental setup of Lavastida et al. (2021) who generated an instance of online capacitated bipartite matching based on a Yahoo dataset (Yahoo, 2011). Capacitated online bipartite matching is a special case of Display Ads where all impression values are 1. Based on this dataset, we create an instance of capacitated online bipartite matching with around 2 million impressions and 16268 advertisers for 123 days. We defer the details to Section C.1.

Synthetic Instances: We obtain random synthetic data for a fixed set of T impressions and k advertisers as follows. We first generate a set of impression types, whereas each type is meant to model a group of homogeneous users (e.g. similar demographic or using similar keywords) and advertisers value users from the same group identically. We sample an advertiser s valuation for each impression type from an exponential distribution. To represent a full day of impressions, we assume that display times of impressions from a certain type are distributed according to a Gaussian with some uniformly random mean in [0, 1] and a fixed standard deviation σ. We then sample the same number of impressions from each type along with display times, and order them in increasing display time. Finally, we equip each advertiser with some fixed budget that makes for a difficult instance.

Results: Figure 3 and Figure 5 show results for real-world and synthetic instances, respectively. For each predictor, we show the consistency (left) and robustness (right) for varying α. Figure 4 shows results for α = 5 with predictions of different quality, as described in the figure caption.

Discussion: We make several observations. On real-world instances, there is only a single prediction for which the performance of our algorithm drops below the worst-case algorithm, even for heavily corrupted predictions. E.g., on the i Pin You dataset, our algorithm is still able to leverage a prediction with a corruption rate as high as p = 50%, and improve upon the worst-case algorithm (see the green and black lines in the top right plot of Figure 3). Moreover, for higher performing predictors, the improvement over the worst-case algorithm is significantly higher in both datasets. See for example, the Previous Day predictions on the i Pin You dataset (the purple and black lines in the top right plot of Figure 3) or Dual Base predictions on the Yahoo dataset (the orange and black lines in the bottom right plot of Figure 3). The increase in robustness is surprising as it does not follow the behavior of our theoretical bounds (cf. Figure 2) which hold against an adversarial prediction. This suggests that on real-world data, our algorithm is able to exploit good suggestions from the prediction while ignoring bad suggestions . Second, as we can see in Figure 3, the consistency of our algorithm for predictors except the optimum is always above 1, and is significantly high for artificially corrupted predictions. The robustness of our algorithm remains high in almost all cases, even for the most heavily corrupted predictions (cf. the right side of Figure 4). On synthetic instances, we observe that our algorithm is robust against both random and biased corruption, as the robustness does not drop to the prediction s low competitiveness of 0.7. Furthermore, our algorithm performs well in combination with the dual base prediction for ϵ = 0.1 (the orange line in Figure 5), even though the first 200 impressions are not representative of all impressions. On all instances, we clearly outperform the random-mixture algorithm which merely interpolates between the objective values of the worst-case algorithm and the prediction.

We introduce a novel algorithm with predictions for Display Ads and GAP with free disposal. Our algorithm is based on the primal-dual approach and can be efficiently implemented in practice. We show its robustness using the primal-dual method similar to Feldman et al. (2009a) and use a novel combinatorial proof to show its consistency. Finally, our experiments show the applicability of our algorithm, which is able to improve beyond the worst-case performance using readily available predictions. An interesting direction left open by our work is to understand the optimal trade-off between robustness and consistency for Display Ads and GAP. This challenging direction has been explored recently in the works (Jin and Ma, 2022; Wei and Zhang, 2020) for a related problem. Limitations: Our algorithm requires a strong prediction that is a solution to the problem. We leave weaker predictions, such as partial solutions or predictions of the supply, for future work.

Acknowledgments

FS and AE were supported in part by NSF CAREER grant CCF-1750333, NSF grant III-1908510, and an Alfred P. Sloan Research Fellowship.

Shipra Agrawal, Morteza Zadimoghaddam, and Vahab Mirrokni. Proportional allocation: Simple, distributed, and diverse matching with high entropy. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 99 108. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/agrawal18b.html.

Niv Buchbinder and (Seffi) Joseph Naor. The design of competitive online algorithms via a primal: Dual approach. Found. Trends Theor. Comput. Sci., 3(2 3):93 263, feb 2009. ISSN 1551-305X. doi: 10.1561/0400000024. URL https://doi.org/10.1561/0400000024.

Niv Buchbinder, Kamal Jain, and Joseph (Seffi) Naor. Online primal-dual algorithms for maximizing ad-auctions revenue. In Lars Arge, Michael Hoffmann, and Emo Welzl, editors, Algorithms ESA 2007, pages 253 264, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. ISBN 978-3-54075520-3.

Justin Y. Chen and Piotr Indyk. Online bipartite matching with predicted degrees. Co RR, abs/2110.11439, 2021. URL https://arxiv.org/abs/2110.11439.

Nikhil Devanur and Thomas P. Hayes. The adwords problem: Online keyword matching with budgeted bidders under random permutations. In ACM Conference on Electronic Commerce. Association for Computing Machinery, Inc., January 2009.

Nikhil R. Devanur, Balasubramanian Sivan, and Yossi Azar. Asymptotically optimal algorithm for stochastic adwords. In Proceedings of the 13th ACM Conference on Electronic Commerce, EC 12, page 388 404, New York, NY, USA, 2012. Association for Computing Machinery. ISBN 9781450314152. doi: 10.1145/2229012.2229043. URL https://doi.org/10.1145/2229012. 2229043.

Jon Feldman, Nitish Korula, Vahab S. Mirrokni, S. Muthukrishnan, and Martin Pál. Online ad assignment with free disposal. In Workshop of Internet Economics (WINE), pages 374 385, 2009a.

Jon Feldman, Aranyak Mehta, Vahab Mirrokni, and S. Muthukrishnan. Online stochastic matching: Beating 1-1/e. In 2009 50th Annual IEEE Symposium on Foundations of Computer Science, pages 117 126, 2009b. doi: 10.1109/FOCS.2009.72.

Jon Feldman, Monika Henzinger, Nitish Korula, Vahab S. Mirrokni, and Cliff Stein. Online stochastic packing applied to display ad allocation. In Mark de Berg and Ulrich Meyer, editors, Algorithms ESA 2010, pages 182 194, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. ISBN 978-3642-15775-2.

Yiding Feng, Rad Niazadeh, and Amin Saberi. Two-stage stochastic matching with application to ride hailing. In SODA, pages 2862 2877. SIAM, 2021.

Billy Jin and Will Ma. Online bipartite matching with advice: Tight robustness-consistency tradeoffs for the two-stage model. In Neur IPS, 2022.

Billy Jin and David P. Williamson. Improved analysis of ranking for online vertex-weighted bipartite matching in the random order model. In Michal Feldman, Hu Fu, and Inbal Talgam-Cohen, editors, Web and Internet Economics, pages 207 225, Cham, 2022. Springer International Publishing. ISBN 978-3-030-94676-0.

Chinmay Karande, Aranyak Mehta, and Pushkar Tripathi. Online bipartite matching with unknown distributions. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, STOC 11, page 587 596, New York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450306911. doi: 10.1145/1993636.1993715. URL https://doi.org/10.1145/ 1993636.1993715.

R. M. Karp, U. V. Vazirani, and V. V. Vazirani. An optimal algorithm for on-line bipartite matching. In Proceedings of the Twenty-Second Annual ACM Symposium on Theory of Computing, STOC 90, page 352 358, New York, NY, USA, 1990. Association for Computing Machinery. ISBN 0897913612. doi: 10.1145/100216.100262. URL https://doi.org/10.1145/100216. 100262.

Silvio Lattanzi, Thomas Lavastida, Benjamin Moseley, and Sergei Vassilvitskii. Online Scheduling via Learned Weights, pages 1859 1877. 2020. doi: 10.1137/1.9781611975994.114. URL https://epubs.siam.org/doi/abs/10.1137/1.9781611975994.114.

Thomas Lavastida, Benjamin Moseley, R. Ravi, and Chenyang Xu. Using Predicted Weights for Ad Delivery, pages 21 31. 2021. doi: 10.1137/1.9781611976830.3. URL https://epubs.siam. org/doi/abs/10.1137/1.9781611976830.3.

Mohammad Mahdian, Hamid Nazerzadeh, and Amin Saberi. Allocating online advertisement space with unreliable estimates. In Proceedings of the 8th ACM Conference on Electronic Commerce, EC 07, page 288 294, New York, NY, USA, 2007. Association for Computing Machinery. ISBN 9781595936530. doi: 10.1145/1250910.1250952. URL https://doi.org/10.1145/1250910. 1250952.

Andrés Muñoz Medina and Sergei Vassilvitskii. Revenue optimization with approximate bid predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, page 1856 1864, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.

Aranyak Mehta. Online matching and ad allocation. Foundations and Trends in Theoretical Computer Science, 8 (4):265 368, 2013. URL http://dx.doi.org/10.1561/0400000057.

Aranyak Mehta, Amin Saberi, Umesh Vazirani, and Vijay Vazirani. Adwords and generalized online matching. J. ACM, 54(5), oct 2007. ISSN 0004-5411. doi: 10.1145/1284320.1284321. URL https://doi.org/10.1145/1284320.1284321.

Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with predictions. Commun. ACM, 65(7): 33 35, jun 2022. ISSN 0001-0782. doi: 10.1145/3528087. URL https://doi.org/10.1145/ 3528087.

Alexander Wei and Fred Zhang. Optimal robustness-consistency trade-offs for learning-augmented online algorithms. In Neur IPS, 2020.

Yahoo. Yahoo! webscope, 2011. URL https://webscope.sandbox.yahoo.com/. Accessed September 7, 2022.

Weinan Zhang, Shuai Yuan, Jun Wang, and Xuehua Shen. Real-time bidding benchmarking with ipinyou dataset, 2014. URL https://arxiv.org/abs/1407.7073.

A Omitted Proofs

We will need the following helper Lemma in the proofs of consistency and robustness.

Lemma 2. Recall that e B := 1 + 1

B B and αB := B eα/B B 1 . We have

1. e B e Ba and

Proof. It is well known that e B converges to e from below for B . Furthermore, we can show that αB is decreasing in B for all α 1 by taking the derivative

B αB = 1 + 1

α 1 = 1 + 1

where the bound follows from Bernoulli s inequality, which states that 1+rx (1 + x)r for x 1 and r R \ (0, 1).

A.1 Proof of Theorem 1 (Robustness)

We write P and D to denote the objective value of the primal and dual solutions, i.e. P = P a P t Sa wat and D = P a Baβa + P t zt where zt is specified in the following proof to ensure feasibility. We can show that after the allocation of each impression t,

Beα B eα/B B 1 D

where P and D are the increase in the primal and dual solution values, respectively. Since we create feasible primal and dual solutions, this is sufficient to bound the robustness due to weak duality. There is one main difference to Feldman et al. (2009a): In their algorithm, setting the dual variable zt to wa(EXP)t βa ensures dual feasibility as a(EXP) is the advertiser with maximum discounted gain. However, in order not to violate dual feasibility when following the prediction, we need to increase the dual variables zt by a factor of αB. Note that for α = 1, this recovers the competitiveness obtained by Feldman et al. (2009a).

Proof. Consider an iteration where we assign an impression t to advertiser a and let w1 w2 w Ba be the values of impressions currently allocated to a in non-decreasing order. Let w0 be the least valuable of the impressions allocated to a at the end of iteration t 1, i.e. the impression that is removed to make space for t. Assume that after allocating impression t to a, it becomes the k-th least valuable impression allocated to a with value wat = wk. Thus, using that wi wi 1, we can bound

β(t 1) a = eα/Ba Ba 1 eα Ba 1

i=0 wieα(Ba i 1)/Ba Ba +

i=k+1 wieα(Ba i)/Ba Ba

= eα/Ba Ba 1 eα Ba 1

i=0 wieα(Ba i 1)/Ba Ba +

i=k+1 (wi wi 1) eα(Ba i)/Ba Ba

eα/Ba Ba 1 eα Ba 1

i=0 wieα(Ba i 1)/Ba Ba

=: ˆβ(t 1) a

which is tight when impression t becomes the most valuable impression assigned to a. We can now write β(t) a as a function of the bound ˆβ(t 1) a :

β(t) a = eα/Ba Ba 1 eα Ba 1

i=1 wieα(Ba i)/Ba Ba

= eα/Ba Ba 1 eα Ba 1

i=0 wieα(Ba i)/Ba Ba + w Ba w0eα Ba

= eα/Ba Ba 1 eα Ba 1 eα/Ba Ba

i=1 wi 1eα(Ba i)/Ba Ba + eα/Ba Ba 1 eα Ba 1 w Ba w0eα Ba

= eα/Ba Ba ˆβ(t 1) a + eα/Ba Ba 1 eα Ba 1 w Ba w0eα Ba .

We set zt := αB w Ba β(t 1) a which is feasible as the discounted value wat β(t 1) a of the

chosen advertiser a may only be αB-times less the maximum discounted value wa(EXP)t β(t 1) a(EXP) due to the advantage of the predicted advertiser. This yields a dual increase of

D = Ba β(t) a β(t 1) a + zt

= Ba β(t) a β(t 1) a + αB w Ba β(t 1) a

Ba β(t) a ˆβ(t 1) a + αB w Ba ˆβ(t 1) a

eα/Ba Ba 1 ˆβ(t 1) a + eα/Ba Ba 1 eα Ba 1 w Ba w0eα Ba !

+ αB w Ba ˆβ(t 1) a

= αBa ˆβ(t 1) a + αBa eα Ba 1 w Ba w0eα Ba + αB w Ba ˆβ(t 1) a

= αBa eα Ba eα Ba 1

ˆβ(t 1) a w0

+ αBa eα Ba 1

w Ba ˆβ(t 1) a

+αB w Ba ˆβ(t 1) a

αB eα B eα B 1

ˆβ(t 1) a w0 + αB eα B 1

w Ba ˆβ(t 1) a + αB w Ba ˆβ(t 1) a

= αB eα B eα B 1

ˆβ(t 1) a w0 + αB eα B eα B 1

w Ba ˆβ(t 1) a

= αB eα B eα B 1 (w Ba w0) = B eα/B B 1 eα B 1 eα B (w Ba w0)

where the second inequality is due to αB αBa and e B e Ba, as shown in Lemma 2.

A.2 Proof of Theorem 1 (Consistency)

In the following, we upper bound PRD using the comparison in Line 7 of Algorithm 1. Lemma 3. We have

(Ba ℓa) β(T ) a + 1

wat β(t 1) a + X

t Pa Xa wat

Proof. We first split impressions t into two categories: Either the algorithm followed the prediction and assigned t to a(t) = a(t) (PRD), or the algorithm ignored the prediction and assigned t to a(t) =

a(t) (EXP) = a(t) (PRD). In the latter case, due to the selection rule in Line 7 of Algorithm 1,

αB wa(t) (PRD)t β(t 1)

wa(t) (EXP)t β(t 1)

a(t) (EXP) .

In symbols,

t Pa\Xa wat + X

t Pa Xa wat

β(t 1) a + 1

wa(t) (EXP),t β(t 1)

t Pa Xa wat

t Pa\Xa β(t 1) a

wat β(t 1) a + X

t Pa Xa wat

where the last equality holds because {Pa}a and {Xa}a are both partitioning the set of all impressions due to the introduction of the dummy advertiser. For ( ), we use that βa can only increase in each round and bound X

t Pa\Xa β(t 1) a (Ba ℓa) β(T ) a .

For the remainder of this section, we consider a fixed advertiser a. Let us denote with ti the i-th impression allocated to a. Let

wat β(t 1) a + X

t Pa Xa wat

as part of the the bound on PRD in Lemma 3.

In order to understand this bound, we make some useful observations in the following lemma to simplify the analysis. The key idea is that we may assume that impressions in Xa are ordered to be non-decreasing. In particular, we need to argue that the sum P

t Xa\Pa β(t 1) a can only decrease (as this term is negated in ( )) when impressions in Xa are ordered to be non-decreasing: Intuitively, each β(t 1) a depends only on the Ba most valuable impressions assigned before impression t, no matter the order in which X(t 1) a arrived. We can thus minimize each β(t 1) a if the impressions allocated prior to t are the impressions of smallest value. To simultaneously minimize each β(t) a in the sum, we order the impressions in Xa to have non-decreasing value. We prove this simplification formally in the following lemma.

Lemma 4. Without loss of generality, we may assume that Pa Xa are the most valuable impressions in Xa and that impressions in Xa arrive such that their values are non-decreasing.

Proof. We may assume that the impressions in Pa Xa are the most valuable impressions in Xa: this can only increase the value of Pa but leaves Sa unaffected, as Sa are by design the Ba most valuable impressions in Xa. All impressions in ( ) are from Xa, so reordering impressions only affects ( ). Specifically, we can show that the sum P

t Xa\Pa β(t 1) a in ( ) is minimized if the values in Xa are ordered to be non-decreasing. Assume to the contrary that the i-th impression added to a is the last that is in order. That is wat1 wat2 wati and there exists a j i such that watj 1 wati+1 < watj. Moving ti+1 ahead to its ranked position within the first i impressions allocated to a changes the ordering as follows (the first and second row show the impression values before and after changing the position of ti+1, respectively):

wat1 watj 1 watj watj+1 wati wat1 watj 1 wati+1 < watj wati 1

Note that each position decreases in value, even strictly at the j-th position. As such, the exponential average β(t 1) a decreases as well for t < ti+1; it remains constant for t ti+1 as it only depends on the Ba most valuable impressions assigned up to t which remain the same. We can thus simultaneously minimize β(t) a for each t by putting Xa in non-decreasing order. This reordering does not affect β(T ) a or the other terms in ( ), so we may indeed assume that values are non-decreasing.

In light of Lemma 4, we can write ( ) as follows.

Lemma 5. We have

wati β(ti 1) a +

i=Ia ℓa+1 wati.

Proof. Impression values are non-decreasing due to Lemma 4, so wati is the i-th least valuable impression in Xa. The impressions {Ia ℓa + 1, . . . , Ia} = Xa Pa are thus the most valuable. We can now write ( ) as

wat β(t 1) a + X

t Pa Xa wat = 1 αB

wati β(ti 1) a +

i=Ia ℓa+1 wati

where β(ti 1) a = β(ti 1) a as there was no change to the dual variable of advertiser a since no impression in {ti 1 + 1, . . . , ti 1} was allocated to a.

Combining Lemmas 3 and 5, we obtain: Lemma 6. PRD P a PRDa where

PRDa := 1 αB

wati β(ti 1) a +

i=Ia ℓa+1 wati + (Ia ℓa) β(T ) a .

In the following, we use the non-decreasing ordering of impressions in Xa to compute β(ti 1) a and bound PRDa with a linear combination of values wati. Consider the j-th impression tj allocated to a. Since we assume that impression values are non-decreasing, we know that tj becomes the most valuable impression right after it is allocated. After the allocation of the (j + 1)-th impression to a, it becomes the second most valuable impression, and so forth, until it is disposed after the allocation of the (j + Ba)-th impression. The value watj therefore appears alongside each coefficient in the convex combination that defines β(ti 1) a for i {j + 1, . . . , j + Ba}. Expanding each β(ti 1) a in the sum PIa ℓa i=1 β(ti 1) a in PRDa, we thus observe that the coefficients of values watj for j Ia ℓa Ba sum up to 1. We use this fact to cancel out most of the values in PIa ℓa i=1 wati. What remains are only the values wati for i {Ia ℓa Ba + 1, . . . , Ia ℓa}. For i {Ia ℓa Ba + 1, . . . , Ia Ba}, we bound wati by wat Ia Ba which is really the best we can hope for. Formally, we show: Lemma 7. We have

i=Ia Ba+1 ϕiwati +

i=Ia ℓa+1 ψiwati + wat Ia Ba Ωa

with coefficients

ϕi := (Ba ℓa) eα/Ba Ba 1 eα Ba 1 eα(Ia i)/Ba Ba + 1

eα Ba eα(Ia ℓa i)/Ba Ba eα Ba 1

ψi := 1 + (Ba ℓa) eα/Ba Ba 1 eα Ba 1 eα(Ia i)/Ba Ba

ℓaeα Ba eα Ba eα(Ba ℓa)/Ba Ba eα/Ba Ba 1

Proof. We start by rewriting the terms in PRDa individually. Since we assume that the values are non-decreasing, we can express β(ti 1) a as the exponential average of values wati Ba , wati Ba+1, . . . , wati 1 of the last Ba impressions (for simplicity, we set watj = 0 for j 0). Summing over multiple iterations, we thus obtain for the sum over the dual variables that

i=1 β(ti 1) a = eα/Ba Ba 1 eα Ba 1

j=i Ba watjeα(i j 1)/Ba Ba

= eα/Ba Ba 1 eα Ba 1

min{j+Ba,Ia ℓa} X

i=j+1 eα(i j 1)/Ba Ba

= eα/Ba Ba 1 eα Ba 1

min{Ba,Ia ℓa j} X

i=1 eα(i 1)/Ba Ba

= eα/Ba Ba 1 eα Ba 1

i=1 eα(i 1)/Ba Ba

+ eα/Ba Ba 1 eα Ba 1

j=Ia Ba ℓa+1 wa

i=1 eα(i 1)/Ba Ba

i=1 wati + 1 eα Ba 1

i=Ia Ba ℓa+1 wati eα(Ia ℓa i)/Ba Ba 1 .

where for the last equality, we use that the two inner sums are geometric. We can use this expression to cancel out most of the terms of the first sum in PRDa:

wati β(ti 1) a

i=1 wati 1 αB eα Ba 1

i=Ia Ba ℓa+1 wati eα(Ia ℓa i)/Ba Ba 1

i=Ia Ba ℓa+1 wati

1 eα(Ia ℓa i)/Ba Ba 1 eα Ba 1

i=Ia Ba ℓa+1 wati eα Ba eα(Ia ℓa i)/Ba Ba eα Ba 1

i=Ia Ba+1 wati eα Ba eα(Ia ℓa i)/Ba Ba eα Ba 1 +

i=Ia Ba ℓa+1 wati eα Ba eα(Ia ℓa i)/Ba Ba eα Ba 1 . (3)

We use that wati wat Ia Ba for all i Ia Ba to upper bound the second sum, divided by αB, in (3) to

i=Ia Ba ℓa+1 wati eα Ba eα(Ia ℓa i)/Ba Ba eα Ba 1

wat Ia Ba 1 αB

i=Ia Ba ℓa+1

eα Ba eα(Ia ℓa i)/Ba Ba eα Ba 1 (4)

= wat Ia Ba 1 αB

i=Ba ℓa eαi/Ba Ba

= wat Ia Ba 1 αB

ℓaeα Ba eα Ba eα(Ba ℓa)/Ba Ba eα/Ba Ba 1

Furthermore, by definition of β(T ) a = β(t Ia) a ,

(Ba ℓa) β(T ) a = (Ba ℓa) eα/Ba Ba 1 eα Ba 1

i=Ia Ba+1 watieα(Ia i)/Ba Ba . (6)

We combine (3), (5), and (6) and group terms to obtain the desired bound

i=Ia ℓa+1 wati + 1

i=Ia Ba+1 wati eα Ba eα(Ia ℓa i)/Ba Ba eα Ba 1 + wat Ia Ba Ωa

+ (Ba ℓa) eα/Ba Ba 1 eα Ba 1

i=Ia Ba+1 watieα(Ia i)/Ba Ba

i=Ia Ba+1 wati

(Ba ℓa) eα/Ba Ba 1 eα Ba 1 eα(Ia i)/Ba Ba + 1

eα Ba eα(Ia ℓa i)/Ba Ba eα Ba 1

i=Ia ℓa+1 wati

1 + (Ba ℓa) eα/Ba Ba 1 eα Ba 1 eα(Ia i)/Ba Ba

We can express ALG analogously:

Lemma 8. We have ALG = P

a ALGa where

i=Ia Ba+1 wati.

Proof. We have ALG = P

t Sa wat. As we always dispose of the least valuable impression in Algorithm 1, Sa are the Ba most valuable impressions in Xa. Due to Lemma 4, these are Sa = {Ia Ba + 1, . . . , Ia} and hence P

t Sa wat = PIa i=Ia Ba+1 wati = ALGa.

We upper bound the ratio PRD/ALG by maxa PRDa/ALGa. To this end, we fix an advertiser a and upper bound the ratio PRDa/ALGa. Recall from Lemmas 7 and 8 that we can express PRDa and ALGa as linear combination over impression values. We could obtain a natural upper bound by comparing impression value coefficients. However, in the following lemma, we show how to use the non-decreasing ordering due to Lemma 4 to obtain a tighter bound.

i=Ia Ba+1 ϕi and Ψa :=

i=Ia ℓa+1 ψi

as the total factor mass on values wati for ϕi and ψi, respectively. Let τa := (Φa + Ψa + Ωa) /Ba be the average factor. Recall that

i=Ia Ba+1 ϕiwati +

i=Ia ℓa+1 ψiwati + wat Ia Ba Ωa (7)

i=Ia Ba+1 wati.

In the following lemma, we use that wati watj for i j due to Lemma 4, to further upper bound the RHS of 7 by a linear combination of the values, where we move mass from coefficients on wati to coefficients on watj. Additionally, we move mass from Ωa to coefficients ϕi for i {Ia Ba + 1, . . . , Ia ℓa} and from ϕi to ψj for j {Ia ℓa + 1, . . . , Ia}. In the best case, we are able to redistribute mass equally across all values, in which case the consistency is given as the average factor τa. Otherwise, the factors on the largest values dominate, giving us a consistency of Ψa/ℓa.

Lemma 9. We have PRDa ALGa

( max n τa, Ψa

o if ℓa > 0

τa otherwise

τa = 1 + 1 eα Ba 1 1 αB

eα Ba eα Ba 1

ℓa = 1 + Ba

ℓa 1 eαℓa/Ba Ba 1 eα Ba 1 .

Proof. We calculate τa and Ψa/ℓa separately in Lemma 10 below. Our main goal is to distribute mass from the factors ϕi, ψi, and from Ωa equally to the values w Ia Ba+1, . . . , w Ia. We begin by taking a closer look at the factors ϕi and ψi. First, note that ψi is always decreasing in i as

ψi = 1 + (Ba ℓa) | {z } 0

eα/Ba Ba 1 eα Ba 1 | {z } 0

eα(Ia i)/Ba Ba .

We can therefore bound the linear combination over values in {Ia ℓa + 1, . . . , Ia} using the average value wΨ := 1 ℓa PIa i=Ia ℓa+1 wati as

i=Ia ℓa+1 watiψi

i=Ia ℓa+1 wΨψi = wΨΨa. (8)

However, ϕi is not always decreasing which can be seen by rearranging

ϕi = (Ba ℓa) eα/Ba Ba 1 eα Ba 1 eα(Ia i)/Ba + 1

eα Ba eα(Ia ℓa i)/Ba Ba eα Ba 1

= 1 eα Ba 1

(Ba ℓa) eα/Ba Ba 1 1

αB e αℓa/Ba Ba

eα(Ia i)/Ba Ba + 1

eα Ba eα Ba 1.

We observe that ϕi is decreasing if (Ba ℓa) eα/Ba Ba 1 is at least 1 αB e αℓa/Ba Ba , and we analyze two cases based on the relationship of both terms.

Let us first assume that (Ba ℓa) eα/Ba Ba 1 1 αB e αℓa/Ba Ba such that ϕi is decreasing in i which helps us to bound the linear combinations in Lemma 7 over {Ia Ba + 1, . . . , Ia ℓa} and {Ia ℓa + 1, . . . , T} by the average values wΦ := 1 Ba ℓa PIa ℓa i=Ia Ba+1 wati and wΨ, respectively. We further use that wat Ia Ba wΦ to charge mass from Ωa to Φa and obtain due to (8) that

i=Ia Ba+1 watiϕi +

i=Ia ℓa+1 watiψi + wat Ia Ba Ωa

wΦΦa + wΨΨa + wat Ia Ba Ωa (9)

wΦ (Φa + Ωa) + wΨΨa = wΦ (Ba ℓa) τa + wΦ (Φa + Ωa (Ba ℓa) τa) + wΨΨa

i=Ia Ba+1 τawati + wΦ (Φa + Ωa (Ba ℓa) τa) + wΨΨa (10)

On the other hand, if (Ba ℓa) eα/Ba Ba 1 1 αB e αℓa/Ba Ba we can no longer bound the values over {Ia Ba + 1, . . . , Ia ℓa} by the average value wΦ. However, each factor ϕi is less than τa which can be seen by rearranging

ϕi = 1 eα Ba 1

(Ba ℓa) eα/Ba Ba 1 1

αB e αℓa/Ba Ba

eα(Ia i)/Ba Ba + 1

eα Ba eα Ba 1

1 + 1 eα Ba 1 1 αB

eα Ba eα Ba 1

to the equivalent expression (Ba ℓa) eα/Ba Ba 1 1

αB e αℓa/Ba Ba

eα(Ia i)/Ba Ba | {z } 0

αBa = 1 1 αB a Ba

which is true since the LHS is 0 and the RHS 0. We can thus charge τa ϕi of mass from Ωa to the coefficients ϕi for each i {Ia Ba + 1, . . . , Ia ℓa} which yields

i=Ia Ba+1 watiϕi +

i=Ia ℓa+1 watiψi + wat Ia Ba Ωa

i=Ia Ba+1 watiϕi + wΨΨa + wat Ia Ba Ωa

i=Ia Ba+1 τawati

i=Ia Ba+1 wati (τa ϕi) | {z } 0

+ wΨΨa + wat Ia Ba Ωa

i=Ia Ba+1 τawati

i=Ia Ba+1 wat Ia Ba (τa ϕi) + wΨΨa + wat Ia Ba Ωa

i=Ia Ba+1 τawati + wat Ia Ba (Φa + Ωa (Ba ℓa) τa) + wΨΨa (11)

In both cases (10) and (11), we have shown that

i=Ia Ba+1 watiϕi +

i=Ia ℓa+1 watiψi + wat Ia Ba Ωa

i=Ia Ba+1 τawati + v (Φa + Ωa (Ba ℓa) τa) + wΨΨa

for a v wΨ. If ℓa > 0, we can use v wΨ to charge the remaining mass to Ψa if the factors over {Ia ℓa + 1, . . . , T} leave enough space. In symbols, this means

i=T Ba+1 τawati + v (Φa + Ωa (Ba ℓa) τa) + wΨΨa

i=T Ba+1 τawati + wΨ max {Φa + Ωa (Ba ℓa) τa, 0} + wΨΨa

i=T Ba+1 τawati + wΨ max {Φa + Ψa + Ωa (Ba ℓa) τa, Ψa}

i=T Ba+1 τawati + wΨ max {ℓaτa, Ψa}

i=T Ba+1 wati + max τa, Ψa

i=T ℓa+1 wati

If ℓa = 0, we have Ψa = 0 and immediately obtain by definition of τa that

i=Ia Ba+1 τawati + v (Φa + Ωa (Ba ℓa) τa) + wΨΨa =

i=Ia Ba+1 τawati.

Lemma 10. We have Ψa

ℓa = 1 + Ba

ℓa 1 eαℓa/Ba Ba 1 eα Ba 1

τa = 1 + 1 eα Ba 1 1 αB

eα Ba eα Ba 1

Proof. We compute

i=T Ba+1 ϕi

(Ba ℓa) eα/Ba Ba 1 1

αB e αℓa/Ba Ba

eα(T i)/Ba Ba + 1

eα Ba eα Ba 1

= 1 eα Ba 1

(Ba ℓa) eα/Ba Ba 1 1

αB e αℓa/Ba Ba

eα Ba eαℓa/Ba Ba eα/Ba Ba 1 + (Ba ℓa) 1

eα Ba eα Ba 1

= 1 eα Ba 1 (Ba ℓa) eα Ba eαℓa/Ba Ba + 1

1 eα Ba 1 1 αB

eα αℓa/Ba Ba 1

i=T ℓa+1 ψi

1 + (Ba ℓa) eα/Ba Ba 1 eα Ba 1 eα(T i)/Ba Ba

= ℓa + (Ba ℓa) eα/Ba Ba 1 eα Ba 1 eαℓa/Ba Ba 1

= ℓa + (Ba ℓa) eαℓa/Ba Ba 1 eα Ba 1 .

Summing up,

Φa + Ψa + Ωa

= 1 eα 1 (Ba ℓa) eα Ba eαℓa/Ba Ba + 1

1 eα Ba 1 1 αB

eα αℓa/Ba Ba 1

+ ℓa + (Ba ℓa) eαℓa/Ba Ba 1 eα Ba 1 + 1 eα Ba 1 1 αB

ℓaeα Ba eα Ba eα(Ba ℓa)/Ba Ba eα/Ba Ba 1

= 1 eα Ba 1 (Ba ℓa) eα Ba 1 + 1

1 eα Ba 1 1 αB

+ ℓa + 1 eα Ba 1 1 αB ℓaeα Ba

= Ba + 1 eα Ba 1 1 αB (Ba ℓa) eα Ba 1 eα Ba 1 1 αB

eα/Ba Ba 1 + 1 eα Ba 1 1 αB ℓaeα Ba

= Ba + 1 eα Ba 1 1 αB Baeα Ba 1 eα Ba 1 1 αB

= Ba + 1 eα Ba 1 1 αB

Baeα Ba eα Ba 1

which does no longer depend on ℓa. Dividing Ψa by ℓa and Φa+Ψa+Ωa by Ba yields the result.

Putting everything together, we have PRD/ALG maxa max τa, maxℓa {1,...,Ba} Ψa/ℓa as τa does not depend on ℓa. The reader can refer back to Figure 2 for an illustration of this upper bound. In the following lemma, we further analyze analytically maxℓa {1,...,Ba} Ψa/ℓa and compare it with τa to obtain the upper bound: Lemma 11. The consistency of Algorithm 1 is given by

PRD/ALG 1 + 1 eα B 1 max 1

eα B eα B 1

, ln (eα B) .

Proof. Due to Lemma 9, it is sufficient to show

1 + 1 eα B 1 max 1

eα B eα B 1

, ln (eα B) max {τa, Ψa/ℓa} if ℓa > 0 τa otherwise.

By Lemma 10, we know for the first term in the maximum that

τa = 1 + 1 eα Ba 1 1 αB

eα Ba eα Ba 1

This term is maximized for Ba = B since

1 + 1 eα Ba 1 1 αB

eα Ba eα Ba 1

Ba eα/Ba Ba 1

eα Ba eα Ba 1 1 αBa

eα B eα B 1 1

= 1 + 1 eα B 1 1 αB

eα B eα B 1

| {z } =:p(α)

due to Lemma 2. The lemma statement therefore follows immediately if ℓa = 0. We may thus assume that ℓa > 0 and use Lemma 10 to determine the second term in the maximum as

ℓa = 1 + 1 eα Ba 1

x 1 eαx Ba 1

where x =: ℓa/Ba. The second term behaves similarly to the first as

ℓa = 1 + 1 eα Ba 1

x 1 eαx Ba 1 1 + 1 eα B 1

x 1 (eαx B 1)

since eαx Ba 1 / eα Ba 1 (eαx B 1) / (eα B 1). We define g(α, x) := 1

x 1 (eαx B 1) such that we can write

max Φa + Ψa + Ωa

1 + 1 eα B 1 max {p(α), g(α, x)} .

We want to remove the dependency on x in g by maximizing g over x [0, 1] for a fixed α. As g(α, x) is continuous, it suffices to evaluate g in both endpoints and find the stationary points. We have

g(α, 0) = lim x 0

x 1 (eαx B 1) = lim x 0 (1 x) (eαx B 1) x = lim x 0 (eαx B 1) + (1 x) ln(eα B)eαx B = ln(eα B)

by L Hoptial. Further, g(α, 1) = 0. Next, we find the stationary points x [0, 1] as solutions to the equation xg(α, x ) = ln(eα B) 1

x 1 eαx B eαx B 1 (x )2 = 0

which is equivalent to

eαx B 1 = ln(eα B)(x )2 1

x 1 eαx B .

There is no closed form solution for x , but we can replace eαx B 1 in g with the RHS of the above. This yields a new function

h(α, y) = 1

y 1 ln(eα B)y2 1

y 1 eαy B = ln(eα B) (1 y)2 eαy B

with h(α, x ) = g(α, x ). We can thus maximize h over y [0, 1] to obtain an upper bound on g(x ). Note that h(α, 0) = ln(eα B) = g(α, 0) and h(α, 1) = 0 = g(α, 1). To this end, let y be such that y h(α, y ) = ln(eα B)2 (1 y )2 eαy

B 2 ln(eα B) (1 y ) eαy

which is equivalent to ln(eα B) (1 y ) 2 = 0 or y = 1 2 ln(eα B). We evaluate h in y and obtain

h(α, y ) = α ln(e B) 2 α ln(e B)

2 e α 2 ln(e B) B = 4 α ln(e B)e2 eα B.

Note that y 0 α 2/ ln(e B). Furthermore, h (α) always exceeds the endpoint g(α, 0): We calculate

h (α) := h(α, y ) = 4 ln(eα B)e2 eα B

4 ln(eα B)e2 e2 ln(eα/2 B )2

where the inequality is due to ez ez for z = ln eα/2 B 0. Therefore, for all x [0, 1],

( ln(eα B) if α 2 ln(e B) h (α) otherwise.

We consider both intervals separately. Let us first consider the the case when α h 0, 2 ln(e B) i . If

B < , there could be multiple intersection points between p(α) and α ln(e B). However, the situation is easier if B as the intersection points given by

= α αeα eα + 1 = α3

are at α = 1 and α 1.79, whereas α ln(e B) dominates p(α) between 1 and α .

It remains to consider the case α 2 ln(e Ba). Again, there can be many intersection points of p(α) with α ln(e Ba) and h (α). However, if B , then p(α) already dominates h (α) for α > 2 which we can see as follows. First,

4e 2 1 1 e α

We can see that 1 e α

α is decreasing in α as

α = e α (α eα + 1)

which holds as 1 + α eα. Finally, we check that h (2) = 2 2.10 p(2).

B Generalized Assignment Problem

The generalized assignment problem (GAP) is a generalization of Display Ads where impressions t can take up any size uat in the budget constraint of advertiser a. This formulation encompasses both Display Ads and Ad Words, and we empirically compare it to the Ad Words algorithm with predictions due to Mahdian et al. (2007) in Section C.3. For simplicity of presentation, we assume that budgets are all 1 and instead, uat 0. However, as before it is possible to adapt the algorithm to work with large sizes uat. We state the LP below.

a, t: zt wat uatβa

Algorithm 2 is a generalization of Algorithm 1 to GAP. An immediate difference is that the discounted gain wat uatβa respects the impression size uat in accordance with the changed dual constraint. We still follow the predicted advertiser if its discounted gain still is a sufficiently high fraction of the maximum discounted gain. However, we might now have to remove multiple impressions with least value-size ratio to accommodate the new impression. The update for βa also differs and is based on value-size ratios of impressions allocated to a: For a fixed advertiser a let Ua = P

t Xa uat, be the total size of all impressions ever allocated to a. For any x (0, Ua] define wx

ux as the minimal ratio such that X

uat > x. (12)

Then, we can naturally define βa as the exponential average over ratios wx

ux . As before, we also assume that there exists a dummy advertiser that only receives impressions of zero value-size ratio and that all advertisers are initially filled up with impressions of zero value.

B.1 Robustness

Theorem 12. Algorithm 1 has a robustness of

ALG OPT eα 1

Proof. Assume we assign impression t to advertiser a while disposing of some impressions to make space. We will bound the dual increase as a multiple of the primal increase. We now assume that after allocating t to a, it becomes the impression with highest value-size ratio (a general proof follows analogously to the proof of robustness for Display Ads in Section A.1). The primal increase is simply

ux dx Z Ua 1

ux dx = wat Z Ua 1

Algorithm 2 Exponential Averaging with Predictions for GAP

1: Input: Robustness-consistency trade-off parameter α [1, ) 2: For each advertiser a, initialize βa 0 and fill up a with zero-value impressions 3: for all arriving impressions t do 4: a(PRD) PRD(t) 5: a(EXP) arg maxa{wat uatβa} 6: if α wa(PRD),t ua(PRD),tβa(PRD) wa(EXP),t ua(EXP),tβa(EXP) then 7: a a(PRD) 8: else 9: a a(EXP) 10: end if 11: Dispose of impressions with least value-size ratio currently allocated to a until there is uat of free space and allocate t to a

ux as in (12) and update βa α eα 1

ux eα(Ua x)dx

13: end for

At the same time,

β(t) a = α eα 1

ux eα(Ua x)dx

ux eα(Ua x)dx + Z Ua

ux eα(Ua x)dx Z Ua 1

ux eα(Ua x)dx

eαuat Z Ua uat

ux eα(Ua x uat)dx + wat Z Ua 1

ux eα(Ua x)dx

= eαuatβ(t 1) a + α eα 1

ux eα(Ua x)dx

We set zt = α wat uatβ(t 1) a and obtain, since eαuat 1 = αuat due to uat 0,

D = β(t) a β(t 1) a + zt

= (eαuat 1) β(t 1) a + α eα 1

ux eα(Ua x)dx

+ α wat uatβ(t 1) a

= αuatβ(t 1) a + α eα 1

ux eα(Ua x)dx

+ α wat uatβ(t 1) a

eα 1wat αeα

B.2 Consistency

Theorem 13. Algorithm 1 has a consistency of

ALG PRD 1 + 1 eα 1 max 1

As before, we split the impressions t based on whether the algorithm followed the prediction or not. If the algorithm ignores the prediction, we can use that α wa(PRD),t ua(PRD),tβa(PRD)

wa(EXP),t ua(EXP),tβa(EXP) due to Line 6 in Algorithm 2. With a similar calculation, we obtain

t Pa Xa wat + 1

t Xa\Pa wat 1

t Xa\Pa uatβ(t 1) a + X

t Pa\Xa uatβ(t 1) a

Once again, we fix an advertiser a. Let ρa := P

t Pa Xa uat so that we can bound X

t Pa\Xa uatβ(t 1) a (1 ρa) β(T ) a .

We still have to argue that the worst-case is when all impressions are ordered such that their value-size ratios are non-decreasing and the impressions in Pa Xa are the ones with maximum value-size ratio among Xa. The latter is obvious as it can only increase the value of PRD, so it remains to show that the non-decreasing value-size ordering minimizes the third sum in PRDa (the first two sums are invariant under reordering). To this end, note that the value of βa for GAP is the limit of βa for Display Ads in the following sense: For positive ϵ 0, we can split each GAP-impression t Xa into ut

ϵ identical Display Ads-impressions with value wt

ut , while assuming a budget of 1/ϵ. Then, the GAP βa and Display Ads βa are identical. As we know from Display Ads, the worst case is achieved when the Display Ads-impressions with value wt

ut are in non-decreasing order. In this ordering, consecutive Display-Ads impressions with identical value wt

ut still correspond to the same GAP-impression t, so we also know that this ordering is the worst-case for GAP. We may therefore assume that the impressions are ordered such that their value-size ratios are non-decreasing. As such, we obtain

β(t) a = α eα 1

ux eα(U (t) a x)dx = Z y

ux eα(y x)dx =: β(y) a

where y = U (t) a . Combined with the fact that Pa Xa are last impressions in Xa, we can now write

t Pa Xa wat + 1

t Xa\Pa wat 1

t Xa\Pa uatβ(t 1) a

0 β(x) a dx

This helps us to compute β(x) a in PRDa and rewrite the whole term as a linear combination of value-size ratios.

Lemma 14. We have

PRDa Z Ua ρa

ux ϕxdx + Z Ua

ux ψxdx + w Ua 1

ϕx := (1 ρa) α eα 1eα(Ua x) + 1

α eα eα(Ua ρa x)

ψx := 1 + (1 ρa) α eα 1eα(Ua x)

eα eα(1 ρa)

Proof. We rewrite the third sum in PRDa to Z Ua ρa

0 β(x) a dx

uy eα(x y)dydx

Z min{1,Ua ρa y}

0 eαxdxdy + α eα 1

= Z Ua 1 ρa

uy dy + 1 eα 1

eα(Ua ρa y) 1 dy

where for the last equality, we simply evaluated the integral. Using this in place of the second sum in PRDa cancels out most of the terms of the second sum: Z Ua ρa

ux dx Z Ua ρa

0 β(x) a dx

ux dx Z Ua 1 ρa

uy dy 1 eα 1

eα(Ua ρa y) 1 dy

1 eα(Ua ρa y) 1

eα eα(Ua ρa y)

eα eα(Ua ρa y)

eα 1 dy + Z Ua 1

eα eα(Ua ρa y)

eα 1 dy. (13)

We upper bound the third sum

eα eα(Ua ρa y)

eα 1 dy w Ua 1

eα eα(Ua ρa y)

eα eα(1 ρa)

Furthermore,

(1 ρa) β(Ua) a = (1 ρa) α eα 1

ux eα(Ua x)dx (15)

Combining (13), (14), and (15) and grouping terms yields Z Ua

eα eα(Ua ρa y)

eα 1 dy + w Ua 1

+ (1 ρa) α eα 1

ux eα(Ua x)dx

(1 ρa) α eα 1eα(Ua x) + 1

α eα eα(Ua ρa x)

1 + (1 ρa) α eα 1eα(Ua x)

dx + w Ua 1

Analogously to Display Ads, we define

Φa := Z Ua ρa

Ua 1 ϕxdx and Ψa := Z Ua

and the total coefficient τa := Φa + Ψa + Ωa which by a calculation similar to Lemma 10 can be shown to be

τa = 1 + 1 eα 1 1 α

Lemma 15. We have

PRD max τa, Ψa

if ρa > 0 and otherwise, PRD τa ALG.

Proof. Again, let

wΦ := 1 1 ρa

be the average coefficients on the intervals [Ua 1, Ua ρa] and [Ua ρa, Ua], respectively. The latter coefficients are still decreasing as

ψx = 1 + (1 ρa) α eα 1 | {z } 0

so we can bound the linear combination Z Ua

Ua ρ ψxdx = wΨΨa.

However, ϕx is not always decreasing which can be seen by rearranging

ϕx = (1 ρa) α eα 1eα(Ua x) + 1

α eα eα(Ua ρa x)

αe αρa eα(Ua x) + 1

We observe that ϕx is decreasing if (1 ρa) α is at least 1

αe αρa, and we analyze two cases based on the relationship of both terms:

αe αρa: We have R Ua ρa Ua 1 wx

ux ϕxdx wΦΦa and thus

ux ϕxdx + Z Ua

ux ψxdx + w Ua 1

wΦΦa + wΨΨa + wa,Ia BaΩa wΦ (Φa + Ωa) + wΨΨa = wΦ (1 ρa) τa + wΦ (Φa + Ωa (1 ρa) τa) + wΨΨa

us τadx + wΦ (Φa + Ωa (1 ρa) τa) + wΨΨa (16)

αe αρa: We can still show that ϕx τa as

ϕx = (1 ρa) α eα 1eα(Ua x) + 1

α eα eα(Ua ρa x)

1 + 1 eα 1 1 α

eα(Ua x) | {z } 0 eα 1 1

(eα 1) | {z } 0

Therefore, Z Ua ρa

ux ϕxdx + Z Ua

ux ψxdx + w Ua 1

ux ϕxdx + wΨΨa + w Ua 1

ux τadx Z Ua ρa

ux (τa ϕx) dx + wΨΨa + w Ua 1

ux τadx Z Ua ρa

u Ua 1 (τa ϕx) dx + wΨΨa + w Ua 1

ux τadx + w Ua 1

u Ua 1 (Φa + Ωa (1 ρa) τa) + wΨΨa (17)

In both cases (16) and (17), we have shown that

ux ϕxdx + Z Ua

ux ψxdx + w Ua 1

ux τadx + v (Φa + Ωa (1 ρa) τa) + wΨΨa

for a v wΦ.

ux τadx + v (Φa + Ωa (1 ρa) τa) + wΨΨa

ux τadx + wΨ max {Φa + Ωa (1 ρa) τa, 0} + wΨΨa

ux τadx + wΨ max {Φa + Ψa + Ωa (1 ρa) τa, Ψa}

ux τadx + wΨ max {ρaτa, Ψa}

ux dx + max τa, Ψa

or τa R Ua Ua 1 wx

ux dx if ρa = 0

Note that for the bound of Lemma 11, we did not require that ℓa is integral. We can thus apply Lemma 11 to bound max n τa, Ψa

o and obtain the same result, which proves Theorem 13.

1 2 3 4 5 α

Consistency

1 2 3 4 5 α

σ = 0.2, ϵ = 0.02 (0.56)

σ = 0.2, ϵ = 0.2 (0.67)

σ = 1.0, ϵ = 0.02 (0.79)

σ = 1.0, ϵ = 0.2 (0.83)

Random Mixture Worst-Case Baseline

Figure 6: Experimental results for varying values of α on synthetic data with 12 advertisers and 2000 impressions of 10 types, where we report the same quantities as in Figure 3. We use Dual Base predictions for different σ and ϵ. Note that there are two black lines indicating the performance of the worst-case algorithm without predictions, corresponding to the datasets with differing σ.

0.2 0.4 0.6 0.8 1.0 Prediction Competitiveness

Consistency

0.2 0.4 0.6 0.8 1.0 Prediction Competitiveness

Dual Base OPT random corruption OPT biased corruption

0.2 0.4 0.6 0.8 1.0 Prediction Competitiveness

Consistency

0.2 0.4 0.6 0.8 1.0 Prediction Competitiveness

Dual Base OPT random corruption OPT biased corruption

Figure 7: Performance for varying prediction quality with the data from Figure 6 (top) for α = 2 (top) and α = 5 (bottom).

C Further Experimental Results

C.1 Real-World Data

Description of the Yahoo Dataset: The original dataset contains impression allocations to 16268 advertisers throughout 123 days, each tagged with the advertiser that bought the impression and a set of keyphrases that categorize the user for whom the impression is displayed. Lavastida et al. (2021) then consider the 20 most common keyphrases and create an impression type for each non-empty subset thereof. Whenever an advertiser buys an impression with a certain set of keyphrases, we assume that all impression types that correspond to a superset of these keyphrases are relevant for this advertiser, and that it derives some constant value (say, 1) from this allocation. At the same time, the number of impressions we create from each impression type (i.e. the supply) is the number of impression allocations in the original dataset that show that the impression type is relevant for an advertiser. As such, we obtain around 2 million impressions. Lavastida et al. (2021) try multiple

1 2 3 4 5 α

Consistency

1 2 3 4 5 α

Dual Base ϵ = 0.1 (0.66)

OPT biased corruption p = 0.5 (0.73)

OPT biased corruption p = 0.75 (0.61)

Worst-Case Baseline

Figure 8: Performance on a worst-case instance with different predictors.

impression orders and budgets for the advertisers, but due to space constraints we restrict ourselves to display all impressions of a type at once, in supply-ascending order. We determine advertisers budgets by allocating each impression to one of the advertisers with non-zero valuation uniformly at random and taking the number of allocated impressions at the end to be the advertiser s budget.

C.2 Synthetic Data

Results: Figure 5 shows consistency and robustness of our algorithm on synthetic data on T = 2000 impressions of 10 types and k = 12 advertisers, for a variation of predictions. The plot shows the performance for predictions from the optimum solutions (with varying corruption) and the dual base prediction. Our algorithm converges to almost perfect consistency and robustness for α = 10, given the optimum solution. At the same time, we observe that the algorithm is robust against both random and biased corruption, as the robustness does not drop to the prediction s low competitiveness of around 0.7. Furthermore, the algorithm performs well in combination with the dual base prediction for ϵ = 0.1 even though the first 200 impressions are clearly not representative of all synthetically generated impressions.

To investigate the our algorithm in conjunction with an easily available prediction, we also analyze the behavior of the dual base algorithm for different values of σ and ϵ in Figure 6. The performance of our algorithm under dual base predictions clearly improves for increasing values of σ as impressions become more evenly distributed across the day. Generally, sampling more impressions helps but dual base predictions may also lead to a drop in robustness, and more samples can even lead to a more adversarial prediction, as we explore further below. Yet, the robustness does still stays above the prediction s competitiveness in these cases.

Figure 7 shows consistency and robustness for different predictions with varying competitiveness on α {2, 5}. We achieve this by varying the fraction ϵ [0, 1] of samples for the dual base algorithm and the corruption rate p [0, 1] for random and biased corruptions. For α = 2, the consistency exceeds 1 if the prediction is not very good (competitiveness below 0.9). The algorithm is not heavily influenced by a bad prediction since α = 2 is low, so the total obtained value remains relatively constant. For α = 5, the algorithm might however follow the bad choices of the prediction, so the competitiveness varies more. As expected, the average robustness decreases for increasing α, but the dual base prediction starts out with a much lower robustness than the corrupted predictions. The reason for that is that both the dual base algorithm and exponential averaging make their decisions based on the discounted gain. Our algorithm might therefore easily disregard a corrupted prediction as its discounted gain is low (or even negative), but the dual base prediction looks like a sensible choice. The dual base algorithm therefore manages to fool the algorithm for low α, while a biased corruption leads to the worst corruption for larger values of α.

Hard Instances: We consider the worst-case instance for the Display Ads problem described in Mehta et al. (2007). For k advertisers, we create impressions of types r {1, . . . , k}. An impression t of type r has zero value for the first r 1 advertisers w1,t = = wr 1,t = 0 and value 1 for the following advertisers wr,t = = wk,t = 1. We first show all impressions of type 1, then all impressions of type 2, and so forth. The instance is difficult as the algorithm not knowing about future impressions has to allocate impressions of a type equally among advertisers that can

1 2 3 4 5 α

Consistency

1 2 3 4 5 α

GAP Exp Avg (0.79) Mahdian (0.79) Worst-Case Baseline

Figure 9: Performance on synthetic Ad Words instances, compared to the algorithm of Mehta et al. (2007). The black lines show the robustness of two worst-case algorithms without predictions: The algorithm due to Feldman et al. (2009a) which is the basis for our algorithm, and the algorithm of Mehta et al. (2007), which serves as a basis for the algorithm of Mahdian et al. (2007).

derive value from this impression type. As shown by Mehta et al. (2007), the competitiveness of the exponential averaging algorithm reaches 1 1

e for k on this instance.

We evaluate the performance of our algorithm on this worst-case instance in Figure 8. Providing the optimum solution as prediction allows the algorithm to quickly ascend to a perfect robustness of 1. We also consider two (biased) corrupted versions of this prediction with p {50%, 75%}. In both cases, the algorithm still achieves a robustness above the competitiveness of the prediction. The dual base algorithm cannot deliver meaningful predictions as it only sees impressions of the first type, which are clearly not representative of the following impressions by construction.

C.3 Evaluation of GAP on an Ad Words Instance

With an algorithm for GAP, we can also solve Ad Words instances. This allows us to compare our generalized algorithm to the algorithm of Mahdian et al. (2007) under the same predictions. In Figure 9, we run both algorithms on synthetic instances from Section C.2 with an optimum prediction and random corruption (p = 0.5). Both algorithms seem to have similar consistency, but our algorithm achieves a better robustness, due to a different choice of constants in the underlying algorithms.