# robust_budget_pacing_with_a_single_sample__a0647b48.pdf

Robust Budget Pacing with a Single Sample

Santiago Balseiro 1 2 Rachitesh Kumar 3 Vahab Mirrokni 2 Balasubramanian Sivan 2 Di Wang 2

Major Internet advertising platforms offer budget pacing tools as a standard service for advertisers to manage their ad campaigns. Given the inherent non-stationarity in an advertiser s value and also competing advertisers values over time, a commonly used approach is to learn a target expenditure plan that specifies a target spend as a function of time, and then run a controller that tracks this plan. This raises the question: how many historical samples are required to learn a good expenditure plan? We study this question by considering an advertiser repeatedly participating in T second-price auctions, where the tuple of her value and the highest competing bid is drawn from an unknown time-varying distribution. The advertiser seeks to maximize her total utility subject to her budget constraint. Prior work has shown the sufficiency of T log T samples per distribution to achieve the optimal O(

T)-regret. We dramatically improve this state-of-the-art and show that just one sample per distribution is enough to achieve the near-optimal O(

T)-regret, while still being robust to noise in the sampling distributions.

1. Introduction

Online advertising is the economic engine of the internet. It allows platforms like Google, Facebook, and Tik Tok to fund services that are free at the point of delivery while providing businesses the ability to target their ads to relevant users. When a user logs onto these platforms, an auction is run among the interested advertisers to determine which ad is displayed to her. Given the scale of the internet, advertisers are typically interested in thousands (if not millions) of auc-

Authors listed in alphabetical order. 1DRO, Columbia Business School, New York, NY, USA 2Google Research, New York, NY, USA 3IEOR, Columbia University, New York, NY, USA. Work done as a Student Researcher at Google Research. Correspondence to: Rachitesh Kumar <rk3068@columbia.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

tions every day. If they participate in all such auctions, they will likely spend far beyond their budget. This necessitates the need for budget management, which is our focus.

Target spend plans to address non-stationarities. Given the inherent non-stationarities that exist over time, in the volume of traffic, in the demographic of users that visit, the rates of conversion etc., an advertiser s own value and that of the competing advertisers values are also non-stationary. To deal with this non-stationarity, budget management systems compute a target expenditure plan (Facebook-Guide; Kumar et al., 2022). The latter is a function of time that specifies the recommended amount of spend at each point of time. I.e., such a plan distributes the cumulative daily/weekly budget into smaller chunks of time, appropriately capturing the non-stationarity. A pacing algorithm like a controller is used to track the plan. This raises the question: how much data is required to come up with a good target spend plan?

Our model. We study this question by considering a budgetconstrained advertiser participating in T second-price auctions. The advertiser seeks to maximize her utility while respecting her budget (formal model in Section 2). We model the tuple, of an advertiser s value and that of her highest competing bid, at time t, as being generated from unknown independent time-varying distributions {Pt}T t=1. This is a major departure from the stochastic budget-management/resourceallocation literature, which for the most part assumes stationary distributions an assumption that rarely holds in practice as daily traffic patterns are non-stationary (Zhou et al., 2019). If nothing is known about the distributions {Pt}t and they can be chosen adversarially, Balseiro & Gur (2019) showed that every algorithm must incur linear regret against the hindsight-optimal benchmark. Their impossibility result also shows that budget management does not fall under the purview of online convex optimization (OCO) and is vastly different, given that one can achieve O(

T) regret for adversarially-chosen input in OCO. Thankfully, additional information is usually available in the form of historical samples. For example, both intraday and interday internet activity (and consequently the user traffic in which an advertiser is interested) tends to be similar weekon-week. However, this periodicity is never exact, i.e., it is too stringent to assume these historical samples are drawn from exactly the same distributions as {Pt}t. To reflect this reality, in this work, we assume that we have samples

Robust Budget Pacing with a Single Sample

from distributions { Pt}, which are potentially different from {Pt}, and develop algorithms that are robust to differences between them. We call an algorithm robust if its regret degrades smoothly as a function of the total Wasserstein distance between the two sequences of distributions. Crucially, classical training-based algorithms that learn a dual solution from samples and use it for bidding (see Algorithm 1, inspired by Devanur & Hayes 2009; Agrawal et al. 2014) are not robust (see Section 3).

Our result. Our main contribution is to show that Dual Follow-The-Regularized-Leader (FTRL) is robust and achieves the near optimal O(

T) regret even when it has access to just one sample from each distribution Pt, dramatically improving over the prior T log T samples from each distribution (Jiang et al., 2020) to achieve O(

Key insights. Our algorithm uses the samples to estimate the ideal amount of expenditure to target in each auction and then uses Dual FTRL to follow these targets by shading the advertiser s values appropriately. The key insight driving the reduction in sample complexity from T log T to 1 per distribution is the following. Prior work by Jiang et al. (2020) first learns the sampling distributions { Pt}t from the samples, then computes the optimal duals on the learned distributions, and finally uses those duals to compute the target expenditures. Our insight is that it is not necessary to learn the entire sampling distribution. Instead, it is far more efficient to directly learn the duals from the samples and construct target expenditure based on those duals (i.e., setting target expenditure at t to be what the dual-based solution consumes at t). Beyond being very efficient with samples, learning the duals from the samples also guarantees robustness to shifts between { Pt}t and {Pt}t.

Practical implications. Consider T representing a week s worth of auctions. While a total of T log T samples may be possible to obtain by looking at the past few weeks (given that each week has about T samples to offer), requiring T log T samples per distribution calls for looking at many months into the past. This is because getting a sample for a distribution, where a distribution could correspond to say a particular hour (e.g. Monday 10 AM), entails looking at that same hour from the past week. Asking for T log T samples for any given hour, when T samples is what we get for the entire week, clearly requires looking at the numerous months into the past. Apart from posing huge storage and operational challenges, given that traffic pattern shifts over time, even gradual shifts would significantly degrade quality as one moves too much into the past. Our ask of one sample per distribution requires looking at just the past week and getting one sample for each hour.

Near optimality: O(

T). Furthermore, our regret guarantee is near-optimal in light of the Ω(

T) lower bound established in Arlotto & Gurvich (2019), which holds

even in the much easier case when the distributions {Pt} are identical (i.e., i.i.d. setting) and known ahead of time. Why is our O(

T) regret separated from the O(

T) regret in the lower bound (and the matching upper bound in Jiang et al. (2020))? Indeed, we can also obtain O(

T) regret when given log T samples per distribution, or in the other direction, the Jiang et al. (2020) paper also gets only O(

T) regret when given only T samples per distribution rather than T log T samples. In other words, we get a factor T reduction in samples: either from T to 1 sample while maintaining the same regret of O(

T) or from T log T to log T samples while maintaining the same regret of O(

Primal vs dual approach. The recent works of Banerjee et al. (2020) and Banerjee & Freund (2020) show how to obtain a constant regret with a single historical trace by repeatedly solving the primal. We note that they crucially rely on the total number of types being small. When the number of types is very large as in Internet advertising where the type models user features, intent, time of day, location etc., a primal based algorithm is not even well defined because one would not have even seen all the types in the samples. Thus, a primal approach is out of question for us. More on this in Appendix A.

Technical contributions. We achieve our result by developing a novel dual-iterate coupling lemma (see Lemma 4.6) and leveraging it to analyze a leave-one-out thought experiment designed to break challenging correlations which arise from working with one sample per distribution (see Subsection 4.3 for details). Additionally, we also prove a novel regret decomposition for Dual FTRL (Theorem 4.1), which may be of independent interest. Finally, our algorithm does not require solving large linear programs and can be implemented efficiently (see Subsection 4.4), which is critical for online advertising since each auction runs in a few milliseconds. Due to space constraints, we directly move onto a formal description of our model next, and refer the reader to Appendix A for a discussion on other related work.

Notation. We use R+ and R++ to denote the set of nonnegative real numbers and the set of positive real numbers respectively. For n N, we use [n] = {1, . . . , n} to denote the set of positive integers less than or equal to n. We use W( , ) to denote the Wasserstein distance between two distributions under the metric with which the sample space is endowed.

Online Allocation with a Single Resource and Budget Management. For ease of exposition, we will prove our results for the more general single-resource online allocation problem with linear rewards/consumptions. It is well-known that bidding in repeated second-price auctions with budgets

Robust Budget Pacing with a Single Sample

can be modelled as an instance of this online allocation problem (e.g., see Balseiro et al. 2022b, or our Appendix B). It also captures the stochastic multi-secretary problem (Arlotto & Gurvich, 2019) as a special case.

Consider a decision maker with an initial budget B R++ of a resource, whose goal is to optimally spend it on T sequentially arriving requests. Each request γ = (f, b) is comprised of a linear reward function f : X R+ such that f(x) = coeff(f) x, and a linear resource consumption function b : X R+ such that b(x) = coeff(b) x; where X R+ is a compact set which denotes the space of possible actions of the decision maker. We will use S to denote the set of all possible requests and (S) to denote the set of distributions over S. Moreover, we endow S with the following metric d( , ): For any two requests γ = (f, b) and γ = ( f, b):

d(γ, γ) = sup x X

f(x) f(x) + sup x X

b(x) b(x) .

We will assume that 0 X. This allows the decision maker to avoid spending the resource if she so chooses and ensures feasibility. Moreover, let x = maxx X x. We will make standard regularity assumptions (Jiang et al., 2020; Balseiro et al., 2022a;b): there exist f, b R+ such that f(x) f and b(x) b for all x X. Like Jiang et al. (2020), we will also assume that there exists κ R+ such that f(x) κ b(x) for all x X, i.e., the maximum rate of return from spending the resource is bounded above by κ.

At time t [T], the following sequence of events takes place: (i) a request γt = (ft, bt) arrives; (ii) the decision maker observes γt and chooses an action xt X based on the information observed so far; (iii) the request consumes bt(xt) amount of the resource and generates a reward of ft(xt). The decision maker aims to maximize her rewards subject to her budget constraint. A policy {xt( )}t for the decision maker maps requests to actions xt : S X based on the available information at each time step, i.e., the action xt(γt) at time t [T] can depend on the historical requests {γs}t 1 s=1 and the current request γt, but not the future requests {γs}T s=t+1. Moreover, a policy is said to be budget-feasible if it respects the budget constraint by ensuring PT t=1 bt(xt(γt)) B for every sequence {γt}t.

The request γt at time t is drawn from a distribution Pt (S) unknown to the decision maker, independently of the requests at other time steps. We only require the requests {γt}t to be independent and allow the distributions Pt to vary arbitrarily across time. We will measure the performance of a policy against the fluid-optimal benchmark, which is defined as:

FLUID({Pt}t) := max

t=1 E[ft(xt(γt))]

t=1 E[bt(xt(γt))] B

xt : S X t [T] .

Another benchmark common in the literature on online resource allocation is the expected hindsight optimal solution, which is defined as E[OPT({γt}t)] for

OPT({γt}t) := max x X T

t=1 ft(xt) s.t.

t=1 bt(xt) B .

It is well-known that FLUID({Pt}) E[OPT({γt}t)], which makes our benchmark the stronger one (we provide a proof in Appendix C for completeness). Hence, our performance guarantees relative to the fluid-optimal benchmark also imply the same guarantees for the expected hindsightoptimal benchmark.

More concretely, we use R(A|{γt}t) to denote the total reward of a policy A on the request sequence {γt}t, and the performance of an algorithm is measured using its expected regret against the fluid-optimal reward:

Regret(A) := FLUID({Pt}t) E [R(A|{γt}t)] .

Now, if the distributions {Pt}t are unknown and arbitrary, and no other information about {Pt}t is available, then the requests {γt}t can be adversarial. This case has been addressed in Balseiro et al. (2022b), where the authors showed no policy can achieve sub-linear regret. In this work, we address the setting in which the decision maker has additional information in the form of historical samples. In particular, we focus on the setting where the decision maker has access to one independent sample γt Pt for each t [T]. We will assume that the { γt} samples are independent of the request sequence {γt}t and { Pt} are not known to the decision maker. We will show that when the sampling distributions { Pt}t are not too far from the actual distributions {Pt}t, which is a minimal relaxation over the adversarial setting, it is possible to achieve sub-linear regret. We refer to the collection of samples { γt}t as a trace and allow the actions of the decision-maker to depend on it. Throughout this paper, we will use { γt}t to denote the trace and {γt}t to denote the (random) sequence of requests on which the decision maker wishes to maximize reward.

3. Warmup: Learning the Dual and Earning with It

First, let us focus on the simpler case when Pt = Pt for all t [T], i.e., the sampling distributions are the same as the request distributions. At first glance, it may appear that only having access to one sample from each of request

Robust Budget Pacing with a Single Sample

distributions Pt yields too little information to achieve nearoptimal rewards. If one were to attempt to directly learn the optimal solution of FLUID({Pt}t), this initial impression would be accurate because of the high-dimensional nature of the space of all possible solutions {xt( )}t. Fortunately, we do not need to learn this high-dimensional information and can instead leverage the structure of the problem: the dual space is just one-dimensional and thereby amenable to learning. More precisely, the dual function D(µ|{Pt}t) of FLUID({Pt}t) at dual variable µ 0 is given by

max {xt( )}t

t=1 E[ft(xt(γt))] + µ

t=1 E[bt(xt(γt))]

t=1 max xt:S X E [ft(xt(γt)) µ bt(xt(γt))]

t=1 E max xt X {ft(xt) µ bt(xt)} .

Throughout, we assume argmaxx X {f(x) µ b(x)} is non-empty for all requests γ S and dual solutions µ 0. If we treat the dual variable µ as the per-unit price of the resource, maxxt X {ft(xt) µ bt(xt)} captures the profit maximization problem. The following terminology would be helpful in working with the dual.

Definition 3.1. For a request γ = (f, b) and dual variable µ 0, let x (γ, µ) be the optimal solution of maxx X {f(x) µ b(x)} with the largest value of f(x). If there are multiple such solutions, pick one which minimizes b(x). Moreover, let f (µ) := f(x (γ, µ)) and b (µ) := b(x (γ, µ)) be the corresponding reward and resource consumption respectively.

We denote D(µ|{Pt}t) = µ B+PT t=1 E[f t (µ) µ b t (µ)]. Throughout this paper, we will repeatedly leverage weak duality, which is a central property of duals. We state the property here and refer the reader to any standard text on convex optimization (e.g., Bertsekas 2009) for a proof.

Proposition 3.2 (Weak Duality). For all request distributions {Pt}t and dual variables µ 0, we have D(µ|{Pt}t) FLUID({Pt}t), i.e.,

t=1 E[f t (µ)] FLUID({Pt}t) µ

t=1 E[b t (µ)]

Observe that PT t=1 E[f t (µ)] is exactly the expected reward the decision maker would receive if she had an infinite budget and she took actions which maximized profit with µ being the per-unit price of the resource. Moreover, B PT t=1 E[b t (µ)] is the amount by which the decision maker would underspend her budget in expectation if she were to take actions using µ as the price. Suppose we can

find a dual variable µ 0 that satisfies approximate complementary slackness, i.e., µ satisfies at least one of the following statements: (1) µ = 0 and maximizing profit with µ as the per-unit price results in total expenditure less than the budget B with high probability; (2) µ > 0 and maximizing profit with µ as the per-unit price results in total expenditure close to the budget B with high probability. Then, if the decision maker were to use µ as the price and make decisions to maximize profit, she will not run out of budget too early and the complementary slackness term µ B PT t=1 E[b t (µ)] would also be small. Therefore, such a µ would yield rewards that are close to FLUID({Pt}t), i.e., yield small regret, as required.

We next describe how such a µ can be learned from the sample trace { γt} when Pt = Pt for all t [T]. We will assume that the distributions satisfy the following mild and standard assumption (Devanur & Hayes, 2009; Agrawal et al., 2014) to exclude the degenerate case. Assumption 3.3 (General Position). The request sequence {γt}t Q t Pt is in general position almost surely: For any µ 0, there is at most one request with multiple profit maximizers, i.e.,

t [T] : | argmax x X {ft(x) µ bt(x)} | > 1 1 .

Moreover, the sample trace { γt}t Q t Pt is also in general position almost surely.

Assumption 3.3 is made without any loss of generality because, as pointed out in Devanur & Hayes (2009) and Agrawal et al. (2014), adding an infinitesimally-small perturbation to the reward functions always results in perturbed distributions that satisfy Assumption 3.3 with only an infinitesimal change in the value of FLUID({Pt}t) (see Appendix D for a formal description). Assumption 3.3 ensures that there exists a dual solution µ 0 which spends close to the budget B on the trace { γt}t if it is possible to do so. In fact, as the following lemma shows, the optimal empirical dual solution satisfies this property. Lemma 3.4. Suppose the trace { γ}t Q t Pt is in general position, and consider

µ argmin µ 0

t=1 max x X

n ft(x) µ bt(x) o)

Then, at least one of the following statements holds:

1. µ = 0 and PT t=1 b t ( µ) B + b.

B PT t=1 b t ( µ) b.

Recall that weak duality (Proposition 3.2) suggests that finding a dual solution which satisfies approximate complementary slackness with high probability would yield

Robust Budget Pacing with a Single Sample

Algorithm 1 Learning the Dual and Earning with It

Input: Trace { γt} Q

t Pt, initial budget B1 = B. Compute an Optimal Empirical Dual Solution:

µ argmin µ 0

t=1 max x X

n ft(x) µ bt(x) o)

for t = 1, . . . , T do

Receive request γt = (ft, bt) Pt. Make the primal decision xt and update the remaining resources Bt:

x t argmax x X {ft(x) µ bt(x)} ,

( x t if bt(x t) Bt 0 otherwise ,

Bt+1 = Bt bt(xt).

reward close to FLUID({Pt}). Lemma 3.4 states that we can compute a dual variable µ which satisfies approximate complementary slackness on the trace. To finish the argument, we require a uniform convergence bound which shows that expenditure on the trace (or the sequence of requests) is concentrated close to the expected expenditure for all dual variables µ 0.

Theorem 3.5. For r(T) := 8 b p

T log(T) and request distributions { Pt}t, the following uniform convergence bound holds

t=1 b t (µ)

t=1 Eˆγt Pt h ˆb t (µ) i r(T)

With Theorem 3.5 in hand, we are now ready to state and prove the regret guarantee for Algorithm 1. It first learns an empirical optimal dual variable µ from the trace { γt}t, and then uses it as the per-unit price of the resource to take profit-maximizing actions on the request sequence {γt}t.

Theorem 3.6. If Pt = Pt for all t [T], then Algorithm 1 (denoted by A) satisfies Regret(A) 12κ b + 2κr(T).

Arlotto & Gurvich (2019) showed that every algorithm must incur a regret of Ω(

T), even when the request distributions are identical (i.e., Pt = P for all t [T]) and known to the decision-maker ahead of time. Thus, Theorem 3.6 shows the regret of Algorithm 1 achieves a near-optimal dependence on T with just a single sample per distribution, despite the request distributions being unknown and time-varying. However, as the following example demonstrates, this regret bound critically relies on the assumption that Pt = Pt for all t [T], and is fragile to even slight deviations from it.

This fact was also demonstrated in Jiang et al. (2020) in a related context which inspired the following example.

Example. Fix a small ϵ > 0, an even horizon T and budget B = T/2. Assume actions are accept/reject decisions, i.e., X = {0, 1}, and the reward/resource consumption functions are linear with coeff(b) = 1 for all γ = (f, b) S. In this setting, a request is completely determined by the coefficient coeff(f) of its reward function. We will overload notation and use γ to denote this coefficient. Set Pt = Unif ([1 + ϵ, 1 + 2ϵ]) for all t T/2 + 1 and Pt = Unif ([1 ϵ, 1]) for all t T/2 + 2. Moreover, set Pt = Unif ([1 ϵ, 1]) for all t [T]. Then, it is easy to see that W(Pt, Pt) 3ϵ for all t [T]. Also, observe that any trace { γt}t Q

t Pt would satisfy γt 1 + ϵ for all t T/2 + 1 and γt 1 for all t T/2 + 2. Hence, we always have µ 1 + ϵ. On the other hand, we also always have γt 1 for all t [T]. Therefore, Algorithm 1 sets x t = 0 for all t [T], yielding a reward of 0. Whereas, FLUID({Pt}t) (1 ϵ) (T/2), thereby making the regret linear in T.

Since ϵ > 0 was arbitrary in the above example, it shows that even infinitesimally-small differences between the sampling and request distributions can lead to linear regret for Algorithm 1. This is antithetical to our goal of developing robust online algorithms for pacing. Formally, we would like to develop online algorithms that achieve regret which is small and degrades smoothly as PT t=1 W(Pt, Pt) grows large. Nonetheless, although Algorithm 1 falls short of this goal, it highlights the power of dual-based algorithms. Building on the intuition developed in this section, we next describe and analyze a Dual FTRL algorithm that achieves near-optimal regret while being robust to discrepancies between the sampling distributions { Pt}t and the request distributions {Pt}t.

4. Dual FTRL with Target Rate Estimation

In this section, we will develop an algorithm based on Dual Follow-The-Regularized-Leader (FTRL) that achieves nearoptimal regret with a single trace, and is robust to discrepancies between the sampling distributions and request distributions. Now, if one had complete knowledge of the sampling distributions { Pt}t, then one can solve FLUID({ Pt}t) to find an optimal solution and run Dual Gradient Descent with the goal of spending the same as the optimal solution at each time step. It is known from Jiang et al. (2020) that this approach achieves O(max{

T, PT t=1 W( Pt, Pt)}) regret, thereby making it rate optimal and robust to discrepancies. However, with just a single sample from each of distributions Pt, we are far from having complete knowledge of { Pt}t. Despite this apparent lack of data, a careful analysis of Dual FTRL will allow us to show that it achieves near-optimal regret rate in a robust manner.

Robust Budget Pacing with a Single Sample

Algorithm 2 Dual Follow-The-Regularized-Leader Input: Initial resource endowment B1 = B, target consumption sequence {λt}T t=1, regularizer h : R R and step-size η. Set initial dual solution µ1 = argminµ [0,κ] h(µ). for t = 1, . . . , T do

Receive request γt = (ft, bt) Pt. Make the primal decision xt and update the remaining resources Bt:

x t argmax x Xt {ft(x) µt bt(x)} , (2)

( x t if bt(x t) Bt 0 otherwise ,

Bt+1 = Bt bt(xt).

Obtain a sample sub-gradient of the dual function D(µ|Pt, λt): gt = λt bt(x t). Update the dual iterate with FTRL:

µt+1 = argmin µ [0,κ]

r=1 gr µ + h(µ)

4.1. Dual Follow-The-Regularized-Leader

The non-stationarity of the request distributions necessitates the need for Dual FTRL that can incorporate target resource consumptions (Algorithm 2). It takes as input a target sequence {λt}T t=1 which specifies λt 0 to be the amount of resource Dual FTRL should attempt to consume at time t. Moreover, like FTRL (Shalev-Shwartz et al., 2012; Hazan et al., 2016), it also takes as input a regularizer h( ), an initial dual variable µ1 and a step-size η. We will make the standard assumption that the regularizer h( ) is differentiable and is σ-strongly convex in the 1 norm.

Before stating the performance bound of Algorithm 2, we introduce some preliminaries. Given a budget of βt for period t [T], the optimal expected reward which can be collected in period t is captured by the following fluid optimization problem:

FLUID(Pt, βt) := max E[ft(xt(γt)]

s.t. E[bt(xt(γt))] βt xt : S X .

The dual function of FLUID(Pt, βt) is given by

D(µ|Pt, βt) := µ βt + E max x X {ft(x) µ bt(x)} ,

for any µ 0. Then, by weak duality, we have

FLUID(Pt, βt) D(µ|Pt, βt) for all µ 0. Moreover, since dual functions are always convex (they are the suprema of linear functions), the dual function D( |Pt, βt) is convex.

Theorem 4.1 states a general regret bound for Algorithm 2 with an arbitrary target sequence {λt}t and against a general benchmark PT t=1 D(µt|Pt, βt). Since FLUID(Pt, βt) D(µ|Pt, βt) by weak duality, Theorem 4.1 also characterizes the performance against the weaker benchmark PT t=1 FLUID(Pt, βt), which is simply the optimal expected reward the decision maker would collect if she spent βt at time t.

Theorem 4.1. Consider Algorithm 2 with target consumption sequence {λt}t, regularizer h( ) and step-size η. Then, for a benchmark sequence {βt}t, we have

t=1 D(µt|Pt, βt)

R1 + R2 + R3 ,

R1 = κ b + 2( b+ λ)2

η , for λ = maxt λt and d R = max{h(0) h(µ1), h(κ) h(µ1)}.

R2 = κ n PT t=1 λt o B + ,

R3 = E h PT t=1 µt (βt λt) i .

Theorem 4.1 decomposes the regret of Algorithm 2 into three terms, where (i) R1 is simply the regret associated with the FTRL algorithm in the OCO setting (Hazan et al., 2016); (ii) R2 captures the overspending error, which is large whenever the total target consumption PT t=1 λt is in excess of the budget B; (iii) R3 captures the underestimation error, which is a weighted sum over the amounts by which the target sequence {λt}t underestimates the benchmark sequence {βt}t, with weights equal to the dual iterates µt. Observe that there is an inherent tension between the overspending error R2 and the underestimation error R3 R2 can be made smaller by making the target consumptions {λt}t smaller, but this in turn makes R3 bigger, and vice versa. To obtain the desired performance guarantees for Algorithm 2 (see Theorem 4.8), we need to carefully choose the benchmark sequence {βt}t and the target sequence {λt}t, which is what we do next (see (4)). We go on to show that

Our choice of target sequence does not overspend too much. In particular, it satisfies R2 κ b (see (5)).

Our choice of benchmark sequence {βt}t ensures

FLUID({Pt}t)

t=1 D(µt|Pt, βt)

Robust Budget Pacing with a Single Sample

t=1 W(Pt, Pt)

i.e., the benchmark in Theorem 4.1 is at most O max n

T, PT t=1 W(Pt, Pt) o larger than our desired benchmark FLUID({Pt}t) (see Lemma 4.2 and the discussion that follows).

Moreover, our choice of the sequences in combination with an intricate argument, consisting of a coupling lemma and a leave-one-out thought experiment, allows use to prove R3 = O(

T) (see Subsection 4.3).

Finally in Subsection 4.4, we combine everything to prove the desired regret bound for Algorithm 2.

4.2. Choosing the Target and Benchmark Sequences

We define the target and benchmark sequences using the empirical optimal dual solution computed from the trace { γt}t:

µ argmin µ 0

t=1 max x X

n ft(x) µ bt(x) o)

If there are multiple minimizers, set µ to be the smallest one. Given the empirical optimal dual solution µ, the target and benchmark sequences are defined as

βt = Eˆγt Pt

h ˆb t ( µ) i and λt = b t ( µ) , (4)

where ˆγt = ( ˆft,ˆbt). In other words, the benchmark sequence is the expected consumption and the target sequence is the empirical consumption on the trace if we were to make profit-maximizing decisions using the empirical optimal dual solution µ as the price of the resource. Instead of learning the empirical optimal dual µ and directly making decisions with it like we did in Algorithm 1, we use µ to learn the empirical consumptions {λt}t and use Algorithm 2 to track this target. Unlike the former, we will show that the latter approach is robust to discrepancies between the sampling and request distributions, while maintaining the same O(

T)-regret guarantee. Importantly, note that the benchmark sequence {βt}t cannot be computed in practice because it requires full knowledge of the request distributions. Algorithm 2 respects this limitation and does not require knowledge of the benchmark sequence {βt}t; we only use it for our analysis.

Our choice of {λt}t and Lemma 3.4 immediately imply

Next, we show that, for our choice of the benchmark sequence {βt}t, the benchmark PT t=1 D(µt|Pt, βt) of

Theorem 4.1 is not too far from the desired benchmark FLUID({Pt}).

Lemma 4.2. For any dual variable µ 0, dual iterates {µt}t [0, κ]T and benchmark sequence {β}t with βt = Eˆγ Pt[ˆb ( µ)], we have

t=1 D(µt|Pt, βt) FLUID({Pt}t) µ

t=1 W(Pt, Pt) .

Observe that Theorem 3.5 implies that, with probability at least 1 1/T 2, we have PT t=1 βt PT t=1 λt r(T). Combining this with Lemma 3.4 yields

= µ r(T) + µ

κ r(T) + κ b , (6)

thereby showing that the benchmark PT t=1 D(µt|Pt, βt) of Theorem 4.1 is not too far from the desired benchmark FLUID({Pt}). In order to establish the desired regret and robustness guarantees for Algorithm 2, all that remains to show is that R3 O(

T). However, as we demonstrate in the next subsection, this step is rife with challenges.

4.3. Bounding R3

We begin with a brief discussion of the challenges involved in bounding R3. It is illuminating to consider the slightly more permissive setting in which the decision maker has access to two sample traces: suppose in addition to trace { γt}t Q

t Pt, we had access to an additional trace { γt}t Q

t Pt. Then, we could compute µ using { γt}t as follows

µ argmin µ 0

t=1 max x X

n ft(x) µ bt(x) o)

making it completely independent of { γt}t. With this modified µ, we continue to define {βt}t, {γt}t as before (see (4)). As a consequence, we get that µs is completely determined by {γt}s 1 t=1 and {λt}s 1 t=1, with the latter being completely determined by µ and { γt}s 1 t=1. This makes µs independent of λs conditional on µ, and consequently yields

E µs (βs λs) µ, { γt, γt}s 1 t=1 = µs βs E h b s( µ) µ i

Robust Budget Pacing with a Single Sample

Thus, we can apply the Tower Rule of conditional expectations to get

s=1 E [µs (βs λs)]

s=1 E E µs (βs λs) µ, { γt, γt}s 1 t=1

It is straightforward to see that the bounds on R1 and R2 established in the previous subsection continue to hold in this two-trace setting. Therefore, two traces allow us to achieve the near-optimal O(

T)-regret while being robust to discrepancies between Pt and Pt.

Although moving from two traces to one trace might appear to be a minor change, it introduces correlations that make the proof much more difficult. Observe that Algorithm 2 determines µs using {λt}s 1 t=1, all of which depend on µ, which in turn is computed using the request γs. Furthermore, λs directly depends on γs. Thus, µs and λs are intricately correlated with each other, which breaks the aforementioned argument for the two-trace setting. Nonetheless, R3 can still be shown to be small, as we note in the following lemma and prove in the remainder of this subsection.

Lemma 4.3. For all s [T], we have

s=1 E [µs (βs λs)] 4η b2

We prove Lemma 4.3 in the remainder of this subsection. The following lemma will find repeated use in the proof. In keeping with economic intuition, it shows that increasing the price (dual variable) leads to smaller consumption under the profit-maximzing decision.

Lemma 4.4 (Monotonicity). For µ > µ , request γ = (f, b) S, x argmaxz X {f(z) µ b(z)} and x argmaxz X {f(z) µ b(z)}, we have b(x) b(x ).

Fix an s [T]. We will get around the correlation between µs and γs by conducting the following leave-one-out thought experiment: suppose we remove the s-th sample γs, compute µ on the remaining trace { γt}t =s, and run Algorithm 2 with the resulting target sequence. More precisely, in this thought experiment, we set Ps to be the distribution which always serves the request γ = (f, b) with f(x) = b(x) = 0 for all x X. Thus, fs(x) = bs(x) = f (µ) = b (µ) = 0 for all x X and µ 0. We will use the superscript ( s) to denote the various variables in this thought experiment:

µ( s) argminµ 0 µ B + P

t =s maxx X { ft(x) µ bt(x)}. If there are multiple minimizers, set µ( s)

to be the smallest one amongst them.

λ( s) t = b t µ( s) for all t [T].

µ( s) t is the t-th iterate of Algorithm 2 with the target consumption sequence n λ( s) t o

We begin by characterizing the impact of this change on the target consumption sequence.

Lemma 4.5. For every sample trace { γt}t, we have µ µ( s) and λt λ( s) t for all t = s. Moreover, Ps 1 t=1 λ( s) t λt 3 b .

Lemma 4.5 shows that the target sequences {λt}t and {λ( s) t }t are close to each other. Next, we couple the dual iterates µt and µ( s) t generated by Algorithm 2 to show that they never stray too far from each other whenever the target sequences are close.

Lemma 4.6 (Dual Iterate Coupling). Let {µt}t and {µ t}t denote the iterates generated by Algorithm 2 on the request sequence {γt}t for the target sequences {λt}t and {λ t}t respectively. Assume that the initial iterates are the same, i.e., µ1 = µ 1. Then, for all s [T], we have

t=1 |λt λ t|

Applying Lemma 4.6 with λ t = λ( s) t and using Lemma 4.5 yields |µs µ s| η

Combining this with the fact that |βs λs| b, we get

E [µs (βs λs)] = E h µs µ( s) s (βs λs) i + E h µ( s) s (βs λs) i

σ b + E h µ( s) s (βs λs) i . (7)

The next lemma shows that the second term is non-positive. Its proof critically leverages the fact that the iterate µ( s) s is independent of the s-th sample in the trace γs (which is used to determine λs). This is in stark contrast to µs which depends on γs, and demonstrates the merit of our leave-one-out thought experiment.

Lemma 4.7. E h µ( s) s (βs λs) i 0 for all s [T].

Lemma 4.7 in combination with (7) yields E [µs (βs λs)] 4η b2/σ. Summing over all s [T] finishes the proof of Lemma 4.3.

4.4. Putting It All Together

In the previous subsections, we bounded R1, R2 and R3, and related the benchmark PT t=1 D(µt|Pt, βt) from Theorem 4.1 to our desired benchmark FLUID({Pt}t). Combining everything yields the following performance gaurantee for Algorithm 2.

Robust Budget Pacing with a Single Sample

Theorem 4.8. Let A be Algorithm 2 with target sequence {λt}t, where λt = b t ( µ) (as defined in (4)), regularizer h( ) and step-size η = p

d R/T, where d R = max{h(0) h(µ1), h(κ) h(µ1)}. Then,

Regret(A) C1 p

T log(T) + C2

t=1 W(Pt, Pt) .

where C1 = 12 b2 d R

σ + d R + 12κ b and C2 = 2(1 + κ).

Observe that the regret of Dual FTRL satisfies Regret(A) = O(

T) whenever PT t=1 W(Pt, Pt) = O(

T). In other words, Dual FTRL achieves near-optimal regret with a single trace as long as the total discrepancy PT t=1 W(Pt, Pt) is not too large. Finally, we would also like to note that our algorithm is extremely efficient computationally. In particular, due to the equivalence of FTRL and Lazy Online Mirror Descent (OMD) (see Hazan et al. 2016), each dual update in (3) can be computed in constant time by running Lazy OMD. Moreover, given a trace { γt} which is sorted in increasing order of bang-per-buck coeff( ft)/ coeff( bt), the target sequence {λt}t can be computed in O(T) steps (see Appendix E for details).

Agrawal, S. and Devanur, N. R. Fast algorithms for online stochastic convex programming. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pp. 1405 1424. SIAM, 2014.

Agrawal, S., Wang, Z., and Ye, Y. A dynamic near-optimal algorithm for online linear programming. Operations Research, 62(4):876 890, 2014.

Arlotto, A. and Gurvich, I. Uniformly bounded regret in the multisecretary problem. Stochastic Systems, 9(3): 231 260, 2019.

Azar, P. D., Kleinberg, R., and Weinberg, S. M. Prophet inequalities with limited information. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pp. 1358 1377. SIAM, 2014.

Balseiro, S., Kroer, C., and Kumar, R. Online resource allocation under horizon uncertainty. ar Xiv preprint ar Xiv:2206.13606, 2022a.

Balseiro, S. R. and Gur, Y. Learning in repeated auctions with budgets: Regret minimization and equilibrium. Management Science, 65(9):3952 3968, 2019.

Balseiro, S. R., Lu, H., and Mirrokni, V. The best of many worlds: Dual mirror descent for online allocation problems. Operations Research, 2022b.

Banerjee, S. and Freund, D. Uniform loss algorithms for online stochastic decision-making with applications to bin packing. In Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems, pp. 1 2, 2020.

Banerjee, S., Gurvich, I., and Vera, A. Constant regret in online allocation: On the sufficiency of a single historical trace, 2020. URL https: //people.orie.cornell.edu/ig264/ Online_Optimization_with_Samples.pdf.

Bertsekas, D. Convex optimization theory, volume 1. Athena Scientific, 2009.

Caramanis, C., D utting, P., Faw, M., Fusco, F., Lazos, P., Leonardi, S., Papadigenopoulos, O., Pountourakis, E., and Reiffenh auser, R. Single-sample prophet inequalities via greedy-ordered selection. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1298 1325. SIAM, 2022.

Chen, X., Kroer, C., and Kumar, R. The complexity of pacing for second-price auctions. In EC, 2021. URL https://arxiv.org/abs/2103.13969.

Correa, J., D utting, P., Fischer, F., and Schewior, K. Prophet inequalities for iid random variables from an unknown distribution. In Proceedings of the 2019 ACM Conference on Economics and Computation, pp. 3 17, 2019.

Devanur, N. R. and Hayes, T. P. The adwords problem: online keyword matching with budgeted bidders under random permutations. In Proceedings of the 10th ACM conference on Electronic commerce, pp. 71 78, 2009.

Devanur, N. R., Jain, K., Sivan, B., and Wilkens, C. A. Near optimal online algorithms and fast approximation algorithms for resource allocation problems. In Proceedings of the 12th ACM conference on Electronic commerce, pp. 29 38, 2011.

Facebook-Guide. Your guide to facebook bid strategy. https://www.facebook.com/gms_hub/ share/biddingstrategyguide_final.pdf. Accessed: 2023-01-19.

Feldman, J., Henzinger, M., Korula, N., Mirrokni, V. S., and Stein, C. Online stochastic packing applied to display ad allocation. In European Symposium on Algorithms, pp. 182 194. Springer, 2010.

Gaitonde, J., Li, Y., Light, B., Lucier, B., and Slivkins, A. Budget pacing in repeated auctions: Regret and efficiency without convergence. ar Xiv preprint ar Xiv:2205.08674, 2022.

Robust Budget Pacing with a Single Sample

Gupta, A. and Molinaro, M. How the experts algorithm can help solve lps online. Mathematics of Operations Research, 41(4):1404 1431, 2016.

Hazan, E. et al. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

Jiang, J., Li, X., and Zhang, J. Online stochastic optimization with wasserstein based non-stationarity. ar Xiv preprint ar Xiv:2012.06961, 2020.

Kesselheim, T., T onnis, A., Radke, K., and V ocking, B. Primal beats dual on online packing lps in the randomorder model. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pp. 303 312, 2014.

Kumar, B., Morgenstern, J., and Schrijvers, O. Optimal spend rate estimation and pacing for ad campaigns with budgets. ar Xiv preprint ar Xiv:2202.05881, 2022.

Rubinstein, A., Wang, J. Z., and Weinberg, S. M. Optimal single-choice prophet inequalities from samples. In ITCS, 2020.

Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012.

Zhou, Y., Chakrabarty, D., and Lukose, R. Budget constrained bidding in keyword auctions and online knapsack problems. In International Workshop on Internet and Network Economics, pp. 566 576. Springer, 2008.

Zhou, Y.-H., Liang, C., Li, N., Yang, C., Zhu, S., and Jin, R. Robust online matching with user arrival distribution drift. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 459 466, 2019.

Robust Budget Pacing with a Single Sample

A. Related Work

Balseiro & Gur (2019) study budget pacing in repeated second-price auctions when the values and competing bids are either i.i.d. according to some unknown distribution or adversarially selected. They show that Dual Gradient Descent with the constant target λt = B/T for all t [T] attains the optimal regret of O(

T) in the i.i.d. stochastic setting, and the optimal parameter-dependent asymptotic competitive ratio (equal to ratio of the per-period budget to the maximum value) in the adversarial setting. Zhou et al. (2008) also study the adversarial setting and provide a pacing-based algorithm that achieves a differently-parameterized competitive ratio which scales as the logarithm of the ratio of the highest-to-lowest return-oninvestment, and show that it is optimal. Kumar et al. (2022) study an episodic setting and provide a density-estimation-based algorithm for learning the target expenditures for each episode. Gaitonde et al. (2022) study the performance of the algorithm of Balseiro & Gur (2019) for the different objective of value maximization, and against the different benchmark comprised of pacing multipliers which spend the same amount B/T at each time period. Under the no-overbidding assumption, they show that the algorithm of Balseiro & Gur (2019) achieves O(T 3/4) regret. Recent years have also seen significant attention being given to pacing in multi-buyer settings, but we focus on the single-agent setting here and refer the reader to the recent works of Chen et al. (2021) and Gaitonde et al. (2022) for an overview.

More generally, budget pacing in second-price auctions is a special case of online linear packing, which in turn is a special case of the online resource allocation problem. Both these problems allow for multiple resources and have been studied extensively; we only provide a broad overview here. For the most part, these problems have also been studied in the i.i.d. stochastic model, or the slightly more general random arrival model (requests are selected by an adversary but arrive in a uniformly random order). Devanur & Hayes (2009) and Feldman et al. (2010) study online linear packing under the random arrival model, and show that learning the dual from the initial requests and then using it to make decisions yields O(T 2/3) regret. Agrawal et al. (2014) extended these results to show that repeatedly solving for the dual at geometrically increasing intervals yields the optimal O(

T) regret. Devanur et al. (2011) and Kesselheim et al. (2014) also achieve O(

T) regret but with a better dependence on the constants and the number of resources. Gupta & Molinaro (2016) give a dual-descent-based algorithm that also achieves O(

Agrawal & Devanur (2014) study online resource allocation with concave rewards and convex constraints, and give a dual-descent-based algorithm that achieves O(

T) regret. Balseiro et al. (2022b) give a Dual Mirror Descent algorithm which attempts to spend λt = B/T at each time step and show that it achieves O(

T) regret for the general online allocation problem. Their results also hold for stochastic models that are close to i.i.d. like periodic, ergodic etc. Balseiro et al. (2022a) present and analyze a generalization of Dual Mirror Descent that can incorporate arbitrary target expenditures {λt}t. When the horizon T is not known exactly and only assumed to lie in some known uncertainty window, they provide a procedure for computing target expenditures which yield a near-optimal asymptotic competitive ratio. Our Dual FTRL algorithm (Algorithm 2) is equivalent to the Lazy-update version of the algorithm of Balseiro et al. (2022a). Importantly, with the exception of Devanur et al. (2011), none of the aforementioned works provide algorithms which can achieve vanishing regret for the setting with time-varying distributions. Devanur et al. (2011) achieve O(

T) regret when the optimal expected reward for each distribution is known in advance. However, this quantity cannot be computed with a single sample for non-trivial distributions, and they do not provide guarantees for our sample-access setting.

Another line of work develops algorithms that beat O(

T) regret when the problem instance is well-structured. With the exception of Banerjee et al. (2020) and Banerjee & Freund (2020), all of these works assume complete knowledge of the distributions and/or assume that the distributions are identical. When the number of requests of each type satisfies a concentration property between the trace and the actual requests, Banerjee et al. (2020) and Banerjee & Freund (2020) show that a constant regret can be achieved for online resource allocation using one sample per distribution. For the budget pacing problem, a type corresponds to a value and competing bid pair. Since complex machine-learning models are typically used to estimate advertiser values to a high precision, this translates to an extremely large number of possible types. Far from concentrating, these large number of types imply that one is unlikely to even observe a type more than once, making their primal-based method ineffective for budget pacing. Moreover, neither Banerjee et al. (2020) nor Banerjee & Freund (2020) do not provide any robustness guarantees for possible discrepancies between the sampling and true distributions, and their algorithm requires knowledge of the competing bid. Finally, our results are meaningful when the budget is much larger than the maximum amount one can spend on an auction/request, as is the case for budget pacing. In contrast, the literature on prophet inequalities considers a unit-cost variable-reward online allocation problem where the budget is only large enough to accept one request. See Azar et al. (2014); Correa et al. (2019); Rubinstein et al. (2020); Caramanis et al. (2022) for a sample-driven treatment of prophet inequalities.

Robust Budget Pacing with a Single Sample

B. Application to Budget Pacing

Here we discuss how the budget pacing problem fits as a special case of the online resource allocation problem that we study in this paper. Consider the setting in which a budget-constrained advertiser repeatedly participates in T second-price auctions. For simplicity, assume that all ties are broken in favor of this advertiser. Let vt and dt denote her value and the highest competing bid in the t-th auction respectively. Moreover, let B denote her budget, which represents the maximum amount she is willing to spend over all T auctions.

We will assume that the tuple (vt, dt) is drawn from some distribution Pt, independently of all other auctions. Now, observe that every bid of the advertiser results in one of two possible outcomes: (i) she bids greater than or equal to dt, wins the auction, gains utility vt dt and pays dt; (ii) she bids strictly less than dt, loses the auction, gains zero value and pays zero. Thus, corresponding to the tuple (vt, dt), we can define a corresponding request γt with linear reward function ft(x) = (vt dt) x and linear consumption function bt(x) = dt x, for the action space x {0, 1} = {lose, win}. Similarly, corresponding to the sample trace of tuples {( vt, dt)}t, we can define a trace { γt}t for the online allocation problem. This defines a corresponding instance of the online allocation problem. Since every bid either results in either a win or loss, the maximum expected utility (value - payment) that the advertiser can earn subject to her budget constraint is bounded above by FLUID({Pt}t) for this instance. Finally, consider step t of Algorithm 2 on this instance. The decision xt is calculated as xt argmaxx Xt {ft(x) µt bt(x)}. Therefore, xt = 1 if vt dt µdt, or equivalently vt/(1 + µt) dt, and xt = 0 otherwise. Observe that, in a second price auction, if the advertiser bids vt/(1 + µt), she will win (xt = 1) if vt/(1 + µt) dt and lose (xt = 0) otherwise. Thus, by bidding vt/(1 + µt), she can simulate the actions of Algorithm 2 for the online allocation instance. Moreover, she does not require knowledge of the competing bid dt to compute her bid, which is crucial because dt is not known in practice. Once the auction is over, the expenditure bt(xt) = dt xt is revealed to the advertiser. She can then use it to update the dual iterate according to (3).

C. Fluid Benchmark is Stronger

Proposition C.1. For any collection of request distributions {Pt}t, we have E{γt}t[OPT({γt}t)] FLUID({Pt}t).

Proof. Fix any request sequence {γt}t and let {x t }t be an optimal solution to the corresponding hindsight optimization problem OPT({γt}t). Then, xt(γ) = x t for all γ S is a feasible solution of FLUID({Pt}) and consequently, we have OPT({γt}t) FLUID({Pt}t). Since the request sequence {γt}t was arbitrary, we have E{γt}t[OPT({γt}t)] FLUID({Pt}t), as required.

D. General Position

Given any collection of request distributions {Pt}t (which may or may not satisfy Assumption 3.3), we can define perturbed distributions { ˆPt}t to capture the distribution of perturbed requests ˆγt = ( ˆft,ˆbt) generated using the following two step procedure: (i) Draw a request γt = (ft, bt) according to the unperturbed distributions Pt; (ii) Add a perturbation by setting ˆft(x) = ft(x) + ϵt x for all x X, where ϵt Unif([0, a]), and leave the consumption function unchanged ˆbt( ) = bt( ).

Then, { ˆPt}t satisfy Assumption 3.3 and FLUID({Pt}t) FLUID({ ˆPt}t) a T, where a > 0 can be made arbitrarily small.

E. Efficiently Computing the Target Sequence

In this section, we describe an efficient procedure for computing the empirical optimal dual solution µ and the target sequence {λt}t. Consider a trace { γt}t and set γ0 = (f0, b0) such that f0(x) = b0(x) = 0 for all x X. Without loss of generality, we will assume that { γt}t is sorted in increasing order of coeff( ft)/ coeff( bt) (assume 0/0 = 0), i.e.,

coeff( fs) coeff( bs) coeff( ft)

coeff( bt) s t .

This can be easily achieved by maintaining a sorted array with O(log(T)) insertion time or sorting the array with O(T log(T)) processing time. Moreover, since the trace { γt}t is in general position by Assumption 3.3, all the coeff( ft)/ coeff( bt) are distinct for t [T].

Robust Budget Pacing with a Single Sample

Algorithm 3 Learning the Dual from the Trace

Input: Trace { γt}t in general position and sorted in increasing order of coeff( ft)/ coeff( bt). Initialize: Total payment P = 0 and target sequence λt = 0 for all t [T]. for t = T, . . . , 0 do

if P + bt( x) > B then

Set λt bt( x), and set µ = coeff( ft)/ coeff( bt). Break. end else

Update total payment P P + bt( x) and set λt bt( x). end end return Dual variable µ, target sequence {λt}t

Theorem E.1. µ returned by Algorithm 3 is the smallest element in argminµ 0 µ B +PT t=1 maxx X n ft(x) µ bt(x) o .

Moreover, λt = b t ( µ) for all t [T].

Proof. Set q(µ) = µ B + PT t=1 maxx X n ft(x) µ bt(x) o . First, we show that the dual variable µ is smallest element

in argminµ 0 q(µ). To do so, we consider the following two cases:

µ = 0. In this case, the If condition implies that there exists s [T] such that PT t=s bt( x) B and

coeff( ft) = 0 for all t < s. Moreover, we have 0 argmaxx X n ft(x) µ bt(x) o for all t < s and

x argmaxx X n ft(x) µ bt(x) o for all t s. Now, note that the set of sub-gradients of the maximum of a collection of linear functions is equal to convex hull of gradients of all the linear functions which are binding (for example, see Chapter 5 of Bertsekas 2009). Therefore, B PT t=s b( x) q(0). Since B PT t=s b( x) 0, the definition of a subgradient implies that q(0) q(µ) for all µ 0. Hence, we have shown that µ is the smallest minimizer of q( ), as required.

µ > 0. In this case, the If condition implies that there exists s [T] such that PT t=s+1 bt( x) < B, PT t=s bt( x) > B

and fs(x) µ bs(x) = 0 for all x X. Moreover, we have 0 argmaxx X n ft(x) µ bt(x) o for all t < s,

x argmaxx X n ft(x) µ bt(x) o for all t > s and {0, x} argmaxx X n fs(x) µ bs(x) o . Now, select a

λ [0, 1] such that

B λ bs(0) + (1 λ) bs( x) +

t=s+1 bt( x) = 0 .

Now, note that the set of sub-gradients of the maximum of a collection of linear functions is equal to convex hull of gradients of all the linear functions which are binding (for example, see Chapter 5 of Bertsekas 2009). Therefore, 0 = B λ bs(0) + (1 λ) bs( x) + PT t=s+1 bt( x) q( µ). Consequently, the definition of a subgradient implies that q(0) q(µ) for all µ 0. Finally, consider any µ < µ. Then, fs(x) µ bs(x) > 0, which further implies { x} = argmaxx X n ft(x) µ bt(x) o for all t s. Therefore, for any {xt}t such that

xt argmaxx X n ft(x) µ bt(x) o , we have B PT t=1 bt(xt) > 0. Hence, v > 0 for all v q(µ) and

consequently µ is a minimizer of q( ). Hence, we have shown that µ is the smallest minimizer of q( ), as required.

Finally, we show that λt = b t ( µ) for all t [T]. Let s be the value of t at which the For loop terminates. From the

definition of µ, we have { x} = argmaxx X n ft(x) µ bt(x) o for all t > s, X = argmaxx X n fs(x) µ bs(x) o and

{0} = argmaxx X n ft(x) µ bt(x) o for all t < s. Therefore, b t ( µ) = bt( x) = λt for all t s and b t ( µ) = bt(0) = λt for all t < s.

Robust Budget Pacing with a Single Sample

F. Missing Proofs from Section 3

F.1. Proof of Lemma 3.4

Proof of Lemma 3.4. Define q(µ) = µ B + PT t=1 maxx X n ft(x) µ bt(x) o . Then, q( ) is a convex function of µ because the maximum of a collection of linear function is convex and the sum of convex function is also convex (Bertsekas, 2009). Since µ argminµ 0 q(µ), first-order condition of optimality (Proposition 5.4.7 of Bertsekas 2009) implies that one of the following statements holds:

(i) µ = 0 and there exists v q(0) such that v 0.

(ii) Zero is a sub-differential of q at µ, i.e., 0 q( µ).

Recall that the trace { γt}t is assumed to be in general position with probability one. Therefore, there is at most one time step for which argmaxx X ft(x) µ bt(x) is not unique. Let s be that time step. Now, note that the set of sub-gradients of the maximum of a collection of linear functions is equal to convex hull of gradients of all the linear functions which are binding (for example, see Chapter 5 of Bertsekas 2009). Hence, v q( µ) implies the existence of Ds (X) such that

Support(Ds) argmax x X fs(x) µ bs(x) and v = B Ex Ds[ bs(x)]

t =s b t ( µ) .

where x t (γt, µ) is the optimal solution to maxx X ft(x) µ bt(x) as described in Definition 3.1. Since 0 bs(x) b for all x X, we get B v

t=1 b t ( µ)

where we have used 0 bt(x) b for all x X and t [T]. Therefore, statements (i) and (ii) imply that either µ = 0 and b + B PT t=1 b t ( µ) v 0, or B PT t=1 b t ( µ) b, as required.

F.2. Proof of Theorem 3.5

Proof of Theorem 3.5. Define the hypothesis class

F := {(f, b) 7 b (µ) | µ 0} .

Let Rad( ) denote Radmacher complexity. Then, we know from PAC learning theory (for example see Chapter 26 of Shalev-Shwartz & Ben-David 2014) that

t=1 b t (µ)

t=1 Eˆγt Pt

h ˆb t (µ) i r(T)

r(T) 2T E{ˆγt}t Q

t Pt [Rad(F {ˆγt}t)] + 2 b p

T log(2T) .

Let H({ˆγt}t) = n {ˆb t (µ)}t | µ 0 o denote the set of all possible resource expenditure vectors that can be generated from

a trace {ˆγt}t, then

t Pt [Rad(F {ˆγt}t)] = E{ˆγt}t Q

t Pt [Rad(H({ˆγt}t))] = 1

T E{ˆγt}t Q

t=1 σt ˆb t (µ)

where {σt}t are independent Radmacher random variables.

Robust Budget Pacing with a Single Sample

For a linear function f : R R, let coeff(f) denote its coefficient. Moreover, let x = maxx X x. Then, observe that for a request γ = (f, b) and dual variable µ 0, we have

( x if coeff(f) µ coeff(b) 0 and coeff(f) = 0 0 otherwise .

Therefore, for any request γ = (f, b) with coeff(f) = 0 and coeff(b) = 0, there exists a critical µ = coeff(f)/ coeff(b) such that

( b( x) if µ µ

0 if µ > µ .

For γ = (f, b) with coeff(f) = 0 or coeff(b) = 0, we have b (µ) = 0 for all µ 0. Consider the trace {ˆγt}t, and let µ t be the critical dual solution for request ˆγt as defined above. Then, the assumption that the trace is in general position (Assumption 3.3) implies that the critical points {µ t }t are distinct. Consequently, we get that {ˆb t (µ)}t remains constant whenever µ lies between any two critical points. Since the total number of critical points is T, we get that |H({ˆγt}t)| T. Therefore, Massart Lemma applies and we get

Rad(H({ˆγt}t)) b

Combining this with (8) and (9) yields the theorem.

F.3. Proof of Theorem 3.6

Proof of Theorem 3.6. By Assumption 3.3, the request sequence {γt} is in general position almost surely. Therefore, there is at most 1 time step such that ft(x t) = f t ( µ), call it s. Let ζA be the first time step t in which Bt+1 b, i.e., PζA t=1 b t ( µ) B b. Then,

E [R(A|{γt}t)] = E

t=1 ft(x t)

t=1 f t ( µ)

|f s ( µ) fs(x s)|

t=1 f t ( µ)

t=ζA+1 f t ( µ)

t=1 f t ( µ)

t=ζA+1 b t ( µ)

D( µ|{Pt}t) E

t=1 b t ( µ)

t=ζA+1 b t ( µ)

FLUID({Pt}t) E

t=1 b t ( µ)

t=ζA+1 b t ( µ)

Regret(A) E

t=ζA+1 b t ( µ)

t=1 b t ( µ)

In the remainder of the proof, we bound the first two terms on the RHS.

Robust Budget Pacing with a Single Sample

For the first term, observe that

t=ζA+1 b t ( µ)

t=1 b t ( µ)

t=1 b t ( µ)

t=1 b t ( µ)

t=1 b t ( µ)

t=1 b t ( µ)

t=1 b t ( µ)

+ 2 b , (10)

where the first inequality follows from the definition of ζA and the last inequality follows from Lemma 3.4.

For the second term, observe that Lemma 3.4 implies

t=1 b t ( µ)

t=1 b t ( µ)

t=1 b t ( µ)

t=1 b t ( µ)

t=1 b t ( µ)

t=1 b t ( µ)

Note that µ κ. This is because the definition of κ implies that maxx X f(x) µ b(x) = 0 for all γ = (f, b) S and µ κ. Hence,

t=1 max x X

n ft(x) µ bt(x) o = µ B < κ B = κ B +

t=1 max x X

n ft(x) κ bt(x) o

for all µ > κ. Therefore, we get

t=1 b t ( µ)

t=1 b t ( µ)

t=1 b t ( µ)

Define G to be the good event to be one in which the total expenditures under the trace and the requests sequence are close:

t=1 b t (µ)

t=1 b t (µ)

Then, Theorem 3.5 and Union Bound imply that Pr(Gc) 2/T 2 and Pr(G) 1/2/T 2. Finally, we can put it all together to get the required bound:

Regret(A) E

t=ζA+1 b t ( µ)

t=1 b t ( µ)

t=1 b t ( µ)

t=1 b t ( µ)

t=1 b t ( µ)

t=1 b t ( µ)

+ κ b + κ b

t=1 b t ( µ)

t=1 b t ( µ)

Pr(G) + 2κ E

t=1 b t ( µ)

t=1 b t ( µ)

Pr(Gc) + 4κ b

2κ r(T) + 2κ 2T b 2

12κ b + 2κr(T) .

Robust Budget Pacing with a Single Sample

G. Missing Proofs from Section 4

G.1. Proof of Theorem 4.1

Proof of Theorem 4.1. Let ζA be the first time less than T for which PζA t=1 bt(xt) + b B. Set ζA = T if this inequality is never satisfied. Then, xt = x t for all t ζA and PζA t=1 bt(x t) B b.

First, observe that

t=1 ft(x t) =

t=1 ft(x t)

t=ζA+1 ft(x t)

t=1 ft(x t) κ

t=ζA+1 bt(x t) . (12)

Next observe that, for all t [T], µt is independent of γt because µt is completely determined by {γ1, . . . , γt 1}. Hence,

Eγt [ft(x t) | µt] = Eγt [ft(x t) + µt (βt bt(x t)) | µt] Eγt [µt (λt bt(x t)) | µt] Eγt [µt (βt λt) | µt]

= Eγt [D(µt|Pt, βt)|µt] Eγt [µt (λt bt(x t)) | µt] Eγt [µt (βt λt) | µt] .

Taking unconditional expectations on both sides and applying the tower rule yields

E [ft(x t)] = E [D(µt|Pt, βt)] E [µt (λt bt(x t))] E [µt (βt λt)] .

Summing over t [T], we get

t=1 E[ft(x t)] =

t=1 E [D(µt|Pt, βt)]

t=1 E [µt (λt bt(x t))]

t=1 E [µt (βt λt)] . (13)

Therefore, (12) and (13) together imply

t=1 D(µt|Pt, βt)

t=1 µt (λt bt(x t)) + κ

t=ζA+1 bt(x t)

t=1 E[µt (βt λt)] .

FTRL Regret Bound. Define wt(µ) := µ (λt bt(x t)). Then, Algorithm 2 can be seen as running FTRL with linear losses {wt( )}t. The gradients of these loss functions are given by wt(µ) = λt bt(x t), which satisfy wt(µ) bt(x t) + λt b + λ. Therefore, the regret bound for FTRL implies that for all µ 0:

t=1 wt(µt) wt(µ) E(T, µ) , (15)

where E(T, µ) = 2( b+ λ)2

σ η T + h(µ) h(µ1)

η is the regret bound of FTRL after T iterations (Hazan et al., 2016). Equivalently, we can write

t=1 µt (λt bt(x t)) E(T, µ) +

t=1 µ (λt bt(x t)) µ 0.

Now, consider the following two cases:

Case 1: ζA = T. Here, setting µ = 0 yields

t=1 µt (λt bt(x t)) + κ

t=ζA+1 bt(x t) E(T, 0) .

Robust Budget Pacing with a Single Sample

Case 2: ζA < T. Then, PζA t=1 bt(x t) B b. Hence, setting µ = κ yields

t=1 µt (λt bt(x t)) + κ

t=ζA+1 bt(x t) E(T, κ) +

t=1 κ (λt bt(x t)) + κ

t=ζA+1 bt(x t)

= E(T, κ) + κ

t=1 bt(x t)

E(T, κ) + κ

= E(T, κ) + κ b + κ

Combining the two cases implies that for all values of ζA we have

t=1 µt (λt bt(x t)) + κ

t=ζA+1 bt(x t) max{E(T, 0), E(T, κ)} + κ b + κ

Finally, combining (14) and (16) yields

t=1 D(µt|Pt, βt)

max{E(T, 0), E(T, κ)} + κ b + κ

t=1 E[µt (βt λt)] .

Plugging in the definition of E(T, µ) finishes the proof.

G.2. Proof of Lemma 4.2

Proof of Lemma 4.2. Consider any two request distributions P, P (S). Then, by the definition of the Wasserstein metric, there exists a joint probability distribution Q, with marginals P and P on the first and second factors respectively, such that

W(P, P) = E(γ, γ) Q

sup x |f(x) f(x)| + sup x |b(x) b(x)| .

Let x (γ, µ) be the optimal solution of maxx X f(x) µ b(x) for request γ = (f, b) as described in Definition 3.1. Then, for any µ [0, κ] and x : S X, we have

E(γ, γ) Q h f(x(γ)) µ b(x(γ)) n f(x(γ)) µ b(x(γ)) o i

E(γ, γ) Q h f(x(γ)) f(x(γ)) + µ b(x(γ)) b(x(γ)) i

W(P, P) + κ W(P, P)

=(1 + κ) W(P, P) . (17)

Now, for t [T], we can use (17) to write

D(µt| P, βt) D(µt|P, βt)

=E γ P[ f(x ( γ, µt)) µt b(x ( γ, µt)) + µt βt] Eγ P[f(x (γ, µt)) µt b(x (γ, µt)) + µt βt]

E γ P,γ P[ f(x ( γ, µt)) µt b(x ( γ, µt)) + µt βt {f(x ( γ, µt)) µt b(x ( γ, µt)) + µt βt}]

Robust Budget Pacing with a Single Sample

E(γ, γ) Q h f(x ( γ, µt)) µt b(x ( γ, µt)) n f(x ( γ, µt)) µt b(x ( γ, µt)) o i

(1 + κ) W(P, P) ,

where the first inequality follows from the definition of x (γ, µt) and the second inequality follows from the fact that (P, P) are the marginals of Q. As a consequence, we get

t=1 D(µt|Pt, βt) =

t=1 D(µt| Pt, βt)

n D(µt| Pt, βt) D(µt|Pt, βt) o

t=1 D(µt| Pt, βt) (1 + κ)

t=1 W(Pt, Pt)

t=1 FLUID( Pt, βt) (1 + κ)

t=1 W(Pt, Pt) , (18)

where the second inequality follows from weak duality.

Next, observe that the definition of βt = Eˆγ Pt[ˆb ( µ)] implies that x ( γt, µ) is a feasible to solution to the optimization problem which defines FLUID( Pt, βt). Hence,

t=1 FLUID( Pt, βt)

t=1 E γt Pt

h ft(x ( γt, µ)) i

t=1 E γt Pt

h ft(x ( γt, µ)) µ bt(x ( γt, µ)) i + µ

t=1 E γt Pt

h bt(x ( γt, µ)) i .

Let {xt( )}t be an optimal solution for FLUID({Pt}). Then, for all t [T], we have

h ft(x ( γt, µ)) µ bt(x ( γt, µ)) i = E(γt, γt) Q h ft(x ( γt, µ)) µ bt(x ( γt, µ)) i

E(γt, γt) Q h ft(xt(γt)) µ bt(x(γt)) i

E(γt, γt) Q [ft(xt(γt)) µ bt(x(γt))] (1 + κ) W(Pt, Pt)

= Eγt Pt [ft(xt(γt)) µ bt(x(γt))] (1 + κ) W(Pt, Pt) ,

where the first inequality follows from the definition of x ( γt, µ) and the second inequality follows from (17). Therefore,

t=1 FLUID( Pt, βt)

t=1 Eγt [ft(xt(γt)) µ bt(x(γt))] (1 + κ)

t=1 W(Pt, Pt) + µ

t=1 E γt h bt(x ( γt, µ)) i

t=1 Eγt Pt[ft(x(γt))] µ

t=1 Eγt[bt(x(γt))]

t=1 E γt[ b t ( µ)]

t=1 W(Pt, Pt)

FLUID({Pt}t) µ

t=1 W(Pt, Pt) ,

where the second inequality follows from the feasibility of the optimal solution {xt( )}t, i.e., PT t=1 Eγt[bt(x(γt))] B. Combining this with (18) yields

t=1 D(µt|Pt, βt) FLUID({Pt}t) µ

t=1 W(Pt, Pt) ,

as required.

Robust Budget Pacing with a Single Sample

G.3. Proof of Lemma 4.4

Proof of Lemma 4.4. The definition of x and x implies

f(x) µ b(x) f(x ) µ b(x ) and f(x ) µ b(x ) f(x) µ b(x) .

Combining the two inequalities, we get

f(x) µ b(x) {f(x) µ b(x)} f(x ) µ b(x ) {f(x ) µ b(x )}

= (µ µ ) (b(x ) b(x)) 0 .

The lemma follows from the last inequality because µ µ > 0.

G.4. Proof of Lemma 4.5

Proof of Lemma 4.5. Define

q(µ) := µ B +

t=1 max x X

n ft(x) µ bt(x) o and q( s)(µ) := µ B + X

t =s max x X

n ft(x) µ bt(x) o .

We start by proving µ µ( s). For contradiction, suppose µ < µ( s). Consider the following two cases:

Case I: 0 argmaxx X fs(x) µ bs(x). Then, we must have 0 argmaxx X fs(x) µ bs(x) for all µ µ. This is because, for µ µ, Lemma 4.4 implies that bs(x ) bs(0) = 0 for all x argmaxx X fs(x) µ bs(x), and fs(x) κ bs(x) for all x X. Therefore, q(µ) = q( s)(µ) for all µ µ. Since µ is a minimizer of q( ) and µ( s) > µ, we get that

q( s)( µ) = q( µ) q( µ( s)) = q( s) µ( s) .

On the other hand, µ( s) is a minimizer of q( s)( ), which implies q( s) µ( s) q( s)( µ). Therefore, q( s) µ( s) q( s)( µ), which contradicts the fact that µ( s) is the smallest minimizer of q( s)( ).

Case II: 0 / argmaxx X fs(x) µ bs(x). Since fs(x) κ bs(x) for all x X, we get that bs(x ) > 0 for all x argmaxx X fs(x) µ bs(x). Consider any sequences of optimal action sequences {xt}t and {x( s) t }t such that for all t [T], we have

xt argmax x X ft(x) µ bt(x) and x( s) t argmax x X ft(x) µ( s) bt(x) .

Then, Lemma 4.4 implies that bt(xt) bt(x( s) t ) for all t = s. Therefore,

t=1 bt(xt) =

t =s bt(xt)

bt(xt) < B X

t =s bt(xt) B X

t =s bt(x( s) t ) . (19)

Now observe that, since q( ) (and q( s)( )) are the maxima of a collection of linear functions, its sub-gradient is given by the convex hull of gradients of all the linear functions which are binding (for example, see Chapter 5 of Bertsekas 2009). Therefore, q( µ) (and q( s) µ( s) ) is a convex hull of terms of the form B PT t=1 bt(xt) for

some optimal action sequence {xt}t (and {x( s) t }t). Since µ( s) > µ 0, first-order optimality conditions imply that 0 q( s) µ( s) . Therefore, (19) implies that v < 0 for all v q( µ). This contradicts the optimality of µ for q( ).

As we have obtained a contradiction in both cases, we get that µ µ( s), as required. Moreover, λt λ( s) t for all t = s follows immediately from Lemma 4.4. Hence, to finish the proof, it suffices to show the final inequality in the following chain:

λ( s) t λt X

λ( s) t λt = X

t =s λ( s) t X

t =s λt 3 b . (20)

Note that, Lemma 3.4 implies that at least one of the following conditions hold

Robust Budget Pacing with a Single Sample

1. µ = 0 and PT t=1 λt B + b.

B PT t=1 λt b.

If µ = 0, then µ( s) = 0 because mu( s) µ. Therefore, in that case λ( s) t = λt = b t (0) for all t = s and (20) follows.

Suppose B PT t=1 λt b. Observe that Lemma 3.4 applied to the trace {ˆγt}t, where ˆγt = γt for all t = s and

ˆγs = (0, 0), implies that at least one of the following conditions hold:

1. µ( s) = 0 and P t =s λ( s) t B + b.

t =s λ( s) t b.

Therefore, P

t =s λ( s) t P

t =s λt B + b P

t =s λt b + b + λs 3 b, as required to establish (20).

G.5. Proof of Lemma 4.6

Proof of Lemma 4.6. It is known that FTRL is equivalent to Lazy Online Mirror Descent (for example, see Hazan et al. 2016). In particular, if we let Vh(x, y) = h(x) h(y) h(y) (x y) denote the Bregman divergence w.r.t. h( ), then the FTRL update (3) of Algorithm 2 can be equivalently written as:

θs+1 = θt η gs = θt η (λs bs(x t))

µt+1 = argmin µ [0,κ] Vh(µ, ( h) 1(θs+1)) .

where x t argmaxx X fs(x) µs bs(x). We will use {µ t}t and {θ t}t to represent the dual and mirror iterates of Algorithm 2 with target sequence {λ t}t:

θ s = h(µ s)

θ s+1 = θ t η gs = θ t η (λ s bs(y t))

µ t+1 = argmin µ [0,κ] Vh(µ, ( h) 1(θ s+1)) .

where y t argmaxx X fs(x) µ s bs(x).

We will first use induction on s to prove the following statement,

t=1 |λt λ t|

+ η b . (21)

The base case s = 1 follows directly from our assumption that the initial iterates θ1 = h(µ1) = h(µ 1) = θ2 are the same.

Suppose (21) holds for s [T 1] (Induction Hypothesis). Define θs+1/2 = θs + η bs(x t) and θ s+1/2 = θ s + η bs(y t). W.l.o.g. assume that θs θ s. Due to the invertibility of h, we get that µs µ s, and consequently Lemma 4.4 implies b(x t) b(y t). Consider the following cases:

Case I: θ s+1/2 θs+1/2. Then, θs+1/2 θ s+1/2 = θs θ s + η (b(x t) b(y t)) θs θ s because b(x t) b(y t).

Case II: θ s+1/2 θs+1/2. Then, θ s+1/2 θs+1/2 = θ s θs + η (b(x t) b(y t)) η b because θ s θs and b(y t) b(x t) b.

Robust Budget Pacing with a Single Sample

Therefore, in both cases we have

|θs+1/2 θ s+1/2| max{θs θ s, η b} η

t=1 |λt λ t|

where we used the induction hypothesis in the second inequality. Consequently, we can write

|θs+1 θ s+1| = θs+1/2 η λs + (θ s+1/2 η λ s)

|θs+1/2 θ s+1/2| + η |λs λ s|

t=1 |λt λ t|

Hence, we have established (21) for all s [T]. Now, since h is σ-strongly convex and differentiable, we have

h(x) h(y) σ (x y) x y .

Therefore, we have

( h) 1(θs) ( h) 1(θ s) 1

σ |θs θ s| . (22)

To finish the proof, we will use the fact that Bregman projections are contractions in one dimensions, which we prove next. Consider any x < 0, then for any µ [0, κ], we have

Vh(µ, x) Vh(0, x) = h(µ) h(0) h(x) (µ 0) h(µ) h(0) h(0) (µ 0) 0 ,

where the inequality follows from h(0) h(x) (convexity of h( )). Therefore, argminµ [0,κ] Vh(µ, x) = 0 = argminµ [0,κ] |µ x|. Similarly, for x > κ and µ [0, κ], we have

Vh(µ, x) Vh(κ, x) = h(µ) h(κ) h(x) (µ κ) h(µ) h(κ) h(κ) (µ κ) 0 ,

where the inequality follows from h(x) h(κ) (convexity of h( )). Therefore, argminµ [0,κ] Vh(µ, x) = κ = argminµ [0,κ] |µ x|. Consequently, we have shown that argminµ [0,κ] Vh(µ, x) = argminµ [0,κ] |µ x|, i.e., the Bregman project is identical to the Euclidean projection in one dimension. Since Euclidean projection is a contraction, we get

argmin µ [0,κ] Vh(µ, ( h) 1(θs+1)) argmin µ [0,κ] Vh(µ, ( h) 1(θ s+1))

argmin µ [0,κ] |µ ( h) 1(θs+1))| argmin µ [0,κ] |µ ( h) 1(θ s+1))|

( h) 1(θs) ( h) 1(θ s) .

Finally, combining this with (21) and (22), we get

t=1 |λt λ t|

as required.

G.6. Proof of Lemma 4.7

Proof of Lemma 4.7. Using the definitions of λs and βs, we can write

E h λs µ( s)i = E b s( µ) µ( s) and E h βs µ( s)i = E Eˆγ Pt

h ˆb s( µ) i µ( s) .

Robust Budget Pacing with a Single Sample

Fix a trace { γt}t. Observe that for any request

x ( γs, µ) =

( x if coeff( fs) µ coeff( bs) 0 and coeff( fs) = 0 0 otherwise ,

x ( γs, µ( s)) =

( x if coeff( fs) µ( s) coeff( bs) 0 and coeff( fs) = 0 0 otherwise .

From Lemma 4.5, we know that µ( s) µ. Now, if coeff( fs) = 0, then x ( γs, µ) = x ( γs, µ( s)) = 0. Assume that coeff( fs) > 0 (and thus coeff( bs) > 0 because fs(x) κ b(x)), let A := {µ 0 | coeff( fs) µ coeff( bs) < 0} be the set of all dual variables that lead to x ( γs, µ) = 0.

Define the dual functions:

q(µ) := µ B +

t=1 max x X

n ft(x) µ bt(x) o and q( s)(µ) := µ B + X

t =s max x X

n ft(x) µ bt(x) o .

For contradiction, suppose µ A and µ( s) / A. Since A is open, there exists a point µ A such that µ = α µ + (1 α) µ( s) for some α (0, 1). Moreover, observe that the minimality of µ( s) implies q( s) µ( s) q( s)(µ) and q( s) µ( s) q( s)( µ). Therefore, as q( s) is convex, we get

q( s)(µ) α q( s)( µ) + (1 α) q( s) µ( s) α q( s)( µ) + (1 α) q( s)( µ) = q( s)( µ) .

Now, observe that q(µ) = q( s)(µ) for all µ A. Therefore, q(µ) q( µ), which contradicts the fact that µ is the smallest minimizer of q( ). Hence, either µ, µ( s) A or µ, µ( s) Ac, and as a consequence, we get b s( µ) = b s µ( s) . Furthermore, combining µ µ( s) (from Lemma 4.5) and Lemma 4.4, we also get ˆb s µ( s) ˆb s( µ) for every ˆγs S. Therefore,

E h λs µ( s)i = E b s( µ) µ( s)

= E b s µ( s) µ( s)

ˆb s µ( s) µ( s)

ˆb s( µ) µ( s)

= E h βs µ( s)i ,

where the third equality follows from the fact that γs and µ( s) are independent of each other, which allows us to rename the variable from γs Ps to ˆγs Ps. We combine this with the Tower Property of conditional expectations to finish the proof:

E h µ( s) s (βs λs) i = E µ( s) s E (βs λs) µ( s) s

= E µ( s) s E βs

G.7. Proof of Theorem 4.8

Proof of Theorem 4.8. Theorem 4.1 and (6) together imply that, with probability at least 1 1/T 2, we have

Regret(A) = FLUID({Pt}t) E{γt}t Q

t Pt[R(A|{γt}t)] R1 + R2 + R3 + κ r(T) + κ b + 2(1 + κ)

t=1 W(Pt, Pt) .

Robust Budget Pacing with a Single Sample

From our choice of step size η = p

d R/T and the observation that λ = maxt λt b, we get that

R1 = κ b + 2( b + λ)2

η κ b + 8 b2

From (10), we know that

Moreover, from Lemma 4.7 and η = p

d R/T, we know that

s=1 E [µs (βs λs)] 4η b2

Combining the above inequalities and plugging in r(T) = 8 b p

T log(T), we get that

Regret(A) 3κ b + 12 b2

dr T + 8κ b p

T log(T) + 2(1 + κ)

t=1 W(Pt, Pt)

with probability at least 1 1/T 2. On the other than, we always have Regret(T) f T κ b T. Hence, we get

Regret(A) 1 1

3κ b + 12 b2

dr T + 8κ b p

T log(T) + 2(1 + κ)

t=1 W(Pt, Pt)

T log(T) + 12 b2

T log(T) + 8κ b p

T log(T) + 2(1 + κ)

t=1 W(Pt, Pt)

T log(T) + C2

t=1 W(Pt, Pt)

where C1 = 12 b2 d R

σ + d R + 12κ b and C2 = 2(1 + κ).