# bandits_with_knapsacks_advice_on_timevarying_demands__8f6a031a.pdf

Bandits with Knapsacks: Advice on Time-Varying Demands

Lixing Lyu 1 Wang Chi Cheung 2

We consider a non-stationary Bandits with Knapsack problem. The outcome distribution at each time is scaled by a non-stationary quantity that signifies changing demand volumes. Instead of studying settings with limited non-stationarity, we investigate how online predictions on the total demand volume Q allows us to improve our performance guarantees. We show that, without any prediction, any online algorithm incurs a linearin-T regret. In contrast, with online predictions on Q, we propose an online algorithm that judiciously incorporates the predictions, and achieve regret bounds that depends on the accuracy of the predictions. These bounds are shown to be tight in settings when prediction accuracy improves across time. Our theoretical results are corroborated by our numerical findings.

1. Introduction

The multi-armed bandit problem (MAB) is a classical model on sequential decision making. The MAB problem features the trade-off between exploration and exploitation, i.e., between exploring for new information about the underlying system and exploiting the potentially optimal solution based on current information. MAB problems have been studied extensively over many decades, with diverse applications such as recommendation systems, ad allocation, resource allocation, revenue management and network routing/scheduling.

Many of these mentioned applications involve resource constraints. For example, a seller experimenting with product prices may have limited product inventory. This motivates the formulation of the bandits with knapsack (Bw K) prob-

1Institute of Operations Research and Analytics, National University of Singapore, Singapore 2Department of Industrial Systems Engineering and Management, National University of Singapore, Singapore. Correspondence to: Lixing Lyu <lyulixing@u.nus.edu>, Wang Chi Cheung <isecwc@nus.edu.sg>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

lem, an online knapsack problem involving model uncertainty introduced by (Badanidiyuru et al., 2013). The agent has d 1 types of resources. At each time step t, the agent pulls an arm, which generates an array of outcomes consisting of the random reward the amounts of the d resources consumed. If some resource(s) are exhausted, then the agent stops pulling any arm. The objective is to maximize the expected total reward over a known horizon T, subject to the budget constraints on the d resources.

The Bw K problem was first studied in a stationary stochastic setting, dubbed stochastic Bw K, where the outcome of an arm follows a stationary but latent probability distribution. (Badanidiyuru et al., 2013; Agrawal & Devanur, 2014) provide online algorithms for stochastic Bw K with regret sub-linear in T. The regret is the difference between the optimum and the expected cumulative reward earned by the algorithm, and a sub-linear-in-T regret implies the convergence to opimality as T grows. An alternative setting, adversarial Bw K, is introduced by (Immorlica et al., 2019). Each arm s outcome distribution can change arbitrarily over time. Contrary to stochastic Bw K, it is impossible to achieve a regret sub-linear in T, even when the outcome distribution is changed only once during the horizon (Liu et al., 2022). It begs a question: could a non-stationary Bw K problem be more tractable under a less adversarial model than (Immorlica et al., 2019; Liu et al., 2022)?

Practically, while stationary models could be a strong assumption, it could be too pessimistic to assume the underlying model to be adversarily changing and completely latent. Consider the example of ad allocation. On one hand, the internet traffic is constantly changing, leading to a nonstationary model. On the other hand, forecast information is often available. Some advertisements are intrinsically more attractive than others, regardless of the internet traffic, meaning that the click probability of an ad can be regarded as stationary. How could the platform harness the probem structure? Can the allocator utilize forecast information on the internet traffic to improve his/her decisions?

Motivated by the discussions above, we consider the Non Stationary Bw K with Online Advice problem (NS-Bw KOA). An outcome involve an adversarial and a stochastic component. The mean reward and mean type-i resource consumption of pulling arm a at time t are equal to qt r(a)

Bandits with Knapsacks: Advice on Time-Varying Demands

and qt c(a, i) respectively. The quantity qt R>0 is a seasonal term that represents the demand volume at time t, while the stationary quantities r(a), c(a, i) are the mean reward earned and resource consumed per demand unit under arm a. In a dynamic pricing setting, qt counts the customers arrivals at time t. The quantities r(a), c(a, i) respectively represent the revenue earned and type-i resource consumed per customer under arm a, which could represent a pricing scheme. Our proposed outcome model generalizes the unconstrained settings in (Trac a et al., 2021; Lykouris et al., 2020). We make the following novel contributions:

Model. We incoporate a prediction oracle in NS-Bw KOA, a non-stationary Bw K problem. Such an incorporation is novel compared to existing Bw K works, which always assume full model uncertainty. In NS-Bw K-OA, we identify the total demand volume Q = PT t=1 qt as a crucial (but latent) parameter. The oracle provides a prediction ˆQt to Q at every time step t. The oracle corresponds to how a firm constantly acquires updated information about Q, and a variety of existing time series prediciton tools can be used to construct such an oracle.

Regret Lower Bounds. We derive two regret lower bounds on NS-Bw K-OA. First, without the access to a prediction oracle, we show that any online algorithm suffers a linear-in T regret, even when r, c are known and qt is known before the arm at time t is to be chosen. Second, with the access of a prediction oracle, we establish regret lower bounds that depend on the accuracy of the estimations. When the estimates are equal to the groundtruth, the regret lower bounds reduce to that of the stochastic Bw K problem.

Algorithms and Regret Upper Bounds. We design an online algorithm OA-UCB to utilize the predictions judiciously. OA-UCB is novel in its incorporation of the prediction ˆQt and demand volume qt into the estimated opportunity costs of the resources, in relation to the predicted demand volumes. We derive a regret upper bound on OA-UCB that depends on the accuracy of the predictions, even though the accuracy of each prediction is not known to the algorithm. OA-UCB is shown to achieve near optimal regret bounds when the accuracy of the predictions improves across time.

Numerical Validations. We perform numerical experiments when {qt}T t=1 is governed by a time series model. The experiment highlights the benefit of predicitons. We show that an online algorithm, such as OA-UCB, that harnesses predictions judiciously can perform empirically better than existing baselines, which only has access to the bandit feedback from the latent environment.

1.1. Literature Review

The Bandits with Knapsacks (Bw K) problem has been extensively studied. (Badanidiyuru et al., 2013) first introduced

the stochastic Bw K problem, which bears applications in dynamic pricing (Besbes & Zeevi, 2009; 2012) and ad allocation (Mehta et al., 2007). The Bw K problem is generalized by (Agrawal & Devanur, 2014) to incorporate convex constraints and concave reweards. Several variants are studied, such as the settings of contextual bandits (Agrawal et al., 2016; Badanidiyuru et al., 2014), combinatorial semibandits (Sankararaman & Slivkins, 2018).

Non-stationary Bw K problems, where the outcome distribution of each arm is changing over time, are studied recently. (Immorlica et al., 2019) achieves a O(log T) competitive ratio against the best fixed distribution benchmark in an adversarial setting. (Rangi et al., 2018) consider both stochastic and adversarial Bw K problems in the single resource case. (Liu et al., 2022) design a sliding window learning algorithm with sub-linear-in-T regret, assuming the amount of nonstationarity is upper bounded and known. A sub-linear-in-T regret non-stationary Bw K is only possible in restrictive settings. For example, as shown in (Immorlica et al., 2019; Liu et al., 2022) and our forthcoming Lemma 3.1, for any online algorithm, there exists a non-stationary Bw K instance where the outcome distribution only changes oncee during the horizon, for which the algorihtm incurs a linear-in-T regret. Non-stationary stochastic bandits with no resource constraints are studied in (Besbes et al., 2014; Cheung et al., 2019; Zhu & Zheng, 2020), who provide sub-linear-in-T regret bounds in less restrictive non-stationary settings than (Liu et al., 2022), while the amount of non-stationariety, quantified as the variational budget or the number of change points, has to be sub-linear in T. Our work goes in an orthogonal direction. Instead of studying settings with limited non-stationariety, we seek an improved regret bound when the decision maker is endowed with information additional (in the form of prediction oracle) to the online observations.

Our work is also related to a recent stream of work on resource allocation with horizon uncertainty. (Bai et al., 2022; Aouad & Ma, 2022) consider a stochastic resource allcation setting under full model certainty. In their model, the total demand volume is a random variable, whose probability distribution is known to the decision maker, but the realization of the total demand volume is not known. (Balseiro et al., 2022) consider an online resource allocation setting with model uncertainty on the horizon, which is closely related to our model uncertainty setting on the total demand volume Q. A cruciall difference between the uncertainty settings in (Balseiro et al., 2022) and ours is that, the former focuses on the case when the DM is provided with static advices, while our work complementarily consider the case of dynamic advices. More precisely, (Balseiro et al., 2022) consider a model where qt {0, 1}. They first consider a model when the DM knows Q [Qlower, Qupper] but not the actual value of Q at the beginning, and then consider a case when the DM is additionally endowed with a static prediction ˆQ

Bandits with Knapsacks: Advice on Time-Varying Demands

at the beginning. In both cases, the performance guarantees are quantified as competitive ratios that depends on the static advices. By contrast, our study quantifies the benefit of receiving dyanmically updated advices ˆQt, and pinpoint conditions on { ˆQt}T t=1 that leads to a sublinear-in-T regret.

In terms of the prediction model, our work is related to an active stream of research works on online algorithm design with machine learned advice (Piotr et al., 2019; Mitzenmacher & Vassilvitskii, 2022). While traditional online algorithm research focuses on worst case performance guarantee in full model uncertainty setting, this stream of works focuses on enhancing the performance guarantee when the decision maker (DM) is provided with a machine learned advice at the start of the online dynamics. A variety of results are derived in different settings, such as online caching (Lykouris & Vassilvitskii, 2021), rent-or-buy (Purohit et al., 2018), scheduling (Mitzenmacher, 2019; Lattanzi et al., 2020), online set cover problems (Bamas et al., 2020; Almanza et al., 2021), online matching (Antoniadis et al., 2020). Our research seeks to take a further step, by investigating the case when the DM receives pregressively updated predictions across the horizon, instead of being given a fixed prediction at the beginning.

Lastly, our prediction model is also related to a line of works online optimization with predictions, which concerns improving the performance guarantee with the help of predictions. These predictions are provided to the DM at the beginning of each round sequentially. A variety of full feedback settings are studied in (Rakhlin & Sridharan, 2013a;b; Steinhardt & Liang, 2014; Jadbabaie et al., 2015), and the contextual bandit setting is studied in (Wei et al., 2020). We remark that the abovementioned works do not involve resource constraints, and they are fundamental different from ours, as shown in the forthcoming Lemma 3.1.

Notation. For a positive integer d, we denote [d] = {1, . . . , d}. We adopt the O( ), o( ), Ω( ) notation in (Cormen et al., 2009).

2. Problem Setup

We consider the Non-Stationary Bandit with Knapsack problem with Online Advice (NS-Bw K-OA). The nature specifies an NS-Bw K-OA problem instance, represented by the tuple (A, B, T, {qt}T t=1, {Pa}a A). We denote A as the set of K arms. There are d types of resources, and the decision maker (DM) is endowed with Bi 0 units of resource i for each i [d]. The planning horizon consists of T discrete time steps. Following the convention in (Badanidiyuru et al., 2013), we assume for all i [d] that Bi = B = b T, where b is the normalized budget. At time t, there are qt units of demands arriving at the DM s platform. For example, qt can be the number of customers visiting an online shop at

time step t, and a time step can be a fifteen minute interval. We assume qt [q, q], where 0 < q < q. The sequence {qt}T t=1 is an arbitrary element of [q, q]T fixed by the nature before the online dynamics. The arbitrariness represents the exogenous nature of the demands.

When the DM pulls arm a A, the nature samples a vector (R(a), C(a, 1), . . . , C(a, d)) Pa of random outcomes. The quantity R(a) is the reward earned per demand unit, and C(a, i) is the amount of type i resource consumed per demand unit. The random variables R(a), C(a, 1), . . . , C(a, d) are supported in [0, 1], and they can be arbitrarily correlated. We denote r(a) = E[R(a)], c(a, i) = E[C(a, i)], and r = (r(a))a A, c = (c(a, i))a A,i [d]. To ensure feasiblity, we assume there is a null arm a0 A such that R(a0) = C(a0, i) = 0 with certainty for all i [d].

At each time t, the DM is provided with a prediction oracle Ft. The oracle is a function Ft : [q, q]t 1 [q T, q T] that provides an estimation ˆQt = Ft(q1, . . . , qt 1) on Q = PT t=1 qt with the past observations {qs}t 1 s=1. The DM knows A, B, T, and has the access to Ft in a sequential manner. In contrast, the DM does not know {Pa}a A, {qt}T t=1, while the upper bound q is known to the DM.

Dynamics. At each time t, three events happen. Firstly, the DM receives a prediction ˆQt = Ft(q1, . . . , qt 1) on Q. Secondly, based on ˆQt and the history observation, the DM selects arm At A. Thirdly, the DM observes the feedback consisting of (i) demand volume qt, (ii) reward earned qt Rt, (iii) resources consumed {qt Ct,i}i [d]. Recall that (Rt, Ct,1, . . . , Ct,d) PAt. Then, the DM proceeds to time t + 1. If some resource is depleted, i.e. j [d] such that Pt s=1 qs Cs,j > Bj, then the null arm a0 is to be pulled in the remaining horizon t+1, . . . , T. We denote the stopping time here as τ. The DM aims to maximize the total reward E[Pτ 1 t=1 qt Rt], subject to the resource constraints and model uncertainty.

On qt. Our feedback model on qt is more informative than (Lykouris et al., 2020), where none of q1, . . . , q T is observed during the horizon. In contrast, ours is less informative than (Trac a et al., 2021), where q1, . . . , q T are all observed at time 1. Our assumption of observing qt at the end of time t is mild in online retail settings. For example, the number of visitors to a website within a time interval can be extracted from the electronic records when the interval ends.

While the nature sets {qt}T t=1 to be fixed but arbitrary, the sequence is set without knowing the DM s online algorithm and prediciton oracle F = {Ft}T t=1. Our model is milder than the oblivious adversary model, where the nature sets a latent quantity (in this case {qt}T t=1) with the knowledge of the DM s algorithm before the online dynamics. Our milder model allows the possibility of ˆQt = Ft(q1, . . . , qt 1) be-

Bandits with Knapsacks: Advice on Time-Varying Demands

ing a sufficiently accurate (to be quantified in our main results) estimate to Q for each t, for example when {qt}T t=1 is govenred by a latent time series model. In contrary, an oblivious adversary can set Q to be far away from the predictions ˆQ1, . . . , ˆQT in response to the information on F.

On F. Our prediction oracle is a general Black-Box model. We do not impose any structural or parameteric assumption on F or {qt}T t=1. It is instructive to understand F as a side information provided to the DM by an external source. In the dynamic pricing example, ˆQt could be an estimate on the customer base population provided by an external marketing research firm. A prime candidate of F is the cornucopia of time series prediction models proposed in decades of research works on time series (Shumway & Stoffer, 2017; Hyndman & Athanasopoulos, 2021; Lim & Zohren, 2021). These prediction models allow one step prediction, where for any t, the predictor P inputs {qs}t 1 s=1 and outputs an estimate ˆqt on qt. The prediction ˆQt can be constructed by (1) iteratively applying P on {qs}t 1 s=1 {ˆq}t+ρ 1 t to output ˆqt+ρ, for ρ {0, . . . , T t}, (2) summing over q1, . . . , qt 1, ˆqt, . . . , ˆq T and return ˆQt. We provide an example in the forthcoming Section 5.

Regret. To measure the performance of an algorithm, we define the regret of an algorithm as

Regret T = OPT

t=1 qt Rt, (1)

where OPT denotes the expected cumulative reward of an offline optimal dynamic policy given all latent information and all adversarial terms. For analytical tractabililty in our regret upper bound, we consider an alternative benchmark

OPTLP = max u K

where K = {w [0, 1]d : P

a A wa = 1}. The benchmark (2) is justified by the following Lemma:

Lemma 2.1. OPTLP OPT.

The proof of Lemma 2.1 is in Appendix.

3. Regret Lower Bounds

In this section, we provide impossibility results on the NSBw K-OA in the form of regret lower bounds. Firstly, we show that a linear-in-T regret is inevitable in the absence of the prediction oracle F.

Lemma 3.1. Consider a fixed but arbitrary online algorithm that knows {Pa}a A, {(qs, qs Rs, qs Cs,1, . . . qs Cs,d)}t 1 s=1

and qt, but does not have any access to a prediction oracle when the action At is to be chosen at each time t. There exists an instance such that the online algorithm suffers Regret T = Ω(T).

Lemma 3.1 is proved in Appednix B.1. Lemma 3.1 shows that even when all model information on time steps 1, . . . , t are revealed when At is to be chosen, the DM still suffers Regret T = Ω(T). Thus, NS-Bw K-OA is fundamentally different from non-stationary bandits without resource constraints such as (Besbes et al., 2015), and online optimization with predictions problems such as (Rakhlin & Sridharan, 2013a). In these settings, we can achieve Regret T = 0 if all model information on time steps 1, . . . , t are available at the time point of choosing At or the action at time t. Indeed, given all model information at time t, the DM achieve the optimum by choosing an arm or an action that maximizes the reward function of time t for every t [T].

In view of Lemma 3.1, we seek to understand if the DM can avoid Regret T = Ω(T) when s/he is endowed with an accurate prediction on Q. Certainly, if the DM only recieves an uninformative prediction, such as a worst case prediction q T, at each time step, Regret T = Ω(T) still cannot be avoided. In contrast, if the DM received an accurate prediction at a time step, we demonstrate our first step for deriving a better regret bound, in the form of a more benign regret lower bound compared to Lemma 3.1. We formalize the notion of being accurate by the following two concepts.

For T0 [T 1] and ϵT0+1 0, an instance {qt}T t=1 is said to be (T0 +1, ϵT0+1)-well estimated by oracle F, if the prediction ˆQT0+1 = FT0+1(q1, . . . , q T0) returned by the oracle at time T0 +1 satisfies |Q ˆQT0+1| [ϵT0+1, 2ϵT0+1]. This notion measures the power of prediction oracle F. We say that ϵT0+1 is (T0 + 1, {qt}T0 t=1)-well response by oracle F if ϵT0+1 satisfies ϵT0+1 min{ ˆQT0+1 PT0 s=1 qs q(T T0), q(T T0) ( ˆQT0+1 PT0 t=1 qt), ˆQT0+1/2}, where ˆQT0+1 = FT0+1(q1, . . . , q T0). This concept imposes requirements on the power of prediction by introducing a non-trivial upper bound on ϵT0+1 for the well-estimate notion. This can help us eliminate trivial and uninformative predictions such as ˆQt = 0 or q T.

Theorem 3.2. Consider the NS-Bw K-OA setting, and consider a fixed but arbitrary online algorithm and prediciton oracle F = {Ft}T t=1. For any T0 [T 1] and any ϵT0+1 > 0 that is (T0 + 1, {qt}T0 t=1)-well response, there exists a (T0 + 1, ϵT0+1)-well estimated instance I = {qt}T0 t=1 {qt}T t=T0+1 such that

Regret T = Ω

t=1 qtϵT0+1 , Λ

Bandits with Knapsacks: Advice on Time-Varying Demands

where Q = PT t=1 qt, and

Theorem 3.2 is proved in Appendix B.2. In (3), the regret lower bound Λ is due to the uncertainty on {Pa}a A, and Λ is derived directly from (Badanidiyuru et al., 2013). The regret lower bound 1

Q PT0 t=1 qtϵT0+1 is due to the oracle s

estimation error on ˆQT0+1. Theorem 3.2 demonstrates a more benign regret lower bound than Ω(T), under the condition that the prediction on Q is sufficiently accurate (as formalized as (T0 + 1, ϵT0+1)-well estimated).

More specifically, let us consider the following accurate prediction condition at time T0 by oracle F: ϵT0+1 is (T0 + 1, {qt}T0 t=1)-well response by oracle F and

Q = O(T α 0 ) for some α > 0. (4)

The condition implies that, for the prediction ˆQT0+1 made using T0 data points q1, . . . , q T0, it holds that |1 ( ˆQT0+1/Q)| = O(T α 0 ). For example, when {qt}T t=1 are i.i.d. generated, the accurate prediction condition holds with α = 1/2.

Corollary 3.3. Consider the setting of Theorem 3.2. Suppose the accurate prediction condition (4) holds at T0, then the refined regret lower bound Regret T = Ω(max{q T 1 α 0 , Λ}) holds.

Altogether, under the accurate prediction condition, the corollary presents a strictly smaller regret lower bound than that in Lemma 3.1, which has no prediction oracle available. In complement, we design and analyze an online algorithm in the next section that reaps the benefits of predictions, and in particular nearly matches the regret lower bound in Corollary 3.3 under the accurate prediction condition. Thus, a o(T)-regret is possible in a non-stationary environment given accurate predictions as prescribed above, even though the amount of non-stationarity in the underlying model is not bounded in general.

4. Algorithm and Analysis

We propose the Online-Advice-UCB (OA-UCB) algorithm, displayed in Algorithm 1, for solving NS-Bw K-OA. The algorithm design involve constructing confidence bounds to address the model uncertainty on r, c, as discussed in Section 4.1. In Section 4.2, we elaborate on OA-UCB, which uses Online Convex Optimization (OCO) tools to balance the trade-off among rewards and resources. Crucially, at each time t, we incorporate the prediction ˆQt to scale the opportunity costs of the resources. In addition, both qt and ˆQt

are judiciously integrated into the OCO tools to factor the demand volumes into the consideration of the abovemention trade-off. In Section 4.3, we provide a regret upper bound to OA-UCB, and demonstrate its near-optimality when the accurate prediction condition (4) holds and when capacity is large. In Section 4.4 we provide a sketch proof of the regret upper bound, where the complete proof is in Appendix C.

4.1. Confidence Bounds

We consider the following confidence radius function:

rad(v, N, δ) =

N + 4 log 1

The function (5) satisfies the following property:

Lemma 4.1 ((Babaioff et al., 2015; Agrawal & Devanur, 2014)). Let random variables {Vi}N i=1 be independently distributed with support in [0, 1]. Denote ˆV = 1

N PN i=1 Vi, then with probability 1 3δ, we have ˆV E[ ˆV ] rad( ˆV , N, δ) 3rad(E[ ˆV ], N, δ).

We prove Lemma 4.1 in Appendix D.1 by following the line of argument in (Babaioff et al., 2015) for the purpose of extracting the values of the coefficients in (5), which are implicit in (Babaioff et al., 2015; Agrawal & Devanur, 2014). Based on the observation {Rs, {Cs,i}i [d]}s [t 1], we compute the sample means

ˆRt(a) = 1 N + t 1(a)

s=1 Rs1{As=a}, a A,

ˆCt(a, i) = 1 N + t 1(a)

s=1 Cs,i1{As=a}, a A, i [d],

where N + t 1(a) = max{Pt 1 s=1 1{As=a}, 1}. In line with the principle of Optimism in Face of Uncertatinty, we construct upper confidence bounds (UCBs) for the rewards and lower confidence bounds (LCBs) for resource consumption ammounts. For each a A, we set UCBr,t(a) =

min n ˆRt(a) + rad( ˆRt(a), N + t 1(a), δ), 1 o . (6)

For each a A, i [d], we set LCBc,t(a, i) =

max n ˆCt(a, i) rad( ˆCt(a, i), N + t 1(a), δ), 0 o . (7)

The design of the UCBs and LCBs are justified by Lemma 4.1 and the model assumption that r(a), c(a, i) [0, 1] for all a A, i [d]:

Bandits with Knapsacks: Advice on Time-Varying Demands

Lemma 4.2. With probability 1 3KTdδ, we have

UCBr,t(a) r(a), LCBc,t(a, i) c(a, i)

for all a A, i [d].

Lemma 4.2 is proved in Appendix D.2.

4.2. Details on OA-UCB

OA-UCB is presented in Algorithm 1. At each time step t, the algorithm first computes a composite reward term

UCBr,t(a) ˆQt B µ t LCBc,t(a), (8)

where UCBr,t(a), ˆQt and LCBc,t(a) are the surrogates for the latent r(a), Q, c(a) respectively. The term ˆ Qt

B µ t LCBc,t(a) can be interpreted as the opportunity cost of the resources. The scalarization

µt S = {µ : µ 1 1, µ 0d} (9)

weighs the relative importance of the resources. The factor ˆQt/B reflects that the opportunity cost increases with ˆQt, since with a higher total demand volume, the DM is more likely to exhaust some of the resources during the horizon, and similar reasoning holds for B. Altogether, (8) balances the trade-off between the reward of an arm and the opportunity cost of that arm s resource consumption. We select an arm that maximizes (8) at time t.

After receiving the feedback, We update the scalarization µt via the Online Gradient Descent (OGD) (Agrawal & Devanur, 2014; Hazan et al., 2016) on the sequence of functions {ft}T t=1, where

ft(x) = qt ˆQt

ˆQt 1d LCBc,t(At) x. (10)

While ft incorporates the prediction ˆQt for the purpose of accounting for the estimated opportunity cost similar to (8), ft also incoporates the actual demand qt for accounting the actual amounts of resources consumed. In the OGD update in (11), for a resource type i, the coefficient µt+1(i) increases with qt LCBc,t(At, i), meaning that a higher amount of resource i consumption at time t leads to a higher weightage of resource i s opportunity cost at time t + 1.

4.3. Performance Guarantees of OA-UCB

The following theorem provides a high-probability regret upper bound for Algorithm 1:

Theorem 4.3. Consider the OA-UCB algorithm, that is provided with predictions that satisfy | ˆQt Q| ϵt for all

Algorithm 1 Online-advice-UCB (OA-UCB)

1: Initialize µ1 = 1

2: for t = 1, 2, ..., T do 3: Receive ˆQt = Ft(q1, . . . , qt 1). 4: Compute UCBr,t(a), LCBc,t(a) for all a A by (6), (7), where LCBc,t(a) = (LCBc,t(a, i))i [d]. 5: Select

At argmax a A

n UCB(r) t (a) ˆ Qt

B µ t LCBc,t(a) o .

6: if j [d] such that Pt s=1 qs Ct,j > B then 7: Break, and pull the null arm a0 all the way. 8: end if 9: Observe qt, receive reward qt Rt, and consume qt Ct,i for each resource i [d]. 10: Update µt+1 with OGD. µt+1 is set to be

µt ηt qt ˆQt

ˆQt 1d LCBc,t(At) !

(11) where S is defined in (9). 11: end for

t [T]. With probability 1 3KTdδ,

t=1 qtϵt + M

Theorem 4.3 is proved in the Appendix, and we provide a sketch proof in Section 4.4. The Theorem holds in the special case when we set ϵt = | ˆQt Q|, and ϵt represents an upper bound on the estimation error of ˆQt on Q, for example by certain theoretical guarantees. The term (12) represents the regret due to the learning on r(a), c(a, i). The first term in (13) represents the regret due to the prediction error of the prediction oracle, and the second term in (13) represents the regret due to the application of OGD.

Comparison between regret lower and upper bounds. The regret term (12) matches the lower bound term Λ in Theorem 3.2 within a logarithmic factor. Next, we compare the regret upper bound term ( 1

B ) Pτ 1 t=1 qtϵt and the

lower bound term 1 Q PT0 t=1 qtϵT0+1 in Theorem 3.2. We first assure that the lower and upper bound results are consistent, in the sense that our regret upper bound is indeed in Ω( 1

Q PT0 t=1 qtϵT0+1) on the lower bounding instances con-

Bandits with Knapsacks: Advice on Time-Varying Demands

structed for the proof of Theorem 3.2. In those instances, T0 is set in a way that the resource is not fully exhausted at time T0 under any policy, thus the stopping time τ of OA-UCB satisfies τ > T0 with certainty. More details are provided in Appendix B.3.

Next, we highlight that the regret upper and lower bounds are nearly matching (modulo multiplicative factors of log(1/δ) and q/q, as well as the additive O(M

T) term), under the high capacity condition B = Θ(Q) and the accurate prediction condition (4) for all T0 1. The first condition is similar to the large capacity assumption in the literature (Besbes & Zeevi, 2009; 2012; Liu et al., 2022), while the second condition is a natural condition that signfies a non-trivial estimation by the prediciton oracle, as discussed in Section 3. On one hand, By setting T0 = Θ(T) for the highest possible lower bound in Corollary 3.3, we yield the regret lower bound Ω(max{q T 1 α 0 , Λ}) = Ω(max{q T 1 α, Λ}). On the other hand, the second term in (13) is upper bounded as

t=1 qtϵt = O

t=1 qt ϵt Q

t=1 qtt α !

= O(q T 1 α).

Altogether, our claim on the nearly matching bounds is established.

4.4. Proof Sketch of Theorem 4.3

We provide an overview on the proof of Theorem 4.3, which is fully proved in Appendix C. We first provide bounds on the regret induced by the estimation errors of the UCBs and LCBs. Now, with probability 1 3KTdδ, the inequalities

t=1 qt UCBr,t(At)

t=1 qt Rt + q K log T

t=1 qt LCBc,t(At, i)

t=1 qt Ct,i

q KB + q K log T

hold. Inequalities (14, 15) are proved in Appendix D.3.

Next, by the optimality of At in Line 1 in Algorithm 1,

UCBr,t(At) ˆQt B µ t LCBc,t(At)

UCB r,tu ˆQt B µ t LCBc,tu ,

which is equivalent to

UCB r,tu UCBr,t(At) + ˆQt B µ t

ˆQt 1d LCB c,tu

ˆQt 1d LCBc,t(At) .

Multiply qt on both side, and sum over t from 1 to τ 1. By applying the OGD performance guarantee in (Hazan et al., 2016) with {ft}T t=1, S defined in (10, 9) respectively, we argue that, for all µ S,

t=1 qt UCB r,tu

t=1 qt UCBr,t(At)

t=1 qt ˆQt B µ t

ˆQt 1d LCB c,tu

t=1 qt ˆQt B B

ˆQt 1d LCBc,t(At) µ + O M

If τ T, then there exists j0 [d] such that Pτ t=1 qt Ct,j0 > B. Take µ = OPTLP

Q ej0 S (This is because OPTLP = Qr u Q). Analysis yields

t=1 qt UCBr,t(At) O

t=1 qtϵt + M

If τ > T, it is the case that τ 1 = T, and no resource is exhausted at the end of the horizon. Take µ = 0. Similar analaysis to the previous case shows that OPTLP Pτ 1 t=1 qt UCBr,t(At)

t=1 qtϵt + M

Combine (14), (16), (17) and the fact that OPTLP Pτ 1 t=1 qt Rt, the theorem holds.

5. Numerical Experiments

We present numerical results, and compare our algorithm with several existing algorithms on Bw K.

Bandits with Knapsacks: Advice on Time-Varying Demands

Demand sequence {qt}T t=1: We apply an AR(1) model to generate {qt}:

qt = α + βqt 1 + εt,

where ε1, . . . εT N(0, σ2) are independent.

Estimations { ˆQt}T t=1: At round t, given history observation {qs}t 1 s=1, there are many time series prediction tools in Python, Mat Lab or R that perform predictions to yield {ˆqs}T s=t, where ˆqs is a estimate on qs. We define the estimation ˆQt as Pt 1 s=1 qs + PT s=t ˆqs. To achieve time-efficiency, we consider a power-of-two policy for updating the ˆQt on Q, as shown in Algorithm 2. That is, we only recompute ˆQt when t = 2k for some k N+. The estimation error of Algorithm 2 in terms of additive gap is plotted in Figure 1.

Algorithm 2 Estimation Generation Policy

Input: Time step t, history observation {qs}t 1 s=1, previous estimation ˆQt 1.

1: if t = 2k for some k N+ then 2: Compute predictions {ˆqs}T s=t , and update ˆQt = Pt 1 s=1 qs + PT s=t ˆqs. 3: else 4: Set ˆQt = ˆQt 1. 5: end if 6: return ˆQt.

Figure 1. Estimation error

Benchmarks: We compare OA-UCB with three existing online algorithms for Bw K. The first algorithm is the Primal Dual Bw K algorithm in (Badanidiyuru et al., 2013), which we call PDB in the following. The second algorithm is the UCB algorithm presented in (Agrawal & Devanur, 2014), which we call AD-UCB . The third algorithm is the Sliding-Window UCB in (Liu et al., 2022), which we call SW-UCB . In implementing SW-UCB, we set the sliding window size according to the suggestion in (Liu et al.,

2022), and we input the required non-stationarity measures by computing them from the ground truth {qt}15000 t=1 .

In the experiment, we simulate our algorithm and the benchmarks on a family of instances, with K = 10, d = 3, b = 3, α = 2, β = 0.5, σ = 0.5, and T varies from 5000 to 15000. Each arms s per-unit-demand outcome (R(a), {Ci(a)}d i=1) follows the standard Gaussian distribution truncated in [0, 1]d+1, which has mean denoted as (r(a), c(a)). We perform two groups of the experiment. In each group, we first generate a sample of (r, c, {qt}15000 t=1 ). Then, for each fixed T, we simulate each algorithm ten times with demand volume sequence {qt}T t=1, and compute the regret based on the sample average.

Figures 2 and 3 plots the regret of each algorithm on different horizon lengths, in each of the two groups. The superiority in numerical performance for OA-UCB does not mean that our algorithm is strictly superior to the baselines. Indeed, our algorithm OA-UCB receives online advice by Algorithm 2, while the benchmarks do not. The numerical results instead indicate the benefit of prediciting the underlying non-stationary demand sequence, and showcase how a suitably designed algoirhtm such as OA-UCB could reap the benefit of predictions.

Figure 2. Regret on Experiment Group 1

Figure 3. Regret on Experiment Group 2

Bandits with Knapsacks: Advice on Time-Varying Demands

Figure 4 depict the trend of the accumulated rewards as time progresses, with horizon T = 10000. The black dotted lines indicate the stopping times of each algorithm respectively. Compared with our algorithm OA-UCB, PDB and SW-UCB could appear conservative, meaning that they focus too much on not exhausting resources. In contrast, AD-UCB seems a little aggressive, meaning that it prefers to choose the arm with high reward and resource consumption.

Figure 4. Cumulative Reward growing.

6. Conclusion

We study a non-stationary bandit with knapsack problem, in the presense of a prediction oracle on the latent total demand volume Q. Our oracle is novel compared to existing models in online optimization with machine-learned advice (such as (Lykouris & Vassilvitskii, 2021)), in that ours returns a (possibly refined) prediction on Q every time step. There are many interesting future directions, such as investigating the models (Bamas et al., 2020; Lykouris & Vassilvitskii, 2021; Purohit et al., 2018; Mitzenmacher, 2019) in the presense of sequential prediction oracles similar to ours. It is also interesting to invenstigate other forms of predictions, such as predictions with distributional information (Bertsimas et al., 2019; Diakonikolas et al., 2021). Customizing prediction oracles for NS-Bw K-OA is also an interesting direction (Anand et al., 2020).

Acknowledgements

The authors would like to thank the reviewing team for the constructive suggestions that improve the positioning of the paper and the clarity in the exposition. This research work is funded by Institute of Operations Research and Analytics, National University of Singapore, a Singapore Ministry of Education Ac RF Tier 2 Grant (Grant number: T2EP201210035) and a Singapore Ministry of Education Ac RF Tier 3 Grant (Grant number: MOE-2019-T3-1-010).

Agrawal, S. and Devanur, N. R. Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation, pp. 989 1006, 2014.

Agrawal, S., Devanur, N. R., and Li, L. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In Conference on Learning Theory, pp. 4 18. PMLR, 2016.

Almanza, M., Chierichetti, F., Lattanzi, S., Panconesi, A., and Re, G. Online facility location with multiple advice. Advances in Neural Information Processing Systems, 34: 4661 4673, 2021.

Anand, K., Ge, R., and Panigrahi, D. Customizing ML predictions for online algorithms. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 303 313. PMLR, 13 18 Jul 2020.

Antoniadis, A., Gouleakis, T., Kleer, P., and Kolev, P. Secretary and online matching problems with machine learned advice. Advances in Neural Information Processing Systems, 33:7933 7944, 2020.

Aouad, A. and Ma, W. A nonparametric framework for online stochastic matching with correlated arrivals. ar Xiv preprint ar Xiv:2208.02229, 2022.

Audibert, J.-Y., Munos, R., and Szepesv ari, C. Exploration exploitation tradeoff using variance estimates in multiarmed bandits. Theoretical Computer Science, 410(19): 1876 1902, 2009.

Babaioff, M., Dughmi, S., Kleinberg, R., and Slivkins, A. Dynamic pricing with limited supply, 2015.

Badanidiyuru, A., Kleinberg, R., and Slivkins, A. Bandits with knapsacks. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pp. 207 216. IEEE, 2013.

Badanidiyuru, A., Langford, J., and Slivkins, A. Resourceful contextual bandits. In Conference on Learning Theory, pp. 1109 1134. PMLR, 2014.

Bai, Y., El Housni, O., Jin, B., Rusmevichientong, P., Topaloglu, H., and Williamson, D. Fluid approximations for revenue management under high-variance demand. SSRN, 2022. URL https://ssrn.com/ abstract=4136445.

Balseiro, S., Kroer, C., and Kumar, R. Online resource allocation under horizon uncertainty. ar Xiv preprint ar Xiv:2206.13606, 2022.

Bandits with Knapsacks: Advice on Time-Varying Demands

Bamas, E., Maggiori, A., and Svensson, O. The primal-dual method for learning augmented algorithms. Advances in Neural Information Processing Systems, 33:20083 20094, 2020.

Bertsimas, D., Sim, M., and Zhang, M. Adaptive distributionally robust optimization. Management Science, 65(2): 604 618, 2019.

Besbes, O. and Zeevi, A. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations Research, 57(6):1407 1420, 2009.

Besbes, O. and Zeevi, A. Blind network revenue management. Operations research, 60(6):1537 1550, 2012.

Besbes, O., Gur, Y., and Zeevi, A. Stochastic multi-armedbandit problem with non-stationary rewards. Advances in neural information processing systems, 27, 2014.

Besbes, O., Gur, Y., and Zeevi, A. Non-stationary stochastic optimization. Operations research, 63(5):1227 1244, 2015.

Cheung, W. C., Simchi-Levi, D., and Zhu, R. Learning to optimize under non-stationarity. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1079 1087. PMLR, 2019.

Chung, F. and Lu, L. Concentration inequalities and martingale inequalities: a survey. Internet mathematics, 3(1): 79 127, 2006.

Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. Introduction to Algorithms, 3rd Edition. MIT Press, 2009. ISBN 978-0-262-03384-8.

Diakonikolas, I., Kontonis, V., Tzamos, C., Vakilian, A., and Zarifis, N. Learning online algorithms with distributional advice. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 2687 2696. PMLR, 18 24 Jul 2021.

Dubhashi, D. P. and Panconesi, A. Concentration of measure for the analysis of randomized algorithms. Cambridge University Press, 2009.

Freedman, D. A. On tail probabilities for martingales. the Annals of Probability, pp. 100 118, 1975.

Hazan, E. et al. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

Hyndman, R. and Athanasopoulos, G. Forecasting: principles and practice. OTexts, 2021. URL https: //otexts.com/fpp3/.

Immorlica, N., Sankararaman, K. A., Schapire, R., and Slivkins, A. Adversarial bandits with knapsacks. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pp. 202 219. IEEE, 2019.

Jadbabaie, A., Rakhlin, A., Shahrampour, S., and Sridharan, K. Online optimization: Competing with dynamic comparators. In Artificial Intelligence and Statistics, pp. 398 406. PMLR, 2015.

Lattanzi, S., Lavastida, T., Moseley, B., and Vassilvitskii, S. Online scheduling via learned weights. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1859 1877. SIAM, 2020.

Lim, B. and Zohren, S. Time-series forecasting with deep learning: a survey. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 379(2194), 2021.

Liu, S., Jiang, J., and Li, X. Non-stationary bandits with knapsacks. ar Xiv preprint ar Xiv:2205.12427, 2022.

Lykouris, T. and Vassilvitskii, S. Competitive caching with machine learned advice. Journal of the ACM (JACM), 68 (4):1 25, 2021.

Lykouris, T., Mirrokni, V., and Leme, R. P. Bandits with adversarial scaling. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 6511 6521, 2020.

Mehta, A., Saberi, A., Vazirani, U., and Vazirani, V. Adwords and generalized online matching. Journal of the ACM (JACM), 54(5):22 es, 2007.

Mitzenmacher, M. Scheduling with predictions and the price of misprediction. ar Xiv preprint ar Xiv:1902.00732, 2019.

Mitzenmacher, M. and Vassilvitskii, S. Algorithms with predictions. Commun. ACM, 65(7):33 35, jun 2022.

Piotr, I., Yaron, S., Ali, V., and Sergei, V. Summer workshop on learning-based algorithms. 2019. URL http://www.mit.edu/ vakilian/ ttic-workshop.html.

Purohit, M., Svitkina, Z., and Kumar, R. Improving online algorithms via ml predictions. In Advances in Neural Information Processing Systems, volume 31, 2018.

Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In Conference on Learning Theory, pp. 993 1019. PMLR, 2013a.

Bandits with Knapsacks: Advice on Time-Varying Demands

Rakhlin, S. and Sridharan, K. Optimization, learning, and games with predictable sequences. Advances in Neural Information Processing Systems, 26, 2013b.

Rangi, A., Franceschetti, M., and Tran-Thanh, L. Unifying the stochastic and the adversarial bandits with knapsack. ar Xiv preprint ar Xiv:1811.12253, 2018.

Sankararaman, K. A. and Slivkins, A. Combinatorial semibandits with knapsacks. In International Conference on Artificial Intelligence and Statistics, pp. 1760 1770. PMLR, 2018.

Shumway, R. and Stoffer, D. Time Series Analysis and Its Applications: With R Examples. Springer texts in statistics. Springer, 2017. URL https://github.com/nickpoison/tsa4/ blob/master/text Rcode.md.

Steinhardt, J. and Liang, P. Adaptivity and optimism: An improved exponentiated gradient algorithm. In International Conference on Machine Learning, pp. 1593 1601. PMLR, 2014.

Trac a, S., Rudin, C., and Yan, W. Regulating greed over time in multi-armed bandits. J. Mach. Learn. Res., 22: 3 1, 2021.

Wei, C.-Y., Luo, H., and Agarwal, A. Taking a hint: How to leverage loss predictors in contextual bandits? In Conference on Learning Theory, pp. 3583 3634. PMLR, 2020.

Zhu, F. and Zheng, Z. When demands evolve larger and noisier: Learning and earning in a growing environment. In International Conference on Machine Learning, pp. 11629 11638. PMLR, 2020.

Bandits with Knapsacks: Advice on Time-Varying Demands

A. Proof for Section 2

A.1. Proof for Lemma 2.1

Let s first consider

OPT LP = max xt |A|, t [T ]

t=1 qtr xt s.t.

t=1 qtc xt B1d, (18)

where |A| = {w : P

a A wa = 1} is the probability simplex across all arms. It is evident that OPT LP OPT, since for a fixed policy π that achieves OPT, the solution x = { xt,a}t [T ],a A defined as

xt,a = E[1(action a is chosen at t under π)]

is feasible to OPT , and the objective value of x in OPT LP is equal to the expected revenue earned in the online process.

Next, we claim that OPTLP = OPT LP. Indeed, for each feasible solution (xt)t [T ] to OPT LP, the solution

u = PT t=1 qtxt PT t=1 qt ,

is feasible to LPOPT and has the same objective value as (xt)t [T ]. Altogether, the Lemma is proved.

B. Proofs for Section 3, and Consistency Remarks

In this section, we provide proofs to the lower bound results. In both proofs, we consider an arbitrary but fixed deterministic online algorithm, that is, conditioned on the realization of the history in 1, . . . , t 1 and qt, ˆQt, the chosen arm At is deterministic. This is without loss of generality, since the case of random online algorithm can be similarly handled by replace the chosen arm At with a probability distribution over the arms, but we focus on deterministic case to ease the exposition. Lastly, in Section B.3 we demonstrate that our regret upper and lower bounds are consistent on the lower bounding instances we constructed in Section B.2.

B.1. Proof for Lemma 3.1

Our lower bound example involve two instances I(1), I(2) with determinstic rewards and deterministic consumption amounts. Both instances involve two non-dummy arms 1, 2 in addition to the null arm a0, and there is d = 1 resource type. Instances I(1), I(2) differ in their respective seqeunces of demand volumes {q(1) t }T t=1, {q(2) t }T t=1, but for other parameters are the same in the two instances.

In both I(1), I(2), arm 1 is associated with (deterministc) reward r(1) = 1 and (deterministic) consumption amount c(1, 1) = 1, while arm 2 is associated with (deterministc) reward r(2) = 3/4 and (deterministic) consumption amount c(2, 1) = 1/2. Both instances share the same horizon T, a positive even integer, and the same capacity B = T/2. The sequences of demand volumes {q(1) t }T t=1, {q(2) t }T t=1 of instances I(1), I(2) are respectively defined as

( 1 if t {1, . . . , T/2}, 1/16 if t {T/2 + 1, . . . , T},

q(2) t = 1, for all t {1, . . . , T}.

Then the optimal reward for I(1) is at least T

2 (always select the arm 1 until the resource is fully consumed), and the optimal reward for I(2) is 3T

4 (always select arm 2 until the resource is fully consumed).

Consider the first T/2 rounds, and consider an arbitrary online algorithm that knows {Pa}a A, {(qs, qs Rs, qs Cs,1, . . . qs Cs,d)}t 1 s=1 and qt when the action At is to be chosen at each time t. Under this setting, the DM recieves the same set of observations in the first T/2 time steps in each of instances I(1), I(2). Consequently, the sequence of arm pulls in the first T/2 time steps are the same. Now, we denote Na = PT/2 t=1 1(At = a) for a {1, 2}. By the previous remark, Na is the number of times arm a is pulled during time steps 1, . . . , T/2 in each of the two instances.

Bandits with Knapsacks: Advice on Time-Varying Demands

Observe that N1 + N2 T

2 , which implies N1 T

4 . We denote Reward T (I(i)), Regret T (I(i)) as the expected reward and the expected regret of the policy in instance I(i). In what follows, we demonstrate that

max i {1,2} Regret T (I(i)) T

which proves the Lemma.

Case 1: N1 T

4 . We consider the algorithm on I(1), which earns

Reward T (I(1)) T

2 1 16 = 15

Regret T (I(1)) T

2 Reward T (I(1)) 1

Case 2: N2 T

4 . We consider the algorithm on I(2), which earns

Reward T (I(2)) = T

3 4 1 2 = T

Regret T (I(2)) 3

4T Reward T (I(2)) 3

Altogether, the inequality (19) is shown.

B.2. Proof for Theorem 3.2

By the Theorem s assumption that ϵT0+1 > 0 is (T0 + 1, {qt}T0 t=1)-well response by F = {Ft}, we know that

0 < ϵT0+1 min

t=1 qt q(T T0), q(T T0) ˆQT0+1

t=1 qt, ˆQT0+1

where ˆQT0+1 = FT0+1(q1, . . . , q T0). To proceed, we take a demand volume seqeunce {qt}T0 t=1 [q, q]T0 that satisfies the assumption. In what follows, we first construct two deterministic instances I(1), I(2) which only differ in their respective seqeunces of demand volumes {q(1) t }T t=T0+1, {q(2) t }T t=T0+1, but the two instances are the same on other parameters, and

that q(1) t = q(2) t = qt for t {1, . . . , T0}. Both I(1), I(2) only involve one resource constraint. We estbalish the Theorem by showing three claims:

1. Both I(1), I(2) are (T0 + 1, ϵT0+1)-well-estimated by F, and the underlying online algorithm and prediction oracle (which are assumed to be fixed but arbitrary in the Theorem statement) suffer

Regret T (I(i)) PT0 t=1 qtϵT0+1

6Q(i) for some i {1, 2}. (21)

In (21), we define Regret T (I(i)) as the regret of the algorithm on instance I(i), and Q(i) = PT t=1 q(i) t .

2. Among the set of instances {J(i) c }i [K] (see Instances {J(i) c }i [K]), the online algorithm suffers

Regret T (J(i) c ) 1 128 min

opt(J(i) c ) for some i [K], (22)

where opt(I) denote the optimum of instance I, even when the DM has complete knowledge on q1, . . . , q T , and ˆQt is equal to the ground truth Q in each of the instances in {J(i) c }i [K].

Bandits with Knapsacks: Advice on Time-Varying Demands

3. Among the set of instances {J(i) r }i [K] (see Instances {J(i) r }i [K]), the online algorithm suffers

Regret T (J(i) r ) 1

q Kopt(J(i) r ) for some i [K], (23)

even when the DM has complete knowledge on q1, . . . , q T , and ˆQt is equal to the ground truth Q in each of the instances in {J(i) r }i [K].

Once we establish inequalities (21, 22, 23), the Theorem is shown. We remark that (22, 23) are direct consequences of (Badanidiyuru et al., 2013). We first extract the instances {J(i) c }i [K], {J(i) r }i [K] that are constructed in (Badanidiyuru et al., 2013), then we construct the instances I(1), I(2). After that, we prove (21), which establish the Theorem.

Instances {J(i) c }i [K]. These instances are single resource instances, with determinsitic rewards but stochastic consumption. According to (Badanidiyuru et al., 2013), we first set parameters

, T = 16B η(1/2 η),

and set qt = q for all t [T]. The instances J(1) c , . . . , J(K) c share the same B, T, {qt}T t=1, and the instances share the same (deterministic) reward function:

R(a) = r(a) =

( 1 if a [K] \ {a0} 0 if a = a0 .

In contrast, instances J(1) c , . . . , J(K) c differ in the resource consumption model. We denote C(i)(a) as the random consumption of arm a in instance J(i) c . The probability distribution of C(i)(a) for each a, i [K] is defined as follow:

Bern(1/2) if a [K] \ {a0, i} Bern(1/2 η) if a = i Bern(0) if a = a0

where Bern(p) denotes the Bernoulli distribution with mean d. The regret lower bound (22) is a direct consequence of Lemma 6.10 in (Badanidiyuru et al., 2013), by incorporating the scaling factor q into the rewards earned by the DM and the optimal reward.

Instances {J(i) r }i [K]. These instances are single resource instances, with random rewards but deterministic consumption. These instances share the same B, T > K (set arbitrarily), the same demand volume seqeunce, which is qt = q for all t [T], and the same resource consumption model, in that c(a) = 0 for all a A. These instances only differ in the reward distributions. We denote R(i)(a) as the random reward of arm a in instance J(i) r . The probability distribution of R(i)(a) for each a, i [K] is defined as follow:

T if a [K] \ {a0, i}

Bern(1/2) if a = i Bern(0) if a = a0

The regret lower bound (23) is a direct consequence of Claim 6.2a in (Badanidiyuru et al., 2013), by incorporating the scaling factor q into the rewards earned by the DM and the optimal reward.

Construct I(1), I(2). We first describe {q(1) t }T t=1, {q(2) t }T t=1. As previously mentioned, for t {1, . . . , T0}, we have q(1) t = q(2) t = qt. To define q(1) t , q(2) t for t {T0 + 1, . . . , T}, first recall that | ˆQT0+1 Q| ϵT0+1, where ϵT0+1 satisfies (20). By (20), we know that

q(T T0) ˆQT0+1

t=1 qt ϵT0+1 < ˆQT0+1

t=1 qt + ϵT0+1 q(T T0)

Bandits with Knapsacks: Advice on Time-Varying Demands

We set q(1) T0+1 = . . . = q(1) T [q, q] and q(2) T0+1 = . . . = q(2) T [q, q] such that based on current instance {qt}T0 t=1 we have

ever received, we construct the following two subsequent instances I(1) = {q(1) t }T t=T0+1, I(2) = {q(2) t }T t=T0+1, such that

t=1 q(1) t = ˆQT0+1 ϵT0+1, Q(2) =

t=1 q(2) t = ˆQT0+1 + ϵT0+1,

which is valid by the stated inequalities.

Next, we define the parameters {r(a)}a A, {c(a, 1)}a A, B, and recall that d = 1. Similar to the proof for Lemma 3.1, we only consider deterministic instances, so it is sufficient to define the mean rewards and consumption amounts. To facilitate our discussion, we specify A = [K] = {1, 2, . . . , K}, with K 3 and arm K being the null arm. The parameters {r(a)}a A, {c(a, 1)}a, B shared between instances I(1), I(2) are defined as follows:

1 if a = 1, (1 + c)/2 if a = 2, 0 if a {3, . . . , K},

1 if a = 1, c if a = 2, 0 if a {3, . . . , K},

c = ˆQT0+1 ϵT0+1 ˆQT0+1 + ϵT0+1 .

Finally, we set B = ˆQT0+1 ϵT0+1.

Inequality (20) ensures that c, B > 0.

Proving (21). To evaluate the regrets in the two instances, we start with the optimal reawrds. The optimal reward in I(1) is ˆQT0+1 ϵT0+1, which is achieved by pulling arm 1 until the resource is exhasuted. The optimal reward for I(2) is ˆQT0+1, which is achieved by pulling arm 2 until the resource is exhasuted.

Consider the execution of the fixed but arbitrary online algorithm during time steps 1, . . . , T0 in each of the instances. The prediction oracle returns the same prediction ˆQt for t {1, . . . , T0} in both instances, since both instances share the same r, c, B, T and q(1) t = q(2) t for t {1, . . . , T0}. Consequently, the fixed but arbitrary online algorithm has the same sequence of arm pulls A1, . . . , AT0 during time steps 1, . . . , T0 in both instances I(1), I(2). Now, for each arm i {1, 2}, we define Ni = {t {1, . . . , T0} : At = i}, which has the same realization in instances I(1), I(2). Since N1 N2 [T0], at least one of the cases P t N1 qt 1

2 PT0 s=1 qs or P

2 PT0 s=1 qs holds.

We denote Reward T (I(i)) as the expected reward of the online algorithm in instance I(i). We proceed with the following case consideration:

2 PT0 s=1 qs. We consider the online algorithm s execution on I(1), which yields

Reward T (I(1)) PT0 s=1 qs

2 1 + PT0 s=1 qs

ˆQT0+1 ϵT0+1

4c + ˆQT0+1 ϵT0+1.

Regret T (I(1))

4(1 c) = PT0 s=1 qsϵT0+1 2( ˆQT0+1 + ϵT0+1) PT0 s=1 qsϵT0+1 6( ˆQT0+1 ϵT0+1) = PT0 s=1 qsϵT0+1

Bandits with Knapsacks: Advice on Time-Varying Demands

where the last inequality is by the well response condition that guarantees that 2ϵT0+1 ˆQT0+1. For the last equality, recall ˆQT0+1 ϵT0+1 = PT t=1 q(1) t .

2 PT0 s=1 qs. We consider the online algorithm s execution on I(2), which yields

Reward T (I(2))

! 1 + 1 + c

= PT0 s=1 qsϵT0+1 ˆQT0+1 ϵT0+1 +

ϵT0+1 ˆQT0+1 ϵT0+1 + ˆQT0

PT0 s=1 qsϵT0+1 2( ˆQT0+1 ϵT0+1) + ˆQT0+1.

Regret T (I(2)) PT0 s=1 qsϵT0+1 2( ˆQT0+1 ϵT0+1) PT0 s=1 qsϵT0+1 2( ˆQT0+1 + ϵT0+1) = PT0 s=1 qsϵT0+1

Altogether, the Theorem is proved.

B.3. Consistency Between Regret Upper and Lower Bounds

Recall that in the proof of Theorem 3.2, we constructed two instances I(1), I(2) such that (see (21):

Regret T (I(i)) PT0 t=1 qtϵT0+1

6Q(i) for some i {1, 2}, (24)

where Regret T (I(i)) is the regret of an arbitrary but fixed online algorithm on I(i), with its prediction oracle satisfying that

|Q(i) ˆQt| ϵT0+1 for each i {1, 2}. (25)

In the lower bound analysis on I(1), I(2), we establish the regret lower bound (24) solely hinging on the model uncertainty on Q(1), Q(2), and the bound (24) still holds when the DM knows {Pa}a A.

In particular, we can set the online algorihtm to be OA-UCB, with an oracle that satisfies the property (25) above. Now, also recall in our construction that q(1) t = q(2) t = qt for all t [T0], thus the predictions ˆQt for t [T0] are the same in the two instances, whereas Q(1) = ˆQT0+1 ϵT0+1 but Q(2) = ˆQT0+1 + ϵT0+1, while we still have Q(2) 3Q(1), so that Q(1) = Θ(Q(2)). Therefore, (24) is equivalent to

max i {1,2}{Regret T (I(i))} Ω

PT0 t=1 qtϵT0+1

To demonstrate the consistency, it suffices to show

max i {1,2}

t=1 qtϵ(i) t

PT0 t=1 qtϵT0+1

where ϵ(i) t = | ˆQt Q(i)| is the prediction error on instance I(i) at time t. Indeed, to be consistent, we should have Theorem

Bandits with Knapsacks: Advice on Time-Varying Demands

4.3 holds true for both instances, while (26) still holds true. We establish (27) as follows:

max i {1,2}

t=1 qtϵ(i) t

t=1 qt ϵ(1) t + ϵ(2) t 2

t=1 qt | ˆQt ˆQT0+1 + ϵT0+1| + | ˆQt ˆQT0+1 ϵT0+1|

t=1 qt 2ϵT0+1

t=1 qtϵT0+1. (29)

Step (28) is by the triangle inequality, and step (29) is by the fact that for any algorithm that fully exhausts the resource, its stopping time τ > T0 (In the case when OA-UCB does not fully consume all the resource at the end of time T, by definition we have τ = T + 1 > T0). By construction, the common budget B in both instances is strictly larger than PT0 t=1 qt, thus the resource is always not exhasuted at T0, since at time t [T0] the DM consumes at most qt units of resource. Altogether, (27) is shown and consistency is verified.

C. Proof of Theorem 4.3

Before we embark on the proof, we first state a well known result on online gradient descent: Lemma C.1 (Theorem 3.1 in (Hazan et al., 2016)). Suppose {ft} are convex functions, then Online Gradient Descent presented in Algorithm 3 applied on {ft} with step sizes {ηt = D G

t} guarantees the following for all T 1:

t=1 ft(xt) min x S

t=1 ft(x ) 3

where D = diam(S) and G = maxt ft .

Algorithm 3 Online Gradient Descent

1: Initialize convex set S, x1 K, step sizes {ηt}T t=1. 2: for t = 1, 2, ..., T do 3: Play xt and observe cost ft(xt). 4: Update xt+1 = ΠS (xt ηt ft(xt)) .

Now we begin the proof of Theorem 4.3. Denote UCBr,t = (UCBr,t(a))a A, LCBc,t = (LCBc,t(a, i))a A,i [d]. We first claim that, at a time step t τ,

e At arg max u |A| UCB c,tu ˆQt B µ t LCB c,tu. (30)

In fact, the following linear optimization problem

max UCB r,tu ˆQt B µ t LCB c,tu

has an extreme point solution such that u = ea for some a A. According to the definition of At, we know that u = e At. Then the claim holds. Suppose u is an optimal solution of (2), then we have OPTLP = Qr u , Qc u B1 and u |A|. By the optimality of (30), we have

UCBr,t(At) ˆQt B µ t LCBc,t(At) = UCB r,te At ˆQt B µ t LCB c,te At UCB r,tu ˆQt B µ t LCBc,tu ,

Bandits with Knapsacks: Advice on Time-Varying Demands

which is equivalent to

UCB r,tu UCBr,t(At) + ˆQt B µ t

ˆQt 1d LCB c,tu ˆQt B µ t

ˆQt 1d LCBc,t(At) .

Times qt on both side and sum all t from 1 to τ 1 and apply Lemma C.1 with ft(x) = qt ˆ Qt B B ˆ Qt 1d LCBc,t(At) x,

S = {µ : µ 1 1, µ 0d}, D = diam(S) =

2, G = maxt ft = M, then we obtain

t=1 qt UCB r,tu

t=1 qt UCBr,t(At) +

t=1 qt ˆQt B µ t

ˆQt 1d LCB c,tu

t=1 qt ˆQt B µ t

ˆQt 1d LCBc,t(At)

t=1 qt ˆQt B B

ˆQt 1d LCBc,t(At) µ + O M

t=1 qt ˆQt B B

ˆQt 1d LCBc,t(At) µ + O M

Recap by lemma 4.2 that with probability 1 3KTdδ, we have

Hence, with probability 1 3KTdδ,

t=1 qt ˆQt B µ t

ˆQt 1d LCB c,tu

t=1 qt ˆQt B µ t

ˆQt 1d c u (32a)

t=1 qt ˆQt B µ t

t=1 qt ˆQt B µ t

Q1d c u (32b)

t=1 qt ˆQt B µ t

t=1 qt ˆQt B

µt 1, (32d)

where (32a) comes from Lemma 4.2, (32b) comes from rearranging the sum, and (32c) comes from the fact the definition of u . We first consider the case τ T, which implies that there exists j0 [d] such that

t=1 qt Ct,j0 > B

t=1 qt Ct,j0 > B q. (33)

Take µ = λej0, where λ [0, 1] is a constant that we tune later. In this case, with probability 1 3KTδ,

t=1 qt UCB r,tu

t=1 qtr t u = OPTLP Qτ 1

Bandits with Knapsacks: Advice on Time-Varying Demands

t=1 qt ˆQt B

ˆQt 1d LCBc,t(At) ej0 = λ

t=1 qt ˆQt B

ˆQt LCBc,t(At, j0)

t=1 qt ˆQt B

t=1 qt ˆQt B

t=1 qt ˆQt B (Ct,j0 LCBc,t(At, j0)) .

Then we deal with each term respectively:

t=1 qt ˆQt B

t=1 qt ˆQt Q

t=1 qt Ct,j0 + 1

< Qτ 1 Q + Q

t=1 qtϵt B Q + 1

t=1 qtϵt Ct,j0 (36c)

t=1 qtϵt, (36d)

where (36a) comes from rearranging the sum, (36c) comes from the (33), and (36d) comes from the assumption that Ct,j0 is supported in [0, 1]. Similarly,

t=1 qt ˆQt B (Ct,j0 LCBc,t(At, j0)) =

t=1 qt Q B (Ct,j0 LCBc,t(At, j0)) +

t=1 qt ˆQt Q

B (Ct,j0 LCBc,t(At, j0))

t=1 qt (Ct,j0 LCBc,t(At, j0))

where the equality comes from rearranging the sum, and the inequality comes from the assuption that | ˆQt Q| ϵt, 0 LCBc,t(At, j0), Ct,j0 1. Combine (35), (36) and (37), we obtain

t=1 qt ˆQt B

ˆQt 1d LCBc,t(At) ej0 λ

t=1 qt ˆQt B

t=1 qt (Ct,j0 LCBc,t(At, j0))

t=1 qt (Ct,j0 LCBc,t(At, j0))

t=1 qt ˆQt B

where the second inequality comes from the assumption that λ [0, 1]. Finally, combine (31), (32), (34), (38), we obtain

t=1 qt UCBr,t(At) +

t=1 qt ˆQt B

t=1 qt (Ct,j0 LCBc,t(At, j0))

t=1 qt ˆQt B

Bandits with Knapsacks: Advice on Time-Varying Demands

which is equivalent to

t=1 qt UCBr,t(At) OPTLP

t=1 qt (Ct,j0 LCBc,t(At, j0))

t=1 qt ˆQt B

(1 µt 1) + O

Let λ = OPTLP

Q 1 (This is because OPTLP = Qr u Q), then we can further derive with probability 1 3KTdδ,

t=1 qt UCBr,t(At) OPTLP

t=1 qt (Ct,j0 LCBc,t(At, j0))

t=1 qt ˆQt B

(1 µt 1) + O

B q + OPTLP

t=1 qt (Ct,j0 LCBc,t(At, j0))

t=1 qt ˆQt B

t=1 qtϵt + M

t=1 qtϵt + M

where the second inequality comes from Lemma D.10 and the following

t=1 qt ˆQt B

t=1 qt ˆQt B

t=1 qt ˆQt Q 1

The above concludes our arguments for the case τ T. In complement, we then consider the case τ > T, which means that τ = T + 1, and no resource is fully exhausted during the horizon. With probability 1 3KTδ, we have

t=1 qt UCB r,tu

t=1 qtr t u = OPTLP. (40)

Take µ = 0 and combine (31), (32), (40), with probablity 1 3KTδ, we have

t=1 qt UCBr,t(At)

t=1 qt ˆQt B

t=1 qtϵt + M

Combine (41) and (39), for any stopping time τ, with probability 1 3KTdδ, we have

t=1 qt UCBr,t(At) O

t=1 qtϵt + M

By Lemma D.9, we can further derive it to the high probability bound, that with probability 1 3KTdδ,

t=1 qt Rt O

t=1 qt Rt + q K log T

t=1 qtϵt + M

t=1 qtϵt + M

where the second inequality comes from the fact that OPTLP Pτ 1 t=1 qt Rt. Now we finish the proof of Theorem 4.3.

Bandits with Knapsacks: Advice on Time-Varying Demands

D. Proofs for Confidence Radii

This section contains proofs for the confidence radius results, which largely follow the literature, but we provide complete proofs since we are in a non-stationary setting. Section D.1 provides the proof for Lemma 4.1, which allows us to extract the implicit constants in existing proofs in (Babaioff et al., 2015; Agrawal & Devanur, 2014). Section D.2 provides the proof for Lemma 4.2. Finally, section D.3, we prove inequalities (14, 15).

D.1. Proof for Lemma 4.1, due to (Babaioff et al., 2015; Agrawal & Devanur, 2014)

In this subsection, we prove Lemma 4.1 by following the line of arguments in (Babaioff et al., 2015). We emphasize that a version of the Lemma has been proved in (Babaioff et al., 2015). We dervie the Lemma for the purpose of extracting the values of the constant coefficients. We first extract some relevant concentration inequalities in the following two Lemmas.

Lemma D.1 (Theorem 8 in (Chung & Lu, 2006)). Suppose {Ui}n i=1 are independent random variables satisfying Ui M, for 1 i n almost surely. Let U = Pn i=1 Ui, U 2 = Pn i=1 E[U 2 i ]. With probability 1 e x, we have

2 U 2x + 2x

3 max{M, 0}.

Lemma D.2 (Theorem 6 in (Chung & Lu, 2006)). Suppose Ui are independent random variables satisfying Ui E[Ui] M, M > 0, for 1 i n. Let U = Pn i=1 Ui, Var(U) = Pn i=1 Var(Ui), then with probability 1 e x, we have

2Var(U)x + 2Mx

Using Lemma D.2, we first derive the following Lemma that bounds the empirical mean:

Lemma D.3. Let {Xi}n i=1 be independent random variables supported in [0, 1]. Let X = Pn i=1 Xi and Var(X) = Pn i=1 Var(Xi). For any fixed x > 0, With probability 1 2e x, we have

2Var(X)x + 2x

Proof of Lemma D.3. Apply Lemma D.2 with Ui = Xi, Ui = Xi, respectively, and M = 1, then with probability 1 2e x, we have

2Var(X)x + 2x

Next, we bound the difference between the ground truth variance and its empirical counterpart using Lemma D.1:

Lemma D.4. Suppose Xi are independent random variables supported in [0, 1]. Let X = Pn i=1 Xi, Var(X) = Pn i=1 Var(Xi), Vn = Pn i=1 (Xi E[Xi])2then with probability 1 3e x, we have

Proof of Lemma D.4. The proof follows the line of argument in (Audibert et al., 2009). First, we apply Lemma D.1 with Ui = (Xi E[Xi])2 and M = 0. With probability 1 e x, we have

Var(X) Vn +

i=1 E h (Xi E[Xi])4i!

i=1 E h (Xi E[Xi])2i!

2Var(X)x. (42)

Bandits with Knapsacks: Advice on Time-Varying Demands

Since Xi [0, 1] almost surely for all i [n], we have

Var(Xi) = E[X2 i ] E[Xi]2 E[Xi] E[Xi]2 1

Now, observe that

i=1 Var(Xi)

2 , then the Lemma evidently holds. Otherwise, we assume 2 x n

2 , which is equivalent to x n 16. Combining Lemma D.3 and (42), with probability 1 3e x, we have,

Var(X) Vn + p

2Var(X)x + (X E[X])2

2Var(X)x + 1

2Var(X)x + 4

2Var(X)x + 4x2

2Var(X)x + 1

Consequently, we can derive an upper bound for p

2 + 4 Vn + x

which proves the Lemma.

Lemma D.5. Suppose Xi are independent random variables supported in [0, 1]. Let X = Pn i=1 Xi, then with probability 1 3e x, we have |X E[X]|

Proof of Lemma D.5. Apply Lemma D.3 and Lemma D.4, we directly derive that with probability 1 3e x

2Var(X)x + 2x

where the last inequality comes from the fact that for random variable whose support is [0, 1], then its variance is always smaller than its mean.

Lemma D.6 (Theorem 1.1 in (Dubhashi & Panconesi, 2009)). Suppose Xi are independent random variables supported in [0, 1]. Let X = Pn i=1 Xi, then for any R > 2e E[X], we have

P(X > R) 2 R.

Now we turn back to the proof of Lemma 4.1. Denote δ = e x. Apply Lemma D.5 then with probability 1 3δ, we have,

N ˆV E h ˆV i

2N ˆV log 1

which is equivalent to , ˆV E h ˆV i rad ˆV , N, δ . (43)

Bandits with Knapsacks: Advice on Time-Varying Demands

P rad ˆV , N, δ > 3rad E h ˆV i , N, δ P ˆV > 9E h ˆV i + 32 log 1

2 9E[ ˆV ] 32 log( 1

Therefore, combining (43) and (44), the lemma holds.

D.2. Proof for Lemma 4.2

By Lemma 4.1, with probability 1 3KTδ, we have

|r(a) ˆRt(a)| rad( ˆRt(a), N + t 1(a), δ).

Hence with probability 1 3KTδ, ( r(a) ˆRt(a) + rad( ˆRt(a), N + t 1(a), δ) r(a) 1 r(a) min n ˆRt(a) + rad( ˆRt(a), N + t 1(a), δ), 1 o

= UCBr,t(a).

Similarly, with probability 1 3KTdδ, LCBc,t(a) c(a).

D.3. Proof for Inequalities 14, 15

We first provide the two lemmas:

Lemma D.7 (Theorem 1.6 in (Freedman, 1975)). Suppose {Ui}n i=1 is a martingale difference sequence supported in [0, 1] with respect to the filtration {Fi}n i=1. Let U = Pn i=1 Ui, and V = Pn i=1 Var(Ui|Fi 1). Then for any a > 0, b > 0, we have P (|U| a, V b) 2e a2 2(a+b) .

Lemma D.8. Suppose {Xi}n i=1 are random variables supported in [0, 1], where Xi is Fi-measurable and {Fi}n i=1 is a filtration. Let Mi = E[Xi|Fi 1] for each i {1, . . . , n}, and M = Pn i=1 Mi. Then with probability 1 2nδ, we have

i=1 (Xi Mi)

Proof of Lemma D.8. The proof follows the line of Theorem 4.10 in (Babaioff et al., 2015). Let Ui = Xi Mi for each i {1, . . . , n}. Clearly, {Ui}n i=1 is a martingale difference sequence with respect to the filtration {Fi}n i=1. Since

Var(Ui|Fi 1) = Var(Xi|Fi 1) = E[X2 i |Fi 1] E[Xi|Fi 1]2 E[Xi|Fi 1] = Mi almost surely,

we have V = Pn i=1 Var(Ui|Fi 1) Pn i=1 Mi = M almost surely. Apply Lemma D.7 with a = q

δ + 2 log 1

for any b 1, it follows that with probability 2δ,

Take the union bound over all integer b from 1 to n, noting that V M and b 1 M b for some b {1, . . . , n} almost surely, with probability 1 2nδ we have

i=1 (Xi Mi)

Bandits with Knapsacks: Advice on Time-Varying Demands

Altogether, the lemma holds.

Now, we paraphrase inequalities 14, 15 as Lemmas D.9, D.10, and provide their proofs. Lemma D.9. With probability 1 3KTδ, we have

t=1 qt UCBr,t(At)

t=1 qt Rt + q K log T

Lemma D.10. With probability 1 3KTdδ, we have

t=1 qt LCBc,t(At, i)

t=1 qt Ct,i

q KB + q K log T

Proof of Lemma D.9. First with probability 1 2Tδ, we have

t=1 qtr(At)

q (r(At) Rt)

v u u tq log 1

t=1 qtr(At) + q log 1

v u u tq log 1

t=1 qt UCBr,t(At) + q log 1

where (45c) comes from Lemma 4.2. Inequality (45b) comes from Lemma D.8, where we apply Xt = qt Rt

Ft 1 = σ({As, qs, Rs, {Cs,i}d i=1, ˆQs}t 1 s=1 {qt}). Then with probability 1 3KTδ, we also have

t=1 qt UCBr,t(At)

t=1 qtr(At)

t=1 qtrad(r(At), N + t 1(At), δ) (46a)

a A:Nτ 1(a)>0

n=1 qn(a)rad (r(a), n, δ) (46b)

a A:Nτ 1(a)>0

2r(a) log 1

a A:Nτ 1(a)>0

2r(a)Qτ 1(a)

+ 4 (1 + log(Nτ 1(a))) log 1

a A r(a)Qτ 1(a) + 2q K log T

+ 2q K log 1

v u u t2q K log 1

t=1 qtr(At) + 2q K log T

+ 2q K log 1

v u u t2q K log 1

t=1 qt UCBr,t(At) + 2q K log T

+ 2q K log 1

Bandits with Knapsacks: Advice on Time-Varying Demands

(46a) comes from the following, with probability 1 3KTδ,

|UCBr,t(At) r(At)| ˆRt 1(At) r(At) + rad( ˆRt 1(At), N + t 1(At), δ)

2rad( ˆRt 1(At), Nt 1(At), δ)

6rad(r(At), Nt 1(At), δ).

(46b) comes from rearranging the sum. qn(a) means the n-th adversarial term that the algorithm selects a.

(46c) comes from the definition of rad( , , ).

(46d) comes from the following

2wi q Pi j=1 wj + q Pi 1 j=1 wj =

i (1 + log(n)).

where wi (0, 1].

In (46d) and (46e) Qt(a) = P

s [t],As=a qs.

(46e) comes from Jansen inequality.

Combine (45) and (46), we have

t=1 qt UCBr,t(At)

t=1 qtrt + O

v u u tq K log 1

t=1 qt UCBr,t(At) + q K log T

+ q K log 1

which is equivalent to

t=1 qt UCBr,t(At) O

t=1 qtrt + O

+ q K log 1

t=1 qt UCBr,t(At) O

t=1 qtrt + O

+ q K log 1

t=1 qtrt + O

Combine (45) and (46), (47), we finish the proof.

Bandits with Knapsacks: Advice on Time-Varying Demands

Proof of Lemma D.10. The proof is quite similar to Lemma D.9, so we omit the descriptive details. Similarly, with probability 1 2Tdδ, we have

t=1 qtc(At, i)

t=1 qt Ct,i

q (c(At) Ct,i)

v u u tq log 1

t=1 qtc(At, i) + q log 1

v u u tq log 1

t=1 qt UCBc,t(At, i) + q log 1

Then with probability 1 3KTdδ, we also have

t=1 qt LCBc,t(At, i)

t=1 qtc(At, i)

t=1 qtrad(c(At, i), N + t 1(At), δ)

a A:Nτ 1(a)>0

n=1 qn(a)rad (c(a, i), n, δ)

a A:Nτ 1(a)>0

2c(a, i) log 1

a A:Nτ 1(a)>0

2c(a, i)Qτ 1(a)

+ 4 (1 + log(Nτ 1(a))) log 1

a A c(a, i)Qτ 1(a) + 2q K log T

+ 2q K log 1

v u u t2q K log 1

t=1 qt UCBc,t(At, i) + 2q K log T

+ 2q K log 1

t=1 qt UCBc,t(At, i)

t=1 qtc(At, i)

t=1 qtrad(c(At, i), N + t 1(At), δ)

v u u tq K log 1

t=1 qt UCBc,t(At, i) + q K log T

+ q K log 1

Combine (48) and (50), we have

Pτ 1 t=1 qt UCBc,t(At, i) Pτ 1 t=1 qt Ct,i + O q

δ Pτ 1 t=1 qt UCBc,t(At, i) + q K log T

δ + q K log 1

which is equivalent to

t=1 qt UCBc,t(At, i) O

t=1 qt Ct,i + O

+ q K log 1

Bandits with Knapsacks: Advice on Time-Varying Demands

Hence, v u u t

t=1 qt UCBc,t(At, i) O

t=1 qt Ct,i + O

+ q K log 1

t=1 qt Ct,i + O

where the last inequality comes from the definition of the stopping time τ. Combine (48) and (49), (51), we finish the proof.