# nonstationary_bandits_with_knapsacks__4649445c.pdf

Non-stationary Bandits with Knapsacks

Shang Liu Jiashuo Jiang Xiaocheng Li

Imperial College Business School, Imperial College London NYU Stern School of Business s.liu21@imperial.ac.uk, jj2398@stern.nyu.edu, xiaocheng.li@imperial.ac.uk

In this paper, we study the problem of bandits with knapsacks (Bw K) in a nonstationary environment. The Bw K problem generalizes the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm. At each time, the decision maker/player chooses to play an arm, and s/he will receive a reward and consume certain amount of resource from each of the multiple resource types. The objective is to maximize the cumulative reward over a finite horizon subject to some knapsack constraints on the resources. Existing works study the Bw K problem under either a stochastic or adversarial environment. Our paper considers a non-stationary environment which continuously interpolates these two extremes. We first show that the traditional notion of variation budget is insufficient to characterize the non-stationarity of the Bw K problem for a sublinear regret due to the presence of the constraints, and then we propose a new notion of global non-stationarity measure. We employ both non-stationarity measures to derive upper and lower bounds for the problem. Our results are based on a primal-dual analysis of the underlying linear programs and highlight the interplay between the constraints and the non-stationarity. Finally, we also extend the non-stationarity measure to the problem of online convex optimization with constraints and obtain new regret bounds accordingly.

1 Introduction

The multi-armed bandit (MAB) problem characterizes a problem for which a limited amount of resource must be allocated between competing (alternative) choices in a way that maximizes the expected gain. The bandits with knapsacks (Bw K) problem generalizes the multi-armed bandits problem to allow more general resource constraints structure on the decisions made over time, in addition to the customary limitation on the time horizon. Specifically, for the Bw K problem, the decision maker/player chooses to play an arm at each time period; s/he will receive a reward and consume certain amount of resource from each of the multiple resource types. Accordingly, the objective is to maximize the cumulative reward over a finite time horizon and subject to an initial budget of multiple resource types. The Bw K problem was first introduced by Badanidiyuru et al. [2013] as a general framework to model a wide range of applications, including dynamic pricing and revenue management [Besbes and Zeevi, 2012], Adwords problem [Mehta et al., 2005] and more.

The standard setting of the Bw K problem is stochastic where the joint distribution of reward and resource consumption for each arm remains stationary (identical) over time. Under such setting, a linear program (LP), that takes the expected reward and resource consumption of each arm as input, both serves as the benchmark for regret analysis and drives the algorithm design [Badanidiyuru et al., 2013, Agrawal and Devanur, 2014]. Notably, a static best distribution prescribed by the LP s optimal solution is used for defining the regret benchmark. An alternative setting is the adversarial Bw K problem where the reward and the consumption may no long follow a distribution and they can be chosen arbitrarily over time. Under the adversarial setting, a sublinear regret is not achievable in

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

the worst case; Immorlica et al. [2019] derive a O(log T) competitive ratio against the static best distribution benchmark which is aligned with the static optimal benchmark in the adversarial bandits problem [Auer et al., 1995]. Another key of the Bw K problem is the number of resource types d. When d = 1, one optimal decision is to play the arm with largest (expected) reward to (expected) resource consumption ratio, where the algorithm design and analysis can be largely reduced to the MAB problem. When d > 1, the optimal decision in general requires to play a combination of arms (corresponding the optimal basis of the underlying LP). Rangi et al. [2018] focus on the case of d = 1 and propose an EXP3-based algorithm that attains a regret of O( m B log m) against the best fixed distribution benchmark. Their result thus bridges the gap between the stochastic Bw K problem and the adversarial Bw K problem for the case of d = 1. The difference between the cases of d = 1 and d > 1 is also exhibited in the derivation of problem-dependent regret bounds for the stochastic Bw K problem [Flajolet and Jaillet, 2015, Li et al., 2021, Sankararaman and Slivkins, 2021].

In this paper, we study the non-stationary Bw K problem where the reward and the resource consumption at each time are sampled from a distribution as the stochastic Bw K problem but the distribution may change over time. The setting relaxes the temporally i.i.d. assumption in the stochastic setting and it can be viewed as a soft measure of adversity. We aim to relate the non-stationarity (or adversity) of the distribution change with the best-achievable algorithm performance, and thus our result bridges the two extremes of Bw K problem: stochastic Bw K and adversarial Bw K. We consider a dynamic benchmark to define the regret; while such a benchmark is aligned with the dynamic benchmark in other non-stationary learning problem [Besbes et al., 2014, 2015, Cheung et al., 2019, Faury et al., 2021], it is stronger than the static distribution benchmark in adversarial Bw K [Rangi et al., 2018, Immorlica et al., 2019]. Importantly, we use simple examples and lower bound results to show that the traditional non-stationarity measures such as change points and variation budget are not suitable for the Bw K problem due to the presence of the constraints. We introduce a new non-stationarity measure called global variation budget and employ both of this new measure and the original variation budget to capture the underlying non-stationarity of the Bw K problem. We analyze the performance of a sliding-window UCB-based Bw K algorithm and derive a near-optimal regret bound. Furthermore, we show that the new non-stationarity measure can also be applied to the problem of online convex optimization with constraints (OCOw C) and extend the analyses therein.

1.1 Related literature

The study of non-stationary bandits problem begins with the change-point or piecewise-stationary setting where the distribution of the rewards remains constant over epochs and changes at unknown time instants [Garivier and Moulines, 2008, Yu and Mannor, 2009]. The prototype of non-stationary algorithms such as discounted UCB and sliding-window UCB are proposed and analyzed in [Garivier and Moulines, 2008] to robustify the standard UCB algorithm against the environment change. The prevalent variation budget measure V = PT 1 t=1 Pt Pt+1 (where Pt and the norm bear different meaning under different context) is later proposed and widely studied under different contexts, such as non-stationary stochastic optimization (Besbes et al. [2015]), non-stationary MAB (Besbes et al. [2014]), non-stationary linear bandits (Cheung et al. [2019]), and non-stationary generalized linear bandits (Faury et al. [2021]) problems. In general, these works derive lower bound of Ω(V 1 3 T 2 3 ), and propose algorithms that achieve near-optimal regret of O(V 1 3 T 2 3 ). Cheung et al. [2019] and Faury et al. [2021] require various assumptions on the decision set to attain such upper bound; under more general conditions, a regret bound of O(V 1 5 T 4 5 ) can be obtained [Faury et al., 2021]. With the soft measure of non-stationarity, the existing results manage to obtain sublinear regret bounds in T against dynamic optimal benchmarks. In contrast, a linear regret in T is generally inevitable against the dynamic benchmark when the underlying environment is adversarial. We remark that while all these existing works consider the unconstrained setting, our work complements this line of literature with a proper measure of non-stationarity in the constrained setting.

Another related stream of literature is the problem of online convex optimization with constraints (OCOw C) which extends the OCO problem in a constrained setting. There are two types of constraints considered: the long-term constraint [Jenatton et al., 2016, Neely and Yu, 2017] and the cumulative constraint [Yuan and Lamperski, 2018, Yi et al., 2021]. The former defines the constraint violation by (PT t=1 gt(xt))+ whilst the latter defines it by PT t=1 (gt(xt))+ where ( )+ is the positive-part function. The existing works mainly study the setting where gt = g for all t and g is known a priori. Neely and Yu [2017] considers a setting where gt is i.i.d. generated from some distribution. In

this paper, we show that our non-stationarity measure naturally extends to this problem and derives bounds for OCOw C when gt s are generated in a non-stationary manner.

2 Problem Setup

We first introduce the formulation of the Bw K problem. The decision-maker/learner is given a fixed finite set of arms A (with |A| = m) called as action set. There are d knapsack constraints with a known initial budget of Bj for j [d]. Without loss of generality, we assume Bj = B for all j. There is a finite time horizon T, which is also known in advance. At each time t = 1, ..., T, the learner must choose either to play an arm it or to do nothing but wait. If the learner plays the arm i at time t, s/he will receive a reward rt,i [0, 1] and consume ct,j,i [0, 1] amount of each resource j from the initial budget B. As the convention, we introduce a null arm to model doing nothing which generates a reward of zero and consumes no resource at all. We assume (rt, ct) is sampled from some distribution Pt independently over time where rt = {rt,i}i [m] and ct = {ct,j,i}i [m],j [d]. In the stochastic Bw K problem, the distribution Pt remains unchanged over time, while in the adversarial Bw K problem, Pt is chosen adversarially. In our paper, we allow Pt to be chosen adversarially, while we use some non-stationarity measure to control the extent of adversity in choosing Pt s.

At each time t, the learner needs to pick it using the past observations until time t 1 but without observing the outcomes of time step t. The resource constraints are assumed to be hard constraints, i.e., the learner must stop at the earliest time τ when at least one constraint is violated, i.e. Pτ t=1 ct,j,it > B, or the time horizon T is exceeded. The objective is to maximize the expected cumulative reward until time τ, i.e. E[Pτ 1 t=1 rt,it]. To measure the performance of a learner, we define the regret of the algorithm/policy π adopted by the learner as

Reg(π, T) := OPT(T) E

Here OPT(T) denotes the expected cumulative reward of the optimal dynamic policy given all the knowledge of Pt s in advance. Its definition is based on the dynamic optimal benchmark which allows the arm play decisions/distributions to change over time.

2.1 A Motivating Example

The conventional variation budget is defined by

t=1 dist(Pt, Pt+1).

By twisting the definition of the metric dist( , ), it captures many of the existing non-stationary measures for unconstrained learning problems. Now we use a simple example to illustrate why V no longer fits for the constrained setting. Similar examples have been used to motivate algorithm design and lower bound analysis in [Golrezaei et al., 2014, Cheung et al., 2019, Jiang et al., 2020], but have not been yet be exploited in a partial-information setting such as bandits problems.

Consider a Bw K problem instance that has two arms (one actual arm and one null arm), and a single resource constraint with initial capacity of T

2 . Without loss of generality, we assume T is even. The null arm has zero reward and zero resource consumption throughout the horizon, and the actual arm always consumes 1 unit of resource (deterministically) for each play and outputs 1 unit of reward (deterministically) for the first half of the horizon, i.e., when t = 1, ..., T

2 . For the second half of the horizon t = T

2 + 1, ..., T, the reward of the actual arm will change to either 1 + or 1 , and the change happens adversarially. For this problem instance, the distribution Pt only changes once, i.e., V = (varying up to constant due to the metric definition). But for this problem instance, a regret of T

4 is inevitable. To see this, if the player plays the actual arm no less than T

4 times, then the distributions of the second half can adversarially change to the reward 1 + , and this will result in a T

4 regret at least. The same for the case of playing the actual arm for the case of no more than T

4 times, and we defer the formal analysis to the proof of the lower bounds in Theorem 2.

This problem instance implies that a sublinear dependency on T cannot be achieved with merely the variation budget V to characterize the non-stationarity. Because with the presence of the constraint(s),

the arm play decisions over time are all coupled together not only through the learning procedure, but also through the global resource constraint(s). For the unconstrained problems, the nonstationarity affects the effectiveness of the learning of the system; for the constrained problems, the non-stationarity further challenges the decision making process through the lens of the constraints.

2.2 Non-stationarity Measure and Linear Programs

We denote the expected reward vector as µt = {µt,i}i [m] and the expected consumption matrix as Ct = {Ct,j,i}j [d],i [m], i.e.,

µt,i := E[rt,i], Ct,j,i := E[ct,j,i].

We first follow the conventional variation budget and define the local non-stationarity budget: 1

t=1 µt µt+1 ,

t=1 Ct,j Ct+1,j , V2 := max 1 j d V2,j.

We refer to the measure as a local one in that they capture the local change of the distributions between time t and time t + 1.

Next, we define the global non-stationarity budget:

t=1 Ct C 1,

where µ = 1

T PT t=1 µt and C = 1

T PT t=1 Ct. These measures capture the total deviations for all the µt s and Ct from their global averages. By definition, W1 and W2 upper bound V1 and V2 (up to a constant), so they can be viewed as a more strict measure of non-stationarity than the local budget. In the definition of W2, the L1 norm is not essential and it aims to sharpen the regret bounds.

All the existing analyses of the Bw K problem utilize the underlying linear program (LP) and establish the LP s optimal objective value as an upper bound of the regret benchmark OPT(T). In a nonstationary environment, the underlying LP is given by

LP ({µt}, {Ct}, T) := max x1,...,x T

t=1 Ctxt B, xt m, t = 1, . . . , T,

where B = (B, ..., B) and m denotes the m-dimensional standard simplex. We know that

LP({µt}, {Ct}, T) OPT(T).

In the rest of our paper, we will use LP({µt}, {Ct}, T) for the analysis of regret upper bound. We remark that in terms of this LP upper bound, the dynamic benchmark allows the xt to take different values, while the static benchmark will impose an additional constraint to require all the xt be the same.

For notation simplicity, we introduce the following linear growth assumption. All the results in this paper still hold without this condition.

1Throughout the paper, for a vector v Rn, we denote its L1 norm and L norm by v 1 := Pn i=1 |vi|, v := max1 i n |vi|. For a matrix M Rm n, we denote its L1 norm and L norm by M 1 := supx =0 Mx 1

x 1 = max1 j n Pm i=1 |Mij|, M := supx =0 Mx

x = max1 i m Pn j=1 |Mij|.

Assumption 1 (Linear Growth). We have the resource budget B = b T for some b > 0.

Define the single-step LP by

LP(µ, C) := max x µ x

s.t. Cx b, x m.

where b = (b, ..., b) . The single-step LP s optimal objective value can be interpreted as the singlestep optimal reward under a normalized resource budget b.

Throughout the paper, we will use the dual program and the dual variables to relate the resource consumption with the reward, especially for the non-stationary environment. The dual of the benchmark LP({µt}, {Ct}, T) is

DLP({µt}, {Ct}) := min q,α T b q +

s.t. q 0, µt C t q αt 1m 0, t = 1, . . . , T

where 1m denotes an m-dimensional all-one vector. Here we denotes one optimal solution as (q , α ).

The dual of the single-step LP(µt, Ct) is

DLP(µt, Ct) := min q,α b q + α

s.t. q 0, µt C t q α 1m 0.

Here we denotes one optimal solution as (q t , α t ). The dual optimal solutions q and q t are also known as the dual price, and they quantify the cost efficiency of each arm play.

Define q = max { q , q t , t = 1, ..., T} . The quantity q captures the maximum amount of achievable reward by each unit of resource consumption. We will return with more discussion on this quantity q after we present the regret bound. Lemma 1. We have the following upper bound on q,

Proposition 1. We have

t=1 LP(µt, Ct) LP({µt}, {Ct}, T) T LP( µ, C)+W1+ q W2

t=1 LP(µt, Ct)+2(W1+ q W2).

Proposition 1 relates the optimal value of the benchmark LP({µt}, {Ct}, T) with the optimal values of the single-step LPs. To interpret the bound, LP({µt}, {Ct}, T) works as an upper bound of the OPT(T) in defining the regret, and the summation of LP(µt, Ct) corresponds to the total reward obtained by evenly allocating the resource over all time periods. In a stationary environment, these two are the same as the optimal decision naturally corresponds to an even allocation of the resources. In a non-stationary environment, it can happen that the optimal allocation of the resource corresponds an uneven one for LP({µt}, {Ct}, T). For the problem instance in Section 2.1, the optimal allocation may be either to exhaust all the resource in first half of time periods or preserve all the resource for the second half. In such case, forcing an even allocation will reduce the total reward obtained. The proposition tells that the reduction can be bounded by 2W1 + 2 q W2 where the non-stationarity in resource consumption W2 is weighted by the dual price upper bound q.

3 Sliding-Window UCB for Non-stationary Bw K

In this section, we adapt the standard sliding-window UCB algorithm for the Bw K problem (Algorithm 1) and derive a near-optimal regret bound. The algorithm will terminate when any type of the resources

is exhausted. At each time t, it constructs standard sliding-window confidence bounds for the reward and the resource consumption. Specifically, we define the sliding-window estimators by

ˆµ(w) t,i :=

Pt 1 s=1 (t w) rt,j 1{is = i}

n(w) t,i + 1 , ˆC(w) t,j,i :=

Pt 1 s=1 (t w) ct,j,i 1{is = i}

n(w) t,i + 1 ,

where n(w) t,i = Pt 1 s=1 (t w) 1{is = i} denotes the number of times that the i-th arm has been played in the last w time periods. To be optimistic on the objective value, UCBs are computed for rewards and LCBs are computed for the resource consumption, respectively. With the confidence bounds, the algorithm solves a single-step LP to prescribe a randomized rule for the arm play decision.

Our algorithm can be viewed as a combination of the standard sliding-window UCB algorithm [Garivier and Moulines, 2008, Besbes et al., 2015] with the UCB for Bw K algorithm [Agrawal and Devanur, 2014]. It makes a minor change compared to [Agrawal and Devanur, 2014] which solves a single-step LP with a shrinkage factor (1 ϵ) on the right-hand-side. The shrinkage factor therein ensures that the resources will not be exhausted until the end of the horizon, but it is not essential to solving the problem. For simplicity, we choose the more natural version of the algorithm which directly solves the single-step LP. We remark that the knowledge of the initial resource budget B and the time horizon T will only be used for defining the right-hand-side of the constraints for this LP(UCBt(µt), LCBt(Ct)).

Algorithm 1 Sliding-Window UCB Algorithm for Bw K

Input: Initial resource budget B, time horizon T, window sizes w1 (for reward) and w2 (for resource consumption). Output: Arm play indices {it} s

1: while t T do 2: if Pt 1 s=1 ct,j > B for some j then 3: Break 4: %% Terminate the procedure if any resource is exhausted. 5: end if 6: Construct confidence bounds UCBt(µt), LCBt(Ct) with window size w1, w2

UCBt,i(µt) := ˆµ(w1) t,i +

n(w1) t,i + 1 log(12m T 3)

LCBt,j,i(Ct) := ˆC(w2) t,j,i

n(w2) t,i + 1 log(12md T 3)

7: Solve the single-step problem LP(UCBt(µt), LCBt(Ct)) 8: Denote its optimal solution by x t = (x t,1, ..., x t,m) 9: Pick arm it randomly according to x t , i.e., P(it = i) = x t,i 10: Observe the realized reward rt and resource consumption ct,j for j [d] 11: end while

Now we begin to analyze the algorithm s performance. For starters, the following lemma states a standard concentration result for the sliding-window confidence bound. Lemma 2. The following inequalities hold for all t = 1, ..., T with probability at least 1 1 3T :

UCBt,i(µt) +

s=1 (t w1) µs µs+1 µt,i, i,

LCBt,j,i(Ct)

s=1 (t w2) Cs,j Cs+1,j Ct,j,i, j, i

where the UCB and LCB estimators are defined in Algorithm 1.

With Lemma 2, we can employ a concentration argument to relate the realized reward (or resource consumption) with the reward (or resource consumption) of the LP under its optimal solution. In

Lemma 3, recall that τ is the termination time of the algorithm where some type of resources is exhausted, and x s is defined in Algorithm 1 as the optimal solution of the LP solved at time s. Lemma 3. For Algorithm 1, the following inequalities hold for all t min{τ, T},

s=1 (rt UCBs(µs) x s)

T log(12m T 3) + 8 p

log(12m T 3)m T w1 + w1V1,

s=1 (cs,j LCBt(Cs,j) x s)

T log(12md T 3)+8 p

log(12md T 3)m T w2 +w2V2 for all j,

with probability at least 1 1

We note that the single-step LP s optimal solution is always subject to the resource constraints. So the second group of inequalities in Lemma 3 implies the following bound on the termination time τ. Recall that b is the resource budget per time period; for a larger b, the resource consumption process becomes more stable and the budget is accordingly less likely to be exhausted too early.

Corollary 1. If we choose w2 = min n m 1 3 V 2

3 2 T 2 3 log 1 3 (12md T 3) , T o in Algorithm 1, the following inequality holds

b 10m 1 3 V

1 3 2 T 2 3 log 1 3 (12md T 3) + 8

log(12md T 3) + 4

T log(12md T 3)

1 3 2 T 2 3 +

with probability at least 1 1 2T .

To summarize, Lemma 3 compares the realized reward with the cumulative reward of the single-step LPs, and Corollary 1 bounds the termination time of the algorithm. Recall that Proposition 1 relates the cumulative reward of the single-step LPs with the underlying LP the regret benchmark. Putting together these results, we can optimize w1 and w2 by choosing

w1 = min n m 1 3 V 2

3 1 T 2 3 log 1 3 (12m T 3) , T o , w2 = min n m 1 3 V 2

3 2 T 2 3 log 1 3 (12md T 3) , T o

and then obtain the final regret upper bound as follows. Theorem 1. Under Assumption 1, the regret of Algorithm 1 is upper bounded as

Reg(π1, T) 1

T log(12md T 3) + (10 + 2d)m 1 3 V

1 3 2 T 2 3 log 1 3 (12md T 3) + 8

log(12md T 3) + 1

T log(12m T 3) + 12m 1 3 V

1 3 1 T 2 3 log 1 3 (12m T 3) + 2(W1 + q W2)

m T + m 1 3 V

1 3 1 T 2 3 + 1

b m 1 3 d V

1 3 2 T 2 3 + W1 + q W2

where π1 denotes the policy specified by Algorithm 1 and O( ) hides the universal constant and the logarithmic factors.

Theorem 1 provides a regret upper bound for Algorithm 1 that consists of several parts. The first part of the regret bound is on the order of 1

m T and it captures the regret when the underlying environment is stationary. The remaining parts of the regret bound characterize the relation between the intensity of non-stationarity and the algorithm performance. The non-stationarity from both the reward and the resource consumption will contribute to the regret bound and that from the resource consumption will be weighted by a factor of 1

b or q (See Lemma 1 for the relation between these two). For the local non-stationarity V1 and V2, the algorithm requires a prior knowledge of them to decide the window length, aligned with the existing works on non-stationarity in unconstrained settings. For the global non-stationarity W1 and W2, the algorithm does not require any prior knowledge and they will contribute additively to the regret bound. Together with the lower bound results in Theorem 2, we argue that the regret bound cannot be further improved even with the knowledge of W1 and W2.

When the underlying environment degenerates from a non-stationary one to a stationary one, all the terms related to V1, V2, W1 and W2 will disappear and then the upper bound in Theorem 1 matches

the regret upper bound for the stochastic Bw K setting. In Theorem 1, we choose to represent the upper bound in terms of b and T so as to reveal its dependency on T and draw a better comparison with the literature on unconstrained bandits problem. We provide a second version of Theorem 1 in Appendix E that matches the existing high probability bounds using OPT(T) [Badanidiyuru et al., 2013, Agrawal and Devanur, 2014]. In contrast to the Θ(log T)-competitiveness result in the adversarial Bw K [Immorlica et al., 2019], our result implies that with a property measure of the non-stationarity/adversity, the sliding-window design provides an effective approach to robustify the algorithm performance when the underlying environment changes from stationary to non-stationary, and the according algorithm performance will not drastically deteriorate when the intensity of the non-stationarity is small.

When the resource constraints become non-binding for the underlying LPs, the underlying environment degenerates from a constrained setting to an unconstrained setting. We separate the discussion for the two cases: (i) the benchmark LP and all the single-step LPs have only non-binding constraints; (ii) the benchmark LP have only non-binding constraints but some single-step LP have binding constraints. For case (i), the regret bound in Theorem 1 will match the non-stationary MAB bound [Besbes et al., 2014]. For case (ii), the match will not happen and this is inevitable. We elaborate the discussion in Section D.

Theorem 2 (Regret lower bounds). The following lower bounds hold for any policy π,

(i) Reg(π, T) = Ω(m 1 3 V

1 3 1 T 2 3 ).

(ii) Reg(π, T) = Ω( 1

1 3 2 T 2 3 ).

(iii) Reg(π, T) = Ω(W1 + q W2).

Theorem 2 presents a few lower bounds for the problem. The first and the second lower bounds are adapted from the lower bound example in non-stationary MAB [Besbes et al., 2014] and the third lower bound is adapted from the motivating example in 2.1. There are simple examples where each one of these three lower bounds dominates over the other two. In this sense, all the non-stationarityrelated terms in the upper bound of Theorem 1 are necessary including the parameters 1/b and q. There is one gap between the lower bound and the upper bound with regard to the number of constraints d in the term related to V2. We leave it as future work to reduce the factor to log d with some finer analysis. Furthermore, we provide a sharper definition of the global nonstationarity measure W min 1 and W min 2 in replacement of W1 and W2 in Appendix C2. It makes no essential change to our analysis, and the two measures coincide with each other on the lower bound problem instance. We choose to use W1 and W2 for presentation simplicity, while W min 1 and W min 2 can capture the more detailed temporal structure of the nonstationarity. The discussion leaves an open question that whether the knowledge of some additional structure of the environment can further reduce the global non-stationarity.

4 Extension to Online Convex Optimization with Constraints

In this section, we show how our notion of non-stationarity measure can be extended to the problem of online convex optimization with constraints (OCOw C). Similar to Bw K, OCOw C also models a sequential decision making problem under the presence of constraints. Specifically, at each time t, the player chooses an action xt from some convex set X. After the choice, a convex cost function ft : X R and a concave resource consumption function gt = (gt,1, ...., gt,d) : X Rd are revealed. As in the standard setting of OCO, the functions ft is adversarially chosen and thus a static benchmark is consider and defined by

OPT(T) := min x X

t=1 gt,i(x) 0, for i [d].

Denote its optimal solution as x and its dual optimal solution as q .

While the existing works consider the case when gt s are static or sample i.i.d. from some distribution P. We consider a non-stationary setting where gt may change adversarially over time. We define a global non-stationarity measure by

j=1 gt,j gj

where gj = 1

T PT t=1 gt,j and f := supx X |f(x)|.

The OCOw C problem considers the following bi-objective performance measure:

Reg1(π, T) =

Reg2(π, T) =

t=1 gt,i(xt)

where ( )+ denotes the positive-part function and π denotes the policy/algorithm.

In analogous to the single-step LPs, we consider an optimization problem with more restricted constraints as

OPT (T) := min x X

s.t. gt,i(x) 0, for t [T], i [d].

Denote its optimal solution as x , and its dual optimal solution as q . The following proposition relates the two optimal objective values. Assumption 2. We assume that Slater s condition holds for both the standard OCOw C program OPT(T) and the restricted OCOw C program OPT (T). We assume that ft, ft, gt,i, and gt,i are uniformly bounded on X and that X itself is bounded. Moreover, we assume that their dual optimal solutions are uniformly bounded by q, i.e.

q = max n q , q o .

The following proposition relates the two optimal objective values. Proposition 2. For OCOw C problem, under Assumption 2, we have

0 OPT (T) OPT(T) q W.

Utilizing the proposition, we can show that the gradient-based algorithm of [Neely and Yu, 2017] achieves the following regret for the setting of OCO with non-stationary constraints. Moreover, we further extend the results and discuss in Appendix F on an oblivious adversarial setting where gt is sampled from some distribution Pt and the distribution Pt may change over time. Theorem 3. Under Assumption 2, the Virtual Queue Algorithm of [Neely and Yu, 2017] for any OCOw C problem (denoted by π2) produces a decision sequence {xt} such that

Reg1(π2, T) O(

Reg2(π2, T) O(d

The theorem tells that the non-stationarity when measured properly will not drastically deteriorate the performance of the algorithm for the OCOw C problem as well. Moreover, the non-stationarity will not affect the constraint violation at all. Together with the results for the Bw K problem, we argue that the new global non-stationarity measure serves as a proper one for the constrained online learning problems. Note that the upper and lower bounds match up to a logarithmic factor (in a worst-case sense) subject to the non-stationarity measures. The future direction can be to refine the bounds in a more instance-dependent way and to identify useful prior knowledge on the non-stationarity for better algorithm design and analysis.

5 Discussions

In this paper, we study the non-stationary setting of the Bw K problem. We remark that our formulation is not different from stochastic Bw K and adversarial Bw K, but it should be viewed as a generalization of both:

When the underlying distribution Pt is i.i.d., our formulation degenerates into the stochastic Bw K problem. Our formulation allows Pt to be point-mass distributions and also allows it to be chosen adversarially, so it recovers the setting of adversarial Bw K. Different from the existing worstcase results on adversarial Bw K, our result characterizes a problem-dependent performance that relates the regret with the temporal change of Pt s.

Two important applications (among others, see Badanidiyuru et al., 2013) are Ad Words problem (under pay-by-click and pay-by-conversion), and the pricing problem, where the knapsack constraints capture the bidder s budget or the available inventory. Under such application context, the distribution of the arms reward and/or resource consumption may change over time; for example, the bidder s bidding policy may change according to their remaining budget, and the underlying market environment may change due to seasonality, day-of-week effect, promotions etc. Both our work and the adversarial Bw K aim to capture such violation of the i.i.d. assumption in the stochastic Bw K. Speaking of these applications, the stochastic setting is too ideal, while the adversarial setting is too worst-case/conservative; non-stationarity provides a smooth connection between these two ends. The spirit inherits the study of non-stationary environment for unconstrained online learning problem [Besbes et al., 2014, 2015, Cheung et al., 2019, Faury et al., 2021].

Technically, we first make a comparison between the existing results on MAB and on Bw K. For the stochastic setting, as we discussed in Appendix D, when the constraints are non-binding, the stochastic Bw K s regret bound can recover the regret bound of a corresponding MAB problem. However, for the adversarial setting, the EXP3 algorithm achieves O(

T) regret bound for adversarial MAB problem against a static benchmark, while the state-of-the-art adversarial Bw K algorithm only achieves an O(log T) competitiveness ratio against the static benchmark, i.e., even worse than a linear regret. In comparison, we believe the Bw K problem in a non-i.i.d. (non-stochastic) environment is pessimistically difficult and far from being resolved. In this light, our work provides a positive result for the problem. We provide in Appendix C1 a detailed discussion and comparison of the benchmarks used in the existing Bw K works. Numerical experiments compare the performance of our algorithm with existing Bw K algorithms and are presented in Appendix A.

The key for our paper to achieve this result is the proposal of of the new non-stationarity measure, while the algorithm and analysis are largely standard as in the UCB literature. The standard analysis also appears in existing works on non-stationary online learning/optimization [Besbes et al., 2014, 2015, Cheung et al., 2019, Faury et al., 2021] where the techniques more or less follow the paradigm of combining the sliding-window concentration argument with the analysis in a corresponding context of stochastic optimization, MAB, linear bandits, or RL. Our measure is new as all existing works on non-stationarity study unconstrained settings. Our measure is also critical for a constrained setting; we believe its application goes beyond the Bw K and OCO problem and it also provides a useful measure for other constrained problems such as constrained MDP and safe RL. Insight-wise, our discussion in Section 2.1 not only validates the criticality of such a measure but also highlights that even if one does not need to perform any learning to the system, the non-stationarity in a constrained problem can still hurt. This intuition is orthogonal to the existing implications of the current works on non-stationarity which mainly focus on remedying the negative effect of nonstationary induced on the learning of the system.

Shipra Agrawal and Nikhil R Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 989 1006, 2014.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th annual foundations of computer science, pages 322 331. IEEE, 1995.

Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 207 216. IEEE, 2013.

Omar Besbes and Assaf Zeevi. Blind network revenue management. Operations research, 60(6): 1537 1550, 2012.

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with nonstationary rewards. Advances in neural information processing systems, 27, 2014.

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization. Operations research, 63(5):1227 1244, 2015.

Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. Learning to optimize under non-stationarity. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1079 1087. PMLR, 2019.

Louis Faury, Yoan Russac, Marc Abeille, and Clément Calauzènes. Regret bounds for generalized linear bandits under parameter drift. ar Xiv preprint ar Xiv:2103.05750, 2021.

Arthur Flajolet and Patrick Jaillet. Logarithmic regret bounds for bandits with knapsacks. ar Xiv preprint ar Xiv:1510.01800, 2015.

Aurélien Garivier and Eric Moulines. On upper-confidence bound policies for non-stationary bandit problems. ar Xiv preprint ar Xiv:0805.3415, 2008.

Negin Golrezaei, Hamid Nazerzadeh, and Paat Rusmevichientong. Real-time optimization of personalized assortments. Management Science, 60(6):1532 1551, 2014.

Nicole Immorlica, Karthik Abinav Sankararaman, Robert Schapire, and Aleksandrs Slivkins. Adversarial bandits with knapsacks. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 202 219. IEEE, 2019.

Rodolphe Jenatton, Jim Huang, and Cédric Archambeau. Adaptive algorithms for online convex optimization with long-term constraints. In International Conference on Machine Learning, pages 402 411. PMLR, 2016.

Jiashuo Jiang, Xiaocheng Li, and Jiawei Zhang. Online stochastic optimization with wasserstein based non-stationarity. ar Xiv preprint ar Xiv:2012.06961, 2020.

Xiaocheng Li, Chunlin Sun, and Yinyu Ye. The symmetry between arms and knapsacks: A primaldual approach for bandits with knapsacks. In International Conference on Machine Learning, pages 6483 6492. PMLR, 2021.

Aranyak Mehta, Amin Saberi, Umesh Vazirani, and Vijay Vazirani. Adwords and generalized on-line matching. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS 05), pages 264 273. IEEE, 2005.

Michael J Neely and Hao Yu. Online convex optimization with time-varying constraints. ar Xiv preprint ar Xiv:1702.04783, 2017.

Anshuka Rangi, Massimo Franceschetti, and Long Tran-Thanh. Unifying the stochastic and the adversarial bandits with knapsack. ar Xiv preprint ar Xiv:1811.12253, 2018.

Karthik Abinav Sankararaman and Aleksandrs Slivkins. Bandits with knapsacks beyond the worst case. Advances in Neural Information Processing Systems, 34, 2021.

Xinlei Yi, Xiuxian Li, Tao Yang, Lihua Xie, Tianyou Chai, and Karl Johansson. Regret and cumulative constraint violation analysis for online convex optimization with long term constraints. In International Conference on Machine Learning, pages 11998 12008. PMLR, 2021.

Jia Yuan Yu and Shie Mannor. Piecewise-stationary bandit problems with side observations. In Proceedings of the 26th annual international conference on machine learning, pages 1177 1184, 2009.

Jianjun Yuan and Andrew Lamperski. Online convex optimization for cumulative constraints. Advances in Neural Information Processing Systems, 31, 2018.