# piecewisestationary_bandits_with_knapsacks__f52dc0ce.pdf

Piecewise-Stationary Bandits with Knapsacks

Xilin Zhang Department of ISEM National University of Singapore Singapore, 117578 zhangxilin@u.nus.edu

Cheung Wang Chi Department of ISEM National University of Singapore Singapore, 117578 isecwc@nus.edu.sg

We propose a novel inventory reserving algorithm which draws new insights into Bandits with Knapsacks (Bwk) problems in piecewise-stationary environments. Suppose parameters ηmin, ηmax P p0, 1s respectively lower and upper bound the ratio between the reward earned and the resources consumed in a round. Our algorithm achieves a provably near-optimal competitive ratio of Oplogpηmax{ηminqq, with a matching lower bound provided. Our performance guarantee is based on a dynamic benchmark that upper bounds the optimum, different from existing works on adversarial Bwk Immorlica et al. (2019); Kesselheim and Singla (2020) who compare with the stationary benchmark. Different from existing non-stationary Bwk work Liu et al. (2022), we do not require a bounded global variation.

1 Introduction

In a bandits with knapsack (Bwk) problem, each action a in the action set K is associated with a latent and random amount of reward earned, Rtpaq, and resource consumed, Ctpaq, in each round t 1, . . . , T. A decision maker (DM) selects an action at P K in round t, and observes bandit feedback p Rtpatq, Ctpatqq. The DM targets at maximizing the total reward řT t 1 Rtpatq, while satisfying the hard capacity constraint řT t 1 Ctpatq ď B. Bwk has many real-life applications such as dynamic pricing Babaioff et al. (2015), resource allocation Zhalechian et al. (2022), online auction Balseiro and Gur (2019) and assortment planning Agrawal et al. (2019). Stochastic Bwk is first introduced by Badanidiyuru et al. (2018), followed by generalizations to concave reward with convex constraints Agrawal and Devanur (2014), combinatorial bandits Sankararaman and Slivkins (2018) and contextual bandits Badanidiyuru et al. (2014); Agrawal and Devanur (2016). In stochastic Bwk problems, the expected feedback Erp Rtpaq, Ctpaqqs prpaq, cpaqq is stationary for all a P K, t P t1, . . . , Tu, and a sublinear-in-T regret is achievable. Nevertheless, the stationary model could be too ideal in many applications.

Adversarial Bwk is firstly considered in Immorlica et al. (2019) where prt, ctq tprtpaq, ctpaqqua PK can change arbitrarily over the horizon. They achieve a competitive ratio (CR) of Opd logp Tqq with respect to a static benchmark when there are d budget constraints. A static benchmark picks a fixed optimal action (or a fixed optimal distribution over arms), and applies the same action (or distribution) in all T rounds. Kesselheim and Singla (2020) further improve the CR to Oplogpdq logp Tqq. Other papers consider different regimes such as unlimited rounds (Rangi et al. (2018)), large budget B Ωp Tq (Castiglioni et al. (2022a)), strict feasibility (Castiglioni et al. (2022b)) and approximate stationarity (Fikioris and Tardos (2023)). All these works compare with static benchmarks (see Appendix A.1). Moreover, adversarial Bwk could be too conservative in certain real-life scenarios. For instance, sales patterns could be stationary for a duration of time, but only change during periods of hot seasons/promotions/new trends, which fits into our piecewise-stationary Bwk regime.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

An abundance of existing works explore adversarial online knapsack problems with full feedback, where prt, ctq can change arbitrarily but their realized value of tp Rtpaq, Ctpaqqua PK is observed before choosing at (Karp et al. (1990); Mehta et al. (2007); Zhou et al. (2008)). Many of these works compare their accrued rewards with dynamic benchmarks (stronger than the static benchmarks used in adversarial Bwk), where the DM picks different optimal actions a t in different rounds. However, the dynamic benchmarks considered in the above papers are best single arm benchmark, which only allows pulling a single arm in each round; while our benchmark is a best distribution over arms benchmark (see Appendix A.2 for a more detailed elaboration). Further, the ability of observing tp Rtpaq, Ctpaqqua PK before selecting at is crucial in the algorithm designs in Karp et al. (1990); Mehta et al. (2007); Zhou et al. (2008). Their algorithm design cannot be readily generalized to the bandit setting, where the DM only observes p Rtpatq, Ctpatqq after selecting at.

Another line of recent research investigates non-stationary online knapsack problems with either full feedback Jiang et al. (2020); Balseiro et al. (2022) or bandit feedback Liu et al. (2022). These works quantify the scale of non-stationarity in terms of both the local variation loc řT 1 t 1 distpprt 1, ct 1q, prt, ctqq and the global variation glo řT t 1 distpřT t 1prt, ctq{T, prt, ctqq, where dist is a certain metric. Assuming maxtloc, glou grows sublinearly in T, they achieve sublinearin-T regret bounds compared to the dynamic benchmark. However, glo growing sublinearly in T is rather strong assumption. We could have glo linear in T, even with one change point (see Remark A.3 for more detail). In this work, we consider piece-wise stationary models that allow a higher degree of non-stationarity, which is yet to be studied in all aforementioned works.

Our contributions. Firstly, on modeling (see Section 2), the DM does not know the number of change points and when changes happen. We formulate our model as a single-resource problem, and extend to d-resource problems in Appendix B.5 with an extra multiplicative factor of d on the competitve ratio. Secondly, on algorithm design (see Sections 3.1 and 4.1), we propose novel algorithms which are natural, intuitive and easy to implement. Our idea of reserving inventory based on the rewardconsumption ratio provides new insights into the problem. Thirdly, on performance guarantee (see Sections 3.2 and 4.2), we achieve a provably near-optimal competitive ratio with respect to a best distribution over arms benchmark without requiring a bounded glo, which distinguishes our work from the existing literature. Specifically, suppose there exists parameters ηmin, ηmax P p0, 1s such that ηmin ď rtpaq, ctpaq ď ηmax for all t, a. Our algorithms achieve a competitive ratio of Oplogpηmax{ηminqq, which requires a novel analysis. We prove the tightness of our competitive ratio by providing a matching lower bound (see Section 4.4). We also run some illustrative numerical experiments (see Section 5) to compare our algorithm performance with Immorlica et al. (2019) and Zhou et al. (2008) under the piecewise-stationary settings.

2.1 Problem formulation

Problem dynamics. The online model involves T rounds, indexed as t P T t1, 2, . . . , Tu. We index an arm as a P K. Additionally, we define the null arm anull, where no allocation is made when anull is chosen. In round t P T , the DM chooses an arm at P K Ťtanullu, and observes a noisy outcome vector p Rtpatq, Ctpatqq P r0, 1s2 as the bandit feedback, where Rtpatq and Ctpatq are the reward and the resource consumption in round t respectively. We set Rtpanullq Ctpanullq 0 with certainty for all t P T . The DM is endowed with B ď T units of the resource. The DM s goal is to maximize the total reward, with the constraint that the total resource consumption is at most B with certainty. Denote rt trtpaqua PK t Er Rtpaqsua PK and ct tctpaqua PK t Er Ctpaqsua PK. We consider a piece-wise stationary setting, where the planning horizon T is partitioned into L stationary pieces tt0 1, . . . , t1u, tt1 1, . . . , t2u, . . . , tt L 1 1, . . . , t L Tu. On each stationary piece l, we have prt, ctq prplq, cplqq for all t P ttl 1 1, . . . , tlu.

The DM does not know the number of rounds T, the number of stationary pieces L, the rounds t1, . . . , t L 1 where changes happen, the values of tprplq, cplqqul Pt1,...,Lu and their realized outcomes.

Goal and benchmark. Our goal is to develop an online algorithm that maximize the expected total reward ErřT t 1 Rtpatqs while satisfying the inventory constraint řT t 1 Ctpatq ď B, which can be formulated as the following dynamic program DP. In DP, Xt t Xtpaqua PK is a binary decision vari-

able indicating whether to pick arm a in round t (i.e., Xtpaq 1) or not (i.e., Xtpaq 0). An online algorithm is non-anticipatory in the sense that Xt depends only on B and tp Rspasq, Cspasq, Xsqut 1 s 1.

DP : max Xt E

a PK Rtpaq Xtpaq

a PK Ctpaq Xtpaq ď B

a PK Xtpaq ď 1 @t P T

Xtpaq P t0, 1u @a P K, t P T .

l 1 ptl tl 1q ÿ

a PK rplqpaqxlpaq

l 1 ptl tl 1q ÿ

a PK cplqpaqxlpaq ď B

a PK xlpaq ď 1 @l 1, . . . , L

xlpaq ě 0 @a P K, l 1, . . . , L.

In non-stationary bandits without resource constraints, the performance bound of an online algorithm is in the form of řT t 1 Rtpatq ě řT t 1 rtpa t q Reg, where řT t 1 rtpa t q is the optimal expected reward obtained by choosing the best arm in each round, and Reg is a sublinear-in-T regret characterizing the reward loss. In our Bwk setting, due to the inventory constraint, achieving a sublinear-in-T regret is impossible without assuming a bounded global variation (see Appendix A.3). We denote optp Pq as the optimum of an optimization problem P. We aim for a performance guarantee of the form řT t 1 Rtpatq ě 1 CR opt(DP) Reg, where opt(DP) is the optimum of DP, CR is a competitive ratio, and Reg is a sublinear-in-T regret. Unfortunately, DP is hard to solve. Therefore, we define a fluid approximation FA of DP, where Rt and Ct are replaced by their respective expectations, and the decision variables are fractional. In the following Lemma 2.1 (proved in Appendix C.1), we justify that opt(FA) can serve as a benchmark for our algorithms performance since

t 1 Rtpatq ě 1 CR optp FAq Reg ě 1 CR opt(DP) Reg,

and we aim to derive performance guarantees of the form řT t 1 Rtpatq ě 1 CR optp FAq Reg.

Lemma 2.1. opt(FA)ěopt(DP).

2.2 Assumptions, limitations and discussions

Compared with the adversarial Bwk literature, our piecewise-stationary setting has two limitations. The first is on L, the number of change points. When L is not known, our result is meaningful only when L op?T ηminq. When L is known, our result is meaningful when L op T ηminq (see Theorem 4.2). In contrast, existing works on adversarial Bwk generally allow L T. The second is on the value range of non-null actions:

Assumption 2.2. For all a P K, l P t1, . . . , Lu, there exists known constants ηmin, ηmax P p0, 1s such that ηmin ď rplqpaq, cplqpaq ď ηmax.

While ηmin can be as small as 0 generally, we argue that this assumption is mild. Assumption 2.2 holds in many real-life scenarios. For instance, in portfolio management, an investor allocates a limited budget among different investment options (arms) to maximize the overall return. The investor has assessments on lower and upper ranges of the expected returns for each investment option. The lower range is usually strictly positive, since the investor would not consider investment options with 0 or negative expected return. In applications such as dynamic pricing, assortment planning, network resource allocation and energy management, expected profits and consumer demands are usually within a known positive value range. We further justify that Assumption 2.2 theoretically in the following Lemma 2.3 (proved in Appendix C.5).

Lemma 2.3. For any online algorithm, there exist an instance for which 0 ď rtpaq, ctpaq ď 1 for all a P K, t P T , and that CRą Ωplogpηmax{ηminqq.

Additionally, in Appendix C.5 we demonstrate that knowing the values of ηmin, ηmax is necessary for achieving CR Oplogpηmax{ηminqq.

2.3 High-level idea of our algorithm

Decomposing opt(FA) in terms of reward-consumption ratios. Throughout our paper, we fix an optimal solution tx l u L l 1 to FA. We define the set L tl P t1, . . . , Lu : ř a PK x l paq ą 0u, which indexes stationary pieces where non-null allocations are made under the optimal solution tx l u L l 1 to FA . For each l P L, we define

a PK rplqpaqx l paq ř a PK cplqpaqx l paq P ηmin

ηmax , ηmax

ȷ , B l ptl tl 1q ÿ

a PK cplqpaqx l paq. (1)

Ratioplq , whose value range follows from Assumption 2.2, is the optimal expected reward earned per unit of consumed resource under tx l u L l 1. We call Ratioplq the optimal expected rewardconsumption ratio of stationary piece l. B l represents the optimal expected amount of resources assigned for stationary piece l. To aid our algorithm design, we define the following linear program:

LPp r, c, Bq : max ÿ

a PK rpaqxpaq

a PK cpaqxpaq ď B

a PK xpaq ď 1

xpaq ě 0 @a P K.

Then, we can express optp FAq in terms of LPp r, c, Bq and Ratioplq , B l as follows:

l PL ptl tl 1q opt LP rplq, cplq, B l {ptl tl 1q ÿ

l PL Ratioplq B l . (2)

The first equation in (2) can be verified by noting that x l is feasible to LPprplq, cplq, B l {ptl tl 1qq for each l P t1, . . . , Lu, and the concatenation of the optimal solutions of t LPprplq, cplq, B l {ptl tl 1qul PL forms a feasible solution to FA. The second equation in (2) holds, by the definitions of Ratioplq , B l and the fact that optp LPprplq, cplq, B l {ptl tl 1qqq ř a PK rplqpaqx l paq.

Algorithm design. Fix an arbitrary constant α ą 1 (we set α e by default, but our results hold for any constant α ą 1). We define M rlogαpηmax{ηminqs and partition rηmin{ηmax, ηmax{ηmins into 2M intervals rα M, α M 1s Ťtpαm, αm 1su M 1 m M 1. For each stationary piece l P L, we denote m l P t M, . . . , M 1u as the interval such that Ratioplq P pαm l , αm l 1s. In the forthcoming discussion, with some abuse of notation, we sometimes write interval rα M, α M 1s as pα M, α M 1s, and we refer to interval pαm, αm 1s as reward-consumption ratio interval m, or interval m in short. Then we can decompose opt(FA) regarding reward-consumption ratio intervals:

opt(FA) p2q

l PL Ratioplq 1p Ratioplq P pαm, αm 1sq B l

l PL Ratioplq 1pm l mq B l . (3)

The key intuition of our algorithm is to achieve a reward guarantee for each interval m regarding the reward-consumption ratio 1pm l mq Ratioplq and the resource consumption ř

l PL 1pm l mq B l , which is done by performing two tasks: (a) for each l P L, we guess the value of m such that Ratioplq P pαm, αm 1s. We guarantee that for at least 1{p M 1q fraction of requests on each l, our guessed ratio interval are close to the correct interval m l ; (b) for each interval m, we reserve B{2M resource units. That is, we reserve an inventory of B{2M resource units to satisfy requests with a guessed reward-consumption ratio interval m. When the inventory reserved for interval m is depleted, the DM rejects (by choosing anull) all future requests with a guessed interval m.

By accomplishing task (a), we ensure that for each l P L, at least 1pm l mq B l {p M 1q requested resources are served by resources reserved for interval m, generating reward at a ratio of at least

αm. Then, by accomplishing task (b), if the reserved inventory for interval m are not depleted by round T, our algorithm earns a reward of at least αm 1pm l mq B l {p M 1q during stationary piece l. Else, if the reserved B{2M resource units for interval m are depleted by round T, then the DM earns a reward of at least αm B{p2Mq ě αm ř

l PL 1pm l mq B l {p2Mq from resources reserved for interval m, since ř

l PL 1pm l mq B l ď B. By judiciously analyzing the relationship between stationary pieces and reward-consumption ratio intervals, for each interval m, we ensure that a reward of 1 Op Mq ÿ

l PL αm 1pm l mq B l

is accrued, which is 1{Op Mq of the benchmark resources consumed on all stationary pieces l P L whose Ratioplq P pαm, αm 1s. By summing over m P t M, . . . , M 1u, we achieve 1{Op Mq fraction of reward (3).

3 Warm-up: Full-feedback deterministic outcome setting

In this section, we introduce the main idea of our algorithm on the bandit model by relaxing the model uncertainty assumptions and specializing to the full-feedback deterministic setting: p Rt, Ctq prplq, cplqq with certainty for each t P ttl 1 1, . . . , tlu. Thus, prplq, cplqq is observed at the start of the stationary piece l. The DM does not know L and tt1, . . . , t L 1u before the online process begins. The DM s decision can be fractional, which means on each stationary piece l, a decision can take the form of xl P tx P r0, 1s|K| : ř a PK xpaq ď 1, xpaq ě 0 @a P Ku, resulting in a reward of ř

a PK rplqpaqxlpaq and resource consumption of ř

a PK cplqpaqxlpaq in a round.

3.1 Inventory REServing (IRES) algorithm

Upon observing p Rt, Ctq prplq, cplqq, guessing Ratioplq (see Section 2.3) is equivalent to guessing x l , which is equivalent to guessing B l {ptl tl 1q, since tx l u L l 1 is an optimal solution to LPprplq, cplq, B l {ptl tl 1qq. In this section, when we say Line xx , we refer to a line of Algorithm 1, which displays IRES. At the start, the IRES reserves B{2M units of resource to each interval m P t M, . . . , M 1u, and each resource unit is reserved by exactly one interval. In Line 5, for each stationary piece l, we firstly solve LPprplq, cplq, ηmin αqq for each q P t0, . . . , Mu, and get an optimal solution xpqq l txpqq l paqua PK. Each ηmin αq is a guess of B l {ptl tl 1q. For each q P t0, . . . , Mu, we define

Ratiopqq l : ř

a PK rplqpaqxpqq l paq ř

a PK cplqpaqxpqq l paq

as a guess of Ratioplq . Claim 3 in Appendix B.3 shows that, by guessing a q such that ηmin αq

is within a factor of α from B l {ptl tl 1q, we also have Ratiopqq l to be at most a factor of α from Ratioplq . As time progresses on l, we go round-robin on the choices of qt P t0, . . . , Mu for each round t (Lines 7, 15). In round t P ttl 1 1, . . . , tlu, we identify mt P t M, . . . , M 1u such that Ratiopqtq l P pαmt, αmt 1s (Line 8). If there remains enough reserved resource units for interval mt (Line 10), the DM fulfils the t-th request by selecting fractional action xt xpqtq l (Line 11), which consumes resources reserved for mt (Lines 9, 11). Otherwise, the DM selects anull and rejects the request (Line 13). By Line 9, we have T pmq t ts P t1, . . . , tu : ms mu, which consists of rounds in t1, . . . , tu when the DM attempts to fulfil a request with resources reserved for interval m.

3.2 Performance guarantee of IRES

We provide a performance guarantee to IRES in Theorem 3.1. Theorem 3.1. For any given α ą 1, IRES achieves a reward of at least 1 2 logαpηmax{ηminq 1

6α2 logα pηmax{ηminq ,

under mild requirements that tl tl 1 ě M 1 @l P L. In particular, IRES achieves a competitive ratio of Oplogαpηmax{ηminqq if B ě Ωplogαpηmax{ηminqq.

Algorithm 1 Inventory REServing with deterministic input (IRES)

1: Input: resource capacity B, ηmin, ηmax.

2: Initialize l 0, t 1, q1 0, T pmq t H for all m, t. 3: while t ď T do 4: Set l l 1. 5: Solve LPprplq, cplq, ηmin αqq @q P t0, 1, . . . , Mu for optimal xpqq l txpqq l paqua PK. 6: while stationary piece l not end do

7: Let prt, ctq prplq, cplqq, xt xpqtq l .

8: Find mt P t M, . . . , M 1u such that Ratiopqtq l P pαmt, αmt 1s.

9: Set T pmtq t T pmtq t 1 Ťttu. 10: if ř

s PT pmtq t 1 ř

a PK cspaqxspaq ď B 2M 1 then

11: Pick fractional arms xt. 12: else 13: Pick arm at anull. 14: end if 15: if qt ď M 1 then set qt 1 qt 1 else set qt 1 0. 16: Set t t 1. 17: end while 18: end while

Remark 3.2 (Comparing with online knapsack problems). Our deterministic setting resembles online knapsack problems with adversarial prt, ctq revealed in each round Zhou et al. (2008), but our prt, ctq remains the same for an unknown number of rounds. Assuming B ě Ωpηmaxq, Zhou et al. (2008) achieve a competitive ratio of 2 logpηmax{ηminq 1 and they provide a nearly-matching lower bound. We recover their competitive ratio with an extra 3α2 multiplicative factor in a piece-wise stationary setting and a stricter requirement on B.

3.3 Analysis

Denote T pmq T tτ pmqp1q, τ pmqp2q, . . .u where τ pmqp1q ă τ pmqp2q ă . . ., and T pmq is the prefix of T pmq T satisfying

τ pmqpnq P T pmq T :

a PK cτ pmqpsqpaqxτ pmqpsqpaq ď B

That is, T pmq consist up to the last round assigned to interval m such that the reserved inventory is not fully consumed. It is evident that if ř

s PT pmq T cspasq ď B{p2Mq 1, then T pmq T pmq T .

Define Jl ttl 1 1, . . . , tlu, the time interval of the l-th piece. The reward achieved by IRES is REW ř

a PK rtpaqxtpaq, which can be decomposed as REW řM 1 m M REWpmq where

l PL 1pm l mq ÿ

t PpŤM 1 n M T pnqq Ş Jl

a PK rtpaqxtpaq.

The set pŤM 1 n M T pnqq Ş Jl consists of rounds in stationary piece l, which requests are not rejected due to shortage in reserved resource units. The summation ř

l PL 1pm l mq yields the reward accrued on pieces where m l m. By the summation řM 1 m M REWpmq, we obtain the total reward accrued with resources reserved for 2M intervals.

Similarly, we decompose the benchmark opt(FA) řM 1 m M opt(FA)pmq where

opt(FA)pmq ÿ

l PL 1pm l mq Ratioplq B l ,

To prove Theorem 3.1, it suffices to show REWpmq ě 1 p2M 1q{B

6α2M opt(FA)pmq for each interval m, as in the following Claim 1 and Claim 2. Then Theorem 3.1 can be established by summing over m P t M, . . . , M 1u.

Claim 1. For any interval m P t M, . . . , M 1u, if for all n P tmaxtm 1, Mu, mu we have T pnq T pnq T , then REWpmq ě 1 2αM opt(FA)pmq. Claim 2. For any interval m P t M, . . . , M 1u, if for at least one element n P tmaxtm 1, Mu, mu we have T pnq Ĺ T pnq T , then REWpmq ě 1 p2M 1q{B

6α2M opt(FA)pmq.

Sketch proofs of Claims 1, 2. Claims 1, 2 are proved in Appendices C.2, C.3 respectively. We first show in Claim 3 in Appendix B.3 that on a stationary piece l P L, there exists a correct

q l P t0, . . . , Mu, such that when selecting decision xt x pq l q l (the optimal solution to the

LPprplq, cplq, ηmin αq l q), our guess Ratio pq l q l on the ground-truth Ratioplq P pαm l , αm l 1s satisfies

Ratio pq l q l P pαm l 1, αm l 1s pαm l 1, αm l s Y pαm l , αm l 1s. (4)

When taking fractional action x pq l q l , we consume resources reserved for reward-consumption ratio intervals m l 1 or m l . Therefore by our round-robin design, on each stationary piece l such that m l m, at least ptl tl 1q{p M 1q requests are assigned to intervals m 1 or m, under decision

x pq l q l . It remains to analyze how many requests are fulfilled by resources reserved for the correct

interval m l m at the correct reward-consumption ratio Ratio pq l q l , as discussed in Section 2.3.

For an interval m where T pmq T pmq T , T pm 1q T pm 1q T (Claim 1 case), there are still remaining resources reserved for intervals m 1, m at the end of the horizon. Hence, for each stationary piece l such that m l m, at least ptl tl 1q{p M 1q requests (consuming B l {p M 1q resource units) are indeed fulfilled by resources reserved for interval m 1 or m, accruing reward at the reward-consumption ratio of at least Ratioplq {α according to (4). Summing over all l such that m l m, we have REWpmq ě ř

l PLp Ratioplq {αq 1pm l mq B l {p M 1q and Claim 1 is validated. For an interval m where there exists some n P tmaxtm 1, Mu, mu such that T pnq Ĺ T pnq T (Claim 2 case), the B{2M resource units reserved for interval n are depleted before the end of the horizon. In this case, some requests on stationary piece l where m l m may be rejected, but the B{p2Mq resource units reserved for interval n have been consumed, generating reward at a reward-consumption ratio of at least αn ě αm 1. Since the total resources that should be consumed w.r.t. interval m under the optimal FA solution is ř

l PL 1pm l mq B l ď B, we have REWpmq ě αm 1 B{p2Mq ě ř

l PLp Ratioplq {α2q 1pm l mq B l , and Claim 2 is validated.

4 Bandit-feedback stochastic outcome setting

In this section, we consider the original piece-wise stationary Bwk model, where the DM receives bandit feedback on outcomes p Rt, Ctq, and decisions are randomized.

4.1 Inventory REServing with change monitoring (IRES-CM) Algorithm

In this section, when we say Line xx , we refer to a line of Algorithm 2, which displays IRESCM. In the bandit-feedback setting, guessing Ratioplq requires estimating prplq, cplqq. To do so, we adaptively partition T into exploration rounds and exploitation rounds. In each round t, we conduct exploration with probability γt M a

|K| logp1{δqplogp|K|q 1q{ ?

Nt (reflected in a Bernoulli random variable Uptq in Line 7), where δ P p0, 1q is a confidence parameter and N is defined in (5). In an exploration round t (Lines 9-13), we uniformly sample an arm a P K and pull it for N consecutive rounds. We update an estimate pˆrtpaq, ˆctpaqq on prtpaq, ctpaqq prplqpaq, cplqpaqq using the tp Rspaq, Cspaqqus PT S t paq information, where T S t paq denotes the set of the most recent N exploration rounds before round t when arm a is pulled. That is, we set T S t paq tτ P tt st, . . . , t 1u : aτ au where st arg maxstřt 1 τ t s 1paτ aq Nu. We define

N 27 logp2{δq p1 1{?αq2 ηmin , ˆrtpaq

s PT S t paq Rspaq

s PT S t paq Cspaq

The estimates ˆrt, ˆct have two sources of error: error due to random noise, which decreases with N; and error due to non-stationarity, which increases with N. We set N according to (5) to balance these two errors. We let T R t denote the set of exploration rounds.

In an exploitation round t (Lines 17-25), we take turns to pull arms according to decision ˆxpqtq t , which is very similar to Algorithm 1 with pˆrt, ˆctq in place of prt, ctq. We define

z Ratio pqq t ř

a PK ˆrtpaqˆxpqq t paq ř

a PK ˆctpaqˆxpqq t paq , p:qt max " min " z Ratio pqtq t , αM * , α M * ,

which can both be interpreted as a guess of Ratioplq at any round t during stationary piece l. We reserve B{p2Mq units of resources for each interval m P t M, . . . , M 1u. In round t, we serve request t using resources reserved for interval ˆmt such that p:qt P pα ˆmt, α ˆmt 1s. If interval ˆmt has remaining reserved inventory, then we pull arm at a with probability ˆxpqtq t paq, or at ˆxpqtq t in short. We let T Ipmq t denote the set of exploitation rounds using resources reserved for interval m. We finally highlight that the major performance difference between IRES and IRES-CM is due to estimating prt, ctq by pˆrt, ˆctq, which is detailed in Section 4.3.

Algorithm 2 Inventory REServing with Change Monitoring (IRES-CM)

1: Input: resource capacity B, rate γ, bounding parameters ηmin, ηmax.

2: Set T R t H for all t and T Ipmq t H for all m, t. 3: Pull each arm a P K for N times, get ˆrtpaq, ˆctpaq as in (5). 4: Set t N|K| 1. 5: while t ď T do 6: Solve LPpˆrt, ˆct, ηmin αqq @q P t0, 1, . . . , Mu for optimal ˆxpqq t tˆxpqq t paqua PK. 7: Sample Uptq Bernpγtq. 8: if Uptq 1 then 9: Pick arm a Unip Kq, pull arm as a. 10: Set Upsq 1 for s P tt, . . . , t N 1u. 11: Set T R s T R t 1 Ťtt, . . . , su for s P tt, . . . , t N 1u. 12: Set t t N, pˆrt, ˆctq pˆrt N, ˆct Nq. 13: Update ˆctpaq, ˆrtpaq as in (5). 14: else 15: for q 0, . . . , M do 16: Set qt q. 17: Determine ˆmt P t M, . . . , M 1u such that p:qt P pα ˆmt, α ˆmt 1s.

18: Set T Ip ˆmtq t T Ip ˆmtq t 1 Ťttu. 19: if ř

s PT Ip ˆ mtq t 1 Cspasq ď B 2M 1 then

20: Pick arm at ˆxpqtq t . 21: else 22: Pick arm at anull. 23: end if 24: Set t t 1, pˆrt, ˆctq pˆrt 1, ˆct 1q. 25: end for 26: end if 27: end while

4.2 Performance guarantee of IRES-CM

We impose the following assumption on the ranges of B, opt(FA).

Assumption 4.1. mint B, opt(FA)u ě Ωp L a

|K|NTq, where Ωp q hides multiplicative factors in terms of logαpηmax{ηminq, logp1{δq, plogp|K|q 1q.

The performance of IRES-CM is as follows:

Theorem 4.2. For any given α ą 1, with probability at least 1 2|K| plogαpηmax{ηminq L Tqδ, IRES-CM achieves a reward of at least

1 op1q 10α4 logα pηmax{ηminq opt(FA) O L a

under Assumption 4.1, where op q hides multiplicative factors in terms of a

M{B and Op q hides multiplicative factors in terms of logαpηmax{ηminq, logp1{δq, plogp|K|q 1q. In particular, IRES-CM achieves a competitive ratio of Oplogαpηmax{ηminqq as long as L op?T ηminq.

The proof of Theorem 4.2 can be found in Appendix C.4. We provide a thorough comparison of our performance guarantee with existing works on adversarial and non-stationary Bwk in Appendix A. Remark 4.3 (Improved performance with known L). If the DM knows L, Assumption 4.1 can be relaxed to mint B, opt(FA)u ě Ωp a

Furthermore, in our performance guarantee in Theorem 4.2, the deductive term Op L a

|K|NTq from opt(FA) can be improved to Op a

L|K|NTq by setting the exploration parameter γt M a

L|K| logp1{δqplogp|K|q 1q{ ?

Nt in IRES-CM. Without prior knowledge of L, the deductive term Op L a

|K|NTq op Tq if L op?T ηminq; with prior information of L, the deductive term Op a

L|K|NTq op Tq if L op T ηminq. Remark 4.4 (Deterministic setting with bandit feedback). In our full-feedback deterministic setting, since prplq, cplqq is given at the beginning of each stationary piece, our performance guarantee is independent on |K|, L. In a bandit-feedback deterministic setting, IRES-CM can be applied by setting N 1. In this case, under Assumption 4.1, IRES-CM achieves a reward of at least

1 op1q 6α2 logα pηmax{ηminq opt(FA) Op L a

4.3 Analysis

We denote σtpaq mints : s P T S t paqu as the 1st element in T S t paq. We partition the exploitation round set T Ipmq T into two sets q T Ipmq and p T Ipmq, i.e., T Ipmq T q T Ipmq Ť p T Ipmq, q T Ipmq Ş p T Ipmq H. A time index t P T Ipmq T belongs to the set q T Ipmq (referred to as successful exploitation rounds regarding interval m ) if and only if the following condition is satisfied for all a P K:

tprspaq, cspaqqua PK tprσtpaqpaq, cσtpaqpaqqua PK, @s P tσtpaq, . . . , tu. (6)

For t P p T Ipmq (referred to as failed exploitation rounds regarding interval m ), inequality (6) is violated for at least one a P K. We denote p T Ipmq tτ Ipmqp1q, τ Ipmqp2q, . . .u where τ Ipmqp1q ă τ Ipmqp2q ă . . .. We let T Ipmq be a prefix of p T Ipmq satisfying

τ Ipmqpnq P q T Ipmq :

a PK Cτ Ipmqpsq aτ Ipmqpsq ď B

which consists up to the last exploitation round satisfying (6) for interval m, such that the reserved resource is adequate. If ř

s PT Ipmq T Cspasq ď B{p2Mq 1, then T Ipmq q T Ipmq.

Sketch proof of Theorem 4.2. Recall that the performance guarantee of our algorithms is in the form of řT t 1 Rtpatq ě 1 CR optp FAq Reg. The proof consists of mainly two steps: (a) we derive the CR Op Mq by bounding two different cases of interval m in a similar manner to Claim 1 and Claim 2 (see Appendix C.4), with q T Ipmq (successful exploitation rounds in IRES-CM) in place of T pmq T (all rounds in IRES); (b) we derive the Reg Op L a

|K|NTq by bounding the number of exploration rounds |T R T | and failed exploitation rounds | ŤM m M 2 p T Ipmq| (see Appendix B.7).

Comparing performance of IRES and IRES-CM. We highlight that the major performance difference between IRES and IRES-CM is the loss caused by estimating prt, ctq, reflected in the following aspects: (i) reward loss caused by exploration (upper bounding |T R T |); (ii) T S t paq contains change points, causing failed estimation of prt, ctq (upper bounding | ŤM m M 2 p T Ipmq|); (iii) T S t paq does not contain change points, but the discrepancy between prt, ctq and pˆrt, ˆctq results in assigning

Ratioplq (estimated by z Ratio pqq t ) to the wrong interval. We remark that the losses due to (i, ii) are accounted for in Reg, while (iii) is accounted for in the CR.

4.4 A lower bound on competitive ratio

We complement our analysis by showing the tightness of our CR (see Appendix C.6 for proof).

Theorem 4.5. Consider a fixed but arbitrary α ą 1, and set ηmin α 3ν, ηmax 1 for an arbitrary ν P Zą0. For any online algorithm, there exist an instance for which ηmin ď rtpaq, ctpaq ď ηmax for all a P K, t P T , and that řT t 1 Rtpatq{opt(FA) ď Θp1{ logαpηmax{ηminqq.

5 Numerical Experiments

We run numerical experiments on a single-resource problem where L 2, T 20000 (each stationary piece has 10000 rounds), K t1, 2u, B 9360 and we set α e for our algorithms. The rewards and resource consumption in all rounds are uniformly distributed within a r 0.2, 0.2s range from their mean values. We compare the performance of IRES-CM with Immorlica et al. (2019) s algorithm and Zhou et al. (2008) s algorithm. Recall that Immorlica et al. (2019) focus on an adversarial Bwk problem and achieves a CR w.r.t. a static benchmark. Zhou et al. (2008) study a full-feedback adversarial setting and achieves a CR w.r.t. a single best arm benchmark. In Figure 1, each curve represents the average cumulative reward over 10 simulations, and the shaded area around each curve marks the variance over the simulations. We provide Zhou et al. (2008) s algorithm with extra information of prt, ctq before making decisions in each round, and compare the performance of algorithms with the linear program benchmark FA (dotted curves in Figure 1).

(a) FA optimal solution: single arm

(b) FA optimal solution: mixed arms

Figure 1: Performance comparison of algorithms for piecewise-stationary Bwk

In Figure 1(a), we set rp1qp1q rp1qp2q 0.5, cp1qp1q cp1qp2q 1 for stationary piece 1; and set rp2qp1q 1, rp2qp2q 0.5, cp2qp1q 0.5, cp2qp2q 1 for stationary piece 2. In Figure 1(b), we switch the values of rp2qp1q and rp2qp2q. i.e., setting rp2qp1q 0.5, rp2qp2q 1. Observe that IRES-CM outperforms Immorlica et al. (2019) s algorithm in both cases. This is mainly because Immorlica et al. (2019) s algorithm is designed for a more general adversarial Bwk setting. In contrast, we utilize the extra information that ηmin 0.5. Therefore, Immorlica et al. (2019) s algorithm is significantly more conservative than IRES-CM in reserving inventories for future customers. Zhou et al. (2008) s algorithm outperforms IRES-CM in Figure 1(a), but performs worse than IRES-CM in Figure 1(b). This is because that in Figure 1(a), the optimal solution of the benchmark FA chooses a single arm on each stationary piece, which aligns with Zhou et al. (2008) s single best arm benchmark. Zhou et al. (2008) s algorithm performs well with the extra information of prt, ctq before making decisions. In Figure 1(b), the optimal solution of the benchmark FA chooses mixed arms on the second stationary piece, where x 2p1q 0.128, x 2p2q 0.872. The numerical results are consistent with the theoretical results that Zhou et al. (2008) achieve sub-optimal rewards compared with a best distribution over arms benchmark, while our IRES-CM performs well. Finally, our experiments are run on a Surface Pro 7 with an i5-1035G4 processor. All results can be produced within 30 minutes.

Acknowledgement

We would like to acknowledge the support from the Singapore Ministry of Education Ac RF Tier 2 Grant (Grant number: T2EP20121-0035).

Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. (2019). Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research, 67(5):1453 1485.

Agrawal, S. and Devanur, N. (2016). Linear contextual bandits with knapsacks. Advances in Neural Information Processing Systems, 29.

Agrawal, S. and Devanur, N. R. (2014). Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 989 1006.

Babaioff, M., Dughmi, S., Kleinberg, R., and Slivkins, A. (2015). Dynamic pricing with limited supply.

Badanidiyuru, A., Kleinberg, R., and Slivkins, A. (2018). Bandits with knapsacks. Journal of the ACM (JACM), 65(3):1 55.

Badanidiyuru, A., Langford, J., and Slivkins, A. (2014). Resourceful contextual bandits. In Conference on Learning Theory, pages 1109 1134. PMLR.

Balseiro, S. R. and Gur, Y. (2019). Learning in repeated auctions with budgets: Regret minimization and equilibrium. Management Science, 65(9):3952 3968.

Balseiro, S. R., Lu, H., and Mirrokni, V. (2022). The best of many worlds: Dual mirror descent for online allocation problems. Operations Research.

Castiglioni, M., Celli, A., and Kroer, C. (2022a). Online learning with knapsacks: the best of both worlds. In International Conference on Machine Learning, pages 2767 2783. PMLR.

Castiglioni, M., Celli, A., Marchesi, A., Romano, G., and Gatti, N. (2022b). A unifying framework for online optimization with long-term constraints. Advances in Neural Information Processing Systems, 35:33589 33602.

Fikioris, G. and Tardos, É. (2023). Approximately stationary bandits with knapsacks. ar Xiv preprint ar Xiv:2302.14686.

Immorlica, N., Sankararaman, K. A., Schapire, R., and Slivkins, A. (2019). Adversarial bandits with knapsacks. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 202 219. IEEE.

Jiang, J., Li, X., and Zhang, J. (2020). Online stochastic optimization with wasserstein based non-stationarity. ar Xiv preprint ar Xiv:2012.06961.

Karp, R. M., Vazirani, U. V., and Vazirani, V. V. (1990). An optimal algorithm for on-line bipartite matching. In Proceedings of the twenty-second annual ACM symposium on Theory of computing, pages 352 358.

Kesselheim, T. and Singla, S. (2020). Online learning with vector costs and bandits with knapsacks. In Conference on Learning Theory, pages 2286 2305. PMLR.

Kim, W., Iyengar, G., and Zeevi, A. (2023). Improved algorithms for multi-period multi-class packing problems with bandit feedback. In International Conference on Machine Learning, pages 16458 16501. PMLR.

Kuszmaul, W. and Qi, Q. (2021). The multiplicative version of azuma s inequality, with an application to contention analysis. ar Xiv preprint ar Xiv:2102.05077.

Liu, S., Jiang, J., and Li, X. (2022). Non-stationary bandits with knapsacks. Advances in Neural Information Processing Systems, 35:16522 16532.

Mehta, A., Saberi, A., Vazirani, U., and Vazirani, V. (2007). Adwords and generalized online matching. Journal of the ACM (JACM), 54(5):22 es.

Rangi, A., Franceschetti, M., and Tran-Thanh, L. (2018). Unifying the stochastic and the adversarial bandits with knapsack. ar Xiv preprint ar Xiv:1811.12253.

Sankararaman, K. A. and Slivkins, A. (2018). Combinatorial semi-bandits with knapsacks. In International Conference on Artificial Intelligence and Statistics, pages 1760 1770. PMLR.

Sivakumar, V., Zuo, S., and Banerjee, A. (2022). Smoothed adversarial linear contextual bandits with knapsacks. In International Conference on Machine Learning, pages 20253 20277. PMLR.

Yao, A. C.-C. (1977). Probabilistic computations: Toward a unified measure of complexity. In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 222 227. IEEE Computer Society.

Zhalechian, M., Keyvanshokooh, E., Shi, C., and Van Oyen, M. P. (2022). Online resource allocation with personalized learning. Operations Research, 70(4):2138 2161.

Zhou, Y., Chakrabarty, D., and Lukose, R. (2008). Budget constrained bidding in keyword auctions and online knapsack problems. In Proceedings of the 17th international conference on world wide web, pages 1243 1244.

A Comparing our performance guarantee with existing literature

A.1 Comparing with adversarial Bw Ks

Immorlica et al. (2019) (achieving Opd logp Tqq competitive ratio) and Kesselheim and Singla (2020) (achieving Oplogpdq logp Tqq competitive ratio) study the adversarial Bw Ks, where the adversarial bandit feedback prtpatq, ctpatqq is revealed in each time step after pulling arm at. Their setting is more general than ours, since they require no boundedness assumption or piece-wise stationary assumption on their outcomes. Note that if our ηmin can be expressed as a function of T, i.e., ηmin T β for a β P p0, 1q, our result can extend to multiple resources and achieve a competitive ratio of Opd logp Tqq (see Appendix B.5). However, we highlight that their result do not imply our result, due to the following two reasons.

Firstly, they consider a static benchmark, dubbed FD, which is FA with the additional constraint that x l x for all l P L; while we compare our result with the dynamic benchmark FA. Specifically, Immorlica et al. (2019) prove that no algorithm can achieve a competitive ratio smaller than T{B2 w.r.t. the dynamic benchmark. We further improve this lower bound to Θp T{Bq in Lemma 2.3(a) by setting L T{B, and provide a matching lower bound of Ωplogp Tqq comparing with opt(FA) when ηmin ą 0. Secondly, they have more restrictive assumptions on the value ranges of B and opt(FD). Specifically, Immorlica et al. (2019) assume opt(FD) B{|K| ą Ωp T 7{4q, which is strictly stronger than our Assumption 4.1.

Some research works consider more specific regimes, including Rangi et al. (2018); Castiglioni et al. (2022a); Fikioris and Tardos (2023). We highlight that all these works still compare with static benchmarks. Castiglioni et al. (2022a) focus on the regime where B Ωp Tq and achieves a competitive ratio of T{B. Fikioris and Tardos (2023) provides a competitive ratio depending on pmaxttrtu{ minttrtu, maxttctu{ minttctuq. Rangi et al. (2018) achieves a sublinear-in-T regret in a different setting with no round limit (sales stop when the inventory is depleted). Therefore, their result is incomparable with ours.

Some recent papers focus on linear contextual Bwk with adversarial contextual vectors Sivakumar et al. (2022) (achieving Opd logp Tqq competitive ratio) or with multiple but stationary customer classes Kim et al. (2023) (achieving sublinear in T regret). We highlight that contextual vectors for each arm are observable before making decision in each round, while our model only observe bandit feedback after an arm is chosen. Therefore their results do not generalize to our setting.

A.2 Comparing with adversarial online knapsack problems with full feedback

In our Section 3, we discuss a warm-up setting where t Rtpaq, Ctpaqua PK is observable upon the arrival of the round t customer, which is similar with adversarial online knapsack with full feedback (Karp et al. (1990); Mehta et al. (2007); Zhou et al. (2008)). Zhou et al. (2008) is the most closely related to our work, where they start off from an online matching problem and extends the results to the online knapsack problem. They define LB mina,ttrtpaq{ctpaqu, UB maxa,ttrtpaq{ctpaqu and achieves a competitive ratio of logp UB{LBq w.r.t. a dynamic best single arm benchmark, which differs from our FA by setting xlpaq P t0, 1u for each a, l. Our FA, on the other hand, is a best distribution over arms benchmark. In fact, for stationary Bwk, Badanidiyuru et al. (2018) show (see their Appendix A) that the best single arm benchmark is strictly weaker than best distribution over arms, which could noticably affect the achievable CR. It is evident that in the picewise-stationary setting, this is also the case.

Additionally, this line of works crucially require knowing tp Rtpaq, Ctpaqqua PK before making decisions. By contrast, we take a different approach in algorithm design in Section 3 allows a natural generalization from full to bandit feedback shown in Section 4.

A.3 Comparing with non-stationary bandit/full-feedback online optimization with knapsacks

In Balseiro et al. (2022); Jiang et al. (2020); Liu et al. (2022), they measure the non-stationarity of a time-varying knapsack model with a quantity called the global variation

t 1 prt, ctq{T, prt, ctq

While dist can be any metric, to give a more concrete idea, we highlight an example form Liu et al. (2022) being

distppr, cq, pr1, c1qq max a PK t|r1paq rpaq|u max a PK t|c1paq cpaq|u.

They all have boundedness assumption on glo. In our setting (glo unbounded), their algorithms incur a linear-in-T regret even when L 1, and no non-trivial competitive ratio is established in their work. To see the case of L 1 but glo Θp Tq, consider the case of K t1u, and we have

pr1p1q, c1p1qq . . . pr T {2p1q, c T {2p1qq p1, 0.5q,

but pr T {2 1p1q, c T {2 1p1qq . . . pr T p1q, c T p1qq p0.5, 1q.

In this case, we can verify that

t 1 prt, ctq{T, pr1, c1q

t 1 prt, ctq{T, pr T {2 1, c T {2 1q

so we have glo 0.5T despite L 1.

B Auxiliary results

B.1 Notation

For functions fpxq : Rě0 Ñ Rě0 and gpxq : Rě0 Ñ Rě0, we say fpxq is Opgpxqq (resp. Ωpgpxqq) if there exist positive constants C and n, such that for all x ě n, fpxq ď C gpxq (resp. fpxq ě C gpxq). We use Opgpxqq (resp. Ωpgpxqq) to hide logarithmic terms in x other than gpxq.

B.2 Concentration inequalities

Lemma B.1 (Multiplicative Azuma-Hoeffding Inequality (Kuszmaul and Qi (2021))). X1, . . . , Xn P r0, cs are real-valued random variables, and t Fnun i 0 is a filtration. Let µ řn i 1 ai where ai are real-valued constants. (i) Suppose Er Xi|Fi 1s ď ai holds for all i P t1, . . . , nu almost surely. Then for any δ P p0, 1q,

(ii) Suppose Er Xi|Fi 1s ě ai holds for all i P t1, . . . , nu almost surely. Then for any δ P p0, 1q,

Lemma B.2 (Lemma 2.1, Badanidiyuru et al. (2018)). Let X1, . . . , XN P r0, 1s be random variables. Let X řN i 1 Xi be the sample average, and let µ řN i 1 Er Xi|X1, . . . , XNs. Then, for any δ P p0, 1q,

Pr |X µ| ď a

2X logp1{δq 4 logp1{δq ě 1 3δ.

B.3 Main claim on reward-consumption ratio

Recall that xpqq l is an optimal solution to LPprplq, cplq, ηmin αqqq, and x l is an optimal solution of LPprplq, cplq, B l {ptl tl 1qq which is an optimal solution of our benchmark. In the following Claim 3, we show that the round-robin technique in IRES ensures that, on each stationary piece l, xpqq l for at least one q P t0, . . . , Mu is close to x l in terms of both the resource consumption and the reward-consumption ratio. This leads to the important result that for all t P ttl 1 1, . . . , tlu, mt P tm l 1, m l u.

Claim 3. On stationary piece l P L, there exists q l P t0, 1, . . . , Mu that satisfies both # ηmin αq l 1 ă ř

a PK cplqpaqx l paq ď ř

a PK cplqpaqx pq l q l paq ď ηmin αq l q l ą 0

a PK cplqpaqx l paq ď ř

a PK cplqpaqx pq l q l paq ď ηmin αq l q l 0 , (7)

a PK rplqpaqx pq l q l paq ř

a PK cplqpaqx pq l q l paq ď ř

a PK rplqpaqx l paq ř

a PK cplqpaqx l paq ď α ř

a PK rplqpaqx pq l q l paq ř

a PK cplqpaqx pq l q l paq . (8)

Proof of Claim 3. The existence of q l satisfying (7) is evident, since ηmin α0 ηmin and ηmin αM ηmax. To show that q l also satisfy (8), we define

a PK cplqpaqx pq l q l paq ř

a PK cplqpaqx l paq , (9)

and claim that ÿ

a PK rplqpaqx l paq ď ÿ

a PK rplqpaqx pq l q l paq ď dplq ÿ

a PK rplqpaqx l paq. (10)

The first inequality in (10) holds since B l {ptl tl 1q ř

a PK cplqpaqx l paq ď ηmin αq l . Therefore, the resource constraint in LPprplq, cplq, B l {ptl tl 1qq is tighter than LPprplq, cplq, ηmin αq l q.

We prove the second inequality in (10) by contradiction. Suppose ř

a PK rplqpaqx pq l q l paq ą

a PK rplqpaqx l paq. Then we set xl x pq l q l {dplq and have

a PK cplqpaqxlpaq 1 dplq ÿ

a PK cplqpaqx pq l q l ÿ

a PK cplqpaqx l paq B l {ptl tl 1q.

In this case, xl is a feasible solution to LPprplq, cplq, B l {ptl tl 1qq and we have

a PK rplqpaqxlpaq 1 dplq ÿ

a PK rplqpaqx pq l q l ą ÿ

a PK rplqpaqx l paq,

contradicting the fact that x l is an optimal solution of LPprplq, cplq, B l {ptl tl 1qq. Therefore, combining (9) and (10), we establish

a PK rplqpaqx pq l q l paq ř

a PK cplqpaqx pq l q l paq ď ř

a PK rplqpaqx l paq ř

a PK cplqpaqx l paq ď dplq ř a PK rplqpaqx pq l q l paq ř

a PK cplqpaqx pq l q l paq .

To establish (8), it suffices to show dplq ď α. By (7) and (9), it is evident that dplq ď α when

q l ą 0. If q l 0, then constraints ř a PK x l paq ď 1 and ř a PK x pq l q l paq ď 1 are not tight in both LPprplq, cplq, B l {ptl tl 1qq and LPprplq, cplq, ηminq. In this case, both LPs are knapsack problems and have closed-form solutions of

# B l ptl tl 1qcplqpaq a arg maxa PK ! rplqpaq cplqpaq

0 otherwise , x pq l q l paq

# ηmin cplqpaq a arg maxa PK ! rplqpaq cplqpaq

0 otherwise .

Therefore, we have

a PK rplqpaqx pq l q l paq ř

a PK cplqpaqx pq l q l paq ř

a PK rplqpaqx l paq ř

a PK cplqpaqx l paq,

which shows that dplq 1 ă α when q l 0.

B.4 Decomposing ratio of REW to opt(FA) in deterministic setting

a PK cτ pmqpsqpaqxτ pmqpsqpaq ď B

and Jl ttl 1 1, . . . , tlu. Then for the reward achieved by IRES, we have

a PK rtpaqxtpaq

a PK rtpaqxtpaq

t Pp ŤM 1 n M T pnqq Ş Jl

a PK rtpaqxtpaq

l PL 1pm l mq ÿ

t Pp ŤM 1 n M T pnqq Ş Jl

a PK rtpaqxtpaq

mintm 1,M 1u ÿ

w maxtm 1, Mu

l PL 1pm l wq ÿ

t Pp ŤM 1 n M T pnqq Ş Jl

a PK rtpaqxtpaq. (11)

Inequality (11) holds since by summing over w P tmaxtm 1, Mu, m, mintm 1, M 1uu for each m P t M, . . . , M 1u, we repeat the sum for at most 3 times.

For the reward achieved by FA, we have

ř a PK rplqpaqx l paq ř

a PK cplqpaqx l paq B l (12)

l PL 1pm l mq ř a PK rplqpaqx l paq ř

a PK cplqpaqx l paq B l

m M opt(FA)pmq.

Inequality (12) follows from (2).

Therefore, the ratio of the reward achieved by IRES to opt(FA) can be decomposed as: ř

a PK rtpaqxtpaq opt(FA)

řM 1 m M REWpmq řM 1 m M opt(FA)pmq

1 3 řM 1 m M řmintm 1,M 1u w m 1 ř

l PL 1pm l wq ř

t Pp ŤM 1 n M T pnqq Ş Jl ř

a PK rtpaqxtpaq řM 1 m M ř

l PL 1pm l mq

a PK rplqpaqx l paq ř

a PK cplqpaqx l paq B l . (13)

Then, to prove Theorem 3.1, it suffices to show řM 1 m M REWpmq řM 1 m M opt(FA)pmq ě p13q ě 1 p2M 1q{B

6α2M . (14)

To prove (14), we focus on showing in Claims 1 and 2.

opt(FA)pmq ě

1 3 řmintm 1,M 1u w m 1 ř

l PL 1pm l wq ř

t PŤM 1 m M T pmq Ş Jl ř

a PK rtpaqxtpaq ř

l PL 1pm l mq

a PK rplqpaqx l paq ř

a PK cplqpaqx l paq B l

ě1 p2M 1q{B

6α2M for each m P t M, . . . , M 1u.

B.5 Extending results to multiple resources

Our results can be readily extend to the multiple-resource case, with |I| d resources indexed by i P I. An upper bound for a multi-resource allocation problem (corresponding to FA in the single-resource setting) can be formulated as:

FAMUL : max

l 1 ptl tl 1q ÿ

a PK rplqpaqxlpaq

l 1 ptl tl 1q ÿ

a PK cplq i paqxlpaq ď B @i P I

a PK xlpaq ď 1 @l 1, . . . , L

xlpaq ě 0 @a P K, l 1, . . . , L.

which is evidently upper bounded by the following LP:

FAMUL-U : max

l 1 ptl tl 1q ÿ

a PK rplqpaqxlpaq

l 1 ptl tl 1q ÿ

i PI cplq i paq

xlpaq ď |I|B

a PK xlpaq ď 1 @l 1, . . . , L

xlpaq ě 0 @a P K, l 1, . . . , L.

It is evident that the LP below achieves a reward of at least 1{d fraction of opt(FAMUL-Uq.

FAMUL-F : max

l 1 ptl tl 1q ÿ

a PK rplqpaqxlpaq

l 1 ptl tl 1q ÿ

i PI cplq i paq

a PK xlpaq ď 1 @l 1, . . . , L

xlpaq ě 0 @a P K, l 1, . . . , L.

Therefore, FAMUL is transformed into a single-resource allocation problem FAMUL-F, with an extra multiplicative factor d in the competitive ratio.

B.6 Core lemma on reward-consumption ratio in general setting

Recall that set T S t paq tτ P tt st, . . . , t 1u : aτ au where st arg maxstřt 1 τ t s 1paτ aq Nu consists of the most recent N rounds where arm a is sampled by exploration; and σtpaq mints : s P T S t paqu is the 1-st element in T S t paq. Recall that

s PT S t paq Cspaq

ř s PT S t paq Rspaq

are the average resource consumption and the reward earned for the most recent N pulls of arm a. Additionally, recall that we define ˆxpqq t tˆxpqq t paqua PK as a solution to LPpˆrt, ˆct, ηmin αqq (ˆxpqq t being the optimal solution) for q 0, 1, . . . , M 1.

We further define

s PT S t paq cspaq

s PT S t paq rspaq

N as the average mean resource consumption and the reward earned for the most recent N pulls of arm a. Note that when condition (6)

tprspaq, cspaqqua PK tprσtpaqpaq, cσtpaqpaqqua PK, @s P tσt, . . . , tu

is satisfied for all a P K, then rounds s P tσt, . . . , tu are on the same stationary piece and we have ctpaq ctpaq, rtpaq rtpaq. We let xpqq t t xpqq t paqua PK be a solution to LPprt, ct, ηmin αqq ( xpqq t being the optimal solution) for q 0, 1, . . . , M 1. By Claim 3, we know that there exists q l P t0, 1, . . . , Mu such that for all t P Jl, # ηmin αq l 1 ă ř

a PK cplqpaqx l paq ď ř

a PK ctpaq x pq l q t paq ď ηmin αq l q l ą 0

a PK cplqpaqx l paq ď ř

a PK ctpaq x pq l q t paq ď ηmin αq l q l 0 . (15)

We show in the following Lemma B.3 that when condition (6) is satisfied for all a P K, our decisions have several nice properties which facilitate our proofs. Lemma B.3. Fix an arbitrary α P p0, 1s. For any t ě σtpaq, if

tprspaq, cspaqqua PK tprσtpaqpaq, cσtpaqpaqqua PK, @s P tσt, . . . , tu,

then with probability at least 1 2|K|δ, then all the following inequalities are satisfied for all t P Jl:

a PK ˆrtpaqˆx pq l q t paq ř a PK ˆctpaqˆx pq l q t paq ď ř

a PK rtpaqˆx pq l q t paq ř a PK ctpaqˆx pq l q t paq ď ?α ř

a PK ˆrtpaqˆx pq l q t paq ř a PK ˆctpaqˆx pq l q t paq , (16)

a PK rtpaqˆx pq l q t paq ď ÿ

a PK rplqpaqx l paq ď α ÿ

a PK rtpaqˆx pq l q t paq, (17)

a PK ˆrtpaqˆx pq l q t paq ř

a PK ˆctpaqˆx pq l q t paq ď ř

a PK rplqpaqx l paq ř

a PK cplqpaqx l paq ď α?α ř

a PK ˆrtpaqˆx pq l q t paq ř

a PK ˆctpaqˆx pq l q t paq if q l 0

Proof of Lemma B.3. We define

3 N ηmin log ˆ2

3 N ηmax log ˆ2

2 N ηmin log ˆ2

Then by the multiplicative Azuma-Hoeffding inequality (Lemma B.1), for any a P K we have

Prrp1 ϵqctpaq ď ˆctpaq ď p1 ϵqctpaqs ě 1 δ (19) Prrp1 ϵqrtpaq ď ˆrtpaq ď p1 ϵqrtpaqs ě 1 δ. (20)

The above probability bounds (19) and (20) hold since

s PT S t paq ctpaq, ÿ

s PT S t paq rtpaq ď N ηmax.

We set p1 3ϵq2 1{α, and therefore

N 27 logp2{δq p1 1{?αq2 ηmin .

Our following discussion is conditioned on the good event that both (19) and (20) hold for all a P K. This good event holds with probability 1 2|K|δ.

Validating (16). Given that (19) and (20) hold, we have

a PK ˆrtpaqˆx pq l q t paq ř

a PK ˆctpaqˆx pq l q t paq ď ř

a PK rtpaqˆx pq l q t paq ř

a PK ctpaqˆx pq l q t paq ď 1 ϵ

a PK ˆrtpaqˆx pq l q t paq ř

a PK ˆctpaqˆx pq l q t paq

a PK ˆrtpaqˆx pq l q t paq ř

a PK ˆctpaqˆx pq l q t paq ď ř

a PK rtpaqˆx pq l q t paq ř

a PK ctpaqˆx pq l q t paq ď 1 1 2ϵ ř

a PK ˆrtpaqˆx pq l q t paq ř

a PK ˆctpaqˆx pq l q t paq .

Since 1 2ϵ ě 1 3ϵ 1{α, (21) indicates (16).

Validating (17). It is evident that by letting ˆx pq l q t paq x pq l q t paq{p1 ϵq, we have

a PK ˆctpaqˆx pq l q t paq ÿ

a PK ˆctpaq x pq l q t paq 1 ϵ

a PK p1 ϵqctpaq x pq l q t paq 1 ϵ (22)

a PK ctpaq x pq l q t paq

Inequality (22) follows from inequalities (19). Since ˆx pq l q t paq x pq l q t paq{p1 ϵq is a feasible solution to LPpˆrt, ˆct, ηmin αq l q, we have

a PK rtpaqˆx pq l q t paq ě ÿ

a PK ˆrtpaq ˆx pq l q t paq 1 ϵ

a PK ˆrtpaq x pq l q t paq p1 ϵq2

a PK ˆrtpaq x pq l q t paq

a PK rtpaq x pq l q t paq. (23)

Similarly, by letting x pq l q t paq p1 ϵqˆx pq l q t paq, we have ÿ

a PK ctpaq x pq l q t paq ÿ

a PK ctpaq p1 ϵqˆx pq l q t paq

ˆctpaq 1 ϵ p1 ϵqˆx pq l q t paq

a PK ˆctpaqˆx pq l q t paq

Since x pq l q t paq p1 ϵqˆx pq l q t paq is a feasible solution to LPprt, ct, ηmin αq l q, we have

a PK ˆrtpaqˆx pq l q t paq ď p1 ϵq ÿ

a PK rtpaqˆx pq l q t paq ď ÿ

a PK rtpaq x pq l q t paq 1 ϵ ď 1 1 2ϵ

a PK rtpaq x pq l q t paq.

Since p1 3ϵq2 1{α, putting (23) and (24) together, we have 1 ?α ÿ

a PK rtpaq x pq l q t paq ď ÿ

a PK rtpaqˆx pq l q t paq ď ?α ÿ

a PK rtpaq x pq l q t paq. (25)

By (10) in Claim 3, we have 1 α ÿ

a PK rtpaq x pq l q t paq ď ÿ

a PK rplqpaqx l paq ď ÿ

a PK rtpaq x pq l q t paq. (26)

Given (25) and (26), we prove (17).

Validating (18). For l P L such that q l 0, we have ř

a PK ctpaq x pq l q t paq ηmin. Given (19), for all t P Jl, 1 ?α ÿ

a PK ctpaq x pq l q t paq 1 ?α ηmin ď ÿ

a PK ˆctpaqˆx pq l q t paq ď ηmin ÿ

a PK ctpaq x pq l q t paq.

Putting (23) and (24) together, we have 1 ?α ÿ

a PK rtpaq x pq l q t paq ď ÿ

a PK ˆrtpaqˆx pq l q t paq ď ?α ÿ

a PK rtpaq x pq l q t paq. (28)

Inequality (27) and (28) gives

a PK ˆrtpaqˆx pq l q t paq ř

a PK ˆctpaqˆx pq l q t paq ď ř

a PK rtpaq x pq l q t paq ř

a PK ctpaq x pq l q t paq ď ?α ř

a PK ˆrtpaqˆx pq l q t paq ř

a PK ˆctpaqˆx pq l q t paq . (29)

Recall from (8) in Claim 3, for all t P Jl we have ř

a PK rtpaq x pq l q t paq ř

a PK ctpaq x pq l q t paq ď ř

a PK rplqpaqx l paq ř a PK cplqpaqx l paq ď α ř

a PK rtpaq x pq l q t paq ř

a PK ctpaq x pq l q t paq . (30)

Putting (27) and (30) together, we establish (18).

B.7 Bounding exploration rounds and failed exploitation rounds

To upper bound |T R T ŤpŤM m M 2 p T Ipmqq|, it suffices to upper bound |T R T | (forthcoming Claim 4) and | ŤM 1 m M 2 p T Ipmq| (forthcoming Claim 5).

Claim 4. For any δ P p0, 1q, |T R T | ď TNγT N b

δ with probability at least 1 δ.

Proof of Claim 4. Define the event ER t t Conduct exploration in round tu ERp1q t Ť ERp2q t , where ERp1q t t Da P K s.t. t σtpaqu and ERp2q t t Da P K s.t. t P T S t paqzσtpaqu. We define random variables Y R t 1p ER t q, Y Rp1q t 1p ERp1q t q and Y Rp2q t 1p ERp2q t q. Define event EI t t Conduct exploitation in round tu EIp1q t Ť EIp2q t , where EIp1q t tqt 0u and EIp2q t tqt ą 0u. We define random variables Y I t 1p EI tq, Y Ip1q t 1p EIp1q t q and Y Ip2q t 1p EIp2q t q.

We further define random variables Zt, where Zt Y Rp1q t Bernpγtq for all t such that Y Rp1q t 1 or Y Ip1q t 1 and Zt Bernpγtq otherwise. It is evident that in each round t, Zt follows a Bernoulli distribution with mean γt and Zt ě Y Rp1q t . Therefore, the total rounds of forced exploration over the planning horizon can be upper bounded as

t 1 Y Rp1q t Y Rp2q t N

t 1 Y Rp1q t ď N

Since Zt are independent and ErřT t 1 Zts řT t 1 γt ď 2TγT , we can apply the Chernoff bound (which is a subcase of Lemma B.1) on Zt,

t 1 Zt ą 2TγT

6TγT log ˆ2

Hence we have |T R T | řT t 1 Y R t ď 2TNγT N a

6TγT logp2{δq with probability at least 1 δ.

Claim 5. Fix any δ P p0, 1q, |p T Ipmq| ď L|K| logp1{δqplogp|K|q 1q{γT with probability at most 1 |K|Lδ.

Proof of Claim 5. If round t is a change point, i.e., prt, ctq prt 1, ct 1q, we define set Kptq Ă K such that prtpaq, ctpaqq prt 1paq, ct 1paqq for all a P Kptq. Notice that after a change happens, some exploitation rounds could run with condition (6) violated, resulting in consuming resources from the wrong reward-consumption interval m. Hence, |p T Ipmq| can be upper bounded by the total number of exploitation rounds run before all arms are updated after each change point. After all arms are updated after a change point, (6) is satisfied again. In the following Claim 6, we suppose round t is a change point, and upper bound the number of exploitation rounds run before all arms a P Kptq are updated.

Claim 6. Suppose round t is a change point. With probability 1 |Kptq|δ, IRES-CM run at most |K| logp1{δqplogp|Kptq|q 1q{γT exploitation rounds (Algorithm 2, Lines 17-25) before updating the change in exploration rounds (Algorithm 2, Lines 9-13).

Proof of Claim 6. For some set K1 Ă K, let random variable Y p K1q denote the number of exploitation samples (i.e., line 6 of Algorithm 2 giving Uptq 0) between two nearest exploration samples (i.e., line 6 of Algorithm 2 giving Uptq 1) applied for arms a P K1. We denote Kptqpjq as the j-th arm explored from set Kptq. After each exploitation sample, IRES-CM runs for each q P t M, . . . , M 1u. Therefore, the number of exploitation rounds run before all a P Kptq are updated is 2M p Y p Kptqq ř|Kptq| j 2 Y p Kptqz Ťj 1 h 1 Kptqphqqq. For any subset K1 P Kptq, we further denote random variable Zp K1q as Y p K1q plus the number of times that exploration is triggered for any a P Kz K1 between two nearest exploration rounds applied for arms a P K1.

It is evident that Zp K1q is a geometric random variable with time-varying probability pt γt|K1|{|K| of success in each round. Therefore, we have Prp Zp K1q ě nq śt n 1 s t p1 psq. Since for any x P r0, 1s, it holds that 1 x ď e x, by requiring e n p T ď δ, we have śt n 1 s t p1 psq ď p1 p T qn ď e n p T ď δ. In this case,

e n p T ď δ ô n p T ď logp1{δq ô n ě logp1{δq

p T |K| logp1{δq

Therefore, we have

Pr ˆ Y p K1q ě |K| logp1{δq

ď Pr ˆ Zp K1q ě |K| logp1{δq

which suggests that with probability at least δ, we run at most |K| logp1{δq{pγT |K1|q exploitation rounds before updating pˆrtpaq, ˆctpaqq for each arm a P K1.

Plugging Kptqz Ťj 1 h 1 Kptqphq in K1, with probability 1 |Kptq|δ,

j 2 Y p Kptqz

h 1 Kptqphqq ď

|K| logp1{δq

ď|K| logp1{δq

ď|K| logp1{δq

γT plogp|Kptq|q 1q. (31)

Inequality (31) holds since

1 j ď 1 ż |Kptq|

1 j ď 1 logp|Kptq|q logp1q. (32)

Notice that there are at most L change points over the entire planning horizon, contributing to p T Ipmq for at most 2ML|K| logp1{δqplogp|K|q 1q{γT rounds, with probability at most 1 |K|Lδ.

Combining Claim 4 and 5, with probability at least 1 2M|K|Lδ, ˇˇˇˇˇT R T ď M 1 ď

m M p T Ipmq ˇˇˇˇˇ ď2TNγT N

6TγT log ˆ2

4M 2L|K| logp1{δqplogp|K|q 1q

ď4TNγT 4M 2L|K| logp1{δqplogp|K|q 1q

Recall that N 27 logp2{δq{pp1 1{αq2 ηminq and γt M a

|K| logp1{δqplogp|K|q 1q{ ?

Nt, we further have ˇˇˇˇˇT R T ď M 1 ď

m M p T Ipmq ˇˇˇˇˇ ď 8ML a

|K|NT logp1{δqplogp|K|q 1q Op L a

|K|NTq. (33)

Note that if the DM knows L a priori, then we can set γt M a

L|K| logp1{δqplogp|K|q 1q{ ?

Nt, which results in |T R T ŤpŤM 1 m M p T Ipmqq| ď Op a

C.1 Proof of Lemma 2.1

Let π be a non-anticipatory feasible policy that achieves the expected optimum opt(DP) in DP, i.e. ErřT t 1 ř

a PK Rtpaq Xπ t paqs Eropt(DP)s where Xπ t is the decision variable under algorithm π. We let

xlpaq 1 tl tl 1 E

a PK Xπ t paq

for each l 1, . . . , L in FA. We claim that txlu L l 1 is feasible to FA, with objective value equal to ErřT t 1 ř

a PK Rtpaq Xπ t paqs Eropt(DP)s, which indicates that under tx l u L l 1 we have opt(FA) ě Eropt(DP)s. Thus, verifying the claims about the feasibility and the objective value proves the claim.

We first verify the feasibility to FA. Since the policy π satisfies the resource constraints, the inequality řT t 1 ř

a PK Ctpaq Xπ t paq ď B holds. Taking expectation over Xπ t paq and Ctpaq for t tl 1 1, . . . , tl gives

a PK Ctpaq Xπ t paq

a PK cplqpaq Er Xπ t paqs

l 1 ptl tl 1q ÿ

a PK cplqpaqxlpaq

Similarly, by taking expectation over each of the reward constraints, we have ErřT t 1 ř

a PK Rtpaq Xπ t paqs řL l 1ptl tl 1q ř

a PK rplqpaqxlpaq Eropt(DP)s. Hence, the claim about the objective value is shown, and the Lemma is proved.

C.2 Proof of Claim 1

Recall that in Algorithm 1, we try each q P t0, 1, . . . , Mu in a round-robin manner. Then we know that on each stationary piece l, for at least ptl tl 1q{M 1 rounds, we choose q q l and take

fractional decision x pq l q l such that (7) and (8) hold. By Claim 3 inequality (8), resources consumed

under decision x pq l q l are assigned with resources reserved for intervals tmaxtm l 1, Mu, m l u. Therefore, for interval m where m l m, ÿ

t PŤm n maxtm 1, Mu T pnq Şttl 1 1,...,tlu 1pqt q l q ě tl tl 1

M 1 1. (34)

Recall that if we choose the optimal decision x l on stationary piece l, B l units of resources would be consumed, i.e. ptl tl 1q ř

a PK cplqpaqx l paq B l . Hence by Claim 3 inequality (7), we have

a PK cplqpaqx pq l q l paq ě ÿ

a PK cplqpaqx l paq ě B l tl tl 1 . (35)

Putting everything together, we have

t PŤM 1 m M T pmq Şttl 1 1,...,tlu

a PK rtpaqxtpaq

t PŤm n maxtm 1, Mu T pnq Şttl 1 1,...,tlu

a PK rtpaqxtpaq

t PŤm n maxtm 1, Mu T pnq Şttl 1 1,...,tlu 1pqt q l q ÿ

a PK rplqpaqx pq l q l paq

t PŤm n maxtm 1, Mu T pnq Şttl 1 1,...,tlu 1pqt q l q ÿ

a PK cplqpaqx pq l q l paq ř

a PK rplqpaqx pq l q l paq ř

a PK cplqpaqx pq l q l paq

t PŤm n maxtm 1, Mu T pnq Şttl 1 1,...,tlu 1pqt q l q B l tl tl 1 ř

a PK rplqpaqx l paq α ř a PK cplqpaqx l paq (36)

M 1 1 B l αptl tl 1q ř

a PK rplqpaqx l paq ř

a PK cplqpaqx l paq (37)

ě B l 2αp M 1q ř a PK rplqpaqx l paq ř

a PK cplqpaqx l paq. (38)

Inequality (36) holds by plugging in (35) and (8). Inequality (37) holds by plugging in (34). Inequality (38) stands since we can assume tl tl 1 ě 2p M 1q without loss of generality. Because otherwise we can ignore the stationary pieces where tl tl 1 ď 2p M 1q, causing a reward loss of at most Op Mq. Following (38), we have

mintm 1,M 1u ÿ

w maxtm 1, Mu

l PL 1pm l wq ÿ

t PŤM 1 m M T pmq Şttl 1 1,...,tlu

a PK rtpaqxtpaq

l PL 1pm l mq ÿ

t PŤM 1 m M T pmq Şttl 1 1,...,tlu

a PK rtpaqxtpaq

ě 1 2αp M 1q ÿ

l PL 1pm l mq ř

a PK rplqpaqx l paq ř a PK cplqpaqx l paq B l .

Therefore, for m P t M, . . . , M 1u such that T pnq T pnq T for n P tmaxtm 1, Mu, mu, we have

řmintm 1,M 1u w maxtm 1, Mu ř

l PL 1pm l wq ř

t PŤM 1 m M T pmq Şttl 1 1,...,tlu ř

a PK rtpaqxtpaq ř

l PL 1pm l mq

a PK rplqpaqx l paq ř

a PK cplqpaqx l paq B l ě 1 2αp M 1q.

C.3 Proof of Claim 2

We let m minntn P tmaxtm 1, Mu, mu, T pnq Ĺ T pnq T u. Then we have

a PK rtpaqxtpaq ÿ

ř a PK rtpaqxtpaq ř

a PK ctpaqxtpaq ÿ

a PK ctpaqxtpaq

a PK ctpaqxtpaq (40)

2M 1 . (41)

Inequality (40) stands since in rounds t P T p mq, we have ř

a PK rtpaqxtpaq{ ř

a PK ctpaqxtpaq P rα m, α m 1s. Inequality (41) holds since ř

t P T p mq ř

a PK ctpaqxtpaq ě p B T ηminq{p2Mq 1 p B 1q{p2Mq 1 by the definition of T p mq. Then we have

mintm 1,M 1u ÿ

w tmaxtm 1, Mu

l PL 1pm l wq ÿ

t PŤM 1 m M T pmq Şttl 1 1,...,tlu

a PK rtpaqxtpaq

mintm 1,M 1u ÿ

w tmaxtm 1, Mu

l PL 1pm l wq ÿ

t P T p mq Şttl 1 1,...,tlu

a PK rtpaqxtpaq

a PK rtpaqxtpaq (42)

Inequality (42) holds since mt P ttmaxtm l 1, Mu, m l u for all t P ttl 1 1, . . . , tlu. Therefore,

it is possible to consume resources reserved for interval m P tmaxtm 1, Mu, mu under x pq l q l only when

m l P tmaxtm 1, Mu, m 1 1u ď tm, mintm 1, M 1uu tmaxtm 1, Mu, m, mintm 1, M 1uu.

We also have

l PL 1pm l mq ř

a PK rplqpaqx l paq ř

a PK cplqpaqx l paq B l ď αm 1

l PL B l ď αm 1B. (43)

Putting together (42) and (43), we know that for m P t M, . . . , M 1u satisfying case (ii), we have

řmintm 1,M 1u w maxtm 1, Mu ř

l PL 1pm l wq ř

t P T p mq Şttl 1 1,...,tlu ř

a PK rtpaqxtpaq ř

l PL 1pm l mq

a PK rplqpaqx l paq ř

a PK cplqpaqx l paq B l

ěα m pp B 1q{p2Mq 1q

2α2M . (44)

Inequality (44) holds since we require B ě Ωp Mq (see Theorem 3.1). Combining (39) and (44), we show that

p14q ě 1 op1q

6α2M . (45)

C.4 Proof of Theorem 4.2

Note that for reward-consumption ratio intervals n P t M, . . . , M 1u where T Ipnq q T Ipnq, not all requests assigned to these intervals are necessarily satisfied. This is because resources could run out due to exploration before ř

s P q T Ipnq Cspasq ą B{p2Mq 1, i.e., when

s 1 Cspasq ď B 1

Ĺ T Ipnq č T .

It can be seen that requests in rounds pŤM 1 n M T Ipnqq Ştt P T : řt s 1 Cspasq ď B 1u are satisfied. Therefore, we decompose the ratio of the IRES-CM reward to opt(FA) as follows:

t PT Rtpatq

t Pp ŤM 1 n M T Ipnqq Ştt PT :řt s 1 CspasqďB 1u ř

a PK rtpaqˆxpqtq t paq

opt(FA) loooooooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooooooon

t PT Rtpatq ř

t Pp ŤM 1 n M T Ipnqq Ştt PT :řt s 1 CspasqďB 1u ř

a PK rtpaqˆxpqtq t paq loooooooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooooooon

To establish Theorem 4.2, it suffices to show H(1) ě p1 op1qq popt(FA) Op a

L|K|NTqq{p10α4M opt(FA)q (see the forthcoming Section C.4.1) and H(2) ě 1 op1q (see the forthcoming Section C.4.2).

C.4.1 Bounding H(1)

Recall that ř

a PK ˆrtpaqˆxpqtq t paq{ ř

a PK ˆctpaqˆxpqtq t paq P rα ˆmt, α ˆmt 1s. We further define ˆmplq t such that ř

a PK ˆrtpaqˆx pq l q t paq{ ř

a PK ˆctpaqˆx pq l q t paq P rα ˆmplq t , α ˆmplq t 1s. Likewise, we define

mplq t such that ř a PK rtpaqˆx pq l q t paq{ ř a PK ctpaqˆx pq l q t paq P rαmplq t , αmplq t 1s. We define

s 1 Cspasq ď B 1

Then H(1) can be further decomposed as:

t Pp ŤM 1 n M T Ipnqq Ştt PT :řt s 1 CspasqďB 1u ř

a PK rtpaqˆxpqtq t paq

t Pp ŤM 1 n M T Ipnqq Ş Jl ř

a PK rtpaqˆxpqtq t paq ř

a PK rplqpaqx l paq

t Pp ŤM 1 n M T Ipnqq Ş Jl 1pmplq t mq ř

a PK rtpaqˆxpqtq t paq řM 1 m M ř

t PJl 1pmplq t mq ř

a PK rplqpaqx l paq (46)

t Pp ŤM 1 n M T Ipnqq Ş Jl řmintm 2,M 1u w maxtm 2, Mu 1pmplq t wq ř

a PK rtpaqˆxpqtq t paq řM 1 m M ř l PL ř t PJl 1pmplq t mq ř

a PK rplqpaqx l paq .

Inequality (47) holds since by summing over w P tmaxtm 2, Mu, . . . , mintm 2, M 1uu, we repeat the numerator of (46) for at most 5 times.

We partition set t M, . . . , M 1u into two disjoint sets M1, M2. An interval m P M1 if for all n P tmaxtm 1, Mu, m, mintm 1, M 1uu, we have T Ipnq q T Ipnq, i.e. ř

s P q T Ipnq Cspasq ď B{p2Mq 1. An interval m P M2 if for some n P tmaxtm 1, Mu, m, mintm 1, M 1uu, we have T Ipnq Ĺ q T Ipnq.

Regarding m P M1. In the following analysis, we focus on the good event that (16), (17), (18) in Lemma B.3 hold for all t P T . We know that with probability at least 1 2|K|Tδ, the good event holds. On each stationary piece l P L and all t P Jl, for all rounds t P pŤM 1 n M T Ipnqq Ş Jl such that mplq t m, we know that ˆmplq t P tmaxtm 1, Mu, m, mintm 1, M 1uu (see (16) in Lemma B.3). Due to the round-robin technique, at least 1{p M 1q fraction of all rounds

t P ŤM 1 n M T Ipnq Ş Jl, l P L such that mplq t m are allocated by resources reserved for intervals n P tmaxtm 1, Mu, m, mintm 1, M 1uu. Therefore, we have

t Pp ŤM 1 n M T Ipnqq Ş Jl 1pmplq t mq 1pqt q l q ě ÿ

t P Ťmintm 1,M 1u n maxtm 1, Mu T Ipnq Ş Jl

1pmplq t mq 1pqt q l q

t P Ťmintm 1,M 1u n maxtm 1, Mu T Ipnq Ş Jl 1pmplq t mq

We ignore the stationary pieces l where |Jl| ď 2M, since this cause a loss of at most Op Mq.

For m P M1, although we have Ťmintm 1,M 1u n maxtm 1, Mu T Ipnq Ťmintm 1,M 1u n maxtm 1, Mu q T Ipnq (i.e., ř

s P q T Ipnq Cspasq ď B{p2Mq 1 for all n P tmaxtm 1, Mu, m, mintm 1, M 1uu), it is not necessary that all requests assigned to intervals tmaxtm 1, Mu, m, mintm 1, M 1uu are satisfied. The resource units reserved for these intervals can run out due to exploration, i.e., when

mintm 1,M 1u ď

n maxtm 1, Mu T Ipnq č #

s 1 Cspasq ď B 1

mintm 1,M 1u ď

n maxtm 1, Mu T Ipnq č T .

t Pp ŤM 1 n M T Ipnqq Ş Jlz Jl 1pmplq t mq ÿ

a PK rtpaqˆxpqtq t paq.

Then we have

t Pp ŤM 1 n M T Ipnqq Ş Jl

mintm 2,M 1u ÿ

w maxtm 2, Mu 1pmplq t wq ÿ

a PK rtpaqˆxpqtq t paq

t Pp ŤM 1 n M T Ipnqq Ş Jl 1pmplq t mq ÿ

a PK rtpaqˆxpqtq t paq

t Pp ŤM 1 n M T Ipnqq Ş Jl 1pmplq t mq ÿ

a PK rtpaqˆxpqtq t paq p:qpmq

t Pp ŤM 1 n M T Ipnqq Ş Jl 1pmplq t mq ÿ

a PK rtpaqˆxpqtq t paq p:qpmq

t P Ťmintm 1,M 1u n maxtm 1, Mu T Ipnq Ş Jl

1pmplq t mq 1pqt q l q ÿ

a PK rtpaqˆx pq l q t paq p:qpmq

t P Ťmintm 1,M 1u n maxtm 1, Mu T Ipnq Ş Jl

1pmplq t mq 1pqt q l q ÿ

a PK rplqpaqx l paq p:qpmq

ě 1 αp M 1q ÿ

t P Ťmintm 1,M 1u n maxtm 1, Mu T Ipnq Ş Jl

1pmplq t mq ÿ

a PK rplqpaqx l paq p:qpmq. (50)

Inequality (49) follows from (17) in Lemma B.3, and inequality (50) follows from inequality (48). Hence, we have ř

t Pp ŤM 1 n M T Ipnqq Ş Jl 1pmplq t mq ř

a PK rtpaqˆxpqtq t paq ř

t PJl 1pmplq t mq ř

a PK rplqpaqx l paq

1 αp M 1q ř

m PM1 řL l 1 ř

t P Ťmintm 1,M 1u n maxtm 1, Mu T Ipnq Ş Jl 1pmplq t mq ř

a PK rplqpaqx l paq ř

m PM1p:qpmq

t PJl 1pmplq t mq ř

a PK rplqpaqx l paq , 0

1 αp M 1q ř

m PM1 řL l 1 ř

t P Ťmintm 1,M 1u n maxtm 1, Mu T Ipnq Ş Jl 1pmplq t mq ř

a PK rplqpaqx l paq |T R T | ř

t PJl 1pmplq t mq ř

a PK rplqpaqx l paq , 0

1 αp M 1q ř

t PJl 1pmplq t mq ř

a PK rplqpaqx l paq ˇˇˇT R T Ť ŤM 1 m M p T Ipmq ˇˇˇ ř

t PJl 1pmplq t mq ř

a PK rplqpaqx l paq , 0

# 1 αp M 1q Op a

t PJl 1pmplq t mq ř

a PK rplqpaqx l paq , 0

w.p. 1 2M|K|Lδ.

Inequality (51) holds since řM 1 m Mp:qpmq ď |T R T |. Inequality (52) is valid since

mintm 1,M 1u ď

n maxtm 1, Mu T Ipnq

mintm 1,M 1u ď

n maxtm 1, Mu q T Ipnq

mintm 1,M 1u ď

n maxtm 1, Mu p T Ipnq

Regarding m P M2. Suppose in some interval ˆn P tmaxtm 1, Mu, . . . , mintm 1, M 1uu, we have ř

t P T Ipˆnq Cspasq ą B{p2Mq 1. We aim validate the following two inequalities respectively:

t Pp ŤM 1 n M T Ipnqq Ş Jl

mintm 2,M 1u ÿ

w maxtm 2, Mu 1pmplq t wq ÿ

a PK rtpaqˆxpqtq t paq

B M logp1{δq 4 logp1{δq 1

t PJl 1pmplq t mq ÿ

a PK rplqpaqx l paq ď αm 5{2 B, (55)

Since given (54) and (55),for any interval m P M2, ř

t Pp ŤM 1 n M T Ipnqq Ş Jl řmintm 2,M 1u w maxtm 2, Mu 1pmplq t wq ř

a PK rtpaqˆxpqtq t paq ř m PM2 ř l PL ř t PJl 1pmplq t mq ř a PK rplqpaqx l paq

1 BM logp1{δq 4 logp1{δq 1

1 BM logp1{δq 4 logp1{δq 1

2α4M . (56)

Validating (54). In our algorithm, for any t P ŤM 1 m M T Ipmq, we have at ˆxpqtq t , and hence,

Er Rtpatq|Ft 1s ř

a PK rtpaqˆxpqtq t paq. Therefore, by Lemma B.2, for any δ P p0, 1q, with probability at least 1 3δ, we have

t P T Ipˆnq

a PK ctpaqˆxpqtq t paq ě ÿ

t P T Ipˆnq Ctpatq d

t P T Ip mq Ctpatq logp1{δq 4 logp1{δq

B M logp1{δq 4 logp1{δq 1. (57)

Inequality (57) holds since B{p2Mq 1 ă ř

t P T Ipˆnq Cspasq ď B{p2Mq. Then we have

t P T Ipˆnq

a PK rtpaqˆxpqtq t paq ÿ

t P T Ipˆnq

a PK rtpaqˆxpqtq t paq ř

a PK ctpaqˆxpqtq t paq ÿ

a PK ctpaqˆxpqtq t paq

t P T Ipˆnq

a PK ctpaqˆxpqtq t paq (58)

B M logp1{δq 4 logp1{δq 1

Inequality (58) holds since for all t P T Ipˆnq, we have ř

a PK ˆrtpaqˆxpqtq t paq{ ř

a PK ˆctpaqˆxpqtq t paq ě αˆn. Then by inequality (16) in Lemma B.3, we have ř

a PK rtpaqˆxpqtq t paq{ ř

a PK ctpaqˆxpqtq t paq ě αˆn 1. Inequality (59) follows from inequality (57). We further have

t Pp ŤM 1 n M T Ipnqq Ş Jl

mintm 2,M 1u ÿ

w maxtm 2, Mu 1pmplq t wq ÿ

a PK rtpaqˆxpqtq t paq

t P T Ipˆnq Ş Jl

mintm 2,M 1u ÿ

w maxtm 2, Mu 1pmplq t wq ÿ

a PK rtpaqˆxpqtq t paq

t P T Ipˆnq Ş Jl

mintm 2,M 1u ÿ

w maxtm 2, Mu 1pmplq t wq ÿ

a PK rtpaqˆxpqtq t paq (60)

t P T Ipˆnq Ş Jl

a PK rtpaqˆxpqtq t paq (61)

t P T Ipˆnq

a PK rtpaqˆxpqtq t paq

B M logp1{δq 4 logp1{δq 1

Inequality (60) holds since the total B resource units have not run out before the reserved B{p2Mq resource units for interval ˆn run out, i.e.,

T Ipˆnq č #

s 1 Cspasq ď B 1

T Ipˆnq č T .

Inequality (61) is valid since for t P T Ipˆnq, we have ˆmplq t ˆn. Hence, we have

mplq t P tmaxtˆn 1, Mu, ˆn, mintˆn 1, M 1uu P tmaxtm 2, Mu, . . . , mintm 2, M 1uu.

Validating (55). For l P L such that q l 0, by (18) in Lemma B.3, we have ÿ

t PJl 1pmplq t mq ÿ

a PK rplqpaqx l paq

a PK ˆrtpaqˆx pq l q t paq ř

a PK ctpaqˆx pq l q t paq P rαm, αm 1s

a PK rplqpaqx l paq

a PK rplqpaqx l paq ř

a PK cplqpaqx l paq P rαm 3{2, αm 5{2s

a PK rplqpaqx l paq

a PK rplqpaqx l paq ř

a PK cplqpaqx l paq P rαm 3{2, αm 5{2s

a PK rplqpaqx l paq ř

a PK cplqpaqx l paq ÿ

a PK cplqpaqx l paq

a PK cplqpaqx l paq. (62)

For l P L such that q l ą 0, by (15) we have ÿ

a PK ˆctpaqˆx pq l q t paq ď ηmin αq l ď α ÿ

a PK cplqpaqx l paq.

In this case, ÿ

t PJl 1pmplq t mq ÿ

a PK rplqpaqx l paq

t PJl 1pmplq t mq ÿ

a PK ˆrtpaqˆx pq l q t paq (63)

α 1pmplq t mq ř

a PK ˆrtpaqˆx pq l q t paq ř

a PK ˆctpaqˆx pq l q t paq ÿ

a PK ˆctpaqˆx pq l q t paq

a PK ˆctpaqˆx pq l q t paq

a PK cplqpaqx l paq. (64)

Putting together (62) and (64)we have ÿ

t PJl 1pmplq t mq ÿ

a PK rplqpaqx l paq ď αm 5{2 ÿ

a PK cplqpaqx l paq ď αm 5{2 B.

Finally, let us combine the two cases where m P M1 and m P M2. If ř

t PJl 1pmplq t mq ř

a PK rplqpaqx l paq ď Op a

L|K|NTq, then p53q 0 and we have

t Pp ŤM 1 n M T Ipnqq Ş Jl 1pmplq t mq ř

a PK rtpaqˆxpqtq t paq řM 1 m M ř

t PJl 1pmplq t mq ř

a PK rplqpaqx l paq

t Pp ŤM 1 n M T Ipnqq Ş Jl 1pmplq t mq ř

a PK rtpaqˆxpqtq t paq ř

t PJl 1pmplq t mq ř

a PK rplqpaqx l paq opt(FA) Op a

L|K|NTq opt(FA)

10α4M opt(FA) Op a

L|K|NTq opt(FA) . (65)

t PJl 1pmplq t mq ř

a PK rplqpaqx l paq ě Ωp a

L|K|NTq, then p53q p1 op1qq{pαp M 1qq. Then

H(1) ě min " 1 op1q

5αp M 1q, 1 op1q

10α4M . (66)

Therefore, we conclude that

H(1) ě 1 op1q

10α4M opt(FA) Op a

L|K|NTq opt(FA)

with probability at least 1 2|K|p ML Tqδ.

C.4.2 Bounding H(2)

H(2) is to bound the stochastic reward achieved by randomized decision and the expected reward. In IRES-CM, for t P ŤM 1 m M T Ipmq, we have at ˆxpqtq t , and hence Er Rtpatq|Ft 1s ř

a PK rtpaqˆxpqtq t paq. Then by Lemma B.1, for any δ P p0, 1q, with probability at least 1 δ, we have ÿ

t PŤM 1 m M T Ipmq Rtpatq P

g f f e 2 ř l PL ř

t Pp ŤM 1 n M T Ipnqq Ş Jl ř a PK rtpaqˆxpqtq t paq log ˆ2

t Pp ŤM 1 n M T Ipnqq Ş Jl

a PK rtpaqˆxpqtq t paq,

g f f e 3 ř

t Pp ŤM 1 n M T Ipnqq Ş Jl ř

a PK rtpaqˆxpqtq t paq log ˆ2

t Pp ŤM 1 n M T Ipnqq Ş Jl

a PK rtpaqˆxpqtq t paq

2 H(1) opt(FA) log ˆ2

H(1) opt(FA),

3 H(1) opt(FA) log ˆ2

H(1) opt(FA)

Since H(1) opt(FA) ě p1 op1qq popt(FA) Op a

L|K|NTqq{p10α4Mq ě Ωpopt(FA)q ě Ωp a

L|K|NTq. Then it is evident that H(2) ě 1 op1q with probability at least 1 δ.

C.5 Proof of Lemma 2.3

C.5.1 Proof of part (a)

The horizon T is partitioned into L T{B pieces with equal length B. We consider L instances with two arms K t1u and anull, and instance n happen with probability pn. All instances have deterministic outcomes, and they share the same consumption model Ctp1q 1 for all t P T . Their reward functions are:

Instance 1: Rp1qp1q

α L, . . . , α L looooooomooooooon

, α L 1, . . . , α L 1 loooooooooomoooooooooon

, . . . , α 1, . . . , α 1 loooooomoooooon

Instance 2: Rp2qp1q

α L, . . . , α L looooooomooooooon

, α L 1, . . . , α L 1 loooooooooomoooooooooon

, . . . , 0, . . . , 0 loomoon

Instance L : Rp Lqp1q

α L, . . . , α L looooooomooooooon

, 0, . . . , 0 loomoon

, . . . , 0, . . . , 0 loomoon

Denote FApnq as the FA for instance n P t1, . . . , Lu. It is clear that optp FApnqq Bα n. Recall Xtp1q 1p Pull 1 in round tq. By the Yao s principle Yao (1977), the competitive ratio of any online algorithm is at most L ÿ

n 1 pn EpnqrřT t 1 Rpnq t p1q Xtp1qs

optp FApnqq , (67)

for any pn ě 0 with řL n 1 pn 1. The expectation Epnq is over the randomness in Xt in instance n. The instances are crafted such that during piece j P t1, . . . , Lu, it is impossible to distinguish among instances j, . . . , L, meaning that the quantity Bpnq j Epnqrř

t Ppiece j Xtp1qs for n P tj, . . . , Lu are all identical, and equal to a common value Bj. Thus, for instance n P t1, . . . , Lu we have EpnqrřT t 1 Rpnq t p1q Xtp1qs ď řL j n Bjα j. Consequently,

řL j n Bj α j

n 1 pn αn j.

By defining p1 1 Lp1 1{αq 1{α 1 1

α pn for n 2, . . . , L, we have for every j 1, . . . , L,

n 1 pn αn j ď 1 Lp1 1{αq 1{α

n 1 pn αn j ď 1 Lp1 1{αq 1{α

by the inventory constraint řL j 1 Bj ď B on instance 1. Since L can generally be larger than logαpηmax{ηminq, we have shown that the CR can be significantly larger than logαpηmax{ηminq when ηmin 0.

C.5.2 Proof of part (b)

While it is possible to derive a worse bound without ηmax by setting it to its upper bound of 1, knowing the lower bound is essential for our algorithm s functionality. To show that it is necessary to know ηmin, we suppose the DM be provided with a looser lower range parameter ηmin ă ηmin ď rtpaq, ctpaq @a, t, and show that it leads to sub-optimal CR.

A general case construction. We firstly construct a case with N 1 instances when ηmin β N for some absolute constant β ą 1. We consider N 1 instances with two arms K t1u and anull, and instance n happen with probability pn. All instances have deterministic outcomes, and they share the same reward model Rtp1q 1 for all t. Their consumption functions are:

Instance 0: Cp0qp1q

1, . . . , 1 loomoon

Piece 0: B rounds

Instance 1: Cp1qp1q

1, . . . , 1 loomoon

Piece 0: B rounds

, 1{β, . . . , 1{β loooooomoooooon

Piece 1: B β rounds

Instance 2: Cp2qp1q

1, . . . , 1 loomoon

Piece 0: B rounds

, 1{β, . . . , 1{β loooooomoooooon

Piece 1: B β rounds

, 1{β2, . . . , 1{β2 looooooomooooooon

Piece 2: B β2 rounds

Instance N : Cp Nqp1q

1, . . . , 1 loomoon

Piece 0: B rounds

, 1{β, . . . , 1{β loooooomoooooon

Piece 1: B β rounds

, . . . , 1{βN, . . . , 1{βN loooooooomoooooooon

Piece N: B βN rounds

Denote FApnq as the FA for instance n P t0, . . . , Nu. It is clear that optp FApnqq Bβn. Recall Xtp1q 1p Pull 1 in round tq. By the Yao s principle Yao (1977), the competitive ratio of an online algorithm is at most N ÿ

n 0 pn EpnqrřT t 1 Rpnq t p1q Xtp1qs

optp FApnqq , (68)

for any pn ě 0 with řN n 0 pn 1. The expectation Epnq is over the randomness in Xt in instance n. The instances are crafted such that during piece j P t0, . . . , Nu, it is impossible to distinguish among instances j, . . . , N, meaning that the quantity Bpnq j Epnqrř

t Ppiece j Cpnq t Xtp1qs for n P tj, . . . , Nu are all identical, and equal to a common value Bj. Thus, for instance n P t0, . . . , Nu we have EpnqrřT t 1 Rpnq t p1q Xtp1qs ď řN j n Bjβj. Consequently,

řN j n Bj βj

n 0 pn β n j.

By defining p0 1 2p N 1qp1 1{βq 1{β p1 1{βq pn for n 1, . . . , N, we have for every j 0, . . . , N, jÿ

n 0 pn β n j ď 1 p N 1qp1 1{βq 1{β leading to

n 0 pn β n j ď 1 p N 1qp1 1{βq 1{β

by the inventory constraint řN j 0 Bj ď B on instance N. Therefore, when the DM is provided with information ηmin β N for any N P N, a CR Θp Nq lower bound is derived based on N 1 instances constructed above.

Not knowing ηmin β Λ but knowing ηmin β κ Λ. We suppose the real underlying ηmin β Λ for some constant Λ, ηmax 1, but the DM only has weaker prior information that ηmin β κ Λ (κ ą 1 can be set arbitrarily large) and ηmax 1. The pattern of p Rt, Ctq follows the above case, and therefore, different ηmin leads to different number of instances N. The DM only knows the number of instances is no larger than κ Λ. Then from the DM s point of view, the optimal CR she/he could derive is CR Θpκ Λq; while from the perspective of a clairvoyant who knows the real ηmin β Λ, the optimal CR should be ΘpΛq.

We first show that given ηmin β κ Λ, the DM will not benefit from tightening the value ranges by blindly guessing a value of ηmin. We suppose the DM blindly tightens the value range to rβ pκ Λ dq, 1s for some d ě 1, without knowing the real ηmin. Then he/she derives a CR lower bound with N 1 κ Λ d 1 instances based on the above construction. Then the DM can expect to achieve a total reward of

t 1 Rtp1q XTIGHT t p1q řN n 0 pn opt(FATIGHTpnqq p N 1qp1 1{βq 1{β Θ ˆB βκ Λ d

However, since the DM does not know the real ηmin, it is possible that in fact ηmin β κ Λ. If this is indeed the case, the optimal reward can be as large as

n 0 pn opt(FApnqq Ωp B βκ Λq

based on the above constructed N κ Λ instances. Hence, from the DM s perspective, she/he could achieve a sub-optimal CR of řN n 0 pn opt(FApnqq řT t 1 Rtp1q XTIGHT t p1q Ω ˆ B βκ Λ κ Λ d

Ωpβd pκ Λ dqq,

if she/he blindly assume ηmin β pκ Λ dq. This is significantly worse than the optimal CR Θpκ Λq (if in fact ηmin β κ Λ). Thus, the DM has no motivation to assume a lower bound larger than the provided ηmin.

Therefore, the DM must derive a CR on the full range rβκ Λ, 1s, which involves N 1 κ Λ 1 instances as constructed above. Therefore the DM expects a reward of Θp B βκ Λ{pκ Λqq. However, since in fact there are only Λ 1 instances, the DM wastes all her/his resources reserved for instance Λ 2, . . . , κ Λ 1 and she/he can only achieve a reward of Op B βΛ{pκ Λqq. Compared with the actual optimal reward Ωp B βΛq with Λ 1 instances, the DM achieves a sub-optimal CR of Ωpκ Λq. Since κ can be arbitrarily large, the CR derived without correct knowledge of ηmin is significantly worse than the optimal CR ΘpΛq.

C.6 Proof of Theorem 4.5

We prove Theorem 4.5 by considering 2ν 1 instances, which share the same K t1u, B P Zą0 and T Bp2ν 1q (All instances have the null arm, as stipulated by our model definition). All instances have deterministic outcomes, and they share the same consumption model Ctp1q 1 for all t P T . By contrast, they differ in the reward model. The horizon T is partitioned into 2ν 1 pieces with equal length B. Their reward functions are:

Instance ν: Rp νqp1q

α 2ν, . . . , α 2ν loooooooomoooooooon

, α 2ν 1, . . . , α 2ν 1 looooooooooomooooooooooon

, . . . , α0, . . . , α0 loooomoooon

Instance ν 1: Rp ν 1qp1q

α 2ν, . . . , α 2ν loooooooomoooooooon

, α 2ν 1, . . . , α 2ν 1 looooooooooomooooooooooon

, . . . , ϵ, . . . , ϵ loomoon

Instance ν : Rpνqp1q

α 2ν, . . . , α 2ν loooooooomoooooooon

, ϵ, . . . , ϵ loomoon

, . . . , ϵ, . . . , ϵ loomoon

where ϵ α 3ν. Denote FApnq as the FA for instance n P t ν, . . . , νu. It is clear that optp FApnqq Bα ν n. Recall Xtp1q 1p Pull 1 in round tq. By the Yao s principle Yao (1977), the competitive ratio of an online algorithm is at most

n ν pn EpnqrřT t 1 Rpnq t p1q Xtp1qs

optp FApnqq , (69)

for any pn ě 0 with řν n ν pn 1. The expectation Epnq is over the randomness in Xt in instance n. The instances are crafted such that during piece j P t ν, . . . , νu, it is impossible to distinguish among instances ν, . . . , j, meaning that the quantity Bpnq j Epnqrř

t Ppiece j Xtp1qs for n P t ν, . . . , ju are all identical, and equal to a common value Bj. Thus, for instance n P t ν, . . . , νu we have EpnqrřT t 1 Rpnq t p1q Xtp1qs ď Bϵ ř n j ν Bjαj ν ď B α 3ν ř n j ν Bjαj ν. Consequently,

n ν pn B α 3ν ř n j ν Bj αj ν

B α ν n ď 1 αν p1 1{αq

n ν pn αj n.

By defining p ν 1 2νp1 1{αq 1{α 1 1

α pn for n ν 1, . . . , ν, we have for every j ν, . . . , ν, j ÿ

n ν pn αj n ď 1 2νp1 1{αq 1{α

n ν pn αj n ď 1 2νp1 1{αq 1{α, (70)

by the inventory constraint řν j ν Bj ď B. Since ν logαpηmax{ηminq{3, the Theorem is proved.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We have clearly stated the contributions made in the paper and important assumptions and limitations in our abstract and introduction. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We have discussed the important assumption and limitation of our work, which is ηmin ą 0, in Section 2.2 Assumption, limitation and discussion . When we establish theorems regarding performance guarantees of our algorithms, we also state clearly the preliminary requirements for them to hold. We promise that we are honest on the limitations of our algorithm. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We promise that our paper provides the full set of assumptions and a complete (and correct) proof. Due to the page limit, our proofs are mostly in appendix. In the main paper, we provide high-level ideas and sketch proofs of our claims, while pointing to the locations of formal proofs in the appendix. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We firstly make it clear that our paper is a theoretical paper. We only run simple experiments for sanity check and drawing insights. We are affirmative that our paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: Our data are numerically generated and codes can be provided.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: Our experiment does not involve training data, but we have specified the test details and settings.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: Though the experiment is not as important as theory in our paper, we run the experiment repeatedly and plot variations of all tests.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Our experiment is very simple and can be run within a few minutes on any laptop. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Our research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA]

Justification: This is a theoretical paper. There is no societal impact of the work performed.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper poses no such risks

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: The paper does not use existing assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL.

The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: The paper does not release new assets.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: The paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: The paper does no involve human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.