# online_resource_allocation_with_nonstationary_customers__7a546dc3.pdf

Online Resource Allocation with Non-Stationary Customers

Xiaoyue Zhang * 1 Hanzhang Qin * 1 Mabel C. Chou 1

We propose a novel algorithm for online resource allocation with non-stationary customer arrivals and unknown click-through rates. We assume multiple types of customers arriving in a nonstationary stochastic fashion, with unknown arrival rates in each period. Additionally, customers click-through rates are assumed to be unknown and only learnable online. By leveraging results from the stochastic contextual bandit with knapsack and online matching with adversarial arrivals, we develop an online scheme to allocate the resources to nonstationary customers. We prove that under mild conditions, our scheme achieves a best-of-both-world result: the scheme has a sublinear regret when the customer arrivals are near-stationary, and enjoys an optimal competitive ratio under general (non-stationary) customer arrival distributions. Finally, we conduct extensive numerical experiments to show our approach generates near-optimal revenues for all different customer scenarios.

1. Introduction

The realm of online resource allocation, critical in fields ranging from online advertising to traffic management, poses a substantial challenge: how to effectively and dynamically distribute limited resources in response to everevolving consumer behaviors. This task is particularly arduous in environments where consumer preferences fluctuate rapidly, rendering traditional static allocation models ineffective.

Addressing this challenge, we introduce the Unified Learning-while-Earning (ULw E) algorithm, a novel ap-

*Equal contribution. Author ordering determined by coin flip. 1Institute of Operations Research and Analytics, National University of Singapore, Singapore 117602. Correspondence to: Hanzhang Qin <hzqin@nus.edu.sg>, Mabel C. Chou <mabelchou@nus.edu.sg>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

proach embedded within the Contextual Bandit with Knapsacks (CBw K) framework. The ULw E algorithm stands out for its real-time adaptability to changing consumer preferences and uncertain click-through rates, a notable departure from traditional methods that often fail to capture the nonstationary nature of customer arrivals.

This paper is structured as follows: Section 1.1 summarizes our main contributions. Section 1.2 reviews relevant literature, setting the stage for our innovation. Section 2 introduces our models for online resource allocation, adapted for non-stationary environments. Section 3 discusses the ULw E algorithm in detail, presenting our unified theoretical framework. Finally, Section 4 validates our algorithm through empirical studies on simulated data and Section 5 concludes the paper.

1.1. Basic Setup and Contributions

1.1.1. BASIC SETUP

Our online resource allocation problem encompasses n resources, each defined by a revenue parameter (ri), capacity (ci), and a latent variable (θi) within a predefined space Θ. It operates over T time periods, with each period t [T] witnessing the arrival of a customer, characterized by a feature vector xt Rd. The likelihood of a customer purchasing from resource i is modeled by fi(xt, θ i ), where θ i is the true value of θi, and fi(xt, θi) 1 ensures model validity. We assume the θ i s are unknown a priori, and we use |Θ| to denote the cardinality of the parameter space Θ.

The context vector xt for each time period t is independently drawn from an unknown distribution P(xt = x(l)) = µt l for each customer type l 1, 2, . . . , L. Given the variability of this distribution across different time periods, the customer arrival process is inherently non-stationary. For every customer type represented by x(l), the expected number of arrivals, denoted as λl, is computed over the time horizon as λl = PT t=1 µt l. We assume all µt ls are unknown but the λls are given in the beginning. In practice, λl can be derived from historical data using machine learning techniques and can usually be estimated with high accuracy. For instance, a regression algorithm was employed on prediction traffic data for online advertising, resulting in a total prediction error rate of approximately 0.9% (Lai et al., 2016). This information also can be regarded as advice, and it is

Online Resource Allocation with Non-Stationary Customers

shown by Lyu & Cheung (2023) without any advice the nonstationary Bw K (CBw K with identical contexts) admits a worst-case regret linear in T. We will discuss the scenario where the knowledge about λl is approximately true in the Appendix E.

1.1.2. MAIN CONTRIBUTIONS

Our primary contribution lies in the development of the ULw E algorithm, a novel approach within the CBw K framework under non-stationary arrivals. This algorithm uniquely offers simultaneous guarantees of sub-linear regret and a constant competitive ratio (CR), effectively addressing the complexities of unknown click-through rates and nonstationary customer arrivals in dynamic environments.

A Unified Learning-while-Earning Algorithm Framework with Sublinear Regret and Constant CR. We propose a unified study of two essential performance guarantees in online resource allocation: regret and CR. Achieving sublinear regret is crucial in online advertising and resource allocation systems, as it signifies that the regret incurred by the algorithm grows at a slower rate compared to the time horizon. Additionally, the constant CR ensures that our algorithm achieves a consistently high-performance level in comparison to an optimal offline policy. While each guarantee has been explored separately, our main contribution lies in investigating them together within a unified algorithm framework and achieving the best of both worlds.

Figure 1. Problem categories

Figure 1 illustrates a graphical representation of three ellipsoids to visually depict this concept. The smallest ellipsoid represents the problem category of i.i.d. arrivals. The intermediate ellipsoid represents problems under near-i.i.d. arrivals scenario, in which our algorithm achieves an expected regret bound of O( p

n|Θ|T). The largest ellipsoid encompasses all types of nonstationary customer arrivals, including the most challenging scenarios. In this case, our

algorithm provides a unified regret bound, which offers sublinear regret with a constant CR performance, allowing for effective resource allocation even in highly dynamic environments. Overall, under nonstationary arrivals, our ULw E algorithm recovers O( p

n|Θ|T) expected regret under neari.i.d. arrivals (the exact condition is specified in Section 3) and guarantees in general

1 + (1 + min i [n] ci) 1 e 1/ min i [n] ci

where OPT denotes the expected revenue the optimal online algorithm that knows the true latent variable achieves, ALG denotes the revenue our proposed online learning algorithm achieves (the latent variable is unknown in priori). When min i [n] ci is sufficiently large, it can be shown the

above guarantee can be rewritten as OPT (1 + 1/(1 1/e))E[ALG] + O( p

n|Θ|T), or equivalently, E[ALG] (1 1/(2 1/e))OPT O( p

n|Θ|T). This guarantee is the tightest possible given the underlying contextual arrival process is adversarial and the click-through rates are unknown (see Theorem 4, Cheung et al. 2022).

Non-Stationary Customer Arrivals. Our approach to nonstationary customer arrivals encompasses two distinct scenarios: those with stochastic distributions and adversarial ones. For stochastic distributions, we employ past data to create empirical distributions, aiding in future predictions. This is achieved using the Upper Confidence Bound (UCB) algorithm and the Deterministic Linear Programming (DLP) formulation, optimizing resource allocation by maximizing expected revenue within the confidence bounds of demand rates.

In contrast, for adversarial or unknown distribution scenarios, we adopt a greedy algorithm, as suggested by Cheung et al. (2022), which factors in discounted revenue and inventory constraints. This method ensures efficient allocation by considering both revenue maximization and resource limitations.

Our ULw E algorithm amalgamates these strategies, seamlessly transitioning between them based on the nature of customer arrivals. This integration, alongside inbuilt condition checks, endows our algorithm with the versatility to operate effectively across varying arrival patterns without prior knowledge of their sequence types. Such adaptability makes it well-suited for dynamic real-world contexts, addressing the challenges posed by diverse customer arrival processes.

Unknown Click-Through Rates. Our algorithm addresses

Online Resource Allocation with Non-Stationary Customers

the complexities introduced by unknown click-through rates in online advertising and resource allocation systems. It dynamically estimates these rates by leveraging historical data alongside current information, ensuring adaptive and informed decision-making.

Central to our approach is the use of a parametric model with a pre-defined prior distribution for the parameter θ Θ, allowing us to capture the nuanced relationships between different customer types and their purchasing probabilities. Initially, our algorithm samples a subset of θ values from existing data to construct the set Θ, which is then continuously refined as new data becomes available. This process enables our algorithm to evolve with and respond to the changing click-through rate landscape, ensuring resource allocation decisions are both current and data-informed.

This strategy of integrating historical insights with real-time updates provides a well-rounded perspective on customer behavior, crucial for optimizing resource allocation in the dynamic field of online advertising.

1.2. Literature Review

Contextual Bandit with Knapsack under Non-Stationary Arrivals. CBw K problems in non-stationary environments, integral to our research, have been a focal point in recent studies. Chen et al. (2013) were pioneers in CBw K within online advertising, proposing a novel greedy algorithm focusing on the reward-to-cost ratio. Badanidiyuru et al. (2014) introduced a probabilistic method, selecting optimal strategies from policy sets. Agrawal et al. (2015) followed with a computationally efficient variant. Complementing these, Agrawal & Devanur (2015) developed Lin CBw K, which utilizes optimistic estimates from confidence ellipsoids to adjust rewards. Recently, Slivkins & Foster (2022) integrated Lagrange Bw K (Immorlica et al., 2019) with Square CB, merging computational efficiency with statistical optimality.

In tackling non-stationary problems, Wei & Luo (2018) proposed an algorithm that achieves

T regret in the adversarial setting and log(T) regret in the stochastic setting. This result was further advanced to optimal by Zimmert & Seldin (2019; 2021). Moreover, Jin et al. (2023) achieved similar results without relying on certain assumptions. Luo et al. (2023) investigated high-probability regret bounds in adversarial settings and applied it to contextual bandits. The adversarial version of Bw K, explored by Immorlica et al. (2019). Unlike scenarios with diminishing regret, these algorithms are bound to achieve approximate ratios (Kesselheim & Singla, 2020; Fikioris & Tardos, 2023), even in environments with a single switch (Immorlica et al., 2022). The understanding of the approximation-ratio aspect is now well-established (Castiglioni et al., 2022; Sivakumar et al., 2022; Liu et al., 2022).

Online Resource Allocation. Our work intersects significantly with online resource allocation (also known as the Ad Word problem, see Mehta et al. (2007)), particularly in the assignment of resources to real-time, random demands. The CR serves as a key performance metric in this domain. Karp et al. (1990) investigated one-sided bipartite arrivals revealed the optimality of a simple 1/2 competitive greedy algorithm among deterministic approaches. On the Adwords problem, Mehta et al. (2005) modeled as a linear program (LP), led to an online algorithm with a 1 1/e CR based on primal-dual LP analysis. Fahrbach et al. (2022) further broke new ground with the online correlated selection subroutine reaching a CR of at least 0.5086.

The works most aligned with ours are those by Ferreira et al. (2018); Zhalechian et al. (2022); Cheung et al. (2022). Ferreira et al. (2018) proposed a dynamic algorithms highlighted versatility in handling resource-constrained bandit problems. Zhalechian et al. (2022) addressed personalized learning resource allocation with a Bayesian regret analysis, adept for adversarial customer contexts. Cheung et al. (2022) focused on the allocation of limited resources over time to heterogeneous customers, a methodology that has significantly influenced our approach to CR analysis.

Our objective is to maximize the total reward obtained from customers clicks or purchases while considering budget constraints for the available resources or advertisers. We consider a system with n resources, resource i [n] has an initial budget value ci. Customers arrive at the system from period 1 to T. When a customer arrives, we observe the feature vector x Rd of the customer, where d is the length of the feature vector. At this point, the system must make a decision: either reject the customer irrevocably or assign a resource to the customer immediately. If the resource j is assigned, the customer s likelihood of purchasing the resource is determined by the probability function fj(x) associated with that resource j and the customer s feature vector x. If the customer makes a purchase, the system earns a reward rj that is deducted from the budget of the assigned resource. If the customer does not make a purchase, no reward is earned.

To address this complex problem, our model incorporates two key elements: non-stationary customer arrivals and unknown click-through rates. We recognize that customer behavior may change over time, requiring the system to adapt to these fluctuations. Additionally, the exact clickthrough rates, which represent the likelihood of customers clicking on a resource or making a purchase, are initially unknown. However, we utilize extensive data to estimate and update these rates in real-time.

Online Resource Allocation with Non-Stationary Customers

2.1. Handling Non-Stationary Customer Arrivals

In real-world systems, customer arrivals often exhibit nonstationary behavior, reflecting temporal variations in customer types. For example, students accessing a system are likely to show different patterns during mornings, evenings, and class hours. We categorize non-stationary customer arrivals into two types for our model:

Stochastic Arrivals with Non-Stationary Distributions: These are regular scenarios without disruptive events. Past data can be used to learn distributions for future arrivals.

Adversarial Arrivals with Agnostic Distributions: Such scenarios occur during unpredictable events (e.g., Black Friday), where arrivals are assumed to be controlled by an adversary.

We assume no prior knowledge of arrival patterns except for the total arrival rate λj for each customer type j [L]. This approach caters to both i.i.d. and adversarial sequences, overcoming the limitations of prior models which often cannot predict the nature of future arrivals.

2.2. Modeling Unknown Click-Through Rates

Addressing unknown click-through rates (CTR) is pivotal, particularly in contexts with ample historical data. While historical data provides insights into individual preferences, actual CTRs are influenced by recent factors like ad quality and competitor promotions. Our model adopts a parametric approach, beginning with a known prior distribution for θ Θ. This setup allows for a flexible, parametric formulation of the purchase probability function f(xt, θ), where xt Rd is the feature vector for customer arrivals at time t [T].

An example model choice is the polynomial logistic regression model (Richardson et al., 2007), with θ representing polynomial term parameters. To utilize historical data, our algorithm samples a subset of θ values initially, forming the set Θ. This set is dynamically updated over time, enabling the algorithm to adapt to changing conditions and refine decision-making processes continually. In practice, Θ can include any type of the parameters involved in a parameter set associated with a particular machine learning model (e.g., a Large Language Model) for CTR prediction.

2.3. An Upper Bound on the Optimal Revenue

In this section, our primary objective is to establish a theoretical upper bound on the optimal revenue in our model, denoted as OPT. This upper bound provides a crucial benchmark for assessing the performance of various algorithms and understanding their efficiency and effectiveness in different resource allocation scenarios.

To establish an upper bound, we introduce a deterministic linear programming model JD, which aims to maximize the deterministic revenue given the capacities and customer arrival patterns:

JD(c, t) = max sij,i [n],j [L]

τ=1 risτ ijfi(xj, θ )

τ=1 µj τsτ ijfi(xj, θ ) ci, i [n]

τ=1 sτ ij =

τ=1 µj τ, j [L]

sτ ij 0, i [n], j [L]

We establish that:

OPT JD(c, t),

indicating that JD(c, t) serves as an upper bound for the optimal revenue. Consequently, we define regret as the difference between this upper bound and the actual optimal revenue achieved, formulated as: Regret = OPT ALG.

This quantifies the efficiency of different algorithms in approaching the theoretical maximum revenue. For a comprehensive understanding of the problem formulation, constraints, and detailed derivations, please refer to Appendix B, which includes the full mathematical treatment and theoretical analysis supporting these findings.

3. Algorithm and Analysis

We propose the ULw E algorithm, displayed in Algorithm 1, which is applicable in both stochastic and adversarial environments. The algorithm design involve constructing switch strategy to address the model uncertainty on µ, as discussed in Section 3.1. In Section 3.2, we provide a regret upper bound to ULw E, and demonstrate the scheme has a sublinear regret when the customer arrivals are near-stationary, and enjoys an optimal competitive ratio under general (nonstationary) customer arrival distributions. In Section 3.3 we provide a sketch proof of the regret upper bound, where the complete proof is in Appendix D

3.1. The ULw E Algorithm

In this section, we explore the ULw E algorithm (Algorithm 1), which is applicable in both stochastic and adversarial environments. ULw E is an evolution of the UCB paradigm, integrating aspects of inventory balancing with a penalty function (Golrezaei et al., 2014). A distinctive feature of ULw E is its dynamic approach: the algorithm continually updates the parameter set θ and monitors capacity constraints to determine the necessity of a policy switch.

Online Resource Allocation with Non-Stationary Customers

Algorithm 1 Unified Learning-while-Earning (ULw E)

Input: Resource capacities c, time horizon T, customer types L Initialize Ω0 i = Θi for all i [n] for t = 1 to T do

if switch = FALSE then

It = ALGLP(c, T, L, Ωt 1 i ). Check if conditions (5) and (6) for switching are met. if any condition is violated then

switch = TRUE. end if else

It = ALGADV(c, T, L, Ωt 1 i ). end if Check if conditions (3) and (4) for updating Ωi are met. if any condition is violated then

Remove θt from Ωt It else

Set Ωt i = Ωt 1 i for all i [n] end if end for

Specifically, ULw E adapts its policy to best suit customer patterns, whether they follow i.i.d. or adversarial arrival sequences. This strategy of switching sets ULw E apart from the traditional UCB-based methods used in the stochastic context (Agrawal & Devanur, 2015) and the strategies applied to the adversarial Bw K settings (Immorlica et al., 2019). While ULw E shares similarities with previous works on stochastic and adversarial bandits with knapsacks in the methodologies of arm selection, which involves LP optimization and inventory balancing, and in its fundamental updating mechanism, which is based on the UCB approach.

The algorithm operates through three crucial stages: selecting an arm to obtain rewards and consumption vectors, updating the confidence set Θ, and assessing whether to switch states.

Arm Selection. In each period t, our algorithm offers a resource It from a set of n resources to the customer. This process constitutes the arm selection phase, which is executed differently under two distinct protocols: ALGLP and ALGADV, catering to different customer arrival patterns.

For ALGLP protocol (Algorithm 2) designed for environments with stationary customer arrivals, we assume that the probability of each customer type l arriving at time t remains constant i.e. µt l = µl for all t [T]. The algorithm solves a LP problem at each step to maximize revenue, constrained by inventory limits and the requirement that the sum of probabilities of offering all resources to each customer

Algorithm 2 ALGLP Protocol

Input: Resource capacities c, time horizon T, customer types L Initialize Ω0 i = Θi for all i [n] for t = 1 to T do

Solve LP in Equation (2) to obtain st, γt

Select resource i with probability si Jt. end for

Algorithm 3 ALGADV Protocol

Input: Resource capacities c, time horizon T, customer types L Initialize Ω0 i = Θi for all i [n] for t = 1 to T do

Observe the context xt of the new arrival in period t. Select It = arg max i [n] rt i fi(xt, Ωt 1 i ).

equals one. The LP formulation is as follows: U t =:

max sij,i [n],j [L]

j [L] λjsij fi(x(j), Ωt 1 i )

j [L] λjsij fi(x(j), Ωt 1 i ) ci, i [n]

i [n] sij = 1, j [L]

sij 0, i [n], j [L]. (2)

Here, the function fi(x, Ω) calculates the maximum purchase probability for resource i given a set Ωof valid latent variables, i.e. fi(x, Ω) = maxw Ω{fi(x, w)}.

In contrast, ALGADV protocol (Algorithm 3) is designed to handle the dynamic and unpredictable nature of customer behavior and preferences. In this situation, the underlying distribution of xt no longer exists. Instead, the value of xt is chosen by an adversary in each period t. Algorithm 3 computes a real-valued function, Ψ(x) := ex 1

e 1 , defined over the interval [0, 1]. This function is instrumental in adjusting the revenue values rt i for each resource i, taking into account the resource s past utilization and its capacity. The adjusted revenue, denoted by rt i, is calculated as

ri 1 Ψ Nt 1 i ci

, where N t 1 i represents the previous consumption of resource i. The algorithm then selects the resource It that maximizes the upper confidence bound of the single period revenue, taking into account the modified revenue values and the confidence intervals defined by the set of valid latent variables Ωt 1 i .

Update Confidence Set. In the concept of the UCB strategy, a key aspect is the continuous updating of its confidence

Online Resource Allocation with Non-Stationary Customers

set. In our algorithm, this process begins with the formation of a theta set Θ, encompassing a set of possible parameter values. From this set, the confidence set Ωt i is constructed. After selecting the resource, the algorithm identifies the maximizer θ of the purchase probability for that resource. The updating mechanism involves two critical equations, for all θ Ωt 1:

t Dt It( θt) (f It (xt , θt) at )

t log 2t β(n, T) (3)

t Dt It( θt) (f It (xt , θt) f It (xt , θ))

t log 2t β(n, T)

Here, β(n, T) is set as 1/(n T). The set Dt i tracks the periods up to time t where resource i is offered. The customer s purchase decision is represented by at {0, 1}.

Equation (3) ensures that the accumulated regret, the difference between predicted and actual purchases, remains within a predefined threshold. If this threshold is exceeded, θ is removed from the confidence set, refining the purchase probability estimates. Equation (4) is crucial in maintaining the stability of the algorithm under near-i.i.d. arrival patterns. It limits the occurrence of a switch, fostering a more consistent and effective resource allocation strategy and leading to sublinear regret in these situations.

At its core, this iterative refinement process sharpens the focus of the confidence interval on the most promising parameter values by discarding less probable theta values. This refined approach, supported by continuously updated data, significantly enhances the algorithm s ability to make efficient and accurate resource allocation decisions.

In Section 1.1.1, we employ θ i as a unique latent variable for each resource in the formulation fi(xt, θ i ). To simplify the analysis in subsequent sections, we transition to a model that employs a collective latent variable affecting all resources, enhancing analytical tractability.

Switch State Checking. The ULw E algorithm checks if it has switched or not in each time step. When not switched, the algorithm checks two conditions. The switching conditions involve two key inequalities which are checked at each time step t. For all θ Ωt 1:

i=1 ri(si Jl(θ) sl i Jt)fi Jl(x(Jt), θ)

max i [n] ri p

32t log(4|Θ|t/β(n, T)), (5)

For all i [n]

l=1 sl i Jl fi(x(Jl), Ωl 1) t

2t log(2t/β(n, T))

Equation (5) ensures that the estimated upper bound regret does not grow beyond a sublinear rate. This condition measures the algorithm s performance in terms of regret, which quantifies the deviation between the total expected reward obtained by the algorithm and the maximum achievable reward that an optimal offline algorithm could obtain. Equation (6) checks whether the remaining capacity of the resources is sufficient. This condition ensures that the cumulative resource allocation does not exceed the available capacity plus a certain tolerance level. It takes into account the capacity constraints imposed by the resources and ensures that the algorithm does not allocate more resources than what is available. If both conditions hold, the algorithm continues to the next time step. However, if any of the conditions is violated, indicating either excessive regret growth or insufficient resource capacity, the algorithm switches from the current algorithm (ALGLP) to ALGADV.

3.2. Performance Guarantees of ULw E

The following theorem provides a regret upper bound for Algorithm 1:

Theorem 3.1. In any case of nonstationary arrivals, the algorithm guarantees

1 + (1 + min i [n] ci) 1 e 1/ min i [n] ci

When the arrivals are stationary, the algorithm guarantees

OPT E[ALG] + O( p

The complete proof of Theorem 3.1 is proved in the Appendix, and we provide a sketch proof in Section 3.3. The Therorem establishes an expected regret bound of O( p

n|Θ|T) for near-i.i.d. arrival scenarios and provides a unified regret bound for nonstationary arrivals, combining sublinear regret with a constant CR. In the context of resource allocation, these results are significant.The sublinear regret of our algorithm indicates that the gap between its performance and the optimal strategy narrows over time. This demonstrates the algorithm s ability to adapt and improve, becoming more effective with longer usage. The constant CR highlights that our algorithm s regret is always within a fixed factor of the optimal strategy s regret, regardless of the

Online Resource Allocation with Non-Stationary Customers

problem s scale. This underlines the algorithm s consistent effectiveness and robustness, even with changing customer preferences. A remark on the behavior of the CR under the scenario when mini [n] ci approaches infinity, are provided in the Remark D.8.

3.3. Proof Sketch of Theorem 3.1

We provide an overview on the proof of Theorem 3.2, which is fully proved in Appendix D. First, we establish that ALGLP always incurs a sublinear regret and a linear rate of resource consumption with high probability as long as no switch occurs (Proposition 3.2). Specifically, if ALGLP runs uninterrupted for t iterations, the expected regret up to time t is bounded. Theorem 3.2. Suppose ALGLP runs for t iterations without being switched, then the expected regret up to time t is upper bounded by

2|Θ|nt log(4|Θ|t/β(n, T)) +

5n log(2t/β(n, T))

+ (2/t + 2 + log t)β(n, T)(LP(θ ) + nt).

Moreover, each resource i [n] has at least T t

8nt log(2t/β(n, T)) +

remaining with probability at least 1 β(n, T).

We then demonstrate that under i.i.d. arrivals, the switch from ALGLP to ALGADV does not occur with high probability (see Proposition D.13 and Proposition D.14). Consequently, we can apply Theorem 3.2, which assures sublinear regret without switch. This leads us to obtain O(|Θ|n T) regret under i.i.d. arrivals.

We further analyze our algorithm to address the case where the switch occurs, transitioning from ALGLP to ALGADV at time t. Theorem 3.3 quantify the performance of the optimal algorithm from time t + 1 to T , given the consumption of all resources is zero. Theorem 3.3. It holds that

OPT (1 + min i [n] ci) 1 e 1/ min i [n] ci

1 1/e E[ALG]

+ max i [n] ri( p

|Θ|n + 1) p

2T log(2T/β(n, T))

+ max i [n] ri(1/T + log T + 1)/n.

This theorem is crucial as it sets an upper bound on the regret incurred by ALGADV under adversarial arrival conditions. By examining the collective performance of both ALGLP and ALGADV, we derive a comprehensive regret bound in Theorem 3.1. Our analysis extends further to encompass nonstationary arrivals, showing that our algorithm not only achieves sublinear regret but also maintains a constant CR.

4. Numerical Studies

In this section, we conduct numerical experiments to assess the performance of ALGLP, ALGADV, and the ULw E algorithm in a controlled setting where resources are continuously sold. The main goal is to evaluate these algorithms under dynamic conditions with limited resource availability, using JD(c, t) (Equation (1)). Based on insights from Golrezaei et al. (2014), we adjust our estimates of λl every h = 50 periods to reflect shifts in customer arrivals while considering inventory levels and empirical data. Initially, based on the true rate µ, we generate data for 500 customers to estimate the initial λl values, which are used in the first h period cycles. Subsequently, every h interval, we update ˆλls based on new data collected, using the formula ˆλl = T

t Pt τ=1 I{customer at τ is type l}, where T is the total period, t is the current time, τ represents past time points, and I{customer at τ is type l} is an indicator function that equals 1 if the customer of type l arrive at time τ, and 0 otherwise.

The experiments involve two customer types (L = 2) and two resources (n = 2), with customer purchase probabilities modeled through a logistic function. The ULw E algorithm starts with historical data (500 observations) to estimate initial Θ values, comparing against an optimal offline algorithm s maximum profit derived from known θ values. In this setup, Customer Type A exhibits a high purchase probability of 0.9, while Customer Type B maintains a consistent probability of 0.5, regardless of the resource. The total capacity of both resources matches the time horizon T, with revenues set at 1 and 1.5 units for Resource 1 and 2, respectively. Our analysis primarily assesses the ULw E algorithm s performance, particularly its regret in both i.i.d. and adversarial customer arrival scenarios, highlighting its adaptability across various operational contexts. Experiment s regret values represent the average outcomes of 100 independent experiments, ensuring robustness and reliability of the results.

4.1. Results under Near-IID Arrivals

In this subsection, we analyze experimental results under i.i.d. arrival conditions, where customer arrivals are stable and predictable. We simulate Customer Types A and B arriving at rates of 0.6T and 0.4T, respectively, representing a near-i.i.d. environment with consistent arrival probabilities.

Figure 2 illustrates the regret trajectories over a total time period T of 500. The ADV algorithm (yellow line) initially maintains a stable regret, suggesting an optimal initial resource allocation. However, as time advances, an increase in regret is observed, a characteristic trait of the algorithm s greedy nature. This increase indicates a shift from optimal decisions to suboptimal ones due to the depletion of resources and consequent reduction in modified revenue. In

Online Resource Allocation with Non-Stationary Customers

Figure 2. Regret over Time under i.i.d. Arrival

contrast, the ALGLP and ULw E algorithms (blue and green lines, respectively) exhibit a more gradual and consistent increase in regret. The near-overlapping of these lines indicates a similarity in their performance, with both algorithms showing a steady increase in revenue regret over time.

At T = 500, ALGADV has the highest cumulative regret, suggesting that the heuristic greedy algorithm encounters difficulties in adapting its resource allocation strategies effectively under i.i.d. arrivals. In contrast, both ALGLP and the ULw E algorithm exhibit similar and improved performance compared to ALGADV, showing effective adaptation to i.i.d. conditions with lower cumulative regret. More details are provided in Appendix F.

4.2. Results under Adversarial Arrivals

In this subsection, we analyze how the algorithms perform in scenarios with nonstationary arrival patterns, wherein the probabilities of customer arrivals change over time. Specifically, we simulate a scenario in which Customer Types A and B exhibit varying arrival rates throughout different phases of the time period T to reflect dynamic shifts in customer behaviors. In our analysis of nonstationary environments, as shown in Figure 3, the regret trajectories of ALGLP (blue line) and ALGADV (yellow line) were consistent with their performance under i.i.d. conditions. ALGLP exhibited a steady increase in regret over time, while ALGADV showed initial stability followed by a sharp rise in the later stages. Interestingly, under nonstationary conditions, the regret of ALGADV did not surpass that of ALGLP, indicating better performance in dynamic settings.

The ULw E algorithm (green line) displayed a distinct pattern. It initially follows the trend of ALGLP, but after a switch point at around 1/3T, it adopts a pattern similar to ALGADV initially stable, then sharply increasing, indicat-

Figure 3. Regret over Time under Adversarial Arrival (ADV1)

ing its ability to integrate the strengths of both the LP and ADV approaches. More details are provided in Appendix F.

4.3. Results under General Arrivals

Figure 4. Regret over Time under Adversarial Arrival (ADV2)

This subsection evaluates the performance of each algorithms under various customer arrival scenarios (illustrated in Table 3), which vary in nonstationarity levels. The primary focus is on the ADV2 setting, indicative of high nonstationarity. Key findings highlight the robustness of the ULw E algorithm, especially in low nonstationarity settings like ADV1, attributed to its adaptive switching between ALGLP and ALGADV. Although its performance dips in more nonstationary situations (e.g., ADV2), it still outperforms in stationary contexts.

In summary, ALGADV consistently sustains performance in nonstationary environments, while ALGLP, limited by its static resource allocation, encounters challenges with fluc-

Online Resource Allocation with Non-Stationary Customers

tuating arrival rates and increased nonstationarity. ULw E s flexibility in resource allocation and its ability to adapt to changing conditions effectively minimize regret, highlighting its effectiveness in addressing nonstationary uncertainties.

5. Conclusion

In conclusion, we proposed ULw E algorithm based on CBw K to address the resource allocation problem under nonstationary environments. Our algorithm leverages contextual information to make informed decisions and adaptively balances exploration and exploitation to accommodate rapidly changing customer preferences. By assuming no prior knowledge of the per-period arrival process and considering the variability in arrival probabilities, our algorithm overcomes the constraints of stationary environments, thereby achieving efficient resource allocation.

While bandit algorithms have been used for resource allocation problems, we extend the existing literature by explicitly considering the variability in customer purchase probabilities. Compared to related works, our algorithm achieves a regret bound of approximately O( p

n|Θ|T) in the case of near-i.i.d. arrivals, providing sublinear regret and a constant CR under nonstationary arrivals. In addition to the theoretical analysis, our experiments compared ULw E algorithm with ALGLP and ALGADV. The results consistently demonstrated that the ULw E algorithm outperforms these algorithms, achieving lower regret under nonstationary arrival patterns. Future research could explore the algorithm s applicability in other domains and enhance its performance in dynamic environments. In future work, we aim to develop a more computationally efficient version and even LP-free, which would significantly streamline the algorithm s practical application.

Acknowledgements

We would like to express their gratitude to the reviewing team for their constructive suggestions, which have significantly improved the paper s positioning and clarity of exposition.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Agrawal, S. and Devanur, N. R. Linear Contextual Bandits

with Knapsacks. ar Xiv e-prints, art. ar Xiv:1507.06738, July 2015. doi: 10.48550/ar Xiv.1507.06738.

Agrawal, S., Devanur, N. R., and Li, L. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. ar Xiv e-prints, art. ar Xiv:1506.03374, June 2015. doi: 10.48550/ar Xiv.1506. 03374.

Badanidiyuru, A., Langford, J., and Slivkins, A. Resourceful contextual bandits. In Conference on Learning Theory, pp. 1109 1134, 2014.

Castiglioni, M., Celli, A., and Kroer, C. Online learning with knapsacks: the best of both worlds. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research. PMLR, 17 23 Jul 2022.

Cetintas, S., Chen, D., Si, L., Shen, B., and Datbayev, Z. Forecasting counts of user visits for online display advertising with probabilistic latent class models. pp. 1217 1218, 07 2011. doi: 10.1145/2009916.2010127.

Chen, W., Wang, Y., and Yuan, Y. Combinatorial multiarmed bandit: General framework and applications. In International conference on machine learning, pp. 151 159. PMLR, 2013.

Cheung, W. C., Ma, W., Simchi-Levi, D., and Wang, X. Inventory balancing with online learning. Management Science, 68(3):1776 1807, 2022.

Fahrbach, M., Huang, Z., Tao, R., and Zadimoghaddam, M. Edge-weighted online bipartite matching. Journal of the ACM, 69(6):1 35, 2022.

Ferreira, K. J., Simchi-Levi, D., and Wang, H. Online network revenue management using thompson sampling. Operations research, 66(6):1586 1602, 2018.

Fikioris, G. and Tardos, E. Approximately stationary bandits with knapsacks. ar Xiv preprint ar Xiv:2302.14686, 2023.

Golrezaei, N., Nazerzadeh, H., and Rusmevichientong, P. Real-time optimization of personalized assortments. Management Science, 60(6):1532 1551, 2014. doi: 10.1287/mnsc.2014.1939.

Immorlica, N., Sankararaman, K. A., Schapire, R., and Slivkins, A. Adversarial bandits with knapsacks. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), 2019.

Immorlica, N., Sankararaman, K., Schapire, R., and Slivkins, A. Adversarial bandits with knapsacks. J. ACM, 69(6), nov 2022.

Online Resource Allocation with Non-Stationary Customers

Jin, T., Liu, J., and Luo, H. Improved best-of-both-worlds guarantees for multi-armed bandits: Ftrl with general regularizers and multiple optimal arms. ar Xiv preprint ar Xiv:2302.13534, 2023.

Karp, R. M., Vazirani, U. V., and Vazirani, V. V. An optimal algorithm for on-line bipartite matching. In Proceedings of the Twenty-Second Annual ACM Symposium on Theory of Computing, pp. 352 358, 1990.

Kesselheim, T. and Singla, S. Online learning with vector costs and bandits with knapsacks. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research. PMLR, 09 12 Jul 2020.

Lai, H.-C., Shih, W.-Y., Huang, J.-L., and Chen, Y.-C. Predicting traffic of online advertising in real-time bidding systems from perspective of demand-side platforms. In 2016 IEEE International Conference on Big Data (Big Data), pp. 3491 3498, 2016. doi: 10.1109/Big Data.2016. 7841012.

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207 1216, Stanford, CA, 2000. Morgan Kaufmann.

Liu, S., Jiang, J., and Li, X. Non-stationary bandits with knapsacks. Ar Xiv, abs/2205.12427, 2022.

Luo, H., Tong, H., Zhang, M., and Zhang, Y. Improved high-probability regret for adversarial bandits with timevarying feedback graphs, 2023.

Lyu, L. and Cheung, W. C. Non-stationary bandits with knapsack problems with advice. Ar Xiv, abs/2302.04182, 2023.

Mehta, A., Saberi, A., Vazirani, U., and Vazirani, V. Adwords and generalized on-line matching. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS 05), pp. 264 273, 2005. doi: 10.1109/SFCS.2005. 12.

Mehta, A., Saberi, A., Vazirani, U., and Vazirani, V. Adwords and generalized online matching. Journal of the ACM (JACM), 54(5):22 es, 2007.

Richardson, M., Dominowska, E., and Ragno, R. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pp. 521 530, 2007.

Sivakumar, V., Zuo, S., and Banerjee, A. Smoothed adversarial linear contextual bandits with knapsacks. In International Conference on Machine Learning, 2022.

Slivkins, A. and Foster, D. J. Efficient contextual bandits with knapsacks via regression. Ar Xiv, abs/2211.07484, 2022.

Wei, C.-Y. and Luo, H. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, pp. 1263 1291. PMLR, 2018.

Zhalechian, M., Keyvanshokooh, E., Shi, C., and Van Oyen, M. P. Online resource allocation with personalized learning. Operations Research, 70(4):2138 2161, 2022.

Zimmert, J. and Seldin, Y. An optimal algorithm for stochastic and adversarial bandits. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 467 475. PMLR, 2019.

Zimmert, J. and Seldin, Y. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits. The Journal of Machine Learning Research, 22(1):1310 1358, 2021.

Online Resource Allocation with Non-Stationary Customers

A. Appendix: LP formulations

To capture the maximum achievable total reward, we define a static LP problem, denoted as LP(θ ):

LP(θ ) = max sij,i [n],j [L]

j [L] λjsijfi(x(j), θ i )

j [L] λjsijfi(x(j), θ i ) ci, i [n]

i [n] sij = 1, j [L]

sij 0, i [n], j [L].

The coefficients in the LP are determined by the latent variables θ . The objective of this LP is to maximize the total reward obtained by assigning resources to customers based on their features and latent variables. The LP is subject to constraints that ensure consistency between budget limitations and resource assignment. The optimal solution to this LP is denoted as s .

We denote U the maximum optimal objective value of U t, i.e.,

U = max sij,θi Θi,i [n],j [L]

j [L] λjsij fi(x(j), θi)

j [L] λjsij fi(x(j), θi) ci, i [n]

i [n] sij = 1, j [L]

sij 0, i [n], j [L].

B. Appendix: Proof of the Deterministic Upper Bound on the Optimal Revenue

In this subsection, our primary objective is to establish a theoretical upper bound on the optimal revenue that can be achieved. This involves a comprehensive analysis of the optimal revenue (denoted as OPT), which serves as a benchmark against which the performance of various algorithms can be measured. By exploring the upper limits of achievable revenue, we aim to provide a clear framework for evaluating the efficiency and effectiveness of different resource allocation strategies. This analysis not only aids in quantifying the regret incurred by these algorithms but also offers insights into the potential for revenue maximization under varying operational constraints and market conditions.

Firstly, the resource allocation problem is formulated as follows: At time zero, the system has capacities c of n items and a finite time t > 0 to assign them.

Let PL j=1 µjsf I(xjs, θ )) denote the probability of item I is purchased by customers to time s. The term µjt denotes the probability of encountering a customer of type j at time t, reflecting the non-stationary nature of customer arrivals. The function f I(xjt, θ ) captures the probability of a customer of type j with feature vector xjt purchasing resource I, given the latent variable θ associated with the resource. A consumption is realized at time s if PL j=1 µjsf I(xjs, θ )) = 1, in which case the system sells one item and receives revenue of r I.

We introduce an indicator variable δI,s, which equals 1 if resource I is assigned at time s, and 0 otherwise. The class of all non-anticipating policies, denoted by Π, must satisfy

j=1 µj sf I(xjs, θ )δI,s c I

In addressing the challenge of optimizing resource allocation under constraints of customer behavior and time, we formulate a function that encapsulates the expected revenue over a continuous time horizon. The decision-making policy π guides the allocation of resources, determining each resource I being dynamically chosen based on customer features and the characteristics of the resources, as indicated by δI,s.

Online Resource Allocation with Non-Stationary Customers

Given a allocation policy π Π, an initial capacity c > 0, and a sales horizon t > 0, we denote the expected revenue by

Jπ(c, t) .= Eπ

j=1 µj sf I(xjs, θ )δI,s

Here, r I represents the revenue earned from resource I upon a customer s purchase, as determined by policy π and the assignment indicated by δI,s. The problem is to find a decision-making policy π (if one exists) that maximizes the total expected revenue generated over [0, t], denoted OPT. Equivalently,

OPT .= sup π Π Jπ(c, t)

This formulation allows for a dynamic and adaptive approach to resource allocation, where the decision policy π can be optimized based on the varying probabilities of customer types and their purchasing behaviors over time. The objective OPT thus represents the total expected revenue, taking into account the fluctuating customer landscape and the inherent uncertainties in customer preferences and behaviors.

In our pursuit to devise an optimal strategy for resource allocation in a customer interaction system, we propose a deterministic linear programming model, denoted as JD. This model is designed to maximize the deterministic revenue while adhering to the constraints of resource capacities and customer arrival patterns. The objective function aims to maximize the total deterministic revenue overall resources, customer types, and time periods.

JD(c, t) = max sij,i [n],j [L]

s=1 riss ijfi(xj, θ )

s=1 µj sss ijfi(xj, θ ) ci, i [n]

s=1 ss ij =

s=1 µj s, j [L]

ss ij 0, i [n], j [L]

All the variables and parameters in this LP are deterministic. They are either constants known a priori or decision variables whose values are to be determined by solving the LP without any randomness involved. The first constraint ensures that the expected number of resource allocations does not exceed the capacity ci of each resource i. This constraint is crucial for maintaining a balance between maximizing revenue and not overcommitting the available resources. The second constraint guarantees that the total allocations for each customer type j over the time horizon equal the expected number of arrivals λj of that customer type. This aligns the resource allocation with the anticipated customer demand.

Finally, the non-negativity constraint ensures that the decision variables remain feasible, reflecting the reality that negative resource allocation is not possible.

Proposition B.1. For all 0 c < + and 0 t < + ,

OPT JD(c, t)

Proof. Initially, the problem is formulated as finding the best policy π to maximize the expected sum of returns, subject to a cost constraint:

OPT .= max π Eπ

j=1 µj sf I(xjs, θ )δI,s

j=1 µj sf I(xjs, θ )) c I

Online Resource Allocation with Non-Stationary Customers

Here we introduce binary decision variables πij to represent the allocation of resources to each customer at each time point s > 0. We add a constraint Pn i=1 πs ij = 1, which represents each customer can only be assigned one item at each time point. The problem thus becomes a binary integer programming problem:

OPT = max πij

s=1 riµj sπs ijfi(xjs, θ )

j=1 µj sπs ijfi(xjs, θ )) ci i [n]

i=1 πs ij = 1

πs ij = {0, 1}

To establish the relationship between the original optimization problem OPT and JD(c, t), we begin to get a new LP problem OPT1 by relaxing the binary constraint on πs ij, allowing it to take any value between 0 and 1. This relaxation naturally leads to OPT1 OPT, as OPT1 includes all feasible solutions of OPT and potentially more. We then introduce a new variable ms ij = µjsπs ij, noting that since µjs is also between 0 and 1, it follows that 0 ms ij 1.

The first constraint of OPT1, Pt s=0 PL j=1 ms ijfi(xjs, θ )) ci, is tighter than the first constraint of JD(c, t), Pt s=0 PL j=1 µjsss ijfi(xjs, θ )) ci, as Pt s=0 PL j=1 ms ijfi(xjs, θ )) is larger than Pt s=0 PL j=1 µjsss ijfi(xjs, θ )).

The second constraint of OPT1, Pn i=1 ms ij = Pn i=1 µjsπs ij = µjs, applies to each time period, making it more stringent than the overall time constraint in JD(c, t),Pn i=1 Pt s=1 ss ij = Pt s=1 µjs. Consequently, we conclude that OPT OPT1 JD(c, t), completing the proof.

Having rigorously established the inequality OPT JD(c, t), we have effectively demonstrated that JD(c, t) serves as an upper bound on the optimal revenue, which we denote as OPT . This upper bound is crucial as it provides a benchmark against which the performance of various algorithms can be measured. In light of this, we are now positioned to define the concept of regret in our context. Specifically, regret can be quantitatively expressed as the difference between the upper bound of the optimal revenue and the actual optimal revenue achieved, formulated as

Regret = OPT ALG

This definition of regret is instrumental in evaluating the efficiency of different algorithms. By measuring how closely an algorithm s performance approaches the theoretical upper limit of revenue, we can assess its effectiveness in resource allocation and decision-making under various operational scenarios.

C. Appendix: Technical Lemmas

Lemma C.1. (Azuma-Hoeffding inequality) Let G0 G1 . . . Gn be a filtration G, and X0, . . . , Xn a martingale associated with G, and |Xi Xi 1| ci, i = 1, . . . , n almost surely. Then, it holds that

P [|Xn X0| > ϵ] 2 exp ϵ2

2 Pn k=1 c2 k

for all ϵ > 0.

Lemma C.2. ((Badanidiyuru et al., 2014)) Let G0 G1 . . . Gn be a filtration, and X1, . . . , Xn be real random variables such that Xi is Gi-measurable, E[Xi|Gi 1] = 0 and |Xi| c for all i [n] and some c > 0. Then with probability at least 1 δ it holds that

i=1 E[X2 i |Gi 1] log(2n/δ) + 5c2 log2(2n/δ).

Online Resource Allocation with Non-Stationary Customers

Proof. First notice for all i [n], j [L], t [T], we have

E h µj st ij fi(x(j), Ωt 1 i )|Ft 1 i =E h E [I(i, j, t)|Ft 1] fi(x(j), Ωt 1 i )|Ft 1 i

=E [I(i, j, t)|Ft 1] E h fi(x(j), Ωt 1 i )|Ft 1 i

=E h I(i, j, t) fi(x(j), Ωt 1 i )|Ft 1 i ,

j=1 (I(i, j, t) µj st ij) fi(x(j), Ωt 1 i )

Let X0 = 0, Xt Xt 1 = PL j=1(I(i, j, t) µj st ij) fi(x(j), Ωt 1 i ), then we know E [Xt|Ft 1] = 0 for all t [T]. To use

Lemma C.2, we need to estimate VT := PT t=1 E X2 t |Ft 1 . To be specific, we have

j=1 (I(i, j, t) µj st ij) fi(x(j), Ωt 1 i )

j=1 (I(i, j, t) µj st ij)2 fi(x(j), Ωt 1 i )2

j =j [L] (I(i, j, t) µj st ij)(I(i, j , t) µj st ij ) fi(x(j), Ωt 1 i ) fi(x(j ), Ωt 1 i )

j=1 (I(i, j, t) µj st ij)2 fi(x(j), Ωt 1 i )2

j=1 I(i, j, t)2 fi(x(j), Ωt 1 i )2

j=1 I(i, j, t)µj st ij fi(x(j), Ωt 1 i )

t=1 E h µ2 j( st ij)2 fi(x(j), Ωt 1 i )2 Ft 1 i

j=1 I(i, j, t) fi(x(j), Ωt 1 i )

j=1 I(i, j, t)µj st ij fi(x(j), Ωt 1 i )

t=1 E h µ2 j( st ij)2 fi(x(j), Ωt 1 i ) Ft 1 i

j=1 µj st ij(1 µj st ij) fi(x(j), Ωt 1 i )

j=1 µj st ij fi(x(j), Ωt 1 i )

Online Resource Allocation with Non-Stationary Customers

The first inequality is due to the following calculation: given any j = j [L],

E h (I(i, j, t) µj st ij)(I(i, j , t) µj st ij ) fi(x(j), Ωt 1 i ) fi(x(j ), Ωt 1 i ) Ft 1 i

=(µj st ij(1 µj st ij)( µj st ij ) + µj st ij ( µj st ij)(1 µj st ij )

+(µj(1 st ij) + µj (1 st ij ) + (1 µj µj ))µj st ijµj st ij ) fi(x(j), Ωt 1 i ) fi(x(j ), Ωt 1 i )

= µj st ijµj st ij fi(x(j), Ωt 1 i ) fi(x(j ), Ωt 1 i ) 0.

So, by using Lemma C.2, for any 0 < δ < 1, we obtain with probability at least 1 δ,

I(i, j, t) µj st ij fi(x(j), Ωt 1 i )

j=1 µj st ij fi(x(j), Ωt 1 i )

log(2T/δ) + 5 log2(2T/δ)

j=1 µj st ij fi(x(j), Ωt 1 i )

log(2T/δ) +

5 log(2T/δ),

where in the second inequality we use the fact

b for any a, b, > 0. Therefore, we conclude with probability at least 1 δ,

I(i, j, t) µj st ij fi(x(j), Ωt 1 i )

j=1 µj st ij fi(x(j), Ωt 1 i )

log(2T/δ) +

5n log(2T/δ)

v u u u t4n

j=1 µj st ij fi(x(j), Ωt 1 i )

log(2T/δ) +

5n log(2T/δ)

4n T log(2T/δ) +

5n log(2T/δ),

where in the second inequality we use the Cauchy-Schwarz inequality.

Corollary C.3. After the modification in the removal process of θ,

P[E] 1 (1 + log T)β(n, T).

Corollary C.3 is directly turned out by Proposition D.3.

Lemma C.4. For any 0 < δ < 1 and t [T], it holds that

i=1 ri(si Jl(θ) sl i Jl

j=1 µl j(sij(θ) sl ij))fij(x(j), θ)

> max i [n] ri p

8t log(2|Θ|/δ)

Proof. First note E[(si Jt(θ) st i Jt)fi Jt(x(Jt), θ)|Ft 1] = PL j=1 µt j(sij(θ) st ij)fi(x(Jt), θ) and Pn i=1 ri si Jl(θ) st i Jt PL j=1 µt j(sij(θ) st ij) fi(x(j), θ) 2 max i [n] ri for all t [T], so we define X0 = 0, Xl

Online Resource Allocation with Non-Stationary Customers

Xl 1 = Pn i=1 ri si Jl(θ) sl i Jl PL j=1 µl j(sij(θ) sl ij) fi(x(j), θ) so that X0, . . . , Xt is a martingale associated with the filtration F0, . . . , Ft 1. So by using the Azuma-Hoeffding inequality (c.f. Lemma C.1) we have for any θ Θ,

si Jl(θ) sl i Jl

j=1 µl j(sij(θ) sl ij)

fi(x(j), θ)

2e ϵ2/(8t(max i [n] ri)2) .

By letting δ/|Θ| = 2e ϵ2/(8t(max i [n] ri)2) and using the union bound we obtain the desired result.

Lemma C.5. For any 0 < δ < 1 and t [T], it holds that

sl i Jlfi(x(Jl), Ωl 1)

j=1 µj sl ijfi(x(j), Ωl 1)

2t log(2/δ)

D. Appendix: Omitted Proofs from Section 3

D.1. Proofs under Stationary Arrivals

We point out the expected regret in period t is (for now ignore events at the end of the horizon)

j=1 µj(s ij st ij)fi(x(j), θ i ).

The following proposition states that the regret can be bounded from above by the size of the confidence intervals of purchase probabilities in period t.

Proposition D.1. Suppose in each period t, if fi(x(j), Ωt 1 i ) fi(x(j), θ i ) for all i [n], j [L], then

j=1 µj(s ij st ij)fi(x(j), θ i )

j=1 µj st ij( fi(x(j), Ωt 1 i ) fi(x(j), θ i )).

j=1 µj(s ij st ij)fi(x(j), θ i )

j=1 µjs ijfi(x(j), θ i )

j=1 µj st ij fi(x(j), Ωt 1 i )

j=1 µj st ij( fi(x(j), Ωt 1 i ) fi(x(j), θ i ))

=LP(θ ) U t +

j=1 µj st ij( fi(x(j), Ωt 1 i ) fi(x(j), θ i ))

j=1 µj st ij( fi(x(j), Ωt 1 i ) fi(x(j), θ i )).

The last inequality follows from the condition that the values of purchase probabilities in U t (2) are at least those in LP(θ ) (7). To see this, we define sij = s ij ηij where ηij := fi(x(j), θ i )/fi(x(j), Ωt 1 i ) if fi(x(j), θ i ) = 0. Then, since ηij 1 for all i, j, sij is always a feasible solution for U t with revenue P

j [L] λjs ijfi(x(j), θ i ). This is because by definition we have P

j [L] λjs ijfi(x(j), θ i ) ci for all i and suppose some null resource exists (i.e., representing the no-click event) then we could allocate the additional allocation weights to a null resource with zero revenue). This proves U t LP(θ ).

Online Resource Allocation with Non-Stationary Customers

Proposition D.1 provides an alternative perspective on the regret incurred in period t. Specifically, the algorithm selects arm i under context x(j) with probability µj st ij. In this case, the algorithm experiences regret equal to fi(x(j), Ωt 1 i ) fi(x(j), θ i ).

We next provide a direct result from the Azuma-Hoeffding inequality (see Lemma C.1).

Lemma D.2. For any 0 < δ < 1, t [T], suppose all It, Jt are given,

l=1 (f Il(x(Jl), θ Il) al)

2t log(2/δ)

Proof. First note E al|Fl 1 = f Il(xl, θ Il) and |f Il(x(Jl), θ Il) al| 1 for all l [t]. Let X0 = 0, Xl Xl 1 = f Il(xl, θ Il) al for all l [t] we know X0, . . . , Xl is a martingale associated with the filtration F0, F1, . . . , Fl 1. Thus using the Azuma-Hoeffding inequality (c.f. Lemma C.1) we have for any ϵ > 0,

l=1 r It(f Il(xl, θ Il) al)

2e ϵ2/(2t).

Then the desired result is obtained by setting ϵ such that δ = 2e ϵ2/(2t).

Proposition D.3 shows the probabilistic event E := i [n], j [L], t [T] : fi(x(j), Ωt 1 i ) fi(x(j), θ i ) indeed happens with high probability. The event E ensures that the maximum purchase probability for resource i based on a set Ω of valid latent variables is greater than the true probability. Once the probabilistic event E has been established to occur with high probability, analyzing the revenue regret under different conditions becomes more convenient.

Proposition D.3. P [E] 1 (1 + log T)β(n, T).

Proof. Since

P h i [n], j [L], t [T] : fi(x(j), Ωt 1 i ) fi(x(j), θ i ) i P i [n], t [T] : θ i Ωt i ,

so it is sufficient to lower bound P [ i [n], t [T] : θ i Ωt i].

Consider the probabilistic event

l=1 f Il(x(Jl), θ Il) al > p

2t log(2t/β(n, T))

We know from Lemma D.2 that P [A(t)] β(n, T)/t.

From step 3 of the algorithm and then using the union bound, we have

P i [n], t [T] : θ i Ωt i =P

t Dt It(θ i ) f It (xt , θ i ) at p

2t log(2t/β(n, T))

l=1 f Il(x(Jl), θ Il) al p

2t log(2t/β(n, T))

1 P T t=1A(t)

t=1 β(n, T)/t

1 (log T + 1)β(n, T),

Online Resource Allocation with Non-Stationary Customers

where in the last inequality we use the fact PT t=1 1 t log T + 1.

Proposition D.4 establishes an upper bound on the expected regret resulting from the deviation of revenue. It quantifies the relationship between the regret and the revenue shortfall caused by the algorithm s decisions. By considering this upper bound, we can gain insights into the potential regret incurred due to revenue variations.

Proposition D.4. We have

j=1 µj(s ij st ij)fi(x(j), θ i )

max i [n] ri

n max i [n] |Θi| + 1

8T log(2T/β(n, T)) + U/Tβ(n, T) + LP(θ )β(n, T)(log T + 1).

Proof. From Proposition D.1, given E, it is sufficient to upper bound the term

j=1 µj st ij( fi(x(j), Ωt 1 i ) fi(x(j), θ i ))

j=1 I(i, j, t)( fi(x(j), Ωt 1 i ) fi(x(j), θ i ))

t=1 E h r It( f It(x(Jt), Ωt 1 It ) f It(x(Jt), θ It)) Ft 1

max i [n] ri E

f It(x(Jt), Ωt 1 It ) at

at f It(x(Jt), θ It)

where the first inequality holds because of E[I(i, j, t)|Ft] = µt st ij and law of iterated expectation. Next, from step 3 of the algorithm and the Cauchy-Schwarz inequality, it holds almost surely

t DT i (θi)

fi(x(Jt ), θi) at

2|DT i (θi)| log(2T/β(n, T))

v u u tn max i [n] |Θi|

θi Θi 2|DT i (θi)| log(2T/β(n, T))

2n max i [n] |Θi|T log(2T/β(n, T)).

Moreover, from D.2 we know

P h |( )| p

2T log(2T/β(n, T)) i 1 β(n, T)/T.

Online Resource Allocation with Non-Stationary Customers

Therefore, let α = max i [n] ri

n max i [n] |Θi| + 1

8T log(2T/β(n, T)), we obtain

P[( ) α|E] 1 β(n, T)/T.

Therefore, noticing P[( ) > α, E] P[( ) > α|E] β(n, T)/T,

E[( )] =E[( )|E]P[E] + E[( )|Ec]P[Ec]

E[( )|E] + LP(θ )β(n, T)(log T + 1)

=E[( )|( ) α, E]P[( ) α, E] + E[( )|( ) > α, E]P[( ) > α, E]

+ LP(θ )β(n, T)(log T + 1)

α + Uβ(n, T)/T + LP(θ )β(n, T)(log T + 1).

Lemma D.5 provides a crucial concentration estimate in analyzing the regret incurred by the violation of the inventory constraints.

Lemma D.5. For any 0 < δ < 1, it holds that,

I(i, j, t) µj st ij fi(x(j), Ωt 1 i )

4n T log(2T/δ) +

5n log(2T/δ)

Proof. First notice for all i [n], j [L], t [T], we have

E h µj st ij fi(x(j), Ωt 1 i )|Ft 1 i =E h E [I(i, j, t)|Ft 1] fi(x(j), Ωt 1 i )|Ft 1 i

=E [I(i, j, t)|Ft 1] E h fi(x(j), Ωt 1 i )|Ft 1 i

=E h I(i, j, t) fi(x(j), Ωt 1 i )|Ft 1 i ,

j=1 (I(i, j, t) µj st ij) fi(x(j), Ωt 1 i )

Let X0 = 0, Xt Xt 1 = PL j=1(I(i, j, t) µj st ij) fi(x(j), Ωt 1 i ), then we know E [Xt|Ft 1] = 0 for all t [T]. To use

Lemma C.2, we need to estimate VT := PT t=1 E X2 t |Ft 1 . To be specific, we have

Online Resource Allocation with Non-Stationary Customers

j=1 (I(i, j, t) µj st ij) fi(x(j), Ωt 1 i )

j=1 (I(i, j, t) µj st ij)2 fi(x(j), Ωt 1 i )2

j =j [L] (I(i, j, t) µj st ij)(I(i, j , t) µj st ij ) fi(x(j), Ωt 1 i ) fi(x(j ), Ωt 1 i )

j=1 (I(i, j, t) µj st ij)2 fi(x(j), Ωt 1 i )2

j=1 I(i, j, t)2 fi(x(j), Ωt 1 i )2

j=1 I(i, j, t)µj st ij fi(x(j), Ωt 1 i )

t=1 E h µ2 j( st ij)2 fi(x(j), Ωt 1 i )2 Ft 1 i

j=1 I(i, j, t) fi(x(j), Ωt 1 i )

j=1 I(i, j, t)µj st ij fi(x(j), Ωt 1 i )

t=1 E h µ2 j( st ij)2 fi(x(j), Ωt 1 i ) Ft 1 i

j=1 µj st ij(1 µj st ij) fi(x(j), Ωt 1 i )

j=1 µj st ij fi(x(j), Ωt 1 i )

The first inequality is due to the following calculation: given any j = j [L],

E h (I(i, j, t) µj st ij)(I(i, j , t) µj st ij ) fi(x(j), Ωt 1 i ) fi(x(j ), Ωt 1 i ) Ft 1 i

=(µj st ij(1 µj st ij)( µj st ij ) + µj st ij ( µj st ij)(1 µj st ij )

+(µj(1 st ij) + µj (1 st ij ) + (1 µj µj ))µj st ijµj st ij ) fi(x(j), Ωt 1 i ) fi(x(j ), Ωt 1 i )

= µj st ijµj st ij fi(x(j), Ωt 1 i ) fi(x(j ), Ωt 1 i ) 0.

Online Resource Allocation with Non-Stationary Customers

So, by using Lemma C.2, for any 0 < δ < 1, we obtain with probability at least 1 δ,

I(i, j, t) µj st ij fi(x(j), Ωt 1 i )

j=1 µj st ij fi(x(j), Ωt 1 i )

log(2T/δ) + 5 log2(2T/δ)

j=1 µj st ij fi(x(j), Ωt 1 i )

log(2T/δ) +

5 log(2T/δ),

where in the second inequality we use the fact

b for any a, b, > 0. Therefore, we conclude with probability at least 1 δ,

I(i, j, t) µj st ij fi(x(j), Ωt 1 i )

j=1 µj st ij fi(x(j), Ωt 1 i )

log(2T/δ) +

5n log(2T/δ)

v u u u t4n

j=1 µj st ij fi(x(j), Ωt 1 i )

log(2T/δ) +

5n log(2T/δ)

4n T log(2T/δ) +

5n log(2T/δ),

where in the second inequality we use the Cauchy-Schwarz inequality.

Proposition D.6 calculates the regret incurred from exceeding the inventory constraints. Denote ( )+ = max{ , 0}. This proposition allows us to understand the relationship between resource allocation decisions and the incurred regret due to inventory limitations.

Proposition D.6. We have

j=1 I(i, j, t)at ci

16 max i n |Θi|n T log(2T/β(n, T)) +

5n log(2T/β(n, T)) + n(T log T + 2T)β(n, T).

Online Resource Allocation with Non-Stationary Customers

Proof. We rewrite

j=1 I(i, j, t)(at fi(x(j), Ωt 1 i ))

j=1 (I(i, j, t) µj st ij) fi(x(j), Ωt 1 i )

j=1 µj st ij fi(x(j), Ωt 1 i ) ci

i=1 ri|( )| +

i=1 ri|( )| +

i=1 ri[( )]+.

From step 3 of the algorithm and the Cauchy-Scharz inequality, we know given E,

i=1 |( )| =

t DT i (θi) I(i, Jt, t ) at fi(x(Jt ), θt )

2|DT i (θi)| log(2T/β(n, T))

2n max i [n] |Θi|T log(2T/β(n, T))

holds almost surely. From Lemma D.5 we know

i=1 |( )| p

4n T log(2T/β(n, T)) +

5n log(2T/β(n, T))

Recall st ij is the optimal solution of LP (2). Thus, noticing λj = Tµj, it holds for all i [n] almost surely

j=1 µj st ij fi(x(j), Ωt 1 i ) ci/T.

Hence, we know ( ) 0holds almost surely.

Therefore, by setting α = max i [n] ri r

16 max i [n] |Θi|n T log(2T/β(n, T)) +

5n log(2T/β(n, T)), we obtain

P[( ) α|E] 1 β(n, T).

Thus, noticing

P[(( ) α, E)c] =1 P[( ) α, E]

=1 P[( ) α|E]P[E]

1 (1 β(n, T))(1 (log T + 1)β(n, T))

(log T + 2)β(n, T),

we then have

E[( )] =E[( )|( ) α, E]P[( ) α, E] + E[( )|(( ) α, E)c]P[(( ) α, E)c]

α + (log T + 2)β(n, T)n T

= max i [n] ri r

16 max i [n] |Θi|n T log(2T/β(n, T)) +

5n log(2T/β(n, T)) + n(T log T + 2T)β(n, T).

Online Resource Allocation with Non-Stationary Customers

Theorem D.7 provides a summary of the expected regret bound for the ALGLP algorithm. Specifically, the regret bound, denoted by O p

max i [n]|Θi|n T , indicates that the expected regret increases with the square root of the maximum

parameter space size |Θi|, the number of resources n, and the time horizon T. This regret bound is particularly relevant in scenarios where customer arrivals exhibit a near-i.i.d. pattern, and it allows us to assess the scalability and performance of the ALGLP in different settings and make informed decisions regarding resource allocation.

Theorem D.7. The expected regret of the algorithm is upper bounded by

8 max i [n] ri r

max i [n] |Θi|n T log(2n T 2) +

5n log(2n T 2) + U/(n T 2) + LP(θ )(log T + 1)/(n T) + log T + 2.

Proof. The expected regret is upper bounded by the sum of the regret incurred by the expected revenue and violation of inventory constraints, that is,

j=1 µj(s ij st ij)fi(x(j), θ i )

j=1 I(i, j, t)at ci

So by piecing together Proposition D.4 and D.6 we obtain the result.

Moreover, a key feature of our regret bound is that it is independent of the number of contexts L, which means it might lead to efficient computation even when L is very large, as long as we have a moderate max i [n] |Θi|. This feature is also shared

by (Badanidiyuru et al., 2014), where they have the cardinality of the policy space instead of the contextual space appear as a multiplicative factor in the regret bound (this allows them to treat resourceful contextual bandit problems with an infinite-dimensional contextual space). Remark D.8. When min i [n] ci is very large, i.e., tends to infinity, it is easy to verify

(1 + min i [n] ci) 1 e 1/ min i [n] ci 1

, i.e., the CR has a limit 1 + 1 1 1/e.

D.2. Proofs under Adversarial Arrivals

Before analyzing the regret, we introduce an auxiliary online learning problem where we maintain the adversarial arrival setting but remove the inventory constraint that limits the consumption of each resource to at most ci units, for all i [n]. Suppose we have an algorithm Π that decides It in each period after observing xt. The regret in this auxiliary problem, denoted as REGAUX, is defined as the sum of regrets incurred in each period t:

t [T ] (r It f It (xt, θ It ) r Itf It(xt, θ It)),

where It represents the optimal resource selection that maximizes the reward rifi(xt, θ i ) among all resources i [n], denoted as It := arg max i [n] rifi(xt, θ i ). By studying the regret in this auxiliary problem, we can gain insights into the impact

of resource allocation decisions on the incurred regret, without considering the inventory constraint.

Our analysis is based on a crucial finding in (Cheung et al., 2022) (c.f., Lemma D.9), which says as long as we have an online algorithm for the auxiliary problem with low time regret, we can adapt the same algorithm to the complete problem within the same time regret plus some approximation error (unrelated to T).

Lemma D.9. ((Cheung et al., 2022)) Let Π be some algorithm that incurs regret REGAUX for the auxiliary online learning problem (without inventory constraints), suppose we apply the same algorithm Π to the complete online learning problem (with inventory constraints), just with ri in period t changed to rt i = ri 1 Ψ N t 1 i ci

, and denote OPT as the optimal

Online Resource Allocation with Non-Stationary Customers

revenue associated with some arrival sequence x1, . . . , x T and ALG the revenue gained from running Π with respect to the arrival sequence x1, . . . , x T . Then, we have

OPT (1 + min i [n] ci) 1 e 1/ min i [n] ci

1 1/e E[ALG] + E[REGAUX].

So, the analysis reduces to upper bound E[REGAUX] properly. As shown in Proposition D.10, we derive E[REGAUX] = O(

Proposition D.10. E[REGAUX] max i [n] ri( p

|Θ|n + 1) p

2T log(2T/β(n, T)) + max i [n] ri(1/T + log T + 1)/n.

Proof. Note given E, it holds almost surely

t=1 (r It f It (x(Jt), θ It ) r Itf It(x(Jt), θ It))

t=1 (r It f It (x(Jt), θ It ) r Itf It(x(Jt), θt) + r Itf It(x(Jt), θt) r Itf It(x(Jt), θ It))

t=1 (r It f It (x(Jt), θ It ) r Itf It(x(Jt), θt) + r Itf It(x(Jt), θt) r Itf It(x(Jt), θ It))

t=1 (r Itf It(x(Jt), θt) r Itf It(x(Jt), θ It))

max i [n] ri

t=1 (f It(x(Jt), θt) at)

t=1 (at f It(x(Jt), θ It))

where in the first inequality we use the step 3 of the algorithm (i.e., It = arg max i [n] rt i fi(x(Jt), θt)), and in the second

inequality we use the fact that f It (x(Jt), θt) f It (xt, θ It ) almost surely given E. From the step 4 of the algorithm, we know given E, it holds almost surely

t DT i (θi) (f It (xt , θi) at )

2Di(θi) log(2T/β(n, T))

2|Θ|n T log(2T/β(n, T)),

where in the last inequality we use the Cauchy-Schwarz inequality. Then, using D.2 we know

P h ( ) > p

2T log(2T/β(n, T)) E i β(n, T)/T.

Let α = max i [n] ri( p

|Θ|n + 1) p

2T log(2T/β(n, T)), we know

P[REGAUX α|E] 1 β(n, T)/T.

Online Resource Allocation with Non-Stationary Customers

E[REGAUX] =E[REGAUX|REGAUX α, E]P[REGAUX α, E]

+ E[REGAUX|(REGAUX α, E)c]P[(REGAUX α, E)c]

α + max i [n] ri T(1/T + log T + 1)β(n, T)

= max i [n] ri( p

|Θ|n + 1) p

2T log(2T/β(n, T)) + max i [n] ri(1/T + log T + 1)/n.

Using Lemma D.9 and Proposition D.10 we obtain the main result of this section immediately.

Theorem D.11. It holds that

OPT (1 + min i [n] ci) 1 e 1/ min i [n] ci

1 1/e E[ALG]

+ max i [n] ri( p

|Θ|n + 1) p

2T log(2T/β(n, T)) + max i [n] ri(1/T + log T + 1)/n.

Theorem D.11 provides a key result of ALGADV regret upper bound under adversarial arrivals. Specifically, it establishes that the regret incurred by the algorithm exhibits a sublinear T growth rate with a constant competitive ratio. Sublinear regret indicates that the corresponding algorithm s performance gradually approaches the performance level of the optimal offline policy as time progresses.The constant competitive ratio highlights the algorithm s performance relative to an optimal decision-making strategy. It provides a quantitative measure of the algorithm s effectiveness, indicating that it achieves regret that is at most a constant factor compared to the optimal strategy. Moreover, this result forms a crucial foundation for our subsequent analysis of the regret bound for the Unified Learning algorithm under nonstationary arrivals in Section D.3.

D.3. Proofs under Nonstationary Arrivals

D.3.1. REGRET ANALYSIS: SUBLINEAR REGRET AS LONG AS ALGLP IS NOT SWITCHED

We first provide a proposition that states ALGLP always incurs a sublinear regret and a linear rate of resource consumption with high probability as long as the switch does not happen.

Proposition D.12. Suppose ALGLP runs for t iterations without being switched, then the expected regret up to time t is upper bounded by

2|Θ|nt log(4|Θ|t/β(n, T)) +

5n log(2t/β(n, T)) + (2/t + 2 + log t)β(n, T)(LP(θ ) + nt).

Moreover, each resource i [n] has at least T t

8nt log(2t/β(n, T)) +

remaining with probability at least 1 β(n, T).

Proof. Given E(t), we know the first condition in step 3 of the algorithm ensures

i=1 ri(s i Jl sl i Jl)fi(x(Jl), θ )

max i [n] ri p

32t log(4|Θ|t/β(n, T)).

And by using Lemma C.4 we derive with probability at least 1 β(n, T)/t,

j=1 µl j(s ij sl ij)fi(x(j), θ )

max i [n] ri p

128t log(4|Θ|t/β(n, T)).

Online Resource Allocation with Non-Stationary Customers

The second condition in step 3 of the algorithm ensures

l=1 sl i Jl fi(x(Jl), Ωl 1) t

2t log(2t/β(n, T)).

And by using C.5 we derive with probability at least 1 β(n, T)/t,

j=1 sl ij fi(x(j), Ωl 1) t

8nt log(2t/β(n, T)).

Therefore, using the same logic of proving Proposition D.6, we derive with probability at least 1 β(n, T),

j=1 I(i, j, t)at ci

64|Θ|nt log(2t/β(n, T)) +

5n log(2t/β(n, T)).

Therefore, let α := 16 p

2|Θ|nt log(4|Θ|t/β(n, T)) +

5n log(2t/β(n, T))we have

P[Regrett > α|E(t)] (2/t + 1)β(n, T).

So we end up with

E[Regrett] =E[Regrett|Regrett α, E(t)]P[Regrett α, E(t)]

+ E[Regrett|(Regrett α, E(t))c]P[(Regrett α, E(t))c]

α + (2/t + 2 + log t)β(n, T)(LP(θ ) + nt).

D.3.2. REGRET ANALYSIS: SUBLINEAR REGRET UNDER I.I.D. ARRIVALS

In this section, we aim to demonstrate that under i.i.d. arrivals, the switch from ALGLP to ALGADV does not occur with high probability. This can be achieved by showing that the conditions for the switch to happen are unlikely to be satisfied. Once we establish that the switch does not occur under i.i.d. arrivals, we can then apply Proposition D.12 to prove that our algorithm still achieves a sublinear expected regret in this setting.

To begin, we verify the first condition in step 3 of the algorithm is not violated for all t [T] with high probability.

Proposition D.13. When xt arrives in a i.i.d. fashion, given E, we have

t [T], θ Ωt 1 :

i=1 ri(si Jl(θ) sl i Jl)fi(x(Jl), θ)

max i [n] ri p

32t log(4|Θ|t/β(n, T))

1 (log T + 1)β(n, T).

Proof. Using the same logic in Proposition D.1, we have for all t [T],

j=1 µj(sij(θ) st ij)fi(x(j), θ)

j=1 µj st ij( fi(x(j), Ωt 1) fi(x(j), θ)).

So from the θ-removal process of the algorithm we have,

j=1 µj(sij(θ) sl ij)fi(x(j), θ)

l=1 E[r Il( f Il(x(Jl), Ωl 1) f Il(x(Jl), θ))|Fl 1]

max i [n] ri p

8t log(2t/β(n, T))

Online Resource Allocation with Non-Stationary Customers

holds with probability at least 1 β(n, T)/(2t). Define probabilistic event

i=1 ri(si Jl(θ) sl i Jl)fi(x(Jl), θ)

> max i [n] ri p

32t log(4|Θ|t/β(n, T))

Then using C.4 and the triangle inequality we obtain

P[An] β(n, T)/t.

Thus by using the union bound we know the condition holds with probability at least 1 (log T + 1)β(n, T).

We verify the second condition in step 3 of the algorithm is not violated for all t [T] with high probability.

Proposition D.14. When xt arrives in a i.i.d. fashion, given E, we have

t [T], i [n] :

l=1 si Jlfi(x(Jl), Ωl 1) t

2t log(2t/β(n, T))

1 (log T + 1)β(n, T).

Proof. Using C.5, it is easy to show

sl i Jlfi(x(Jl), Ωl 1)

j=1 µj sl ijfi(x(j), Ωl 1)

2t log(2t/β(n, T))

Since sl ij is the solution of U t, we have for all i [n] it holds almost surely

j=1 µj sl ijfi(x(j), Ωl 1) t

Therefore the desired result holds by a union bound argument.

Finally, we obtain O(|Θ|n T) regret under i.i.d. arrivals, as a corollary of Proposition D.12, Proposition D.13 and Proposition D.14 .

Corollary D.15. The expected regret under i.i.d. arrivals of the algorithm is upper bounded by

2|Θ|n T log(4|Θ|n T 2) +

5n log(2n T 2) + (LP(θ )/(n T) + 1)(2/T + 2 + 3 log T).

D.3.3. REGRET ANALYSIS: UNIFIED REGRET BOUND WHERE ALGLP IS SWITCHED TO ALGADV SOMETIME

In this section, we further analyze our algorithm to address the case where the switch occurs, transitioning from ALGLP to ALGADV. We extend our analysis to nonstationary arrivals and demonstrate that our algorithm achieves sublinear regret and a constant competitive ratio. By considering the combined performance of both algorithms, we derive a unified regret bound in Proposition D.16 that provides an overall measure of our algorithm s performance. This analysis allows us to quantify the regret incurred and evaluate the effectiveness of our approach in dynamically allocating resources.

Proposition D.16. In any case of nonstationary arrivals, the algorithm guarantees

1 + (1 + min i [n] ci) 1 e 1/ min i [n] ci

E[ALG] + O( p

Proof. Suppose the switch occurs in time epoch t. Denote OPT1 the expected revenue obtained by the optimal algorithm (who knows θ at the beginning) from time epoch 1 to t and OPT2 the rest of the expected revenue generated by the same

Online Resource Allocation with Non-Stationary Customers

algorithm. We use ALGt LP and ALGT t+1 ADV to denote the revenue obtained by our algorithm in the two phases, respectively. From Proposition D.12 we know

OPT1 E[ALGt LP] + 16

2|Θ|nt log( 4t β(n, T)) +

5n log( 2t β(n, T)) + (2

t + 2 + log t)β(n, T)(LP(θ ) + nt).

We define Empty-OPT2 as the revenue obtained by the optimal algorithm from time t + 1 to T, given the consumption of all resources is zero. We also define Empty-ALGT t+1 ADV the revenue obtained by our algorithm (after the switch) in the second phase, given the consumption of all resources is zero. So from the definition, it holds almost surely

Empty-ALGT t+1 ADV ALGT t+1 ADV + ALGt LP.

Thus, from Theorem D.11 we know

Empty-OPT2 (1 + min i [n] ci) 1 e 1/ min i [n] ci

1 1/e E[Empty-ALGT t+1 ADV ]

+ max i [n] ri( p

n|Θ| + 1) p

2(T t + 1) log(2(T t + 1)/β(n, T))

+ max i [n] ri((1/(T t + 1) + log(T t + 1) + 1)/n.

For an arbitrary 0 < α < 1, we have

ALGt LP + ALGT t+1 ADV = αALGt LP + (1 α)ALGt LP + ALGT t+1 ADV αALGt LP + (1 α)(ALGt LP + ALGT t+1 ADV ).

Then plugging in the previous two inequalities on OPT1, Empty-OPT2 on the right hand side of the previous inequality, using OPT = OPT1 + OPT2, Empty-OPT2 OPT2, then we have

OPT = OPT1 + OPT2

E[ALGt LP] + Empty-OPT2 + O( p

E[ALGt LP] + (1 + min i [n] ci) 1 e 1/ min i [n] ci

1 1/e E[Empty-ALG] + O( p

E[ALGt LP] + (1 + min i [n] ci) 1 e 1/ min i [n] ci

1 1/e E[ALGT t+1 ADV + ALGt LP] + O( p

1 + (1 + min i [n] ci) 1 e 1/ min i [n] ci

E[ALG] + O( p

completes the proof.

Through our analysis in Proposition D.16, we demonstrate that the ULw E algorithm designed for nonstationary arrivals achieves sublinear regret, indicating diminishing regret growth as the time horizon increases. This sublinear growth rate is a desirable property, as it reflects the algorithm s ability to effectively adapt to the changing arrival patterns and make informed resource allocation decisions. We establish that our algorithm achieves a constant competitive ratio, highlighting its performance relative to an optimal decision-making strategy. This constant competitive ratio indicates that our algorithm achieves regret which is at most a constant factor compared to the optimal strategy, demonstrating its competitiveness in resource allocation even in the face of changing customer preferences.

Online Resource Allocation with Non-Stationary Customers

E. Appendix: Discussion of Imperfect Knowledge of Total Arrival Rates

In this section, we clarify our results hold more generally under the scenario where the knowledge about λl is approximately true. That is, now let s suppose λl = λ l + δl, where λ l is the true arrival rate and δl captures some estimation error on the total arrival rate. By this definition, we regard λl as some quantity learned from some applied machine learning algorithm using historical data on total arrival rates. For instance, the estimated daily total arrival rate for Google.com exceeds 2 billion visits. Moreover, in Lai et al. (2016), a certain regression algorithm was employed on prediction traffic data for online advertising, resulting in a total prediction error rate of approximately 0.9%. Moreover, Cetintas et al. (2011) introduces a series of probabilistic latent class models aimed at predicting online user visits for display advertising, with experiments conducted on visit logs from tens of millions of Yahoo users, resulting in an absolute percentage error of less than 0.5. These evidences demonstrate that typically in practice, even though complete knowledge of the total arrival rates might be challenging to attain, the knowledge of approximately accurate total arrival rates is easy to obtain.

Next, we show our analysis is general enough and still holds even with this additional error term incurred in our knowledge about the total arrival rate. By the new definition of λl, we can carefully follow the same lines of proving Proposition D.4 to show the algorithm s expected revenue is divided into two parts: one is the expected revenue in the absence of error ( ), and the other is the error term introduced due to the uncertainty of λl ( ), as shown below:

j=1 µj(s ij st ij)fi(x(j), θ i )

j=1 δj(s ij st ij)fi(x(j), θ i )

We can see that due to the uncertainty of λ, an additional error term ( ) will arise. The function fi(x(j), θ i ) denotes the probability of allocating resource i to customer type x(j). For each time point t and each resource i, it can only be allocated to one customer type j. So we have the magnitude of the error term ( ) is at most T maxi,j riδj. The term T maxi,j riδj is thus an upper bound on the error term s impact on the expected cumulative regret in all periods. Consequently, after modifying the switching condition in ALG accordingly, the expression for overall regret in Theorem 3.1 is modified to: (let δ := maxj δj)

1 + (1 + min i [n] ci) 1 e 1/ min i [n] ci

E[ALG] + O( p

n|Θ|T) + O(Tδ)

From this, we conclude that as long as the magnitude of δ remains at O(T 1/2), it will not significantly affect the overall regret. Therefore, say if we have at least T historical observations of the total arrival rates (and they are drawn from some light-tailed distributions), we can still achieve the regret order as if we have full information. Of course, we point out this is still just a pretty crude analysis and better bounds from some completely distinct algorithmic design is possible.

F. Appendix: Additional Numerical Results

F.1. Regret Table under i.i.d. Arrivals

Table 1 presents a detailed comparative analysis of the solution trajectories and regret for each algorithm under i.i.d. arrivals. It details resource allocation to Customer Types A and B and the corresponding regret values, demonstrating that ALGLP and ULw E outperform ALGADV. Notably, the ALGLP and ULw E algorithms maintain consistent performance without violating inventory constraints, as evidenced by the zero values in the inventory regret column. This consistency corroborates the earlier observations regarding their effective resource allocation strategies.

F.2. Regret Table under ADV Arrivals

Table 2 provides a detailed comparison of the algorithms solution trajectories and regret. The ULw E algorithm showed a distinct shift in its allocation patterns in the later stages, aligning more closely with ALGADV. Specifically, for the initial phase

Online Resource Allocation with Non-Stationary Customers

Table 1. Algorithm Solution Trajectories and Regret under i.i.d. Arrivals

CUSTOMER A CUSTOMER B

PERIOD RESOURCE 1 RESOURCE 2 RESOURCE 1 RESOURCE 2 REGRET

LP ALGORITHM & ULWE ALGORITHM

100 0 66 12 22 4.6 200 0 126 38 36 12.4 300 0 174 71 55 20.3 400 0 235 100 65 28.5 500 0 291 132 77 36.6

ADV ALGORITHM

100 0 66 0 34 0 200 19 107 10 64 10.7 300 46 128 35 91 30.0 400 76 159 53 112 47.7 500 99 192 76 133 64.7

(T < 200), the ULw E algorithm s allocations closely resemble those of ALGLP. In the middle phase (200 < T < 300), the ULw E algorithm shifts its approach to mirror the early strategies of ALGADV, favoring Resource B predominantly. In the final phase (T > 300), the ULw E algorithm adopts a more greedy allocation strategy, akin to the later strategies of ALGADV. This progression, also supported by Figure 3, emphasizes the ULw E algorithm s capacity to dynamically adjust its allocation strategies in response to evolving conditions in nonstationary environments.

F.3. Regret Table under General Arrivals

In this section, our objective is to assess the performance of ALGLP, ALGADV, and the ULw E algorithm under general customer arrivals. To investigate the impact of general customer arrivals, we consider specific arrival rate settings. Table 3 provides a description of the total arrival rates (λ), which represent the expected number of arrivals, for different settings. Each row in the table corresponds to a specific setting and includes information about the time period and the total arrival rates for customer types A and B. In the i.i.d. setting, the total arrival rate for customer type A is 0.6T, and for customer type B, it is 0.4T, over the entire time period T. In the ADV1 setting, the time period is divided into two parts: 0.33T and 0.67T. In the first part, the arrival rates for customer types A is 0.8T and B is 0.2T. In the second part, the arrival rates are reversed. Similarly, other settings, such as ADV2, follow a similar pattern where the arrival rates for customers change periodically.

Regarding the performance of the i.i.d. and ADV1 settings, we have provided detailed explanations in the previous two subsections. In this section, our primary focus is directed toward the ADV2 settings, which introduce a heightened level of nonstationarity. We observe that ALGS consistently performs well throughout the entire period. Conversely, ALGADV initially exhibits strong performance, followed by a period where its performance is similar to ALGLP, and eventually surpasses ALGLP in the later stages. This implies that under conditions of high nonstationarity, ALGLP demonstrates the poorest performance.

Indeed, the findings indicate that the ULw E algorithm consistently performs well across different settings, highlighting its robustness in handling various scenarios. Specifically, it performs best in cases where there is a lower level of nonstationary arrival sequence, such as in the ADV1 setting. Since this algorithm s ability to switch from ALGLP to ALGADV allows it to leverage the advantages of each approach, leading to superior performance. However, in scenarios with higher levels of nonstationarity, such as in the ADV2 settings, the performance of the ULw E algorithm may decrease but still be better than that in stationary settings.

On the other hand, the performance of the ADV algorithm is steady under nonstationary and the LP algorithm demonstrates good performance under stationary conditions, but it struggles when faced with changes in arrival rates and higher levels of nonstationarity. This implies that the algorithm s static resource allocation strategy may not be suitable for dynamically changing environments. The effectiveness of the ULw E algorithm in handling nonstationary arrival patterns and adapting its resource allocation strategies is evident. By dynamically switching between different algorithms, it can effectively respond to changing conditions and achieve better performance while minimizing regret. This adaptability is a valuable characteristic

Online Resource Allocation with Non-Stationary Customers

Table 2. Algorithm Solution Trajectories and Regret under Adversarial Arrivals

CUSTOMER A CUSTOMER B

PERIOD RESOURCE 1 RESOURCE 2 RESOURCE 1 RESOURCE 2 REGRET

LP ALGORITHM

100 0 13 77 10 20.5 200 0 35 141 24 39.5 300 0 75 193 32 55.8 400 0 120 241 39 71.6 500 0 153 250 77 88.1

ADV ALGORITHM

100 0 13 0 87 0 200 14 21 21 144 18.0 300 44 31 66 159 40.1 400 79 44 106 174 61.3 500 100 53 150 197 81.8

ULWE ALGORITHM

100 0 13 77 10 20.5 200 0 35 121 44 37.1 300 0 75 121 104 37.8 400 27 93 155 125 55.7 500 51 102 199 148 76.1

Table 3. Description of different settings for customer arrival rates.

SETTING TIME PERIOD t ( T ) ARRIVAL RATE λA ( T ) ARRIVAL RATE λB ( T )

I.I.D. 1 0.6 0.4

ADV1 0.33 0.15 0.85 0.67 0.4 0.6

ADV2 0.1 0.2 0.8 0.3 0.8 0.2 0.2 0.2 0.8 0.1 0.4 0.6 0.1 0.2 0.8 0.1 0.02 0.98 0.1 0.2 0.8

that allows the ULw E algorithm to optimize resource allocation and cope with the uncertainties introduced by nonstationary arrival patterns.