# online_learning_of_delayed_choices__487e7188.pdf

Online Learning of Delayed Choices

Recep Yusuf Bekci University of Waterloo Waterloo, Canada recep.bekci@uwaterloo.ca

Choice models are essential for understanding decision-making processes in domains like online advertising, product recommendations, and assortment optimization. The Multinomial Logit (MNL) model is particularly versatile in selecting products or advertisements for display. However, challenges arise with unknown MNL parameters and delayed feedback, requiring sellers to learn customers choice behavior and make dynamic decisions with biased knowledge due to delays. We address these challenges by developing an algorithm that handles delayed feedback, balancing exploration and exploitation using confidence bounds and optimism. We first consider a censored setting where a threshold for considering feedback is imposed by business requirements. Our algorithm demonstrates a O(

NT) regret, with a matching lower bound up to a logarithmic term. Furthermore, we extend our analysis to environments with non-thresholded delays, achieving a O(

NT) regret. To validate our approach, we conduct experiments that confirm the effectiveness of our algorithm.

1 Introduction

The ability to model and understand consumer choices between discrete alternatives is critical for various business applications, such as online advertising, product recommendations, and assortment optimization. Businesses need to present the most appealing set of options to consumers to maximize engagement and revenue. However, the task of optimizing what content or products are shown to a customer during their browsing session is complex due to the interplay between alternatives that customers face. Each alternative can act as a substitute or competitor to others, impacting the customer s final decision. Traditional multi-armed bandit (MAB) models, which are widely used for decision-making problems, fall short in scenarios where a subset of alternatives must be presented, and the customer s choice among these influences future decisions.

Multinomial choice (MNL) models have emerged as powerful tools for capturing and predicting consumer behavior among a finite set of alternatives. These models estimate the utilities of different options and the probabilities of their selection. However, when the MNL parameters are unknown and no historical data is available as is often the case with newly introduced products or advertisements the learning process becomes even more challenging. This complexity is further exacerbated when the feedback on decisions is delayed, requiring the learner to dynamically adjust decisions based on limited and potentially biased information.

One of the fundamental challenges in this setting is the delay in receiving feedback from customers. Unlike immediate responses in classical MAB problems, customers in e-commerce environments often take hours or even days to make decisions, as highlighted by Vernade et al. [2020] and Chapelle [2014]. This delay in feedback complicates the learning process, as it must adapt to new information that arrives sporadically and potentially long after the initial interaction.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

In this paper, we address the dual challenges of unknown MNL parameters and delayed feedback by developing algorithms that balances exploration and exploitation through the use of confidence bounds and the principle of optimism in the face of uncertainty. We focus on two settings in receiving delay: the thresholded and the non-thresholded settings.

In thresholded settings, feedback is only considered if it is received within a predetermined time frame set by business requirements. This constraint ensures operational stability and efficiency by ignoring excessively delayed responses that may no longer be relevant. For these settings, we introduce the Delayed Multinomial Logit Bandit (DEMBA) algorithm, specifically designed to handle potentially censored delayed feedback effectively.

In contrast, non-thresholded settings allow the learner to consider all feedback regardless of delay, potentially leading to better accuracy in decision making in long-term but at the cost of increased bias in the learning process. For such environments, we propose the Patient Delayed Multinomial Logit Bandit (PA-DEMBA) algorithm, which adapts its learning strategy to accommodate all feedback, irrespective of the delay.

Contributions. Our main contributions are two novel bandit algorithms: we develop DEMBA, for thresholded feedback settings, and PA-DEMBA, an algorithm designed for non-thresholded feedback environments. Both algorithms effectively learn from delayed and censored choices using confidence intervals. We provide a comprehensive regret analysis for both DEMBA and PA-DEMBA, demonstrating an O(

NT) regret bound and a matching lower bound up to a logarithmic term. Additionally, through detailed computational experiments, we validate the performance and robustness of our algorithms in various scenarios.

Organization. The remainder of this paper is organized as follows. In Section 2, we review the related work on choice modeling and delayed feedback in online learning. Section 3 details the problem formulation and the specific challenges addressed by our approach. Our main algorithm DEMBA is presented in Section 4. We analyze the regret bounds of this algorithm in Section 5, providing theoretical guarantees for its performance. PA-DEMBA algorithm for non-thresholded delays is presented in Section 6. In Section 7, we conduct experiments to demonstrate the effectiveness of our proposed algorithms. Finally, Section 8 concludes the paper and outlines potential directions for future research.

2 Related Work

Delayed feedback is a crucial aspect of online learning environments, especially in domains like online advertising and e-commerce, where decision making involves a consideration period, or in healthcare, where the effects of actions take time to manifest. Consequently, research interest in bandits with delayed feedback has surged in recent years.

In their seminal work, Joulani et al. [2013] studied online learning scenarios with stochastic delayed feedback. Their work laid the foundation for understanding how delays impact learning performance. Similarly, Chapelle [2014] examined delayed conversions in display advertising, highlighting the practical challenges faced in real-world applications. Further, Vernade et al. [2017] focused on delays specifically in the context of delayed conversions, providing insights into handling delays with known distributions.

Expanding on this, Pike-Burke et al. [2018] explored bandits with delayed, aggregated, and anonymous feedback, which adds another layer of complexity by considering multiple types of delays. Zhou et al. [2019] extended this exploration to generalized linear contextual bandits with delayed feedback. Vernade et al. [2020] investigated linear bandits with stochastic thresholded delays. Additionally, Gael et al. [2020] tackled multi-armed bandits with arm-dependent stochastic delays, focusing on the challenges of non-uniform delays across different choices. Moreover, Cesa-Bianchi et al. [2022] considered composite and anonymous delayed feedback within non-stochastic bandits, further enriching the literature on delayed feedback mechanisms. Tang et al. [2024] studied delayed multi-armed bandits (MAB) with reward-dependent delays in clinical trials, while Lancewicki et al. [2021] explored both reward-dependent and reward-independent delay settings. Flaspohler et al. [2021] investigated delayed learning in weather forecasting, and Grover et al. [2018] addressed the best arm identification problem with delayed feedback. Thune et al. [2019] examined non-stochastic MABs with unrestricted delays, and Cesa-Bianchi et al. [2016] considered cooperation between

different agents in delayed settings. Tang et al. [2021] explored scenarios where past actions impact future arm rewards, and Yang et al. [2024] addressed general sequential decision-making problems with delayed feedback. Despite this rich body of work, the solutions developed for MABs do not directly apply to our setting, where assortment feedback in discrete choice models presents additional complexities. In particular, delayed feedback affects both item value estimation and assortment composition, making our problem significantly more challenging than those where only arm rewards are updated.

When focusing specifically on generalized linear bandits with delays, we note key contributions such as Howson et al. [2023] and Blanchet et al. [2024] who explored generalized linear bandits with delayed feedback, demonstrating the efficacy of these models in more complex, non-linear settings. Multinomial bandits, which address decision making where multiple items are offered simultaneously, present a unique challenge due to the interactions between items. This complexity distinguishes our problem from other online learning models. Specifically, for generalized linear bandits, we note that when the assortment has more than one item, our problem cannot be addressed by solutions designed for generalized linear models due to the complexity of interactions among multiple choices which makes the action space more complicated.

In terms of online learning with choice models, significant progress has been made in understanding and optimizing MNL parameters. Agrawal et al. [2017] developed a Thompson sampling algorithm for learning MNL parameters for assortment optimization, while Agrawal et al. [2019] proposed a UCB algorithm for the same purpose. Dong et al. [2020] adapted this problem by introducing switch costs, addressing practical constraints in dynamic environments. Further, Agarwal et al. [2020] studied the problem for best arm identification, extending the framework to multiple pulls, which is a generalization of the dueling bandits problem. Other notable works include Wang et al. [2018] and Chen et al. [2021], who worked on dynamic assortment allocation under uncapacitated MNL models, and Perivier and Goyal [2022], who investigated joint pricing and assortment optimization with MNL demand processes.

To the best of our knowledge, our work is the first to address online learning of delayed choices in this context. Specifically, we propose the Delayed Multinomial Logit Bandit (DEMBA) algorithm for thresholded feedback settings and the Patient Delayed Multinomial Logit Bandit (PA-DEMBA) algorithm for non-thresholded settings. These algorithms effectively learn from delayed and censored choices using confidence intervals, providing robust solutions for dynamic and uncertain environments where feedback is not immediately available. Our approach not only advances the theoretical understanding of delayed feedback in online learning but also offers practical insights for applications in e-commerce and online advertising.

3 Problem Setup

We consider a capacitated selection problem faced by a seller over T rounds. The items to be selected, referred to as products, can be new retail products or services such as advertisements. There are N products available to potentially be shown to the customer. The set of products selected in a given round is called an assortment, denoted as St for round t, where St {1, . . . , N} and |St| K. Here, K represents the capacity, indicating the maximum number of items the seller can show at any given time.

Upon encountering the assortment, the customer makes a choice at among the options: (i) rejecting the browsing options provided (at = 0), (ii) browsing and purchasing/selecting an option (at = i, i St), and (iii) browsing but not purchasing/selecting an option (at = 0d). If option i is chosen, the seller earns a reward ri, with ri [0, 1] for i [N] and r0 = r0d = 0.

Customer choice probabilities are determined by the Multinomial Logit (MNL) model as follows:

P(at = i|St = S) =

( vi v0+v0d+P

j S vj if i S {0, 0d}

0 otherwise,

where vi denotes the (unknown) attraction parameter of option i. Without loss of generality we assume that attraction parameters are normalized such that v0 = 1. In parallel with the real applications, we also assume that not browsing is the most common choice, i.e. vi v0.

It is important to note that not purchasing/selecting is different from tracking conversions. Specifically, if the customer browses the given assortment, we can track if the customer decides not to purchase, similar to choosing an option to purchase (e.g., by closing the pop-up or following specific behavioral click-through patterns).

If the customer s choice is at = 0, the seller receives this choice immediately. Otherwise, the customer s choice is revealed to the seller after a delay dt N. dt is sampled from a unknown distribution fd and independent from at. Moreover, delays longer than a certain threshold µ is censored or ignored by the seller in the learning process. This threshold is set based on the seller s operational requirements. We later extend our solution to the patient learner setting without a threshold, i.e. µ .

We define ai,t as the demand for option i at time t. We have

ai,t = 1, if at = i, 0, otherwise.

We also define ci,s,t {0, 1} as the censoring variable of product i at period t that is sold at period s. The censoring variable is determined as:

ci,s,t = I(ds t s and ds µ).

We define the feedback observed by the seller as oi,s,t {0, 1} and oi,s,t = ci,s,tai,s. The expected fraction of observed feedback is denoted as ψµ := Pµ s=0 fd(s).

The sequence of events at round t can be summarized as follows:

1. The seller selects an assortment St. 2. The customer interacts with the medium(view the page or encounter with the pop-up) and makes a decision at. 3. The environment returns a reward ri, i [N], and samples a delay dt. 4. Rewards of certain previous actions and/or if the customer rejected to browse revealed to the seller as oi,s,t.

The expected reward of the seller given assortment S and attraction parameter set v is given by

R(S; v) = X

i S ri vi 1 + v0d + P

The goal of the seller is to sequentially learn customer preferences and find a policy to minimize cumulative expected regret, defined as:

Reg(T, π) =

t=1 R(S ; ψµv) R(Sπ t ; ψµv),

where S = arg max S {1,...,N},|S| K R(S; ψµv) maximizes the expected reward of the clairvoyant and v = {v0d, v1, . . . , v N} is the ground truth attraction parameter set.

4 Delayed MNL Bandit (DEMBA) Algorithm

In this section, we introduce the Delayed MNL Bandit (DEMBA) algorithm. DEMBA leverages an epoch-based learning method, where epochs are explicitly defined by immediate no-purchase decisions. Specifically, when a no-purchase decision is made by the customer, the current epoch is closed, and a new one begins. Throughout each epoch, customer selections are observed, and parameter updates occur upon encountering a no-purchase outcome.

Our approach adopts the principle of optimism in the face of uncertainty [Auer et al., 2002] for parameter estimation, generating optimistic estimations using upper confidence bounds for product attraction parameters and a lower confidence bound for the no-purchase option. This results in an optimistic revenue function, guiding decision-making under uncertainty by balancing exploration and exploitation.

We build our optimistic estimates on biased observations. The total observed preference given to product i until epoch τ is denoted as vi,τ and can be calculated as

s=1 oi,s,tend τ ,

where τ is the current epoch and tend τ is the last period of epoch τ.

We count how many times a particular product i is offered in the assortment until epoch τ using the set Eτ(i) which is the set of epochs during which i is offered, i.e. Eτ(i) = {e τ : i Se}. Then, we estimate attraction parameters by

ˆvi,τ = vi,τ |Eτ(i)|.

ˆvi,τ is a biased estimator due to delay and thresholding, i.e. E[ˆvi,τ] = vi. We consider this bias in our estimation and build our concentration around ψµvi. We state our concentration result in the following lemma.

Lemma 4.1 With probability at least 1 O(N 2T 1), for every epoch τ {1, . . . } and option i {0d, 1, . . . , N}, we have

|ˆvi,τ ψµvi| i,τ,

where i,τ = q

(48ˆvi,τ +1) log(NT )

|Eτ (i)| + 48 log(NT )+µ

Sketch of the proof. Note that ˆvi,τ is a biased estimator and defined as

ˆvi,τ = vi,τ |Eτ(i)| =

Ptend τ s=1 oi,s,tend τ |Eτ(i)| ,

where |Eτ(i)| is the total number of epochs during which product i is shown to the customer and Ptend τ s=1 oi,s,tend τ is the total observed sales of product i = 0d or is the total observed delayed no selections.

We analyze the concentration around ψµvi:

|ˆvi,τ ψµvi| =

Ptend τ s=1 oi,s,tend τ |Eτ(i)| ψµvi

Ptend τ µ s=1 ai,s I(ds µ) + Ptend τ s=tend τ µ+1 ai,s I(ds tend τ s)

|Eτ(i)| ψµvi

Ptend τ s=1 ai,s I(ds µ) + Ptend τ s=tend τ µ+1 ai,s I(ds tend τ s) I(ds µ)

|Eτ(i)| ψµvi

Ptend τ s=1 ai,s I(ds µ)

|Eτ(i)| ψµvi

Ptend τ s=tend τ µ+1 ai,s 1 I(ds µ)

where the second equality follows from decomposing the observations into those before and after the threshold, the third equality rearranges the terms, and the last inequality uses the triangle inequality.

For the first term in the decomposition (1), we have

Ptend τ s=1 ai,s I(ds µ)

|Eτ(i)| ψµvi

Ptend τ s=1 ai,s I(ds µ)

|Eτ(i)| ψµ Ptend τ s=1 ai,s |Eτ(i)| + ψµ Ptend τ s=1 ai,s |Eτ(i)| ψµvi

Ptend τ s=1 ai,s I(ds µ)

|Eτ(i)| ψµ Ptend τ s=1 ai,s |Eτ(i)|

ψµ Ptend τ s=1 ai,s |Eτ(i)| ψµvi

Ptend τ s=1 I(ds µ)

Ptend τ s=1 ai,s |Eτ(i)| vi

We bound the first term (a) using Hoeffding s inequality:

|Eτ(i)| . (3)

For part (b), we make use of the Chernoff bound from Theorem A.1 and handle two cases based

on ζ = (vi + 1) q

vi|Eτ (i)| . The detailed proof is provided in Appendix A. The result can be summarized as follows:

Ptend τ s=1 ai,s |Eτ(i)| vi

48ˆvi,τ log(NT)

|Eτ(i)| + 48 log(NT)

1 4 N 2T . (4)

Combining the bounds for both terms, we establish the concentration result stated in Lemma 4.1.

Using the concentration result for the attraction parameters, we construct upper confidence bounds for each product at each epoch:

vi,τ = ˆvi,τ + i,τ,

and for the delayed no-purchase option, we construct a lower confidence bound:

v0d,τ = ˆv0d,τ 0d,τ,

(48ˆvi,τ + 1) log(NT)

|Eτ(i)| + 48 log(NT) + µ

We use the optimistic parameter estimations to construct an optimistic revenue function

R(S; v) = X

i S ri vi,τ 1 + vod,τ + P

Our algorithm DEMBA suggests the assortment according to the optimistic revenue function R(S; v) and updates parameters according to feedback received with delay. The pseudocode of DEMBA is given in Algorithm 1.

Computational complexity. The computational complexity of the DEMBA algorithm involves several key components. The most intensive step is computing the assortment Sτ by maximizing the revenue function R(S; v). For this step, polynomial-time solutions with O(N 2) complexity are available, as demonstrated by Rusmevichientong et al. [2010] and Davis et al. [2013]. Updating observed preferences vi,τ involves summing over previous observations, with a complexity of O(N(tstart τ tend τ )). The process of updating sets Eτ(i) and estimations ˆvi,τ and the confidence bounds adds O(N) operations per epoch. Given that τ t and t T, the overall computational complexity across all rounds T is O(TN 2 + NT 2).

Algorithm 1 Delayed MNL Bandit (DEMBA)

Initialize: t = 0, τ = 0; while t < T do

Compute Sτ = arg max S {1,...,N},|S| K R(S; v); Offer assortment Sτ; Receive feedback; if at = 0 then {immediate reject}

vi,τ = Ptend τ s=1 oi,s,tend τ i [N] {0d}; Eτ(i) = {e τ : i Se} i [N] {0d}; ˆvi,τ = vi,τ |Eτ (i)| i [N] {0d}; vi,τ = ˆvi,τ + i,τ i [N]; v0d,τ = ˆv0d,τ 0d,τ τ = τ + 1 end if t = t + 1 end while

5 Regret Analysis

Our main result is given in the following theorem.

Theorem 5.1 Let πDEMBA be the policy produced by Algorithm 1 using i,τ = q

(48ˆvi,τ +1) log(NT )

|Eτ (i)| + 48 log(NT )+µ

|Eτ (i)| . Then, πDEMBA satisfies

Reg(T, πDEMBA) (1 + µ) log(T) + K p

73T log(NT) + 48K log(T)(log(NT) + µ)

NT log(NT) + 48 log2(NT).

The bound can be further simplified to O

NT by omitting logarithmic terms.

Sketch of the proof. The proof of Theorem 5.1 consists of several steps. First, we use the definition of the optimistic revenue function R(S; v) and show that it provides an upper bound on the true revenue function R(S; ψµv). This is achieved by leveraging Lemma B.1, which ensures that the estimated attraction parameters are close to their true values with high probability.

We start by expressing the regret in terms of the epochs:

Reg(T, π) = E

τ=1 |Hτ| (R(S ; ψµv) R(Sτ; ψµv))

where Hτ is the duration of epoch τ. Given that the epoch duration follows a geometric distribution, we simplify the expected regret using the law of total expectations.

Next, we decompose the regret into two parts: one that occurs with high probability (event EC τ ) and one that occurs with low probability (event Eτ):

E[ Rτ] = E RτI(Eτ 1) + RτI(EC τ 1) .

We bound the contribution of the low-probability event by (N + 1)P(Eτ 1), which is small due to our concentration results.

For the high-probability event, we show that the difference between the optimistic and true revenues is bounded by i,τ. Applying Lemma B.1, we can then bound the regret for each epoch:

E[ Rτ] (N + 1)P(Eτ 1) + E

1 + ψµv0d + X

(R(Sτ; vτ) R(Sτ; ψµv)) I(EC τ 1)

Summing over all epochs and using the properties of the epoch duration, we show that the total regret is bounded by O(

The full detailed proof is provided in Appendix B.

Next, we provide a lower bound result for the regret in the following theorem:

Theorem 5.2 For any policy π, suppose K N/4, T 1, and ψµ (0, 1). There exists a universal constant c > 0 such that

Reg(T, π) c min

The proof of Theorem 5.2 is deferred to Appendix C. This theorem establishes a lower bound on the regret, showing that no policy can achieve a better regret rate than Ω(

Remark 5.3 The effect of the threshold µ in the upper bound appears only in logarithmic terms, suggesting that the regret increases with larger µ. In contrast, in the lower bound, ψµ appears in the square root and the denominator, indicating that the regret decreases with larger µ. It is important to note that µ is determined by business conditions and is typically fixed. We conjecture that the upper bound is not tight concerning µ, indicating potential areas for future improvement in the analysis.

6 Non-Thresholded Setting: The Patient Learner

In this section, we modify our algorithm for environments that do not apply a threshold for delays. We refer to the seller in this setting as the patient learner. The patient learner is assumed to have knowledge of the expected delay. This assumption is consistent with existing literature (see e.g. Joulani et al. [2013], Blanchet et al. [2024]).

We build our concentration result as follows:

Lemma 6.1 With probability at least 1 O(N 2T 1) we have

48ˆvi,τ log(NT)

|Ei(τ)| + 48 log(NT)

|Ei(τ)| + E [ds]

6E [ds] log(NT)

The proof is deferred to the Appendix. According to Lemma 6.1, we modify Algorithm 1 by changing i,τ to

48ˆvi,τ log(NT)

|Ei(τ)| + 48 log(NT)

|Ei(τ)| + E [ds]

6E [ds] log(NT)

We then state the regret result of the modified algorithm:

Theorem 6.2 Let πP A DEMBA be the policy produced by Algorithm 1 using i,τ. πP A DEMBA satisfies

Reg(T, πP A DEMBA) log(T) + K p

48T log(NT)

+ (K + 1)(48 + E [ds] + p

6E [ds]) log2(NT) + 72 p

NT log(NT).

Remark 6.3 In the non-thresholded setting, the term with ψµ disappears since limµ ψµ = 1. This implies that the regret in this setting does not depend on ψµ, providing a potentially tighter bound compared to the thresholded case. However, new terms involving E [ds] are introduced. Asymptotically, both the thresholded and non-thresholded regret bounds simplify to O(

NT), with the differences primarily reflected in the constants, and both bounds approach the lower bound.

Remark 6.4 Incorporating the skewness or variance of the delay distribution can improve the regret upper bound in practice, particularly in the non-thresholded setting. For instance, distributions with faster decay rates, such as Gaussian distributions, may lead to better regret performance compared to long-tail distributions. This would involve using techniques like Bernstein-type inequalities or making assumptions about tail behavior (e.g., sub-exponential tails). However, in the current analysis, we focus on the expectation of the delay, and further improvements based on skewness are left for future work. The asymptotic regret bound remains O(

NT), independent of skewness.

Figure 1: Simulation results of DEMBA algorithm with benchmarks. Top row: geometric delays. Bottom row: uniform delays. Left: E[ds] = 500, µ = 100; Middle E[ds] = 100, µ = 100; Right: E[ds] = 100, µ = 500. Results are averaged over 100 independent runs.

7 Experiments

We conducted two sets of experiments to evaluate the performance of our algorithms. Our benchmark is an explore-then-exploit (EXP) algorithm, which explores by offering random assortments until a pre-specified time and then offers the optimized assortment based on the collected data. We tuned the exploration duration of the EXP algorithm and used the three best-performing durations in our comparisons.

We used N = 10, K = 4 and pi = 1 for all i {1, . . . , N}. The attraction parameters were set as:

vi = 0.25 + ϵ if i {1, 2, 9, 10} 0.25 otherwise,

where ϵ represents the contrast between products. With this setting the optimal assortment is {1, 2, 9, 10}.

In the first set of experiments, we tested our algorithm in two different delay settings: geometrically distributed and uniformly distributed delays. We set ϵ = 0.05 and used three cases in each distribution with increasing ψµ values: E[ds] = 500, µ = 100; E[ds] = 100, µ = 100; and E[ds] = 100, µ = 500. The results of this experiment are shown in Figure 1. We observed that the DEMBA algorithm learns effectively and performs better than our benchmarks in all settings. Learning becomes more difficult as ψµ decreases due to increased censorship and information loss from thresholding. With uniform delays, the learning is more challenging due to the heavy tail of the distribution. The gaps between DEMBA and the benchmarks increase with higher ψµ, suggesting better utilization of information by DEMBA.

In our second set of experiments, we tested how the contrast parameter ϵ affects learning and how the PA-DEMBA algorithm performs. We used geometric delays with E[ds] = 100 and µ = 100 for the first experiment and E[ds] = 100 for the second experiment. The results are shown in Figure 2. On the left-hand side, we observe that when the number of rounds is low (and thus the amount of learning is limited), lower contrast values lead to better results. As the number of rounds (and therefore the amount of learning) increases, as expected, higher contrast simplifies the learning problem, as indicated by lower regret curves. Furthermore, on ther right hand side, the PA-DEMBA algorithm demonstrated robust performance, effectively managing the challenges posed by the non-thresholded setting.

2500 5000 7500 10000 12500 15000 17500 20000

Cumulative Regret

² = 0:05 ² = 0:1 ² = 0:15 ² = 0:25

2500 5000 7500 10000 12500 15000 17500 20000

Cumulative Regret

Figure 2: Left: Performance change of DEMBA algorithm with different contrast levels. Right: PADEMBA and benchmarks with non-thresholded delays. Results are averaged over 100 independent runs.

Figure 3: Comparison with MNL-Bandit. Left: no delay; Middle E[ds] = 50; Right: E[ds] = 100. Results are averaged over 100 independent runs.

In our third set of experiments, we compare DEMBA algorithm and EXP benchmarks with MNLBandit algorihm from Agrawal et al. [2019]. While MNL-Bandit learns customer preferences similarly to DEMBA, it does not account for potential delays in the feedback. In this experiment, we considered µ = 500 and we applied a geometric delay distribution with no delay, E[ds] = 50 and E[ds] = 100. We observed that when there is no delay, the performance of MNL-Bandit and DEMBA is almost identical. However, as the delay increases, the performance of MNL-Bandit deteriorates, clearly indicating that it fails to handle delayed feedback.

8 Conclusion

This work provides the first solution and analysis for delayed choice modeling and online assortment optimization. We introduced two novel algorithms, DEMBA for thresholded feedback settings and PA-DEMBA for non-thresholded settings, demonstrating their effectiveness through theoretical guarantees and comprehensive experiments. Our algorithms address the dual challenges of unknown Multinomial Logit (MNL) parameters and delayed feedback, achieving sub-linear regret bounds.

Lastly, we discuss future work. Our lower bound suggest an improvement on regret by considering the delay distribution via ψµ. This would require learning the delay distribution itself, adding more complexity. Moreover, it could be interesting to explore scenarios where no-purchase decisions are indistinguishable from delayed purchases, such as settings where tracking no purchases is not possible. Additionally, it would be worthwhile to consider multi-level choice settings where customer preferences and rewards are revealed to the seller in multiple stages with delays between each stage.

Acknowledgments

I would like to thank the anonymous referees for their valuable feedback, which has helped to improve this paper.

A. Agarwal, N. Johnson, and S. Agarwal. Choice bandits. Advances in neural information processing systems, 33:18399 18410, 2020.

S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. Thompson sampling for the mnl-bandit. In Conference on Learning Theory, pages 76 78. PMLR, 2017.

S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research, 67(5):1453 1485, 2019.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235 256, 2002.

J. Blanchet, R. Xu, and Z. Zhou. Delay-adaptive learning in generalized linear contextual bandits. Mathematics of Operations Research, 49(1):326 345, 2024.

N. Cesa-Bianchi, C. Gentile, Y. Mansour, and A. Minora. Delay and cooperation in nonstochastic bandits. In Conference on Learning Theory, pages 605 622. PMLR, 2016.

N. Cesa-Bianchi, T. Cesari, R. Colomboni, C. Gentile, and Y. Mansour. Nonstochastic bandits with composite anonymous feedback. The Journal of Machine Learning Research, 23(1):12713 12736, 2022.

O. Chapelle. Modeling delayed feedback in display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1097 1105, 2014.

X. Chen and Y. Wang. A note on a tight lower bound for capacitated mnl-bandit assortment selection models. Operations Research Letters, 46(5):534 537, 2018.

X. Chen, Y. Wang, and Y. Zhou. Optimal policy for dynamic assortment planning under multinomial logit models. Mathematics of Operations Research, 46(4):1639 1657, 2021.

J. Davis, G. Gallego, and H. Topaloglu. Assortment planning under the multinomial logit model with totally unimodular constraint structures. 2013.

K. Dong, Y. Li, Q. Zhang, and Y. Zhou. Multinomial logit bandit with low switching cost. In International Conference on Machine Learning, pages 2607 2615. PMLR, 2020.

G. E. Flaspohler, F. Orabona, J. Cohen, S. Mouatadid, M. Oprescu, P. Orenstein, and L. Mackey. Online learning with optimism and delay. In International Conference on Machine Learning, pages 3363 3373. PMLR, 2021.

M. A. Gael, C. Vernade, A. Carpentier, and M. Valko. Stochastic bandits with arm-dependent delays. In International Conference on Machine Learning, pages 3348 3356. PMLR, 2020.

A. Grover, T. Markov, P. Attia, N. Jin, N. Perkins, B. Cheong, M. Chen, Z. Yang, S. Harris, W. Chueh, et al. Best arm identification in multi-armed bandits with delayed feedback. In International Conference on Artificial Intelligence and Statistics, pages 833 842. PMLR, 2018.

B. Howson, C. Pike-Burke, and S. Filippi. Delayed feedback in generalised linear bandits revisited. In International Conference on Artificial Intelligence and Statistics, pages 6095 6119. PMLR, 2023.

P. Joulani, A. Gyorgy, and C. Szepesvári. Online learning under delayed feedback. In International Conference on Machine Learning, pages 1453 1461. PMLR, 2013.

T. Lancewicki, S. Segal, T. Koren, and Y. Mansour. Stochastic multi-armed bandits with unrestricted delay distributions. In International Conference on Machine Learning, pages 5969 5978. PMLR, 2021.

T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

M. Mitzenmacher and E. Upfal. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017.

N. Perivier and V. Goyal. Dynamic pricing and assortment under a contextual mnl demand. Advances in Neural Information Processing Systems, 35:3461 3474, 2022.

C. Pike-Burke, S. Agrawal, C. Szepesvari, and S. Grunewalder. Bandits with delayed, aggregated anonymous feedback. In International Conference on Machine Learning, pages 4105 4113. PMLR, 2018.

P. Rusmevichientong, Z.-J. M. Shen, and D. B. Shmoys. Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations research, 58(6):1666 1680, 2010.

W. Tang, C.-J. Ho, and Y. Liu. Bandit learning with delayed impact of actions. Advances in Neural Information Processing Systems, 34:26804 26817, 2021.

Y. Tang, Y. Wang, and Z. Zheng. Stochastic multi-armed bandits with strongly reward-dependent delays. In International Conference on Artificial Intelligence and Statistics, pages 3043 3051. PMLR, 2024.

T. S. Thune, N. Cesa-Bianchi, and Y. Seldin. Nonstochastic multiarmed bandits with unrestricted delays. Advances in Neural Information Processing Systems, 32, 2019.

C. Vernade, O. Cappé, and V. Perchet. Stochastic bandit models for delayed conversions. ar Xiv preprint ar Xiv:1706.09186, 2017.

C. Vernade, A. Carpentier, T. Lattimore, G. Zappella, B. Ermis, and M. Brueckner. Linear bandits with stochastic delayed feedback. In International Conference on Machine Learning, pages 9712 9721. PMLR, 2020.

Y. Wang, X. Chen, and Y. Zhou. Near-optimal policies for dynamic multinomial logit assortment selection models. Advances in neural information processing systems, 31, 2018.

Y. Yang, H. Zhong, T. Wu, B. Liu, L. Wang, and S. S. Du. A reduction-based framework for sequential decision making with delayed feedback. Advances in Neural Information Processing Systems, 36, 2024.

Z. Zhou, R. Xu, and J. Blanchet. Learning in generalized linear contextual bandits with stochastic delays. Advances in Neural Information Processing Systems, 32, 2019.

A Proof of Lemma 4.1

We begin by providing an instrumental theorem that will be useful to establish concentration results.

Theorem A.1 (Theorem 5 of Agrawal et al. [2019]) Consider n i.i.d. geometric random variables X1, . . . , Xn with parameter p, i.e. for any i

P(Xi = m) = (1 p)mp m = {0, 1, 2, . . . },

and let µ = E(Xi) = 1 p

p 1 and X = 1

n Pn i=1 Xi. For any ζ > 0, we have

P( X > (1 + ζ)µ) exp nµζ2

2(1 + ζ)(1 + µ)2

We now provide the proof of Lemma 4.1 that is crucial for our subsequent analysis.

Proof of Lemma 4.1. Note that ˆvi,τ is a biased estimator and defined as

ˆvi,τ = vi,τ |Eτ(i)| =

Ptend τ s=1 oi,s,tend τ |Eτ(i)| ,

where |Eτ(i)| is the total number of epochs that product i is shown to the customer and Ptend τ s=1 oi,s,tend τ is the total observed sales of product i = 0d or is the total observed delayed no selections.

Let t be the current period, i.e. t = tend τ + 1. We have

|ˆvi,τ ψµvi| =

Pt 1 s=1 oi,s,t 1

|Eτ(i)| ψµvi

Pt µ 1 s=1 ai,s I(ds µ) + Pt 1 s=t µ ai,s I(ds t s)

|Eτ(i)| ψµvi

Pt 1 s=1 ai,s I(ds µ) + Pt 1 s=t µ ai,s I(ds t s) I(ds µ)

|Eτ(i)| ψµvi

Pt 1 s=1 ai,s I(ds µ)

|Eτ(i)| ψµvi

Pt 1 s=t µ ai,s 1 I(ds µ)

where the second equality follows from the decomposition of old observations before the threshold and recent observations, the third equality is rearrangement of the values and the last inequality follows from triangle inequality.

For the first element of (5), we have

Pt 1 s=1 ai,s I(ds µ)

|Eτ(i)| ψµvi

Pt 1 s=1 ai,s I(ds µ)

|Eτ(i)| ψµ Pt 1 s=1 ai,s |Eτ(i)| + ψµ Pt 1 s=1 ai,s |Eτ(i)| ψµvi

Pt 1 s=1 ai,s I(ds µ)

|Eτ(i)| ψµ Pt 1 s=1 ai,s |Eτ(i)|

ψµ Pt 1 s=1 ai,s |Eτ(i)| ψµvi

Pt 1 s=1 I(ds µ)

Pt 1 s=1 ai,s |Eτ(i)| vi

|Eτ(i)| , (7)

by Hoeffding s inequality.

For part (b), we make use of the Chernoff bound from Theorem A.1 in two cases according to

ζ = (vi + 1) q

vi|Eτ (i)| .

2: We have from Theorem A.1

Pt 1 s=1 ai,s |Eτ(i)| vi > ζvi

Pt 1 s=1 ai,s |Eτ(i)| vi < ζvi

Pt 1 s=1 ai,s |Eτ(i)| vi

6vi log(NT)

Therefore, we have

Pt 1 s=1 ai,s |Eτ(i)| vi

24vi log(NT)

2 N 2T 2 . (8)

We shall generalize the result at 8 in two cases for substituting vi with ˆvi,τ in the concentration radius and upper bounding the new bound utilizing vi.

For ˆvi,τ, we have

P(ˆvi,τ vi < vi 1 2) 1 N 2T 2 ,

P(2ˆvi,τ vi) 1 N 2T 2 .

Combining this with 8, we get

Pt 1 s=1 ai,s |Eτ(i)| vi

48ˆvi,τ log(NT)

3 N 2T 2 . (9)

For upper bounding 9 using vi, we have

P(ˆvi,τ vi > vi 1 2) 1 N 2T 2 ,

2 ˆvi,τ) 1 N 2T 2 .

Therefore, we conclude

Pt 1 s=1 ai,s |Eτ(i)| vi

72vi log(NT)

3 N 2T 2 . (10)

Case 2 ζ > 1

2: We have 2ζ2 1

2. Let ζ = 2ζ2. We have by Theorem A.1

Pt 1 s=1 ai,s |Eτ(i)| vi

exp |Eτ(i)|viζ 2

2(1 + ζ )(1 + vi)2

exp |Eτ(i)|viζ

substituting the value of ζ , we get

Pt 1 s=1 ai,s |Eτ(i)| vi

> 48 log(NT)

1 N 2T 2 . (11)

Combining 9 and 11, and applying union bound we have

Pt 1 s=1 ai,s |Eτ(i)| vi

48ˆvi,τ log(NT)

|Eτ(i)| + 48 log(NT)

Additionally, combining 9, 10 and 11, and applying union bound we have

Pt 1 s=1 ai,s |Eτ(i)| vi

72vi log(NT)

|Eτ(i)| + 48 log(NT)

Pt 1 s=1 ai,s |Eτ(i)| vi

48ˆvi,τ log(NT)

|Eτ(i)| + 48 log(NT)

1 4 N 2T . (13)

We can bound the second element of (1) as

Pt 1 s=t µ ai,s 1 I(ds µ)

µ |Eτ(i)|. (14)

Combining (7), (12) and (14) gives the result.

B Proof of Theorem 5.1

We begin with establishing a key result that will be instrumental in our main proof.

Lemma B.1 Let S represent the optimal assortment when the MNL parameters are given by v. Also, let Sτ denote the assortment applied by the policy at epoch τ. For any parameter set w distinct from v the following hold.

i. If vi wi for all i 1, . . . , N, and v0d w0d, then R(S ; w) R(S ; v).

ii. If vi wi for all i 1, . . . , N, and v0d = w0d, then R(Sτ; w) R(Sτ; v) P

i Se(wi vi) 1+v0d+P

Proof of Lemma B.1. Note that the proof of Part i. mostly resembles Lemma A.3 in Agrawal et al. [2019], and we write the whole proof here for being self-contained.

Proof of Part i. Let vj be identical to v except for the jth element, which is increased to wj. We aim to show that for any j S , if vj is increased to wj, then R(S ; vj) R(S ; v). This suffices to prove R(S ; w) R(S ; v).

Define T := P

i S \j rivi and V := 1 + P

i S \j vi. If there exists a j S such that rj < R(S ), removing product j from the assortment yields higher expected revenue, contradicting the optimality of S . Therefore, we have

rj R(S ; v)

i S rivi 1 + P i S vi

i S \j rivi + rjvj 1 + P

i S \j vi + vj

for all j S . Rearranging terms, we get

rj V T. (15)

To have R(S ; vj) R(S ; v), we need to show that

V + wj T + rjvj

which is equivalent to

Tvj + rj V wj Twj + rj V vj.

Rearranging terms, we get

rj V (wj vj) T(wj vj),

which holds thanks to (15). Moreover, the case for i = 0d holds trivially which concludes our proof.

Proof of Part ii. We have

R(Sτ; w) R(Sτ; v) = X

i Se ri wi 1 + v0d + P

i Se ri vi 1 + v0d + P

i Se ri wi 1 + v0d + P

i Se ri vi 1 + v0d + P

i Se ri(wi vi) 1 + v0d + P

i Se(wi vi) 1 + v0d + P

where the second inequality follows from ri 1, i [N].

Now, we provide the proof of Theorem 5.1.

Proof of Theorem 5.1. We have

R(Sτ; vτ) R(S ; vτ) R(S ; ψµv), (16)

where the first inequality holds thanks to the definition of Sτ and the second inequality follows from Lemma B.1 with probability at least 1 O(N 1T 1).

Reg(T, π) = E

τ=1 |Hτ| (R(S ; ψµv) R(Sτ; ψµv))

where Hτ is the duration of epoch τ and |Hτ| Geom( 1 1+ψµv0d+P

j Sτ ψµvj ), then E[|Hτ| |

Sτ] = 1 + ψµv0d + P

j Sτ ψµvj. Since Sτ is determined by the history of policy(Fτ 1), we have E[|Hτ| | Fτ 1] = 1 + ψµv0d + P

j Sτ ψµvj. Therefore the regret is equal to

Reg(T, π) = E

τ=1 E [|Hτ| (R(S ; ψµv) R(Sτ; ψµv)) | Fτ 1]

and by the law of total expectations

Reg(T, π) = E

1 + ψµv0d + X

(R(S ; ψµv) R(Sτ; ψµv))

Let the epoch based regret be Rτ, we can write it as

1 + ψµv0d + X

(R(S ; ψµv) R(Sτ; ψµv)) ,

then the regret can be written in the form of

Reg(T, π) = E

We define event E, for all τ as

i=1 {|vi,τ ψµvi| > τ} |ψµv0d v0d,τ| > τ .

The event Eτ is a low probability event. We will analyze the regret according to the event separately:

E[ Rτ] = E RτI(Eτ 1) + RτI(EC τ 1) .

We have Rτ N + 1 and hence

E[ Rτ] (N + 1)P(Eτ 1) + E RτI(EC τ 1) .

We have by (16) that

E[ Rτ] (N + 1)P(Eτ 1) + E

1 + ψµv0d + X

(Rτ(Sτ; vτ) R(Sτ; ψµv)) I(EC τ 1)

Then, by Lemma B.1 we have

1 + ψµv0d + X

R(Sτ; vτ) R(Sτ; ψµv) I(EC τ 1)

1 + ψµv0d + X

R(Sτ; vτ) R(Sτ; ψµv)

1 + ψµv0d + X

i Sτ ri vi,τ 1 + v0d + P j Sτ vj

i Sτ ri vi,τ 1 + ψµv0d + P j Sτ vj

i Sτ ri vi,τ 1 + ψµv0d + P j Sτ vj

i Sτ riψµvi 1 + ψµv0d + P j Sτ ψµvj

1 + ψµv0d + X

i Sτ ri vi,τ(ψµv0d v0d) 1 + v0d + P

j Sτ vj 1 + ψµv0d + P

i Sτ ri( vi,τ ψµvi) 1 + ψµv0d + P

1 + ψµv0d + X

K(ψµv0d v0d) 1 + v0d + P

j Sτ vj 1 + ψµv0d + P

i Sτ ( vi,τ ψµvi) 1 + ψµv0d + P j Sτ vj

1 + ψµv0d + X

K(ψµv0d v0d) 1 + ψµv0d + P j Sτ ψµvj +

i Sτ ( vi,τ ψµvi) 1 + ψµv0d + P j Sτ ψµvj

= K(ψµv0d v0d) + X

i Sτ ( vi,τ ψµvi,τ).

Let Ei be the number of periods that the product i is offered. Then, the regret can be written as

Reg(T, π) E

(N + 1)P(Eτ 1) + K 0d,τ + X

(72v0d + 1) log(NT)

τ + 48K(log(NT) + µ)

(72vi + 1) log(NT)

|Eτ(i)| + 48 log(NT) + µ

log(T) + K p

73T log(NT) + 48K log(T)(log(NT) + µ) + 73E

vi Ei log(NT)

+ 48 log2(NT) + µ log(T)

log(T) + K p

73T log(NT) + 48K log(T)(log(NT) + µ) + 73

vi E[Ei] log(NT)

+ 48 log2(NT) + µ log(T),

where the third inequality holds due to τ T, Ei T and the fourth inequality holds due to Jensen s inequality.

Since, PN i=1 vi E[Ei] T, we conclude

Reg(T, π) log(T) + K p

73T log(NT) + 48K log(T)(log(NT) + µ)

NT log(NT) + 48 log2(NT) + µ log(T).

C Proof of Theorem 5.2

Before giving the proof of Theorem 5.2, we first give an instrumental Lemma in the main proof and its proof.

Lemma C.1 Let P and Q be categorical distributions with PP (X = i) = pi and PQ(X = i) = qi where pi = qi + ϵi for all i = 1, . . . , I. Also, let the selected option be observed with bias ψµ. The Kullback-Leibler divergence between P and Q, denoted by DKL(P Q), is given by

ψµϵ2 i qi .

Proof of Lemma C.1. We have

i=0 ψµ(qi + ϵi) log ψµ(qi + ϵi)

i=0 ψµ(qi + ϵi)ϵi

ψµϵ2 i qi ,

where the inequality follows since log(1 + x) x for any x > 1 and the last equality holds since PI i=0 ϵi = 0.

Proof of Theorem 5.2. We set pi = 1 for all i [N]. Let S be a subset of N with |S| = K. For attraction parameters, we set v0 = 1, v0d arbitrary in [0, 1], vi = (1 + ϵ)/K for i S and vi = 1/K otherwise.

We denote all subsets of product set [N] of size K as SK. We have

max S SK R(S) = R(S ) = 1 + ϵ 2 + v0d + ϵ.

We employ the neighboring argument for assortment sets as described in Chen and Wang [2018]. For any arbitrary assortment St, we have

R(St) = 1 + (1 δ)ϵ 2 + v0d + (1 δ)ϵ,

where δ is a measure of discrepancy from the optimal set S, i.e. δ := 1 |St S|

Then, we have

R(S ) R(St) = 1 + ϵ 2 + v0d + ϵ 1 + (1 δ)ϵ 2 + v0d + (1 δ)ϵ

= δϵ + δϵv0d (2 + v0d + ϵ)(2 + v0d + (1 δ)ϵ)

δϵ (1 + v0d)

For any arbitrary subset St of |St| K, we can find St SK such that St St. We define Ni := PT t=1 1(i St). We have

t=1 ES [R(S) R(St)] (17)

t=1 ES h R(S) R( St) i (18)

t=1 ES h R(S) R( St) i (19)

i/ S ES h Ni i ϵ

Let S(i) K 1 be all subsets of [N] of size K 1 that do not include i. We have

i S ES1 S 2 h Ni i = 1 K|SK|

S SK:i S ES h Ni i

ES {i} h Ni i . (22)

We define probability measures P and Q through parameterizations S and S {i} respectively. We denote Total Variation(TV) distance between two probability measures as TV (P, Q) := sup A |P(A) Q(A)| and Kullback-Leibler (KL) divergence as DKL(P Q). We have 0 Ni T

|EP [ Ni] EQ[ Ni]|

j=0 j P( Ni = j) Q( Ni = j)

P( Ni = j) Q( Ni = j)

T TV (P, Q)

1 2DKL(P Q), (23)

where the last inequality follows from Pinsker s Inequality.

We use (23) in (22) and we get

i S ES h Ni i

ES h Ni i + T

1 2DKL(PS PS {i})

For the first element of (24), we have

ES h Ni i = 1 K|SK|

i/ S ES h Ni i (25)

i=1 ES h Ni i (26)

K|SK| TK (27)

= TK N K + 1 (28)

where the last inequality follows from K N/4.

For the second element of (24), we have that

1 2DKL(PS PS {i}) (30)

i=1 max S S(i) K 1

1 2DKL(PS PS {i}) (31)

= T N K + 1

i=1 max S S(i) K 1

1 2DKL(PS PS {i}) (32)

max S SK 1 T

1 2(N K + 1)

i/ S DKL(PS PS {i}) (33)

where the first inequality is due to Hölder s inequality and the second inequality follows from Jansen s inequality.

We have DKL(PS ( |St) PS {i}( |St)) = 0 for i / St. For analyzing i St, we define K = |St| K, J := |St S | K 1 and a = 1 + v0d + K

|p0 q0| = 1 a + Jϵ/K 1 a + (J + 1)ϵ/K

= ϵ K(a2 + a Jϵ/K + a(J + 1)ϵ/K + J(J + 1)ϵ2/K2)

|p0d q0d| = v0d

1 a + Jϵ/K 1 a + (J + 1)ϵ/K

= v0dϵ K(a2 + a Jϵ/K + a(J + 1)ϵ/K + J(J + 1)ϵ2/K2)

|pj qj| 1 + ϵ

1 a + Jϵ/K 1 a + (J + 1)ϵ/K

K2(a2 + a Jϵ/K + a(J + 1)ϵ/K + J(J + 1)ϵ2/K2)

K2 for 1 j N and j = i;

1 a + Jϵ/K 1 a + (J + 1)ϵ/K

K 1 a + Jϵ/K

We also have

q0 = 1 a + (J + 1)ϵ/K 1 3 + ϵ 1 3 + 1/2 1

qj = 1 + ϵ K(a + (J + 1)ϵ/K) 1 + ϵ

Using these with Lemma C.1, we get

DKL(PS ( |St) PS {i}( |St)) ψµ

K2 + 4v2 0dϵ2

K2 + 4K J 4ϵ2

K4 + 4K 4ϵ2

K2 + 16ϵ2 1 K2 + 16ϵ2 1 K

40ψµϵ2 1 K .

Integrating this into (33), we obtain

max S SK 1 T

1 2(N K + 1)

i/ S DKL(PS PS {i}) (34)

max S SK 1 T

1 2(N K + 1)

i/ S ES [ Ni]40ψµϵ2

v u u t 1 2(N K + 1) 40ψµϵ2

i=1 ES [ Ni] (36)

1 2(N K + 1) 40ψµϵ2

For the final bound, we have

t=1 ES [R(S) R(St)]

1 2(N K + 1) 40ψµϵ2

where (38) follows from (21), (39) follows from (29) and (37).

Setting ϵ1 = min{ p

N/ψµT, 0.5} gives the final bound.

D Proof of Theorem 6.2

We first provide an instrumental theorem that we will use to establish concentration results.

Theorem D.1 (Theorem 4.4 of Mitzenmacher and Upfal [2017]) Let X1, . . . , Xn be independent Poisson trials such that P(Xi = 1) = pi. Let X = Pn i=1 Xi. Then, for 0 < δ < 1:

P(X (1 + δ)E[X]) exp( E[X]δ2/3).

Proof of Lemma 6.1.

|ˆvi,τ vi| =

Pt 1 s=1 oi,s,t 1

Pt 1 s=1 ai,s I(ds t s)

Pt 1 s=1 ai,s Pt 1 s=1 ai,s I(ds > t s) |Ei(τ)| vi

Pt 1 s=1 ai,s |Ei(τ)| vi

Pt 1 s=1 I(ds > t s)

The derivation of the bound for the first element of (41) parallels that of Lemma 4.1, thus, we omit the analysis, yielding

Pt 1 s=1 ai,s |Ei(τ)| vi

48ˆvi,τ log(NT)

|Ei(τ)| + 48 log(NT)

|Ei(τ)| , (42)

with probability at least 1 4 N2T .

For the second element of (41), we first bound the expected number of unobserved feedback until time t. We have

s=1 I(ds > t s)

s=1 P(ds > t s) =

n=1 P(dt n > n)

n=0 P(d1 > n) =

m=n+1 P(d1 = m)

n=0 P(d1 = m) =

m=0 m P(d1 = m)

= E[ds]. (43)

Then, we have

Pt 1 s=1 I(ds > t s)

= Pt 1 s=1 I(ds > t s)

|Ei(τ)| (44)

(1 + δ)E h Pt 1 s=1 I(ds > t s) i

|Ei(τ)| (45)

(1 + δ)E [ds]

|Ei(τ)| (46)

6E [ds] log(NT)

|Ei(τ)| (47)

where the first inequality follows by Theorem D.1 with probability at least 1 1 N2T 2 by setting

E[ds] and the second inequality follows from 43.

Combining 42 and 47 gives the result.

Proof of Theorem 6.2.

We will proceed similarly to the proof of Theorem 5.1, but we provide a complete proof here for clarity and completeness.

R(Sτ; vτ) R(S ; vτ) R(S ; v), (48)

where the first inequality holds thanks to the definition of Sτ and the second inequality follows from Lemma B.1 with probability at least 1 O(N 1T 1).

Similar to the proof of Theorem 5.1, we can write

Reg(T, π) = E

1 + v0d + X

(R(S ; v) R(Sτ; v))

And defining the epoch based regret be Rτ, we can write it as

1 + v0d + X

(R(S ; v) R(Sτ; v)) ,

which will lead to

Reg(T, π) = E

We define event E, for all τ as

i=1 {|vi,τ vi| > τ} |v0d v0d,τ| > τ .

Conditioning on the event and by (48), we can write

E[ Rτ] (N + 1)P(Eτ 1) + E

1 + v0d + X

(Rτ(Sτ; vτ) R(Sτ; v)) I(EC τ 1)

Then, by Lemma B.1 we have

1 + v0d + X

R(Sτ; vτ) R(Sτ; v) I(EC τ 1)

1 + v0d + X

R(Sτ; vτ) R(Sτ; v)

1 + v0d + X

i Sτ ri vi,τ 1 + v0d + P j Sτ vj

i Sτ ri vi,τ 1 + v0d + P j Sτ vj

i Sτ ri vi,τ 1 + v0d + P j Sτ vj

i Sτ rivi 1 + v0d + P j Sτ vj

1 + v0d + X

i Sτ ri vi,τ(v0d v0d) 1 + v0d + P

j Sτ vj 1 + v0d + P

i Sτ ri( vi,τ vi) 1 + v0d + P

1 + v0d + X

K(v0d v0d) 1 + v0d + P

j Sτ vj 1 + v0d + P

i Sτ ( vi,τ vi) 1 + v0d + P j Sτ vj

1 + v0d + X

K(v0d v0d) 1 + v0d + P j Sτ vj +

i Sτ ( vi,τ vi) 1 + v0d + P j Sτ vj

= K(v0d v0d) + X

i Sτ ( vi,τ vi,τ).

Let Ei be the number of periods that the product i is offered. Then, the regret can be written as

Reg(T, π) E

(N + 1)P(Eτ 1) + K 0d,τ + X

48ˆvi,τ log(NT)

|Eτ(i)| + 48K log(NT)

|Eτ(i)| + KE [ds]

6E [ds] log(NT)

48ˆvi,τ log(NT)

|Eτ(i)| + 48 log(NT)

|Eτ(i)| + E [ds]

6E [ds] log(NT)

log(T) + K p

48T log(NT) + K(48 + E [ds] + p

6E [ds]) log2(NT)

vi Ei log(NT)

+ (48 + E[ds] + p

6E [ds]) log2(NT)

log(T) + K p

48T log(NT) + (K + 1)(48 + E [ds] + p

6E [ds]) log2(NT)

vi E[Ei] log(NT),

where the third inequality holds due to τ T, Ei T and the fourth inequality holds due to Jensen s inequality.

Since, PN i=1 vi E[Ei] T, we conclude

Reg(T, π) log(T) + K p

48T log(NT) + (K + 1)(48 + E [ds] + p

6E [ds]) log2(NT)

NT log(NT).

E Experimental Details

We numerically evaluate our algorithms over 100 independently generated problem instances, with error bars representing standard errors in both Figure 1 and Figure 2.

The simulations were conducted on a server equipped with 4 Intel Xeon 6248 2.5GHz CPUs and 377 GB of RAM, running Cent OS 7. The simulation code was developed in Python version 3.9.6.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: [NA]

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: [NA]

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: [NA] Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: The scripts for all experiments is provided. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: The necessary information is provided in the Experiments section and in the appendix as well as the scripts for experiments are provided.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: [NA]

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: Error bars are reported and represent standard errors.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors).

It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: Information provided in appendix.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: [NA]

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: There is no societal impact of the work performed. The main application of the presented work is theoretical for sequential decision making.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: the paper does not use existing assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.