# nonstochastic_multiarmed_bandits_with_unrestricted_delays__1d217140.pdf

Nonstochastic Multiarmed Bandits with Unrestricted Delays

Tobias Sommer Thune University of Copenhagen Copenhagen, Denmark tobias.thune@di.ku.dk

Nicolò Cesa-Bianchi DSRC & Univ. degli Studi di Milano Milan, Italy nicolo.cesa-bianchi@unimi.it

Yevgeny Seldin University of Copenhagen Copenhagen, Denmark seldin@di.ku.dk

We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We ﬁrst prove that "delayed" Exp3 achieves the O p

(KT + D) ln K regret bound conjectured by Cesa-Bianchi et al. [2019] in the case of variable, but bounded delays. Here, K is the number of actions and D is the total delay over T rounds. We then introduce a new algorithm that lifts the requirement of bounded delays by using a wrapper that skips rounds with excessively large delays. The new algorithm maintains the same regret bound, but similar to its predecessor requires prior knowledge of D and T. For this algorithm we then construct a novel doubling scheme that forgoes the prior knowledge requirement under the assumption that the delays are available at action time (rather than at loss observation time). This assumption is satisﬁed in a broad range of applications, including interaction with servers and service providers. The resulting oracle regret bound is of order minβ |Sβ|+β ln K +(KT +Dβ)/β , where |Sβ| is the number of observations with delay exceeding β, and Dβ is the total delay of observations with delay below β. The bound relaxes to O p

(KT + D) ln K , but we also provide examples where Dβ D and the oracle bound has a polynomially better dependence on the problem parameters.

1 Introduction

Multiarmed bandits is an algorithmic paradigm for sequential decision making with a growing range of industrial applications, including content recommendation, computational advertising, and many more. In the multiarmed bandit framework an algorithm repeatedly takes actions (e.g., recommendation of content to a user) and observes outcomes of these actions (e.g., whether the user engaged with the content), whereas the outcome of alternative actions (e.g., alternative content that could have been recommended) remains unobserved. In many real-life situations the algorithm experience delay between execution of an action and observation of its outcome. Within the delay period the algorithm may be forced to make a series of other actions (e.g., interact with new users) before observing the outcomes of all the previous actions. This setup falls outside of the classical multiarmed bandit paradigm, where observations happen instantaneously after the actions, and motivates the study of bandit algorithms that are provably robust in the presence of delays.

Part of this work was done while visiting Università degli Studi di Milano, Milan, Italy

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

We focus on the nonstochastic (a.k.a. oblivious adversarial) bandit setting, where the losses faced by the algorithm are generated by an unspeciﬁed deterministic mechanism. Though it might be of adversarial intent, the mechanism is oblivious to internal randomization of the algorithm. In the delayed version, the loss of an action executed at time t is observed at time t+dt, where the delay dt is also chosen deterministically and obliviously. Thus, at time step t the algorithm receives observations from time steps s t for which s + ds = t. This delay is the independent of the action chosen. The algorithm s performance is evaluated by regret, which is the difference between the algorithm s cumulative loss and the cumulative loss of the best static action in hindsight. The regret deﬁnition is the same as in the ordinary setting without delays. When all the delays are constant (dt = d for all t), the optimal regret is known to scale as O p

(K + d)T ln K , where T is the time horizon and K is the number of actions [Cesa-Bianchi et al., 2019]. Remarkably, this bound is achieved by delayed Exp3, which is a minor modiﬁcation of the standard Exp3 algorithm performing updates as soon as the losses become available.

The case of variable delays has previously been studied in the full information setting by Joulani et al. [2016]. They prove a regret bound of order p

(D + T) ln K, where D = PT t=1 dt is the total delay. Their proof is based on a generic reduction from delayed full information feedback to full information with no delay. The applicability of this technique to the bandit setting is unclear (see Appendix A). Cesa-Bianchi et al. [2019] conjecture an upper bound of order p

(KT + D) ln K for the bandit setting with variable delays. Note that this bound cannot be improved in the general case because of the lower bound Ω p

(K + d)T , which holds for any d. In a recent paper, Li et al. [2019] study a harder variant of bandits, where the delays dt remain unknown. As a consequence, if an action is played at time s and then more times in between time steps s and s + ds, the learner cannot tell which speciﬁc round the loss observed at time s + ds refers to. In this harder setting, for known T, D, and dmax, Li et al. [2019] prove a regret bound of e O p

dmax K(T + D) . Cesa-Bianchi et al. [2018] further study an even harder setting of bandits with anonymous composite feedback. In this setting at time step t the learner observes feedback, which is a composition of partial losses of the actions taken in the last dmax rounds. In this setting Cesa-Bianchi et al. [2018] obtain an O dmax KT ln K

regret bound (which is tight to within the ln K factor, and in fact tighter than the bound of Li et al. [2019] for an easier problem).

Our paper is structured in the following way. We start by investigating the regret of Exp3 in the variable delay setting. We prove that for known T, D, and dmax, and assuming that dmax is at most of order p

(KT + D)/(ln K), "delayed" Exp3 achieves the conjectured bound of O p

(KT + D) ln K . In order to remove the restriction on dmax and eliminate the need of its knowledge we introduce a wrapper algorithm, Skipper. Skipper prevents the wrapped bandit algorithm from making updates using observations with delay exceeding a given threshold β. This threshold acts as a tunable upper bound on the delays observed by the underlying algorithm, so if T and D are known we can choose β that attains the desired O p

(KT + D) ln K regret bound with "delayed" Exp3 wrapped within Skipper.

To dispense of the need for knowing T and D, the ﬁrst approach coming to mind is the doubling trick. However, applying the standard doubling to D is problematic, because the event that the actual total delay d1 + + dt exceeds an estimate D is observed at time t + dt rather than at time t. In order to address this issue, we consider a setting in which the algorithm observes the delay dt at time t rather than at time t + dt. To distinguish between this setting and the previous one we say that "the delay is observed at action time" if it is observed at time t and "the delay is observed at observation time" if it is observed at time t + dt. Observing the delay at action time is motivated by scenarios in which a learning agent depends on feedback from a third party, for instance a server or laboratory that processes the action in order to evaluate it. In such cases, the third party might partially control the delay, and provide the agent with a delay estimate based on contingent and possibly private information. In the server example the delay could depend on workload, while the laboratory might have processing times and an order backlog. Other examples include medical imaging where the availability of annotations depends on medical professionals work schedule. Common for these examples is that the third party knows the delay before the action is taken.

Within the "delay at action time" setting we achieve a much stronger regret bound. We show that Skipper wrapping delayed Exp3 and combined with a carefully designed doubling trick enjoys an implicit regret bound of order minβ |Sβ| + β ln K + (KT + Dβ)/β , where Dβ is the total delay of observations with delay below β. This bound is attained without any assumptions on the sequence

Table 1: Spectrum of delayed feedback settings and the corresponding regret bounds, progressing from easier to harder settings. Results marked by (*) have matching lower bounds up to the

ln K factor. If all the delays are identical, then D = d T and (**) has a lower bound following from Cesa-Bianchi et al. [2019] and matching up to the

ln K factor. However, for non-identical delays the regret can be much smaller, as we show in Example 8.

Setting Regret Bound Reference

Fixed delay O p

(K + d)T ln K (*) Cesa-Bianchi et al. [2019]

Delay at action time O minβ |Sβ| + β ln K + KT +Dβ

β This paper

Delay at observation time with known T, D O p

(KT + D) ln K (**) This paper

Anonymous, composite with known dmax O dmax KT ln K (*) Cesa-Bianchi et al. [2018]

of delays dt and with no need for prior knowledge of T and D. The implicit bound can be relaxed to an explicit bound of O p

(KT + D) ln K , however if Dβ D it can be much tighter. We provide an instance of such a problem in Example 8, where we get a polynomially tighter bound.

Table 1 summarizes the spectrum of delayed feedback models in the bandit case and places our results in the context of prior work.

1.1 Additional related work

Online learning with delays was pioneered by Mesterharm [2005] see also [Mesterharm, 2007, Chapter 8]. More recent work in the full information setting include [Zinkevich et al., 2009, Quanrud and Khashabi, 2015, Ghosh and Ramchandran, 2018]. The theme of large or unbounded delays in the full information setting was also investigated by Mann et al. [2018] and Garrabrant et al. [2016]. Other related approaches are the works by Shamir and Szlak [2017], who use a semi-adversarial model, and Chapelle [2014], who studies the role of delays in the context of onlne advertising. Chapelle and Li [2011] perform an empirical study of the impact of delay in bandit models. This is extended in [Mandel et al., 2015]. The analysis of Exp3 in a delayed setting was initiated by Neu et al. [2014]. In the stochastic case, bandit learning with delayed feedback was studied in [Dudík et al., 2011, Vernade et al., 2017]. The results were extended to the anonymous setting by Pike-Burke et al. [2018] and by Garg and Akash [2019], and to the contextual setting by Arya and Yang [2019].

2 Setting and notation

We consider an oblivious adversarial multiarmed bandit setting, where K sequences of losses are generated in an arbitrary way prior to the start of the game. The losses are denoted by ℓa t , where t indexes the game rounds and a {1, . . . , K} indexes the sequences. We assume that all losses are in the [0, 1] interval. We use the notation [K] = {1, . . . , K} for brevity. At each round of the game the learner picks an action At and suffers the loss of that action. The loss ℓAt t is observed by the learner after dt rounds, where the sequence of delays d1, d2, . . . is determined in an arbitrary way before the game starts. Thus, at round t the learner observes the losses of prior actions As for which s + ds = t. We assume that the losses are observed "at the end of round t", after the action At has been selected. We consider two different settings for receiving information about the delays dt:

Delay available at observation time The delay dt is observed when the feedback ℓAt t arrives at the end of round t + dt. This corresponds to the feedback being timestamped. Delay available at action time The delay dt is observed at the beginning of round t, prior to selecting the action At.

The following learning protocol provides a formal description of our setting.

Protocol for bandits with delayed feedback For t = 1, 2, . . .

1. If delay is available at action time, then dt 0 is revealed to the learner 2. The learner picks an action At {1, . . . , K} and suffers the loss ℓAt t [0, 1] 3. Pairs s, ℓAs s for all s t such that s + ds = t are observed

We measure the performance of the learner by her expected regret RT , which is deﬁned as the difference between the expected cumulative loss of the learner and the loss of the best static strategy in hindsight:

This regret deﬁnition is the same as the one used in the standard multiarmed bandit setting without delay.

3 Delay available at observation time: Algorithms and results

This section deals with the ﬁrst of our two settings, namely when delays are observed together with the losses. We ﬁrst introduce a modiﬁed version of "delayed" Exp3, which we name Delayed Exponential Weights (DEW) and which is capable of handling variable delays. We then introduce a wrapper algorithm, Skipper, which ﬁlters out excessively large delays. The two algorithms also serve as the basis for the next section, where we provide yet another wrapper for tuning the parameters of Skipper.

3.1 Delayed Exponential Weights (DEW)

DEW is an extension of the standard exponential weights approach to handle delayed feedback. The algorithm, laid out in Algorithm 1, performs an exponential update using every individual feedback as it arrives, which means that between each prediction either zero, one, or multiple updates might occur. The algorithm assumes that the delays are bounded and that an upper bound dmax maxt dt on the delays is known.

Algorithm 1: Delayed exponential weights (DEW) Input : Learning rate η; upper bound on the delays dmax Truncate the learning rate: η = min{η, (4edmax) 1};

Initialize wa 0 = 1 for all a [K]; for t = 1, 2, . . . do

Let pa t = wa t 1 P

b wb t 1 for a [K];

Draw an action At [K] according to the distribution pt and play it; Observe feedback (s, ℓAs s ) for all {s : s + ds = t} and construct estimators ˆℓa s = ℓa s1(a=As)

Update wa t = wa t 1 exp η P

s:s+ds=t ˆℓa s ;

The following theorem provides a regret bound for Algorithm 1. The bound is a generalization of a similar bound in Cesa-Bianchi et al. [2019].

Theorem 1. Under the assumption that an upper bound on the delays dmax is known, the regret of Algorithm 1 with a learning rate η against an oblivious adversary satisﬁes

RT max ln K

η , 4edmax ln K + η KTe

where D = PT t=1 dt. In particular, if T and D are known and η = q

2 +D 1 4edmax , we have

2 + D ln K. (1)

The proof of Theorem 1 is based on proving the stability of the algorithm across rounds. The proof is sketched out in Section 5. As Theorem 1 shows, Algorithm 1 performs well if dmax is small and we also have preliminary knowledge of dmax, T, and D. However, a single delay of order T increases dmax up to order T, which leads to a linear regret bound in Theorem 1. This is an undesired property, which we address with the skipping scheme presented next.

3.2 Skipping scheme

We introduce a wrapper for Algorithm 1, called Skipper, which disregards feedback from rounds with excessively large delays. The regret in the skipped rounds is trivially bounded by 1 (because the losses are assumed to be in [0, 1]) and the rounds are taken out of the analysis of the regret of DEW. Skipper operates with an externally provided threshold β and skips all rounds where dt β. The advantage of skipping is that it provides a natural upper bound on the delays for the subset of rounds processed by DEW, dmax = β. Thus, we eliminate the need of knowledge of the maximal delay in the original problem. The cost of skipping is the number of skipped rounds, denoted by |Sβ|, as captured in Lemma 2. Below we provide a regret bound for the combination of Skipper and DEW.

Algorithm 2: Skipper Input : Threshold β; Algorithm A.

for t = 1, 2, . . . do

Get prediction At from A and play it; Observe feedback (s, ℓAs s ) for all {s : s + ds = t}, and feed it to A for each s with ds < β; end

Lemma 2. The expected regret of Skipper with base algorithm A and threshold parameter β satisﬁes RT |Sβ| + RT \Sβ, (2)

where |Sβ| is the number of skipped rounds (those for which dt β) and RT \Sβ is a regret bound for running A on the subset of rounds [T]\Sβ (those, for which dt < β).

A proof of the lemma is found in Appendix C. When combined with the previous analysis for DEW, Lemma 2 gives us the following regret bound. Theorem 3. The expected regret of Skipper(β, DEW(η, β)) against an oblivious adversary satisﬁes

RT |Sβ| + max ln K

η , 4eβ ln K + η KTe

where Dβ = P

t/ Sβ dt is the cumulative delay experienced by DEW.

Proof. Theorem 1 holds for parameters (η, β) for DEW run under Skipper. We then apply Lemma 2.

Corollary 4. Assume that T and D are known and take

η = 1 4eβ , β =

4e + D 4e ln K .

Then the expected regret of Skipper(β, DEW(η, β)) against an oblivious adversary satisﬁes

2 + (1 + 4e)D ln K.

Proof. Note that D β|Sβ| |Sβ| D

β . By substituting this into (3), observing that Dβ D, and substituting the values of η and β we obtain the result.

Note that Corollary 4 recovers the regret scaling in Theorem 1, equation (1) within constant factors in front of D without the need of knowledge of dmax. Similar to Theorem 1, Corollary 4 is tight in the worst case. The tuning of β still requires the knowledge of T and D. In the next section we get rid of this requirement.

4 Delay available at action time: Oracle tuning and results

This section deals with the second setting, where the delays are observed before taking an action. The combined algorithm introduced in the previous section relies on prior knowledge of T and D for tuning the parameters. In this section we eliminate this requirement by leveraging the added information about the delays at the time of action. The information is used in an implicit doubling scheme for tuning Skipper s threshold parameter β. Additionally, the new bound scales with the experienced delay Dβ rather than the full delay D and is signiﬁcantly tighter when Dβ D. This is achieved through direct optimization of the regret bound in terms of |Sβ| and Dβ, as opposed to Corollary 4, which tunes β using the potentially loose inequality |Sβ| D/β.

Let m index the epochs of the doubling scheme. In each epoch we restart the algorithm with new parameters and continually monitor the termination condition in equation (6). The learning rate within epoch m is set to ηm = 1 4eβm , where βm is the threshold parameter of the epoch. Theorem 3 provides a regret bound for epoch m denoted by

Boundm(βm) := |Sm βm| + 4eβm ln K + σ(m)e K/2 + Dm βm 4eβm , (4)

where σ(m) denotes the length of epoch m and |Sm βm| and Dm βm are, respectively, the number of skipped rounds and the experienced delay within epoch m.

Let ωm = 2m. In epoch m we set

βm = ωm 4e ln K (5)

and we stay in epoch m as long as the following condition holds:

max |Sm βm|2, e Kσ(m)

ln K ωm. (6)

Since dt is observed at the beginning of round t, we are able to evaluate condition (6) and start a new epoch before making the selection of At. This provides the desired tuning of βm for all rounds without the need of a separate treatment of epoch transition points.

While being more elaborate, this doubling scheme maintains the intuition of standard approaches. First of all, the condition for doubling (6) ensures that the regret bound in each period is optimized by explicitly balancing the contribution of each term in equation (4). Secondly, the geometric progression of the tuning (5) and thus of the resulting regret bounds means that the total regret bound summed over the epochs can be bounded in relation to the bound in the ﬁnal completed epoch.

In the following we refer to the doubling scheme deﬁned by (5) and (6) as Doubling.

4.2 Results

The following results show that the proposed doubling scheme works as well as oracle tuning of β when the learning rate is ﬁxed at η = 1/4eβ. We ﬁrst compare our performance to the optimal tuning in a single epoch, where we let

β m = arg min βm Boundm(βm) (7)

be the minimizer of (4).

Lemma 5. The regret bound (4) for any non-ﬁnal epoch m, with the epochs and βm controlled by Doubling satisﬁes

Boundm(βm) 3 ωm 3 Boundm(β m) + 2e2K ln K + 1. (8)

The lemma is the main machinery of the analysis of Doubling and its proof is provided in Appendix C. Applying it to Skipper(β, DEW(η,β)) leads to the following main result. Theorem 6. The expected regret of Skipper(β, DEW(η, β)) tuned by Doubling satisﬁes for any T

RT 15 min β

|Sβ| + 4eβ ln K + KT + Dβ

+ 10e2K ln K + 5.

The proof of Theorem 6 is based on Lemma 5 and is provided in Appendix C. Corollary 7. The expected regret of Skipper(β, DEW(η, β)) tuned by Doubling can be relaxed for any T to

2 + (1 + 4e)D ln K + 10e2K ln K + 5. (9)

Proof. The ﬁrst term in the bound of Theorem 6 can be directly bounded using Corollary 4.

Note that both Theorem 6 and Corollary 7 require no knowledge of T and D.

4.3 Comparison of the oracle and explicit bounds

We ﬁnish the section with a comparison of the oracle bound in Theorem 6 and the explicit bound in Corollary 7. Ignoring the constant and additive terms, the bounds are

explicit : O p

(KT + D) ln K ,

oracle : O min β

|Sβ| + β ln K + KT + Dβ

Note that the oracle bound is always as strong as the explicit bound. There are, however, cases where it is much tighter. Consider the following example.

Example 8. For t < p

KT/ ln K let dt = T t and for t p

KT/ ln K let dt = 0. Take β = p

KT/ ln K. Then D = Θ(T p

KT/ ln K), but Dβ = 0 (assuming that T K ln K) and |Sβ| < p

KT/ ln K. The corresponding regret bounds are

explicit : O q

KT ln K + T

KT = O T 3/4 ,

KT ln K = O T 1/2 .

5 Analysis of Algorithm 1

This section contains the main points of the analysis of Algorithm 1 leading to the proof of Theorem 1 which were postponed from Section 3. Full proofs are found in Appendix B.

The analysis is a generalization of the analysis of delayed Exp3 in Cesa-Bianchi et al. [2019], and consists of a general regret analysis and two stability lemmas.

5.1 Additional notation

We let Nt = |{s : s+ds [t, t+dt)}| denote the stability-span of t, which is the amount of feedback that arrives between playing action At and observing its feedback. Note that letting N = maxt Nt we have N 2 maxt dt 2dmax, since this may include feedback from up to maxs ds rounds prior to round t and up to dt rounds after round t.

We introduce Z = (z1, ..., z T ) to be a permutation of [T] = {1, ..., T} sorted in ascending order according to the value of z + dz with ties broken randomly, and let Ψi = (z1, ..., zi) be its ﬁrst i elements. Similarly, we also introduce Z t = (z 1, ..., z Nt) as an enumeration of {s : s + ds [t, t + dt)}.

For a subset the integers C, corresponding to timesteps, we also introduce

qa(C) = exp η P

s C ˆℓbs . (10)

The nominator and denominator in the above expression will also be denoted by wa(C) and W(C) corresponding to the deﬁnition of pa t .

By ﬁnally letting Ct 1 = {s : s + ds < t} we have pa t = qa(Ct 1).

5.2 Analysis of delayed exponential weights

The starting point is the following modiﬁcation of the basic lemma within the Exp3 analysis that takes care of delayed updates of the weights.

Lemma 9. Algorithm 1 satisﬁes

a=1 pa t+dt ˆℓa t min a [K]

t ˆℓa t ln K

a=1 pa t+dt ˆℓa t 2 . (11)

To make use of Lemma 9, we need to ﬁgure out the relationship between pa t+dt and pa t . This is achieved by the following two lemmas, which are generalizations and reﬁnements of Lemmas 1 and 2 in Cesa-Bianchi et al. [2019].

Lemma 10. When using Algorithm 1 the resulting probabilities fulﬁl for every t and a

pa t+dt pa t η Nt X

i=1 qa Ct 1 {z j : j < i} ˆℓa z i, (12)

where z j is an enumeration of {s : s + ds [t, t + dt)}.

The above lemma allows us to bound pa t+dt from below in terms of pa t . We similarly need to be able to upper bound the probability, which is captured in the second probability drift lemma.

Lemma 11. The probabilities deﬁned by (10) satisfy for any i

qa(Ψi) 1 + 1 2N 1

qa(Ψi 1). (13)

5.3 Proof sketch of Theorem 1

By using Lemma 10 to bound the left hand side of (11) we have

a pa t ˆℓa t min a

t ˆℓa t ln K

a=1 pa t+dt ˆℓa t 2

i=1 qa Ct 1 {z j : j < i} ˆℓa z i.

Repeated use of Lemma 11 bounds the second term on the right hand side by η TKe/2 in expectation. The third term on the right hand side can be bounded by D. Taking the maximum over the two possible values of the truncated learning rate ﬁnishes the proof.

6 Discussion

We have presented an algorithm for multiarmed bandits with variably delayed feedback, which achieves the O p

(KT + D) ln K regret bound conjectured by Cesa-Bianchi et al. [2019]. The algorithm is based on a procedure for skipping rounds with excessively large delays and reﬁned analysis of the exponential weights algorithm with delayed observations. At the moment the skipping procedure requires prior knowledge of T and D for tuning the skipping threshold. However, if the delay information is available "at action time", as in the examples described in the introduction, we provide a sophisticated doubling scheme for tuning the skipping threshold that requires no prior knowledge of T and D. Furthermore, the reﬁned tuning also leads to a reﬁned regret bound of order O minβ |Sβ| + β ln K + KT +Dβ

β , which is polynomially tighter when Dβ D. We provide an example of such a problem in the paper.

Our work leads to a number of interesting research questions. The main one is whether the two regret bounds are achievable when the delays are available "at observation time" without prior knowledge of D and T. Alternatively, is it possible to derive lower bounds demonstrating the impossibility of further relaxation of the assumptions? More generally, it would be interesting to have reﬁned lower bounds for problems with variably delayed feedback. Another interesting direction is a design of anytime algorithms, which do not rely on the doubling trick. Such algorithms can be used, for example, for achieving simultaneous optimality in stochastic and adversarial setups [Zimmert and Seldin, 2019a]. While a variety of anytime algorithms is available for non-delayed bandits, the extension to delayed feedback does not seem trivial. Some of these questions are addressed in a follow-up work by Zimmert and Seldin [2019b].

Acknowledgments

Nicolò Cesa-Bianchi gratefully acknowledges partial support by the Google Focused Award Algorithms and Learning for AI (ALL4AI) and by the MIUR PRIN grant Algorithms, Games, and Digital Markets (ALGADIMAR). Yevgeny Seldin acknowledges partial support by the Independent Research Fund Denmark, grant number 9040-00361B.

Sakshi Arya and Yuhong Yang. Randomized allocation with nonparametric estimation for contextual multi-armed bandits with delayed rewards. ar Xiv preprint, ar Xiv:1902.00819, 2019.

Nicolò Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Nonstochastic bandits with composite anonymous feedback. In Proceedings of the International Conference on Computational Learning Theory (COLT), 2018.

Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Delay and cooperation in nonstochastic bandits. The Journal of Machine Learning Research, 20(1):613 650, 2019.

Olivier Chapelle. Modeling delayed feedback in display advertising. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD), 2014.

Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems (Neur IPS), 2011.

Miroslav Dudík, Daniel J. Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efﬁcient optimal learning for contextual bandits. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2011.

Siddhant Garg and Aditya Kumar Akash. Stochastic bandits with delayed composite anonymous feedback. ar Xiv preprint, ar Xiv:1910.01161, 2019.

Scott Garrabrant, Nate Soares, and Jessica Taylor. Asymptotic convergence in online learning with unbounded delays. ar Xiv preprint, ar Xiv:1604.05280, 2016.

Avishek Ghosh and Kannan Ramchandran. Online scoring with delayed information: A convex optimization viewpoint. In the proceedings of the Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2018.

Pooria Joulani, Andras Gyorgy, and Csaba Szepesvári. Delay-tolerant online convex optimization: Uniﬁed analysis and adaptive-gradient algorithms. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2016.

Bingcong Li, Tianyi Chen, and Georgios B. Giannakis. Bandit online learning with unknown delays. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2019.

Travis Mandel, Yun-En Liu, Emma Brunskill, and Zoran Popovi c. The queue method: Handling delay, heuristics, prior data, and evaluation in bandits. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2015.

Timothy A Mann, Sven Gowal, Ray Jiang, Huiyi Hu, Balaji Lakshminarayanan, and Andras Gyorgy. Learning from delayed outcomes with intermediate observations. ar Xiv preprint, ar Xiv:1807.09387, 2018.

Chris Mesterharm. On-line learning with delayed label feedback. In Proceedings of the International Conference on Algorithmic Learning Theory (ALT), 2005.

Chris Mesterharm. Improving Online Learning. Ph D thesis, Department of Computer Science, Rutgers University, 2007.

Gergely Neu, Andras Gyorgy, Csaba Szepesvari, and Andras Antos. Online markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, 59(3), 2014.

Ciara Pike-Burke, Shipra Agrawal, Csaba Szepesvari, and Steffen Grünewälder. Bandits with delayed, aggregated anonymous feedback. In Proceedings of the International Conference on Machine Learning (ICML), 2018.

Kent Quanrud and Daniel Khashabi. Online learning with adversarial delays. In Advances in Neural Information Processing Systems (Neur IPS), 2015.

Ohad Shamir and Liran Szlak. Online learning with local permutations and delayed feedback. In Proceedings of the International Conference on Machine Learning (ICML), 2017.

Claire Vernade, Olivier Cappé, and Vianney Perchet. Stochastic bandit models for delayed conversions. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2017.

Julian Zimmert and Yevgeny Seldin. An optimal algorithm for stochastic and adversarial bandits. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2019a.

Julian Zimmert and Yevgeny Seldin. An optimal algorithm for adversarial bandits with arbitrary delays. ar Xiv preprint, ar Xiv:1910.06054, 2019b.

Martin Zinkevich, John Langford, and Alex J Smola. Slow learners are fast. In Advances in Neural Information Processing Systems (Neur IPS), 2009.