# stochastic_bandits_with_armdependent_delays__9d558f8d.pdf Stochastic bandits with arm-dependent delays Anne Gael Manegueu 1 Claire Vernade 2 Alexandra Carpentier 1 Michal Valko 3 Significant work has been recently dedicated to the stochastic delayed bandits because of its relevance in applications. The applicability of existing algorithms is however restricted by the fact that strong assumptions are often made on the delay distributions, such as full observability, restrictive shape constraints, or uniformity over arms. In this work, we weaken them significantly and only assume that there is a bound on the tail of the delay. In particular, we cover the important case where the delay distributions vary across arms, and the case where the delays are heavy-tailed. Addressing these difficulties, we propose a simple but efficient UCB-based algorithm called the Patient Bandits. We provide both problemsdependent and problems-independent bounds on the regret as well as performance lower bounds. 1. Introduction In realistic applications of reinforcement learning (RL), rewards come delayed. In a game, for instance, the consequences of the agent s actions are only observed at the end. This issue at the core of challenges in RL (Garcia et al., 1966), when the horizon is finite or geometrically discounted. Even much simpler (state-less) bandit setups, such as online advertising suffer from delayed feedback (Chapelle & Li, 2011; Chapelle, 2014). In particular, most systems do not optimize for clicks but for conversions, which are events implying a stronger commitment from the customer. However, different ads trigger different customer response time. Typically, more expensive and rewarding products require more time to convert on customers side, and the system needs to be tuned to be robust to the delays. 1Otto-von-Guericke University of Magdeburg, DE 2Deep Mind, London, UK 3Deep Mind, Paris, FR. Correspondence to: Anne Manegueu , Claire Vernade , Alexandra Carpentier , Michal Valko . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). As a result, we study stochastic delayed bandits for which the delay distributions are arm-dependent and possibly heavy-tailed. We consider the realistic model for delayed conversions of Chapelle (2014) and Vernade et al. (2017) in which delays are only partially observable. Conversions are binary events that represent a strong commitment (buying, subscribing, ...). If a conversion happens, it is sent with some delay to the learner who observes both its reward and the corresponding delay. Otherwise, the reward is null by default but it has no specific delay, only the current waiting time. This models a typical e-commerce application: if a customer does not buy the recommended product, the recommendation system will not be informed. The nature of this setup brings two main challenges (1) the censoring due to partially observed delays, which forces the learner to deal with an unknown amount of missing feedback; and (2) the identifiability1 issue due to arm-dependent delays. Prior work for delayed bandits have bypassed the challenges above by assuming that the delays are observed (Joulani et al., 2013; Dudik et al., 2011), which removes the ambiguity, or bounded by a fixed quantity (Pike-Burke et al., 2018; Garg & Akash, 2019; Cesa-Bianchi et al., 2018), which gives other possibilities to deal with them. Another approach that has been proposed by Vernade et al. (2017) is to drop the artificial requirement of observability of delays, and instead impose that all delays have the same distribution across arms and that this distribution is known. We further discuss the relevant related work in Section 3. While the known approaches yield good results under their strong assumptions on delays, none of them provides a solution to the realistic problem that we are tackling. Contributions This work is the first to consider a stochastic bandit setting with arm-dependent, unbounded, and possibly heavy-tailed delays with partially observable delays. We jointly address the challenges of Vernade et al. (2017), Zhou et al. (2019) and Thune et al. (2019). Unlike Vernade et al. (2017); Zhou et al. (2019), we make only mild assumptions on the delays. Furthermore, we give a precise characterization of the impact of the delays on the regret than that given in the more difficult, non-stochastic setting of Thune 1Ex.: Consider two instances for : (1) reward follows a Bernoulli(1) and delay is a Dirac in + and (2) reward follows a Bernoulli(0) and delay is a Dirac in 0. Both instances produce the same data but have strictly different parameters. Stochastic bandits with arm-dependent delays et al. (2019). Our algorithmic soltuion is Patient Bandits, the right calibration of upper confidence bounds and prove that it attains problem-dependent and minimax regret upper bounds. In particular, we prove that: In the asymptotic regime, the presence of delays does not affect the regret by more than a constant factor with respect to what is achieved in standard bandits. In other words, the loss of information due to the delays does not lead to a significant increase of the regret with respect to standards bandits Our algorithm attains the problem-dependent upper bound of the standard bandits up to a constant multiplicative factor in many cases, e.g. in the homoscedastic Gaussian case. On the other hand, we prove that there is a drop in performance with respect to problem-independent guarantees as compared to standard bandits. This is unavoidable and we prove a lower bound to support it. Finally, we study the impact of imperfect prior knowledge for Patient Bandits. Our algorithm takes a parameter that is related to an upper bound on the heaviness of the tails of the delay distributions. We provide a comprehensive study in which respect the precise knowledge of this parameter can be avoided. 2. Bandits with delayed feedback We define our stochastic delayed bandit setting. Consider a sequential game of T N rounds where an agent is interacting with an environment characterized by a finite set of K N arms which we denote [K] {1, ..., K}. An instance is characterized by a tuple ((Vi, Di)i [K]), where each arm i [K] is associated with both an unknown reward distribution Vi whose support is in [0, 1], and with mean µi, and an unknown delay distribution Di with cumulative distribution function (CDF) τi and support in N, such that for any d 0, t T, if Dt Di, then we have that P(Dt d) = τi(d). At each round t T, the learner chooses (pulls) an arm It [K]. A reward Ct VIt and a delay Dt DIt are generated independently from each other. Neither the reward nor the delay is necessarily displayed at the current round t. However, at each upcoming round t + u for 1 u T t, the learner observes the updated quantity Xt,u Ct1{Dt u}, (1) corresponding to her pull at time t. Note that, conversely, at time t, the learner only observes the updated quantities corresponding to its past actions: (Xs,t s)s t (Cs1{Ds Setting: K arms, horizon T, reward distributions (Vi)i K, delay distributions (Di)i K for t = 1 to T learner observes updated reward sequence (Xs,t s)s t, see Eq. 1 learner chooses It [K] based on Ht, see Eq. 2 reward Ct VIt and delay Dt DIt are generated independently but not necessarily displayed end for Figure 1. Delayed learning setting t s})s t. And therefore at round t, it disposes of the entire history information Ht (Xu,v)u 0 be some fixed quantity. We assume that m N and i {1, . . . , K}, it holds that |1 τi(m)| m α. The smaller α, the more heavy-tailed the delay distribution, and the more difficult the setting. This assumption needs to hold uniformly across arms but does not impose they all have the same distribution, unlike required by Vernade et al. (2017). This is an important weakening of the restricted setting of the prior work, which we generalize. Stochastic bandits with arm-dependent delays For i [K], we denote by Ti(t) Pt i=1 I{Is = i} the number of times that the arm i has been drawn up to round t. As µ maxi µi denotes the mean of the best arm(s), i µ µi is the gap between the mean of the optimal arm(s) and the mean of arm i. The goal of the agent is to maximize its expected cumulative reward (i.e., E[PT t=1 Ct]) after T rounds and therefore to minimize the expected regret, i=1 i E[Ti(T)]. (3) Remark 1. In this paper, we consider the same concept of regret as for standard bandits, unlike what is done by Vernade et al. (2017). We believe it is a more relevant approach that allows for comparison with the vast existing prior work on the topic (see Section 3). 3. Related work The problem of learning with delayed feedback is ubiquitous in applications, including universal portfolios in finance (Cover, 2011), online advertising (Chapelle, 2014) and ecommerce (Yoshikawa & Imai, 2018). Therefore, there is a large body of theoretical results under different scenarios and assumptions on the delays. We review the contributions of prior work distinguishing between the full information setting and the bandit setting. Full-information In online (convex) optimization, as opposed to our setting, the learner typically performs a gradient descent by estimating the gradient on the fly using the available information. Thus, delayed feedback forces the learner to make decisions in the face of additional uncertainty. This setting has been considered by Weinberger & Ordentlich (2002) in the case of prediction of individual sequences under the assumption that the delays were fixed and known. The study of online gradient type of algorithms under possibly random or adversarial delays is made under various hypotheses by Langford et al. (2009); Quanrud & Khashabi (2015); Joulani et al. (2016). In distributed learning, communication time between servers naturally induces delays, and this particular setting was studied by Agarwal & Duchi (2011); Mc Mahan & Streeter (2014); Sra et al. (2015) who all proposed asynchronous or delay-tolerant versions of ADAGRAD. Finally, an attempt to reduce the impact of delays was made by Mann et al. (2018) by allowing the learner to observe intermediate signals. Bandits with observed delays Joulani et al. (2013) provide a clear overview of the impact of the delays for both the stochastic and adversarial setting. Using a method based on a non-delayed algorithm called Base, they succeed in extending the work of Weinberger & Ordentlich (2002) to the non-constant delays case. Following this work, Mandel et al. (2015) argued in favor of more randomization to improve exploration in delayed environments, although the guarantees remained unchanged. Taking a different path, closer to ours, the recent work of Zhou et al. (2019) relies on a biased estimator of the mean and corrects for it in the UCB using an estimator of the amount of missing information. They consider the contextual bandit setting and make a strong assumption on the delay distribution, imposing that 1) delays are fully observable, and 2) the delay distribution is the same for all arms and should concentrate nicely (bounded expectation). Due to these assumptions, their algorithm cannot be used for our setting and cannot be compared to ours. Thune et al. (2019) considered the adversarial bandit setting, where delays are observed right after (or before) sampling an arm. Unfortunately, their results and algorithms do not apply in our setting since we do not observe the delays before or right after sampling an arm. Bandits with partially observed delays The delayed bandits with censored observations were introduced by Vernade et al. (2017), who build on the real-data analysis of Chapelle & Li (2011). They rely on the major assumption that delays are the same across arms and have a finite expectation. In this setting they prove an asymptotic problemdependent lower bound that recovers the standard Lai & Robbins lower bound. They propose an algorithm that uses as input the CDF of the delay distribution and matches this asymptotic lower bound. In other words, they prove that asymptotically, well-behaved delays have no impact on the regret. Following this work, Vernade et al. (2018); Arya & Yang (2019) extend its setting to the linear and contextual stochastic ones. Two more results consider the case where the delays are not observed at all but are bounded by a constant D > 0. Garg & Akash (2019) analyze the stochastic setting and Cesa-Bianchi et al. (2018) the adversarial setting and achieve a regret of order TK log K + KD log T and DTK respectively. Pike-Burke et al. (2018) further considers unbounded delays in adversarial setting but time under the assumption that only their expectation is bounded. Again, these results do not apply to our context as we do not assume that the delays are bounded; under Assumption 1 with α < 1, the delays can even have infinite means. 4. The Patient Bandits algorithm In this section we describe an optimistic algorithm (Auer et al., 2002) that is able to cope with partially observed and potentially heavy-tailed delays. Patient Bandits estimates high-probability upper confidence bounds on the parameter of each arm. As opposed to the standard UCB approach, it is hopeless to design conditionally unbiased estimators in this delayed setting. Therefore, the algorithm needs to also properly bound the bias for each arm adaptively. Throughout the paper, we use the notation Stochastic bandits with arm-dependent delays A B min(A, B) and A B max(A, B). A delay-corrected, high-probability UCB Delays being partially observable, the learner must build its estimators with an unknown number of observations. Indeed, since rewards are delayed, a certain proportion of the feedback of each arm is missing but it is impossible to know exactly how much because the zeros are ambiguous. Nonetheless, we show that it is possible to prove high-probability confidence bounds for the parameters of the problem, provided that we correctly handle this extra bias due to the delays. For this purpose, we rely on Assumption 1, that gives us a loose global bound on the tails of the distributions of the delays. At a time t K +1, given the history of pulls and observed rewards Ht, we define the mean estimator, bµi(t) 1 Ti(t) u=1 Xu,t u1{Iu = i}, (4) where Ti(t) Pt u=1 1{Iu = i}. The key ingredient of our algorithm is the upper confidence bound. Our first major provides just that and is presented next. Theorem 1. Let i [K] and α > 0 satisfy Assumption 1. Then for any t > K and δ > 0, with probability 1 δ |bµi(t) µi| 2 log 2 1/2 + 2Ti(t) (α 1/2). (5) Proof. The full proof is in Appendix A and relies on the following decomposition |bµi(t) µi| bµi(t) 1 Ti(t) u=1 τi(t u)µi1{Iu = i} u=1 τi(t u)µi1{Iu = i} µi . On the right-hand side, the first term is a standard deviation term. The probability that it is larger than (2 log(2δ)/Tt(t))1/2 is uniformly bounded by δ/2. The second term corresponds to the bias and is bounded by 2Ti(t) α 1/2, which comes from simply summing the τi(t u) in the worst case, i.e., when all pulls of arm i are made in the last Ti(t) rounds. A clear benefit of the above result is a simple and easy-tocompute adaptive upper bound on the parameter µi for our estimator. This UCB is similar to the standard UCB2 (Auer et al., 2002) except for the extra bias term that goes to zero with the number of pulls. In fact, it adaptively trades off bias and variance as a function of α: it is the largest for small values of α 1/2, that is when delays have very large tails. Indeed, α plays an important role in our algorithm presented in details next. Algorithm 1 Patient Bandits Input: α > 0, horizon T, number of arms K. Initialisation: Pull each arm once and set for all i [K]: Ti(t) = 1 and initialise bµi(t) according to Eq. 4. for t = K + 1...T do Pull arm It arg maxi [K] UCBi(t) Observe all feedback updates (Xs,t s)s t end for Algorithm Patient Bandits is described in Algorithm 1. It receives as input the parameter α > 0, the horizon T, and the number of arms K which we assume to be smaller than T. In the first phase of the game, all arms are pulled once. The player then pulls the arm from [K] that has the highest UCB as defined in Theorem 1, UCBi(t) bµi(t) + 2 log(2KT 3) 1/2 + 2Ti(t) (α 1/2). The algorithm then pulls an arm It that maximises UCBi(t). 5. Analysis of Patient Bandits We analyse Patient Bandits and provide a nonasymptotic lower-bound for delayed bandits. We first provide first a problem-dependent upper bound on the regret of Patient Bandits. with proof in Appendix B and follows the lines of the usual analysis of UCB by Auer et al. (2002); see also Lattimore & Szepesv ari (2019, Chapter 7). Theorem 2. Let T > K 1 and α > 0. Let (Vi, Di)i [K] be the problem as defined in Section 2 such that Assumption 1 holds. If Patient Bandits is run with parameters P = (α, T, K), its cumulative regret is bounded as " 64 log(2T) The only term in Theorem 2 that depends on T is of the order of P i: i>0 log(T)/ i. It is of the same order as the classical bound for UCB which is asymptotically optimal; see Lai & Robbins (1985). Note that this was expected, since Vernade et al. (2017) showed that delays should not have an asymptotic impact on the regret2. Our bound has an additional term of order P i: i>0(8/ i) 1 α α 1. This term does not depend on T, so it is asymptotically negligible. But if α < 1/2 and some of the gaps are very small, it can be large from a non-asymptotic perspective see Section 7 for a discussion. We now provide a problem-independent upper bound. 2Their result is stated for delays with finite expectation but remains valid in our setting. Stochastic bandits with arm-dependent delays Theorem 3. Let T > K 1 and α > 0. If Patient Bandits is run with parameters P = (α, T, K), for any stochastic delayed instance such that Assumption 1 holds, it achieves RT 2 64(1 α) 1/2T 1 α 1/2(K log(2T))α 1/2+ 2K. Up to logarithmic terms and multiplicative constants, the order of magnitude of this bound is max( KT, KαT 1 α). Whenever α 1/2, the order of the bound is KT; as is the case for UCB (up to logarithmic terms) and for more refined algorithms like MOSS (Audibert & Bubeck, 2009) in the standard stochastic bandit setting (without delays). However if the delays are allowed to be more heavy-tailed, i.e., α < 1/2, then the regret starts degrading with α as the upper bound is of order KαT 1 α. We prove that this degradation of the (problem-independent) regret is unavoidable. Theorem 4. Consider K = 2, T K, and α > 0. There exists a Bernoulli stochastic delayed bandit problem satisfying Assumption 1, such that the expected regret of any algorithm RT on this problem is larger than T 1 α/8. Combining the above theorem with the standard problem independent lower bound for standard bandits (Lattimore & Szepesv ari, 2018) we get that the order of magnitude of the worst case regret (for bandit instances satisfying Assumption 1) of any algorithm is larger than max( KT, T 1 α). This matches (up to logarithmic terms) the upper bound in Theorem 3 with respect to T (not to K whenever α < 1/2). 6. Adaptation to α Patient Bandits requires (a lower bound on) α as input. It is indeed natural to ask whether this prior information on the delays is necessary. In other words, can we design an algorithm that learns α as well or adapts to the delays onthe-fly? How much would the regret be impacted? We now give a precise answer to these questions, both in the asymptotic and non-asymptotic regime. In the latter, we prove a negative result in the general case. However, we propose a new assumption under which adaptivity is achievable. 6.1. Adaptation of the problem dependent regret to α We first study possibilities of adaptation in the asymptotic regime. An immediate corollary of Theorem 2 is as follows. Corollary 1. Let T > K 1 and consider a bandit problem with minimum gap = mink: k>0 k, and where all arms k satisfy Assumption 1 for αk. Consider T > eee large enough so that (a) log log(T)/ log(T) mini αi; (b) 8 1 log T. If Patient Bandits is run with parame- ters ((log log(t)/ log(t))t T , T, K), 128 log(2T) Therefore, for T large enough depending on instance dependent quantities, it is possible to run a slight variant of Patient Bandits that takes as input the sequence (αt = log log(t)/ log(t))t 1 instead of a fixed α. As a result, for a fixed problem, asymptotically, knowing α is not necessary. 6.2. Impossibility result under Assumption 1 for adapting the problem independent regret to α For a fixed horizon T < , however, that is whole different story. Our second lower bound below states that if you give a input parameter α that is too small to a good algorithm, then it has a suboptimal regret, even in the simpler case where K = 2. Specifically, we define the class of α-optimal algorithms Aα as the algorithms whose expected regret is smaller than T 1 α/8 for all bandit instances satisfying Assumption 1 for a fixed α. Theorem 5. Consider K = 2, T K, and fix α > 0. For any β α, there exists a Bernoulli bandit instance satisfying Assumption 1 for β such that the expected regret RT of any α-optimal algorithm is larger than T 1 α/8 > T 1 β/8. In other words, an algorithm that performs optimally uniformly under Assumption 1 for a given α cannot at all adapt to β α. 6.3. New algorithm for adapting the problem independent regret to α under more restrictive assumptions Yet, is adaptivity a lost cause? Under the weak Assumption 1, Theorem 5 above is quite disheartening. A structural reason for this is that it is impossible to estimate α under this assumption. We show that under a slightly more restrictive assumption this becomes possible. Assumption 2. Assume that there exists 0 < c 1 and µ > 0, such that mink µk > µ and for all i [K], cm α |1 τi(m)| m α. (6) Assume also that α α for some α > 0. This assumption does not mean that the delay distributions of the arms are all the same. It means that the parameter α now globally characterizes the tails of the delay distributions. The challenge for estimating it is that delays are only partially observable. To further explain Assumption 2, let us consider small and a large delay d and D, such that D > d > 0. The conditional expectation of the difference Stochastic bandits with arm-dependent delays of the same reward after respectively d and D time steps have passed is E|It[Xt,D Xt,d] = µItτIt(D) µItτIt(d) [cµItd α µIt D α, µItd α], where E|It is the conditional expectation with respect to the arm It pulled at time t, and where c > 0 comes from Assumption 2. Therefore, if d (c/2)1/αD, we now have that3 E|It[Xt,D Xt,d] cµ 2 d α, d α , where µ, α > 0 are defined in Assumption 2. We can now see that it is possible to estimate α up to a logarithmic factor using the logarithm of an estimator of µiτi(D) µiτi(d), if we properly choose d and D from some arm i sampled often enough. We formalize this idea next and start by introducing the necessary quantities followed by its analysis. In the rest of this section, we denote It arg maxk Tk(t). We only use the samples of this arm to estimate α at each round. For a simpler notation, let T t TIt(t) and for some delay D, let mt,D 1 T t D s=1 Xs,D1{Is = It} be the sample mean after waiting D steps. For the estimate, we set Dt T t/2 and dt j (c/2)1/αDt k . Subsequently, we define the estimator of α at round t as bαt min log(mt,Dt mt,dt) log(T t) , 1 Such an estimator is related to quantile or CDF-based estimators used in extreme value theory (De Haan & Ferreira, 2007; Carpentier & Kim, 2015). We define set the lower confidence bound on it as bαt log 24p log(2KT 3)/(cµ) Adapt-Patient Bandits then simply uses αt for the computation of the upper confidence bounds as in Eq. 5. The algorithm therefore does not need to know for which parameter α Assumption 2 is satisfied. We show the pseudocode Adapt-Patient Bandits in Algorithm 2. We bound the expected regret of Adapt-Patient Bandits in the following theorem. Theorem 6. Let T > K 1 and α, α, c, µ > 0, such that Assumption 2 holds. The expected regret of Adapt-Patient Bandits is bounded as RT = e O(KαT 1 α 1/2). 3Note that c 1 so that with this definition of d and D, we have that D d. Algorithm 2 Adapt-Patient Bandits Input: c, α, µ, T, K Initialisation: Pull each arm twice. for t = 2K + 1, ..., T do Pull the arm It arg sup i {1,...,K} UCBi(t 1) when using the parameter αt in the UCB. Observe all individual feedback (Xs,t s)s t. end for Comparing the above with Theorem 3, notice that Adapt-Patient Bandits achieves a regret that depends on the unknown parameter α and is of the same order as the upper bound of Theorem 3 up to logarithmic term. This algorithm is therefore minimax optimal up to logarithmic terms relatively to the lower bound defined in Section 5. 7. Discussion Comparison to bandits without delays As discussed in the Section 4, the major difference between Patient Bandits and the standard UCB algorithm is the extra bias term. In the standard bandits, we have strictly more information than in our setting. When rewards are delayed, the learner has to deal with temporarily missing data: some actions have been taken but their rewards are missing until the delay has passed. In particular, when α < 1, the delays are heavy-tailed, so their expectation is infinite, and this buffer of missing data always keeps growing in size with T, creating a non-negligible bias in the estimators. This difference is even more important when α < 1/2, which corresponds to the situation where the bias term is larger than the standard deviation term. For such a instance, it is clear that a standard UCB would have a linear regret and we show it Section 8. We now discuss optimality of Patient Bandits by commenting both our problem-dependent (Theorem 2) and problem-independent (Theorem 3) guarantees. The problem-dependent bound for our setting is of the order P k: k>0 log(T)/ k. Up to a term that depends only on the ( k)k (and not on T, see Section 5), this is of the same order than the problem-dependent bound in standard bandits (Lai & Robbins, 1985) and it does not depend on the delay distributions, e.g. it does not depend on α. On the other hand, the problem-independent bound, is of order T 1 α (1/2)Kα (1/2) up to logarithmic terms. This is light-years different from the problemindependent bound for standard bandits, which is of order KT; see Lattimore & Szepesv ari (2019, Chapter 7). In particular, the bound is of same order when Stochastic bandits with arm-dependent delays α 1/2, and is larger when α < 1/2. However, Theorem 4 ensures that this rate is minimax optimal with respect to T, up to a multiplicative term Kα (1/2) and some logarithmic terms. This means that this gap is the price to pay for having delays that are potentially long, and can therefore pose strong bias issues. Parameters Patient Bandits needs α as as input, which is the only external prior information on the delays. It is a more delicate question to decide whether this prior information is necessary. Section 6 treat this subject extensively. In a nutshell, from an asymptotic perspective, the knowledge of α is not needed, but from a non-asymptotic perspective it is. Nonetheless, under a slightly stronger hypothesis on the delay distribution, see Assumption 2, there exists a fully adaptive algorithm Adapt-Patient Bandits. with sublinear regret (Theorem 6) that matched the one of Patient Bandits up to a logarithmic term. 8. Experiments We now evaluate the empirical performance of Patient Bandits. Throughout the section, we provide experiments where the delays of each arm i follows the Pareto Type I distribution with tail index αi, where the rewards of each arm i follow a Bernoulli distribution with parameter µi. First, we investigate the performances of Patient Bandits with respect to the parameters of the instance, (αi, i)i [K], and to the hyperparameter of the algorithm α, which we here denote by α to avoid confusion. Second, we compare to strongest baseline that deals with partially observed delays, which is the censored version of D-UCB of Vernade et al. (2017) with various threshold windows. Their algorithm takes two parameters, the threshold m and the CDF τ of the delays.4 The threshold m calibrates the time that the algorithm waits before updating reward. 8.1. Impact of the instance properties Study of α Patient Bandits takes α as an input. The choice of this parameter is a key for the implementation of Patient Bandits. Ideally we would like to take α = mini αi but in the absence of information on the delay distributions we cannot. We therefore illustrate the sensitivity of our method to the miscalibration of α. We consider a 2-arm setting with horizon T = 3000, with arm means µ = (0.5, 0.55) and with tail index α1 = 1, α2 = 0.3 respectively. We consider α [0.02, 0.5] an in Figure 2 show the regret as function of α. Note that the regret first decreases with α and then increases after approximately 4Vernade et al. (2017) assumed that the delay distributions are know and homogeneous across arms. Figure 2. Regret at round T = 3000 of Patient Bandits in function of α [0.02, 0.5] for the bandit instance µ = (0.5, 0.55), α1 = 1, and α2 = 0.3. Results are averaged over 400 runs. α = 0.3. This is precisely what is expected. For small α, the algorithm is consistent but explores too much and this induces a large regret. For α larger than mini αi, the regret starts to increase again, since the bias coming from the delays are not sufficiently taken into account by the UCB. Study of (αi, i)i [K]. We now investigate the dependency of the regret of Patient Bandits on the delay parameters and arm gaps (αi, i)i [K]. We consider the following two-arm instance where we set µ = (0.4, 0.4 + ) where we take [0.02, ..., 0.6] - fixing the horizon to T = 3000. For each problem, we respectively choose α1 = 1 and α2 {0.2, 0.3, 0.4, 0.5, 0.8} and run the Patient Bandits policy with optimal parameter α = α2, so that we can see the impact of α2 and independently from calibration issues. The results represented in the Figure 3 display the influence of the arm gap on the regret, for various values of α2. Figure 3 illustrates a standard phenomenon in stochastic bandits, and which also holds for delayed bandits: for small values of the arm gap , the regret increases with . This corresponds to the fact that for small , the algorithm explores and is not able to focus on the most promising arms since the arms means are too close. Then at some point for larger values of the regret starts decreasing, as predicted by the bound in Theorem 2. A phenomenon that is specific to delayed bandits is that the smaller α2, the larger the regret. This is expected from Theorem 3, since the smaller α2, the more delayed the rewards, and the harder the problem. A more subtle phenomenon, also illustrating Theorem 3, is that the smaller α2, the larger the value and the position of the maximum of each curve - the maximum or the regret being bounded by the problem independent bound that depends here on α = α2. Stochastic bandits with arm-dependent delays Figure 3. Regret of Patient Bandits in function of the arm gap [0.02, 0.6] where the bandit problem is characterized by µ = (0.5, 0.5 + ) and α1 = 1 and where α2 {0.2, 0.3, 0.4, 0.5, 0.8} - each curve corresponds to a different value of α2. For each problem, Patient Bandits was run with horizon T = 3000 and with parameter α = α2. 8.2. Comparison with with D-UCB. We compare here the regret of Patient Bandits with the one of D-UCB as a function of the horizon T, for different values of the parameters - respectively α and m. Since DUCB was designed for the context where the distribution of the delays is the same for all arms, i.e. αi identical over arms, we consider that scenario as well as the more general case with heterogeneous αi s. Homogeneous delay distributions across arms. In the first scenario, we consider a two armed bandit problem with means µ = (0.6, 0.8). We set the same tail index for both arms, i.e. α1 = α2 = 0.7. We run Patient Bandits for α {0.1, 0.5}. For D-UCB we consider various threshold parameters m {10, 50, 100, 200}, and feed D-UCB with the exact delay distribution of the arms - which gives DUCB an important edge over our algorithm. The results are displayed in Figure 4 (regret as a function of time). The performances of D-UCB and Patient Bandits are comparable, in particular in the case of good calibration of the parameters, i.e. respectively α = 0.5, and m = 50. Patient Bandits performs slightly worse in the best case, but note that D-UCB is tuned with the full knowledge of the CDF of the delays. An observation coming from Figure 4 is the presence of long lasting linear phases at the initial stage of learning of D-UCB for large m which tend to be caught up over time. This comes from the structure of the algorithm, and from the fact that it has to wait until m + K time steps before it starts exploiting the observations - which is not the case for our strategy. Non-homogeneous delay distributions. In the second scenario we still consider a two-armed bandit problem with Figure 4. Regret of D-UCB and Patient Bandits for µ = (0.6, 0.8) and with homog. delay distributions characterised by α1 = α2 = 0.7. We plot results for Patient Bandits with parameters α = (0.1, 0.5), and for D-UCB with parameters m = (10, 50, 100, 200). means µ = (0.6, 0.8), and we set the parameters of the tail distribution of the delays as α1 = 1, α2 = 0.3. This is a difficult scenario, since arm 2 which has the highest mean has also the lowest delay parameter. This means that its delays are more heavy tailed. We consider as before Patient Bandits with parameters α {0.1, 0.5}, and the D-UCB for threshold parameters m {10, 50, 100, 200}. Regarding the CDF parameter of D-UCB, we provide a Pareto distribution with parameter 0.7. The results for all policies are displayed on Figure 5 (regret in function of horizon T). We observe that D-UCB has a very high regret, Figure 5. Regret of D-UCB and Patient Bandits for µ = (0.6, 0.8) and with delay distributions that vary across arms, characterized by α1 = 1 and α2 = 0.3. We plot results for Patient Bandits with parameters α = (0.1, 0.5), and for DUCB with parameters m = (10, 50, 100, 200). increasing linearly with T. This can be explained by the fact that D-UCB cannot adapt to delay distributions varying across arms. In other words, it does not take the heterogeneity of the delays into account and can be confused by this difficult situation where the best arm also corresponds to the longest delays. Consequently, D-UCB focuses only on observations that are substantially biased and misidentifies Stochastic bandits with arm-dependent delays the best arm. On the other hand, Patient Bandits adapts to the heterogeneous delays and manages to identify the best arm, leading to a sub-linear regret. 8.3. Performance of Adapt-Patient Bandits We now compare the performance of the adaptive Adapt-Patient Bandits to the one of Patient Bandits with some values of parameter α. As before, we consider a two-arm bandit instance with mean parameters µ = (0.6, 0.8). The delay distribution of the second arm is Pareto with parameter α2 = α = 0.3. The first arm has its delay distribution characterised by the CDF: ( 1 0.7/2α if m = 0, 1 0.7/(1 + x)α if m N , where we remind that α = 0.3, so that Assumption 2 is satisfied with c = 0.7, but the delays of the optimal arm are stochastically dominated by those of the sub-optimal arm. We compare the algorithm Adapt-Patient Bandits launched with parameters (α = 0.1, c = 0.7, µ = 0.6) with the algorithm Patient Bandits launched with three different parameters parameters α {0.1, 0.3, 0.6}. The maximal horizon is here T = 10000 and the results are averaged over 100 runs. Figure 6 displays the regret of the algorithm as a function of time, and Figure 7 displays the estimated lower bound αt of α, used by Adapt-Patient Bandits. Figure 6. Regret of Adapt-Patient Bandits and Patient Bandits in function of the time, for µ = (0.6, 0.8) and delay distributions described in Section 8.3. The Patient Bandits algorithm was run for α = (0.1, 0.3, 0.6) and the Adapt-Patient Bandits algorithm with parameters (α = 0.1, c = 0.7, µ = 0.6). The lower bound estimates αt eventually converges towards the right quantity, which is α = 0.3. However this takes some time so that the performance of Adapt-Patient Bandits is slightly worse than the one of Patient Bandits that knows the oracle parameter α = 0.3. On the other hand, the perfor- Figure 7. Lower bound estimate αt on α from Adapt-Patient Bandits, in the same instance in Figure 6. mance of Adapt-Patient Bandits is better than that of Patient Bandits run non-oracle parameters α {0.1, 0.6}, indicating that Adapt-Patient Bandits is more flexible than Patient Bandits when Assumption 2 holds. Conclusion In this paper, we extend the problem of learning with bandit feedback and partially observable delays to arm-dependent delay distributions with possibly unbounded expectations. We close many existing open problems left by Vernade et al. (2017; 2018), either with positive answers (Theorem 2) or negative answers (Theorem 5). The major difficulty faced by the learner in this setting is the identifiability issue due to missing rewards which induce a bias in the estimator of the real payoff. Under the assumption that the tail distribution of the delays is bounded - although it might be very heavy tailed - we designed a very simple UCB-based algorithm, termed Patient Bandits. We proved that Patient Bandits performs almost as well as the standard UCB in the classical, non-delayed case, from a problem dependent point of view. We also studied the problem of adaptivity to the delay distributions and concluded that this is not possible (Theorem 5) unless a global bound on the tails hold (Assumption 2). Closing the gap between the problem-independent bound and the lower bound may constitute the object of future studies. Acknowledgements. The work of A. Carpentier is partially supported by the Deutsche Forschungsgemeinschaft (DFG) Emmy Noether grant Mu Sy AD (CA 1488/1-1), by the DFG - 314838170, GRK 2297 Math Co Re, by the DFG GRK 2433 DAEDALUS (384950143/GRK2433), by the DFG CRC 1294 Data Assimilation , Project A03, and by the UFA-DFH through the French-German Doktorandenkolleg CDFA 01-18 and by the UFA-DFH through the French German Doktorandenkolleg CDFA 01-18 and by the SFI Sachsen-Anhalt for the project RE-BCI. The work of A. Manegueu is supported by the Deutsche Forschungsgemeinschaft (DFG) CRC 1294 Data Assimilation , Project A03.