# stochastic_multiarmed_bandits_with_unrestricted_delay_distributions__e15e4bcd.pdf Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions Tal Lancewicki * 1 Shahar Segal * 1 Tomer Koren 1 2 Yishay Mansour 1 2 We study the stochastic Multi-Armed Bandit (MAB) problem with random delays in the feedback received by the algorithm. We consider two settings: the reward-dependent delay setting, where realized delays may depend on the stochastic rewards, and the reward-independent delay setting. Our main contribution is algorithms that achieve near-optimal regret in each of the settings, with an additional additive dependence on the quantiles of the delay distribution. Our results do not make any assumptions on the delay distributions: in particular, we do not assume they come from any parametric family of distributions and allow for unbounded support and expectation; we further allow for infinite delays where the algorithm might occasionally not observe any feedback. 1 Introduction Stochastic Multi-armed Bandit problem (MAB) is a theoretical framework for studying sequential decision making. Most of the literature on MAB assumes that the agent observes feedback immediately after taking an action. However, in many real world applications, the feedback might be available only after a period of time. For instance, in clinical trials, the observed effect of a medical treatment often comes in delay, that may vary between different treatments. Another example is in targeted advertising on the web: when a user clicks a display ad the feedback is immediate, but if a user decides not to click, then the algorithm will become aware to that only when the user left the website or enough time has elapsed. In this paper, we study the stochastic MAB problem with randomized delays (Joulani et al., 2013). The reward of the chosen action at time t is sampled from some distribu- *Equal contribution 1Blavatnik School of Computer Science, Tel Aviv University, Israel 2Google Research, Tel Aviv. Correspondence to: Tal Lancewicki , Shahar Segal . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). tion, like in the classic stochastic MAB problem. However, the reward is observed only at time t + dt, where dt is a random variable denoting the delay at step t. This problem has been studied extensively in the literature (Joulani et al., 2013; Vernade et al., 2017; Pike-Burke et al., 2018; Gael et al., 2020) under an implicit assumption that the delays are reward-independent: namely, that dt is sampled from an unknown delay distribution and may depend on the chosen arm, but not on the stochastic rewards on the same round. For example, Joulani et al. (2013); Pike-Burke et al. (2018) show a regret bound of the form O(R MAB T + KE[D]). Here R MAB T denotes the optimal instance-dependent T-round regret bound for standard (non-delayed) MAB: R i>0 log(T)/ i, where i is the sub-optimality gap for arm i. In the second term, K is the number of arms and E[D] is the expected delay. A significantly more challenging setting, that to the best of our knowledge was not explicitly addressed previously in the literature,1 is that of reward-dependent delays. In this setting, the random delay at each round may also depend on the reward received on the same round (in other words, they are drawn together from a joint distribution over rewards and delays). This scenario is motivated by both of the examples mentioned earlier: e.g., in targeted advertisement the delay associated with a certain user is strongly correlated with the reward she generates (i.e., click or no click); and in clinical trials, the delay often depends on the effect of the applied treatment as some side-effects take longer than others to surface. In contrast to the reward-independent case, with rewarddependent delays the observed feedback might give a biased impression of the true rewards. Namely, the expectation of the observed reward can be very different than the actual expected reward. For example, consider Bernoulli rewards. If the delays given reward 0 are shorter than the delays given reward 1, then the observed reward will be biased towards 0. Even worse, the direction of the bias can be opposite between different arms. Hence, as long as the fraction of unobserved feedback is significant, the expected observed reward of the optimal arm can be smaller than expected 1Some of the results of Vernade et al. (2017); Gael et al. (2020) can be viewed as having a specific form of reward-dependent delays; we discuss this in more detail in the related work section. Stochastic Multi-Armed Bandits with Unrestricted delay distributions observed reward of a sub-optimal arm, which makes the learning task substantially more challenging. 1.1 Our contributions We consider both the reward-independent and rewarddependent versions of stochastic MAB with delays. In the reward-independent case we give new algorithms whose regret bounds significantly improve upon the state-of-the-art, and also give instance-dependent lower bounds demonstrating that our algorithms are nearly-optimal. In the rewarddependent setting, we give the first algorithm to handle such delay structure and the potential bias in the observed feedback that it induces. We provide both an upper bound on the regret and a nearly matching general lower bound. Reward-independent delays: We first consider the easier reward-independent case. In this case, we provide an algorithm where the second term scales with a quantile of the delay distribution rather the expectation, and the regret is bounded by O(minq{R MAB T /q + d(q)}), where d(q) is the q-quantile of the delay distribution. Specifically, when choosing the median (i.e., q = 1/2), we obtain regret bound of O(R MAB T + d(1/2)). We thus improve over the O(R MAB T + KE[D]) regret bound of Joulani et al. (2013); Pike-Burke et al. (2018), as the median is always smaller than the expected delay, up to factor of two. Moreover, the increase in regret due to delays in our bound does not scale with number of arms, so the improvement is significant even with fixed delays (Dudik et al., 2011; Joulani et al., 2013). Our bound is achieved using a remarkably simple algorithm, based on variant of Successive Elimination (Even-Dar et al., 2006). For this algorithm, we also prove a more delicate regret bound for arm-dependent delays that allows for choosing different quantiles qi for different arms i (rather than a single quantile q for all arms simultaneously). The intuition why the increase in regret due to delays should scale with a certain quantile is fairly straightforward: consider for instance the median of the delay, d M. For simplicity, assume that the delay value is known when we take the action. One can simulate a black box algorithm for delays that are bounded by d M on the rounds in which delay is smaller than d M (approximately half of the rounds), and in the rest of the rounds, imitate the last action of the black-box algorithm. Since rewards are stochastic, and independent of time and the delay, the regret on rounds with delay larger than d M is similar to the regret of the black-box algorithm on the rest of the rounds, resulting with total regret of twice the regret of the black-box algorithm. For example, when using the algorithm of (Joulani et al., 2013), this would give us O(R MAB T + Kd M). We stress that unlike this reduction, our algorithm does not need to know the value of the delay at any time, nor the median or any other quantile. In addition, our bound is much stronger and does not depend on K on the second term. Table 1. Regret bounds comparison of this and previous works. The bounds in this table omit constant and log(K) factors. Previous work This paper General, Reward-indep. i E[G T,i] R MAB T + KE[D] (Joulani et al., 2013) MAB T + d(q)} Fixed delay d MAB T + Kd (Joulani et al., 2013) Kd (Dudik et al., 2011) (Gael et al., 2020) MAB T + 21/α Packet loss (1 p)T (Joulani et al., 2013) MAB T General, Reward-dep. R MAB T + d(1 min) Reward-dependent delays: We then proceed to consider the more challenging reward-dependent setting. In this setting, the feedback reveals much less information on the true rewards due to the selection bias in the observed rewards (in other words, the distributions of the observed feedback and the unobserved feedback might be very different). In order to deal with this uncertainty, we present another algorithm, also inspired by Successive Elimination. The algorithm widens the confidence bounds in order to handle the potential bias. We achieve a regret bound of the form O(R MAB T +log(K)d(1 min/4)), where min is the minimal sub-optimality gap, and d( ) is the quantile function of the marginal delay distribution. We show that this bound is optimal, by presenting a matching lower bound, up to a factor of in the second term (and log (K) factors). Summary and comparison of bounds: Our main results, along with a concise comparison to previous work, are presented in Table 1. G T,i denotes the maximal number of unobserved feedback from arm i. The results show that our algorithm works well even under heavy-tailed distributions and some distributions with infinite expected value. For example, the arm-dependent delay distributions used by Gael et al. (2020) are all bounded by an α-pareto distribution (in terms of the delay distributions CDFs). Hence, their median is bounded by 21/α. Our algorithm suffer at most an additional O(21/α) to the classical regret for MAB without delays (see bounds for the α-Pareto case in Table 1). In the packet loss setting, the delay is 0 with probability p, and (or T) otherwise. If p is a constant (e.g., > 1/4), our regret bound scales as the optimal regret bound for MAB without delays, up to constant factors. Previous work Joulani et al. (2013) show a regret bound which scales with the number of missing samples, and thus is linear. A Pareto distribution that will bound such delay would require a very small parameter α which also result in linear regret bound by the result of Gael et al. (2020). Stochastic Multi-Armed Bandits with Unrestricted delay distributions 1.2 Related work To the best of our knowledge, Dudik et al. (2011) were the first to consider delays in stochastic MAB. They examine contextual bandit with fixed delay d, and obtain regret bound of O( p K log(NT)(d + T)), where N is number of possible policies. Joulani et al. (2013) use a reduction to non-delayed MAB. For their explicit bound they assume that expected value of the delay is bounded (see Table 1 for their implicit bound). Pike-Burke et al. (2018) consider a more challenging setting in which the learner observe the sum of rewards that arrive at the same round. They assume that the expected delay is known, and obtain similar bound as Joulani et al. (2013). Vernade et al. (2017) study partially observed feedback where the learner cannot distinguish between reward of 0 and a feedback that have not returned yet, which is a special form of reward-dependent delay. However, they assume bounded expected delay and full knowledge on the delay distribution. Gael et al. (2020) also consider partially observed feedback, and aim to relax the bounded expected delay assumption. They consider delay distributions that their CDF are bounded from below by the CDF of an αPareto distribution, which might have infinite expected delay for α 1. However, this assumption still limits the distribution, e.g., the commonly examined fixed delay falls outside their setting. Moreover, they assume that the parameter α is known to the learner. Other extensions include Gaussian Process Bandit Optimization (Desautels et al., 2014) and linear contextual bandits (Zhou et al., 2019). As opposed to most of these works, we place no assumptions on the delay distribution, and the learner has no prior knowledge on it. Delays were also studied in the context of the non-stochastic MAB problem (Auer et al., 2002b). Generally, when reward are chosen in an adversarial fashion, the regret increases by a multiplicative factor of the delay. Under full information, Weinberger & Ordentlich (2002) show regret bound of O( d T), with fixed delay d. This was extended to bandit feedback by (Cesa-Bianchi et al., 2019), with near-optimal regret bound of O( p T(K + d)). Several works have studied the effect of adversarial delays, in which the regret scales with O( D), where D is the sum of delays (Thune et al., 2019; Bistritz et al., 2019; Zimmert & Seldin, 2020; Gy orgy & Joulani, 2020). For last, Cesa-Bianchi et al. (2018) consider a similar setting to Pike-Burke et al. (2018), in which the learner observe only the sum of rewards. The increase in the regret is by a multiplicative factor of 2 Problem Setup and Background We consider a variant of the classical stochastic Multi-armed Bandit (MAB) problem. In each round t = 1, 2, . . . , T, an agent chooses an arm at [K] and gets reward rt(at), where rt( ) [0, 1]K is a random vector. Unlike the stan- Protocol 1 MAB with stochastic delays for t [T] do Agent picks an action at [K]. Environment samples a pair, (rt( ), dt( )), from a joint distribution. Agent get a reward rt(at) and observes feedback {(as, rs(as)) : t = s + ds(as)}. dard MAB setting, the agent does not immediately observe rt(at) at the end of round t; rather, only after dt(at) rounds (namely, at the end of round t+dt(at)) the tuple (at, rt(at)) is received as feedback. We stress that neither the delay dt(at) nor the round number t are observed as part of the feedback (so that the delay cannot be deduced directly from the feedback). The delay is supported in N { }. In particular, we allow dt(at) to be infinite, in which case the associated reward is never observed. The pairs of vectors {(rt( ), dt( ))}T t=1 are sampled i.i.d from a joint distribution. Throughout the paper we sometimes abuse notation and denote rt(at) and dt(at) simply by rt and dt, respectively. This protocol is summarized in Protocol 1. We discuss two forms of stochastic delays: (i) rewardindependent delays, where the vectors rt( ) and dt( ) are independent from each other, and (ii) reward-dependent delays, where there is no restriction on the joint distribution. The performance of the agent is measured as usual by the the difference between the algorithm s cumulative expected reward and the best possible total expected reward of any fixed arm. This is known as the expected pseudo regret, formally defined by RT = max i E where µi is the mean reward of arm i, i denotes the optimal arm and i = µi µi for all i [K]. For a fixed algorithm for the agent (the relevant algorithm will always be clear from the context), we denote by mt(i) the number of times it choose arm i by the end of round t 1. Similarly nt(i) denotes the number of observed feedback from arm i, by the end of round t 1. The two might differ as some of the feedback is delayed. Let ˆµt(i) be the observed empirical average of arm i, defined as: ˆµt(i) = 1 nt(i) 1 s:s+ds