# stochastic_multiarmed_bandits_with_unrestricted_delay_distributions__e15e4bcd.pdf

Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions

Tal Lancewicki * 1 Shahar Segal * 1 Tomer Koren 1 2 Yishay Mansour 1 2

We study the stochastic Multi-Armed Bandit (MAB) problem with random delays in the feedback received by the algorithm. We consider two settings: the reward-dependent delay setting, where realized delays may depend on the stochastic rewards, and the reward-independent delay setting. Our main contribution is algorithms that achieve near-optimal regret in each of the settings, with an additional additive dependence on the quantiles of the delay distribution. Our results do not make any assumptions on the delay distributions: in particular, we do not assume they come from any parametric family of distributions and allow for unbounded support and expectation; we further allow for inﬁnite delays where the algorithm might occasionally not observe any feedback.

1 Introduction

Stochastic Multi-armed Bandit problem (MAB) is a theoretical framework for studying sequential decision making. Most of the literature on MAB assumes that the agent observes feedback immediately after taking an action. However, in many real world applications, the feedback might be available only after a period of time. For instance, in clinical trials, the observed effect of a medical treatment often comes in delay, that may vary between different treatments. Another example is in targeted advertising on the web: when a user clicks a display ad the feedback is immediate, but if a user decides not to click, then the algorithm will become aware to that only when the user left the website or enough time has elapsed.

In this paper, we study the stochastic MAB problem with randomized delays (Joulani et al., 2013). The reward of the chosen action at time t is sampled from some distribu-

*Equal contribution 1Blavatnik School of Computer Science, Tel Aviv University, Israel 2Google Research, Tel Aviv. Correspondence to: Tal Lancewicki <lancewicki@mail.tau.ac.il>, Shahar Segal <shaharsegal1@mail.tau.ac.il>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

tion, like in the classic stochastic MAB problem. However, the reward is observed only at time t + dt, where dt is a random variable denoting the delay at step t. This problem has been studied extensively in the literature (Joulani et al., 2013; Vernade et al., 2017; Pike-Burke et al., 2018; Gael et al., 2020) under an implicit assumption that the delays are reward-independent: namely, that dt is sampled from an unknown delay distribution and may depend on the chosen arm, but not on the stochastic rewards on the same round. For example, Joulani et al. (2013); Pike-Burke et al. (2018) show a regret bound of the form O(R

MAB T + KE[D]). Here R

MAB T denotes the optimal instance-dependent T-round regret bound for standard (non-delayed) MAB: R

i>0 log(T)/ i, where i is the sub-optimality gap for arm i. In the second term, K is the number of arms and E[D] is the expected delay.

A signiﬁcantly more challenging setting, that to the best of our knowledge was not explicitly addressed previously in the literature,1 is that of reward-dependent delays. In this setting, the random delay at each round may also depend on the reward received on the same round (in other words, they are drawn together from a joint distribution over rewards and delays). This scenario is motivated by both of the examples mentioned earlier: e.g., in targeted advertisement the delay associated with a certain user is strongly correlated with the reward she generates (i.e., click or no click); and in clinical trials, the delay often depends on the effect of the applied treatment as some side-effects take longer than others to surface.

In contrast to the reward-independent case, with rewarddependent delays the observed feedback might give a biased impression of the true rewards. Namely, the expectation of the observed reward can be very different than the actual expected reward. For example, consider Bernoulli rewards. If the delays given reward 0 are shorter than the delays given reward 1, then the observed reward will be biased towards 0. Even worse, the direction of the bias can be opposite between different arms. Hence, as long as the fraction of unobserved feedback is signiﬁcant, the expected observed reward of the optimal arm can be smaller than expected

1Some of the results of Vernade et al. (2017); Gael et al. (2020) can be viewed as having a speciﬁc form of reward-dependent delays; we discuss this in more detail in the related work section.

Stochastic Multi-Armed Bandits with Unrestricted delay distributions

observed reward of a sub-optimal arm, which makes the learning task substantially more challenging.

1.1 Our contributions

We consider both the reward-independent and rewarddependent versions of stochastic MAB with delays. In the reward-independent case we give new algorithms whose regret bounds signiﬁcantly improve upon the state-of-the-art, and also give instance-dependent lower bounds demonstrating that our algorithms are nearly-optimal. In the rewarddependent setting, we give the ﬁrst algorithm to handle such delay structure and the potential bias in the observed feedback that it induces. We provide both an upper bound on the regret and a nearly matching general lower bound.

Reward-independent delays: We ﬁrst consider the easier reward-independent case. In this case, we provide an algorithm where the second term scales with a quantile of the delay distribution rather the expectation, and the regret is bounded by O(minq{R

MAB T /q + d(q)}), where d(q) is the q-quantile of the delay distribution. Speciﬁcally, when choosing the median (i.e., q = 1/2), we obtain regret bound of O(R

MAB T + d(1/2)). We thus improve over the O(R

MAB T + KE[D]) regret bound of Joulani et al. (2013); Pike-Burke et al. (2018), as the median is always smaller than the expected delay, up to factor of two. Moreover, the increase in regret due to delays in our bound does not scale with number of arms, so the improvement is signiﬁcant even with ﬁxed delays (Dudik et al., 2011; Joulani et al., 2013). Our bound is achieved using a remarkably simple algorithm, based on variant of Successive Elimination (Even-Dar et al., 2006). For this algorithm, we also prove a more delicate regret bound for arm-dependent delays that allows for choosing different quantiles qi for different arms i (rather than a single quantile q for all arms simultaneously).

The intuition why the increase in regret due to delays should scale with a certain quantile is fairly straightforward: consider for instance the median of the delay, d M. For simplicity, assume that the delay value is known when we take the action. One can simulate a black box algorithm for delays that are bounded by d M on the rounds in which delay is smaller than d M (approximately half of the rounds), and in the rest of the rounds, imitate the last action of the black-box algorithm. Since rewards are stochastic, and independent of time and the delay, the regret on rounds with delay larger than d M is similar to the regret of the black-box algorithm on the rest of the rounds, resulting with total regret of twice the regret of the black-box algorithm. For example, when using the algorithm of (Joulani et al., 2013), this would give us O(R

MAB T + Kd M). We stress that unlike this reduction, our algorithm does not need to know the value of the delay at any time, nor the median or any other quantile. In addition, our bound is much stronger and does not depend on K on the second term.

Table 1. Regret bounds comparison of this and previous works. The bounds in this table omit constant and log(K) factors.

Previous work This paper

General, Reward-indep.

i E[G T,i] R

MAB T + KE[D] (Joulani et al., 2013)

MAB T + d(q)}

Fixed delay d

MAB T + Kd (Joulani et al., 2013)

Kd (Dudik et al., 2011)

(Gael et al., 2020)

MAB T + 21/α

Packet loss (1 p)T (Joulani et al., 2013)

MAB T General, Reward-dep. R

MAB T + d(1 min)

Reward-dependent delays: We then proceed to consider the more challenging reward-dependent setting. In this setting, the feedback reveals much less information on the true rewards due to the selection bias in the observed rewards (in other words, the distributions of the observed feedback and the unobserved feedback might be very different). In order to deal with this uncertainty, we present another algorithm, also inspired by Successive Elimination. The algorithm widens the conﬁdence bounds in order to handle the potential bias. We achieve a regret bound of the form O(R

MAB T +log(K)d(1 min/4)), where min is the minimal sub-optimality gap, and d( ) is the quantile function of the marginal delay distribution. We show that this bound is optimal, by presenting a matching lower bound, up to a factor of in the second term (and log (K) factors).

Summary and comparison of bounds: Our main results, along with a concise comparison to previous work, are presented in Table 1. G T,i denotes the maximal number of unobserved feedback from arm i. The results show that our algorithm works well even under heavy-tailed distributions and some distributions with inﬁnite expected value. For example, the arm-dependent delay distributions used by Gael et al. (2020) are all bounded by an α-pareto distribution (in terms of the delay distributions CDFs). Hence, their median is bounded by 21/α. Our algorithm suffer at most an additional O(21/α) to the classical regret for MAB without delays (see bounds for the α-Pareto case in Table 1). In the packet loss setting, the delay is 0 with probability p, and (or T) otherwise. If p is a constant (e.g., > 1/4), our regret bound scales as the optimal regret bound for MAB without delays, up to constant factors. Previous work Joulani et al. (2013) show a regret bound which scales with the number of missing samples, and thus is linear. A Pareto distribution that will bound such delay would require a very small parameter α which also result in linear regret bound by the result of Gael et al. (2020).

Stochastic Multi-Armed Bandits with Unrestricted delay distributions

1.2 Related work

To the best of our knowledge, Dudik et al. (2011) were the ﬁrst to consider delays in stochastic MAB. They examine contextual bandit with ﬁxed delay d, and obtain regret bound of O( p

K log(NT)(d +

T)), where N is number of possible policies. Joulani et al. (2013) use a reduction to non-delayed MAB. For their explicit bound they assume that expected value of the delay is bounded (see Table 1 for their implicit bound). Pike-Burke et al. (2018) consider a more challenging setting in which the learner observe the sum of rewards that arrive at the same round. They assume that the expected delay is known, and obtain similar bound as Joulani et al. (2013).

Vernade et al. (2017) study partially observed feedback where the learner cannot distinguish between reward of 0 and a feedback that have not returned yet, which is a special form of reward-dependent delay. However, they assume bounded expected delay and full knowledge on the delay distribution. Gael et al. (2020) also consider partially observed feedback, and aim to relax the bounded expected delay assumption. They consider delay distributions that their CDF are bounded from below by the CDF of an αPareto distribution, which might have inﬁnite expected delay for α 1. However, this assumption still limits the distribution, e.g., the commonly examined ﬁxed delay falls outside their setting. Moreover, they assume that the parameter α is known to the learner. Other extensions include Gaussian Process Bandit Optimization (Desautels et al., 2014) and linear contextual bandits (Zhou et al., 2019). As opposed to most of these works, we place no assumptions on the delay distribution, and the learner has no prior knowledge on it.

Delays were also studied in the context of the non-stochastic MAB problem (Auer et al., 2002b). Generally, when reward are chosen in an adversarial fashion, the regret increases by a multiplicative factor of the delay. Under full information, Weinberger & Ordentlich (2002) show regret bound of O(

d T), with ﬁxed delay d. This was extended to bandit feedback by (Cesa-Bianchi et al., 2019), with near-optimal regret bound of O( p

T(K + d)). Several works have studied the effect of adversarial delays, in which the regret scales with O(

D), where D is the sum of delays (Thune et al., 2019; Bistritz et al., 2019; Zimmert & Seldin, 2020; Gy orgy & Joulani, 2020). For last, Cesa-Bianchi et al. (2018) consider a similar setting to Pike-Burke et al. (2018), in which the learner observe only the sum of rewards. The increase in the regret is by a multiplicative factor of

2 Problem Setup and Background

We consider a variant of the classical stochastic Multi-armed Bandit (MAB) problem. In each round t = 1, 2, . . . , T, an agent chooses an arm at [K] and gets reward rt(at), where rt( ) [0, 1]K is a random vector. Unlike the stan-

Protocol 1 MAB with stochastic delays

for t [T] do

Agent picks an action at [K]. Environment samples a pair, (rt( ), dt( )), from a joint distribution. Agent get a reward rt(at) and observes feedback {(as, rs(as)) : t = s + ds(as)}.

dard MAB setting, the agent does not immediately observe rt(at) at the end of round t; rather, only after dt(at) rounds (namely, at the end of round t+dt(at)) the tuple (at, rt(at)) is received as feedback. We stress that neither the delay dt(at) nor the round number t are observed as part of the feedback (so that the delay cannot be deduced directly from the feedback). The delay is supported in N { }. In particular, we allow dt(at) to be inﬁnite, in which case the associated reward is never observed. The pairs of vectors {(rt( ), dt( ))}T t=1 are sampled i.i.d from a joint distribution. Throughout the paper we sometimes abuse notation and denote rt(at) and dt(at) simply by rt and dt, respectively. This protocol is summarized in Protocol 1.

We discuss two forms of stochastic delays: (i) rewardindependent delays, where the vectors rt( ) and dt( ) are independent from each other, and (ii) reward-dependent delays, where there is no restriction on the joint distribution.

The performance of the agent is measured as usual by the the difference between the algorithm s cumulative expected reward and the best possible total expected reward of any ﬁxed arm. This is known as the expected pseudo regret, formally deﬁned by

RT = max i E

where µi is the mean reward of arm i, i denotes the optimal arm and i = µi µi for all i [K].

For a ﬁxed algorithm for the agent (the relevant algorithm will always be clear from the context), we denote by mt(i) the number of times it choose arm i by the end of round t 1. Similarly nt(i) denotes the number of observed feedback from arm i, by the end of round t 1. The two might differ as some of the feedback is delayed. Let ˆµt(i) be the observed empirical average of arm i, deﬁned as:

ˆµt(i) = 1 nt(i) 1

s:s+ds<t I{as = i}rs,

where a b = max{a, b} and I{π} is the indicator function of predicate π.

Stochastic Multi-Armed Bandits with Unrestricted delay distributions

We denote di(q) to be the quantile function for arm i s delay distribution; formally, if Di is the delay of arm i then the quantile function is deﬁned as

di(q) = min γ N | Pr[Di γ] q .

3 Reward-independent Delays

We ﬁrst consider the case where delays are independent of the realized stochastic rewards. We begin with an analysis of two classic algorithms: UCB (Auer et al., 2002a) and Successive Elimination (SE) (Even-Dar et al., 2006), adjusted to handle delayed feedback in a straightforward naive way (see Procedure 2).

3.1 Suboptimality of UCB with delays

UCB (Auer et al., 2002a) is based on the basic principle of optimism under uncertainty. It maintains for each arm an upper conﬁdence bound (UCB): a value that upper bounds the true mean with high probability. In each round it simply pulls the arm with the highest UCB. The exact description appears in Algorithm 3.

Procedure 2 Update-Parameters

for i [K] do

# number of observed feedback nt(i) P

s:s+ds<t I{as = i} # observed empirical mean

ˆµt(i) 1 nt(i) 1 P

s:s+ds<t I{as = i}rs

LCBt(i) ˆµt(i) q

2 log T nt(i) 1

UCBt(i) ˆµt(i) + q

2 log T nt(i) 1

Algorithm 3 UCB with Delays

Input: number of rounds T, number of arms K Initialization: t 1 # begin with sampling each arm once Pull each arm i [K] once Observe any incoming feedback Set t t + K while t < T do

Call Update-Parameters (Procedure 2) Pull arm at arg maxi UCBt(i) # With deterministic tie breaking rule i.e. by index Observe feedback {(as, rs) : s + ds = t} Set t t + 1

In the standard non-delayed setting, UCB is known to be optimal. However, with delays this is no longer the case. Consider the simpler case where all arms suffers from a constant ﬁxed delay d. Joulani et al. (2013) show that the regret of UCB with delay is bounded by O(R

MAB T +Kd). We show that the increase in the regret is necessary for UCB,

and the additional regret due to the delay can in general scale as Ω(Kd). The reason is due to the nature of UCB: it always samples the currently most promising arm, and it might take as much as d rounds to update the latter. This is formalized in the following theorem (proof is deferred to the full version of the paper (Lancewicki et al., 2021).)

Theorem 1. Under ﬁxed delay d K, there exist a problem instance such that UCB suffers regret of Ω(Kd).

3.2 Successive Elimination with delays

Successive Elimination (SE) maintains a set of active arms, where initially all arms are active. It pulls all arms equally and whenever there is a high-conﬁdence that an arm is suboptimal, it eliminates it from the set of active arms. The exact description appears in Algorithm 4.

Algorithm 4 Successive Elimination with Delays

Input: number of rounds T, number of arms K Initialization: S [K], t 1 while t < T do

Pull each arm i S Observe any incoming feedback Set t t + |S| Call Update-Parameters (Procedure 2) # Elimination Step Remove from S all arms i such that exists j with UCBt(i) < LCBt(j)

Unlike UCB, SE continues to sample all arms equally, and not just the most promising arm. In fact, the number of rounds that SE runs before it observes m samples for K arms is approximately Km+d, whereas UCB might require K(m+d) rounds in certain cases. More generally, we prove:

Theorem 2. For reward-independent delay distributions, the expected pseudo-regret of Algorithm 4 is bounded by

RT min q (0,1]K X

+ log (K) max i =i (di(qi) + di (qi )) i . (1)

Additionally, if instead we minimize over a single quantile q (0, 1], the expected pseudo-regret becomes

RT min q (0,1]

325 log (T)

q i + 4 max i [K] di(q). (2)

Particularly, Theorem 2 implies that for ﬁxed delay d, we have RT = O(R

MAB T + d). Note that the bounds in Eqs. (1) and (2) are incomparable: Eq. (1) allows choosing a different quantile for each arm, while Eq. (2) gives a slightly better dependence on K.

Stochastic Multi-Armed Bandits with Unrestricted delay distributions

We now turn to show the main ideas of the proof of Theorem 2, deferring the full proof to the full version of the paper (Lancewicki et al., 2021).

Proof of Theorem 2 (sketch). Here we sketch the proof of Eq. (1); proving Eq. (2) is similar, but requires a more delicate argument in order to eliminate the K dependency in the second term.

Fix some vector q (0, 1]K and let dmax = maxi =i di(qi). First, with high probability all the true means of the reward remain within the conﬁdence interval (i.e., t, i : µi [LCBt(i), UCBt(i)]). Under this condition, the optimal arm is never eliminated. If a sub-optimal arm i was not eliminated by time t then, LCBt(i ) UCBt(i). which implies with high probability,

i = µi µi 2

Now, using a concentration bound, we show that the amount of observed feedback from arm j at time t, is approximately a fraction qj of the number of pulls at time t dj(qj). We use that to bound nt(i) and nt(i ) from below and obtain,

mt dmax(i) = O log T

Now, if t is the last time we pulled arm i, then we can write the total regret from arm i as,

mt(i) i = mt dmax(i) i + (mt(i) mt dmax(i)) i

+ mt(i) mt dmax(i).

The difference mt(i) mt dmax(i) is number of times we pull i between time t dmax and t. This is trivially bounded by dmax, but since we round-robin over active arms, we can divide it by the number of active arms. At the ﬁrst elimination there are K active arms, in second there K 1 active arms, and so on. When summing the regret of all arms we get,

+ log(K)dmax

+ log(K) max i di(qi),

where we have used the fact that 1/K + 1/(K 1) + ... + 1/2 log(K). This proves the bound in Eq. (1).

3.3 Phased Successive Elimination

Next, we introduce a phased version of successive elimination, we call Phased Successive Elimination (PSE). Inspired by phased versions of the commonly used algorithms (Auer & Ortner, 2010), the algorithm works in phases. Unlike SE, it does not round-robin naively, instead it attempts to maintains a balanced number of observed feedback at the end of each phase. As a result, PSE does not depend on the delay of the optimal arm. Surprisingly, the dependence on the delay of the sub-optimal arms remain similar, up to log-factors.

On each phase ℓof PSE, we sample arms that were not eliminated in previous phase in a round-robin fashion. When we observe at least 16 log(T)/2 2ℓsamples for an active arm, we stop sampling it, but keep sampling the rest of active arms. Once we reach enough samples from all active arms, we perform elimination the same way we do on SE, and advance to the next phase ℓ+ 1. The full description of the algorithm is found in Algorithm 5.

Algorithm 5 Phased Successive Elimination (PSE)

Input: number of rounds T, number of arms K Initialization: S [K], ℓ 0, t 1 while t < T do

Set ℓ ℓ+ 1 (phase counter) Set Sℓ S while Sℓ = do

Pull each arm i Sℓ, observe incoming feedback Set t t + |Sℓ| Call Update-Parameters (Procedure 2) Remove from Sℓall arms that where observed at least 16 log(T)/2 2ℓtimes. Remove from S all arms i such that exists j with UCBt(i) < LCBt(j)

Theorem 3. For reward-independent delay distributions, the expected pseudo-regret of Algorithm 5 (PSE) satisﬁes

RT min q (0,1]K X

+ log(T) log(K) max i =i di(qi) i. (3)

The proof of Theorem 3 appears in the full version of the paper (Lancewicki et al., 2021). Similarly to the proof Theorem 2, both SE and PSE eliminate arm i approximately whenever s

log T nt(i ) i.

In a sense, PSE aims to shrink both terms in the left-hand side at a similar rate, which avoid the dependence on 1/qi in the ﬁrst term of Eq. (3). The down side is in the second

Stochastic Multi-Armed Bandits with Unrestricted delay distributions

term: SE keeps sampling all active arms at the same rate, which gives rise to the log(K) dependence in the second term. Under PSE this is no longer the case: naively, one could show a linear dependence on K, but a more careful analysis that uses round-robin sampling within phases gives a log(T) log(K) dependence in the second term of Eq. (3).

One important example in which PSE dominates SE is the arm-dependent packet loss setting, where we get the feedback of arm i immediately (i.e., zero delay) with probability pi, and inﬁnite delay otherwise. The regret of SE in this setting is O(P i =i log(T)/ i (1/pi + 1/pi )). On the other hand, PSE s regret is bounded by O(P

i =i log T/( ipi)). The difference in the regret is substantial when pi is very small. In fact, small amount of feedbacks from the optimal arm only beneﬁts PSE, as it would keep sampling it until it gets enough feedbacks.

3.4 Lower Bound

We conclude this section with showing an instancedependent lower bound (an instance is deﬁned by the set of sub-optimality gaps i).

Theorem 4. Let ALGdelay be an algorithm that guarantees a regret bound of T α over any instance. For any suboptimality gaps set S = { i : i [0, 1

4]} of cardinality K, a quantile q (0, 1], and d T, there exists an instance with an order on S , and delay distributions with di(q) = d for any i, such that ALGdelay s regret on that instance is

(1 α) log T

2 max i [K] di(q)

for sufﬁciently large T, where = 1

The lower bound is proved using delay distribution which is homogeneous across all arms: at time t, the delay is d with probability q and otherwise. The upper bound of SE and PSE involves a minimization over qi. In this case, it is solved by qi = q for all i. Therefore, the best comparison is to Eq. (2) in Theorem 2, where a single quantile is chosen. Theorem 4 shows that SE is near optimal in this case. The ﬁrst term in Eq. (2) is aligned with Eq. (4), up to constant factors. The difference between the two is on the second term, where there is a factor in the lower bound.

The second term in Eq. (4) is due to the fact that the algorithm does not get any feedback for the ﬁrst d = di(q) rounds. Thus, any order on is statistically indistinguishable from the others for the ﬁrst rounds. Therefore, the learner suffers regret on average, over the ﬁrst d rounds, under at least one of the instances. The ﬁrst term is achieved using a reduction from instance-depended lower bound for MAB without delays (Kleinberg et al., 2010; see also Lattimore & Szepesv ari, 2020). The regret is bounded from

below by this term, even if the instance I is known to the learner (the regret guarantee over the other instances ensures that the algorithm does not specialized particularly for that instance). A more detailed lower bound and its full proof is provided in the full version of the paper (Lancewicki et al., 2021).

4 Reward-dependent Delays

We next consider the more challenging case where we let the reward and the delays to be probabilistically dependent. Namely, there is no restriction on the reward-delay joint distribution.

The main challenge in this setting is that the observed empirical mean is no longer an unbiased estimator of the expected reward; e.g., if the delay given a reward of 0 is shorter than the delay given that the reward is 1, then the observed empirical mean would be biased towards 0. Therefore, the analysis from the previous section does not hold anymore. To tackle the problem, we present a new variant of successive elimination, Optimistic-Pessimistic Successive Elimination (OPSE), described in Algorithm 6. When calculating UCB the agent is optimistic regarding the unobserved samples, by assuming all missing samples have the maximal reward (one). When calculating LCB the agent assumes all missing samples have the minimal reward (zero). We emphasize that unlike the previous section, here the estimators take into account all samples, including the unobserved ones. The above implies that the conﬁdence interval computed by OPSE contains the conﬁdence interval computed by nondelayed SE.

Algorithm 6 Optimistic-Pessimistic Successive Elimination

Input: number of rounds T, number of arms K Initialization: S [K], t 1 while t < T do

Pull each arm i S Observe any incoming feedback Set t t + |S| for i S do

# the number of pulls and observations mt(i) P

s<t I{as = i} nt(i) P

s:s+ds<t I{as = i} # pessimistic and optimistic estimators for µi

ˆµ t (i) 1 mt(i) P

s:s+ds<t I{as = i}rs ˆµ+ t (i) mt(i) nt(i)

mt(i) + ˆµ t (i)

LCBt(i) ˆµ t (i) q

UCBt(i) ˆµ+ t (i) + q

mt(i) Remove from S all arms i such that exists j with UCBt(i) < LCBt(j)

Stochastic Multi-Armed Bandits with Unrestricted delay distributions

For OPSE we prove the following regret guarantee.

Theorem 5. For reward-dependent delay distributions, the expected pseudo-regret of Algorithm 6 is bounded by

+ 4 log (K) max i =i di(qi) + di (qi ) , (5)

where qi = 1 mini =i i/4 and qi = 1 i/4 for i = i .

Theorem 5 is analogous to Theorem 2 in the rewardindependent setting. We show a variant of SE, rather than PSE, because the algorithm relies on the entire feedback, rather than just the observed feedback. In addition, the dependence in 1/qi was the main motivation to introduce PSE in the previous section, here it is bounded by a constant. In the reward-dependent setting we have much less information on the unobserved feedback, thus it would be unrealistic to expect similar regret bounds. The main difference between the two bounds is that here we are restricted to speciﬁc choice of quantiles qi and qi , while the bound in Theorem 2 hold for any vector q. A second difference between the theorems is in the additive penalty due to the delay, here it is not multiplied by the sub-optimality gap, i. This factor i also appears in the lower bound in Theorem 6, which we discuss later on.

Proof of Theorem 5 (sketch). Consider time t in which arm i is still active. Deﬁne λt(i) = p

2 log(T)/mt(i). Let µt(i) be the empirical mean of arm i that is based on all mt(i) samples. Formally, µt(i) = 1 mt(i) P

s<t I{as = i}rs.

This is the estimator that we would use to compute the conﬁdence interval in non-delayed setting, but since not all observations are available at time t, we cannot compute it directly. Note that by deﬁnition,

t, i : ˆµ t (i) µt(i) ˆµ+ t (i). (6)

With high probability, using concentration bound on µt and Eq. (6) we can show that,

i = µi µi 4λt(i) + ˆµ+ t (i) ˆµ t (i) + ˆµ+ t (i ) ˆµ t (i )

= 4λt(i) + mt(i) nt(i)

mt(i) + mt(i ) nt(i )

Let dmax = maxi =i di(1 i/4). Using Hoeffding s inequality, with high probability, we have that,

nt(i) (1 i/4)mt dmax(i) λt(i)mt(i).

mt(i) nt(i)

= mt(i) mt dmax(i)

mt(i) + mt dmax(i) nt(i)

mt(i) mt dmax(i)

mt(i) + i/4 + λt(i).

The third term on the right hand side in Eq. (7) is bounded in a similar fashion, which gives us the following bound:

2mt(i) mt dmax(i) mt d max(i)

log T mt(i)

max = maxi =i di (1 i). Either the last term on the right hand side is larger than the ﬁrst two, or vice versa. By considering both cases and solving them, we yield the following result:

mt(i) i = O log T

i + mt(i) mt dmax(i)

+ mt(i) mt d max(i) .

The above holds for the last time we pull arm i, τi. Summing over the sub-optimal arms gives us a bound on regret. Similar to the setting of Section 3, P

i mτi(i) mτi d(i) log(K)d. Here, we set d to dmax or d

max accordingly, which gives us the desired regret bound.

Optimistic-UCB. The dependency on the delay of the optimal arm comes from the bias of ˆµ t . A similar proof would hold for a variant of UCB that uses ˆµ+ t . In that case, one can obtain a regret bound of

i =i di(1 i/4)

In most cases, this is a weaker bound than the bound of Theorem 5, as the second term scales linearly with the number of arms. The advantage of Optimistic-UCB is that it does not depend on the delay of the optimal arm. It still remains an open question whether we can enjoy the beneﬁts of both bounds, and achieve a regret bound that depends only on maxi =i di(1 i).

On the other hand, in Theorem 6 we show that the dependence in maxi =i di(1 i) cannot be avoided, which establishes that our bound is not far from being optimal. Theorem 6. Let K = 2. For any d T and [0, 1/2], there exist reward distributions with sub-optimality gap and reward-dependent delay distributions with di(1 2 ) = d, such that,

2 di(1 2 ). (9)

Stochastic Multi-Armed Bandits with Unrestricted delay distributions

Moreover, for any algorithm ALGdelay that guarantees a regret bound of T α over any instance, the regret is at least,

RT = Ω (1 α) log(T)

+ di(1 2 ) ,

for sufﬁciently large T.

Note that di(1 2 ) di(1 /4), which complies with our upper bound. It seems necessary to have the factor in Eq. (9), and we conjecture that it should also appear in the upper bound.

The proof for Theorem 6 is built upon two instances which are indistinguishable until time d. The reward distributions are Bernoulli and the index of the optimal arm alternates in the two instances. The idea is that when arm 2 is optimal, samples with reward 1 are delayed more often than samples with reward 0. When arm 2 is sub-optimal, the opposite occurs. The delay distribution is tailored such that under both instances, (i) the probability to observe feedback immediately is exactly 1 2 ; and (ii) the probability for reward 1 given that the delay is 0, is identical for both arms under both instances. These two properties guarantee that the learner cannot distinguish between the two instances until time d. After that, it is possible to distinguish between them whenever a sample with delay d is observed. The full details of the proof appears in the full version of the paper (Lancewicki et al., 2021).

5 Experiments

We conducted a variety of synthetic experiments to support our theoretical ﬁndings. We provide additional experiments in the full version of the paper (Lancewicki et al., 2021).

Fixed delays. In Fig. 1 we show the effect of different ﬁxed delays on UCB and SE. We ran both algorithms with a conﬁdence radius λt(i) = p

2/nt(i), for K = 20 arms, each with Bernoulli rewards with mean uniform in [0.25, 0.75], under various ﬁxed delays. Top plots show cumulative regret until T = 2 104. Bottom plot shows regret over increasing delays for T = 2 105. The results are averaged over 100 runs and intervals in both plots are 4 times the standard error.

As delay increases, the regret of UCB increases as well, while SE is quite robust to the delay, and around delay of 200 SE becomes superior. These empirical results coincides with our theoretical results: As in the proof Theorem 1, the regret UCB grows linearly in the ﬁrst Kd rounds. On the other hand, SE created a pipeline of observations, so it keeps getting observations from all active arms. While it cannot avoid from sampling each sub-optimal arm for d/K times, as long as this does not exceed the minimal amount of observations required for SE to eliminate a sub-optimal arm, the effect on the regret is minor.

Figure 1. Regret of SE and UCB for ﬁxed delays.

Figure 2. Regret of SE and Patient Bandits (PB) for Pareto delays.

α-Pareto delays. We reproduce an experiment done by Gael et al. (2020) under our reward-independent setting, in Fig. 2. We compare their algorithm, Patient Bandits (PB), with SE. For T = 3000 rounds and K = 2 arms, we ran sub-optimality gaps [0.04, 0.6]. The expected rewards are µ1 = 0.4 and µ2 = 0.4 + . The delay is sampled from Pareto distribution with α1 = 1 for arm 1 and α2 = 0.2 for arm 2. The results are averaged over 300 runs.

PB is a UCB-based algorithm that uses a prior knowledge on distribution in order to tune conﬁdence radius. Even though it is designed to work under Pareto distributions, SE s regret is strictly smaller for any value of . For small values of , the regret increases with , as the algorithms are not able to distinct between the arms. When becomes large enough the regret starts to decrease as increases. This transition occurs much sooner under SE, which indicates that SE starts to distinguish between the arms at lower values of . We note that PB is designed for partial observation setting, which is more challenging than the reward-independent setting. However, the work of (Gael et al., 2020) is the only previous work, as far as know, to present a regret bound for delay distributions that potentially have inﬁnite expected value and arm-dependent delays, as in this experiment.

Stochastic Multi-Armed Bandits with Unrestricted delay distributions

Figure 3. Regret of SE and PSE for packet loss delays.

Packet-loss. We study the regret of SE and PSE in the packet loss setting. Speciﬁcally to evaluate the difference when amount of feedback from the best arm is signiﬁcantly smaller than the other arms. We ran the algorithms for T = 2 104 rounds and K = 10 arms with randomized values of sub-optimality gaps between [0.15, 0.25]. The probability to observe the best arm is 0.1, and 1 for the sub-optimal arms. The results are averaged over 300 runs. As seen in Fig. 3, the slope of PSE zeroes in some regions. This is the part of a phase in which the algorithm observed enough feedback from all sub-optimal arms and keeps sampling only the optimal arm. This happens due to the fact that the feedback of the optimal arm is unobserved 90% of the time. Meanwhile, SE samples each arm equally and receives less reward. The slope of PSE in other regions, is similar to the one of SE which indicates that the set of active arms is similar as well.

Reward-dependent case. We compare between OPSE (Algorithm 6) and UCB. We show that unlike in the rewardindependent case, here an off-the-shelf solution doesn t perform very well, thus this case requires a modiﬁed algorithm. We set T = 6 104 and K = 3 arms with random sub-optimality gaps of [0.15, 0.25]. The delay is biased with ﬁxed delay of 5,000 rounds for reward 1 of the best arm and reward 0 of the sub-optimal arms. The results are averaged over 100 runs. In Fig. 4, OPSE outperforms UCB, mostly due to UCB s unawareness that the observed reward empirical means are biased. Thus, it favors the sub-optimal arms at the beginning and never recovers from that regret loss. We remark that in this settings, standard SE eliminates the best arm and suffers linear regret, so we omitted it.

6 Discussion

We presented algorithms for multi-arm bandits under two stochastic delayed feedback settings. In the rewardindependent, which was studied previously, we present nearoptimal regret bounds that scale with the delays quantiles. Those are signiﬁcantly stronger, in many cases, then previous results. In addition we show a surprising gap between two classic algorithms: UCB and SE. While the former suffers a regret of Ω(R

MAB T + Kd) under ﬁxed de-

Figure 4. Reward-dependent setting. Regret of OPSE and UCB.

lays, the latter achieves O(R

MAB T + d) for ﬁxed delays and O(minq R

MAB T /q + maxi di(q)) in the general setting. We further showed the PSE algorithm, which removes the dependency on the delay of the best arm. We then presented the reward-dependent delay setting, which is more challenging since the observed and the actual rewards distribute differently. Our novel OPSE algorithm achieves O(R

MAB T + log(K)d(1 min)) by widening the gap of the conﬁdence bounds to incorporate the potential observed biases. In both settings we provided almost matching lower bounds.

Our paper leaves some interesting future lines of research. The reward-dependent setting is mostly unaddressed in the literature and we believe there is more to uncover in this setting. One important question regards the gap between UCB and SE with ﬁxed delays. In non-delayed multi-arm bandits, UCB and SE have similar regret bounds (and UCB even outperforms SE empirically when the delay is zero as evidence by Fig. 1). This raises the question: Can a variant of UCB or any other optimistic algorithm achieve similar regret bounds as a round-robin algorithm in the delayed settings? Lastly, another interesting direction is to tighten the regret bounds: In the reward independent case the gap between the lower and upper bound is either logarithmic in K (e.g., the bound in Eq. (1)) or missing a factor on the delay term (e.g., Eq. (2)). In the reward dependent case it is still remains open question whether we can enjoy the beneﬁts of both optimistic-SE and optimistic-UCB and obtain a regret bound that scales with maxi =i di(1 i).

Acknowledgments

The work of YM and TL has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement No. 882396), by the Israel Science Foundation (grant number 993/17) and the Yandex Initiative for Machine Learning at Tel Aviv University. SS and TK were supported in part by the Israeli Science Foundation (ISF) grant no. 2549/19, by the Len Blavatnik and the Blavatnik Family foundation, and by the Yandex Initiative in Machine Learning.

Stochastic Multi-Armed Bandits with Unrestricted delay distributions

Auer, P. and Ortner, R. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65, 2010.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002a.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002b.

Bistritz, I., Zhou, Z., Chen, X., Bambos, N., and Blanchet, J. Online exp3 learning in adversarial bandits with delayed feedback. In Advances in Neural Information Processing Systems, pp. 11349 11358, 2019.

Cesa-Bianchi, N., Gentile, C., and Mansour, Y. Nonstochastic bandits with composite anonymous feedback. In Conference On Learning Theory, pp. 750 773, 2018.

Cesa-Bianchi, N., Gentile, C., and Mansour, Y. Delay and cooperation in nonstochastic bandits. The Journal of Machine Learning Research, 20(1):613 650, 2019.

Csisz ar, I. and Talata, Z. Context tree estimation for not necessarily ﬁnite memory processes, via bic and mdl. IEEE Transactions on Information theory, 52(3):1007 1016, 2006.

Desautels, T., Krause, A., and Burdick, J. W. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research, 15:3873 3923, 2014.

Dudik, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., and Zhang, T. Efﬁcient optimal learning for contextual bandits. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artiﬁcial Intelligence, pp. 169 178, 2011.

Even-Dar, E., Mannor, S., and Mansour, Y. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079 1105, 2006.

Gael, M. A., Vernade, C., Carpentier, A., and Valko, M. Stochastic bandits with arm-dependent delays. In International Conference on Machine Learning, pp. 3348 3356. PMLR, 2020.

Gy orgy, A. and Joulani, P. Adapting to delays and data in adversarial multi-armed bandits. ar Xiv preprint ar Xiv:2010.06022, 2020.

Joulani, P., Gyorgy, A., and Szepesv ari, C. Online learning under delayed feedback. In International Conference on Machine Learning, pp. 1453 1461, 2013.

Kleinberg, R., Niculescu-Mizil, A., and Sharma, Y. Regret bounds for sleeping experts and bandits. Machine learning, 80(2-3):245 272, 2010.

Lancewicki, T., Segal, S., Koren, T., and Mansour, Y. Stochastic multi-armed bandits with unrestricted delay distributions. ar Xiv preprint ar Xiv:2106.02436, 2021.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Pike-Burke, C., Agrawal, S., Szepesvari, C., and Grunewalder, S. Bandits with delayed, aggregated anonymous feedback. In International Conference on Machine Learning, pp. 4105 4113. PMLR, 2018.

Thune, T. S., Cesa-Bianchi, N., and Seldin, Y. Nonstochastic multiarmed bandits with unrestricted delays. In Advances in Neural Information Processing Systems, pp. 6541 6550, 2019.

Vernade, C., Capp e, O., and Perchet, V. Stochastic bandit models for delayed conversions. In Conference on Uncertainty in Artiﬁcial Intelligence, 2017.

Weinberger, M. J. and Ordentlich, E. On delayed prediction of individual sequences. IEEE Transactions on Information Theory, 48(7):1959 1976, 2002.

Zhou, Z., Xu, R., and Blanchet, J. Learning in generalized linear contextual bandits with stochastic delays. In Advances in Neural Information Processing Systems, pp. 5197 5208, 2019.

Zimmert, J. and Seldin, Y. An optimal algorithm for adversarial bandits with arbitrary delays. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 3285 3294. PMLR, 2020.