# delay_as_payoff_in_mab__0fec2c2c.pdf

Delay as Payoff in MAB

Ofir Schlisselberg*1, Ido Cohen*1, Tal Lancewicki1, Yishay Mansour1,2

1Tel Aviv University 2Google Research {ofirs4, idoc, lancewicki}@mail.tau.ac.il, mansour.yishay@gmail.com

In this paper, we investigate a variant of the classical stochastic Multi-armed Bandit (MAB) problem, where the payoff received by an agent (either cost or reward) is both delayed, and directly corresponds to the magnitude of the delay. This setting models faithfully many real world scenarios such as the time it takes for a data packet to traverse a network given a choice of route (where delay serves as the agent s cost); or a user s time spent on a web page given a choice of content (where delay serves as the agent s reward). Our main contributions are tight upper and lower bounds for both the cost and reward settings. For the case that delays serve as costs, which we are the first to consider, we prove optimal regret that scales as P

i: i>0 log T

i + d , where T is the maximal number of steps, i are the sub-optimality gaps and d is the minimal expected delay amongst arms. For the case that delays serves as rewards, we show optimal regret of P

i: i>0 log T

i + d, where d is the second maximal expected delay. These improve over the regret in the general delay-dependent payoff setting, which scales as P

i: i>0 log T

i + D, where D is the maximum possible delay. Our regret bounds highlight the difference between the cost and reward scenarios, showing that the improvement in the cost scenario is more significant than for the reward. Finally, we accompany our theoretical results with an empirical evaluation.

1 Introduction Classical stochastic Multi-armed Bandit (MAB) is a well studied theoretical framework for sequential decision making, where at every step an agent chooses an action and immediately receives some payoff, be it reward or cost. A natural generalization of this framework considers the situation where the payoff is only received after a certain delay. This is known as the stochastic MAB problem with randomized delays (Joulani, Gyorgy, and Szepesv ari 2013), and was extensively researched in previous work under various variants (Vernade, Capp e, and Perchet 2017; Pike-Burke et al. 2018; Zhou, Xu, and Blanchet 2019; Manegueu et al. 2020; Vernade et al. 2020; Wu and Wager 2022; Howson, Pike Burke, and Filippi 2023; Shi, Wang, and Wu 2023). In all these works the delay was considered reward-independent,

*These authors contributed equally. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

namely, the reward and delay are sampled from independent distributions (more on this in the following sections). Later, Lancewicki et al. (2021) introduced reward-dependent delay, where the delay and reward are sampled from a joint distribution. This model is more challenging as it introduces selection bias into the observed payoffs. Tang, Wang, and Zheng (2024) consider a special case of this they call strongly-dependent, where for all intents and purposes the delay is exactly the reward that we are trying to maximize. We second their motivation to study this case, and additionally generalize this to delay as cost that we are trying to minimize. This motivation is twofold. First, it models well many real world scenarios. Second, from a performance perspective, it offers a significant gain in the regret. We show that the delay as payoff scenario is actually simpler than the general payoff-dependent setting by providing a tighter upper bound compared to (Tang, Wang, and Zheng 2024), and a significantly better bound for the cost setting. To motivate the cost scenario, consider a communication network where we route packets from node a to node b, and would like to do it in the fastest way possible. We can model this as a stochastic MAB, where every route a b is an arm (action) and the time it takes the packet to arrive (formally known as Round Trip Time (RTT), see Postel (1981)) is our payoff. For routing we want to minimize the RTT and so we call the payoff cost . To motivate the reward scenario, consider a web page with some dynamic content. We wish to capture the viewer s attention for as long as possible, by choosing the content wisely. A common metric used in advertising is Average Time on Page (ATP), used, for example, by Google Analytics. In this case we want to maximize the ATP and so we call the payoff reward . In both scenarios the payoff (RTT or ATP) is the time elapsed from choosing an action until its payoff is final, hence it can be modeled as delay. As far as performance, an immediate observation is that while the agent is waiting for an action s final payoff, it gains partial knowledge about its payoff as time progresses. More specifically, if at time t1 an arm is played and by time t2 the payoff has not been revealed, we can learn that the delay (hence the payoff) is at least t2 t1. This is a crucial observation that we use. Taking advantage of this knowledge, we can improve the regret bounds. Notice that this knowledge is one-sided in the sense that every time-step that passes pro-

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

vides an improved lower bound on the delay, but the same cannot be said for an upper bound. This is a challenging limitation that is the key difference between cost and reward, and explains how a better regret can be achieved for cost. (This issue is further discussed in Section 1.2.)

1.1 Our Contributions We study both reward and cost as payoff in the special case of payoff-dependent delay where delay serves as payoff. This setting presents an opportunity to use the partial knowledge accumulated while waiting for the payoff, to achieve better regret bounds. In order to conform with the literature, we normalize the payoff to be in [0, 1], by setting it to the actual delay divided by the maximum delay. This has no implications on the analysis, and is aimed to be inline with the existing regret bounds. Our main contributions are the following: 1. In the case of cost, we offer tight lower and upper bounds that scale as P i: i>0 log T

i + min{d , D max}, where d is the minimal expected delay. 2. In the case of rewards, we offer tight lower and upper bounds that scale as P i: i>0 log T

i + min{ d, D max}, where d is the second maximal expected delay. Note that the cost regret bound can be significantly smaller than the reward regret bound, on the same problem instance. 3. We complement our theoretical results in an experimental evaluation. Our main results, along with a concise comparison to previous work, are presented in Table 1. The two bounds provided, improve both on the general delay-dependent payoff setting which scales as P i: i>0 log T

i +D (Lancewicki et al. 2021)1 and P i: i>0 log T

i +D P i: i>0 i by (Tang, Wang, and Zheng 2024). Note that D max is already an improvement of factor K, but more significantly if d D max our bound is substantially lower. In the cost case this becomes even more clear as d is potentially much smaller than d.

1.2 Cost vs Reward Intuition The goal of a player in a MAB environment can be either to maximize his total payoff, in which case the payoff is called the reward , or to minimize his total payoff, then the payoff is called the cost . Normally when considering a MAB setting without delay, the choice of using cost or reward is interchangeable by simply changing the sign of the payoff, and most algorithms would be oblivious to the change. We argue that in the case of delayed payoff, where the payoff is the delay, cost and reward are very different. Specifically, when minimizing cost we can make better use of partial knowledge that we gain while waiting for payoff feedback. We provide here some informal intuition for this, and we will make it more formal by providing lower bounds for both cases in Sections 4.3 and 5.3. Consider a scenario where we have K arms with constant delays (and thus, payoff) sorted

1In (Lancewicki et al. 2021) the additional additive term is formally the (1 min/4)-quantile, but in general this can be as large as D - see discussion in section 2.

from low to high {di}K i=1. If we maximize reward the best arm is i K with delay d K. No matter how we play, arm i K 1 and arm i K are indistinguishable until after d K 1 time steps, simply because no feedback is received from either arm before d K 1. So the number of times we play sub-optimal arms depends on the second highest delay. In comparison, when minimizing cost, the best arm is i1 with delay d1. After d1 time steps we can already start getting some information about the cost of i1, and so we can hope to stop playing sub-optimal arms as early as O(d1).

Paper organization The rest of the paper is organized as follows. In Section 2 we discuss related work and in Section 3 we formally present our settings. In Section 4 we present our main algorithm and analysis for the delay as cost setting. In Section 5 we present our algorithm and results for the delay as reward setting. In Section 6 we present empirical evaluation of our algorithms compared to previous related works. Section 7 is a discussion. Most of the proofs are deferred to the the full version of the paper (Schlisselberg et al. 2024).

2 Related Work The delayed payoff in MAB has recently gained significant attention. Most previous works have been devoted to payoff-independent delays, often treating them as some unknown distribution. This line of work started with (Dudik et al. 2011) who introduced a constant delay, and offered a regret bound with linear dependence on the delay. Joulani, Gyorgy, and Szepesv ari (2013) extended this to stochastic, yet bounded, delay. Later variations include Zhou, Xu, and Blanchet (2019), who made a distinction between armdependent and arm-independent delay. And Pike-Burke et al. (2018), who consider an aggregated rewards model. Delayed payoff was also studied from an adversarial perspective, where both the delay and rewards are adversarial. This includes works such as (Cesa-Bianchi, Gentile, and Mansour 2019; Thune, Cesa-Bianchi, and Seldin 2019; Bistritz et al. 2019; Gyorgy and Joulani 2021). Masoudian, Zimmert, and Seldin (2022, 2023) presented a best-of-bothworlds algorithm for the delayed setting, which is a modification of Zimmert and Seldin (2020). Interestingly, in the adversarial setting, if the delay is adversarial and arm dependant as in (Van Der Hoeven and Cesa-Bianchi 2022), the adversary can correlate the payoff and the delay, thus the payoff also depends on the delay. While this resembles the payoff-dependant setting, the resulting regret bounds are very different. In particular, the delay has a multiplicative effect on the regret, while in the stochastic case we only suffer an additive term. Only few works considered regret in the rewarddependent setting. Lancewicki et al. (2021) considers the case where the delay and reward are sampled from a joint distribution. In their work, there is no assumption on the delay distribution, in particular it may be unbounded. However the reward is still bounded in [0, 1]. Note that in our special case, where the payoff directly corresponds to the magnitude of the delay, this bounded payoff implies that the delay is also bounded. Their regret bound has an additive term that

Reward Cost

(Lancewicki et al. 2021) P i: i>0 log T

i + D P i: i>0 log T

(Tang, Wang, and Zheng 2024) P i: i>0 log T

i + D P i: i>0 i N/A

This work P i: i>0 log T

i + min{ d, D max} P i: i>0 log T

i + min{d , D max}

Table 1: Pseudo regret comparison for works on delay-dependent payoff. T is the total number of steps, i is the sub-optimality gap of arm i and max = maxi i, D is the maximum possible delay, d is the second maximal expected delay (across arms), and d is minimal expected delay. Bounds shown hide logarithmic factors that are independent of T.

scales as the (1 min/4)-quantile of the delay distribution. In general, this can easily be as large as the maximal delay D. For instance, when there is an arm with Bernoulli payoffs with µ(i) > 1/4. Lancewicki et al. (2021) have also considered the case where the delay and reward are independent, which falls outside the scope of the reward-dependent case we consider here. Later, Tang, Wang, and Zheng (2024), consider the setting where the delay equals the reward. They consider general distributions, and their bound scales with a complex quantity dependent on these distributions. In the case where the distribution is bounded by D, they establish a regret bound with an additive term of D P i i. For the same setting we show an additive regret of d, which we can also improve to min( d, D maxi i).

3 Problem Setup Our delay-as-payoff model is as follows. There is a set [K] of K arms. Each arm i [K] has a distribution Di with support [D] {0}, where D is the maximum delay. In each step t = 1, 2, ..., T, the agent chooses arm it [K], and incurs a delay dt Dit. The agent observes the payoff of it at time t + dt. The payoff is dt/D, which we denote by rt for rewards, or ct for cost. Thus, the average payoff is EX Di[X/D] denoted by µ(i). Until step t+dt, we refer to the payoff of arm it as missing, since we do not know its actual delay, and thus payoff, yet. At step t+dt, dt is revealed, and thus the agent observes the payoff. The interaction protocol is in Algorithm 1.

Algorithm 1 Protocol1

for t [T] do Agent picks an action it [K] Environment samples dt Dit Agent observes feedback {ds : t = s + ds}

The performance of the agent is measured by the expected pseudo regret, which is the difference between the algorithm s cumulative expected payoff and the best expected payoff of any fixed arm. In the case of reward this will be:

RT = max i [K] Tµ(i) E

And in the case of cost:

min i [K] Tµ(i) = E

where i is the sub-optimality gap of arm i, i.e., i = |µ(i) µ | and µ = maxi [K] µ(i), for rewards, and µ = mini [K] µ(i), for cost. Respectively we define i = i [K] s.t. µ(i) = µ . Which, without loss of generality, we assume to be single. For readability, we make use of the following additional notations; d(i) = Dµ(i) is the mean delay of arm i [K] and max = maxi [K] i.

4 Delay As Cost In this section, we consider the case where the cost is proportional to the agent s delay. We introduce our main algorithm, Bounded Doubling Successive Elimination (BDSE, Algorithm 3), and its associated subroutine, Cost Successive Elimination (CSE, Algorithm 2). CSE builds on the well-known Successive Elimination (SE) algorithm (Even Dar, Mannor, and Mansour 2006), and as discussed in Section 4.1, it introduces an improved lower confidence bound (LCB). This LCB leverages not only the observed payoff but also takes into account the number of missing observations and their current duration. As we discuss later in this section, a similar improvement cannot be obtained for an upper confidence bound. Instead, BDSE employs a doubling scheme that upper bounds µ . This combination is a key component that enables us to achieve our optimal bounds. In the following subsections, we expand on these algorithms and their regret guarantees.

4.1 CSE Algorithm Much like standard SE, CSE maintains a set of active arms, where initially all arms are active. The algorithm works in rounds, where in each round each active arm is selected once. Unlike standard SE, which eliminates an arm only when there is confidence that it is suboptimal, CSE also eliminates an arm when there is confidence that it is worse than a specific threshold parameter B. In our following definitions we distinguish between the three following groups (Note that they are not mutually exclusive): 1. Mt(i) are the time steps with chosen arm i that have not returned feedback by time t. Formally, Mt(i) =

{s [t]| is = i s + ds t}). We denote the size of this group mt(i) = |Mt(i)|. 2. Ot(i) are time steps with chosen arm i that have returned feedback by time t. Formally, Ot(i) = {s [t]| is = i s + ds < t}. 3. Ft(i) are the time steps with chosen arm i that are at least D time steps ago, hence their feedback must have returned. Formally, Ft(i) = {s [t]| s t D} 4. Additionally, nt(i) = |{is = i|1 s t}| is the number of all plays of arm i [K] before step t [T]. Our lower-confidence-bound comprises three terms, LCBt(i) = max{L1 t(i), L2 t(i), L3 t(i)}; each bounds the expected cost with high probability: 1. L1 t(i) incorporates both observed and unobserved samples, optimistically assuming that the payoff of unobserved samples will be received in the next round. Formally,

L1 t = ˆµ t (i)

nt(i) , (1)

where ˆµ t (i) = 1 nt(i)(P s Mt(i) t s

D + P s Ot(i) cs).

2. L2 t(i) uses only observed samples that were played up to time t D:

L2 t(i) = ˆµF t (i)

2 log T |Ft(i)| 1 (2)

where ˆµF t (i) = 1 |Ft(i)| 1(P s Ft(i) cs) is the empirical average of those samples ( indicates max). We take maximum in the denominator for the case that some arm was not played by t D, this can occur until t = D +K. 3. L3 t(i) directly leverages the fact the cost corresponds to the magnitude of delay. In particular, as we establish in Lemma 4.1, the number of missing samples can t be much larger than a factor of the expected delay. We define,

L3 t(i) = |St|

2 8 log T 1 (3)

where St is the set of active arms at time t. With the use of Lemma 4.1, L3 t(i) serves as a valid lower-confidence bound for µ(i). For the elimination step we require an upper confidence bound. We use a similar bound as in L2 t:

UCBt(i) = ˆµF t (i) +

2 log T |Ft(i)| 1 (4)

The CSE algorithm is formally described in Algorithm 2 and as a full pseudo code in Algorithm 6 in the the full version of the paper (Schlisselberg et al. 2024). Note that if the delays were deterministic, then we would have mt(i) d(i)/R, for every arm i. The following lemma handles the case that the delays are stochastic with expectation d(i).

Algorithm 2 Cost Successive Elimination (CSE)

Input: number of rounds T, number of arms K, maximum delay D, Elimination Threshold B. Initialization: t 1, S [K] Output : Status (either Success or Fail) and t number of time steps performed. while t < T do Play each arm i S Observe any incoming feedback Set t t + |S| for i S do LCBt(i) max{L1 t(i), L2 t(i), L3 t(i)} as defined in Equations (1) to (3) Update UCBt(i) as defined in Equation (4)

Elimination step Remove from S any arm i if there exists j such that min {UCBt(j), B} < LCBt(i) if S = then Return (Fail,t) Return (Success,t)

Lemma 4.1 For every step t, if the last min {D, t} steps was played with a round robin of a set of size at least R:

Pr mt(i) 2d(i)

R + 16 log T + 2 1 1

Note that using the missing plays to upper bound the mean delay results in a significantly weaker bound, and thus unhelpful. With that in hand, and standard concentration bounds, we can define an event G that happens with high probability. Definition 4.2 Assume that the actions were played in a round robin manner. Denote Rt to be minimum size of the round robin by time t [T]. Let G be the event that for every t [T] and i [K]:

mt(i) 2d(i)

Rt + 16 log T + 2

|µ(i) ˆµt(i)|

where, ˆµt(i) = 1 nt(i) P s {1 s t|is=i} rs is the empirical average of payoff of time steps with chosen arm i. Note that due to missing plays, this is likely unknown to the algorithm at time t. We show that G holds with high probability. Lemma 4.3 The event G holds with probability 1 3/T 2. As previously mentioned, CSE adopts a less conservative elimination rule than standard SE, as it also eliminates arms that perform worse than a specified threshold B. Consequently, it might eliminate all arms, in which case it would return a Fail. For this reason we have a main program BDSE that call CSE with the threshold B. When CSE return a Fail back to BDSE, then BDSE doubles the threshold B and calls CSE with the new threshold.

The following theorem shows that CSE will not eliminate the optimal arm, if B µ . Lemma 4.4 (Safe Elimination) Assuming G holds and B µ , the procedure CSE will not fail and i will not be eliminated.

The following theorem bounds the regret suffered in one call to the procedure CSE. Theorem 4.5 The regret of CSE (Algorithm 2) with elimination threshold B is bounded by, X

i + 8D min {B, max} log K

The first term in the regret scales as optimal instancedependent, non-delayed MAB. The second term scales with the magnitude of B. Note that, by Lemma 4.4, B will remain smaller than 2µ and thus the second term is at most O(d ). Proof sketch: For the sake of simplicity we provide the proof sketch only for the B term in the min. Assume the good event G holds. Let St be the set S at time t. Fix any sub-optimal arm i [K], and let τi be the last elimination step which arm i remained active. Recall that L1 t is computed with an optimistic empirical average ˆµ t (i). That is, any missing sample is assumed to be observed in the next round. At worse, such missing sample eventually would return after D steps and would have cost of 1. Thus, the difference between µ t (i) and the actual empirical mean ˆµt(i) is at most mt(i)/nt. Since the good event G holds, and the arm i was not yet eliminated at time τi,

µ(i) LCBτi(i) + 2

nτi(i) + mτi(i)

nτi(i) + mτi(i)

nτi(i) + 2DL3 τi(i) nτi(i)|Sτi|

nτi(i) + 2DB nτi(i)|Sτi| (6)

We consider three cases: (i) µ < B and µ(i) < 2B, (ii) B µ and (iii) 2B µ(i).

case (i): By Lemma 4.4, we know that i will not be eliminated (since µ < B). Since i was not eliminated at time τi, L2 τi(i) UCBτi(i). Using standard arguments this implies that in F tτi (i) O(log(T)/ i). Recall that n F tτi (i) is the number of times we played i until time τi D. In the last D plays i was played approximately D/|Sτi| times due to the round-robin, which causes additional regret of i D |Sτi| O DB

|Sτi| . This accumulates to a total regret of

case (ii): We use Equation (6) to show that i 2 q

nτi(i) + 2DB nτi(i)|Sτi|. This implies that either i

Algorithm 3 Bounded Doubling Successive Elimination

Input: number of rounds T, number of arms K, maximum delay D. Initialization: B 1/D while t < T do Run (ret, τ) CSE(T t, K, D, B) if ret=Fail then B 2B t t + τ

nτi(i) or i 4DB nτi(i)|Sτi|. In the first case we get

that the regret from arm i is bounded by O(log(T)/ i). Similarly, in the second case the regret is bounded by O((DB)/|Sτi|).

case (iii): The third case assumes that B µ(i)/2. By rearranging the terms of Equation (6), we have that i µ(i) O q

log T nτi(i) + DB n|Sτi| . Similarly to case (ii), the to-

tal regret is bounded by O log T

|Sτi| . Now we can sum over all sub-optimal arms i and bound the regret by O P i: >0 log T

i + DB log K . Note that the regret when G does not hold is in expectation only O(1).

4.2 Bounded Doubling Successive Elimination CSE (Algorithm 2) demands a parameters B, which is not available for the agent. In this algorithm we estimate B using the doubling technique. Corollary 4.6 Algorithm BDSE has a regret of at most, X

129 log T log d

i + 8 min {d , D max log d } log K

Proof: From Lemma 4.4 we know that if B µ then CSE will not fail, which means that the number of calls to CSE is at most log d . Notice that on the j s call of BSE, DB = 2j. The total regret will be at most

i + 8 min 2j, D max log K

129 log T log d

+ 8 min {d , D max log d } log K

4.3 Lower Bound In this section, we show two lower bounds for the cost setting. The first is a general lower bound which nearly matches the regret bound of our algorithm. And the second, a lower bound for classical SE algorithms. The main challenge is to understand the impact of d on the regret. We focus on the second term of the upper bound, as the first term, P i: >0 log T

i , is a well-known instance-dependent bound even when there are no delays (Bubeck and Cesa-Bianchi 2012).

Theorem 4.7 In the cost scenario, for every choice of d D/2, there is an instance for which any algorithm will have a regret of Ω(d )

Proof: We consider two arms with deterministic delays, one is d (and cost µ = d /D 1/2) and the other is D (with cost µ = 1). We select at random which arm has delay d and which D. Until time d both arms are indistinguishable, and hence the regret is (1 µ )d . Since µ 1/2 we have a regret of at least d /2.

Conservative SE algorithms: We show, for a natural class of SE algorithms, which are also conservative (w.h.p. do not eliminate the optimal action), a lower bound of

Dd . Interestingly, this bound is also tight, as we show in the full version of the paper (Schlisselberg et al. 2024), a conservative SE algorithm which attains it. For this impossibility result we use the following two problem instances. In the first problem instance we have the delay of arm 1 to be

Dd /2 w.p. 2 p

d /D and otherwise 0. For arm 2 the delay is deterministic D. In the second problem instance we have the delay of arm 1 to be D w.p. 2 p

d /D and otherwise 0. For arm 2 the delay is deterministic

Dd . In the first instance the best arm is 1 while in the second it is arm 2. Until time

Dd we cannot distinguish between the two instances, so a conservative SE algorithm will keep playing both arms in a round-robin manner, and have a regret of Ω(

Dd ) in the first instance. It is worth observing why our BDSE overcomes those two problem instances. Due to the doubling scheme, every time the number of missing plays reaches (roughly) the current threshold, both arms will be eliminated until the threshold surpasses µ . In the first instance, this will happen after d steps, at which point only the optimal arm will remain, and thus the regret is at most d = d . Similarly, in the second instance, the threshold will surpass µ after

Dd steps. The here is p

d /D and so the regret is at most d .

5 Delay As Reward

In this section, we consider the case where the delay corresponds to the agent s reward. Similarly to cost, we have a main program Bounded Halving Successive Elimination algorithm (BHSE, Algorithm 5), and an associated subroutine Reward Successive Elimination (RSE). Besides the transition from minimization to maximization, the main difference is that the missing feedbacks at time t should be interpreted differently. In the following subsections, we include the details of these algorithms and their regret guarantees.

5.1 RSE Algorithm

As in the cost scenario, we start with a Reward Successive Elimination algorithm. Since we consider rewards, we would like the threshold B to decrease with time (rather than increase, as was done in the cost scenario). Eventually, RSE expects B µ to guarantee success. As in the CSE algorithm, we will eliminate arms based on suboptimality, in comparison to other arms, or when there is confidence that they are worse than the parameter B.

Algorithm 4 Reward Successive Elimination

Input: number of rounds T, number of arms K, maximum delay D, Elimination Threshold B. Initialization: t 1, S [K] Output : Status (either Success or Fail) and t number of time steps performed. while t < T do Play each arm i S Observe incoming payoff from {s : s + ds = t} Set t t + |S| for i S do UCBt(i) min{U 1 t (i), U 2 t (i)} as defined in Equations (7) and (8) Update LCBt(i) as defined in Equation (9)

Elimination step Remove from S any arm i if there exists j such that max {LCBt(j), B} > UCBt(i) if S = then Return (Fail, t) Return (Success, t)

Our upper-confidence-bound comprises only two terms, UCBt(i) = min{U 1 t (i), U 2 t (i)}; which are analogues to L1 and L2 in the cost case. Formally,

U 1 t = ˆµ+ t (i) +

nt(i) , (7)

where ˆµ+ t (i) = 1 nt(i)(P s Mt(i) 1+P s Ot(i) rs) is an optimistic estimate of µ(i). (Recall that rs = ds/D.) Similarly, U2 as well as LCBt are defined by,

U 2 t (i) = ˆµF t (i) +

2 log T |Ft(i)| 1, (8)

For the LCB we have,

LCBt(i) = ˆµF t (i)

2 log T |Ft(i)| 1 (9)

where ˆµF t (i) = 1 |Ft(i)| 1(P s Ft(i) rs). We use the same good event G as defined in Definition 4.2 in the previous section which also holds here w.h.p. Theorem 5.1 (Safe Elimination) Assuming G holds and B µ , the procedure RSE will not return Fail and i will not be eliminated.

The following theorem bounds RSE s regret.

Theorem 5.2 Assume B µ

2 . The regret of RSE (Algorithm 4) with elimination threshold B is bounded by, X

i + 12 min d, D max log K,

where d is the second highest expected delay. The assumption that B µ /2 is satisfied under the main program BHSE (Algorithm 5) due to Theorem 5.1.

Figure 1: This graph shows results of experiments on different algorithms (color) and different distributions (line style).

5.2 Bounded Halving Successive Elimination Similar to the main program in the cost case, BHSE estimates a lower bound for µ . It starts with an over-estimation of B = 1 and this time halves it by 2 whenever RSE returns Fail.

Algorithm 5 Bounded Halving Successive Elimination (BHSE)

Input: number of rounds T, number of arms K, maximum delay D. Initialization: B 1 while t < T do Run (ret, τ) = RSE(T t, K, D, B) if ret = Fail then B B/2 ; t t + τ

Corollary 5.3 Algorithm BHSE has regret of at most, X

i + 12 min d, D max log K

Proof: From Theorem 5.1 we know that if B µ BSE will not fail, which means that the loop will run a maximum of log(1/µ ) times. This also means that B µ /2, as needed. Therefore, the total regret will be P i: >0 289 log T

i + 12 min{ d, D max} log K log 1

µ Note that without loss of generality we can assume that µ 1/T, since otherwise i 1/T for all arms and the regret is trivially bounded by 1. Therefore the term log(1/µ ) would be at most log T. Notice that unlike Corollary 4.6, the regret bound depends on d. On the one hand, this is better than the delay of the best arm (which has the maximal expected delay). On the other hand, this can be much larger than the regret in the cost scenario, which depends on the minimal expected delay.

5.3 Lower Bound In this section we show a lower bound for the reward setting, which nearly matches our regret bound.

Theorem 5.4 In the reward scenario, for every choice of d D/2, there is an instance for which any algorithm will have a regret of Ω d

Proof: Consider the case of K arms. We have one arm with constant delay D, one arm with constant delay d = D/2, and the remaining arms have delay 0 (and hence reward 0). We select at random the identities of the arms. This implies that for any sub-optimal arm i we have i 1/2. Clearly, until time d the best two arms are indistinguishable hence the regret is at least of order of d mini i d/2.

Conservative SE algorithms: Using similar arguments as in the cost case we can show here a lower bound of

6 Experiments We conducted synthetic experiments for both the cost and reward settings, using the algorithms in Table 1 as baselines. We show results on two representative distributions: Truncated Normal (bounded in [0, D]) and Bernoulli. Due to space constraints, we defer the cost experiments to the the full version of the paper (Schlisselberg et al. 2024). All experiments use T=150, 000, K=30 and D=5000. For the truncated Normal we sample K means and standard deviations (std), and adjust them to get a truncated version. Since our additive term in the regret is min{ d, D max}, our contribution is mainly for instances where d < D max. Hence, we show the result on such instance, by using an exponential distribution to sample the means of the arms, creating sparsity in the higher regime. The stds are sampled uniformly in [0, D]. For the Bernoulli distribution, we sample K probabilities pi uniformly in [0, 1], so that arm i gets 0 with probability pi and D with probability 1 pi. Figure 1 shows the average cumulative regret over 10 runs. The shaded region is the std of these runs. BHSE (our algorithm) outperforms OPSE (from Lancewicki et al. (2021)) and Censored UCB (from Tang, Wang, and Zheng (2024)) in both distributions. Bernoulli distribution is more challenging, resulting in higher regret and std.

7 Discussion In this paper, we explored a variant of the classical MAB problem, where the payoff is both delayed and directly corresponds to the magnitude of the delay. For the delay as reward setting we introduced tighter upper and lower regret bounds compared to those established in previous works. We are the first to generalize also to cost, highlighting the inherent difference between cost and reward in this setting. There are several interesting future directions. First, as our motivation for the reward setting is the online advertising, it is a natural question to ask if we can expect similar results in the a delay as payoff contextual bandit setting or other variants of the MAB problem. Furthermore, it remains unclear whether our results can be generalized for more general delay distributions which are potentially unbounded (but have a bounded expectation). Finally, adopting this perspective in the adversarial setting, where delays serves as payoffs, is a challenging new problem.

Acknowledgments We thank Batya Berzack and Yoav Nagel for their input and work during the initiation of this paper and Ran Darshan for his constructive feedback. OS is supported by the Darshan lab. IC is supported by the Israeli Science Foundation (grant number 1186/18). TL and YM are supported by the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement No. 882396), by the Israel Science Foundation and the Yandex Initiative for Machine Learning at Tel Aviv University. IC, TL and YM are also supported by a grant from the Tel Aviv University Center for AI and Data Science (TAD).

References Bistritz, I.; Zhou, Z.; Chen, X.; Bambos, N.; and Blanchet, J. 2019. Online EXP3 Learning in Adversarial Bandits with Delayed Feedback. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alch e-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc. Bubeck, S.; and Cesa-Bianchi, N. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. ar Xiv:1204.5721. Cesa-Bianchi, N.; Gentile, C.; and Mansour, Y. 2019. Delay and cooperation in nonstochastic bandits. Journal of Machine Learning Research, 20(17): 1 38. Cohen, A.; Efroni, Y.; Mansour, Y.; and Rosenberg, A. 2021. Minimax regret for stochastic shortest path. Advances in Neural Information Processing Systems, 34. Dann, C.; Lattimore, T.; and Brunskill, E. 2017. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30. Dudik, M.; Hsu, D.; Kale, S.; Karampatziakis, N.; Langford, J.; Reyzin, L.; and Zhang, T. 2011. Efficient optimal learning for contextual bandits. ar Xiv preprint ar Xiv:1106.2369. Even-Dar, E.; Mannor, S.; and Mansour, Y. 2006. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. Journal of Machine Learning Research, 7(39): 1079 1105. Gyorgy, A.; and Joulani, P. 2021. Adapting to delays and data in adversarial multi-armed bandits. In International Conference on Machine Learning, 3988 3997. PMLR. Howson, B.; Pike-Burke, C.; and Filippi, S. 2023. Delayed feedback in generalised linear bandits revisited. In International Conference on Artificial Intelligence and Statistics, 6095 6119. PMLR. Joulani, P.; Gyorgy, A.; and Szepesv ari, C. 2013. Online learning under delayed feedback. In International conference on machine learning, 1453 1461. PMLR. Lancewicki, T.; Segal, S.; Koren, T.; and Mansour, Y. 2021. Stochastic multi-armed bandits with unrestricted delay distributions. In International Conference on Machine Learning, 5969 5978. PMLR. Manegueu, A. G.; Vernade, C.; Carpentier, A.; and Valko, M. 2020. Stochastic bandits with arm-dependent delays. ar Xiv:2006.10459.

Masoudian, S.; Zimmert, J.; and Seldin, Y. 2022. A bestof-both-worlds algorithm for bandits with delayed feedback. Advances in Neural Information Processing Systems, 35: 11752 11762. Masoudian, S.; Zimmert, J.; and Seldin, Y. 2023. An Improved Best-of-both-worlds Algorithm for Bandits with Delayed Feedback. ar Xiv preprint ar Xiv:2308.10675. Pike-Burke, C.; Agrawal, S.; Szepesvari, C.; and Grunewalder, S. 2018. Bandits with delayed, aggregated anonymous feedback. In International Conference on Machine Learning, 4105 4113. PMLR. Postel, J. 1981. Transmission Control Protocol. In RFC 793, Internet Engineering Task Force (IETF). Schlisselberg, O.; Cohen, I.; Lancewicki, T.; and Mansour, Y. 2024. Delay as Payoff in MAB. ar Xiv preprint ar Xiv:2408.15158. Shi, L.; Wang, J.; and Wu, T. 2023. Statistical inference on multi-armed bandits with delayed feedback. In International Conference on Machine Learning, 31328 31352. PMLR. Slivkins, A. 2024. Introduction to Multi-Armed Bandits. ar Xiv:1904.07272. Tang, Y.; Wang, Y.; and Zheng, Z. 2024. Stochastic Multi Armed Bandits with Strongly Reward-Dependent Delays. In International Conference on Artificial Intelligence and Statistics, 3043 3051. PMLR. Thune, T. S.; Cesa-Bianchi, N.; and Seldin, Y. 2019. Nonstochastic Multiarmed Bandits with Unrestricted Delays. ar Xiv:1906.00670. Van Der Hoeven, D.; and Cesa-Bianchi, N. 2022. Nonstochastic Bandits and Experts with Arm-Dependent Delays. In Camps-Valls, G.; Ruiz, F. J. R.; and Valera, I., eds., Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, 2022 2044. PMLR. Vernade, C.; Capp e, O.; and Perchet, V. 2017. Stochastic Bandit Models for Delayed Conversions. ar Xiv:1706.09186. Vernade, C.; Carpentier, A.; Lattimore, T.; Zappella, G.; Ermis, B.; and Brueckner, M. 2020. Linear bandits with stochastic delayed feedback. In International Conference on Machine Learning, 9712 9721. PMLR. Wu, H.; and Wager, S. 2022. Thompson sampling with unrestricted delays. In Proceedings of the 23rd ACM Conference on Economics and Computation, 937 955. Zhou, Z.; Xu, R.; and Blanchet, J. 2019. Learning in generalized linear contextual bandits with stochastic delays. Advances in Neural Information Processing Systems, 32. Zimmert, J.; and Seldin, Y. 2020. An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays. ar Xiv:1910.06054.