# contextual_linear_bandits_with_delay_as_payoff__36265804.pdf

Contextual Linear Bandits with Delay as Payoff

Mengxiao Zhang 1 Yingfei Wang 2 Haipeng Luo 3

A recent work by Schlisselberg et al. (2025) studies a delay-as-payoff model for stochastic multiarmed bandits, where the payoff (either loss or reward) is delayed for a period that is proportional to the payoff itself. While this captures many real-world applications, the simple multiarmed bandit setting limits the practicality of their results. In this paper, we address this limitation by studying the delay-as-payoff model for contextual linear bandits. Specifically, we start from the case with a fixed action set and propose an efficient algorithm whose regret overhead compared to the standard no-delay case is at most D max log T, where T is the total horizon, D is the maximum delay, and max is the maximum suboptimality gap. When payoff is loss, we also show further improvement of the bound, demonstrating a separation between reward and loss similar to Schlisselberg et al. (2025). Contrary to standard linear bandit algorithms that construct least squares estimator and confidence ellipsoid, the main novelty of our algorithm is to apply a phased arm elimination procedure by only picking actions in a volumetric spanner of the action set, which addresses challenges arising from both payoff-dependent delays and large action sets. We further extend our results to the case with varying action sets by adopting the reduction from Hanna et al. (2023). Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments.

1. Introduction

Stochastic multi-armed bandit (MAB) is a well-studied theoretical framework for sequential decision making. In recent

1University of Iowa 2University of Washington 3University of Southern California. Correspondence to: Mengxiao Zhang <mengxiao-zhang@uiowa.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

years, considerable investigation has been given to the realistic situations where the agent observes the payoff (either reward or loss) of an arm only after a certain delayed period of time. However, most work assumes that the delays are payoff-independent. Namely, while the delay may depend on the chosen arm, it is sampled independently from the stochastic payoff of the chosen arm.

Lancewicki et al. (2021) address this limitation by studying a setting where the delay and the reward are drawn together from a joint distribution. Later, Tang et al. (2024) consider a special case where the delay is exactly the reward. Their motivation stems from response-adaptive clinical trials that aim at maximizing survival outcomes. For example, progression-free survival (PFS) defined as the number of days after treatment until disease progression or death is widely used to evaluate the effectiveness of a treatment. Notably, in this context, the delays in observing the PFS are the PFS itself. Schlisselberg et al. (2025) build on and refine this investigation, extending the study to the case where delay is the loss itself. Taken together, this delay-as-payoff framework effectively captures many real-world scenarios involving time-to-event data across many domains. For example, postoperative length of stay (PLOS) is one example of time-to-event data that specifies the length of stay after surgery. Potential surgical procedures and postoperative care can be modeled as arms. The delay defined as the time until the patient is discharged can be interpreted as the loss that we aim to minimize. As another example, in advertising, common metrics, including Average Time on Page (ATP) and Time to Re-engagement (that tracks the time elapsed between a user s initial interaction with an ad and subsequent engagement such as returning to the website), can be modeled as reward or loss inherently delayed by the same duration.

Despite such recent progress, the current consideration of payoff-dependent delay remains limited to the simple multiarmed bandit (MAB) setting. While MAB frameworks are foundational in decision-making problems, they have notable practical limitations. Specifically, they fail to account for the influence of covariates that drive heterogeneous responses across different actions. This makes them less suitable for scenarios involving a large number of (potentially dynamically changing) actions and/or situations where context is crucial in shaping outcomes.

Contextual Linear Bandits with Delay as Payoff

Contributions. Motivated by this limitation, in this work, we extend the delay-as-payoff model from MAB to contextual linear bandits, a practical framework that is widely used in real-world applications. Specifically, our contributions are as follows.

As a first step, in Section 3, we study stochastic linear bandits with a fixed action set (known as the noncontextual setting). We point out the difficulty of directly combining the standard Lin UCB algorithm with the idea of Schlisselberg et al. (2025), and propose a novel phased arm elimination algorithm that only selects actions from a volumetric spanner of the action set. In the delay-as-loss case, we prove that, compared to the standard regret in the delay-free setting, the overhead caused by the payoff-dependent delay for our algorithm is O(min{nd log(T/n)+D max, D max log(T/n)}), where n is the dimension of the action set, T is the total horizon, max is the maximum suboptimality gap, d is the expected delay of the optimal action, and D is the maximum possible delay (formal definitions are deferred to Section 2). This instance-dependent bound is in the same spirit as the one of Schlisselberg et al. (2025) and is small whenever the optimal action has a small loss. In the delay-as-reward case, a slightly worse bound is provided in Appendix B; such a separation between loss and reward is similar to the results of Schlisselberg et al. (2025).

Next, in Section 4, we extend our results to the contextual case where the action set is varying and drawn i.i.d. from an unknown distribution. Using a variant of our noncontextual algorithm (that can handle misspecification) as a subroutine, we apply the contextual to non-contextual reduction recently proposed by (Hanna et al., 2023) and show that the resulting algorithm enjoys a similar regret guarantee despite having varying action sets, establishing the first regret guarantee for contextual linear bandits with delay-as-payoff.

In Section 5, we implement our algorithm and test it on synthetic linear bandits instances, demonstrating its superior performance against a baseline that runs Lin UCB with only the currently available feedback.

Related works. Recent research has investigated different problems of learning under bandit feedback with delayed payoff, addressing various new challenges caused by the combination of delay and bandit feedback. As mentioned, most studies assume payoff-independent delays. Among this line of research, Dud ık et al. (2011) are among the first to consider delays in stochastic MAB with a constant delay. Mandel et al. (2015) and Joulani et al. (2013) extend the consideration to stochastic delays, with the assumption that the delay is bounded.

Subsequent studies on i.i.d. stochastic delays differentiate between arm-independent and arm-dependent delays. For arm-independent delays, Zhou et al. (2019); Vernade et al. (2020a); Blanchet et al. (2024) show regret characterizations for (generalized) linear stochastic contextual bandits. Pike-Burke et al. (2018) consider aggregated anonymous feedback, under the assumption that the expected delay is bounded and known to the learner. Arm-dependent stochastic delays have been explored in various settings, including Gael et al. (2020); Arya & Yang (2020); Lancewicki et al. (2021).

Far less attention has been given to payoff-dependent stochastic delays. The setting in Vernade et al. (2017) implies a dependency between the reward and the delay, as a current non-conversion could be the result of a delayed reward of 1. Lancewicki et al. (2021) consider the case where the stochastic delay in each round and the reward are drawn from a general joint distribution. Tang et al. (2024) investigate strongly reward-dependent delays, specifically motivated by medical settings where the delay is equal to the reward. Schlisselberg et al. (2025) follow this investigation and extend the discussion to delay as loss, and provide a tighter regret bound. Although with a slightly different focus, Thune et al. (2019); Zimmert & Seldin (2020); Gyorgy & Joulani (2021); Van Der Hoeven & Cesa-Bianchi (2022); van der Hoeven et al. (2023) and several other works study non-stochastic bandits, where both the delay and rewards are adversarial.

Nevertheless, the payoff-dependent (either loss or reward) delays are only studied under stochastic multi-armed bandits (MAB). In this work, we extend the study to contextual linear bandits, significantly broadening its practicality.

2. Preliminary

Throughout this paper, we use [N] to denote {1, 2, . . . , N} for some positive integer N. Let Rn + be the n-dimensional Euclidean space in the positive orthant and Bn 2(1) = {v Rn : v 2 1} be the n-dimensional unit ball with respect to ℓ2 norm. Define U[a, b] to be the uniform distribution over [a, b]. For a real number a, define SGN(a) as the sign of a. For two real numbers a and b, define a b max{a, b}. For a finite set S, denote |S| as the cardinality of S. The notation e O( ) hides all logarithmic dependencies.

In this paper, we consider the delay-as-payoff model proposed by Schlisselberg et al. (2025), in which the delay of the payoff is proportional to the payoff itself. Specifically, we study stochastic linear bandits in this model, and we start with a fixed action set as the first step (referred to as the non-contextual case) and then move on to the case with a time-varying action set (referred to as the contextual case). For conciseness, we mainly discuss the payoff-as-loss case,

Contextual Linear Bandits with Delay as Payoff

but our algorithm and results can be directly extended to the payoff-as-reward case (see Appendix B).

Non-contextual stochastic linear bandits. In this problem, the learner is first given a fixed finite set of actions A Rn + Bn 2(1) with cardinality |A| = K. Let D > 0 be the maximum possible delay. At each round t [T], the learner selects an action at A and incurs a loss ut = µat + ηt [0, 1] where ηt is zero-mean random noise, µa = a, θ is the expected payoff of action a, and θ Rn + Bn 2(1) is the model parameter that is unknown to the learner.1 Then, the loss is received by the learner at the end of round t + dt where the delay dt = D ut (that is, proportional to the loss). The goal of the learner is to minimize the (expected) pseudo regret defined as follows:

T min a A a, θ . (1)

Let a argmina A a, θ be an optimal action, µ = µa be its expected loss, and d = Dµ be its expected delay. For an action a, define a = a a , θ as its suboptimality gap. Further define min = mina A, a>0 a and max = maxa A a to be the minimum and maximum non-zero sub-optimality gap respectively.

We point out that the standard multi-armed bandit (MAB) setting considered in Schlisselberg et al. (2025) is a special case of our setting with A being the set of all standard basis vectors in Rn.

Contextual stochastic linear bandits. In the contextual case, the main difference is that the action set is not fixed but changing over rounds. Specifically, at each round t, the learner first receives an action set At Rn + Bn 2(1) (which can be seen as a context), where we assume that At is drawn i.i.d. from an unknown distribution P. The rest of the protocol remains the same, and the goal of the learner is still to minimize the (expected) pseudo regret, defined as:

t=1 min a t At a t , θ

where the expectation is taken over both the internal randomness of the algorithm and the external randomness in the action sets and loss noises.

3. First Step: Non-Contextual Linear Bandits

In this section, we focus on the non-contextual case, which serves as a building block for eventually solving the contextual case. Before introducing our algorithm, we first briefly introduce the successive arm elimination algorithm

1We enforce both A Rn + Bn 2 (1) and θ Rn + Bn 2 (1) to make sure that the payoff (and hence the delay) is non-negative.

for the simpler MAB setting proposed by Schlisselberg et al. (2025) and their ideas of handling payoff-dependent delay. Specifically, their algorithm starts with a guess B = 1/D on the optimal action s loss, and maintains an active set of arms. The algorithm pulls each arm in the active set once, and constructs two LCB s (lower confidence bound) and one UCB (upper confidence bound) for each action in the active set as follows (supposing the current round being t):

LCBt,1(a) = 1 kt(a)

kt(a) , (2)

LCBt,2(a) = 1 ct(a)

2 log T ct(a) 1, (3)

UCBt(a) = 1 ct(a)

τ Ct(a) uτ +

2 log T ct(a) 1, (4)

where kt(a) = Pt τ=1 1{at = a} is the total number of pulls of action a till round t, Ot(a) = {τ : τ + dτ t and aτ = a} is the set of rounds where action a is chosen and its loss has been received by the end of round t, Ct(a) = {τ t D : aτ = a} is the set of rounds up to t D where action a is chosen (so its loss has for sure been received by the end of round t), and ct(a) = |Ct(a)|. Specifically, Eq. (2) constructs an LCB of action a assuming all the action s unobserved loss to be 0 (the smallest possible), while Eq. (3) and Eq. (4) construct an LCB and a UCB using only the losses no later than round t D (which must have been received by round t), making the empirical average a better estimate of the expected loss. With UCBt(a) and LCBt(a) = max{LCBt,1(a), LCBt,2(a)} constructed, the algorithm eliminates an action a if its LCBt(a) is larger than min{UCBt(a ), B} for some a in the active set. If all the actions are eliminated, this means that the guess B on the optimal loss is too small, and the algorithm starts a new epoch with B doubled.2

Challenges However, this approach cannot be directly applied to linear bandits. Specifically, standard algorithms for stochastic linear bandits without delay (e.g., Li et al. (2010); Abbasi-Yadkori et al. (2011)) all construct UCB/LCB for each action by constructing an ellipsoidal confidence set for θ. In the delay-as-payoff model, while it is still viable to construct UCB/LCB similar to Eq. (3) and Eq. (4) via a standard confidence set of θ, it is difficult to construct an LCB counterpart similar to Eq. (2). This is because one action s loss is estimated using observations of all other actions in linear bandits, and naively treating the unobserved

2In fact, Schlisselberg et al. (2025) construct yet another LCB based on the number of unobserved losses. We omit this detail since we are not able to use this to further improve our bounds for linear bandits.

Contextual Linear Bandits with Delay as Payoff

loss of one action as zero might not necessarily lead to an underestimation of another action.

Our ideas To bypass this barrier, we give up on estimating θ itself and propose to construct UCB/LCB for each action using the observed losses of the volumatric spanner of the action set. A volumetric spanner of an action set A is defined such that every action in A can be expressed as a linear combination of the spanner.

Definition 3.1 (Volumetric Spanner (Hazan & Karnin, 2016)). Suppose that A = {a1, a2, . . . , a N} is a set of vectors in Rn. We say S A is a volumetric spanner of A if for any a A, we can write it as a = P

b S λb b for some λ R|S| with λ 2 1.

Due to the linear structure, it is clear that the loss µa of action a can be decomposed in a similar way as P

b S λbµb, making it possible to estimate every action s loss by only estimating the loss of the spanner. Moreover, such a spanner can be efficiently computed:

Proposition 3.2 (Bhaskara et al. (2023)). Given a finite set A of size K, there exists an efficient algorithm finding a volumetric spanner S of A with |S| = 3n within O(Kn3 log n) runtime.

Equipped with the concept of volumetric spanner, we are now ready to introduce our algorithm (see Algorithm 1). Specifically, our algorithm also makes a guess B on the loss of the optimal action. With this guess, it proceeds to multiple epochs of arm elimination procedures, with the active action set initialized as A1 = A. In each epoch m, instead of picking every action in the active set Am, we first compute a volumetric spanner Sm of Am with |Sm| = 3n (Line 4), which can be done efficiently according to Proposition 3.2, and then pick each action in the spanner set Sm for 2m

rounds in a round-robin way (Line 5).

After that, we calculate two UCBs and two LCBs for actions in the spanner, in a way similar to the simpler MAB setting discussed earlier (Line 7). Specifically, the first one is in the same spirit of Eq. (2): we calculate bµ+ m(a) ( bµ m(a)) as an overestimation (underestimation) of the expected loss of action a by averaging over all observed losses from the rounds in Om(a) as well as the maximum (minimum) possible value of unobserved losses from the rounds in Em(a); see Eq. (5) and Eq. (6). The first UCB (LCB) bµ+ m,1(a) (bµ m,1(a)) is then computed based on bµ+ m(a) (bµ m(a)) by incorporating a standard confidence width β

2m a 2 for some coefficient β; see Eq. (7) and Eq. (8). Then, to compute the second UCB/LCB, which is in the same spirit as Eq. (3) and Eq. (4), we calculate an unbiased estimation bµF m(a) of the expected loss of a by averaging losses from the rounds in Cm(a), that is, all the rounds where the observation must have been revealed; see Eq. (9). Note that the number of

Algorithm 1: Phased Elimination via Volumetric Spanner for Linear Bandits with Delay-as-Loss

1 Input: maximum possible delay D, action set A, β > 0.

2 Initialization: optimal loss guess B = 1/D.

3 Initialization: active action set A1 = A. for m = 1, 2, . . . , do

4 Find Sm = {am,1, . . . , am,|Sm|}, a volumetric spanner of Am with |Sm| = 3n.

5 Pick each a Sm 2m times in a round-robin way.

6 Let Im contain all the rounds in this epoch.

7 For each a Sm, calculate the following quantities:

bµ+ m(a) = 1 2m X

τ Om(a) uτ + X

τ Em(a) 1 , (5)

bµ m(a) = 1 2m X

τ Om(a) uτ, (6)

bµ+ m,1(a) = bµ+ m(a) + β 2m/2 a 2, (7)

bµ m,1(a) = bµ m(a) β 2m/2 a 2, (8)

bµF m(a) = 1 cm(a)

τ Cm(a) uτ, (9)

bµ+ m,2(a) = bµF m(a) + β p

cm(a) a 2, (10)

bµ m,2(a) = bµF m(a) β p

cm(a) a 2, (11)

where cm(a) = |Cm(a)|, Cm(a) = {τ Im : τ + D Im, aτ = a}, Om(a) = {τ Im : τ + dτ Im, aτ = a}, and Em(a) = {τ Im : aτ = a} \ Om(a). for each a Am do

8 Decompose a as a = P|Sm| i=1 λ(a) m,iam,i with

λ(a) m 2 1 and calculate

i=1 λ(a) m,i bµ SGN(λ(a) m,i) m,2 (am,i),

LCBm(a) = max j {1,2}{LCBm,j(a)} where

LCBm,j(a) =

i=1 λ(a) m,i bµ SGN( λ(a) m,i) m,j (am,i),

9 Set Am+1 = Am. for a Am do

10 if a Am, s.t. LCBm(a) min{UCBm(a ), B} then

Eliminate a from Am+1.

11 if Am+1 = then

Set B 2B and go to Line 3.

Contextual Linear Bandits with Delay as Payoff

such rounds, cm(a) = |Cm(a)|, is a fixed number, and thus bµF m(a) is indeed unbiased. Similarly, we incorporate a standard confidence width β

cm(a) a 2 to arrive at the second

UCB bµ+ m,2(a) and LCB bµ m,2(a); see Eq. (10) and Eq. (11).

The next step of our algorithm is to use these UCBs/LCBs for the spanner to compute corresponding UCBs/LCBs for every active action in Am (Line 8). Specifically, for each action a Am, according to the definition of a volumetric spanner (Definition 3.1), we can write a as a linear combination of actions in Sm: P|Sm| i=1 λ(a) m,iam,i . As mentioned, due to the linear structure of losses, we also have µa = P|Sm| i=1 λ(a) m,iµam,i. Thus, when constructing a UCB

(or similarly LCB) for a, based on whether λ(a) m,i is positive or not, we decide whether to use the UCB or LCB of am,i; see Eq. (12), a counterpart of Eq. (4), and Eq. (13), a counterpart of Eq. (2) and Eq. (3).3

At the end of an epoch, we eliminate all actions from the active action set if their LCB is either larger than the UCB of certain action or the guess B on the optimal loss (Line 10). If the active set becomes empty, this means that the guess B is too small, and the algorithm restarts with the guess doubled; otherwise, we continue to the next epoch.

Computational complexity We analyze the computational complexity of Algorithm 1. Specifically, the total runtime of our algorithm over T rounds is O(n T + Kn3 log n log(T/n)), as we compute the volumetric spanner only once per epoch, and the total number of epochs is O(log(T/n)). Compared to the classic Lin UCB algorithm (Abbasi-Yadkori et al., 2011), our approach is in fact more computationally efficient: Lin UCB computes the UCB for each action at a cost of O(n2) per round, resulting in a total runtime of O(Kn2T).

Theoretical performance We prove the following regret bound for our algorithm.

Theorem 3.3. Algorithm 1 with β = p

2 log(KT 3) guarantees:

Reg O (min {V1, V2}) + log(d ) O (min {W1, W2}) ,

where V1 = n2 log(KT ) log(T/n) log(d )

min , V2 = n p

T log(d ) log(KT), W1 = nd log(T/n) + D max, and W2 = D max log(T/n).

The first term in the regret bound O (min {V1, V2}) is of order e O(min{ n2 min , n

T}), which matches the standard regret bound of Lin UCB in the case without delay (Abbasi Yadkori et al., 2011). The second term is the overhead

3This also explains why we need bµ+ m(a), a quantity not used in Schlisselberg et al. (2025).

caused by delay and is in the same spirit as the result of Schlisselberg et al. (2025): focusing only on the part that grows in T, we see that W1 only depends on d , the expected delay of the optimal action (and hence the smallest delay among all actions), while W2 depends on the maximum possible delay D but scaled by max, the largest sub-optimality gap. Therefore, the delay overhead of our algorithm is small when either the shortest delay is small or all actions have similar losses. We remark again that in the delay-as-reward setting, we obtain similar results; see Appendix B for details.

3.1. Analysis

In this section, we provide a proof sketch of Theorem 3.3. Detailed proofs are deferred to Appendix A.

The proof starts by proving that UCBm(a) and LCBm(a) are indeed valid UCB and LCB respectively for all actions in Am. This follows from first using standard concentration inequalities to show that bµ+ m,1(a) and bµ+ m,2(a) (bµ m,1(a) and bµ m,2(a)) are valid UCBs (LCBs) for each action in the spanner, and then generalizing it to every action a Am according to its decomposition over the actions in the spanner.

With this property, our analysis then proceeds to control the regret of Algorithm 1 for each guess of B separately. Let TB be the set of rounds when Algorithm 1 runs with guess B. In Step 1, we first show that the use of LCBm,2(a) and UCBm(a) ensures a regret bound of O (min{R1, R2} + D max log(T/n)) where R1 = n2 log(KT ) log(T/n)

min and R2 = n p

|TB| log(KT), and then in Step 2, we show that the use of LCBm,1(a) and UCBm(a) ensures a regret bound of O(min{R1, R2} + (nd + DB) log(T/n) + D max).

Step 1 For notational convenience, we define

rad F m,a = β

i=1 |λ(a) m,i| a 2 p

to be the total confidence radius of action a coming from the definition of LCBm,2(a) and UCBm(a). Via a standard analysis of arm elimination, we show that that if an action a is not eliminated at the end of epoch m, we have

a 4 max a Am rad F m,a 4

where the second inequality uses Cauchy-Schwarz inequality and the properties of volumetric spanners, specifically that λ(a) m 2 1 and |Sm| = 3n. To provide a lower bound on cm(a ) for any a Sm, note that we pick each action

Contextual Linear Bandits with Delay as Payoff

a Sm 2m times in a round-robin manner, and thus

cm(a ) 2m D |Sm| 1 = 2m D

Rearranging the terms, we then obtain

3n + a. (14)

Taking summation over all a Sm and m, and noticing that the total number of epochs is bounded by M = log2(|TB|/3n) , we arrive at the following O(R1 + D max log(T/n)) regret guarantee:

a Sm, a>0 2 48nβ2

a Sm, a>0 O n log(KT)

+ O (D max log(T/n)) ,

O n2 log(T/n) log(KT)

+ O (D max log(T/n)) ,

where the first inequality is because a Sm is not eliminated in epoch m 1 and the last inequality is by lower bounding a by min.

To obtain the other instance-independent regret bound O(R2 + D max log(T/n)), we bound the regret differently by considering a β p

n/2m and a β p

n/2m separately:

n/2m (2m a)

|TB| log(KT) + D max log(T/n)).

Step 2 To obtain the other regret bound O(min{R1, R2}+ (nd + DB) log(T/n) + D max) with a different delay overhead, we similarly define

rad N m,a = β

i=1 |λ(a) m,i| a 2

as the total confidence radius of action a coming from the definition of LCBm,1(a). Further let bµm(a) =

τ Om(a) Em(a) uτ be the empirical mean of action a s loss within epoch m (which is generally not available to the algorithm due to delay). According to the construction of bµ+ m(a) and bµ m(a), we know that for all a Sm,

bµ+ m(a) bµm(a) + |Em(a)|

2m , bµ m(a) bµm(a) |Em(a)|

Then, for any action a Am that is not eliminated at the end of epoch m, using the fact that a = P|Sm| i=1 λ(a) m,iam,i, we obtain with high probability:

i=1 λ(a) m,i bµm(am,i) + rad N m,a

LCBm,1(a) + rad N m,a +

i=1 |λ(a) m,i| |Em(am,i)|

LCBm,1(a) + rad N m,a

i=1 |λ(a) m,i| 2Dµam,i

2m|Sm| + 16 log KT + 2

B + rad N m,a

i=1 |λ(a) m,i| 2Dµam,i

2m|Sm| + 16 log KT + 2

where the first inequality is by standard Azuma-Hoeffding s inequality, the third inequality is by Lemma C.2 of Schlisselberg et al. (2025) (included as Lemma A.2 in the appendix for completeness), and the last inequality is because a is not eliminated at the end of epoch m.

Now consider two cases. When B µa

2 , we know that a µa µ 2B. Using the previous Eq. (14), we know that

Otherwise, when B < µa

2 , with some manipulation on Eq. (16), we show that

a + P|Sm| i=1 Dµam,i

Combining Eq. (17) and Eq. (18), we then obtain that within epoch m, the regret is bounded by

i=1 µam 1,i

since all active actions in epoch m are not eliminated in epoch m 1. The first term P

a in Eq. (19)

Contextual Linear Bandits with Delay as Payoff

Algorithm 2: Reduction from Contextual Linear Bandits to Non-Contextual Linear Bandits (Hanna et al., 2023)

Input: confidence level δ, an instance Algn-ctx of Algorithm 3 with β = p

2 log(KT 3). Let Θ be a 1

T -cover of Θ with size O(T n). for m = 1, 2, . . . do

1 Construct action set Xm = {g(m)(θ) | θ Θ } where g(m)(θ) = 1 2m 1 P2m 1

τ=1 argmina Aτ a, θ .

2 Initiate Algn-ctx with action set Xm and misspecification level εm = min{1, 2 p

log(T|Θ |/δ)/2m}.

3 for t = 2m 1 + 1, . . . , 2m do

4 Algn-ctx outputs action g(m)(θt).

5 Observe At and select at = argmina At a, θt .

6 Observe the loss uτ for all τ such that τ + dτ (t 1, t] and send them to Algn-ctx.

eventually leads to the min{R1, R2} term in the claimed regret bound, by the exact same reasoning as in Step 1. The second term explains the final DB log(T/n) term in the regret bound (recall that number of epoch is of order O(log(T/n))). Finally, the last term in Eq. (19) can be written as D P|Sm 1| i=1 am 1,i + 3n d , and the term

D P|Sm 1| i=1 am 1,i is one half of the regret incurred in epoch m 1 as long as 2m 1 > 2D (otherwise, the epoch length is smaller than D, and we bound the regret trivially by D max). Summing over all epochs and rearranging the terms thus leads to the a term nd log(T/n) in the regret. This proves the goal of the second step.

Combining everything Finally, note that the number of different values of B Algorithm 1 uses is upper bounded by log2(d ) = log2(Dµ ) since the optimal action a

will never be eliminated when B µ . Summing up the regret over these different values of B arrives at the the final bound O(min{V1, V2}, log(d ) min{W1, W2}).

4. Extension to Contextual Linear Bandits

In this section, we extend our results to the stochastic contextual setting where the action set at each round is drawn i.i.d. from a distribution P. While the arm elimination procedure is critical in solving our problem in the non-contextual case with a fixed action set, it is not clear (if possible at all) to directly generalize it to the contextual setting due to the dynamic nature of the action set.

Fortunately, a recent work by Hanna et al. (2023) proposes a reduction from contextual linear bandits to non-contextual linear bandits (both without delay). At a high level, this reduction constructs a fixed action for each possible parameter θ of the contextual bandit instance, where the expected payoff of this fixed action equals to the expected payoff of the optimal action assuming that the true parameter is θ. Since the distribution of the action set is unknown, this reduction estimates each action s expected payoff using the historical action sets. This estimation introduces a misspecification

error, resulting in a misspecified model. Therefore, the subroutine used by this reduction needs to be able to handle an ε-misspecified model, where the loss of each a A is approximately linear: µa = a, θ + εa [0, 1], with ε maxa A |εa| indicating the misspecification level. It turns out that, a simple modification of our Algorithm 1 can indeed address such misspecification it only requires incorporating the misspecification level ε into the criteria of arm elimination; see Algorithm 3 and specifically its Line 10 for details.

We then plugin this subroutine, denoted as Algn-ctx, into their reduction, as shown in Algorithm 2. Specifically, the algorithm first constructs a 1

T -cover Θ of the parameter space Θ = Rn + Bn 2(1) with size |Θ | = O(T n). It then proceeds in epochs with doubling length. At the start of epoch m, a new fixed action set Xm = {g(m)(θ) : θ Θ } is constructed, where g(m)(θ) is the averaged optimal action over the previous m 1 epochs, assuming the model parameter being θ. Then, a new instance of Algn-ctx with action set Xm and some misspecification level εm is initiated and run for the entire epoch. At each round t of this epoch, Algn-ctx outputs an action g(m)(θt) Xm, and the algorithm s final decision upon receiving the true action set At is at = argmina At a, θt . Finally, at the end of this round, all newly observed losses are sent to Algn-ctx.

Guarantees and Analysis Even though our algorithm is a direct application of the reduction of Hanna et al. (2023), it is a priori unclear whether it enjoys any favorable regret guarantee in the delay-as-loss setting. By adopting and generalizing their analysis, we show that this is indeed the case. Before introducing our results, we define the following quantities:

argmin a A a, θ ,

n-ctx min min θ Θ , g(θ ),θ = g(θ),θ E [ g(θ) g(θ ), θ ] ,

n-ctx max max θ Θ E [ g(θ) g(θ ), θ ] ,

Contextual Linear Bandits with Delay as Payoff

0 5000 10000 15000 Time steps

Lin UCB OTFLin UCB OTFLin TS Our Algorithm

0 5000 10000 15000 Time steps

Lin UCB OTFLin UCB OTFLin TS Our Algorithm

0 5000 10000 15000 Time steps

Lin UCB OTFLin UCB OTFLin TS Our Algorithm

0 5000 10000 15000 Time steps

Lin UCB OTFLin UCB OTFLin TS Our Algorithm

0 5000 10000 15000 Time steps

Lin UCB OTFLin UCB OTFLin TS Our Algorithm

0 5000 10000 15000 Time steps

Lin UCB OTFLin UCB OTFLin TS Our Algorithm

Figure 1. Comparison of the empirical results of our algorithm, Lin UCB, OTFLin UCB, and OTFLin TS with K = 70. The top row is the delay-as-loss setting and the bottom row is the delay-as-reward setting. The left, middle, and right column correspond to n = 6, 8, 10 respectively.

d D g(θ), θ = D EA P

min a A a, θ ,

where g(θ) denotes the optimal action in expectation, n-ctx min ( n-ctx max) denotes the minimum (maximum) suboptimality gap for the reduced non-contextual linear bandit instance, and d denotes the expected delay of the optimal action.

Theorem 4.1. Algorithm 2 with δ = 1/T 2 guarantees

Reg = O n p

T log T + min{V1, V2}

+ log(d ) min{W1, W2} ,

where V1 = n3 log2(T ) log(T/n) log(d ) n-ctx min , V2 =

T log(d ) log(T), W1 = log T(nd log(T/n) + D n-ctx max), and W2 = D n-ctx max log T log(T/n).

The proof is deferred to Appendix C. The regret bound is in the same spirit as the one for the non-contextual case (Theorem 3.3) and consists of a term for standard regret and a term for delay overhead. The standard part unfortunately suffers higher dependence on the dimension n, while the delay overhead is in a similar problem-dependent form. We remark that this is the first regret guarantee for contextual linear bandits with delay-as-payoff, resolving an open problem asked by (Schlisselberg et al., 2025).

5. Experiment

In this section, we implement and evaluate our algorithm for both the delay-as-loss and delay-as-reward settings. For simplicity, we only consider the non-contextual setting. Since there are no existing algorithms for this problem (to the best of our knowledge), we consider three simple benchmarks. The first one applies the standard Lin UCB algorithm only using the currently available observations (see Algorithm 5 in Appendix D for details). We point out that this simple approach to handling delayed feedback is indeed very common in the literature and in fact enjoys favorable guarantees at least for some problems (Thune et al., 2019; van der Hoeven et al., 2023). Additionally, we include two benchmark algorithms: OTFLin UCB and OTFLin TS (Vernade et al., 2020a), which are designed for linear bandits but with payoff-independent stochastic delay.

Experiment setup The experiment setup is as follows. We set the dimension n {6, 8, 10} and the size of the action set |A| = 70. The model parameter θ is set to be |ν| ν 2 where ν is drawn from the n-dimensional standard normal distribution and |ν| denotes the entry-wise absolute value of ν to make sure that θ Rn + Bn 2(1). Each action a A is constructed by first sampling ai uniformly from [0, 1] for all i [n] and then normalizing it to unit ℓ2-norm. When an action at is chosen at round t, the payoff ut is drawn from beta distribution with α = µat and β = 1 µat.The

Contextual Linear Bandits with Delay as Payoff

number of iterations T is 16000 and the maximal possible delay D is 1000. For simplicity, we also ignore the role of B in our algorithms.

Results In Figure 1, we plot the mean and the standard deviation of the regret over 8 independent experiments with different random seeds, for each n {6, 8, 10} (the columns) and also both delay-as-loss and delay-as-reward (the rows). We observe that our algorithm consistently outperforms all the three benchmarks in all setups. Also, in all runs, after about 9 to 12 epochs, our algorithm eliminates a significant number of bad actions, leading to almost constant regret after that point (and explaining the phase transition in the plots).

6. Conclusion

In this work, we initiate the study of the delay-as-payoff model for contextual linear bandits and develop provable algorithms that require novel ideas compared to standard linear bandits. Interesting future directions include proving matching regret lower bounds and extending our results to general payoff-dependent delays (Lancewicki et al., 2021), unbounded delay/payoff (Howson et al., 2023; Zhou et al., 2019), and other even more challenging settings, such as those with intermediate observations (Vernade et al., 2020b; Esposito et al., 2023) or evolving observations (Bar-On & Mansour, 2024).

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgement

HL is supported by NSF award IIS-1943607.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

Arya, S. and Yang, Y. Randomized allocation with nonparametric estimation for contextual multi-armed bandits with delayed rewards. Statistics & Probability Letters, 164:108818, 2020.

Bar-On, Y. and Mansour, Y. Non-stochastic bandits with evolving observations. ar Xiv preprint ar Xiv:2405.16843, 2024.

Bhaskara, A., Mahabadi, S., and Vakilian, A. Tight bounds for volumetric spanners and applications. In Thirtyseventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/ forum?id=c4Xc0u TLXW.

Blanchet, J., Xu, R., and Zhou, Z. Delay-adaptive learning in generalized linear contextual bandits. Mathematics of Operations Research, 49(1):326 345, 2024.

Dud ık, M., Hsu, D. J., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., and Zhang, T. Efficient optimal learning for contextual bandits. In Conference on Uncertainty in Artificial Intelligence, 2011.

Esposito, E., Masoudian, S., Qiu, H., Van Der Hoeven, D., Cesa-Bianchi, N., and Seldin, Y. Delayed bandits: when do intermediate observations help? International Conference on Machine Learning, 2023.

Gael, M. A., Vernade, C., Carpentier, A., and Valko, M. Stochastic bandits with arm-dependent delays. In International Conference on Machine Learning, pp. 3348 3356. PMLR, 2020.

Gyorgy, A. and Joulani, P. Adapting to delays and data in adversarial multi-armed bandits. In International Conference on Machine Learning, pp. 3988 3997. PMLR, 2021.

Hanna, O. A., Yang, L., and Fragouli, C. Contexts can be cheap: Solving stochastic contextual bandits with linear bandit algorithms. In The Thirty Sixth Annual Conference on Learning Theory, pp. 1791 1821. PMLR, 2023.

Hazan, E. and Karnin, Z. Volumetric spanners: an efficient exploration basis for learning. Journal of Machine Learning Research, 2016.

Howson, B., Pike-Burke, C., and Filippi, S. Delayed feedback in generalised linear bandits revisited. In Ruiz, F., Dy, J., and van de Meent, J.-W. (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp. 6095 6119. PMLR, 25 27 Apr 2023.

Joulani, P., Gyorgy, A., and Szepesv ari, C. Online learning under delayed feedback. In International Conference on Machine Learning, pp. 1453 1461. PMLR, 2013.

Lancewicki, T., Segal, S., Koren, T., and Mansour, Y. Stochastic multi-armed bandits with unrestricted delay distributions. In International Conference on Machine Learning, pp. 5969 5978. PMLR, 2021.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article

Contextual Linear Bandits with Delay as Payoff

recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010.

Mandel, T., Liu, Y.-E., Brunskill, E., and Popovi c, Z. The queue method: Handling delay, heuristics, prior data, and evaluation in bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.

Pike-Burke, C., Agrawal, S., Szepesvari, C., and Grunewalder, S. Bandits with delayed, aggregated anonymous feedback. In International Conference on Machine Learning, pp. 4105 4113. PMLR, 2018.

Schlisselberg, O., Cohen, I., Lancewicki, T., and Mansour, Y. Delay as payoff in mab. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 20310 20317, 2025.

Tang, Y., Wang, Y., and Zheng, Z. Stochastic multi-armed bandits with strongly reward-dependent delays. In International Conference on Artificial Intelligence and Statistics, pp. 3043 3051. PMLR, 2024.

Thune, T. S., Cesa-Bianchi, N., and Seldin, Y. Nonstochastic multiarmed bandits with unrestricted delays. Advances in Neural Information Processing Systems, 32, 2019.

Van Der Hoeven, D. and Cesa-Bianchi, N. Nonstochastic bandits and experts with arm-dependent delays. In International Conference on Artificial Intelligence and Statistics. PMLR, 2022.

van der Hoeven, D., Zierahn, L., Lancewicki, T., Rosenberg, A., and Cesa-Bianchi, N. A unified analysis of nonstochastic delayed feedback for combinatorial semibandits, linear bandits, and mdps. In The Thirty Sixth Annual Conference on Learning Theory, pp. 1285 1321. PMLR, 2023.

Vernade, C., Capp e, O., and Perchet, V. Stochastic Bandit Models for Delayed Conversions. In Conference on Uncertainty in Artificial Intelligence, 2017.

Vernade, C., Carpentier, A., Lattimore, T., Zappella, G., Ermis, B., and Brueckner, M. Linear bandits with stochastic delayed feedback. In International Conference on Machine Learning, pp. 9712 9721. PMLR, 2020a.

Vernade, C., Gyorgy, A., and Mann, T. Non-stationary delayed bandits with intermediate observations. In International Conference on Machine Learning, pp. 9722 9732. PMLR, 2020b.

Zhou, Z., Xu, R., and Blanchet, J. Learning in generalized linear contextual bandits with stochastic delays. Advances in Neural Information Processing Systems, 32, 2019.

Zimmert, J. and Seldin, Y. An optimal algorithm for adversarial bandits with arbitrary delays. In International Conference on Artificial Intelligence and Statistics, pp. 3285 3294. PMLR, 2020.

Contextual Linear Bandits with Delay as Payoff

A. Omitted Details in Section 3

In this section, we provide the detailed proof for Theorem 3.3. Specifically, as mentioned in Section 4, we prove the guarantee of a modified algorithm (Algorithm 3) for the more general ε-misspecified linear bandits.

Recall that in misspecified linear bandits, µa = a, θ + εa [0, 1] with |εa| ε for all a A. Due to this difference, we clarify on the definitions of a, a , µ , min, max, and d in misspecified linear bandits as follows. We still define a = a a, θ as the suboptimality gap of action a, where a argmina A a, θ , but µ mina A µa as the loss of the optimal action. Note that due to the misspecification, µ may not necessarily be µa . Define min = mina A, a>0 a and max = maxa A a to be the minimum non-zero, and maximum sub-optimality gap. The delay at round t is still defined as dt = D ut and d = D µ is the expected delay of the optimal action.

As for the algorithm, Algorithm 3 differs from Algorithm 1 only in Line 10 where we add one misspecification term 4

3nε in the criteria of eliminating an action.

The following theorem shows the guarantee of our algorithm in the misspecified linear bandits.

Theorem A.1. Algorithm 3 with β = p

2 log(KT 3) guarantees that

Reg O min n2 log(KT) log(T/n) log(d )

T log(d ) log(KT) + ε n T

+ log(d ) O (min {nd log(T/n) + D max, D max log(T/n)}) .

To prove Theorem A.1, recall the following quantities

τ Om(a) Em(a) uτ, a Sm, (29)

i=1 λ(a) m,i bµm(am,i), a Am, (30)

i=1 λ(a) m,i bµF m(am,i), a Am. (31)

We then define the following event and show that the event holds with high probability. Event 1. For all action a Am, m [T],

| a, θ bµm,1(a)| p

1 2m , (32)

| a, θ bµm,2(a)| p

1 cm(am,i), (33)

|Em(a)| 2Dµa

|Sm| + 16 log KT + 2, (34)

where β = p

2 log KT 3.

Lemma A.2. Algorithm 3 guarantees that Event 1 holds with probability at least 1 2 T 2 .

Proof. Fix an action a Sm in epoch m [T]. According to standard Azuma s inequality, we know that with probability at least 1 δ,

|µa bµm,1(a)|

|µa bµm,2(a)|

Contextual Linear Bandits with Delay as Payoff

Algorithm 3: Phased Elimination via Volumetric Spanner for Linear Bandits with Delay-as-Loss with misspecification

1 Input: maximum possible delay D, action set A, β > 0, a misspecification level ε.

2 Initialization: optimal loss guess B = 1/D.

3 Initialization: active action set A1 = A. for m = 1, 2, . . . , do

4 Find Sm = {am,1, . . . , am,|Sm|}, a volumetric spanner of Am with |Sm| = 3n.

5 Pick each a Sm 2m times in a round-robin way.

6 Let Im contain all the rounds in this epoch.

7 For each a Sm, calculate the following quantities:

bµ+ m(a) = 1 2m X

τ Om(a) uτ + X

τ Em(a) 1 , (20)

bµ m(a) = 1 2m X

τ Om(a) uτ, (21)

bµ+ m,1(a) = bµ+ m(a) + β 2m/2 a 2, (22)

bµ m,1(a) = bµ m(a) β 2m/2 a 2, (23)

bµF m(a) = 1 cm(a)

τ Cm(a) uτ, (24)

bµ+ m,2(a) = bµF m(a) + β p

cm(a) a 2, (25)

bµ m,2(a) = bµF m(a) β p

cm(a) a 2, (26)

where cm(a) = |Cm(a)|, Cm(a) = {τ Im : τ + D Im, aτ = a}, Om(a) = {τ Im : τ + dτ Im, aτ = a}, and Em(a) = {τ Im : aτ = a} \ Om(a). for each a Am do

8 Decompose a as a = P|Sm| i=1 λ(a) m,iam,i with λ(a) m 2 1 and calculate

i=1 λ(a) m,i bµ SGN(λ(a) m,i) m,2 (am,i), (27)

LCBm(a) = max j {1,2}{LCBm,j(a)} where

LCBm,j(a) =

i=1 λ(a) m,i bµ SGN( λ(a) m,i) m,j (am,i), (28)

9 Set Am+1 = Am. for a Am do

10 if a Am, s.t. LCBm(a) min{UCBm(a ), B} + 4

3nε then Eliminate a from Am+1.

11 if Am+1 = then

Set B 2B and go to Line 3.

Contextual Linear Bandits with Delay as Payoff

Taking union bound over all possible a A and all m [T], we know that with probability at least 1 δ, for all a Sm and all m [T],

|µa bµm,1(a)|

2 log(2TK/δ)

|µa bµm,2(a)|

2 log(2TK/δ)

Then, given that the above equation holds, for a Am, due to the property of volumetric spanners, we have µa = a, θ + εa = P|Sm| i=1 λ(a) m,i am,i, θ + εa. Therefore, we can obtain that

| a, θ bµm,1(a)|

i=1 λ(a) m,i( am,i, θ µam,i)

λ(a) m,i µam,i bµm(am,i)

2 log(2TK/δ)

2 log(2TK/δ)

where the last inequality uses λ(a) m 1 p

|Sm| λ(a) m 2 p

|Sm|. A similar analysis proves Eq. (33). Eq. (34) holds with probability at least 1 1 T 2 according to Lemma 4.1 of (Schlisselberg et al., 2025). Picking δ = 1 T 2 finishes the proof.

The next lemma shows that if B µ , then Algorithm 3 will not reach an empty active set.

Lemma A.3. Suppose that Event 1 holds. If B µ , then a Am for all m.

Proof. Since Event 1 holds, we have, we know that for all a Am, LCBm(a) a, θ + p

|Sm|ε and UCBm(a) a, θ p

|Sm|ε. If B µ , then we have a never eliminated since for any a Am

LCBm(a ) a , θ + ε p

|Sm| µ + ε + ε p

|Sm| µ + 2ε p

LCBm(a ) a , θ + ε p

|Sm| a, θ + 2ε p

|Sm| UCBm(a) + 4ε p

Therefore, a never satisfy the elimination condition.

The following lemma shows that the regret within epoch m can be well-controlled.

Lemma A.4. Suppose that Event 1 holds. Algorithm 3 guarantees that if a A is not eliminated at the end of epoch m (meaning that a Am+1), then

2m a 2m 24 nε + 256nβ2

Proof. For notational convenience, define rad N m,a = β

2m a 2 and rad F m,a = β

cm(a) a 2 for all a Sm. In addition, we

also define rad N m,a and rad F m,a for a / Sm as follows:

rad N m,a =

i=1 |λ(a) m,i| rad N m,am,i,

rad F m,a =

i=1 |λ(a) m,i| rad F m,am,i.

Contextual Linear Bandits with Delay as Payoff

Since Event 1 holds, we know that for all a Am, LCBm(a) a, θ + p

|Sm|ε, UCBm(a) a, θ p

|Sm|ε. Moreover, as LCBm(a) = max{LCBm,1(a), LCBm,2(a)}, we know that for all a Am

LCBm,1(a) + 2rad N m,a + 2ε p

|Sm| bµm,1(a) + rad N m,a + 2ε p

|Sm| a, θ ,

LCBm,2(a) + 2rad F m,a + 2ε p

|Sm| bµm,2(a) + rad F m,a + 2ε p

|Sm| a, θ ,

UCBm(a) 2rad F m,a 2ε p

|Sm| = bµm,2(a) rad F m,a 2ε p

|Sm| a, θ .

If B µ , then a Am according to Lemma A.3. Moreover, if a is not eliminated in epoch m, we have LCB(a) min{UCBm(a ), B} + 4 p

|Sm|ε, meaning that

a, θ 2rad F m,a 2ε p

bµm,2(a) rad F m,a LCBm(a)

min{UCBm(a ), B} + 4 p

UCBm(a ) + 4 p

= bµm,2(a ) + rad F m,a + 4 p

a , θ + 2rad F m,a + 6 p

Since rad F m,a = P|Sm| i=1 |λ(a) m,i| rad F m,am,i with λ(a) m 2 1, we have that λ(a) m 1 p

|Sm| max a Sm rad F m,a + 2ε = 4

3n max a Sm rad F m,a + 8

cm(a ) + 16 nε.

If B µ , then we have

a , θ + ε µ B LCBm(a) 4 p

|Sm|ε a, θ 2rad F m,a 5 p

where the second inequality is because a is not eliminated in epoch m. Therefore, we always have

a 2rad F m,a + 6 p

cm(a ) + 12 nε.

In addition, we know that for all a Sm,

2m cm(a) + D |Sm| + 1 cm(a) + 2D

Therefore, if 12 nε a

2 , then we have

2m a 2m 24 nε;

otherwise, we have a 8 nβ mina Sm

cm(a) + 12 nε 8 nβ mina Sm

and we can obtain that

min a Sm cm(a ) a 256dβ2

Combining the above two cases, we know that for all a Am,

2m a 2m 24 nε + min a Sm cm(a ) a + 2D a

|Sm| 2m 24 nε + 256nβ2

Contextual Linear Bandits with Delay as Payoff

In fact, the bound above can be obtained by only using LCBm,1. Next, we provide yet-another regret bound within epoch m, which utilizes LCBm,2.

Lemma A.5. Algorithm 3 guarantees that under Event 1, if action a is not eliminated at the end of epoch m (meaning that a Am+1), then

a, θ B + rad N m,a +

i=1 |λ(a) m,i| 2Dµam,i

2m|Sm| + 16 log T + 2

Proof. For all a Sm, since ut [0, 1], we know that

bµ+ m(a) = 1

τ Om(a) uτ + X

bµm,a + |Em(a)|

bµ m(a) = 1

bµm,a |Em(a)|

Then, under Event 1, we know that for all a Am,

i=1 λ(a) m,i am,i, θ

i=1 λ(a) m,i(µam,i εam,i) (since µa = a, θ + εa)

i=1 λ(a) m,i µam,i + p

|Sm|ε (since λ(a) m 1 p

i=1 λ(a) m,i bµm(am,i) + rad N m,a + 3 p

|Sm|ε (since Event 1 holds)

i=1 λ(a) m,i bµ sgn( λ(a) m,i) m (am,i) + rad N m,a +

i=1 |λ(a) m,i| |Em(am,i)|

|Sm|ε (using Eq. (35) and Eq. (36))

= LCBm,1(a) + rad N m,a +

i=1 |λ(a) m,i| |Em(am,i)|

LCBm,1(a) + rad N m,a +

i=1 |λ(a) m,i| 2Dµam,i

2m|Sm| + 16 log KT + 2

|Sm|ε. (since Event 1 holds)

Since LCBm,1(a) B + 4 p

|Sm|ε (as a is not eliminated at the end of epoch m), we have

a, θ B + rad N m,a +

i=1 |λ(a) m,i| 2Dµam,i

2m|Sm| + 16 log T + 2

Lemma A.6. If Event 1 holds, Algorithm 3 guarantees that if a is not eliminated at the end of epoch m, then we also have

2m a 256nβ2

a + 4DB + 12 P|Sm| i=1 |λ(a) m,i| Dµam,i |Sm| + (128 log T + 16) n + 2m 64 nε.

Contextual Linear Bandits with Delay as Payoff

Proof. If a, θ 2B, we know that a = a a , θ 2B. Using Lemma A.4, we can obtain that

2m a 2m 24 nε + 256nβ2

2m 24 nε + 256nβ2

If a, θ 2B, we have B a,θ

2 . Using Lemma A.5, we know that

a a, θ 2 rad N m,a | {z } Term (1)

i=1 |λ(a) m,i| 2Dµam,i

2m|Sm| + 16 log T + 2

| {z } Term (2)

If Term (1) Term (2), we have

a 4rad N m,aε 4 p

|Sm| max am Sm rad N m,am 8β n

meaning that 2m a 64nβ2

a . Otherwise, we have

i=1 |λ(a) m,i| 2Dµam,i

2m|Sm| + 16 log T + 2

meaning that

2m a 8 P|Sm| i=1 |λ(a) m,i| Dµam,i |Sm| + (128 log T + 16) n + 2m 64 nε.

Combining both cases, we know that

2m a 256nβ2

a + 4DB + 12 P|Sm| i=1 |λ(a) m,i| Dµam,i |Sm| + (128 log T + 16) n + 2m 64 nε.

Now we are ready to prove our main result Theorem A.1.

Proof of Theorem A.1. We analyze the regret when Event 1 holds, which happens with probability at least 1 2 T 2 . When Event 1 does not hold, the expected regret is bounded by 2

We then bound the regret with a fixed choice of B. Combining Lemma A.4 and Lemma A.5, if action a is not eliminated at the end of epoch m, we have

2m a 256nβ2

a + 4DB + 12 P|Sm| i=1 |λ(a) m,i| Dµam,i |Sm| + (128 log T + 16) n + 2m 64 nε,

2m a 2m 24 nε + 256nβ2

Therefore, we have

2m a + nε + n log T

( 4DB + 12 P|Sm| i=1 |λ(a) m,i| Dµam,i n , D a

Contextual Linear Bandits with Delay as Payoff

Denote TB to be the number of rounds Algorithm 3 proceeds with B and define Reg B be the expected regret within TB rounds. Then, for any αm 0, the overall regret is then upper bounded as follows:

log2(|TB|/3n X

log2(|TB|/3n X

a Sm 1{ a > αm} O nβ2

a + 2m nε + n log T

( 4DB + 12 P|Sm 1| i=1 |λ(a) m 1,i| d(am 1,i) n , 2D a

(since a is not eliminated in epoch m 1 for all a Sm)

a Sm 1{ a αm}2m a.

Picking αm = βp n

2m , we can obtain that

log2(|TB|/3n X

n 2m + 2m nε + n log T

( 4DB + 12 P|Sm 1| i=1 |λ(a) m 1,i| d(am 1,i) n , 2D a

O |TB| nε + βn p

|TB| + n log T log(T/n)

log2(|TB|/3n) X

( 4DB + 12 P|Sm 1| i=1 |λ(a) m 1,i| d(am 1,i) n , 2D a

On the other hand, picking αm = 0, we have

log2(|TB|/3n X

min + 2m nε + n log T

( 4DB + 12 P|Sm 1| i=1 |λ(a) m 1,i| d(am 1,i) n , 2D a

O n2β2 log(T/n)

min + ε n|TB| + n log T log(T/n)

log2(|TB|/3n X

( 4DB + 12 P|Sm 1| i=1 |λ(a) m 1,i| d(am 1,i) n , 2D a

Using the fact that β = p

2 log(KT 3) and combining both bounds, we can obtain that

Reg B O min n2 log(KT) log(T/n)

|TB| log(KT) + ε n|TB|

log2(|TB|/3n X

( 4DB + 12 P|Sm 1| i=1 |λ(a) m 1,i| d(am 1,i) n , 2D a

For notational convenience, let RB = O min n n2 log(KT ) log(T/n)

|TB| log(KT) o + ε n|TB| . To further analyze

this bound, we first upper bound min 4DB+12 P|Sm 1| i=1 |λ(a) m 1,i| d(am 1,i) n , 2D a

n and obtain that

Reg B RB + O (D max log(T/n)) . (38)

Contextual Linear Bandits with Delay as Payoff

On the other hand, we can also upper bound min 4DB+12 P|Sm 1| i=1 |λ(a) m 1,i| d(am 1,i) n , 2D a

12 P|Sm 1| i=1 |λ(a) m 1,i| d(am 1,i)

n and obtain that

log2(|TB|/3n X

4DB + 12 P|Sm 1| i=1 |λ(a) m 1,i| d(am 1,i) n

Let Lm Alg = P

a Sm 2mµa be the total expected loss within epoch m and Lm = |Sm| 2m µ be the total expected loss for the optimal action. Define Regm = Lm Alg Lm . Direct calculation shows that

P|Sm 1| i=1 |λ(a) m 1,i| d(am 1,i)

3D 2m 1 2m 1 |Sm 1| X

i=1 µam 1,i (since |λ(a) m 1,i| 1 and |Sm| = 3n)

= 3D 2m 1 Lm 1 Alg .

Using the fact that Reg B = P log(|TB|/3n) m=1 Regm, we know that

log(|TB|/3n) X

m=1 (Lm Alg Lm )

log(|TB|/3n) X

m=1 Regm + 2ε |TB|

log(|TB|/3n) X

m= log2(72D) +1

36D 2m 1 Lm 1 Alg +

log2(72D) X

m=1 2m max + 12DB log(T/n) (2ε |TB| is subsumed in RB)

log(|TB|/3n) X

m= log2(72D) +1

36D 2m 1 Lm 1 Alg Lm 1

log(|TB|/3n) X

m= log2(72D) +1

36D 2m 1 Lm 1 +

log2(72D) X

m=1 2m max + 12DB log(T/n)

log(|TB|/3n) X

m= log2(72D) +1

Lm 1 Alg Lm 1 + 36n Dµ log(T/(216n D)) + 144D max + 12DB log(T/n)

log(|TB|/3n) X

m= log2(72D) +1

Lm 1 Alg Lm 1 + 36nd log(T/(216n D)) + 144D max + 12DB log(T/n).

Rearranging the terms, we can obtain that

Reg B RB + 72nd log(T/(216n D)) + 288D max + 12DB log(T/n). (39)

Combining Eq. (38) and Eq. (39), we know that

Reg B O min n2 log(KT) log(T/n)

|TB| log(KT) + ε n|TB|

+ O (min {nd log(T/n D) + D max + DB log(T/n), D max log(T/n)}) . (40)

Contextual Linear Bandits with Delay as Payoff

Finally, according to Lemma A.3, Algorithm 3 fails at most log2(Dµ )) = log2(d )) times. Summing up the regret over all rounds, we know that the overall regret is bounded as follows:

log2(d )) X

r=0 Reg2r/D O min n2 log(KT) log(T/n) log(d )

T log(d ) log(KT) + ε n T

+ log(d ) O (min {nd log(T/n) + D max, D max log(T/n)}) .

which finishes the proof.

Contextual Linear Bandits with Delay as Payoff

B. Omitted Details for Delay-as-Reward

In this section, we show our results for the delay-as-reward setting. The difference compared with the delay-as-loss setting is that now, µa = a, θ + εa [0, 1] represents the expected reward of picking action a, where |εa| ε for all a A. The learner s goal is to minimize the pseudo regret defined as follows:

Reg T max a A a, θ E

Define a = a a, θ as the suboptimality gap of action a, where a argmaxa A a, θ , and µ maxa A µa as the reward of the optimal action. Again, note that due to the misspecification, µ may not necessarily be µa . Define min = mina A, a>0 a to be the minimum non-zero sub-optimality gap. The delay at round t is still defined as dt = D ut, and d = D µ is the expected delay of the optimal action.

Contextual Linear Bandits with Delay as Payoff

B.1. Algorithm for Linear Bandits with Delay-as-Reward

We list our algorithm for the reward case in Algorithm 4 for completeness. The algorithm shares the same idea as Algorithm 3.

Algorithm 4: Phased Elimination for Linear Bandits with Delay-as-Reward

1 Input: maximum possible delay D, action set A, β > 0, a misspecification level ε.

2 Initialize optimal reward guess B = 1.

3 Initialize active action set A1 = A.

4 for m = 1, 2, . . . , do

5 Find Sm = {am,1, . . . , am,|Sm|} to be the volumetric spanner of Am, where |Sm| = 3n.

6 Pick each a Sm 2m times in a round-robin way.

7 Let Im contain all the rounds in this epoch.

8 For all a Sm, calculate the following quantities

bµ+ m(a) = 1

τ Om(a) uτ + X

τ Em(a) 1 , (42)

bµ m(a) = 1

τ Om(a) uτ, (43)

bµ+ m,1(a) = bµ+ m(a) + β 2m/2 a 2, (44)

bµ m,1(a) = bµ m(a) β 2m/2 a 2, (45)

bµF m(a) = 1 cm(a)

τ Cm(a) uτ, (46)

bµ+ m,2(a) = bµF m(a) + β p

cm(a) a 2, (47)

bµ m,2(a) = bµF m(a) β p

cm(a) a 2, (48)

where cm(a) = |Cm(a)|, Cm(a) = {τ Im : τ + D Im, aτ = a}, Om(a) = {τ Im : τ + dτ Im, aτ = a}, and Em(a) = {τ Im : aτ = a} \ Om(a).

9 for each a Am do

10 Decompose a as a = P|Sm| i=1 λ(a) m,iam,i with λ(a) m 2 1 and calculate

i=1 λ(a) m,i bµ SGN(λ(a) m,i) m,2 (am,i), (49)

UCBm(a) = max j {1,2}{UCBm,j(a)} where

UCBm,j(a) =

i=1 λ(a) m,i bµ SGN(λ(a) m,i) m,j (am,i), (50)

11 Set Am+1 = Am.

12 for a1 Am do

13 if a2 Am, such that max{LCBm(a2), B} UCBm(a1) + 4

14 Eliminate a1 from Am+1.

15 if Am+1 is empty then

Set B B/2 and go to Line 3.

Contextual Linear Bandits with Delay as Payoff

B.2. Regret Guarantees

In this section, we show the theoretical guarantees for our algorithm in the delay-as-reward setting.

Theorem B.1. Algorithm 4 with β = p

2 log(KT 3) guarantees that

Reg O min n2 log(KT) log(T/n) log(1/µ )

T log(KT) log(1/µ ) + ε n T

log2(1/µ ) X

i=1 d(a(2 j) m 1,i), D max log(1/µ ) log(T/n)

where {a B m,i}3n i=1 represents the set of volumetric spanner at epoch m with the optimal reward guess B.

Similar to the analysis in Appendix A, our analysis is based on the condition that Event 1 holds, which happens with probability 1 2 T 2 according to Lemma A.2. The following lemma is a counterpart of Lemma A.3, providing an upper bound of the number of guesses on the optimal reward B.

Lemma B.2. Suppose that Event 1 holds. If B µ , then a Am for all m.

Proof. Since Event 1 holds, we have, we know that for all a Am, UCBm(a)+ p

|Sm|ε a, θ , LCBm(a)+ p

|Sm|ε a, θ If B µ , then we have a never eliminated since for any a Am,

UCBm(a ) + 2ε p

|Sm| max a A{ a, θ + εa} µ B,

UCBm(a ) + 4ε p

|Sm| µ + 2ε p

|Sm| a, θ + ε p

|Sm| LCBm(a).

Therefore, a never satisfy the elimination condition.

The following lemma is a counterpart of Lemma A.4.

Lemma B.3. Suppose that Event 1 holds. Algorithm 4 guarantees that if a A is not eliminated at the end of epoch m (meaning that a Am+1), then

2m a 2m 24 nε + 256nβ2

Proof. Since Event 1 holds, we know that for all a Am, LCBm(a) µa + p

|Sm|ε, UCBm(a) µa p

|Sm|ε. Moreover, as UCBm(a) = min{UCBm,1(a), UCBm,2(a)}, we know that for all a Am

UCBm,1(a) 2rad N m,a 2ε p

|Sm| = bµm,1(a) rad N m,a 2ε p

|Sm| a, θ ,

UCBm,2(a) 2rad F m,a 2ε p

|Sm| = bµm,2(a) rad F m,a 2ε p

|Sm| a, θ ,

LCBm(a) + 2rad F m,a + 2ε p

|Sm| = bµm,2(a) + rad F m,a + 2ε p

|Sm| a, θ .

If B µ , then a Am according to Lemma B.2. Moreover, if a is not eliminated in epoch m, we have UCBm(a) + 4 p

|Sm|ε max{LCBm(a ), B}, meaning that

a, θ + 2rad F m,a + 2ε p

bµm,2(a) + rad F m,a UCBm(a)

max{LCBm(a ), B} 4 p

LCBm(a ) 4 p

= bµm,2(a ) rad F m,a 4 p

Contextual Linear Bandits with Delay as Payoff

a , θ 2rad F m,a 6 p

Since rad F m,a = P|Sm| i=1 |λ(a) m,i| rad F m,am,i with λ(a) m 2 1, we have that λ(a) m 1 p

|Sm| max a Sm rad F m,a + 2ε = 4

3n max a Sm rad F m,a + 8

cm(a ) + 16 nε.

If B µ , then we have

µ B UCBm(a) + 4 p

|Sm|ε µa + 2rad F m,a + 6 p

where the second inequality is because a is not eliminated in epoch m. Therefore, we always have

a 2rad F m,a + 6 p

cm(a ) + 12 nε.

In addition, we know that for all a Sm,

2m = |Sm| cm(a) + D |Sm| + 1 cm(a) + 2D

Therefore, if 12 nε a

2 , then we have

2m a 2m 24 nε;

otherwise, we have a 8 nβ mina Sm

cm(a) + 12 nε 8 nβ mina Sm

and we can obtain that

min a Sm cm(a ) a 256dβ2

Combining the above two cases, we know that for all a Am,

2m a 2m 24 nε + min a Sm cm(a ) a + 2D a

|Sm| 2m 24 nε + 256nβ2

The following lemma is a counterpart of Lemma A.5.

Lemma B.4. Algorithm 4 guarantees that under Event 1, if an action a is eliminated at the end of epoch m (meaning that a Am), then

B a, θ + rad N m,a +

i=1 |λ(a) m,i| 2d(am,i)

2m|Sm| + 16 log T + 2

where d(a) = Dµa.

Proof. Under Event 1, we know that for all a Am,

i=1 λ(a) m,i am,i, θ

Contextual Linear Bandits with Delay as Payoff

i=1 λ(a) m,i(µam,i εam,i) (since µa = a, θ + εa)

i=1 λ(a) m,i µam,i p

|Sm|ε (since λ(a) m 1 p

i=1 λ(a) m,i bµm(am,i) rad N m,a 3 p

|Sm|ε (since Event 1 holds)

i=1 λ(a) m,i bµ sgn(λ(a) m,i) m (am,i) rad N m,a

i=1 |λ(a) m,i| |Em(am,i)|

|Sm|ε (using Eq. (35) and Eq. (36))

= UCBm,1(a) rad N m,a

i=1 |λ(a) m,i| |Em(am,i)|

UCBm,1(a) rad N m,a

i=1 |λ(a) m,i| 2d(am,i)

2m|Sm| + 16 log KT + 2

|Sm|ε. (since Event 1 holds)

Since UCBm,1(a) B 4 p

|Sm|ε (as a is not eliminated at the end of epoch m), we have

B a, θ + rad N m,a +

i=1 |λ(a) m,i| 2d(am,i)

2m|Sm| + 16 log T + 2

The following lemma is a counterpart of Lemma A.6.

Lemma B.5. If B µ

2 and Event 1 holds, Algorithm 4 guarantees that if a is not eliminated at the end of epoch m, then we also have

2m a 256nβ2

a + 12 P|Sm| i=1 |λ(a) m,i| d(am,i) |Sm| + (128 log T + 16) n + 2m 64 nε,

where d(a) = Dµa.

Proof. If a, θ B

2 , we know that a = a a, θ 3 a, θ . Using Lemma B.3, we can obtain that

2m a 2m 24 nε + 256nβ2

a + 2D a, θ

2m 24 nε + 256nβ2

a + 2 P|Sm| i=1 |λ(a) m,i| d(am,i) |Sm| .

2 , we have 3(B a, θ ) 3B

2 a a, θ . Using Lemma A.5, we know that

a µa 3 rad N m,a | {z } Term (1)

i=1 |λ(a) m,i| 2d(am,i)

2m|Sm| + 16 log T + 2

| {z } Term (2)

If Term (1) Term (2), we have

a µa 6rad N m,aε 6 p

|Sm| max am Sm rad N m,am 12β n

Contextual Linear Bandits with Delay as Payoff

meaning that 2m a 144nβ2

a . Otherwise, we have

i=1 |λ(a) m,i| 2d(am,i)

2m|Sm| + 16 log T + 2

meaning that

2m a 12 P|Sm| i=1 |λ(a) m,i| d(am,i) |Sm| + (96 log T + 12) n + 2m 96 nε.

Combining both cases, we know that

2m a 256nβ2

a + 12 P|Sm| i=1 |λ(a) m,i| d(am,i) |Sm| + (96 log T + 12) n + 2m 96 nε.

Now we are ready to prove our main result Theorem B.1.

Proof of Theorem B.1. Combining Lemma B.3 and Lemma B.5 and following the exact same process of obtaining Eq. (37) in Theorem A.1, we can obtain that for a fixed value of B, Algorithm 4 guarantees that

Reg B O min n2 log(KT) log(T/n)

|TB| log(KT) + ε n|TB|

log2(|TB|/3n X

(P|Sm 1| i=1 |λ(a) m 1,i| d(am 1,i)

min n2 log(KT) log(T/n)

|TB| log(KT) + ε n|TB|

log2(|TB|/3n X

i=1 d(am 1,i), D max log(T/n)

According to Lemma B.2, there are at most log2(1/µ ) different values of B. With an abuse of notation, we define S(B) m = {a(B) m,i}i [3n] to be the volumetric spanner at epoch m with the reward guess B. Taking summation over all these values, we can obtain that

Reg O min n2 log(KT) log(T/n) log(1/µ )

T log(KT) log(1/µ ) + ε n T

log2(1/µ ) X

i=1 d(a(2 j) m 1,i), D max log(1/µ ) log(T/n)

completing the proof.

While we can further apply a similar analysis to the one in Theorem A.1 to bound the term P log2(1/µ ) j=0 P3n i=1 d(a(2 j) m 1,i) and obtain a bound with respect to d , since d D max + ε, this d dependent bound does not provide a significantly better regret guarantee in the worst case. This difference in loss versus reward is also pointed out in (Schlisselberg et al., 2025) in the MAB setting. We keep this term in the upper bound since this quantity can still be potentially smaller than D max log(1/µ ) log(T/n).

Contextual Linear Bandits with Delay as Payoff

C. Omitted Details in Section 4

In this section, we provide the omitted details in Section 4. We start with the following lemma that is a standard application of the Azuma-Hoeffding s inequality.

Lemma C.1 (Proposition 2 in (Hanna et al., 2023)). For each epoch m, Algorithm 2 guarantees that with probability at least 1 δ

T , the following holds:

g(θ), θ D g(m)(θ), θ E 2

log(2T|Θ |/δ)

2m 1 , θ, θ Θ .

Next, we provide the proof for Theorem 4.1.

Proof of Theorem 4.1. Define θ0 = argminθ Θ θ θ 2. Following the analysis of Hanna et al. (2023), we decompose the regret Regm within epoch m as follows:

argmin a At a, θt , θ min a τ Aτ a τ, θ

argmin a At a, θt , θ0

min a τ Aτ a τ, θ0

τ=2m 1+1 g(θt) g(θ0), θ0

D g(θt) g(m)(θt), θ0 E

| {z } ERR-TERM(1)

D g(m)(θt) g(m)(θ0), θ0 E

| {z } REG-NCTX

D g(m)(θ0) g(θ0), θ0 E

| {z } ERR-TERM(2)

where the second equality is because E [mina At a, θ0 ] = E argmina At a, θ0 , θ0 = g(θ0), θ0 .

For ERR-TERM(1) and ERR-TERM(2), we apply Lemma C.1 to bound both terms by O p

2m log(T|Θ |) . As for REG-NCTX, this is in fact the regret of misspecified non-contextual linear bandits with action set Xm and misspecification level maxθ Θ g(m)(θ ) g(θ ), θ , since E[ut] = g(θt), θ for all t. Applying Lemma C.1 again, we know that the misspecification is of order εm = O( p

log(T|Θ |)/2m) with probability at least 1 1

T 2 . Then, applying the regret guarantee of Algorithm 3 proven in Theorem A.1, we know that

REG-NCTX O p

n2m log(T|Θ |) + O min{Vm,1, Vm,2}, log(d ) min{Wm,1, Wm,2} ,

where Vm,1 = n2 log(T |Θ |) log(T/n) log(d ) n-ctx min , Vm,2 = n q

2m log(d ) log(T|Θ |), Wm,1 = nd log(T/n) + D n-ctx max, and Wm,2 = D n-ctx max log(T/n). Taking a summation over all m [ log2(T) ] epochs and using the fact that |Θ | O(T n) finishes the proof.

Contextual Linear Bandits with Delay as Payoff

D. Omitted Details in Section 5

For completeness, we include the pseudo code for the benchmark used in our experiment, that is, Lin UCB using only the observed feedback; see Algorithm 5.

Algorithm 5: Lin UCB with Delayed Feedback Input: action set A, a parameter λ > 0.

Initialize: bθ1 arbitrarily, βt =

2 log T + n log(1 + t nλ) for all t [T], H1 = λI.

for t = 1, 2, . . . , T do

argmina A D a, bθt E β a 1 Ht, in the loss case,

argmaxa A D a, bθt E + β a 1 Ht, in the reward case.

Observe the payoff uτ for all τ such that τ + dτ (t 1, t]. Update Ht+1 = Ht + P

τ:τ+dτ (t 1,t] aτa τ and bθt+1 = H 1 t+1 P

τ:τ+dτ t aτuτ.