# biased_dueling_bandits_with_stochastic_delayed_feedback__aa1a084e.pdf

Published in Transactions on Machine Learning Research (08/2024)

Biased Dueling Bandits with Stochastic Delayed Feedback

Bongsoo Yi bongsoo@unc.edu Department of Statistics and Operations Research University of North Carolina at Chapel Hill

Yue Kang yuekang@ucdavis.edu Department of Statistics University of California, Davis

Yao Li yaoli@unc.edu Department of Statistics and Operations Research University of North Carolina at Chapel Hill

Reviewed on Open Review: https: // openreview. net/ forum? id= Hw AZDVxk LX

The dueling bandit problem, an essential variation of the traditional multi-armed bandit problem, has become significantly prominent recently due to its broad applications in online advertising, recommendation systems, information retrieval, and more. However, in many real-world applications, the feedback for actions is often subject to unavoidable delays and is not immediately available to the agent. This partially observable issue poses a significant challenge to existing dueling bandit literature, as it significantly affects how quickly and accurately the agent can update their policy on the fly. In this paper, we introduce and examine the biased dueling bandit problem with stochastic delayed feedback, revealing that this new practical problem will delve into a more realistic and intriguing scenario involving a preference bias between the selections. We present two algorithms designed to handle situations involving delay. Our first algorithm, requiring complete delay distribution information, achieves the optimal regret bound for the dueling bandit problem when there is no delay. The second algorithm is tailored for situations where the distribution is unknown, but only the expected value of delay is available. We provide a comprehensive regret analysis for the two proposed algorithms and then evaluate their empirical performance on both synthetic and real datasets.

1 Introduction

In recent years, the dueling bandit problem has gained significant attention, finding wide applications in practical domains such as online advertising and recommendation systems (Yue & Joachims, 2009; Ailon et al., 2014; Zoghi et al., 2014a; 2015b; Komiyama et al., 2015; Agarwal et al., 2021). The K-armed dueling bandit problem (Yue et al., 2012) is a variation of the classical K-armed bandit problem (Auer et al., 2002), providing only relative comparison instead of absolute feedback. In the dueling bandit problem, the learner is presented with K arms. At each trial, the learner selects a pair of arms and receives a stochastic feedback indicating which of the two chosen options is preferred, based on an underlying stochastic pairwise preference model. This setup enables users to express their preference for one item over another rather than assigning numerical scores. Recognizing the challenges of consistently offering accurate absolute reviews, leveraging preference feedback with dueling bandits has emerged as a practical approach, gaining widespread popularity in online learning (Radlinski et al., 2008).

While existing studies in dueling bandit problem (Zoghi et al., 2014a; Komiyama et al., 2015) assume immediate feedback observation, a major practical challenge in real-world applications is the issue of delayed

Published in Transactions on Machine Learning Research (08/2024)

(a) Biased dueling bandit with delayed feedback

(b) Traditonal dueling bandit

Figure 1: Case (a) illustrates the e-commerce example of the biased dueling bandit with delayed feedback: a delay occurs before the user can evaluate and potentially select Product B, and any data collected during this delay suggests A > B. Case (b) showcases the traditional dueling bandit with immediate feedback.

feedback. In practice, feedback on actions is often not observable until after a random period. For example, in recommendations within e-commerce (Vernade et al., 2017), it takes time for users to decide whether to buy a product. Moreover, if a user has chosen not to purchase, the system remains unaware of this decision. It cannot distinguish whether the decision is not to buy or if the user has not made a decision yet, a dilemma often referred to as the delayed conversion problem. This complexity introduces significant challenges in managing delayed feedback. A similar situation occurs with web advertisements (Chapelle, 2014; Yoshikawa & Imai, 2018). Users take time to consider whether to click on an ad after viewing it. However, the system must continue to display ads to other users before receiving this feedback. In particular, Chapelle (2014) analyzed data from the online advertising company Criteo and confirmed that only 35% of feedback was observed within an hour, while 50% was observed after 24 hours, and notably, 13% emerged two weeks later. Another instance can be seen in clinical trials (Chow & Chang, 2006), where the impact of medical treatment on a patient s health frequently encounters delays. Due to these real-world challenges, the past few years have witnessed extensive research on various types of bandit problems with delayed feedback (Vernade et al., 2017; Grover et al., 2018). However, to the best of our knowledge, all the existing literature merely focuses on the bandit problems with absolute numerical rewards, and how to handle the delayed feedback under the practical dueling bandit problem receiving relative preferences between pairs still remains unexplored.

In this work, we introduce a new problem of dueling bandits in the presence of delayed feedback, characterized by a stochastic delay between the selection of action pairs and the receipt of corresponding preference feedback. Different from other types of bandits with delayed feedback, we further delve into a more practical and challenging problem called biased dueling bandits under delayed feedback. Specifically, we keep the same formulation as other bandit literature with binary rewards (Vernade et al., 2017; 2020): the player observes a zero value when the reward for an action is inaccessible due to delay. However, unlike other bandits with absolute numerical rewards where pulling one arm does not provide direct information about the others, the dueling bandit outputs binary responses revealing relative preference between pairs of arms. Consequently, the zero rewards under dueling bandits indicate a biased preference for one arm, particularly when rewards are delayed, making the delay effect much more malicious and challenging to handle. This treatment naturally gives an advantage to the second arm in a pair. For example, if arm i is presented first and arm j second, and the reward is delayed, the player observes a zero value while making a decision. This results in arm j being favored, while arm i suffers from the delay-induced bias. We will further elaborate on this issue in Section 2. In practice, this bias in our problem setting is very common and can be characterized as the prior knowledge or location effect indicating that one arm has an intrinsic advantage over the other. For instance, in the e-commerce recommendation shown in Figure 1 (a), a user is browsing product A, and the platform suggests product B on the sidebar. Before the user makes the comparison after some delay, we can only observe the user stay at product A and hence assume the user likes A over B. In other words, due to the common delay before the user can compare and perhaps select product B, any data collected during that delay may inaccurately suggest a preference for product A. This comparative bias exists in many real-world applications, and it highlights a unique and practical challenge in implementing dueling bandit algorithms with delayed feedback, compared to other bandit algorithms with delayed feedback where each

Published in Transactions on Machine Learning Research (08/2024)

arm is evaluated based on independent selections and feedback, even if the feedback is delayed. Therefore, given the existing literature on bandits with delayed feedback, our work is highly novel and intriguing since we successfully address both the usual challenges of delayed feedback and the additional practical concerns of bias simultaneously.

Given the broad applicability of the dueling bandit problem and the significance of handling delayed feedback in real-world applications, our paper explores the dueling bandit with delayed feedback and presents two dueling bandit algorithms designed to handle stochastic delayed feedback. We note that the proposed algorithm can also be applied in general cases where there is no delay. Our main contributions can be summarized as follows:

We describe and formulate a novel biased dueling bandit framework that incorporates stochastic delayed feedback.

We present RUCB-Delay (Relative Upper Confidence Bound with delayed feedback), which deals with delayed feedback by employing the delay distribution information. We establish the regret bound for the algorithm and demonstrate that it matches the regret lower bound of the dueling bandit problem when all delays are zero.

We introduce and analyze another algorithm, MRR-DB-Delay (Multi Round-Robin Dueling Bandit with delayed feedback), suitable for situations where the delay distribution is unknown. The algorithm requires only the expected value of the delay, making it versatile and applicable in various cases.

We conduct an empirical evaluation of the performance of RUCB-Delay and MRR-DB-Delay using six synthetic and real-world datasets.

1.1 Related Work

Originating from the practical benefits of considering relative feedback over absolute feedback, the dueling bandit problem, initially introduced by Yue et al. (2009), has been widely investigated across various settings. Yue et al. (2009) began with strong assumptions, relying on strong stochastic transitivity and stochastic triangle inequality, which may frequently not align with real-world cases. Yue & Joachims (2011) suggested a setting on a relaxed form of strong stochastic transitivity, although this remained a fairly restrictive condition. In response, Urvoy et al. (2013) introduced the Condorcet winner setting, where we assume the existence of a unique best arm. Subsequent efforts have been made to generalize the concept of the Condorcet winner, including alternatives like Borda winner (Jamieson et al., 2015; Saha et al., 2021), Copeland winner (Zoghi et al., 2015a; Komiyama et al., 2016), and Von-Neumann winner (Dudík et al., 2015; Balsubramani et al., 2016). Nevertheless, our focus in this study remains on the Condorcet winner assumption, a premise commonly found in various other studies (Zoghi et al., 2014a;b; 2015b; Komiyama et al., 2015; Chen & Frazier, 2017; Haddenhorst et al., 2021; Agarwal et al., 2021; Saha & Gaillard, 2022).

Urvoy et al. (2013) and Zoghi et al. (2013) presented algorithms with a regret bound of O(K2 log T) for the dueling bandit problem in the Condorcet winner setting. Subsequently, Zoghi et al. (2014a) and Komiyama et al. (2015) introduced RUCB and RMED algorithms, respectively, achieving a bound of O(K2 + K log T) that matches a lower bound established in (Yue et al., 2012). RUCB is built on the estimation of pairwise probabilities and utilizes the Upper Confidence Bound (UCB) strategy to select the best arm, whereas RMED explores the likelihood of each arm being the Condorcet winner. Several other studies have examined the dueling bandit problem in specific settings, including investigations into robustness to corruptions (Agarwal et al., 2021), best-of-both-world scenarios (Saha & Gaillard, 2022), and adversarial settings (Gajane et al., 2015; Saha et al., 2021). To the best of our knowledge, our paper is the first to explore the dueling bandit problem with stochastic delayed feedback.

Delayed feedback, given its practical significance, has been a key area of research in multi-armed bandits and online learning (Joulani et al., 2013; Vernade et al., 2017; Grover et al., 2018; Gael et al., 2020; Lancewicki et al., 2021). A thorough analysis of online learning problems with delayed feedback including partial monitoring settings is given in Joulani et al. (2013). Additionally, Vernade et al. (2017) proposed algorithms

Published in Transactions on Machine Learning Research (08/2024)

for delayed conversions in multi-armed bandits, assuming full knowledge of the delay distribution. Pike-Burke et al. (2018) addressed a more complex problem related to delayed, aggregated anonymous feedback. Here, the player only observes the sum of rewards received in each round after certain delays, without information about which past actions contributed to this aggregated reward. Delays have also been extensively explored in the adversarial setting (Cesa-Bianchi et al., 2016; Bistritz et al., 2019; Thune et al., 2019; Zimmert & Seldin, 2020). Furthermore, recent studies have investigated delays in various bandit settings and objectives, including linear bandits (Vernade et al., 2020), generalized linear bandits (Zhou et al., 2019; Howson et al., 2023), kernel bandits (Vakili et al., 2023), adversarial Markov decision processes (Lancewicki et al., 2022; Jin et al., 2022), and best-of-both-worlds algorithms (Masoudian et al., 2022; Saha & Gaillard, 2022).

1.2 Organization

The structure of the remaining sections in this paper is as follows: Section 2 provides a formal description of our problem setting. Following that, in Section 3, we introduce the RUCB-Delay algorithm, applicable when complete knowledge of the delay distribution is known. Section 4 presents another algorithm, MRRDB-Delay, designed for scenarios where the delay distribution is unknown, with only the expected value known. The comprehensive regret analysis for the two algorithms, RUCB-Delay and MRR-DB-Delay, is also provided in Sections 3 and 4, respectively. In Section 5, we demonstrate the performance of our two algorithms using various simulated and real-world datasets. Finally, we conclude with a discussion of our results in Section 6.

2 Problem Setting

In this paper, we investigate a dueling bandit problem with K arms, denoted as {1, 2, ..., K}. At each time step, we select a pair of arms (i, j) and obtain the outcome of a pairwise comparison between the selected arms, which is subject to a delay. The outcome of the pairwise comparison between (i, j) is stochastically determined by a probability µij, where µij represents the likelihood of arm i beating arm j in a comparison between the two arms. The delay is decided by a discrete distribution D, which is supported on N and is independent of the selection of arm pairs1. We assume the existence of a Condorcet winner (Urvoy et al., 2013), specifically arm 1 without loss of generality. This implies that there exists a unique arm, i.e. arm 1, with µ1i = P(1 i) > 1

2 held for all i = 1.

In this work, we introduce a new problem of dueling bandits under the presence of delayed feedback. The exact procedure repeats the following steps:

1. At time step t, the player selects a pair of arms (ut, vt).

2. The environment stochastically samples an outcome Xt = I(ut vt) {0, 1} from the Bernoulli distribution B(µutvt). Additionally, it determines a delay Dt N from the delay distribution D.

3. At the beginning of time step t + 1, the player receives outcomes from earlier rounds. For s < t + 1, the censoring variable I(Ds t+1 s) determines whether the outcome from time step s is revealed by time step t + 1. Define Ys,t+1 = Xs I(Ds t + 1 s); the player then observes the collection (Ys,t+1)1 s t.

Our problem setting aligns with the classic delayed feedback formulation with binary observations (Vernade et al., 2017; 2020): the player observes zero value for delayed responses until they become available. However, as we mention in Section 1, different from other types of bandits, the dueling bandit is intrinsically more difficult with delayed rewards due to its feedback mechanism that involves two arms at once: on the one hand, when Ys,t = 1, it is straightforward to determine that Xs = 1. We say that the reward converted if Xs = 1 and the actual observation of this conversion event is indicated by Ys,t = 1. On the other hand, if Ys,t = 0, the status of Xs (Xs = 0 or Xs = 1) remains uncertain due to the potential delay since we always

1We note that our analysis can be extended to scenarios where the delay distribution depends on the pair of arms. This is advantageous because a user may hesitate longer when µij is close to 1/2, resulting in longer delays. However, for simplicity and clarity, we assume independence in our work.

Published in Transactions on Machine Learning Research (08/2024)

observe Ys,t = 0 when the delayed feedback is not available yet regardless of the pairwise comparison result Xs. Specifically, the scenarios Xs = 0 or Ds > t s both would result in our observation Ys,t = 0, making it impossible to distinguish between these two cases until the delay is resolved. Conclusively, this phenomenon leads to a preference bias on vt over ut, and this issue is practical under the dueling bandit application as we mention in Section 1. Therefore, our problem is intrinsically more challenging than other types of bandits with delayed feedback due to the additional bias effects on pairwise comparisons.

Our goal is to minimize the cumulative regret up to time step T, given by

where i = µ1i 1

2. This formulation aligns with the standard concept of regret used in other studies (Vernade et al., 2017; Pike-Burke et al., 2018; Vernade et al., 2020).

3 Algorithm: Known Delay Distribution

In this section, we introduce an algorithm designed for situations where the delay distribution D is known. Our approach begins with estimating the parameters µij in the pairwise preference model. Following this, we propose RUCB-Delay, an algorithm adapted from the RUCB (Relative Upper Confidence Bound) algorithm (Zoghi et al., 2014a). RUCB-Delay utilizes the upper confidence bound (UCB) methodology computed from the estimated parameters. Finally, we establish a regret bound for our proposed algorithm, demonstrating its effectiveness in addressing the K-armed dueling bandit problem with delayed feedback.

3.1 Parameter Estimation

Let τt = P(D1 t) represent the cumulative distribution function of the delay distribution. Assuming knowledge of the delay distribution, we possess information regarding the individual values of τt. Additionally, we incorporate an M-threshold delay, restricting the observation of conversions to occur within M time steps after the corresponding action. In other words, if the reward conversion has not been observed within the initial M rounds, the algorithm presumes it will never occur. This censored observation setting aligns with the study conducted in Vernade et al. (2017; 2020). This assumption has a practical advantage, as there is no need to keep track of observations {Ys,t} for t > s + M.

We first introduce some key notations. We utilize the notation Is i,j = I((us, vs) = (i, j)) for simplicity when there is no potential confusion in the context. Define Nij(t) as the exact count of times we select a pair of arms (i, j) up to time t, and Nij(t) as the delay-discounted count, which takes into account the probability of the reward not yet being observed:

s=1 ( Is i,j + Is j,i ), (2)

s=1 τM Is i,j + Is j,i +

s=t M+1 τt s Is i,j + Is j,i . (3)

Moreover, for practical benefit, as mentioned earlier, we introduce the censored observation Ys,t, which differs from Ys,t only when the reward Xs is one and the delay Ds exceeds M:

Ys,t = Ys,t I(Ds M) = Xs I(Ds min(M, t s)).

Lastly, introduce Sij as follows:

s=1 Ys,t Is i,j + (τM Ys,t) Is j,i +

s=t M+1 Ys,t Is i,j + (τt s Ys,t) Is j,i.

Published in Transactions on Machine Learning Research (08/2024)

Algorithm 1 RUCB-Delay Input: Time horizon T, α, M, {τd}M d=1, A = {1, 2, ..., K} Initialization:

1: for t = 1, 2, ..., T do 2: Compute uij(t) for all i = j based on Equation 5 3: uii(t) 1

2 for all i A

4: Select a pair of arms (ut, vt) according to RUCB 5: Observe the collection (Ys,t+1)1 s t. 6: Update Nij(t + 1), Nij(t + 1), and Sij(t + 1) 7: end for

Sij(t) captures the bias-corrected count of arm i beating arm j up to time t. When comparing (i, j) and if the current observation implies j > i, we add τ instead of 1. This adjustment accounts for the preference bias favoring the second arm over the first arm, which enables us to construct an unbiased estimator of µij. We define ˆµij as follows:

ˆµij(t) = Sij(t)

Nij(t), (4)

which serves as a conditionally unbiased estimator of µij when conditioned on the arm selections.

Proposition 1. ˆµij(t) is a conditionally unbiased estimator of µij, when conditioning on the selections of arms.

Proof. If (us, vs) = (i, j), E( Ys,t|us, vs) = µijτmin(M,t s).

E(Sij(t)|{us, vs}1 s t 1)

s=1 µijτM Is i,j + (τM µjiτM) Is j,i +

s=t M+1 µijτt s Is i,j + (τt s µjiτt s) Is j,i = µij Nij(t),

where the last equality holds from the definition of Nij(t) and the relation µji = 1 µij. Therefore,

E(ˆµij(t)|{us, vs}1 s t 1) = µij.

3.2 Algorithm

Utilizing the estimator and variables defined in Section 3.1, we propose an algorithm named RUCB-Delay designed for the dueling bandit problem in the presence of delayed feedback. RUCB-Delay is built upon the framework of the RUCB algorithm (Zoghi et al., 2014a), retaining a similar structure but with novel estimators, variables, and an altered upper confidence bound. The modified upper confidence bound is given by:

Uij(t) = ˆµij(t) +

αNij(t) log t

N 2 ij(t) , (5)

where α 1. Refer to Algorithm 1 for an outline of the RUCB-Delay procedure. With the novel upper confidence bound, we follow RUCB s approach of selecting a pair of arms (ut, vt). At each time step, we define a potential champion set C consisting of arms that optimistically win against all other arms, meaning Uij 1

2 for all j. We then update the current best arm set B, which will either have one element or be empty. The idea is twofold: first, the arm in B loses its top position as the best arm if it is optimistically beaten by another arm, and second, the arm ut will be selected from B with high probability, or from the potential

Published in Transactions on Machine Learning Research (08/2024)

champion set C. Specifically, if C contains exactly one element, we set B to C and select that unique element as ut. If C has multiple elements, we choose ut from B with a probability of 0.5 and from the remaining potential champion arms in C \ B with equal probability. After selecting the first arm ut, the second arm vt is chosen as the one that maximizes Uutvt.

3.3 Regret Analysis

Here, we present a regret analysis for the proposed RUCB-Delay algorithm. This is achieved by first establishing a general deviation inequality between µij and ˆµij in Lemma 2 and utilizing it to prove the high probability bound in Lemma 3.

Lemma 2. For any α > 0 and any pair of arms (i, j), the following inequality holds for all t,

P (|ˆµij(t) µij| > rij(t)) 2 t2α where rij(t) =

αNij(t) log t

N 2 ij(t) .

The proof of Lemma 2 is provided in Appendix A.1. With the upper confidence bound Uij defined in Equation 5 for i = j and Uii = 1

2 for all i, we define Lij(t) = 1 Uji(t). We now state the high probability concentration inequality:

Lemma 3. Let α > 1

2 and δ > 0. Then, with probability at least 1 δ, for any t > C(δ) and any pair of arms (i, j), the following holds: Lij(t) µij Uij(t),

where C(δ) = (4α 1)(M+1)K(K 1)

(2α 1)δ 1 2α 1 .

The lemma establishes a high probability inequality that involves all large time steps and pairs of arms. By leveraging symmetry, our analysis can be confined to cases where i < j. Additionally, if arms i and j are compared at time s1, then at time s2, Nij and Nij remain constant for all time steps between s1 +M and s2. This observation narrows down the range of time steps that need to be considered. Given these observations and Lemma 2, we present a comprehensive proof of Lemma 3 in Appendix A.1.

Given the high probability upper and lower bounds of µij from Lemma 3, we derive the following high probability upper bound on the regret for the RUCB-Delay algorithm and state the expected regret bound.

Theorem 1. Let α 1 and δ > 0. For any T 1, with probability at least 1 δ, the cumulative regret R(T) of RUCB-Delay is upper bounded by

α min( 2 i , 2 j)

α τ 2 1 2 j log T

Theorem 2. For α 1, the expected regret E[R(T))] of RUCB-Delay is upper bounded by

α min( 2 i , 2 j)

α τ 2 1 2 j log T

The detailed proofs for Theorems 1 and 2 are available in Appendix A.2. The components of the expected regret upper bound in Theorem 2 share a similar structure with the regret bounds obtained in a fully stochastic setting (Zoghi et al., 2014a), but with an additional multiplicative constant 1/τ 2 1 . We may easily verify that in the absence of delay, when τ1 = 1, our algorithm RUCB-Delay recovers the optimal regret bound in the standard stochastic dueling bandit problem (Zoghi et al., 2014a; Komiyama et al., 2015).

Remark 1. We have developed an algorithm applicable to the delayed feedback setting in the dueling bandit problem by introducing a series of new variables and modifying the upper confidence bound when the delay distribution is known. We anticipate that many bandit algorithms using the Upper Confidence Bound (UCB)

Published in Transactions on Machine Learning Research (08/2024)

Algorithm 2 Multi Round-Robin Dueling Bandit with Delayed Feedback (MRR-DB-Delay) Input: Time horizon T, {nm}m N Initialization: γ1 = 1

2, t = 1, m = 1, A1 = {1, 2, ..., K}, Tij(0) = for all i, j A1

1: while t T do

2: / Round m starts / 3: for i, j Am do 4: Let Tij(m) := Tij(m 1) 5: while |Tij(m)| nm and t T do 6: Play arms i and j

7: Tij(m) Tij(m) {t} 8: t t + 1 9: end while 10: end for 11: / Update mean estimates and active arm set / 12: for i, j Am do

13: Yij(m) := 1 |Tij(m)| P

s Tij(m) Ys,t 14: end for

15: Am+1 := Am \ i Am : j Am s.t. Yij(m) + γm < 1

16: γm+1 γm

2 17: m m + 1 18: end while

strategy can transition to scenarios with delayed feedback by employing novel variations similar to our upper bound.

For instance, the RR-DB algorithm (Saha & Gaillard, 2022) designed for stochastic dueling bandits can be transformed in a similar manner. RR-DB is a straightforward methodology that, in each round, compares all pairs of potentially optimal arms in a round-robin fashion, gradually removing significantly suboptimal arms. Although RR-DB achieves a regret bound of O(K2 log T) and is not optimal overall, it performs optimally concerning the order of min. In contrast to other dueling bandit algorithms (Zoghi et al., 2014a; Bengs et al., 2021) suffering from a regret of order 2 min, RR-DB relies only on 1 min, which leads to an improved worst-case regret. We adjust the upper bound uij of the RR-DB algorithm from

uij(t) := ˆpij(t) +

uij(t) := ˆµij(t) +

Nij(t) log(Kt/δ)

N 2 ij(t) ,

where ˆµij is defined as Equation 4. Then, the algorithm becomes suitable for the delayed feedback setting. We refer to this as RR-DB-Delay. Since RR-DB is not optimal, we do not delve into it in detail in this paper. However, we will evaluate and compare the performance of RR-DB-Delay in the experimental section (Section 5) later on.

Remark 2. We can formulate a non-asymptotic lower bound for the dueling bandit problem with stochastic delayed feedback. For any dueling bandit algorithm and T, there exists a K-armed dueling bandit such that R(T) c p

TK/τM, where c > 0 is a constant. The proof for this is deferred to Appendix D, while it remains an open problem whether this lower bound is tight.

Published in Transactions on Machine Learning Research (08/2024)

4 Algorithm: Known and Bounded Expected Delay

In practice, the exact delay distribution D is often unknown and can be challenging to estimate. This section presents a novel algorithm named Multi Round-Robin Dueling bandit with Delayed Feedback (MRR-DBDelay) that can be implemented when only the expected value of the delay is known and bounded.

4.1 Algorithm

MRR-DB-Delay employs a round-based elimination strategy that iteratively refines the set of active arms. In each round, all possible pairs of active arms are played multiple times, and based on the accumulated observations, suboptimal arms are eliminated at the end of the round.

Algorithm 2 provides a detailed overview of our algorithm, MRR-DB-Delay. In this algorithm, m denotes the current round, while t represents the current time step, with T being the total time steps until the algorithm concludes. Additionally, nm represents the number of times each active arm pair should be played by round m. In the collection Tij(m), we store all time steps when arm pair (i, j) were played in the first m rounds. The set Am identifies the active arms during round m, indicating the subset of arms currently under consideration.

In round m, each pair of arms (i, j), where i, j Am, is played nm nm 1 times. Then, we compute the mean estimates Yij(m) of arm pairs based on the observations at that point. Finally, arm i is eliminated if there exists another arm j in Am such that the mean estimate Yij(m) is less than 1

2 by γm. The gap tolerance γm decreases exponentially over rounds, and in our further analysis (Lemma 4), we establish an inequality between the difference of Yij(m) and µij. These provide a guarantee that, with high probability, all arms, except the best arm 1, will be removed.

Lastly, in each round, we need to determine the value of nm, indicating how many times we will play each pair of arms. The determination of nm is crucial as it plays a significant role in the algorithm s ability to efficiently eliminate suboptimal arms and has a substantial impact on regret performance. We choose nm so that the confidence bound established in Lemma 4 holds with high probability. This naturally results in the algorithm effectively removing arms with suboptimal performance. The selection of nm in Algorithm 4 is as follows:

nm = C1 log(Tγ2 m) γ2m + C2 p

E[D] log(Tγ2m)

γm + C3E[D]

where C1 and C2 are some constants, and E[D] represents the expected value of the delay. The complete formula for nm and further details on the derivation of nm are provided in Appendix B.1.

Remark 3. In fact, with a different value of nm, MRR-DB-Delay can also manage aggregated and anonymous delayed feedback (Pike-Burke et al., 2018). In this situation, the player only observes the reward summation in each round after some delay, without knowing which specific past actions led to this total reward. Detailed analysis on this aspect can be found in Appendix C.

4.2 Regret Analysis

Here, we conduct a regret analysis for MRR-DB-Delay under the choice of nm in Equation 6. We establish an error bound between µij and Yij(m), and subsequently derive the expected regret bound of the algorithm.

Let tm be the time step at the end of round m. The sum of discrepancies between µij and the observation Ys,tm during comparisons of arms i and j can be decomposed as follows:

s Tij(m) (µij Ys,tm) = X

s Tij(m) (µij Xs) + (Xs Ys,tm)

The first term represents the difference between the true parameter µij and the outcome Xs, capturing the error in the absence of delayed feedback. Since the average of Xs is an unbiased estimator of µij, the first term can be bounded using concentration inequalities. On the other hand, the second term is nonzero

Published in Transactions on Machine Learning Research (08/2024)

only when the reward is converted but remains unobserved due to the delay, thereby illustrating the impact of delay on unobserved converted output. This decomposition allows us to separately analyze the effect of preference bias and delay. We establish an upper bound for each term and present the following high probability bound, along with a comprehensive proof given in Appendix B.1. Lemma 4. For any round m 1 and any pair of active arms (i, j) Am, with probability at least 1 2 T γ2m ,

µij Yij(m) γm

We now provide the expected regret bound of MRR-DB-Delay in Theorem 3. Please refer to Appendix B.2 for a detailed proof. Theorem 3. The expected regret E[R(T))] of Algorithm 2 is upper bounded by

i=2 O K log(T 2 i ) i + K q

log(T 2 i )E[D] + KE[D]

While its regret bound is larger than that of RUCB-Delay proposed in Section 3, MRR-DB-Delay possesses a significant advantage in that it can be utilized even when the complete delay distribution is unknown. Moreover, MRR-DB-Delay is advantageous in terms of the order of min, as it only depends on 1 min, whereas RUCB-Delay relies on 2 min.

5 Experiments

Following the introduction of our algorithms and theoretical analysis, this section involves a comparison of the performance of our proposed algorithms through numerical experiments conducted on various synthetic and real-world datasets.

Algorithms. We evaluate the performance of three algorithms introduced in our paper with the baseline: 1. RUCB-Delay (Section 3), 2. RR-DB-Delay (Remark 1 in Section 3), and 3. MRR-DB-Delay (Section 4). RUCB-Delay and RR-DB-Delay are algorithms that necessitate knowledge of the complete delay distribution, whereas MRR-DB-Delay only requires the expected value of the delay. We note that the regret bound, considering only terms related to K and T, for RUCB-Delay is O(K log T), while RR-DB-Delay and MRRDB-Delay have a bound of O(K2 log T). To the best of our knowledge, there are no other known methods applicable to dueling bandit problems when there is a delay in feedback. However, we also included a baseline comparison using RUCB (Zoghi et al., 2014a) to effectively demonstrate the advantages of our proposed delayspecific algorithms. RUCB is an algorithm designed without accounting for delayed feedback; therefore, we update the output as soon as the delayed feedback is received.

Experimental Setup. We plot the regret performance of all three algorithms on six distinct datasets, which are described in the following paragraph. For all experiments, we set the time horizon to T = 200, 000. However, for datasets where all three algorithms converge earlier than this fixed horizon, we plot the results only up to the earlier convergence time for better visualization. The regret performance for all plots was assessed by averaging the cumulative regret, as defined in Equation 1, across 100 runs. Both the average and standard deviation are reported in the plots. Similar to Vernade et al. (2017; 2020), we assume that the delay distribution follows a geometric distribution with p = 0.01, implying a mean E[D] = 100. Also, based on our regret analysis in Theorem 2, we set α = 1.0 for RUCB-Delay. For the baseline comparison, RUCB uses the number of times arm i beats arm j to establish their upper confidence bound, which includes potentially delayed feedback. We employ the most up-to-date values and update this count as soon as delayed feedback is received. Lastly, as discussed in Section 3.1, we introduce a windowing parameter denoted as M for computational efficiency in RUCB-Delay and RR-DB-Delay. When dealing with an unbounded delay, we encounter practical storage issues since we need to keep track of previous actions to update Nij and Sij. The windowing framework allows us to address this with an array of size M. We set the windowing parameter M = 1000. Given that τ1000 = 0.999 with the geometric delay distribution, introducing the windowing parameter in this experiment is expected to have practically no impact on regret performance.

Published in Transactions on Machine Learning Research (08/2024)

(a) Six Rankers

(d) Arithmetic

(e) Car Preference

Figure 2: Average Cumulative Regret. The regret performances, averaged over 100 independent runs, along with their standard deviations, are reported for each dataset and algorithm.

Datasets. We perform experiments using the following six datasets. We provide a brief description of each dataset, including the number of arms:

Six rankers (K = 6): a preference matrix generated from the six retrieval functions within the full-text search engine of Ar Xiv.org (Yue & Joachims, 2011).

MSLR (K = 5): a 5 5 preference matrix introduced by Zoghi et al. (2015a) is extracted from a subset of rankers originating from the Microsoft Learning to Rank (MSLR) dataset (Qin & Liu, 2013). The MSLR dataset includes relevant information between queries and URLs, comprising more than 30,000 queries.

Tennis (K = 8): a dataset, constructed by Ramamohan et al. (2016), is based on the results of tennis matches organized by the Association of Tennis Professionals (ATP) among 8 international tennis players. The element (i, j) represents the proportion of times player i has defeated player j.

Arithmetic (K = 10): a synthetic preference matrix with ten arms where µij = 0.5 + 0.025(j i). A similar data was first used in (Komiyama et al., 2015).

Car Preference (K = 10): a dataset of car preferences (Abbasnejad et al., 2013) collected from 60 users in the United States. The preference matrix was constructed based on the pairwise comparison data of users evaluating 10 different cars. Each user expressed their preferences for all 45 pairs of cars.

Sushi (K = 16): a dataset derived from the sushi preference dataset (Kamishima, 2003), comprising the preferences of 5,000 Japanese users for 100 different types of sushi. Komiyama et al. (2015; 2016) selected 16 sushi types from the dataset and represented them in a preference matrix.

Published in Transactions on Machine Learning Research (08/2024)

Results. The regret evaluations of the three algorithms for all datasets are presented in Figure 2. Across all datasets except for Tennis, RUCB-Delay outperforms RR-DB-Delay and MRR-DB-Delay, providing empirical support for their respective regret bound guarantees. Additionally, upon comparing the average performance of RR-DB-Delay and MRR-DB-Delay, we observe a slight superiority of RR-DB-Delay. However, it is essential to highlight that MRR-DB-Delay holds a notable advantage as it can be implemented with only the knowledge of the expected value of delay, unlike RUCB-Delay and RR-DB-Delay, which require complete delay distribution. Another critical aspect to highlight is the baseline comparison: The baseline method, RUCB, becomes futile and fails to converge across all datasets. This observation distinctly highlights the superiority of our approaches, specifically tailored for settings with delayed rewards. Notably, the innovative estimator introduced in RUCB-Delay significantly enhances the algorithm s effectiveness and robustness to delayed observations.

When examining the performance of RUCB-Delay, it can be observed that the regret increases more slowly in the early stages compared to RR-DB-Delay and MRR-DB-Delay. This fact is aligned with our expectation since both RR-DB-Delay and MRR-DB-Delay initially attempt all possible arm pairs before eliminating suboptimal arms, leading to a rapid increase in regret at the beginning. Additionally, for datasets with many arms having small i, these arms tend to be eliminated later. As a result, comparably large differences in regret performance may arise between RUCB-Delay and both RR-DB-Delay and MRR-DB-Delay.

6 Discussion

We have studied the biased dueling bandit problem with stochastic delayed feedback and introduced two main algorithms, each accompanied by a comprehensive analysis. The first algorithm, RUCB-Delay, is designed for scenarios where the complete delay distribution is available. It leverages parameter estimation of the underlying model and incorporates a UCB strategy. This algorithm recovers the optimal regret bound for the dueling bandit problem in the absence of delay. The second algorithm, MRR-DB-Delay, is crafted for situations where information about the distribution is unavailable, and only the expected values are known. It employs a multi-round-robin fashion by playing all active arm pairs a predetermined number of times in each round. To the best of our knowledge, our methods are the first to delve into the delayed feedback setting within the dueling bandit framework. The efficiency of our proposed algorithms is then validated under numerical experiments.

There are several potential directions for future research. To begin with, it is intriguing to investigate algorithms that can be implemented without any knowledge of the delay, such as its expected value, while this problem remains challenging since existing algorithms for the simpler multi-armed bandit with delayed feedback still require additional side information. Furthermore, conducting theoretical studies to understand the impact of employing estimates of the delay distribution could be insightful. While our study assumed a Condorcet winner, it would be beneficial to extend the problem to consider Borda or Copeland winners. Last but not least, the biased dueling bandit presents a compelling area of study due to the pervasive nature of biases confounding user preferences in real-world applications. Despite its practical significance, this challenge remains largely underexplored in the existing literature.

Acknowledgments

We thank the anonymous reviewers and the Action Editor for their constructive feedback, which helped improve this work. This work was supported in part by the National Science Foundation under grants DMS-2152289, DMS-2134107, and Cisco Faculty Award.

Ehsan Abbasnejad, Scott Sanner, Edwin V Bonilla, and Pascal Poupart. Learning community-based preferences via dirichlet process mixtures of gaussian processes. In Twenty-third international joint conference on artificial intelligence, 2013.

Arpit Agarwal, Shivani Agarwal, and Prathamesh Patil. Stochastic dueling bandits with adversarial corruption. In Algorithmic Learning Theory, pp. 217 248. PMLR, 2021.

Published in Transactions on Machine Learning Research (08/2024)

Nir Ailon, Zohar Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning, pp. 856 864. PMLR, 2014.

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235 256, 2002.

Akshay Balsubramani, Zohar Karnin, Robert E Schapire, and Masrour Zoghi. Instance-dependent regret bounds for dueling bandits. In Conference on Learning Theory, pp. 336 360. PMLR, 2016.

Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke Hüllermeier. Preference-based online learning with dueling bandits: A survey. The Journal of Machine Learning Research, 22(1):278 385, 2021.

Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, and Jose Blanchet. Online exp3 learning in adversarial bandits with delayed feedback. Advances in neural information processing systems, 32, 2019.

Nicol o Cesa-Bianchi, Claudio Gentile, Yishay Mansour, and Alberto Minora. Delay and cooperation in nonstochastic bandits. In Conference on Learning Theory, pp. 605 622. PMLR, 2016.

Olivier Chapelle. Modeling delayed feedback in display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1097 1105, 2014.

Bangrui Chen and Peter I Frazier. Dueling bandits with weak regret. In International Conference on Machine Learning, pp. 731 739. PMLR, 2017.

Shein-Chung Chow and Mark Chang. Adaptive design methods in clinical trials. Chapman and Hall/CRC, 2006.

Miroslav Dudík, Katja Hofmann, Robert E Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling bandits. In Conference on Learning Theory, pp. 563 587. PMLR, 2015.

David A Freedman. On tail probabilities for martingales. The Annals of Probability, pp. 100 118, 1975.

Manegueu Anne Gael, Claire Vernade, Alexandra Carpentier, and Michal Valko. Stochastic bandits with arm-dependent delays. In International Conference on Machine Learning, pp. 3348 3356. PMLR, 2020.

Pratik Gajane, Tanguy Urvoy, and Fabrice Clérot. A relative exponential weighing algorithm for adversarial utility-based dueling bandits. In International Conference on Machine Learning, pp. 218 227. PMLR, 2015.

Aditya Grover, Todor Markov, Peter Attia, Norman Jin, Nicolas Perkins, Bryan Cheong, Michael Chen, Zi Yang, Stephen Harris, William Chueh, et al. Best arm identification in multi-armed bandits with delayed feedback. In International Conference on Artificial Intelligence and Statistics, pp. 833 842. PMLR, 2018.

Björn Haddenhorst, Viktor Bengs, Jasmin Brandt, and Eyke Hüllermeier. Testification of condorcet winners in dueling bandits. In Uncertainty in Artificial Intelligence, pp. 1195 1205. PMLR, 2021.

Benjamin Howson, Ciara Pike-Burke, and Sarah Filippi. Delayed feedback in generalised linear bandits revisited. In International Conference on Artificial Intelligence and Statistics, pp. 6095 6119. PMLR, 2023.

Kevin Jamieson, Sumeet Katariya, Atul Deshpande, and Robert Nowak. Sparse Dueling Bandits. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pp. 416 424. PMLR, 2015.

Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, and Aviv Rosenberg. Near-optimal regret for adversarial mdp with delayed bandit feedback. Advances in Neural Information Processing Systems, 35:33469 33481, 2022.

Pooria Joulani, Andras Gyorgy, and Csaba Szepesvári. Online learning under delayed feedback. In International Conference on Machine Learning, pp. 1453 1461. PMLR, 2013.

Published in Transactions on Machine Learning Research (08/2024)

Toshihiro Kamishima. Nantonac collaborative filtering: recommendation based on order responses. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 583 588, 2003.

Junpei Komiyama, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa. Regret lower bound and optimal algorithm in dueling bandit problem. In Conference on learning theory, pp. 1141 1154. PMLR, 2015.

Junpei Komiyama, Junya Honda, and Hiroshi Nakagawa. Copeland dueling bandit problem: Regret lower bound, optimal algorithm, and computationally efficient algorithm. In International Conference on Machine Learning, pp. 1235 1244. PMLR, 2016.

Tal Lancewicki, Shahar Segal, Tomer Koren, and Yishay Mansour. Stochastic multi-armed bandits with unrestricted delay distributions. In International Conference on Machine Learning, pp. 5969 5978. PMLR, 2021.

Tal Lancewicki, Aviv Rosenberg, and Yishay Mansour. Learning adversarial markov decision processes with delayed feedback. Proceedings of the AAAI Conference on Artificial Intelligence, 36(7):7281 7289, 2022.

Saeed Masoudian, Julian Zimmert, and Yevgeny Seldin. A best-of-both-worlds algorithm for bandits with delayed feedback. Advances in Neural Information Processing Systems, 35:11752 11762, 2022.

Ciara Pike-Burke, Shipra Agrawal, Csaba Szepesvari, and Steffen Grunewalder. Bandits with delayed, aggregated anonymous feedback. In International Conference on Machine Learning, pp. 4105 4113. PMLR, 2018.

Tao Qin and Tie-Yan Liu. Introducing LETOR 4.0 datasets. Co RR, abs/1306.2597, 2013. URL http: //arxiv.org/abs/1306.2597.

Filip Radlinski, Madhu Kurup, and Thorsten Joachims. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM conference on Information and knowledge management, pp. 43 52, 2008.

Siddartha Y Ramamohan, Arun Rajkumar, and Shivani Agarwal. Dueling bandits: Beyond condorcet winners to general tournament solutions. Advances in Neural Information Processing Systems, 29, 2016.

Aadirupa Saha and Pierre Gaillard. Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences. In International Conference on Machine Learning, pp. 19011 19026. PMLR, 2022.

Aadirupa Saha, Tomer Koren, and Yishay Mansour. Adversarial dueling bandits. In International Conference on Machine Learning, pp. 9235 9244. PMLR, 2021.

Tobias Sommer Thune, Nicolò Cesa-Bianchi, and Yevgeny Seldin. Nonstochastic multiarmed bandits with unrestricted delays. Advances in Neural Information Processing Systems, 32, 2019.

Tanguy Urvoy, Fabrice Clerot, Raphael Féraud, and Sami Naamane. Generic exploration and k-armed voting bandits. In International Conference on Machine Learning, pp. 91 99. PMLR, 2013.

Sattar Vakili, Danyal Ahmed, Alberto Bernacchia, and Ciara Pike-Burke. Delayed feedback in kernel bandits. In International Conference on Machine Learning, pp. 34779 34792. PMLR, 2023.

Claire Vernade, Olivier Cappé, and Vianney Perchet. Stochastic bandit models for delayed conversions. In Conference on Uncertainty in Artificial Intelligence, 2017.

Claire Vernade, Alexandra Carpentier, Tor Lattimore, Giovanni Zappella, Beyza Ermis, and Michael Brueckner. Linear bandits with stochastic delayed feedback. In International Conference on Machine Learning, pp. 9712 9721. PMLR, 2020.

Yuya Yoshikawa and Yusaku Imai. A nonparametric delayed feedback model for conversion rate prediction. ar Xiv preprint ar Xiv:1802.00255, 2018.

Published in Transactions on Machine Learning Research (08/2024)

Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1201 1208, 2009.

Yisong Yue and Thorsten Joachims. Beat the mean bandit. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 241 248, 2011.

Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. In Proceedings of the 22nd Conference on Learning Theory, 2009.

Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 1556, 2012.

Zhengyuan Zhou, Renyuan Xu, and Jose Blanchet. Learning in generalized linear contextual bandits with stochastic delays. Advances in Neural Information Processing Systems, 32, 2019.

Julian Zimmert and Yevgeny Seldin. An optimal algorithm for adversarial bandits with arbitrary delays. In International Conference on Artificial Intelligence and Statistics, pp. 3285 3294. PMLR, 2020.

Masrour Zoghi, Shimon Whiteson, Rémi Munos, and M. de Rijke. Relative upper confidence bound for the k-armed dueling bandit problem. Ar Xiv, abs/1312.3393, 2013.

Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten Rijke. Relative upper confidence bound for the k-armed dueling bandit problem. In International conference on machine learning, pp. 10 18. PMLR, 2014a.

Masrour Zoghi, Shimon A Whiteson, Maarten De Rijke, and Remi Munos. Relative confidence sampling for efficient on-line ranker evaluation. In Proceedings of the 7th ACM international conference on Web search and data mining, pp. 73 82, 2014b.

Masrour Zoghi, Zohar S Karnin, Shimon Whiteson, and Maarten De Rijke. Copeland dueling bandits. Advances in neural information processing systems, 28, 2015a.

Masrour Zoghi, Shimon Whiteson, and Maarten de Rijke. Mergerucb: A method for large-scale online ranker evaluation. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 17 26, 2015b.

Published in Transactions on Machine Learning Research (08/2024)

A Regret Analysis of RUCB-Delay

A.1 Proof of Lemmas 2 and 3

Lemma 2. For any α > 0 and any pair of arms (i, j), the following inequality holds for all t,

P (|ˆµij(t) µij| > rij(t)) 2 t2α where rij(t) =

αNij(t) log t

N 2 ij(t) .

s=1 Ys,t Is i,j + (1 Ys,t) Is j,i.

Then, we obtain the relation

Sij(t) = Z +

s=1 (τmin(M,t s) 1) Is j,i.

Let s first consider the case conditioning on the selection of arms {us, vs}1 s t 1. Then, we have

s=1 (τmin(M,t s) 1) Is j,i = µij Nij(t) E(Z).

Therefore, we can set an upper bound for the left-hand side conditioned on the selection of arms as:

P (|ˆµij(t) µij| > rij(t) | {us, vs}1 s t 1) = P(|Z E(Z)| > rij(t) Nij(t) | {us, vs}1 s t 1) 2 t2α .

The inequality holds by Hoeffding s Inequality since each censored observation Ys,t is independent given the selection of arms.

Finally, since the right-hand side term of the above inequality does not depend on the selection of arms, we may remove the conditioning by the law of total probability and obtain:

P (|ˆµij(t) µij| > rij(t)) 2 t2α .

Lemma 3. Let α > 1

2 and δ > 0. Then, with probability at least 1 δ, for any t > C(δ) and any pair of arms (i, j), the following holds: Lij(t) µij Uij(t),

where C(δ) = (4α 1)(M+1)K(K 1)

(2α 1)δ 1 2α 1 .

Proof. Let Gij(t) denote the good event where we have µij [Lij(t), Uij(t)] at time t, and let Bij(t) be the bad event where µij / [Lij(t), uij(t)] at time t.

For i < j, the relations µij = 1 µji, Uij(t) = 1 Lji(t), and Lij(t) = 1 Uji(t) hold. Thus, the event Gij(t) is true if and only if Gji(t) is true. Additionally, Gii(t) is always true because, as constructed, µii = lii(t) = uii(t) = 1

2. Hence, it is sufficient to focus only on Gij(t) for i < j.

We define σij n as the time step when arms i and j were compared for the n-th time. We note that Nij(σij n ) = n

by definition. For any n, if σij n + M < σij n+1, then r

αNij(t) log t

N 2 ij(t) is an increasing function of t for all

σij n + M t < σij n+1. Also, ˆµij(t) is a constant for all σij n + M t < σij n+1. Therefore, if Gij(σij n + k) holds true for all 0 k M, it implies that Gij(t) holds for all σij n t < σij n+1.

Published in Transactions on Machine Learning Research (08/2024)

Using the above observations, for any T > 0, we have

P ( t > T, i, j, Gij(t))

= P i < j, Gij(t) for T t max σij Nij(T ) + M, T

and Gij(σij n + k) for n Nij(T) + 1, 0 k M .

Next, by considering the complement of the events on both sides of the above equation and applying the union bound, we arrive at the following:

P ( t > T, i, j s.t. Bij(t))

P t h T, max σij Nij(T ) + M, T i s.t. Bij(t)

+P n Nij(T) + 1, 0 k M s.t. σij n + k < σij n+1 and Bij(σij n + k)

t h T, max σij Nij(T ) + M, T i : |µij ˆµij(t)| >

αNij(t) log t

n [Nij(T) + 1, T], k [0, M] : σij n + k < σij n+1 and µij ˆµij(σij n + k) >

v u u tαNij(σij n + k) log(σij n + k) N 2 ij(σij n + k))

n > T, k [0, M] : σij n + k < σij n+1 and µij ˆµij(σij n + k) >

v u u tαNij(σij n + k) log(σij n + k) N 2 ij(σij n + k))

Let s break down the expression on the right-hand side into three components, denoted as (a), (b), and (c), and then derive upper bounds for each of these components. The first part (a) can be upper bounded as:

t [T, T + M] : Nij(t) = Nij(T) and |µij ˆµij(t)| >

αNij(t) log t

t [T, T + M] : Nij(t) = Nij(T) and |µij ˆµij(t)| >

αNij(t) log T

Nij(T + k) = Nij(T) = n and |µij ˆµij(T + k)| >

αNij(T + k) log T

N 2 ij(T + k)

Published in Transactions on Machine Learning Research (08/2024)

where the last inequality holds from Lemma 2. Second, (b) can be upper bounded as:

Nij(T) + 1 n T, 0 k M : σij n + k < σij n+1 and µij ˆµij(σij n + k) >

v u u tαNij(σij n + k) log T N 2 ij(σij n + k))

σij n + k < σij n+1 and µij ˆµij(σij n + k) >

v u u tαNij(σij n + k) log T N 2 ij(σij n + k))

The first inequality holds because Nij(T) + 1 n and σij n + k > T, while the second inequality results from applying the union bound. The last inequality holds again from Lemma 2. Similarly, by using the fact that n σij n and Lemma 2, we can obtain an upper bound for (c) as follows:

n > T, 0 k M : σij n + k < σij n+1 and µij ˆµij(σij n + k) >

v u u tαNij(σij n + k) log n N 2 ij(σij n + k))

σij n + k < σij n+1 and µij ˆµij(σij n + k) >

v u u tαNij(σij n + k) log n N 2 ij(σij n + k))

Putting all these results together, we obtain

P ( t > T, i, j s.t. Bij(t))

= 2(M + 1)K(K 1) 1 T 2α 1 + (M + 1)K(K 1)

2(M + 1)K(K 1) 1 T 2α 1 + (M + 1)K(K 1) Z

= (M + 1)(4α 1)K(K 1)

(2α 1)T 2α 1

Therefore, when T = C(δ), we obtain P ( t > C(δ), i, j s.t. Bij(t)) δ which concludes the proof.

A.2 Proof of Theorems 1 and 2

Theorem 1. Let α 1 and δ > 0. For any T 1, with probability at least 1 δ, the cumulative regret R(T) of RUCB-Delay is upper bounded by

α min( 2 i , 2 j)

α τ 2 1 2 j log T

Proof. We first state a high probability bound for the number of comparisons for each arm.

With probability at least 1 δ, for any t > C(δ) and any pair of arms (i, j) = (1, 1), the following holds:

Nij(t) 4α log t τ 2 1 min( 2 i , 2 j). (7)

Published in Transactions on Machine Learning Research (08/2024)

Moreover, let N δ ij(t) denote the number of comparisons between arms i and j performed between time C(δ) and t. Then,

N δ ij(t) 4α log t τ 2 1 min( 2 i , 2 j) (8)

holds as well. The proof of this statement closely follows the analysis presented in Proposition 2 of Zoghi et al. (2014a), with a modification to Equation (3) of Zoghi et al. (2014a) as follows:

Uij(s) Lij(s) = 2

αNij(s) log s

N 2 ij(s) 2

α log s τ 2 1 Nij(s) 2

α log t τ 2 1 Nij(t) min{ i, j}.

Now, let ˆTδ be the smallest time that such that

+ D log ˆTδ,

where D := 1 τ 2 1 P

i<j 4α min( 2 i , 2 j) and the existence of such ˆTδ is guaranteed. Then, according to the above

statement and Lemma 3, we can assert with a probability of 1 δ

, i, j, µij [Lij(t), Uij(t)],

(i, j) = (1, 1), N δ/2 ij ( ˆTδ) 4α log ˆTδ τ 2 1 min( 2 i , 2 j).

In this case, when t > C δ

2 and i > 1, arm i cannot be compared against itself. This is due to the fact that uii(t) = 1

2 < µ1i u1i(t) which holds by Lemma 3. Additionally, we have ˆTδ C δ

i<j N δ/2 ij ( ˆTδ) ˆTδ C δ

4α log ˆTδ min( 2 i , 2 j) > 0

by the definition of ˆTδ. Therefore, this implies that there exists a time Tδ C δ

2 , ˆTδ i when arm 1 was

played against itself. At that time Tδ, we had uj1(Tδ) < 1

2 for all j > 1, indicating that B = {1}.

With the above insights, we follow the proof strategy outlined in Theorem 4 of Zoghi et al. (2014a). Equation 7 in Zoghi et al. (2014a) requires modification, taking the form:

τ 2 1 2 j =: ˆN1(T).

Furthermore, we can upper bound the cumulative regret similar to Equation 8 in Zoghi et al. (2014a) as follows:

R(T) Tδ max +

4α 1j log T

τ 2 1 2 j +

l=1 nl max. (9)

While the notations for C and D have undergone modifications compared to Zoghi et al. (2014a), we may verify that all subsequent statements remain valid, and we are able to upper bound each term in Equation 9 as follows:

R(T) Tδ max +

+ 2D log 2D max +

τ 2 1 j + 4 max log 2

8α τ 2 1 2 j max log T

which concludes the proof.

Published in Transactions on Machine Learning Research (08/2024)

Theorem 2. For α 1, the expected regret E[R(T))] of RUCB-Delay is upper bounded by

α min( 2 i , 2 j)

α τ 2 1 2 j log T

HT (1 δ) := 2C δ

+ 2D log 2D max +

τ 2 1 j + 4 max log 2

8α τ 2 1 2 j max log T.

For fixed T, we view R(T) as a random variable. By the relationship between the expected value of a random variable and its cumulative distribution function, we obtain:

E[R(T)] = Z 1

0 F 1 R(T )(q)dq < Z 1

0 HT (q)dq.

Then, for integration of HT , we only need to foucs on terms that depend on δ:

E[R(T)] < Z 1

+ 2D log 2D max +

τ 2 1 j + 4 max log 2 1 q +

8α τ 2 1 2 j max log Tdq

= 2D log 2D max +

8α τ 2 1 2 j max log T + max

+ 4 log 2 1 q

Moreover, the above integral can be computed as: Z 1

+ 4 log 2 1 q

dq = 2(4α 1)(M + 1)K(K 1)

1 2α 1 2α 1

α 1 + 4(log 2 + 1)

< 2(4α 1)(M + 1)K(K 1)

1 2α 1 2α 1

Therefore, the expected regret can be upper bounded by

8 + 2(4α 1)(M + 1)K(K 1)

1 2α 1 2α 1

+2D log 2D max +

2α( j + 4 max)

τ 2 1 2 j log T.

Remark 4. In studies focusing on delayed feedback (Vernade et al., 2017; 2020), certain analyses include τm in the regret bound instead of τ1. We note that in Theorem 1 and 2, τ1 can be replaced with τM M+1. This substitution is justified by the following inequality. If Nij(t M) 1, then

Nij(t) Nij(t) M + 1 + M M + 1

Nij(t) Nij(t M)

M + 1 τM + M M + 1τM

τM M + 1Nij(t).

The condition Nij(t M) 1 is satisfied when all pairs of arms are initially selected before the beginning of the algorithm. Therefore, in further analysis, τ1 can be substituted with τM M+1. However, for simplicity, we maintain the use of τ1 in subsequent analysis.

Published in Transactions on Machine Learning Research (08/2024)

B Regret Analysis of MRR-DB-Delay

B.1 Proof of Lemma 4

Lemma 4. For any round m 1 and any pair of active arms (i, j) Am, with probability at least 1 2 T γ2m ,

µij Yij(m) γm

Proof. Let tm denote the time step when round m concludes, and define Ps = Xs I(Ds > tm s)I(s Tij(m)). Additionally, denote Gt as the σ-algebra generated by actions, outcomes, delays, and observations up to time step t. We will start by proving two lemmas.

Lemma 5. With probability greater than 1 1 T γ2m ,

s Tij(m) (µij Xs)

nm log(Tγ2m)

Proof. Applying Hoeffding s inequality and Lemma 19 from Pike-Burke et al. (2018), we obtain the inequality:

s Tij(m) (µij Xs) λ

Then, choosing λ = q

nm log(T γ2m)

2 completes the proof.

Lemma 6. Let Qt = Pt s=1 (Ps E(Ps|Gs 1)). With probability at least 1 1 T γ2m ,

3 log(Tγ2 m) + p

2E[D] log(Tγ2m)

Proof. {Qt} t=0 is a martingale with respect to the filtration {Gt} t=0 with increments Qt Qt 1 = Pt E(Pt|Gt 1). Also, we have

s=1 E Qs 2|Gs 1 =

s=1 Var(Ps|Gs 1)

s=1 E(P 2 s |Gs 1)

s=1 P(Ds > tm s) E[D].

Then by Freedman s version of Bernstein s inequality (Theorem 1.6 of Freedman (1975)), we obtain the following inequality:

P(Qtm λ) exp λ2/2 E[D] + λ/3

Finally, pick λ = log(T γ2 m) 3 + q

log2(T γ2m)

9 + 2E[D] log(Tγ2m) and since

log(Tγ2 m) 3 +

9 + 2E[D] log(Tγ2m) 2

3 log(Tγ2 m) + p

2E[D] log(Tγ2m),

this concludes the proof.

Published in Transactions on Machine Learning Research (08/2024)

Now, we employ the following decomposition: X

s Tij(m) (µij Ys,tm) = X

s Tij(m) (µij Xs) + (Xs Ys,tm)

s Tij(m) (µij Xs) + Qtm +

s=1 E(Ps|Gs 1)

Both the first and second terms can be bounded with high probability using Lemma 5 and 6, respectively. The last term can be bounded as follows:

s=1 E(Ps|Gs 1) µij

s=1 P(Ds > tm s) E[D]

Therefore, with a probability greater than 1 2 T γ2m , the following holds:

µij Yij(m) = 1 nm

s Tij(m) (µij Ys,tm) 2 3nm log(Tγ2 m) +

log(Tγ2m) + E[D]

Defining nm as

3γm log(Tγ2m) + 2γm p

2E[D] log(Tγ2m) + 2γm E[D]

ensures that the right-hand side of Equation 10 is less than or equal to γm

2 , concluding the proof.

B.2 Proof of Theorem 3

Theorem 3. The expected regret E[R(T))] of Algorithm 2 is upper bounded by

i=2 O K log(T 2 i ) i + K q

log(T 2 i )E[D] + KE[D]

Proof. We follow the proof structure presented in the proof of Theorem 2 of Pike-Burke et al. (2018). For any arm i, let Mi denote the round when arm i is eliminated. Additionally, for any arm i, let s define

mi := min m 1 : γm < 2

Then by the definition of mi, we have i

Furthermore, we introduce Nij as the number of comparisons between arms i and j up to time T, and Ni as the total number of times arm i has been played by time T:

Nij := Nij(T) =

t=1 I((ut, vt) = (i, j)) + I((ut, vt) = (j, i))

t=1 I(ut = i) + I(vt = i)

Published in Transactions on Machine Learning Research (08/2024)

Also, for any arm i, let Ri(T) := Ni i

2 . Then the cumulative regret can be expressed as

i<j Nij i + j

Therefore, we decompose the expected regret as follows:

E[R(T)] = E

i=1 Ri(T)I(M1 mi)

i=1 Ri(T)I(M1 < mi)

To bound each component, we prove the following two lemmas.

Lemma 7. For any arm i = 1,

P(Mi > mi and M1 mi) 2 Tγ2mi .

Proof. Define an event A = n Yi1(mi) µi1 + γmi

which occurs with high probability, specifically at least 1 2 T γ2mi by Lemma 4. If the event A happens,

Yi1(mi) µi1 + γmi

Therefore, if M1 mi, we have Mi mi. This implies that if Mi > mi and M1 mi, the event A does not occur. Hence, we obtain

P(Mi > mi and M1 mi) P(Ac {i Ami}) 2 Tγ2mi .

Lemma 8. Let the event Fi(m) be the event that the optimal arm 1 is eliminated by arm i in round m. Then, for any arm i,

P(Fi(m)) 2 Tγ2m .

Proof. By the condition that an arm is eliminated,

P(Fi(m)) = P 1, i Am and Y1i(m) + γm < 1

Define an event A = n Y1i(m) > µ1i γm

If A occurs, Y1i(m) > µ1i γm

which implies arm 1 will not be removed by arm i in round m. Therefore, by Lemma 4, we obtain

P(Fi(m)) P(Ac {i Am}) 2 Tγ2m .

Published in Transactions on Machine Learning Research (08/2024)

Bounding Term (a). By Lemma 7, we can bound the first term as follows:

i=1 Ri(T)I(M1 mi)

i=1 E [Ri(T)I(Mi mi)] +

2 I(Mi > mi, M1 mi)

2 nmi K + T

Bounding Term (b). Let mmax = maxi =1 mi. By Lemma 7 and 8,

i=1 Ri(T)I(M1 < mi)

I(M1 = m) X

m=1 E I(M1 = m)T max i:mi>m i

m=1 3P(M1 = m)Tγm =

i=1 3P(Fi(m))Tγm

i:mi<m 3P(M1 mi, Mi > mi)Tγm + X

i:mi m 3P(Fi(m))Tγm

6 Tγ2mi T γmi 2m mi + X

6 γmi 2mi m +

where the last two inequalities hold from 2mi = 1 γmi 3 i .

Therefore, we have provided upper bound for terms (a) and (b), and the expected regret is overall upper bounded by:

Published in Transactions on Machine Learning Research (08/2024)

Lastly, our choice of nmi can be bounded as follows:

3γmi log(Tγ2mi) + 2γmi q

2E[D] log(Tγ2mi) + 2γmi E[D]

2 log(Tγ2 mi) + 8

3γmi log(Tγ2 mi) + 4γmi q

2E[D] log(Tγ2mi) + 4γmi E[D]

1 + 2 log(4T 2 i /9) γ2mi + 8 log(4T 2 i /9) 3γmi + 4 p

2E[D] log(4T 2 i /9) γmi + 4E[D]

1 + 18 log(4T 2 i /9) 2 i + 8 log(4T 2 i /9) i + 12 p

2E[D] log(4T 2 i /9) i + 12E[D]

Thus, the total expected regret is bounded by

9K log(4T 2 i /9) i + 4K log(4T 2 i /9) + 6K q

2E[D] log(4T 2 i /9) + 81

i + 6K E[D] + 1

C MRR-DB-Delay on delayed, aggregated, and anonymous feedback

In this section, we illustrate the regret guarantee of MRR-DB-Delay when handling delayed, aggregated, and anonymous feedback in the dueling bandit problem. Here we also assume a known and bounded delay as in Section 4. For a comprehensive description of the delayed, aggregated, and anonymous feedback setting, please refer to Pike-Burke et al. (2018). We conduct a regret analysis with the following choice of nm:

2 log(Tγ2m) +

2 log(Tγ2m) + 8

3γm log(Tγ2m) + 6γmm E[τ]

Lemma 9. Consider the delayed, aggregated, and anonymous feedback setting and the choice of nm given by equation 11. For any round m 1 and any pair of active arms (i, j) Am, with probability at least 1 3 T γ2m ,

µij Yij(m) γm

Proof. We omit the proof here since the analysis of Lemma 1 in Pike-Burke et al. (2018) can be similarly applied here.

Theorem 10. Under the delayed, aggregated, and anonymous feedback setting and the choice of nm given by equation 11, the expected regret E[R(T))] of Algorithm 2 is upper bounded by

i=2 O K log(T 2 i ) i + K log 1

Proof. We follow the organization in the proof of Theorem 3 in Appendix B.2. Similarly, for any arm i, let Mi be the round when arm i is removed and define mi := min m 1 : γm < 2

3 i . Then we know i

3 γmi < 2 3 i. Additionally, we use the same notation of Nij, Ni, R(T), and Ri(T) as employed in the proof of Theorem 3 in Appendix B.2. We once again break down the expected regret and bound each

Published in Transactions on Machine Learning Research (08/2024)

element using Lemma 11 and 12.

E[R(T)] = E

i=1 Ri(T)I(M1 mi)

i=1 Ri(T)I(M1 < mi)

Lemma 11. For any arm i = 1,

P(Mi > mi and M1 mi) 3 Tγ2mi .

Proof. The proof is identical to that of Lemma 7, with the only difference being the utilization of the upper bound probability from Lemma 9.

Lemma 12. Let the event Fi(m) be the event that the optimal arm 1 is eliminated by arm i in round m. Then, for any arm i,

P(Fi(m)) 3 Tγ2m .

Proof. The proof is identical to that of Lemma 8, differing only in the upper bound probability from Lemma 9.

Bounding Term (a). By Lemma 11, we have

i=1 Ri(T)I(M1 mi)

i=1 E [Ri(T)I(Mi mi)] +

2 I(Mi > mi, M1 mi)

2 nmi K + T

2 nmi K + 3

Published in Transactions on Machine Learning Research (08/2024)

Bounding Term (b). Let mmax = maxi =1 mi. Using Lemma 11 and 12, we obtain

i=1 Ri(T)I(M1 < mi)

I(M1 = m) X

m=1 E I(M1 = m)T max i:mi>m i

m=1 3P(M1 = m)Tγm

i:mi<m 3P(M1 mi, Mi > mi)Tγm + X

i:mi m 3P(Fi(m))Tγm

9 Tγ2mi T γmi 2m mi + X

9 γmi 2mi m +

where we also employed the relation 2mi = 1 γmi 3 i .

Consequently, by the upper bound proved for terms (a) and (b), the expected regret can be upper bounded by:

2 1 i + inmi K

and since nmi defined in equation 11 can be bounded as

2 log(Tγ2mi) +

2 log(Tγ2mi) + 8

3γmi log(Tγ2mi) + 6γmimi E[τ]

8 log(Tγ2 mi) + 16

3 γmi log(Tγ2 mi) + 12γmimi E[τ]

1 + 8 log(4T 2 i /9) γ2mi + 16 log(4T 2 i /9) 3γmi + 12 log2(3/ i)E[τ]

1 + 72 log(4T 2 i /9) 2 i + 16 log(4T 2 i /9) i + 72 log(3/ i)E[τ]

we conclude that the total expected regret is bounded by

36K log(4T 2 i /9) i + 8K log(4T 2 i /9) + 36K log(3/ i)E[τ] + 243

Published in Transactions on Machine Learning Research (08/2024)

D Lower Bound on the Regret

Proposition 13. For any dueling bandit algorithm and T, there exists a K-armed dueling bandit such that R(T) c p

TK/τM, where c > 0 is a constant.

Proof. The proof of this proposition closely follows the proof of Theorem 3 in (Vernade et al., 2020). Firstly, let s define a parameter µ1 RK K. For all i > 1, we set µ1 1i = 1

2 + and µ1 i1 = 1

2 , where 0 1

8. Additionally, for all i = 1 and j = 1, we define µ1 ij = 1

2. Now, let k = arg mini>1 Eµ1[Ni(T)]. Since we choose two arms at each time step, using the pigeonhole principle, we find that Eµ1[Nk(T)] 2T K 1. Next, we introduce another parameter µ2 RK K. We define µ2 k1 = 1

2 + and µ2 ki = 1

2 + 2 for all i > 2, while for all i = k and j = k, µ2 ij = µ1 ij.

Then, by the definition of regret and µ1, it follows that

2 Eµ1[N1(T)]

2 Eµ1[N1(T)I(N1(T) T) + N1(T)I(N1(T) > T)]

2 Pµ1(N1(T) T) T Pµ1(N1(T) > T)

2 Pµ1(N1(T) T).

Similarly, we have

2 Eµ2[N1(T)] T

2 Pµ2(N1(T) > T).

Then, we use Bretagnolle-Huber inequality and obtain Rµ1(T) + Rµ2(T) T

4 exp( KL(Pµ1, Pµ2)). Also, the relative entropy between Pµ1 and Pµ2 can be expressed as follows:

KL(Pµ1, Pµ2) =

i=2 Eµ1[Nki(T)]d τm

+ Eµ1[Nk1(T)]d τm

i=2 Eµ1[Nik(T)]d τm

+ Eµ1[N1k(T)]d τm

Eµ1[N1(T)] 32τm 2 64T K 1τm 2,

where d(p, q) = p log( p

q ) + (1 p) log( 1 p

1 q ) and the inequality holds by the relation between the relative entropy and χ2 distance. Therefore,

Rµ1(T) + Rµ2(T) T

Finally,by setting = q

K 1 128T τm , we achieve

Rµ1(T) + Rµ2(T) c p

which completes the proof.