# batched_dueling_bandits__ea294632.pdf

Batched Dueling Bandits

Arpit Argarwal 1 Rohan Ghuge 2 Viswanath Nagarajan 2

The K-armed dueling bandit problem, where the feedback is in the form of noisy pairwise comparisons, has been widely studied. Previous works have only focused on the sequential setting where the policy adapts after every comparison. However, in many applications such as search ranking and recommendation systems, it is preferable to perform comparisons in a limited number of parallel batches. We study the batched K-armed dueling bandit problem under two standard settings: (i) existence of a Condorcet winner, and (ii) strong stochastic transitivity and stochastic triangle inequality. For both settings, we obtain algorithms with a smooth trade-off between the number of batches and regret. Our regret bounds match the best known sequential regret bounds (up to poly-logarithmic factors), using only a logarithmic number of batches. We complement our regret analysis with a nearly-matching lower bound. Finally, we also validate our theoretical results via experiments on synthetic and real data.

1. Introduction

The K-armed dueling bandits problem has been widely studied in machine learning due to its applications in search ranking, recommendation systems, sports ranking, etc. (Yue & Joachims, 2011; Yue et al., 2012; Urvoy et al., 2013; Ailon et al., 2014; Zoghi et al., 2014; 2015a;b; Dudik et al., 2015; Jamieson et al., 2015; Komiyama et al., 2015; 2016; Ramamohan et al., 2016; Chen & Frazier, 2017). It is a variation of the traditional stochastic bandit problem in which feedback is obtained in the form of pairwise preferences. This problem falls under the umbrella of preference learning (Wirth et al., 2017), where the goal is to learn from relative

1Data Science Institute, Columbia University, New York, USA 2Department of Industrial and Operations Engineering, University of Michigan, Ann Arbor, USA. Research supported in part by NSF grants CMMI-1940766 and CCF-2006778.. Correspondence to: Rohan Ghuge <rghuge@umich.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

feedback (in our case, given two alternatives, which of the two is preferred). Designing learning algorithms for such relative feedback becomes crucial in domains where qualitative feedback is easily obtained, but real-valued feedback would be arbitrary or not interpretable. We illustrate this using the web-search ranking application.

Web-search ranking is an example of a complex information retrieval system, where the goal is to provide a list (usually ranked) of candidate documents to the user of the system in response to a query (Radlinski et al., 2008; Joachims, 2002; Yue & Joachims, 2009; Hofmann et al., 2013). Modern day search engines comprise hundreds of parameters which are used to output a ranked list in response to a query. However, manually tuning these parameters can sometimes be infeasible, and online learning frameworks (based on user feedback) have been invaluable in automatically tuning these parameters (Liu, 2009). These methods do not affect user experience, enable the system to continuously learn about user preferences, and thus continuously adapt to user behavior. For example, given two rankings ℓ1 and ℓ2, they can be interleaved and presented to the user in such a way that clicks indicate which of the two rankings is more preferable to the user (Radlinski et al., 2008). The availability of such pairwise comparison data motivates the study of learning algorithms that exploit such relative feedback.

Previous learning algorithms have focused on a fully adaptive setting; in the web-ranking application this corresponds to the learning algorithm updating its parameters after each query. Such updates might be impractical in large systems. For example, if the parameters are fine-tuned for each user and a user makes multiple queries in a short time, such continuous updates require a lot of computational power. Even if users are assigned to a small number of classes (and parameters are fine-tuned for each user-class), multiple users from the same class may simultaneously query the system, making it impractical to adapt after each interaction.

Motivated by this, we introduce the batched K-armed dueling bandits problem (or, batched dueling bandits), where the learning algorithm is only allowed to adapt a limited number of times. Specifically, the algorithm uses at most B adaptive rounds and in each round it commits to a fixed batch of pairwise comparisons. The feedback for a batch is received simultaneously, and the algorithm chooses the next

Batched Dueling Bandits

batch based on this (and previous) feedback.

1.1. Contributions

We design three algorithms, namely PCOMP, SCOMP, and SCOMP2, for batched dueling bandits under a finite time-horizon T. We analyze the regret of PCOMP under the Condorcet assumption, and that of SCOMP and SCOMP2 under the strong stochastic transitivity (SST) and stochastic triangle inequality (STI) assumptions. In all cases, we obtain a smooth trade-off between the expected regret and the number of batches, B.

Specifically, in log(T)+1 batches, SCOMP has expected regret nearly matching the instance-dependent regret bound due to Yue et al. (2012), up to a

K factor (K is the number of arms).

Furthermore, in O(log(T)) batches, SCOMP2 achieves a worst-case regret matching the best known result in the sequential setting (Yue & Joachims, 2011) up to a logarithmic factor.

To complement our upper bound results, we provide a lower bound that shows that a T 1/B factor in the expected regret is necessary, where B is the number of batches.

Finally, we run computational experiments to validate our theoretical results.

1.2. Preliminaries

The K-armed dueling bandits problem (Yue et al., 2012) is an online optimization problem, where the goal is to find the best among K bandits B = {b1, . . . , b K} using noisy pairwise comparisons with low regret. In the traditional multi-armed bandit problem (Auer et al., 2002), an arm (or equivalently, bandit) bj can be pulled at each time-step t, which generates a random reward from an unknown stationary distribution with expected value µj. However, in the K-armed dueling bandits problem, each iteration comprises a noisy comparison between two bandits (possibly the same), say (bi, bj). The outcome of the comparison is an independent random variable, and the probability of picking bi over bj is a constant denoted Pi,j = 1

2 + ϵi,j where ϵi,j ( 1

2). Here ϵi,j can be thought of as a measure of distinguishability between the two bandits, and we use bi bj when ϵi,j > 0. We also refer to ϵi,j as the gap between bi and bj.

Throughout the paper, we let b1 refer to the best bandit. To further simplify notation, we define ϵj = ϵ1,j; that is, the gap between b1 and bj. We define the regret per time-step as follows: suppose bandits bt1 and bt2 are chosen in iteration t, then the regret r(t) = ϵt1+ϵt2

2 . The cumulative regret up to

time T is R(T) = PT t=1 r(t), where T is the time horizon, and it s assumed that K T. The cumulative regret can be equivalently stated as R(T) = 1

2 PK j=1 Tjϵj, where Tj denotes the number comparisons involving bj. We define ϵmin = minj:ϵj>0 ϵj to be the smallest non-zero gap of any bandit with b1. We say that bandit bi is a Condorcet winner if, and only if, Pi,j 1

2 for all j B \ {i}. Furthermore, we say that the probabilistic comparisons exhibit strong stochastic transitivity (SST) if there exists a total ordering, denoted by , over arms such that for every triple bi bj bk, we have ϵi,k max{ϵi,j, ϵj,k}, and exhibits stochastic triangle inequality (STI) if for every triple bi bj bk, ϵi,k ϵi,j + ϵj,k.

1.3. Batch Policies

In traditional bandit settings, actions are performed sequentially, utilizing the results of all prior actions in determining the next action. In the batched setting, the algorithm must commit to a round (or batch) of actions to be performed in parallel, and can only observe the results after all actions in the batch have been performed. More formally, in round r = 1, 2, . . ., the algorithm must decide the comparisons to be performed; afterwards all outcomes of the comparisons in batch r are received. The algorithm can then, adaptively, select the next batch of comparisons. However, it can use at most a given number, B, of batches.

The batch sizes can be chosen non-adaptively (fixed upfront) or adaptively. In an adaptive policy the batch sizes may even depend on previous observations of the algorithm. An adaptive policy is more powerful than a non-adaptive policy, and may suffer a smaller regret. In this paper, we focus on such adaptive policies. Furthermore, note that the total number of comparisons (across all batches) must sum to T. We assume that the values of T and B are known. Observe that when T = B, we recover the fully sequential setting.

1.4. Results and Techniques

We provide a summary of our results in Table 1. Our first result is as follows.

Theorem 1.1. For any integer B > 1, there is an algorithm for batched dueling bandits that uses at most B rounds, and if the instance admits a Condorcet winner, the expected regret is bounded by

E[R(T)] 3KT 1/B log 6TK2B X

The above bound is an instance-dependent bound. To obtain an instance-independent bound, recall that ϵmin = minj:ϵj>0 ϵj. We get that the expected worst-case regret

Batched Dueling Bandits

Table 1: A summary of our results

Fully Adaptive Our Algorithms Our Lower Bound

(prior work) Regret Rounds (for B rounds)

Condorcet O K log T

O K2T 1/B log(T )

B Ω KT 1/B B2ϵmin

SST + STI O K log(T )

O KBT 1/B log(T )

2B + 1 Ω KT 1/B B2ϵmin

is bounded by

E[R(T)] 3K2T 1/B log 6TK2B ) ϵmin .

In the sequential setting, existing algorithms achieve a worstcase expected regret of O K log T

(Zoghi et al., 2014; Komiyama et al., 2015). When B = log(T), our worst-case regret is at most

E[R(T)] 3K2 log(6TK2B)/ϵmin = O(K2 log(T)/ϵmin),

which nearly matches the best-known bound in the sequential setting. Our algorithm in Theorem 1.1 proceeds by performing all pairwise comparisons in an active set of bandits, and gradually eliminating sub-optimal bandits. This algorithm is straightforward, and its analysis follows that of (Esfandiari et al., 2021a) for batched stochastic multiarmed bandits. Although this is a simple result, it is an important step for our main results, described next.

Our main results are when the instance satisfies the SST and STI conditions. These conditions impose a structure on the pairwise preference probabilities, and we are able to exploit this additional structure to obtain improved bounds.

Theorem 1.2. For any integer B > 1, there is an algorithm for batched dueling bandits that uses at most B + 1 rounds, and if the instance satisfies the SST and STI assumptions, the expected regret is bounded by

E[R(T)] = X

KT 1/B log(T)

The idea behind this algorithm is to first sample a sufficiently small seed set, and then to perform all pairwise comparisons between the seed set and the active set to eliminate sub-optimal arms. The idea is to exploit the structure of pairwise probabilities so that we do not need to perform all pairwise comparisons. Additionally, if the seed set is found to be sub-optimal, we can construct a much smaller active set; thus allowing us to switch to the pairwise comparison policy. In the sequential setting, (Yue et al., 2012) obtain

instance-dependent regret bounded by P

j:ϵj>0 O log(T )

. Our result nearly matches this sequential bound (with an extra multiplicative factor of

K) when B = log(T). Observe that the worst-case regret of (Yue & Joachims, 2011) in the sequential setting is bounded by O K log(T )

we obtain E[R(T)] O K

KT 1/B log(T )

Next, we improve the worst-case regret by reducing the comparisons performed as follows. We first perform pairwise comparisons amongst bandits in the seed set, and pick a candidate bandit. This candidate bandit is used to eliminate sub-optimal arms from the active set. Although selecting a candidate bandit each time requires additional adaptivity, we get a better bound on the worst-case expected regret by exploiting the fact that there can be at most B candidate bandits. Theorem 1.3. For any integer B > 1, there is an algorithm for batched dueling bandits that uses at most 2B +1 rounds, and if the instance satisfies the SST and STI assumptions, the expected worst-case regret is bounded by

E[R(T)] = O KBT 1/B log(T)

Thus, in B = log(T) rounds, our expected worst-case regret

is bounded by E[R(T)] O K log2(T )

matching the best known result in the sequential setting up to an additional logarithmic factor.

Finally, we complement our upper bound results with a lower bound for the batched K-armed dueling bandits problem, even under the SST and STI assumptions. Theorem 1.4. Given an integer B > 1, and any algorithm that uses at most B batches, there exists an instance of the K-armed batched dueling bandit problem that satisfies the SST and STI condition such that the expected regret

E[R(T)] = Ω KT 1/B

The above lower bound shows that the T 1/B dependence in our upper bounds is necessary. Note that the above lower

Batched Dueling Bandits

bound also applies to the more general Condorcet winner setting. The proof is similar to the lower bound proof in (Gao et al., 2019) for batched multi-armed bandits. The main novelty in our proof is the design of a family of hard instances with different values of ϵmin s that satisfy the SST and STI conditions. We defer further discussion and the proof of Theorem 1.4 to Appendix C.

2. Related Work

The dueling bandits problem has been widely studied in recent years; we mention the most relevant works here and refer the reader to (Sui et al., 2018) for a more comprehensive survey. This problem was first studied by (Yue et al., 2012) under the SST and STI setting. The authors gave a worst-case regret upper bound of e O(K log T/ϵmin) and provided a matching lower bound. (Yue & Joachims, 2011) considered a slightly more general version of the SST and STI setting and achieved an instance-wise optimal regret upper bound of P

j:ϵj>0 O log(T )

. (Urvoy et al., 2013) studied this problem under the Condorcet winner setting and proved a O(K2 log T/ϵmin) regret upper bound, which was improved by (Zoghi et al., 2014) to O(K2/ϵ2 min) + P

j:ϵj>0 O(log T/ϵ2 j). (Komiyama et al., 2015) achieved a similar but tighter KL divergence-based bound, which is shown to be asymptotically instance-wise optimal (even in terms constant factors). There are also other works that improve the dependence on K in the upper bound, but suffer a worse dependence on ϵjs (Zoghi et al., 2015b). This problem has also been studied under other noise models such as utility based models (Ailon et al., 2014) and other notions of regret (Chen & Frazier, 2017). Alternate notions of winners such as Borda winner (Jamieson et al., 2015), Copeland winner (Zoghi et al., 2015a; Komiyama et al., 2016; Wu & Liu, 2016), and von Nuemann winner (Dudik et al., 2015) have also been considered. There are also several works on extensions of dueling bandits that allow multiple arms to be compared at once (Sui et al., 2017; Agarwal et al., 2020; Saha & Gopalan, 2019).

All of the aforementioned works on the dueling bandits problem are limited to the sequential setting. To the best of our knowledge, ours is the first work that considers the batched setting for dueling bandits. However, batched processing for the stochastic multi-armed bandit problem has been investigated in the past few years. A special case when there are two bandits was studied by (Perchet et al., 2016). They obtain a worst-case regret bound of

O T log(T ) 1/B log(T )

. (Gao et al., 2019) studied the

general problem and obtained a worst-case regret bound of O K log(K)T 1/B log(T )

, which was later improved by (Es-

fandiari et al., 2021a) to O KT 1/B log(T )

. Furthermore,

(Esfandiari et al., 2021a) obtained an instance-dependent regret bound of P j:ϵj>0 T 1/BO log(T )

. Our results for batched dueling bandits are of a similar flavor; that is, we get a similar dependence on T and B. (Esfandiari et al., 2021a) also give batched algorithms for stochastic linear bandits and adversarial multi-armed bandits.

Adaptivity and batch processing has been recently studied for stochastic submodular cover (Golovin & Krause, 2017; Agarwal et al., 2019; Esfandiari et al., 2021b; Ghuge et al., 2021), and for various stochastic maximization problems such as knapsack (Dean et al., 2008; Bhalgat et al., 2011), matching (Bansal et al., 2012; Behnezhad et al., 2020), probing (Gupta & Nagarajan, 2013) and orienteering (Guha & Munagala, 2009; Gupta et al., 2015; Bansal & Nagarajan, 2015). Recently, there have also been several results examining the role of adaptivity in (deterministic) submodular optimization; e.g. (Balkanski & Singer, 2018a; Balkanski et al., 2018; Balkanski & Singer, 2018b; Balkanski et al., 2019; Chekuri & Quanrud, 2019).

3. Algorithms for Batched Dueling Bandits

In this section, we present three algorithms, namely PCOMP, SCOMP and SCOMP2, for the K-armed batched dueling bandits problem. Recall that given a set of K bandits (or arms) B = {b1, . . . , b K}, and a positive integer B T, we wish to find a sequence of B batches of noisy comparisons with low regret. Given bandits bi and bj, Pi,j = 1

2 + ϵi,j denotes the probability of bi winning over bj. The first algorithm, termed PCOMP, proceeds by performing all-pairs comparisons amongst bandits in an active set, and gradually eliminating sub-optimal bandits. The other two algorithms, termed SCOMP and SCOMP2, first select a (sufficiently small) seed set S B, and eliminate bandits in an active set by successively comparing them to (all or few) bandits in S. If the seed set S is itself found to be sub-optimal in a subsequent round, then these algorithms call the all-pairs algorithm PCOMP over the remaining active arms.

Before describing our algorithms in detail we will set up some basic notation. We will denote by A the set of active arms, i.e. arms that have not been eliminated. We will use index r for rounds or batches. At the end of each round r, our algorithms compute a fresh estimate of the pairwise probabilities based on the feedback from comparisons in round r as:

b Pi,j = #bi wins against bj in round r #comparisons of bi and bj in round r . (1)

If a pair (bi, bj) is compared in round r, it is compared cr =

qr times. In round r, the parameter γr = q

δ /2cr is used to eliminate bandits from the active set (the specific elimination criteria depends on the algorithm).

Batched Dueling Bandits

Algorithm 1 PCOMP(ALL PAIRS COMPARISONS)

1: Input: Bandits B, time-horizon T, rounds B, comparison parameters q and τ 2: K |B|, δ 1 6T K2B , active bandits A B, cr qr+τ 1 , γr p

log(1/δ)/2cr, r 1 3: while number of comparisons T do 4: for all (bi, bj) A2, perform cr comparisons and compute b Pi,j using Eq(1).

5: if bi, bj such that b Pi,j > 1

2 + γr then 6: A A \ {bj} 7: end if 8: r r + 1 9: end while

3.1. All Pairs Comparisons

We first describe the PCOMP algorithm. This algorithm takes as input the set of bandits B, time-horizon T, rounds B and comparison parameters q and τ. We will set the parameters q = T 1/B and τ = 1, unless otherwise specified.1 In round r [B], this algorithm compares each pair (bi, bj) A2 for cr times. It then computes fresh estimates of the pairwise probabilities b Pi,j for all (bi, bj) A2. If, for some bandit bj, there exists bandit bi such that b Pi,j > 1

2 +γr, then bandit bj is eliminated from A. We provide the pseudo-code in Algorithm 1.

The following theorem (see Appendix B for proof) describes the regret bound obtained by PCOMP under the Condorcet assumption, and formalizes Theorem 1.1.

Theorem 3.1. Given any set B of K bandits, time-horizon T, rounds B, parameters q = T 1/B and τ = 1, the expected regret of PCOMP for the batched K-armed dueling bandits problem under the Condorcet assumption is at most

E[R(T)] 3KT 1/B log 6TK2B X

Setting ϵmin := minj:ϵj>0 ϵj, we get

E[R(T)] 3K2T 1/B log 6TK2B

3.2. Seeded Comparisons Algorithms

In this section, we present two algorithms for the batched dueling bandits problem, namely SCOMP and SCOMP2. The algorithms work in two phases:

In the first phase, the algorithms sample a seed set S by including each bandit from B independently with

1We allow general parameters q and τ in order to allow PCOMP to be used in conjunction with other policies.

probability 1/

K. This seed set is used to eliminate bandits from the active set A.

Under certain switching criteria, the algorithms enter the second phase which involves running algorithm PCOMP on some of the remaining bandits.

The algorithms differ in how the candidate set is used to eliminate active bandits in the first phase.

In SCOMP, all pairwise comparisons between S (seed set) and A (active bandits) are performed. Specifically, in round r, every active bandit is compared with every bandit in S for cr times. If, for some bandit bj, there exists bandit bi such that b Pi,j > 1

2 + 3γr, then bandit bj is eliminated (from A as well as S); note that the elimination criteria here is stricter than in PCOMP. If, in some round r, there exists bandit bj such that bj eliminates all bandits bi S, then the algorithm constructs a set A = {bj A | b Pj,i > 1

2 +γr for all bi S}, and invokes PCOMP on bandits A with starting batch r. This marks the beginning of the second phase, which continues until time T. We provide the pseudocode in Algorithm 2.

Algorithm 2 SCOMP(SEEDED COMPARISONS)

1: Input: Bandits B, time-horizon T, rounds B 2: q T 1/B, δ 1 6T K2B , active bandits A B, cr qr , γr p

log(1/δ)/2cr, r 1 3: S add elements from B into S w.p. 1/

K 4: while number of comparisons T do 5: for all (bi, bj) S A, compare bi and bj for cr times and compute b Pi,j 6: if bi S, bj A, b Pi,j > 1

2 + 3γr then 7: A A \ {bj}, S S \ {bj} 8: end if 9: if bj such that b Pj,i > 1

2 + 3γr for all bi S then

10: construct set A = {bj A | b Pj,i > 1 2 + γr for all bi S} 11: r r, T # comparisons until round r , break 12: end if 13: r r + 1 14: end while 15: run PCOMP(A , T T , q, r )

We obtain the following result, which formalizes Theorem 1.2, when the given instance satisfies SST and STI.

Theorem 3.2. Given any set B of K bandits, time-horizon T, parameter B, SCOMP uses at most B + 1 batches, and has expected regret bounded by

E[R(T)] = X

KT 1/B log(T)

Batched Dueling Bandits

under the strong stochastic transitivity and stochastic triangle inequality assumptions.

Observe that this gives a worst-case regret bound of O K

KT 1/B log(T )

for SCOMP under SST and STI. We can improve this by sampling each bandit from B independently into the seed set with probability K 2/3: this gives us the following result.

Theorem 3.3. Given any set B of K bandits, time-horizon T, parameter B, there is an algorithm that uses at most B + 1 batches, and has worst-case regret bounded by

E[R(T)] = O K4/3T 1/B log(T)

under the strong stochastic transitivity and stochastic triangle inequality assumptions.

To further improve this worst-case bound, we add more rounds of adaptivity in SCOMP to obtain SCOMP2. Specifically, each round r in the first phase is divided into two rounds of adaptivity.

In the first round r(1), pairwise comparisons among the bandits in S are performed, and an undefeated bi r is selected as a candidate. We say that bi defeats bj if b Pi,j > 1

In the second round r(2), the candidate bi r is used to eliminate active bandits. A bandit bj is eliminated if b Pi r,j > 1

The switching criterion in SCOMP2 is different from that of SCOMP. Here, if in some round r, there is a bandit bj such that bj eliminates bi r, then the algorithm constructs set A = {bj A | b Pj,i r > 1

2 + 3γr}, and invokes PCOMP on bandits A with starting batch r. See Algorithm 3 for a formal description.

We show that SCOMP2 obtains an improved worst-case regret bound (at the cost of additional adaptivity) over SCOMP when the given instance satisfies SST and STI, thus proving Theorem 1.3.

Theorem 3.4. Given any set B of K bandits, time-horizon T and parameter B, SCOMP2 uses at most 2B + 1 batches, and has worst-case expected regret bounded by

E[R(T)] = O KBT 1/B log(T)

under strong stochastic transitivity and stochastic triangle inequality, where ϵmin := minj:ϵj>0 ϵj.

The proofs of Theorems 3.2 and 3.4 can be found in Appendix B.

Algorithm 3 SCOMP2 (SEEDED COMPARISONS 2)

1: Input: Bandits B, time-horizon T, rounds B 2: q T 1/B, δ 1 6T K2B , active bandits A B, cr qr , γr p

log(1/δ)/2cr, r 1 3: S add elements from B into S w.p. 1/

K 4: while number of comparisons T do 5: r(1): compare all pairs in S for cr times; get b Pi,j. 6: candidate bi r any bandit i S with maxj S b Pj,i 1

2 + γr. 7: r(2): for all bj A, compare bi r and bj for cr times and compute b Pi r,j.

8: if bj A, b Pi r,j > 1

2 + 5γr then 9: A A \ {bj}, S S \ {bj} 10: end if 11: if bj such that b Pj,i r > 1

2 + 5γr then

12: construct set A = {bj A | b Pj,i r > 1

2 + 3γr} 13: r r, T # comparisons until round r , break 14: end if 15: r r + 1 16: end while 17: run PCOMP(A , T T , q, r )

4. Regret Analysis

We present a sketch of the regret analysis for the algorithms described in 3 in this section. Refer to Appendix B for complete proofs.

The following lemma follows from a direct application of Hoeffding s inequality.

Lemma 4.1. For any batch r [B], and for any pair bi, bj that are compared cr times, we have

P |Pi,j b Pi,j| > γr 2δ,

where γr = q

We analyze the regret of our algorithms under a good event, G. We show that the G occurs with high probability; in the event that G does not occur (denoted G), we incur a regret of T. Towards defining G, we say that an estimate b Pi,j at the end of batch r is correct if | b Pi,j Pi,j| γr. We say that G occurs if every estimate in every batch is correct.

Lemma 4.2. The probability that every estimate in every batch of PCOMP, SCOMP, and SCOMP2 is correct is at least 1 1/T.

Proof. Applying Lemma 4.1 and taking a union bound over all pairs and batches (note SCOMP2 has at most 2B+1 3B batches), we get that the probability that some estimate

Batched Dueling Bandits

is incorrect is at most K2 3B 2δ = 1 T where δ = 1/6K2BT. Thus, P(G) 1

Using Lemma 4.2, the expected regret (of any algorithm) can be written as follows:

E[R(T)] = E[R(T) | G] P(G) + E[R(T) | G] P(G)

E[R(T) | G] + T 1

T = E[R(T) | G] + 1 (2)

The proof of Theorem 3.1 can be found in Appendix B.

4.1. Proofs of Theorems 3.2, 3.3 and 3.4

In this section, we discuss the proofs of Theorems 3.2, 3.3 and 3.4. Henceforth, we assume the SST and STI properties. We need the following definition. For a bandit bj, let Ej = {bi B : ϵi,j > 0}; that is, the set of bandits superior to bandit bj. We define rank(bj) = |Ej|. 2

As before, we analyze the regret of SCOMP and SCOMP2 under event G. By Lemma 4.2 and (2), we only need to bound the expected regret under G; that is, we need to bound E[R(T) | G]. Conditioned on event G, the following Lemmas 4.3,4.4 and 4.5 hold for both SCOMP and SCOMP2.

Lemma 4.3. The best bandit b1 is never deleted from A in the elimination step of phase I.

Lemma 4.4. When the algorithm switches to PCOMP on set A , we have b1 A and |A | rank(bi S) where bi S is the best bandit in S.

Lemma 4.5. We have E[rank(bi S)]

K and E[rank(bi S)2] 2K.

Using Lemmas 4.3, 4.4 and 4.5, we complete the proof of Theorem 3.2.

Proof of Theorem 3.2. We bound the expected regret of SCOMP conditioned on G. Let R1 and R2 denote the regret incurred in phase I and II respectively.

Bounding R1. Fix a bandit bj. Let r denote the last round such that bj A and switching does not occur (at the end of round r). Let bi S be the best bandit in S. As j is not eliminated by bi S, we have b Pi S,j 1

2 + 3γr, which implies (by event G) Pi S,j 1 2 + 4γr. Moreover, as switching doesn t occur, we have mini S b P1,i 1

2 + 3γr (by Lemma 4.3, b1 is never deleted from A). We now claim that P1,i S 1

2 + 4γr. Otherwise, by SST we have mini S P1,i = P1,i S > 1

2 + 4γr, which (by event G) implies mini S b P1,i > 1

2 + 3γr, a contradiction! It now fol-

2Note that SST and STI imposes a linear ordering on the bandits. So, we can assume b1 b2 b K. Thus, rank(bj) j and is at most the number of bandits strictly preferred over bj.

lows that ϵi S,j 4γr and ϵ1,i S 4γr. Consider now two cases:

1. b1 bi S bj. Then, by STI, ϵ1,j 8γr, and

2. b1 bj bi S. Then, by SST ϵ1,j ϵi S,j 4γr.

In either case, we have ϵj = ϵ1,j 8γr, which implies cr log(1/δ)

2γ2r 32 log(1/δ)

Now, let Tj be a random variable denoting the number of comparisons of bj with other bandits before switching. By definition of round r, bandit bj will participate in at most one round after r (in phase I). So, we have

Tj |S| Pr+1 τ=1 cτ if bj S K Pr+1 τ=1 cτ if bj S

Taking expectation over S, we get

τ=1 cτ | bj S

τ=1 cτ | bj S

K + E[|S| | bj S]

where the third inequality uses E[|S| | bj S]

K. Moreover,

τ=1 cτ 2T 1/B cr = O

T 1/B log(1/δ)

Thus, E[R1] = X

j E [Tj] ϵj

K log(6K2TB)

Bounding R2. We now bound the regret after switching. From Lemmas 4.3 and 4.4, we know that b1 is never deleted, b1 A , and |A | rank(bi S). For any A , applying Theorem 3.1 we get,

R2 3|A |T 1/B log(6T|A |2B) X

3|A |T 1/B log(6TK2B) X

Batched Dueling Bandits

By Lemma 4.5, E[|A |]

KT 1/B log(6TK2B) X

Combining (3) and (5), we get

E[R(T)|G] = X

K log(6K2TB)

and by (2), this concludes the proof.

Proof of Theorem 3.3. To prove Theorem 3.3, we use SCOMP with sampling probability (for sampling a bandit into the seed set) equal to K 2/3. Note that Lemmas 4.3 and 4.4 continue to hold. The following lemma can be proved exactly as Lemma 4.5, where the sampling probability p = K 2/3.

Lemma 4.6. We have E[rank(bi S)] K2/3 and E[rank(bi S)2] K4/3.

We can bound R1, as in the proof of Theorem 3.2, to obtain:

E[R1] O K4/3T 1/B log(6K2TB)

From (4), we have

R2 3|A |T 1/B log(6TK2B) |A |

which gives

E[R2] 3 E[|A |2] T 1/B log(6TK2B)

Using Lemma 4.6, we get

E[R2] O K4/3T 1/B log(6TK2B)

Combining (6) and (7) completes the proof.

The proof of Theorem 3.4 follows along the same lines but requires additional ideas, and is deferred to Appendix B.

5. Experimental Results

We provide a summary of computational results of our algorithms for the batched dueling bandits problem. We conducted our computations using C++ and Python 2.7 with a 2.3 Ghz Intel Core i5 processor and 16 GB 2133 MHz LPDDR3 memory.

Experimental Setup. We compare all our algorithms, namely PCOMP, SCOMP, and SCOMP2 to a representative set of sequential algorithms for dueling bandits. Specifically, we use the dueling bandit library due to (Komiyama et al., 2015), and compare our algorithms to RUCB (Zoghi et al., 2014), RMED1 (Komiyama et al., 2015), and BEAT-THEMEAN (Yue & Joachims, 2011). Henceforth, we refer to BEAT-THE-MEAN as BTM. We plot the cumulative regret R(t) incurred by the algorithms against time t. Furthermore, to illustrate the dependence on B, we run another set of experiments on SCOMP2 and plot the cumulative regret R(t) incurred by SCOMP2 against time t for varying values of B.3

We perform these experiments using both real-world and synthetic data. We use the following datasets:

Six rankers. This real-world dataset is based on the 6 retrieval functions used in the engine of Ar Xiv.org.

Sushi. The Sushi dataset is based on the Sushi preference dataset (Kamishima, 2003) that contains the preference data regarding 100 types of Sushi. A preference dataset using the top-16 most popular types of sushi is obtained.

BTL-Uniform. We generate synthetic data using the Bradley-Terry-Luce (BTL) model. Under this model, each arm bi B is associated with a weight wi > 0 (sampled uniformly in the interval (0, 1]), and we set Pi,j = wi/(wi + wj). We set the number of arms K = 100. Note that the data generated in this way satisfies SST and STI (Yue et al., 2012). We refer to this data as SYN-BTL.

Hard-Instance. The last dataset is a synthetic dataset inspired by the hard instances that we construct for proving our lower bound (see Theorem 1.4). Again, we set K = 100, and pick ℓ [K] uniformly at random as the Condorcet winner. We select uniformly in (0, 0.5), and set Pℓ,i = 1

2 + for i = ℓ. Furthermore, for all i, j = ℓ, we set Pi,j = 1/2. We refer to this data as SYN-CD.

Note that there exists a Condorcet winner in all datasets. Moreover, the SYN-BTL dataset satisfies SST and STI. We repeat each experiment 10 times and report the average regret. In our algorithms, we use the KL-divergence based confidence bound (as in RMED1) for elimination as it performs much better empirically (and our theoretical bounds continue to hold). In particular, we replace lines 5, 6 and 8 in PCOMP, SCOMP and SCOMP2, respectively, with KL-divergence based elimination criterion that eliminates an arm i if there exists another arm j if ˆPij < 1 2 and Nij DKL( ˆPij, 1

2) > log(Tδ) where Nij is the number of times arm i and j are played together. We report the average cumulative regret at each time step.

Comparison with sequential dueling bandit algorithms.

3We also conducted these experiment for PCOMP and SCOMP and the conclusions were similar.

Batched Dueling Bandits

0 20000 40000 60000 80000 100000 t

Regret R(t)

PCOMP SCOMP SCOMP2 RMED RUCB BTM

(a) Six rankers

0 20000 40000 60000 80000 100000 t

Regret R(t)

PCOMP SCOMP SCOMP2 RMED RUCB BTM

0 20000 40000 60000 80000 100000 t

Regret R(t)

PCOMP SCOMP SCOMP2 RMED RUCB BTM

(c) SYN-BTL

0 20000 40000 60000 80000 100000 t

Regret R(t)

PCOMP SCOMP SCOMP2 RMED RUCB BTM

Figure 1: Regret v/s t plots of algorithms

As mentioned earlier, we compare our algorithms against a representative set of sequential dueling bandits algorithms (RUCB (Zoghi et al., 2014), RMED1 (Komiyama et al., 2015), and BTM (Yue & Joachims, 2011)). Note that the purpose of these experiments is to perform a sanity check to ensure that our batched algorithms, using a small number of batches, perform well when compared with sequential algorithms. We set α = 0.51 for RUCB, and f(K) = 0.3K1.01

for RMED1, and γ = 1.3 for BTM. We chose these parameters as they are known to perform well both theoretically and empirically (Komiyama et al., 2015). We set T = 105, δ = 1/TK2 and B = log(T) = 16. We plot the results in Figure 1. We observe that SCOMP2 performs comparably to RMED1 in all datasets, even outperforms RUCB in 3 out of the 4 datasets, and always beats BTM. Notice that both PCOMP and SCOMP considerably outperform BTM on the six rankers and sushi data; however their performance degrades on the synthetic data demonstrating the dependence on K.

Trade-off with number of batches B. We study the tradeoff of cumulative regret against the number of batches using SCOMP2. We set T = 105, and vary B {2, 8, 16}. We also plot the regret incurred by RMED1 as it performs the best amongst all sequential algorithms (and thus serves as a good benchmark). We plot the results in Figure 2 in Appendix A. We observe that as we increase the number of batches, the (expected) cumulative regret decreases. Furthermore, we observe that on the synthetic datasets (where K = 100), the

regret of SCOMP2 approaches that of RMED1; in fact, the regret incurred is almost identical for SYN-BTL dataset.

6. Conclusion

We introduced and studied the batched dueling bandit problem, where the learning algorithm is only allowed to adapt a limited number of times. Our main contribution was an algorithm for this problem under the SST and STI setting. This algorithm s regret (in a logarithmic number of batches) nearly matches the regret of the best known sequential algorithms. We also provided a lower bound demonstrating the dependence of the regret on the number of batches.

An avenue for future work is to obtain batched algorithms (with logarithmic number of batches) that exactly match the regret bounds of the best sequential algorithms for dueling bandits under SST and STI. Another direction concerns the batched dueling bandits problem under the more general Condorcet setting. Although we obtained a batched algorithm (PCOMP) for this setting, its regret (in a logarithmic number of batches) is still not asymptotically tight compared to known sequential algorithms.

Agarwal, A., Assadi, S., and Khanna, S. Stochastic submodular cover with limited adaptivity. In Proceedings

Batched Dueling Bandits

of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 323 342, 2019.

Agarwal, A., Johnson, N., and Agarwal, S. Choice bandits. In Neur IPS, 2020.

Ailon, N., Karnin, Z., and Joachims, T. Reducing Dueling Bandits to Cardinal Bandits. In Proceedings of the 31st International Conference on Machine Learning, 2014.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 05 2002. doi: 10.1023/A:1013689704352.

Balkanski, E. and Singer, Y. The adaptive complexity of maximizing a submodular function. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1138 1151, 2018a.

Balkanski, E. and Singer, Y. Approximation guarantees for adaptive sampling. In Proceedings of the 35th International Conference on Machine Learning, pp. 393 402, 2018b.

Balkanski, E., Breuer, A., and Singer, Y. Non-monotone submodular maximization in exponentially fewer iterations. In Advances in Neural Information Processing Systems, pp. 2359 2370, 2018.

Balkanski, E., Rubinstein, A., and Singer, Y. An exponential speedup in parallel running time for submodular maximization without loss in approximation. In Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 283 302, 2019.

Bansal, N. and Nagarajan, V. On the adaptivity gap of stochastic orienteering. Math. Program., 154(1-2):145 172, 2015.

Bansal, N., Gupta, A., Li, J., Mestre, J., Nagarajan, V., and Rudra, A. When LP is the cure for your matching woes: Improved bounds for stochastic matchings. Algorithmica, 63(4):733 762, 2012.

Behnezhad, S., Derakhshan, M., and Hajiaghayi, M. Stochastic matching with few queries: (1-ϵ) approximation. In Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 1111 1124, 2020.

Bhalgat, A., Goel, A., and Khanna, S. Improved approximation results for stochastic knapsack problems. In Proceedings of the 22nd Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1647 1665, 2011.

Chekuri, C. and Quanrud, K. Parallelizing greedy for submodular set function maximization in matroids and beyond. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 78 89, 2019.

Chen, B. and Frazier, P. I. Dueling Bandits with Weak Regret. In Proceedings of the 34th International Conference on Machine Learning, 2017.

Dean, B. C., Goemans, M. X., and Vondr ak, J. Approximating the stochastic knapsack problem: The benefit of adaptivity. Math. Oper. Res., 33(4):945 964, 2008.

Dudik, M., Hofmann, K., Schapire, R. E., Slivkins, A., and Zoghi, M. Contextual Dueling Bandits. In Proceedings of the 28th Conference on Learning Theory, 2015.

Esfandiari, H., Karbasi, A., Mehrabian, A., and Mirrokni, V. Regret bounds for batched bandits. Proceedings of the AAAI Conference on Artificial Intelligence, 35(8): 7340 7348, May 2021a.

Esfandiari, H., Karbasi, A., and Mirrokni, V. Adaptivity in adaptive submodularity. In Proceedings of 34th Conference on Learning Theory, volume 134, pp. 1823 1846. PMLR, 2021b.

Gao, Z., Han, Y., Ren, Z., and Zhou, Z. Batched multiarmed bandits problem. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 501 511, 2019.

Ghuge, R., Gupta, A., and Nagarajan, V. The power of adaptivity for stochastic submodular cover. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 3702 3712. PMLR, 18 24 Jul 2021.

Golovin, D. and Krause, A. Adaptive submodularity: A new approach to active learning and stochastic optimization. Co RR, abs/1003.3967, 2017.

Guha, S. and Munagala, K. Multi-armed bandits with metric switching costs. In Automata, Languages and Programming, 36th Internatilonal Colloquium (ICALP), pp. 496 507, 2009.

Gupta, A. and Nagarajan, V. A stochastic probing problem with applications. In Integer Programming and Combinatorial Optimization - 16th International Conference, pp. 205 216, 2013.

Gupta, A., Krishnaswamy, R., Nagarajan, V., and Ravi, R. Running errands in time: Approximation algorithms for stochastic orienteering. Math. Oper. Res., 40(1):56 79, 2015.

Hofmann, K., Whiteson, S., and Rijke, M. Balancing exploration and exploitation in listwise and pairwise

Batched Dueling Bandits

online learning to rank for information retrieval. Inf. Retr., 16(1):63 90, feb 2013. ISSN 1386-4564. doi: 10.1007/s10791-012-9197-9.

Jamieson, K., Katariya, S., Deshpande, A., and Nowak, R. Sparse Dueling Bandits. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015.

Joachims, T. Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 02, pp. 133 142, New York, NY, USA, 2002. Association for Computing Machinery. ISBN 158113567X. doi: 10.1145/775047.775067.

Kamishima, T. Nantonac collaborative filtering: recommendation based on order responses. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 24 - 27, 2003, pp. 583 588, 2003.

Komiyama, J., Honda, J., Kashima, H., and Nakagawa, H. Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem. In Proceedings of the 28th Conference on Learning Theory, 2015.

Komiyama, J., Honda, J., and Nakagawa, H. Copeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

Liu, T.-Y. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):225 331, mar 2009. ISSN 15540669. doi: 10.1561/1500000016.

Perchet, V., Rigollet, P., Chassang, S., and Snowberg, E. Batched bamdit problems. The Annals of Statistics, 44 (2):660 681, 2016. ISSN 00905364.

Radlinski, F., Kurup, M., and Joachims, T. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 08, pp. 43 52, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781595939913. doi: 10.1145/1458082.1458092.

Ramamohan, S., Rajkumar, A., and Agarwal, S. Dueling Bandits : Beyond Condorcet Winners to General Tournament Solutions. In Advances in Neural Information Processing Systems 29, 2016.

Saha, A. and Gopalan, A. Combinatorial bandits with relative feedback. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R.

(eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 983 993, 2019.

Sui, Y., Zhuang, V., Burdick, J. W., and Yue, Y. Multidueling Bandits with Dependent Arms. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence, 2017.

Sui, Y., Zoghi, M., Hofmann, K., and Yue, Y. Advancements in dueling bandits. In Lang, J. (ed.), Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, pp. 5502 5510. ijcai.org, 2018.

Urvoy, T., Clerot, F., Feraud, R., and Naamane, S. Generic Exploration and K-armed Voting Bandits. In Proceedings of the 30th International Conference on Machine Learning, 2013.

Wirth, C., Akrour, R., Neumann, G., and F urnkranz, J. A survey of preference-based reinforcement learning methods. J. Mach. Learn. Res., 18(1):4945 4990, jan 2017. ISSN 1532-4435.

Wu, H. and Liu, X. Double thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 649 657, 2016.

Yue, Y. and Joachims, T. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 09, pp. 1201 1208, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374. 1553527.

Yue, Y. and Joachims, T. Beat the mean bandit. In Proceedings of the 28th International Conference on Machine Learning, 2011.

Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. The karmed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 1556, 2012. ISSN 0022-0000. doi: https://doi.org/10.1016/j.jcss.2011.12. 028. JCSS Special Issue: Cloud Computing 2011.

Zoghi, M., Whiteson, S., Munos, R., and de Rijke, M. Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem. In Proceedings of the 31st International Conference on Machine Learning, 2014.

Zoghi, M., Karnin, Z., Whiteson, S., and de Rijke, M. Copeland Dueling Bandits. In Advances in Neural Information Processing Systems 28, 2015a.

Batched Dueling Bandits

Zoghi, M., Whiteson, S., and de Rijke, M. Merge RUCB: A method for large-scale online ranker evaluation. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining, 2015b.

Batched Dueling Bandits

A. Additional Plots

In this section, we provide the missing plots from 5.

0 20000 40000 60000 80000 100000 t

Regret R(t)

RMED B=2 B=8 B=16

(a) Six rankers

0 20000 40000 60000 80000 100000 t

Regret R(t)

RMED B=2 B=8 B=16

0 20000 40000 60000 80000 100000 t

Regret R(t)

RMED B=2 B=8 B=16

(c) SYN-BTL

0 20000 40000 60000 80000 100000 t

Regret R(t)

RMED B=2 B=8 B=16

Figure 2: Regret v/s B for SCOMP2.

B. Regret Analysis

We present the regret analysis for the algorithms described in 3 in this section. We first prove the following lemma which will be used in the analysis of all three algorithms.

Lemma B.1. For any batch r [B], and for any pair bi, bj that are compared cr times, we have

P |Pi,j b Pi,j| > γr 2δ,

where γr = q

Proof. Note that E[ b Pi,j] = Pi,j, and applying Hoeffding s inequality gives

P | b Pi,j Pi,j| > γr 2 exp 2cr γ2 r = 2δ.

Batched Dueling Bandits

We analyze the regret of our algorithms under a good event, G. We show that the G occurs with high probability; in the event that G does not occur (denoted G), we incur a regret of T. Towards defining G, we say that an estimate b Pi,j at the end of batch r is correct if | b Pi,j Pi,j| γr. We say that G occurs if every estimate in every batch is correct.

Lemma B.2. The probability that every estimate in every batch of PCOMP, SCOMP, and SCOMP2 is correct is at least 1 1/T.

Proof. Applying Lemma B.1 and taking a union bound over all pairs and batches (note SCOMP2 has at most 2B + 1 3B batches), we get that the probability that some estimate is incorrect is at most K2 3B 2δ = 1

T where δ = 1/6K2BT. Thus, P(G) 1

Using Lemma B.2, the expected regret (of any algorithm) can be written as follows:

E[R(T)] = E[R(T) | G] P(G) + E[R(T) | G] P(G)

E[R(T) | G] + T 1

T = E[R(T) | G] + 1 (8)

Proof of Theorem 3.1. First, recall that in each batch of PCOMP every pair of active arms is compared cr times where cr = qr with q = T 1/B. Since, q B = T, PCOMP uses at most B batches.

Following Lemma B.2 and (8), we only need to bound E[R(T) | G]. Given G, whenever Pi,j > 1

2 +2γr (that is ϵi,j > 2γr), we have b Pi,j > 1

2 + γr: so bandit bj will be eliminated by bi. Furthermore, given bandits bi and bj such that bi bj, bi will never be eliminated by bj under event G. This implies that b1 is never eliminated: this is crucial as we use b1 as an anchor to eliminate sub-optimal bandits. Recall that the regret can be written as follows:

where Tj is the number of comparisons that bj partakes in. We proceed by bounding Tj. Towards this end, let T1,j be a random variable denoting the number of comparisons performed between b1 and bj. As b1 is never eliminated, Tj K T1,j. Let r denote the last round such that bj survives round r, i.e., bj A at the end of round r. We can then conclude that ϵj := ϵ1,j 2γr (else b1 would eliminate bj in round r). We get

which on squaring and re-arranging gives:

Now, note that bj could have been played for at most one more round. Thus, we have

τ=0 cτ 2q cr

where the final inequality follows from summing up Pr 1 τ=1 cτ, and using B log(T). Then, we have Tj 2Kq cr. Using 9, and plugging in q = T 1/B and δ = 1/6TK2B we have

E[R(T) | G] 1

2KT 1/B 2 log 6TK2B

KT 1/B log 6TK2B

= 2KT 1/B log 6TK2B X

Batched Dueling Bandits

Note that when ϵj = 0 for bj B, we exclude the corresponding term in the regret bound. Combining this with (8) gives the first bound of Theorem 3.1. Plugging in ϵmin = minj:ϵj>0 ϵj completes the proof.

B.1. Proofs of Theorems 3.2 and 3.4

In this section, we provide the proofs of Theorem 3.2 and Theorem 3.4. Henceforth, we assume the SST and STI properties. We need the following definition. For a bandit bj, let Ej = {bi B : ϵi,j > 0}; that is, the set of bandits superior to bandit bj. We define rank(bj) = |Ej|. 4

As before, we analyze the regret of SCOMP and SCOMP2 under event G. By Lemma B.2 and (8), we only need to bound the expected regret under G; that is, we need to bound E[R(T) | G]. Conditioned on event G, the following Lemmas B.3,B.4 and B.5 hold for both SCOMP and SCOMP2.

Lemma B.3. The best bandit b1 is never deleted from A in the elimination step of phase I.

Proof. In SCOMP, bi deletes bj in batch r if b Pi,j > 1

2 + 3γr, and in SCOMP2 if b Pi,j > 1

2 + 5γr. If b1 is deleted due to some bandit bj, then by applying Lemma B.1 (in either case), we get Pj,1 > 1

2 + 2γr, a contradiction.

Lemma B.4. When the algorithm switches to PCOMP on set A , we have b1 A and |A | rank(bi S) where bi S is the best bandit in S.

Proof. We first consider algorithm SCOMP. Here, the switching occurs when, in some batch r, there exists bj A such that b Pj ,i > 1

2 + 3γr for all bi S, Moreover, A = {bj A | b Pj,i > 1

2 + γr for all bi S}. Consider any bi S. Given G, b Pj ,i > 1

2 + 3γr implies that Pj ,i > 1

2 + 2γr. By SST, P1,i Pj ,i, and again using event G, b P1,i > 1

2 + γr. Thus, b1 A . We now bound |A |. Let bi S be the best bandit in S, i.e., the bandit of smallest rank. Consider any bandit bj A . We have b Pj,i S > 1

2 + γr, which implies (by event G) that Pj,i S > 1

2. So, we must have bj bi S. Consequently, A {bj B : bj bi S}, which implies |A | rank(bi S).

We now consider SCOMP2. Here, we select an undefeated candidate bandit bi r in batch r, and the algorithm switches if there exists bj A such that b Pj ,i r > 1 2 + 5γr. Moreover, A = {bj A | b Pj,i r > 1 2 + 3γr}. Given G, we have Pj ,i r > 1

2 + 4γr. By SST and again applying G, we obtain b P1,i r > 1

2 + 3γr. So, b1 A . We now argue that |A | rank(bi S). Again, let bi S be the best bandit in S. As bi r is undefeated after round r(1), we have b Pi S,i r 1

2 + γr, which implies Pi S,i r 1

2 + 2γr (by event G). Now, consider any bandit bj A . We have b Pj,i S > 1

2 + 3γr, which implies (by event G) that Pj,i S > 1

2 + 2γr. It follows that bj bi S for all bj A . Hence, |A | rank(bi S).

Lemma B.5. We have E[rank(bi S)]

K and E[rank(bi S)2] 2K.

Proof. Let R be a random variable denoting rank(bi S). Note that R = k if, and only if, the first k 1 bandits are not sampled into S, and the kth bandit is sampled into S. Thus, R is a geometric random variable with success probability p := 1

K .5 Recall that the mean and variance of a geometric random variable are 1

p and 1 p2 1

p respectively. So,

K. Moreover, E[R2] 2 p2 = 2K.

Using Lemmas B.3, B.4 and B.5, we complete the proofs of Theorems 3.2 and 3.4.

Proof of Theorem 3.2. We bound the expected regret of SCOMP conditioned on G. Let R1 and R2 denote the regret incurred in phase I and II respectively.

4 Note that SST and STI imposes a linear ordering on the bandits. So, we can assume b1 b2 b K. Thus, rank(bj) < j; that is, it is at most the number of bandits strictly preferred over bj. 5Strictly speaking, R is truncated at K.

Batched Dueling Bandits

Bounding R1. Fix a bandit bj. Let r denote the last round such that bj A and switching does not occur (at the end of round r). Let bi S be the best bandit in S. As bj is not eliminated by bi S, we have b Pi S,j 1

2 + 3γr, which implies (by event G) Pi S,j 1

2 + 4γr. Moreover, as switching doesn t occur, we have mini S b P1,i 1

2 + 3γr (by Lemma B.3, b1 is never deleted from A). We now claim that P1,i S 1

2 + 4γr. Otherwise, by SST we have mini S P1,i = P1,i S > 1

2 + 4γr, which (by event G) implies mini S b P1,i > 1

2 + 3γr, a contradiction! It now follows that ϵi S,j 4γr and ϵ1,i S 4γr. Consider now two cases:

1. b1 bi S bj. Then, by STI, ϵ1,j 8γr, and

2. b1 bj bi S. Then, by SST ϵ1,j ϵi S,j 4γr.

In either case, we have ϵj = ϵ1,j 8γr, which implies cr log(1/δ)

2γ2r 32 log(1/δ)

Now, let Tj be a random variable denoting the number of comparisons of bj with other bandits before switching. By definition of round r, bandit bj will participate in at most one round after r (in phase I). So, we have

Tj |S| Pr+1 τ=1 cτ if bj S K Pr+1 τ=1 cτ if bj S

Taking expectation over S, we get

τ=1 cτ | bj S

P(bj S) + E

τ=1 cτ | bj S

K + E[|S| | bj S]

where the third inequality uses E[|S| | bj S]

K. Moreover,

τ=1 cτ 2T 1/B cr = O

T 1/B log(1/δ)

j E [Tj] ϵj = X

K log(6K2TB)

Bounding R2. We now bound the regret after switching. From Lemmas B.3 and B.4, we know that b1 is never deleted, b1 A , and |A | rank(bi S). For any A , applying Theorem 3.1 we get,

R2 3|A |T 1/B log(6T|A |2B) X

1 ϵj 3|A |T 1/B log(6TK2B) X

By Lemma B.5, E[|A |]

KT 1/B log(6TK2B) X

Combining (10) and (11), we get

E[R(T)|G] = X

K log(6K2TB)

and by (8), this concludes the proof.

Proof of Theorem 3.4. We bound the expected regret conditioned on G. Let R1 and R2 denote the regret incurred in phase I and II respectively.

Batched Dueling Bandits

Bounding R1. Fix a bandit bj. Let r denote any round such that bj A and switching does not occur (at the end of round

r). As in the proof of Theorem 3.2, we first show that cr = O log(1/δ)

. Recall that bi r is the candidate in round r. As bj

is not eliminated by bi r, we have b Pi r,j 1

2 + 5γr, which implies (by event G) Pi S,j 1

2 + 6γr. Moreover, as switching doesn t occur, we have b P1,i r 1

2 + 5γr (by Lemma B.3, b1 is never deleted from A). By event G, we get P1,i r 1

2 + 6γr. It now follows that ϵi r,j 6γr and ϵ1,i r 6γr. Consider now two cases:

1. b1 bi r bj. Then, by STI, ϵ1,j 12γr, and

2. b1 bj bi S. Then, by SST ϵ1,j ϵi r,j 6γr.

In either case, we have ϵj = ϵ1,j 12γr, which implies cr log(1/δ)

2γ2r = O log(1/δ)

We further divide R1 into two kinds of regret: R(c) 1 and R(n) 1 where R(c) 1 refers to the regret incurred by candidate arms and R(n) 1 is the regret incurred by non-candidate arms.

Bounding R(n) 1 . For any bandit bj, let Tj be a random variable denoting the number of comparisons of bj (in phase I) when bj is not a candidate. Also, let r be the last round such that bj A and switching doesn t occur. So, bj will participate in at most one round after r, and

Tj Pr+1 τ=1 cτ if bj S |S| Pr+1 τ=1 cτ if bj S

Taking expectation over S, we get

τ=1 cτ | bj S

P(bj S) + E

τ=1 cτ | bj S

K E[|S| | bj S] + 1 (2 + 1

where the third inequality uses E[|S| | bj S] 1 +

Moreover, using cr = O log(1/δ)

, we have Pr+1 τ=1 cτ = O T 1/B log(1/δ)

E[R(n) 1 ] = X

j E [Tj] ϵj X

T 1/B log 1

T 1/BK log 1

Bounding R(c) 1 . Observe that if bj is a candidate in round r, then the regret incurred by bj in round r is at most

Kcr ϵ1,j. Also, cr 1 O log( 1

because bj A and switching hasn t occurred at end of round r 1. Thus, we have

cr = T 1/Bcr 1 O T 1/B log( 1

. We can thus write

j Kcr ϵj I [i r = j] ,

where I [i r = j] is an indicator random variable denoting whether bj was the candidate bandit in round r. Observe that there is exactly one candidate bandit, bi r, in each round. So,

r=1 crϵi r K

T 1/B log 1

T 1/B log 1

T 1/BKB log 1

Batched Dueling Bandits

Combining (12) and (13), we get

T 1/BKB log 1

Bounding R2. Finally, we bound the regret in phase II where we only have bandits A . From Lemmas B.3 and B.4, we know that b1 A , and |A | rank(bi S). For any A , applying Theorem 3.1 we get,

R2 3|A |T 1/B log(6T|A |2B) X

1 ϵj 3|A |2 T 1/B log(6TK2B) 1 ϵmin

By Lemma B.5, E[|A |2] 2K, and so:

E[R2] 6T 1/BK log(6TK2B)

Finally, combining (14) and (15) completes the proof.

C. Lower Bound

In this section, we present a lower bound for the batched dueling bandits problem under the SST and STI setting. Note that this lower bound also applies to the more general Condorcet winner setting. The main result of this section is the following:

Theorem C.1. Given an integer B > 1, and any algorithm that uses at most B batches, there exists an instance of the K-armed batched dueling bandit problem that satisfies the SST and STI conditions such that the expected regret

E[RT ] = Ω KT 1/B

where ϵmin is defined with respect to the particular instance.

In order to prove this theorem, we will construct a family of instances such that any algorithm for batched dueling bandits cannot simultaneously beat the above regret lower bound over all instances in the family. We exploit the fact that the algorithm is unaware of the particular instance chosen from the family at run-time, and hence, is unaware of the gap ϵmin under that instance.

Family of Instances :

Let F be an instance where Pi,j = 1

2 for all i, j B.

For j [B], let j =

K 24B T (j 1)/2B. For j [B] and k [K], let Ej,k be an instance where bandit bk is the Condorcet winner such that Pk,l = 1

2 + j for all l [K] \ {k} and Pl,m = 1

2 for all l, m [K] \ {k}.

The family of instances := {Ej,k}j [B],k [K] {F}.

C.1. Proof of Theorem C.1

Let us fix an algorithm A for this problem. Let Tj = T j/B for j [B]. Let tj be the total (random) number of comparisons until the end of batch j during the execution of A. We will overload notation and denote by It the distribution of observations seen by the algorithm when the underlying instance is I. We will sometimes use Pi,j(I) for the probability of i beating j under an instance I to emphasize the dependence on I. We will also write ϵmin(I) to emphasize the dependence on the underlying instance I.

We define event Aj as follows: Aj = {tj < Tj , j < j and tj Tj},

Batched Dueling Bandits

and denote by Ej,k(Aj) the event that Aj occurs given that the instance selected is Ej,k. Similarly, F(Aj) denotes the event that Aj occurs when the instance selected is F. Now, define

l=1 P(Ej,l(Aj)).

Observe that pj is the average probability of event Aj conditional on the instance having gap j.

Lemma C.2. PB j=1 pj 1

Proof. Note that the event Aj is determined by observations until Tj 1. This is because tj 1 < Tj 1, and once the observations until tj 1 are seen: the next batch j determines whether or not Aj occurs. Hence, in order to bound the probability of Aj under two different instances F and Ej,l we use the Pinsker s inequality as

|P(F(Aj)) P(Ej,l(Aj))|

1 2DKL(F Tj 1||ETj 1 j,l )

for l [K]. Let τl be the random variable for the number of times arm l is played until Tj 1. We first bound DKL(F Tj 1||ETj 1 j,l ) as

DKL(F Tj 1||ETj 1 j,l ) (a) =

t=1 DKL Pt1,t2(F) || Pt1,t2(Ej,l)

t=1 Pr F (arm l is played in trial t) DKL

(c) EF [τl] 4 2 j , (16)

where (a) follows from the fact that, given F, the outcome of comparisons are independent across trials, (b) follows from the fact that the KL-divergence between Pt1,t2(F) and Pt1,t2(Ej,k) is non-zero only when arm l is played in trial t, and (c) follows from the fact that DKL(p||q) (p q)2

q (1 q). Using the above bounds, we have that

l=1 |P(F(Aj)) P(Ej,l(Aj))| 1

1 2DKL(F Tj 1||ETj 1 j,l )

1 2 4 2 j EF [τl] = 1

2 2 j EF [τl]

2 2 j EF [PK l=1 τl] K

2 2 j 2Tj 1

where (a) follows from the concavity of x and Jensen s inequality, and (b) follows from the fact that PK l=1 τl Tj 1. We thus have

|P(F(Aj)) pj| = |P(F(Aj)) 1

l=1 P(Ej,l(Aj))|

l=1 |P(F(Aj)) P(Ej,l(Aj))| 1 2B .

Batched Dueling Bandits

Finally, we can write

j=1 (P(F(Aj)) 1

j=1 P(F(Aj)) 1

As a consequence of this lemma, we can conclude that there exists some j [B] such that pj 1 2B . We focus on the event where gap is j, and prove that when pj 1 2B , A must suffer a high regret leading to a contradiction. The next lemma formalizes this.

Lemma C.3. If, for some j, pj 1 2B , then

sup I:ϵmin(I)= j E[RT (I)] Ω KT 1/B

Proof. Fix k [K]. We will construct a family of instances {Qj,k,l}l =k where Qj,k,l is defined as:

Instance Qj,k,l: Arm l is the Condorcet winner and the pairwise preferences are defined as:

2 + 2 j, m [K] \ {l}; Pkm = 1

2 + j, m [K] \ {l, k};

and Pmm = 1

2 for remaining pairs (m, m ).

We also let Qj,k,k := Ej,k. Note that the regret is j if the underlying instance is Qj,k,l and the pair played is not (bl, bl). We have that

sup I:ϵmin(I)= j E[RT (I)] j

l =k Qt j,k,l ((bt1, bt2) = (bl, bl)) ,

where Qt j,k,l denotes the distribution of observations available at time t under instance Qj,k,l and Qt j,k,l ((bt1, bt2) = (bl, bl)) is the probability that the algorithm does not play arm (bl, bl) at time t under Qt j,k,l. In order to bound the above quantity we will need the following lemma from (Gao et al., 2019).

Lemma C.4 (Lemma 3 of (Gao et al., 2019)). Let Q1, QK be probability measures on some common probability space (Ω, F), and ψ : Ω [K] be any measurable function (i.e., test). Then, for any tree T = ([K], E) with vertex set [K] and edge set E,

i=1 Qi(ψ = i) 1

Z min{d Ql, d Ql } .

Batched Dueling Bandits

Using the above lemma for the star graph centered at k, we have that

sup I:ϵmin(I)= j E[RT (I)] j

Z min{d Qt j,k,k, d Qt j,k,l}

Z min{d Qt j,k,k, d Qt j,k,l}

Z min{d QTj j,k,k, d QTj j,k,l}

Aj min{d QTj j,k,k, d QTj j,k,l}

Aj min{d QTj 1 j,k,k, d QTj 1 j,k,l } ,

where (a) follows because Tj T, (b) follows due to the fact that R min{d P, d Q} = 1 DTV(P, Q) and the fact that DTV(QTj j,k,k, QTj j,k,l) is at least DTV(Qt j,k,k, Qt j,k,l) as the sigma algebra FQt j,k,k of Qt j,k,k is a subset of the sigma algebra

FQ Tj j,k,k of QTj j,k,k, (c) follow from the fact that the event Aj is determined by observations until Tj 1 as explained in the

proof of Lemma C.2. We then have that

Aj min{d QTj 1 j,k,k, d QTj 1 j,k,l } = Z

d QTj 1 j,k,k + d QTj 1 j,k,l |d QTj 1 j,k,k d QTj 1 j,k,l |

= QTj 1 j,k,k(Aj) + QTj 1 j,k,l (Aj)

|d QTj 1 j,k,k d QTj 1 j,k,l |

(a) QTj 1 j,k,k(Aj) 1

2DTV QTj 1 j,k,k, QTj 1 j,k,l DTV QTj 1 j,k,k, QTj 1 j,k,l

= QTj 1 j,k,k(Aj) 3

2DTV QTj 1 j,k,k, QTj 1 j,k,l ,

where (a) follows from the fact that DTV(P, Q) = sup A |P(A) Q(A)|. Let us define τl to be the random variable for the number of times arm l is played until Tj 1 We also have that

l =k DTV QTj 1 j,k,k, QTj 1 j,k,l 1

1 2DKL(QTj 1 j,k,k||QTj 1 j,k,l )

1 2 16 2 j EEj,k[τl] = 1

8 2 j EEj,k[τl]

8 2 j EEj,k[P

8 2 j K 2Tj 1 = 1 6B ,

where (a) follows from a similar calculation as Equation (16) in the proof of Lemma C.2, (b) follows from the concavity of x and Jensen s inequality, and (c) follows from the fact that PK l=1 τl Tj 1.

Combining ?????? we have that

sup I:ϵmin(I)= j E[RT (I)] j Tj

P(Ej,k(Aj)) 1

Batched Dueling Bandits

Since the above inequality holds for all k [K], by averaging we get

sup I:ϵmin(I)= j E[RT (I)] j Tj

k=1 P(Ej,k(Aj)) 1

j Tj 1 4B .

Substituting the value of j Tj we get

sup I:ϵmin(I)= j E[RT (I)] j Tj 1 4B =

K 24B T (j 1)/2BT j/B 1

K 24B T (j 1)/2BT 1/B 1

4B = Ω KT 1/B

Finally, PB j=1 pj 1

2 implies that there exists j [B] with pj 1/2B. Combining the two lemmas above, we get that

there exists j [B] with pj 1/2B such that the algorithm incurs a regret of Ω KT 1/B

. In this case, there must exist an

instance Ej,k with gap ϵmin(Ej,k) = j such that the regret of the algorithm under Ej,k is Ω KT 1/B

. This completes the proof of our lower bound.