# adversarial_dueling_bandits__77faf0a3.pdf

Adversarial Dueling Bandits

Aadirupa Saha 1 Tomer Koren 2 Yishay Mansour 2

We introduce the problem of regret minimization in Adversarial Dueling Bandits. As in classic Dueling Bandits, the learner has to repeatedly choose a pair of items and observe only a relative binary win-loss feedback for this pair, but here this feedback is generated from an arbitrary preference matrix, possibly chosen adversarially. Our main result is an algorithm whose T-round regret compared to the Borda-winner from a set of K items is O(K1/3T 2/3), as well as a matching Ω(K1/3T 2/3) lower bound. We also prove a similar high probability regret bound. We further consider a simpler ﬁxed-gap adversarial setup, which bridges between two extreme preference feedback models for dueling bandits: stationary preferences and an arbitrary sequence of preferences. For the ﬁxed-gap adversarial setup we give an O((K/ 2) log T) regret algorithm, where is the gap in Borda scores between the best item and all other items, and show a lower bound of Ω(K/ 2) indicating that our dependence on the main problem parameters K and is tight (up to logarithmic factors). Finally, we corroborate the theoretical results with empirical evaluations.

1. Introduction

Dueling Bandits is an online decision making framework similar to the well known (stochastic) multi-armed bandit (MAB) problem (Auer et al., 2002a; Slivkins, 2019), that has gained widespread attention in the machine learning community over the past decade (Yue et al., 2012; Zoghi et al., 2014b; 2015a). In Dueling Bandits, a learner repeatedly selects a pair of items to be compared to each other in a duel, and consequently observe a binary stochastic preference feedback, which can be interpreted as the winning item in this duel. The goal of the learner is to minimize the

1Microsoft Research, New York City 2Blavatnik School of Computer Science, Tel Aviv University, and Google Research Tel Aviv. Correspondence to: Aadirupa Saha <aadirupa.saha@microsoft.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

regret with respect to the best item in hindsight, according to a certain score function.

Numerous real-world applications are naturally modelled as dueling bandit problems, including movie recommendations, tournament ranking, search engine optimization, retail management, etc. (see also Busa-Fekete & H ullermeier, 2014; Yue & Joachims, 2009). Indeed, in many of these scenarios, users with whom the algorithm interacts with ﬁnd it more natural to provide binary feedback by comparing two alternatives rather than giving an absolute score for a single alternative. Over the years, several algorithms have been proposed for addressing dueling bandit problems (Ailon et al., 2014; Zoghi et al., 2014a; Komiyama et al., 2015; Zoghi et al., 2014b) and there has been some work on extending the pairwise preference to more general subset-wise preferences (Sui et al., 2017; Brost et al., 2016; Saha & Gopalan, 2018; 2019; Ren et al., 2018).

While almost all of the existing literature on dueling bandits focus on stochastic stationary preferences, in reality preferences might vary signiﬁcantly and unpredictably over time. For example, in video recommendation systems, user preferences may evolve according to daily and hourly viewing trends; in web-search optimization, relevance of various websites may vary rather unpredictably. In other words, many of the real-world applications of dueling bandits actually deviate from the stochastic feedback model, and would more faithfully be modelled in a robust worse-case (adversarial) model that alleviates the strong stochastic assumption and allows for an arbitrary sequence of preferences over time. For similar reasons, the MAB problem, and more generally, online learning, are frequently studied in a nonstochastic adversarial setup (Lattimore & Szepesv ari, 2018; Bubeck & Cesa-Bianchi, 2012; Cesa-Bianchi & Lugosi, 2006; Seldin & Slivkins, 2014; Seldin & Lugosi, 2017; Neu, 2015; Bubeck & Slivkins, 2012).

Surprisingly, however, a non-stochastic version of dueling bandits has not been well studied (with the only exception being Gajane et al., 2015, discussed below). The ﬁrst challenge in eschewing stationarity in dueling bandits lies in the performance benchmark compared to which regret is deﬁned. Indeed, most works on stochastic dueling bandits rely on the existence of a Condorcet winner: an item being preferred (and often by a gap) when compared with

Adversarial Dueling Bandits

any other item. In an adversarial environment, however, assuming a Condorcet winner makes little sense as it would constrain the adversary to consistently prefer a certain item at all rounds, ultimately defeating the purpose of a nonstationary model in the ﬁrst place. Another main challenge is the inherent disconnect between the feedback observed by the learner and her payoff at any given round; while this disparity already exists in stochastic models of dueling bandits, in an adversarial setup it becomes more tricky to attribute preferential information to the instantaneous quality of items.

1.1. Our contributions

In this paper, we introduce and study an adversarial version of dueling bandits. To mitigate the issues associated with Condorcet winner assumptions, and following recent literature on dueling bandits (e.g., Jamieson et al., 2015; Ramamohan et al., 2016; Falahatgar et al., 2017), we focus on the so-called Borda score criterion. The Borda score of an item is the probability that it is preferred over another item chosen uniformly at random. A Borda winner (i.e., an item with the highest Borda score) always exists for any preference matrix, and more generally, this notion naturally extends to any arbitrary sequence of preference matrices. However, the second challenge from above remains: the Borda score of an item is not directly related in nature to the preferential feedback observed for this item on rounds where it is chosen for a duel.

The main contributions of this paper can be summarized as follows:

We introduce and formalize an adversarial model for K-armed dueling bandits with standard binary winloss preferential feedback (and where regret is measured with respect to Borda scores). To the best of our knowledge, we are ﬁrst to study such a setup.

In the general adversarial model, where the sequence of preference matrices is allowed to be entirely arbitrary, we present an algorithm with expected regret bounded by O(K1/3T 2/3).1 We further demonstrate how to modify our algorithm so as to guarantee a similar bound with high probability. We also give a lower bound of Ω(K1/3T 2/3), showing our algorithm is nearly optimal.

We consider a more specialized ﬁxed-gap adversarial model, that bridges between the two extreme preference feedback models for dueling bandits: the wellstudied stationary stochastic preferences, and fully adversarial preferences. Here, we assume that there is a ﬁxed item whose average Borda score at any point in

1Throughout, the notation O( ) hides logarithmic factors.

time exceeds that of any other item by at least > 0, where is a gap parameter unknown to the learner. (Other than constraining this ﬁxed gap, the preference assignment may change adversarially.) We present an algorithm that achieves regret O(K/ 2), and show that it is near-optimal by proving a regret lower bound of Ω(K/ 2).

Finally, we corroborate our theoretical ﬁndings with an empirical evaluation.

Our results thus reveal an inherent gap in the achievable regret between dueling bandits and standard multi-armed bandits: in the adversarial model, the optimal regret in dueling bandits grows like Θ(T 2/3) whereas in standard bandits Θ(

T)-type bounds are possible; likewise, in the ﬁxed-gap model the optimal regret for dueling bandits is Θ(K/ 2), versus the well-known Θ(K/ ) regret performance for standard ﬁxed-gap (stochastic) bandits.

The reason for this substantial gap, as we explain in more detail in our discussion of lower bounds, is the following. For gaining information about the identity of the best item in terms of Borda scores, the learner might be forced to choose items the scores of which are already (or even initially) known to be suboptimal, and for which she would unavoidably suffer constant regret. Indeed, the Borda score of an item inherently depends on its relative performance compared to all other items, and it may be that the identity of the Borda winner is determined solely by its comparison to poorly-performing items.

1.2. Related work

Dueling bandits were investigated extensively in the stochastic setting. The most frequently used performance objective in this literature is the regret compared to the Condorcet Winner (Yue et al., 2012; Zoghi et al., 2014a; 2015b; Komiyama et al., 2015; Yue & Joachims, 2011). However, there are quite a few well-established shortcomings of this objective; most importantly, the Condorcet winner often fails to exist even for a ﬁxed preference matrix. (See Jamieson et al., 2015 for more detailed discussion.) In absence of Condorcet winners, there are other preference notions studied in the literature, most notably the Borda Winner (Busa-Fekete & H ullermeier, 2014; Jamieson et al., 2015; Ramamohan et al., 2016; Falahatgar et al., 2017), Copeland Winner (Zoghi et al., 2015a; Komiyama et al., 2016; Wu & Liu, 2016),2

and Von-Neumann Winner (Dud ık et al., 2015; Balsubramani et al., 2016). In this work, we focus on the Borda Winner, which appears to be the most common alternative.

The only previous treatment of dueling bandits in an ad-

2It is worth noting that for the Copeland winner to be at all learnable, a gap assumption is required.

Adversarial Dueling Bandits

versarial setting is Gajane et al. (2015), which considers utility-based preferences and thereby imposes a complete ordering of the items in each time step rather than a general preference matrix. Further, their feedback model includes not only the winning item but also a transfer function which is the difference in utilities between the compared items, thus being more similar to standard MAB and largely departs from the original motivation of dueling bandit. For the identity transfer function, they show in their adversarial utility-based dueling bandit model a tight regret bound of Θ(

KT). In contrast, we show for the adversarial dueling bandit model a tight regret bound of Θ(K1/3T 2/3). This shows that when one does not have a direct access to a transfer function and is faced with arbitrary preferences, the regret scales substantially different, i.e., Θ(T 2/3) versus Θ(T 1/2).

Jamieson et al. (2015) show an instance dependent Ω(K/ 2) sample complexity lower bound for the Bordawinner identiﬁcation problem in stochastic dueling bandits. In contrast, our lower bound which is similar in magnitude, applies to the regret which is always smaller (and often strictly smaller) than the sample complexity.

2. Problem Setup

We consider an online decision task over a ﬁnite set of items [K] := {1, 2, . . . , K} which spans over T decision rounds. Initially, and obliviously, the environment ﬁxes a sequence of T preference matrices P1, . . . , PT , where each Pt [0, 1]K K satisﬁes Pt(i, j) = 1 Pt(j, i), and Pt(i, i) = 1 2 for all i, j [K]. The value of Pt(i, j) is interpreted as the probability that item i wins when matched against item j at time t. Then, at each round t the learner selects, possibly at random, two items xt, yt [K] and a feedback ot Ber(Pt(xt, yt)) for the selected pair is revealed, where Ber(p) denotes a Bernoulli random variable with parameter p. Here, feedback of ot = 1 implies that item xt wins the duel, while ot = 0 corresponds to yt being the winner.

The Borda score of item i [K] with respect to the preference matrix Pt at time t is deﬁned as

i [K] : bt(i) := 1 K 1

j =i Pt(i, j),

and i := arg max i [K]

i.e., i is the item with the highest cumulative Borda score at time T. The learner s T-round regret RT is then deﬁned

as follows:

t=1 rt, where

rt := bt(i ) 1

2(bt(xt) + bt(yt)). (1)

We will consider two settings of preference assignments. In the general adversarial setting, P1, . . . , PT is an arbitrary sequence of preference matrices. In the ﬁxed-gap setting, preferences are set so that there is an item i [K] for which, at all rounds t [T], we have bt(i ) bt(j) + for any other j = i , where bt(j) := 1

t Pt τ=1 bτ(j) is the average Borda score of item j [K] up to time t.

3. General Adversarial Dueling Bandits

We ﬁrst consider the general adversarial setup for an arbitrary sequence of preference matrices. We give an algorithm, called Dueling-EXP3 (D-EXP3), which has an expected regret of O((K log K)1/3T 2/3). We also show how a simple modiﬁcation of the D-EXP3 algorithm guarantees regret O(K1/3T 2/3p

log(K/δ)) with probability at least 1 δ.

3.1. The Dueling-EXP3 Algorithm

Our algorithm, detailed in Algorithm 1, is motivated from the classical EXP3 algorithm for adversarial MAB (Auer et al., 2002a), and relies on constructing unbiased estimates for scores of individual items at all rounds. However, in the dueling setup one has to establish such estimates using only binary preference feedback corresponding to a choice of a pair of items. Technically, the algorithm will estimate a shifted version of the Borda score, deﬁned as follows.

Deﬁnition 1. The shifted Borda score of item i [K] at time t [T] is st(i) := 1 K P

j [K] Pt(i, j). The

shifted regret is then deﬁned as Rs T := PT t=1[st(i ) 1 2(st(xt) + st(yt))].

Since all scored are shifted by the same value, this will not have any impact and the differences between Borda scores will be maintained (albeit multiplied by K K 1). In particular, the best item is unchanged, i.e., i =

arg maxi [K] PT t=1 bt(i) = arg maxi [K] PT t=1 st(i), and for any K 2 and T > 0 we have RT = K K 1Rs T .

At every round t, D-EXP3 maintains a weight distribution qt [K] ( [K] is the K-simplex), and compute a score estimate st(i) for each item i, being an unbiased estimate of st(i) (Lemma 4). Thus, the cumulative estimated score Pt τ=1 st(i) can be seen as the estimated cumulative reward of item i at round t, and hence qt+1 is simply updated running an exponential weight update on these estimated cumulative scores along with an γ-uniform exploration.

We now state the expected regret guarantee we establish for

Adversarial Dueling Bandits

Algorithm 1 Dueling-EXP3 (D-EXP3)

1: Input: Item set indexed by [K], learning rate η > 0, parameters γ (0, 1) 2: Initialize: Initial probability distribution q1(i) = 1/K, i [K] 3: for t = 1, . . . , T do 4: Sample xt, yt qt i.i.d. (with replacement) 5: Receive preference ot(xt, yt) Ber(Pt(xt, yt)) 6: Estimate scores, for all i [K]:

st(i) = 1(xt = i)

1(yt = j)ot(xt, yt)

7: Update, for all i [K]:

qt+1(i) = exp(η Pt τ=1 sτ(i)) PK j=1 exp(η Pt τ=1 sτ(j))

qt+1(i) = (1 γ) qt+1(i) + γ

Algorithm 1. Theorem 2. Let η = ((log K)/(T

K))2/3 and γ = ηK. For any T, the expected regret of Algorithm 1 satisﬁes E[RT ] 6(K log K)1/3T 2/3.

The proof of the expected regret bound crucially relies on the the following key lemmas regarding the estimates for the shifted Borda scores. We bound their magnitude, show that they are unbiased estimates, bound their instantaneous regret, and bound their second moment.

We ﬁrst bound the magnitude of the estimates st(i), using the fact that qt(j) γ/K. Lemma 3. For all t [T] and i [K] it holds that st(i) K/γ2.

Next, we show that st(i) is an unbiased estimate of the shifted Borda score st(i). Lemma 4. For all t [T] and i [K] it holds that E[ st(i)] = st(i).

Let Ht 1 := (q1, P1, (x1, y1), o1, . . . qt, Pt) denotes the history up to time t. We compute the expected instantaneous regret at time t as a function of the true shifted Borsda scores at time t. Lemma 5. For all t [T] it holds that EHt[q t st] = EHt 1 Ex qt[st(x) | Ht 1] .

Finally, We bound the second moment of our estimates. Lemma 6. For all t [T] it holds that E PK i=1 qt(i) st(i)2 K/γ.

Proof overview. We upper bound Rs T , the shifted Borda score regret, and recall that RT = K K 1Rs T . Note that EHT [st(xt) + st(yt)] = EHt 1 Ex qt[2st(x) | Ht 1] , since xt and yt are i.i.d. Further note that we can write

EHT [Rs T ] = EHT

t=1 [st(i ) 1

2(st(xt) + st(yt))]

= max k [K] EHT

t=1 [st(k) 1

2(st(xt) + st(yt))]

where the last equality holds since we assume the Pt are chosen obliviously and so i does not depend on the learning algorithm. Thus we can rewrite:

EHT [Rs T ] =

t=1 EHt 1[Ex qt[st(x) | Ht 1]]

Now, as η st(i) ηK/γ2 (from Lemma 3), for any γ ηK and η > 0 we have η st(i) [0, 1]. From the regret guarantee of standard Multiplicative Weights algorithm (Bubeck & Cesa-Bianchi, 2012) over the completely observed ﬁxed sequence of reward vectors s1, s2, . . . s T we have for any k [K]:

t=1 q t st log K

i=1 qt(i) st(i)2.

Note that qt := (qt γ K )/(1 γ). Let i = arg maxk [K] PT t=1 st(k) = arg maxk [K] PT t=1 bt(k). Taking expectation on both sides of the above inequality for k = i , we get:

t=1 EHT [ st(i )]

t=1 EHT [q t st]

i=1 qt(i) st(i)2 ,

which by applying Lemma 4, Lemma 5 and Lemma 6 and the fact that st(k ) 1, γ = ηK, we have

EHT [Rs T ] 2T p

3(K log K)1/3T 2/3,

where the second inequality follows by optimizing over η. The theorem follows since RT = K K 1Rs T 2Rs T . A complete proof is given in the supplementary material.

3.2. High Probability Regret Analysis

We can show that a slightly modiﬁed version of Dueling EXP3 can lead to a high probability regret bound for

Adversarial Dueling Bandits

the same setup. (This is inspired by the EXP3.P algorithm (Auer et al., 2002b).) The modiﬁed algorithm runs almost identically to that of Algorithm 1, except we now use a different score estimate s t(i) in place of st(i), where s t(i) = st(i) + β/qt(i), where β (0, 1) is a tuning parameter. The items weights qt [K] are now similarly updated using an exponential weight update on these modiﬁed score estimates along with an γ-uniform exploration. The complete algorithm is described in Algorithm 2.

Algorithm 2 Dueling-EXP3 (High Probability)

1: Input: Item set: [K], learning rate η > 0, parameters β (0, 1), γ (0, 1) 2: Initialize: Initial distribution q1(i) = 1

K , i [K] 3: while t = 1, 2, . . . do 4: Sample xt, yt qt (i.i.d., with replacement) 5: Receive preference ot(xt, yt) Ber(Pt(xt, yt)) 6: Compute i [K]:

s t(i) = 1(xt = i)

1(yt = j)ot(xt, yt)

qt(j) + β qt(i)

7: Update i [K]:

qt+1(i) = exp(η Pt τ=1 s τ(i)) PK j=1 exp(η Pt τ=1 s τ(j))

qt+1(i) = (1 γ) qt+1(i) + γ

8: end while

We now prove a high probability regret bound for Algorithm 2:

Theorem 7. Given any T and δ > 0, there exists a setting of γ, β and η, such that with probability at least 1 δ, the regret of the modiﬁed D-EXP3 algorithm is RT = O(K1/3T 2/3),

The proof builds on the following steps. Similarly to our estimates st(i) above, we can show the following properties.

Lemma 8. For any item i and round t [T], we have s t(i) K/γ2 + Kβ/γ.

Lemma 9. For any item i and round t [T], it holds that E[s t(i) | Ht 1] = st(i) + β/qt(i).

However, unlike st(i), the adjusted score estimates s t(i) are no longer unbiased for the true scores st(i), and are larger in expectation by β. Nevertheless, this does not hurt the regret analysis as its key element lies in showing that for any item i [K], the cumulative estimated scores are not too far from the accumulated true scores. Precisely, the next lemma ensures a high conﬁdence upper bound on the cumulative scores PT t=1 st(i) and thus we can upper bound the learners performance in terms of estimated scores s t (instead of st).

Lemma 10. For any i [K], δ (0, 1) and β, γ (0, 1), with probability at least 1 δ, we have

t=1 st(i) 1

Incorporating this idea, the rest of the analysis closely follows that of Theorem 2. See complete proof in the supplementary material.

4. Fixed-Gap Adversarial Dueling Bandits

In this section we study an adversarial setting with a ﬁxed-gap of > 0, and give an algorithm with regret O((K log(KT))/ 2). In this case, our algorithm is based on using conﬁdence intervals of the estimated average Borda-scores. The algorithm has two phases. In the ﬁrst phase, it samples uniformly at random two different items, and observes the outcome of their duel; in the second phase, it has a speciﬁc single item ˆi, which it uses in all rounds (for both items). The algorithm moves to its second phase when it detects an itemˆi whose lower conﬁdence bound (LCB) is larger than the upper conﬁdence bound (UCB) of any other item j. The complete description is given in Algorithm 3.

Because of the non-stationary nature of the item preferences, and unlike classical action-elimination algorithms (Auer, 2000; Even-Dar et al., 2006), we still need to maintain an unbiased estimate of the Borda-score for every item at every round. (In contrast, in the stochastic dueling bandit problem (Zoghi et al., 2014a), for any ﬁxed item i [K], the unbiased estimate of its Borda score at round t is also an unbiased estimate for any other round s = t; this simplifying condition does not hold in our ﬁxed-gap adversarial model.) Towards this, we maintain an estimate of the Borda score of any item i [K] at any round t as ˆbi(t) := K1(xt = i)ot(xt, yt), and show that it is an unbiased estimator.

Lemma 11. At any round t, we have EHt[ˆbt(i)] = bt(i) for all i [K].

Thus, an unbiased estimate for the t-step average Borda score bt(i), is bt(i) := 1

t Pt τ=1 ˆbτ(i). We further maintain conﬁdence intervals [LCB(i; t), UCB(i; t)] around each bt(i), within which the means bi(t) lie with high probability.

Lemma 12. With probability 1 δ, we have bi(t) [LCB(i; t), UCB(i; t)] for all i [K] and t [T].

The proof uses Bernstein s inequality to show that the estimates bi(t) are concentrated around their means bt(i), within the respective conﬁdence intervals. Assuming these conﬁdence bounds hold, as soon as we ﬁnd an item ˆi [K] such that LCB(ˆi; t) > UCB(j; t) for any other item j = ˆi,

Adversarial Dueling Bandits

we are guaranteed that ˆi is the best item (in hindsight), i.e., ˆi = i . In the remaining rounds, t + 1, . . . , T, we play only itemˆi (for both items) and suffer no regret. This results with the algorithm detailed in Algorithm 3

Theorem 13. Given any δ > 0, with probability at least 1 δ, the regret of Algorithm 3 (with parameter δ) is upper bounded by 64(K/ 2) log(2KT/δ).

We remark that unlike most MAB algorithms, we do not gain by incremental elimination. The reason is that we need to sample a second random item, yt, which would have an expected Borda score which equals the average Borda score. This random item implies a constant regret per round until we identify ˆi. After we identify ˆi, with high probability, we do not incur any regret.

Algorithm 3 Borda-Conﬁdence-Bound (BCB)

1: Input: item set indexed by [K], conﬁdence δ > 0 2: for t = 1, . . . , T do 3: Select xt, yt [K], xt = yt uniformly at random 4: Receive preference ot(xt, yt) Ber(Pt(xt, yt)) 5: Estimate: ˆbi(t) = K ot(xt, yt) 1(xt = i), i [K] 6: Compute: bt(i) 1

t Pt τ=1 ˆbτ(i), i [K] 7: Compute:

LCB(i; t) = bt(i) 2

UCB(i; t) = bt(i) + 2

8: if ˆi [K] s.t. LCB(ˆi; t) > UCB(j; t) j = ˆi, then break 9: end for 10: Play (ˆi,ˆi) for rest of the rounds t + 1, . . . , T.

5. Lower Bounds

This section derives lower bounds for the adversarial dueling bandit settings. Theorem 15 and Theorem 16 respectively give the regret lower bound for ﬁxed gap and general adversarial setting. We ﬁrst prove the following key lemma before proceeding to the individual lower bounds:

Lemma 14. For the problem of Adversarial Dueling Bandits with Borda Score objective, for any learning algorithm A and any ϵ (0, 0.1], there exists a problem instance (sequence of preference matrices P1, P2, . . . , PT ) such that the expected regret incurred by A on that instance is at least Ω(min(ϵT, K/ϵ2)), for any K 4.

Proof outline. The proof of the lemma has the following outline. We initially construct a stochastic preference matrix P0, and later we consider perturbations of it. We start by describing P0. We split the items to two equal size subsets

Kg and Kb. For any two items i, j Kg, they are equally likely to win or lose in P0, i.e., P0(i, j) = 1/2. Similarly, for any i, j Kb we have P0(i, j) = 1/2. When we pick item i Kg and item j Kb then item i wins with probability 0.9, i.e., P0(i, j) = 0.9. This implies that the Borda score of any i Kg is s(i) = 0.7 and for any j Kb it is s(j) = 0.3. Note that in P0 all the items in Kg have the highest Borda score.

The main idea of the proof is that we will introduce a perturbation that will make one item i Kg to have the highest Borda score. Formally, for each i Kg we have a preference matrix Pi. The only difference between Pi and P0 is in the entries of i Kg, where for any j Kb we have Pi(i, j) = 0.9 + ϵ. We select our stochastic preference matrix at random from all the Pi where i Kg, and denote by i the selected index. More explicitly following shows the form of P1:

0.5 ... 0.5 0.9 + ϵ ... 0.9 + ϵ . ... . . ... . . ... . . ... . 0.5 ... 0.5 0.9 ... 0.9 0.1 ϵ ... 0.1 0.5 ... 0.5 . ... . . ... . . ... . . ... . 0.1 ϵ ... 0.1 0.5 ... 0.5

A key observation is that in order to determine the best Borda score item, we need to match items i Kg with items j Kb, since the expected outcome of other comparisons is known. However, each time we match an item i Kg with an item j Kb we have a constant regret of about 0.2 O(ϵ) = Θ(1). We will need to have Ω(|Kg|/ϵ2) samples to distinguish a bias of ϵ in the Borda score of i Kg compared to other items i Kg. This leads to a regret of Ω(K/ϵ2). If, with some constant probability, we do not identify the item with the best Borda score, we will have a regret of at least Ω(ϵT). This follows since any sub-optimal item has regret at least Ω(ϵ) per time step.

We remark that the lower bound holds for K = 3 with an almost an identical proof. (Technically, our lower bound requires that K is even, but this is only for ease of presentation.) On the other hand, for K = 2 the true regret bound scales Θ(1/ ), since when we match the (only) two items we have a regret of only /2. Finally, there is an additional logarithmic dependency on the time horizon, which our lower bound does not capture.

Lower bound for the ﬁxed-gap setting. In this case, given any ﬁxed > 0, Theorem 15 shows a lower bound of Ω(K/ 2). The proof follows from Lemma 14 setting ϵ = . Theorem 15. Fix any (0, 0.1) and K 4. For the ﬁxed gap setting, for any learning algorithm A, there

Adversarial Dueling Bandits

exists an instance with ﬁxed gap , such that the expected regret incurred by A on that instance is at least Ω(min( T, K/ 2)).

The regret bound in this scales as K/ 2 compared to K/ for MAB. The reason is that in order to distinguish between near-optimal items, the learner must compare them to signiﬁcantly suboptimal items, which leads to the increase in the regret. Essentially, the regret bound is identical to the sample complexity bound in our lower bound instance.

Lower bound for the general adversarial setup. In this general case, since {Pt}t [T ] could be any arbitrary sequence, the adversary has the provision to tune ϵ based on T. Precisely, given any K and T, the adversary here can set ϵ = Θ(K1/3/T 1/3). For any T K we guarantee that ϵ (0, 0.1] and apply Lemma 14. For T < K we clearly have a lower bound of Ω(T), since we need to sample each item at least once. Therefore, for this general setup, we derive the following lower bound of Ω(K1/3T 2/3).

Theorem 16. For the problem of Adversarial Dueling Bandits with Borda Score objective, for any learning algorithm A, there exists a problem instance Adv-Borda(K, T) with T K, K 4, and sequence of preference matrices P1, P2, . . . , PT , such that the expected regret incurred by A on that Adv-Borda(K, T) is atleast Ω(K1/3T 2/3).

Note that the lower bound of Ω(T 2/3) steams from the fact that we can essentially cannot mix exploration and exploitation, at least in our lower bound instance. Namely, while we are searching for the best Borda score item, we have a constant regret per time step. If we settle on any sub-optimal item, we get a regret of Ω(ϵT) = Ω(T 2/3), due to the selection of ϵ.

6. Experiments

In this section we evaluate the empirical evaluation of our proposed algorithm Dueling-EXP3 and compare its performances with the only other existing adversarial dueling bandit algorithm, REX3, although it is known to work only under the restricted class of linear utility based preferences (Gajane et al., 2015; Ailon et al., 2014).

In more detail, we run our experiments with the following setup:

Algorithms. (1) Dueling-EXP3: As introduced in Section 3 with parameters tuned according to Theorem 2. (2) REX3: As introduced in Gajane et al. (2015). Note that their suggested optimal tuning parameters, i.e., the uniform exploration rate γ as well as the learning rate η requires the knowledge of problem dependent parameters τ the algorithm s expected loss regret with respect to a random strategy (see Thm. 1 of Gajane et al., 2015), which is un-

known to the learner. We used T in place of τ henceforth. However, other settings of τ give similar outcomes. (3) Random: A naive baseline that draws any arbitrary duel at each round.

Performance Measures. In all cases, we report the cumulative regret of the algorithms averaged over 500 runs.

Environments: Adversarial preferences.

We consider K = 20 and generate the sequence of adversarial probability matrices as follows:

(1) Switching Borda or SB(t). We generate the preference sequence such that the best performing Borda winner changes after every t length epochs by appropriate tweaking of the entries of the current preference matrix at time t: Precisely, we manipulated the entries carefully to make sure the new Borda winner is always selected from one of the ﬁrst 10 arms and different from the latest Borda winner (of the matrix Pt 1). Towards this, upon swapping the matrix entries if needed, we randomly select a row i from [10] (such that i = Borda-winner(Pt 1)), and iteratively increase the row entries Pt(i, j) for all j = i in a round robin fashion (up to a threshold of 1) with subsequently resetting Pt(j, i) = 1 Pt(i, j), until i becomes the new Borda winner of Pt.

(2) Random-walk preferences or RW(ν). In the literature of adversarial Multi-armed Bandits, one popular technique to generate adversarial loss sequence is through random walk (Neu & Valko, 2014; Saha et al., 2020). Taking cues, we generate the sequence of preferences Pt(i, j) for each pair of arm (i, j) as random walks with increments ν with some randomly chosen probability q (0.2, 0.8), where each P1(i, j) is initialized uniformly on [0, 1] for all i, j. Any values that fall outside [0, 1] in the process are truncated back to [0, 1].

(3) Lower Bound instance or LB(ϵ). Our lower bound preference instance P1 parameterized by ϵ (0, 0.5) (see Section 5). The explicit values used for τ, ν, ϵ are speciﬁed in the corresponding ﬁgures.

6.1. Cumulative regret over time

We ﬁrst conduct a set of experiments to compare the regret performance of the three algorithms over the two problem instances, SB(500) and RW(0.01), as shown in Fig. 1.

Remark. As shown in Fig. 1, our algorithm Dueling-EXP3 outperforms REX3 in both the instances. This is expected since the later is guaranteed to work only under linear utility based adversarial preference models, whereas we have constructed completely adversarial preference matrices through SB and RW instances. Also, both of the above algorithms perform better than the naive Random duel selection baseline.

Adversarial Dueling Bandits

Figure 1. Averaged cumulative regret over time

6.2. Regret vs Varying Item-size (K)

We also conduct a set of experiment changing the item set size K over a range (K = 10 to 100). We report the ﬁnal cumulative regret of all algorithms vs. K on the LB(0.1) instance as speciﬁed in Fig. 2.

Figure 2. Final regret (at T = 10K) with increasing arm size K

Remark. In terms of the comparative regret performances of three algorithms, Fig. 2 shows the same trend as reﬂected in the ﬁrst set of experiments (Fig. 1), where Dueling-EXP3 performs best, then REX3 and the worse is Random. Additionally Fig. 2 shows that with increasing K but ﬁxed gap (ϵ) that we ensured with our LB(0.1) instance construction keeping the gap ϵ = 0.1 ﬁxed for all K we see the regret of all the algorithms scales up with increasing K, as expected and also justiﬁed by Theorem 2.

7. Conclusion and Future Scopes

We considered the problem of dueling bandits with any adversarial preferences, i.e., adversarial dueling bandits. To the best of our knowledge, this work is the ﬁrst to consider the dueling bandit problem for fully adversarial setup. (The work of (Gajane et al., 2015) introduced adversarial utilitybased dueling bandits with a transfer function, which has very different characteristics, as we discussed earlier.)

We proposed algorithms for online regret minimization with Borda scores. We gave an O(K1/3T 2/3) regret algorithm (Dueling-EXP3 ), for the problem, and also shown optimality of our bounds with a matching Ω(K1/3T 2/3) lower bound analysis. We also proved a similar high probability regret bound. Finally, for an intermediate ﬁxed-gap adversarial setup which bridges the gap between stochastic and adversarial dueling bandits we gave an O((K/ 2) log T) regret algorithm, Borda-Conﬁdence-Bound, and also a corresponding regret lower bound of Ω(K/ 2).

Moving forward, one can potentially address many open threads along this direction; for example, considering other general notions of regret performances, considering the problem on larger (potentially inﬁnite) arm-spaces, or even analyzing dynamic regret for adversarial preferences (Besbes et al., 2019; Luo et al., 2017). Few more open questions to answer here are: In case of more strcutured utility based preferences (e.g., Plackett-Luce preference model (Azari et al., 2012), etc.), where the item utility scores are chosen adversarially at every round, is it possible to show an improved performance limit of Θ(

KT)? In such cases, how does the learning rate varies with K and T for general subsetwise preferences (i.e., where more than two items can be compared at every round and the learner receives a winner feedback of the subset played) (Brost et al., 2016; Ren et al., 2018)? Another interesting direction would be to understand the connection of this problem with other bandit setups, e.g., learning with feedback graphs (Alon et al., 2015; 2017) or other side information (Mannor & Shamir, 2011; Kocak et al., 2014).

Acknowledgments

The work of YM has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement No. 882396), by the Israel Science Foundation (grant number 993/17) and the Yandex Initiative for Machine Learning at Tel Aviv University. TK was supported in part by the Israeli Science Foundation (ISF) grant no. 2549/19, by the Len Blavatnik and the Blavatnik Family foundation, and by the Yandex Initiative in Machine Learning.

Adversarial Dueling Bandits

Ailon, N., Karnin, Z. S., and Joachims, T. Reducing dueling bandits to cardinal bandits. In ICML, volume 32, pp. 856 864, 2014.

Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. Online learning with feedback graphs: Beyond bandits. In Annual Conference on Learning Theory, volume 40. Microtome Publishing, 2015.

Alon, N., Cesa-Bianchi, N., Gentile, C., Mannor, S., Mansour, Y., and Shamir, O. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785 1826, 2017.

Auer, P. Using upper conﬁdence bounds for online learning. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pp. 270 279. IEEE, 2000.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002a.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48 77, 2002b.

Azari, H., Parkes, D., and Xia, L. Random utility theory for social choice. In Advances in Neural Information Processing Systems, pp. 126 134, 2012.

Balsubramani, A., Karnin, Z., Schapire, R. E., and Zoghi, M. Instance-dependent regret bounds for dueling bandits. In Conference on Learning Theory, pp. 336 360, 2016.

Besbes, O., Gur, Y., and Zeevi, A. Optimal exploration exploitation in a multi-armed bandit problem with nonstationary rewards. Stochastic Systems, 9(4):319 337, 2019.

Brost, B., Seldin, Y., Cox, I. J., and Lioma, C. Multi-dueling bandits and their application to online ranker evaluation. Co RR, abs/1608.06253, 2016.

Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012.

Bubeck, S. and Slivkins, A. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pp. 42 1, 2012.

Busa-Fekete, R. and H ullermeier, E. A survey of preferencebased online learning with bandit algorithms. In International Conference on Algorithmic Learning Theory, pp. 18 39. Springer, 2014.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006.

Dud ık, M., Hofmann, K., Schapire, R. E., Slivkins, A., and Zoghi, M. Contextual dueling bandits. In Conference on Learning Theory, pp. 563 587, 2015.

Even-Dar, E., Mannor, S., and Mansour, Y. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res., 7:1079 1105, 2006.

Falahatgar, M., Hao, Y., Orlitsky, A., Pichapati, V., and Ravindrakumar, V. Maxing and ranking with few assumptions. In Advances in Neural Information Processing Systems, pp. 7063 7073, 2017.

Gajane, P., Urvoy, T., and Cl erot, F. A relative exponential weighing algorithm for adversarial utility-based dueling bandits. In Proceedings of the 32nd International Conference on Machine Learning, pp. 218 227, 2015.

Jamieson, K. G., Katariya, S., Deshpande, A., and Nowak, R. D. Sparse dueling bandits. In AISTATS, 2015.

Kocak, T., Neu, G., Valko, M., and Munos, R. Efﬁcient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems, pp. 613 621, 2014.

Komiyama, J., Honda, J., Kashima, H., and Nakagawa, H. Regret lower bound and optimal algorithm in dueling bandit problem. In COLT, pp. 1141 1154, 2015.

Komiyama, J., Honda, J., and Nakagawa, H. Copeland dueling bandit problem: Regret lower bound, optimal algorithm, and computationally efﬁcient algorithm. ar Xiv preprint ar Xiv:1605.01677, 2016.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. preprint, 2018.

Luo, H., Wei, C.-Y., Agarwal, A., and Langford, J. Efﬁcient contextual bandits in non-stationary worlds. ar Xiv preprint ar Xiv:1708.01799, 2017.

Mannor, S. and Shamir, O. From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pp. 684 692, 2011.

Neu, G. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, pp. 3168 3176, 2015.

Neu, G. and Valko, M. Online combinatorial optimization with stochastic decision sets and adversarial losses. In Advances in Neural Information Processing Systems, pp. 2780 2788, 2014.

Adversarial Dueling Bandits

Ramamohan, S. Y., Rajkumar, A., and Agarwal, S. Dueling bandits: Beyond condorcet winners to general tournament solutions. In Advances in Neural Information Processing Systems, pp. 1253 1261, 2016.

Ren, W., Liu, J., and Shroff, N. B. PAC ranking from pairwise and listwise queries: Lower bounds and upper bounds. ar Xiv preprint ar Xiv:1806.02970, 2018.

Saha, A. and Gopalan, A. Battle of bandits. In Uncertainty in Artiﬁcial Intelligence, 2018.

Saha, A. and Gopalan, A. PAC battling bandits in the plackett-luce model. In Algorithmic Learning Theory, pp. 700 737, 2019.

Saha, A., Gaillard, P., and Valko, M. Improved sleeping bandits with stochastic action sets and adversarial rewards. In International Conference on Machine Learning, pp. 8357 8366. PMLR, 2020.

Seldin, Y. and Lugosi, G. An improved parametrization and analysis of the exp3++ algorithm for stochastic and adversarial bandits. ar Xiv preprint ar Xiv:1702.06103, 2017.

Seldin, Y. and Slivkins, A. One practical algorithm for both stochastic and adversarial bandits. In ICML, pp. 1287 1295, 2014.

Slivkins, A. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(1-2):1 286, 2019.

Sui, Y., Zhuang, V., Burdick, J., and Yue, Y. Multi-dueling bandits with dependent arms. In Conference on Uncertainty in Artiﬁcial Intelligence, UAI 17, 2017.

Wu, H. and Liu, X. Double Thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems, pp. 649 657, 2016.

Yue, Y. and Joachims, T. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1201 1208. ACM, 2009.

Yue, Y. and Joachims, T. Beat the mean bandit. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 241 248, 2011.

Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 1556, 2012.

Zoghi, M., Whiteson, S., Munos, R., Rijke, M. d., et al. Relative upper conﬁdence bound for the k-armed dueling bandit problem. In JMLR Workshop and Conference Proceedings, number 32, pp. 10 18. JMLR, 2014a.

Zoghi, M., Whiteson, S. A., De Rijke, M., and Munos, R. Relative conﬁdence sampling for efﬁcient on-line ranker evaluation. In Proceedings of the 7th ACM international conference on Web search and data mining, pp. 73 82. ACM, 2014b.

Zoghi, M., Karnin, Z. S., Whiteson, S., and De Rijke, M. Copeland dueling bandits. In Advances in Neural Information Processing Systems, pp. 307 315, 2015a.

Zoghi, M., Whiteson, S., and de Rijke, M. Mergerucb: A method for large-scale online ranker evaluation. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 17 26. ACM, 2015b.