# dueling_bandits_with_adversarial_sleeping__d46145d5.pdf

Dueling Bandits with Adversarial Sleeping

Aadirupa Saha Pierre Gaillard

We introduce the problem of sleeping dueling bandits with stochastic preferences and adversarial availabilities (DB-SPAA). In almost all dueling bandit applications, the decision space often changes over time; eg, retail store management, online shopping, restaurant recommendation, search engine optimization, etc. Surprisingly, this sleeping aspect of dueling bandits has never been studied in the literature. Like dueling bandits, the goal is to compete with the best arm by sequentially querying the preference feedback of item pairs. The non-triviality however results due to the non-stationary item spaces that allow any arbitrary subsets items to go unavailable every round. The goal is to ﬁnd an optimal no-regret policy that can identify the best available item at each round, as opposed to the standard ﬁxed best-arm regret objective of dueling bandits. We ﬁrst derive an instance-speciﬁc lower bound for DB-SPAA Ω(PK 1 i=1 PK j=i+1 log T (i,j)), where K is the number of items and (i, j) is the gap between items i and j. This indicates that the sleeping problem with preference feedback is inherently more difﬁcult than that for classical multi-armed bandits (MAB). We then propose two algorithms, with near optimal regret guarantees. Our results are corroborated empirically.

1 Introduction

The problem of Dueling-Bandits has gained much attention in the machine learning community [34, 38, 36], which is an online learning framework that generalizes the standard multiarmed bandit (MAB) [6] setting for identifying a set of good arms from a ﬁxed decision-space (set of arms/items) by querying preference feedback of actively chosen item-pairs. More formally, in dueling bandits, the learning proceeds in rounds: At each round, the learner selects a pair of arms and observes stochastic preference feedback of the winner of the comparison (duel) between the selected arms; the objective of the learner is to minimize the regret with respect to a (or set of) best arm(s) in hindsight. Towards this several algorithms have been proposed [2, 37, 21, 14]. Due to the inherent exploration-vs-exploitation tradeoff of the learning framework and several advantages of preference feedback [9, 35], many real-world applications can be modeled as dueling bandits, including movie recommendations, retail management, search engine optimization, job scheduling, etc.

However, in reality, the decision spaces might often change over time due to the non-availability of some items, which are considered to be sleeping . This sleeping-aspect of online decision making problems has been widely studied in the standard multiarmed bandit (MAB) literature [17, 24, 15, 19, 18, 11]. There the goal is to learn a no-regret policy that maps to the best awake item of any available (non-sleeping) subset of items, and the learner s performance is measured with respect to the optimal policy in hindsight. This setting is famously known as Sleeping Bandits in MAB [17, 24, 15, 11]. More discussions are given in Related Works.

Surprisingly, however, the sleeping problem is completely unaddressed in the preference bandits literature, even for the special case of pairwise preference feedback, which is famously studied

Microsoft Research, New York, US; aasa@microsoft.com. Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. pierre.gaillard@inria.fr

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Parameters. Item set: [K] (known), Preference: P (unknown), Available item sets: ST (observed sequentially) For t = 1, 2, . . . , T, the learner: Observes St [K] the set of available items Chooses (xt, yt) S2 t Observes ot := 1(xt yt) Ber(P(xt, yt)) Incurs rt := 1/2 P(i t , xt) + P(i t , yt) 1 ; where i t is such that minj St P(i t , j) 1/2

Figure 1: Setting of DB-SPAA(P, ST )

as Dueling Bandits [37, 34], even though the setup of changing decision spaces are quite relevant in almost every practical applications: Be that in retail stores where some items might go out of production over time, for search engine optimization some websites could be down on certain days, in recommender systems some restaurants might be closed or movies could be outdated, in clinical trials certain drugs could be out of stock, and many more. This work is the ﬁrst to consider the problem of Sleeping Dueling Bandits, where we formulated the stochastic K-armed dueling bandit problem with adversarial item availabilities. Here at each round t {1, 2, . . . , T} the item preferences are considered to be generated from a ﬁxed underlying (and of course unknown) preference matrix P [0, 1]K K, however, the set of available actions St {1, 2, . . . , K} is assumed to be adversarially chosen by the environment. We call the problem as Sleeping-Dueling Bandit with Stochastic Preferences and Adversarial Availabilities or in brief DB-SPAA(P, ST ), where ST = {S1, S2, . . . ST } denotes the sequence of available subsets over T rounds. We also assume the preference P follows a total-ordering assumption to ensure the existence of a best-item per available subset St. We describe the setting in Fig. 1 with a formal description in Sec. 2.

Our speciﬁc contributions are as follows:

1. We ﬁrst analyze the fundamental performance limit for the DB-SPAA(P, ST ) problem in Sec. 3: Thm. 1 gives an instance-speciﬁc regret lower bound of

Ω PK 1 i=1 PK j=i+1 log T (i,j) ,

with (i, j) being the preference gap of item i-vs-j (see Eqn. (2)). Our lower bound, which can be of order Ω(K2 log T/ ), := mini,j (i, j) being the worst case gap, indicates that the problem of sleeping dueling bandits is inherently more difﬁcult that standard sleeping bandits (MAB), unlike the non-sleeping case where both dueling bandits (with total-ordering assumption on P) and MAB are known to have the same fundamental performance limit of Ω(K log T/ ) (Rem. 1).

2. We next design a ﬁxed conﬁdence regret algorithm Sl DB-UCB (Alg. 1), inspired from the pairwise upper conﬁdence bound (UCB) based algorithm [37]. However due to the ﬁxed conﬁdence and adversarial-sleeping nature of the problem, we need to differently maintain pairwise conﬁdence bounds per item (based on the availability sequence {St}), which makes the resulting algorithm and its subsequent analysis signiﬁcantly different than standard UCB based dueling bandit algorithms: Precisely given any δ > 0, Sl DB-UCB achieves a regret of O K3 log(1/δ)

2 with probability at least

1 δ over any problem instance of DB-SPAA(P, ST ) (Sec. 4).

3. In Sec. 5, we design another computationally efﬁcient algorithm, Sl DB-ED (Alg. 2), for expected regret guarantee. Unlike the previous algorithm (Sl DB-UCB), Sl DB-ED uses empirical divergence (ED) based measures to ﬁlter out the good set of arms, inspired from the idea of RMED algorithm of [21] for standard dueling bandits; however, due to sleeping nature of the items, it requires a different maintenance of good arms and the regret analysis of the algorithm requires derivation of new results (as described in Sec. 5). The algorithm is shown to perform near optimally with an expected non-asymptotic regret upper-bound of (Thm. 6). Note that for any problem instance with constant suboptimality gaps (i, j) = for all i < j, regret bound of Sl DB-ED is tight and matches the lower-bound ensuring the near optimality of Sl DB-ED in the worst case. Furthermore, a novelty of our ﬁnite time regret analysis lies in showing a cleaner tradeoff between regret vs. availability sequence ST which automatically adapts to the inherent hardness of the sequence of available

subsets ST , compared to existing sleeping bandits work for adversarial availabilities in the MAB setting [19] which only gives a worst-case regret bound over all possible availability sequences (Rem. 3).

4. Finally we corroborate our theoretical results with extensive empirical evaluations. (Sec. 6).

Related Works. The problem of regret minimization for stochastic multiarmed bandits (MAB) is extremely well studied in the online learning literature [6, 1, 22, 5, 16], where the learner gets to see a noisy draw of absolute reward feedback of an arm upon playing a single arm per round.

A well motivated generalization of MAB framework is Sleeping Bandits [17, 24, 18, 15], much studied in the online learning community, where at any round the set of available actions could vary stochastically based on some unknown distributions over the decision space of K items [24, 11] or adversarially [15, 19, 18]. Besides the reward model, the set of available actions could also vary stochastically or adversarially [17, 24]. The problem is NP-hard when both rewards and availabilities are adversarial [19, 18, 15]. In case of stochastic reward and adversarial availabilities [19] proposed an UCB based no-regret algorithm, which was also shown to be provably optimal. The case of adversarial reward and stochastic availabilities has also been studied where the achievable regret lower bound is known be Ω(

KT) by the inefﬁcient EXP4 algorithm [19, 15].

On the other hand over the last decade, the relative feedback variants of stochastic MAB problem has seen a widespread resurgence in the form of the Dueling Bandit problem, where, instead of getting noisy feedback of the reward of the chosen arm, the learner only gets to see a noisy feedback on the pairwise preference of two arms selected by the learner. The objective of the learner is to minimize the regret with respect to best arm in the stochastic model. Several algorithms have been proposed to address this dueling bandits problem, for different notions of best arms or preference models [10, 31, 38, 37, 36, 21, 33, 13], or even extending the pairwise preference to subsetwise preferences [29, 8, 26, 27, 25]. However, surprisingly, unlike the sleeping bandits generalization of MAB, no parallel has been drawn for dueling bandits, which remains our main focus.

2 Problem Formulation

Notations. Decision space (or item/arm set) [K] := {1, 2, . . . , K}. The available set of items at round t is denoted by St [K]. For any matrix M RK K, we deﬁne mij := M(i, j), i, j [K]. We write S\i = S \ {i}, for any S [K] and i S. 1( ) denotes the indicator random variable which takes value 1 if the predicate is true and 0 otherwise and a rough inequality which holds up to universal constants. For any two items x, y [K], we use the symbol x y to denote x is preferred over y. ΣK denotes the set of all permutations of the items in set [K]. The KL-divergence of two Bernoullis with biases p and q respectively is written kl(p, q) := p log(p/q) + (1 p) log((1 p)/(1 q)). We assume 0

0 := 0.5 (in Alg. 1 and 2).

Setup. We consider the problem of stochastic K-armed dueling bandits with adversarial availabilities: At every iteration t = 1, . . . , T, a set of available items (actions) St [K] is revealed, and the learner is asked to choose two items xt, yt St. Then, the learner receives a preference feedback ot = 1(xt yt) Ber(P(xt, yt)), where P [0, 1]K K is an underlying pairwise preference matrix, unknown to the learner. The setting is described in Figure 1. We assume that P respects a total ordering , say σ ΣK. Without loss of generality, we set σ = (1, 2, . . . , K) thoughout the paper. This implies P(i, j) 0.5 for i j. One possible pairwise probability model which respects total ordering is Plackett-Luce [7], where it is assumed that the K items are associated to positive score parameters θ1, . . . , θK, and P(i, j) = θi/(θi + θj) for all i, j [K]. In fact any well random utility (RUM) based preference model would have the above property, like [7, 28]. Note also that our assumption corresponds to assuming the existence of a Condorcet winner for every subset St [K].

Objective. The objective of the learner is to minimize his regret over T rounds with respect to the best policy in the policy class Π = {π : 2K 7 [K] | t [T], π(St) St}, i.e. any π Π is such that for any t [T], π(St) St. More formally we deﬁne the regret as follows:

RT = max π Π

P(π(St), xt) + P(π(St), yt) 1

We analyze both ﬁxed-conﬁdence and expected regret guarantees in this paper respectively in Sec. 4 (see Thm. 3) and Sec. 5 (see Thm. 6). It is easy to note that under our preference modelling assumptions, the best policy, say π , turns out to be π (S) = min{S} for any S [K]. We henceforth denote by i t = π (St). We deﬁne the above problem to be Sleeping-Dueling Bandit with Stochastic Preferences and Adversarial Availabilities over the stochastic preference matrix P [0, 1]K K and the sequence of available subsets ST = {S1, . . . , ST }, or in short DB-SPAA(P, ST ). For ease of notation we respectively deﬁne the gaps and the non-zeros gaps as (i, j) := P(i, j) 1/2, and

(i, j)+ := (i, j) if (i, j) = 0 + if (i, j) = 0 (2)

The regret thus can be rewritten as RT := PT t=1 rt, where rt := ( (i t , xt) + (i t , yt))/2 denotes the instantaneous regret. We also denote by nij(t) := Pt τ=1 1 {xt, yt} = {i, j} the number of times the pair (i, j) is played until time t and by wij(t) the number of times i beats j in t rounds.

3 Lower Bound

We ﬁrst derive a worst case regret lower bound over all possible sequences of ST . The proof idea essentially lies in constructing hard enough availability sequences ST , where no learner can escape learning the preferences of every distinct pair of items (i, j). This leads to a potential lower bound of Ω K2 log(T)/ . For this section we denote P by PK to make the dependency on K more precise.

Theorem 1 (Lower Bound for DB-SPAA(PK, ST )). For any No-regret learning algorithm A, there exists a problem instance DB-SPAA(PK, ST ) with T K4, such that its expected regret is lowerbounded as:

E[RT (A)] Ω K 1 X

log T (i, j)+

The No-regret learning algorithm refers to the class of consistent algorithms which do not pull any suboptimal pair more than O(T α), α [0, 1] (see Def. 7, Appendix 8)).

Proof (sketch) The main argument lies behind the fact that in the worst case the adversary can force the algorithm to learn the preference of every distinct pair (i, j) as for the worst-case sequences ST , a knowledge of the already learnt pairwise preferences would not disclose any information on the remaining pairs; e.g. assuming σ = (1, 2, . . . , K), revealing the available subsets in the following sequence (1, 2), (1, 3), . . . (1, K), (2, 3), (2, 4), . . . (K 1, K) would force the learner to explore (learn the preferences) all K 2 distinct pairs. The remaining proof establishes this formally, towards which we ﬁrst show a Ω(ln(T)/ (1, 2)) regret lower bound for a DB-SPAA instance with just two items (i.e. K = 2) as shown in Lem. 2. The lower bound for any general K can now be derived applying the above bound on independent K 2 subintervals, with the availability sequence (1, 2), (1, 3), . . . (1, K), (2, 3), (2, 4), . . . (K 1, K). The full proof is given in Appendix 8. Lemma 2 (Lower Bound of DB-SPAA(PK, ST ) for 2 items). For any No-regret learning algorithm A, there exists a problem instance DB-SPAA(P2, ST ) such that the expected regret incurred by A on that can be lower bounded as: E[RT (A)] 1 log(T), being the preference-gap between the two items (i.e. = P12 1/2, assuming P12 > 1/2 or equivalently > 0). Remark 1 (Implication of the lower bound). The above result indicates that in the preference based learning setup, the fundamental problem complexity lies in distinguishing every pair of items 1 i < j K. If the learner fails to learn the preference of any pair (i, j), the adversary can make the learner suffer O(T) regret by setting St = {i, j} henceforth at all round. It is worth noting that in the no sleeping case both dueling bandits and MAB are known to have the same fundamental performance limit of Ω(K log(T)/ ) (assuming P respects a condorcet winner [6, 21]). Thus Thm. 1 shows that the sleeping-aspect of dueling bandits makes the problem K-times harder than sleeping MAB for which the regret lower bound is known to be only Ω PK i=1 log(T)/ (i, i + 1) [19].

Algorithm 1 Sl DB-UCB

1: input: Arm set: [K], parameters α > 0.5, Conﬁdence parameter δ [0, 1) 2: init: wij(1) 0, i, j [K]. 3: deﬁne: nij(t) := wij(t) + wji(t), t [T] 4: for t = 1, 2, . . . , T do 5: Receive St [K]

6: bpij(t) = wij(t)

nij(t) , cij(t)

α log aij(t)

nij(t) , i, j St, (assume x

0 := 0.5, x R)

7: uij(t) bpij(t) + cij(t), uii(t) 1/2, i, j St UCB of empirical preferences 8: for k St do 9: Ck(t) := {j St | ukj(t) > 1/2} Potential losers to k 10: end for 11: Ct = {i St | |Ci(t)| = maxj St |Cj(t)|} Potential best items 12: Select a random xt from Ct. Choose yt arg maxi Ct uixt(t) 13: Play (xt, yt). Receive preference ot 14: Update: i, j [K], wxtyt(t + 1) wxtyt(t) + ot, wytxt(t + 1) wytxt(t) + (1 ot), aij(t + 1) max{nij(t + 1), C(K, δ)}, i, j [K] 15: end for

4 Sl DB-UCB: A Fixed-Conﬁdence Algorithm

In this section, we design an efﬁcient algorithm for the DB-SPAA(K, T) problem with instancedependent regret guarantee.

Main ideas. Our algorithm, described in Alg. 1, depends on an hyper-parameter α > 0.5 and a conﬁdence parameter δ > 0. It maintains, for each item k [K], its own record of empirical pairwise estimates of the duels, (i, j) [K] [K] and their respective upper conﬁdence bounds deﬁned as:

bpij(t) := wij(t)

nij(t) and uij(t) := bpij(t) + cij(t), with cij(t) := q

α log aij(t)

where wij(t) denotes the total number of times item i beats j up to round t, nij(t) := wij(t)+wji(t), and for all i, j [K] and t [T]

aij(t) := max{C(K, δ), nij(t)} and C(K, δ) := (4α 1)K2

(2α 1)δ 1 2α 1 .

A key observation is that our careful choice of the conﬁdence bounds cij(t) ensures that with high probability pij(t) [bpij(t) cij(t), bpij + cij(t)] for any duel i, j [K] and any t [T] (Lem. 4). Now at any round t 1, the algorithm ﬁrst computes a set of potential winners of St as Ct = {k St | |Ck(t)| = maxj St |Cj(t)|}, where Ck(t) := {j St | ukj(t) > 1

2} denotes the set of items that item k dominates (optimistically). At each round, we play a random item from the set potential winners Ct as the left arm xt. Finally the right-arm yt is chosen to be the most competitive opponent of xt as yt arg maxi Ct uji(t) from the potential winners. Our arm selection strategy ensures that eventually for all t, algorithm plays the optimal pair (i t , i t ) frequently enough as desired. Theorem 3 (Fixed-conﬁdence regret analysis: Sl DB-UCB). Given any δ > 0 and α > 1/2, with probability at least 1 δ, the regret incurred by Sl DB-UCB (Alg. 1) is upper-bounded as:

j=i+1 Mij log 2C(K, δ)Mij

C(K, δ) := (4α 1)K2

1 2α 1 and Mij =

min (k, i)+, (k, j)+ 2 .

The complete proof with a precise dependencies on the model parameters is deferred to Appendix 9. Remark 2. The dependency on = mini,j +(i, j) does not match the lower-bound of Thm. 1, which is of order O(log(T)/ ). Instead, Thm. 3 proves O(log(1/δ)/ 2). Yet, the bounds are

not directly comparable because the lower-bound is on the expected regret while the upper-bound considers ﬁxed-conﬁdence δ and is hence independent of T. All existing dueling bandit algorithms, that minimize the expected regret, suffer an additional constant term of order O(1/ 2) see for instance [37, 21]. Achieving an order O(1/ ) dependendence is an interesting question for future work.

Proof sketch of Thm. 3. The key steps lie in proving the following four lemmas. The ﬁrst lemma follows along the line of Lem. 1 of RUCB algorithm [37]and shows that all the pairwise estimates are contained within their respective conﬁdence intervals with high probability.

Lemma 4. Let α > 0.5 and δ > 0. Then, for any i, j [K], with probability at least 1 δ/K2,

bpij(t) cij(t) pij uij(t) := bpij(t) + cij(t), t [T] .

The lemma below shows that once the algorithm can not play any suboptimal pair too many times .

Lemma 5. Let α > 0.5. Under the notations and the high-probability event of Lem. 4, for all i, j, k [K] such that {i, j} = {k, k}, and for any τ 1

t=1 1(i t = k)1 {xt, yt} = {i, j} 4α log ai,j(τ)

min (k, i)+, (k, j)+ 2 ,

where recall aij(τ) = max C(K, δ), nij(τ) .

With probability of at least 1 δ, the event of Lem. 4 holds and thus so do the ones of Lem. 5. The regret of Alg. 1 then follows from applying the above lemmas with the following careful decomposition of the regret:

k=1 1(i t = k)1 {xt, yt} = {i, j} =

j=i+1 nij(T)

and the proof is concluded by using Lemma 5 to upper-bound nij(T). The complete proof given in Appendix 9.

5 Sl DB-ED: An Expected Regret Algorithm

In this section, we propose another computationally efﬁcient algorithm, Sl DB-ED (Alg. 2), which achieves near-optimal expected-regret for DB-SPAA problem, and also performs competitively against Sl DB-UCB empirically (see Sec. 6). Furthermore, a novelty of our ﬁnite time regret analysis of Sl DB-ED lies in showing a cleaner trade-off between regret vs availability sequence ST which automatically adapts to the inherent hardness of the sequence of available subsets ST , unlike the previous attempts made in standard sleeping bandits for adversarial availabilities [19] (Rem. 3).

Main ideas. We again use the notations wij(t), nij(t) as used for Sl DB-UCB (Alg. 1), with the same initializations. Same as Sl DB-UCB, this algorithm also maintains the empirical pairwise preferences bpij(t) for each item pair i, j [K]. However, unlike the earlier case here we need to ensure an initial t0 rounds of exploration (t0 = 1 in the theorem) for every distinct pairs (i, j), and instead of maintaining pairwise UCBs, in this case the set of good-items is deﬁned in terms of empirical divergences for all i St

j b Bi(t) nij(t) kl bpij(t), 1/2 , b Bi(t) := n j [K] \ {i} | bpij(t) 1/2 o St

denotes the empirical winners of item i in set St. Now intuitively since exp( Ii(t)) can be interpreted as the likelihood of i being the best-item of St, we denote bybi t arg mini St Ii(t) the empiricalbest item of round t and deﬁne the set of near-best items Ct := i St | Ii(t) Ibi t (t) α log t , whose likelihood is close enough to that ofbi t . Finally the algorithm selects an arm pair (xt, yt) such that xt is a potential candidate of good arm (which ensures the required exploration) and yt being the strongest challenger of xt w.r.t the empirical preferences. The algorithm is given in Alg. 2.

Algorithm 2 Sl DB-ED

1: input: Arm set: [K], exploration parameter t0 > 0, parameter α > 0 2: for t = 1, 2, . . . , T do 3: bpij(t) := wij(t)

nij(t) , bp(i, i) 1/2, i, j [K] (assume x

0 := 0.5, x R)

4: Receive St [K] 5: if |St| 2 and i, j St s.t. nij(t) < t0, i = j then 6: Set xt i, yt j Exploration rounds 7: else 8: b Bi(t) := {j [K] \ {i} | bpij(t) 1/2} St, i [K] Empirical winners over i 9: Ii(t) := P

j b Bi(t) nij(t) kl(bpij(t), 1/2), i [K] andbi t arg mini St Ii(t)

10: Ct := {i St | Ii(t) Ibi t (t) α log t} Potential good arms 11: Select any xt from Ct uniformly at random 12: if bi t b Bxt(t) or b Bxt(t) = : set yt bi t , else: yt arg maxi St\{xt} bpixt(t) 13: end if 14: Play (xt, yt) Receive preference feedback ot 15: end for

Theorem 6 (Expected regret analysis Sl DB-ED). Let t0 = 1 and α = 4K. Then as T , the expected regret incurred by Sl DB-ED (Alg. 2) can be upper bounded as: For all ε2, . . . , εK 0

E RT K2 + X

K1{ (i,j)>εj}

(i, j)2 + nij(T) min{εj, (i, j)} +

K log T max εj, (j 1, j)+

K log T (j 1, j)+ , KT 2/3 .

The proof is deferred to Appendix 10. Although it borrows some high-level ideas from [19] for sleeping bandits and from [21] for RMED in standard dueling bandits, our analysis needed new ingredients in order to obtain O(K2(log T)/ ). This is especially the case for the proofs of the technical Lemmas 8 and 9 which signiﬁcantly differ from corresponding technical lemmas of [21]. Speciﬁcally, both regret bounds of RMED and ours need to control the length of an initial exploration t0 after which pairwise preferences are well estimated by bpij(t). This is done respectively by our Lemma 8 and Lemma 5 of [21]. Yet, RMED s original analysis is based on a union bound over all possible subsets S {1, . . . , K} of items (see Equation (19) in [21]), whose number is exponential in K. Despite our efforts, we could not follow the proof of Lemma 5 of [21], which to the best of our understanding, should yield to an exploration t0 exponentially large in K contrary to O(1) claimed in [21]. Instead, in our proof of Lem. 8, we carefully apply concentration inequalities to run union bounds over the items directly instead of sets of items.

The upper-bound of Thm. 6 is close to optimal. It suffers at most a suboptimal factor K and exactly matches the lower-bound for some problems. The distribution-free upper-bound of order O(T 2/3) matches obtainable standard dueling bandit problems [21, 37], since the latter algorithms also suffer constant terms of order 2. Yet, it is unclear whether it is optimal or if O(

T) can be obtained. Remark 3 (Sequence ST adaptivity of Alg. 2). It is worth pointing out that the regret bound of Thm. 6 is ﬁnite time and automatically adapts to the sequence of available sets ST . In the worst-case, the complexity lies in identifying for all items j the gap with the earlier item j 1. Yet, our regretbound, which holds for any εj 0, will automatically perform a trade-off for each j between the gap (j 1, j) 1 + and εj Pj 1 i=1 nij(T) the number of times j is played together with a better item i < j. In particular, Pj 1 i=1 nij(T) can be small if j is rarely available in St while not optimal. Notably, this adaptivity to ST item per item improves the regret guarantee of Thm. (10), [19], which also addresses the problem of sleeping bandits with adversarial availabilities but for the stochastic multi-armed bandit setup and only provides worst-case guarantees over all ST and a trade-off ε independent of j. Remark 4 (Sl DB-ED in standard dueling bandits). Even in the dueling bandit setting (without the sleeping component), Sl DB-ED and Thm. 6 have advantages compared to the RMED algorithm and analysis of [21]. Our regret bound is valid for all number of items K, while the one of Thm. 3 of [21] is only asymptotic when K . This is due to the fact that the algorithm of [21] depends

on a hyper-parameter f(K) which needs to larger than AK, where A is a constant in K and T but which depends on the unknown sub-optimality gaps (i, j). Thus, [21] chooses f(K) K1+ε so that eventually the bound is satisﬁed when K . Instead, our algorithm only depends on easily tunable hyper-parameters t0 and α, whose optimal values are independent of unknown parameters.

6 Experiments

In this section, we compare the empirical performances of our two proposed algorithms (Alg. 1 and 2). Note that there are no other existing algorithms for our problem (see Sec. 1).

Constructing Preference Matrices (P). We use the following three different utility based Plackett Luce(θ) preference models (see Sec. 2) that ensures a total-ordering. We now construct three types of problem instances 1. Easy 2. Medium 3. Hard, for any given K, such that items with their respective θ parameters are assigned as follows: 1. Easy: θ(1 : K/2 ) = 1, θ( K/2 + 1 : K) = 0.5. 2. Medium: θ(1 : K/3 ) = 1, θ( K/3 + 1 : 2K/3 ) = 0.7, θ( 2K/3 + 1 : K) = 0.4. 3. Hard: θ(i) = 1 (i 1)/K, i [K]. Note for each σ = (1 > 2 > . . . K).

In every experiment, we set the learning parameters α = 0.51, δ = 1/T for Sl DB-UCB (Alg. 1) and as per Thm. 6 for Sl DB-ED (Alg. 2). All results are averaged over 50 runs.

Regret over Varying Preference Matrices. We ﬁrst plot the cumulative regret of our two algorithms (Alg. 1 and 2) over time on the above three Plackett-Luce datasets for K = 10. We generate availability sequence ST randomly by sampling every item i [K] independently with probability 0.5. Fig 2 shows that, as their names suggest too, instance-Easy is easiest to learn as the best-vs-worst item preferences are well separated and the diversity of the item preferences across different groups are least. Consequently the algorithms yield slightly more regret on instance-Medium due to higher preference diversity, and the hardest instance being Hard where the learner really needs to differentiate the ranking of every item for any arbitrary set sequences ST . Empirically Sl DB-UCB is seen to slightly outperform Sl DB-ED, though orderwise they perform competitively.

Regret over Varying Set Availabilities. In these set of experiments, the idea is to understand how the regret improves over completely random subset availabilities as now the learner may not have to distinguish all item preferences as some of the item combinations occurs rarely. We choose K = 10 and to enforce item dependencies we generate each set St by drawing a random sample from Gaussian(µ, Σ) such that µi = 0, i [10], and Σ is a ﬁxed 10 10 positive deﬁnite matrix which controls the set dependencies: Precisely we use two different block diagonal matrices for Low-Correlation and High-Correlation with the following correlations: 1. Low-Correlation: Σ is a separated block diagonal matrix on item partitions {1, 2, 3}, {4, 5, 6}, {7, 8, 9, 10}. 2. High Correlation: Σ is constructed by merging three all-1 matrices on partitions {1 . . . 5}, {2, . . . 8}, and {6, . . . 10}, however as the resulting matrix is positive semi-deﬁnite, so we further take its SVD and reconstruct the matrix back eliminating the negative eigenvalues. At every round we sample a random vector from Gaussian(µ, Σ), and St is considered to be the set of items whose value exceeds 0.5. Both experiments are run on instance-Hard. Fig. 3 shows, as expected, on Low-Correlation both algorithms converge to σ relatively faster and at lower regret compared to High-Correlation (as the latter induces higher variability of the available subsets).

Figure 2: Regret (Rt) vs time (t) over three preference instances (P)

Figure 3: Regret (Rt) vs time (t) over availability sequences ST

Figure 4: Final regret (RT ) at T = 104 with varying sizes (K)

Final Regret vs Setsize(K). We also compared the (averaged) ﬁnal regret of the two algorithms over varying item sizes K. We additionally constructed two larger Plackett-Luce (θ) Easy and Hard instances for K = 20 and 40, using similar θ assignments explained before. We set T = 10 000 and use itemwise independent set generation idea, as described for Fig. 2. As expected, Fig. 4 shows the regret of both algorithms scales up with increasing K with effect on Sl DB-ED being slightly worse than Sl DB-UCB, though the latter generally exhibits a higher variance.

Worst Case Regret vs Time. We run an experiment to analyse the regret of our two algorithms on the worst case problem instances. Towards this we use preference matrices P of the form: P (i, j) = 0.5 + , 1 i < j K, i.e. all items are spaced with equidistant gap (0, 0.5]. As before, we choose T = 20, 000 and K = 10, and run the algorithms on above problem instances varying in the range [10/T, . . . , 0.5] with uniform grid-size of 0.005 (i.e. total 100 values of , each corresponds to a separate problem instance P with different gap-complexity ). At the end we plot the worst case regret of both the algorithms over time, by plotting max Rt(P ) vs t. We run the experiments over three availability sequences: 1. Independent (as used in Fig. 2), 2. Low-Correlation, and 3. High-Correlation (as used in Fig. 3). As a consequence the resulting plots reﬂect the worst case (w.r.t. ) performances of the algorithms, which seem to be scaling as O(T 2/3) for Sl DB-ED, as conjectured to be its distribution free upper bound (see discussion after Thm. 6), and with a slightly lower rate for Sl DB-UCB. Fig. 5 shows the comparative performances.

Figure 5: Worst Case Regret" (max Rt(P )) vs time (t) over three availability sequences ST

7 Conclusion and Perspective

We introduce the problem of sleeping dueling bandits with stochastic preferences and adversarial availabilities, which, despite of great practical relevance, was left unaddressed till date. Towards this we adapt two dueling bandit algorithms for the problem and give regret analysis for both. We also derive an instance dependent regret lower bound for our problem setup which shows that our second algorithm is asymptotically near-optimal (up to the problem dependent constants). Finally, we compare both our algorithms empirically where usually the ﬁrst algorithm is shown to outperform the second, although having a relatively weaker regret.

Future Works. Moving forward, one can address many open questions along this direction, including relaxing the total-ordering assumption on the stochastic preferences assuming more general ranking objective based on borda [32] or copeland scores [36], or extending the framework to a general contextual scenario with subsetwise feedback. Another direction worth understanding is to analyze the connection of this problem with other bandit setups, e.g., learning with feedback graphs [3, 4] or other side information [23, 20]. It would also be interesting to consider the dueling bandit problem for adversarial preference and stochastic availabilities [24, 17], and also analyzing these class of problems for general subsetwise preferences [25, 30, 8].

[1] Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39 1, 2012.

[2] Nir Ailon, Zohar Shay Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In ICML, volume 32, pages 856 864, 2014.

[3] Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In JMLR Workshop and Conference Proceedings, volume 40. Microtome Publishing, 2015.

[4] Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785 1826, 2017.

[5] Jean-Yves Audibert and Sébastien Bubeck. Best arm identiﬁcation in multi-armed bandits. In COLT-23th Conference on Learning Theory-2010, pages 13 p, 2010.

[6] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002.

[7] Hossein Azari, David Parkes, and Lirong Xia. Random utility theory for social choice. In Advances in Neural Information Processing Systems, pages 126 134, 2012.

[8] Brian Brost, Yevgeny Seldin, Ingemar J. Cox, and Christina Lioma. Multi-dueling bandits and their application to online ranker evaluation. Co RR, abs/1608.06253, 2016.

[9] Róbert Busa-Fekete and Eyke Hüllermeier. A survey of preference-based online learning with bandit algorithms. In International Conference on Algorithmic Learning Theory, pages 18 39. Springer, 2014.

[10] Róbert Busa-Fekete, Eyke Hüllermeier, and Balázs Szörényi. Preference-based rank elicitation using statistical models: The case of mallows. In Proceedings of The 31st International Conference on Machine Learning, volume 32, 2014.

[11] Corinna Cortes, Giulia Desalvo, Claudio Gentile, Mehryar Mohri, and Scott Yang. Online learning with sleeping experts and feedback graphs. In International Conference on Machine Learning, pages 1370 1378, 2019.

[12] Imre Csiszár. The method of types. IEEE Transactions on Information Theory, 44(6):2505 2523, 1998.

[13] Miroslav Dudík, Katja Hofmann, Robert E Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling bandits. In Conference on Learning Theory, pages 563 587, 2015.

[14] Pratik Gajane, Tanguy Urvoy, and Fabrice Clérot. A relative exponential weighing algorithm for adversarial utility-based dueling bandits. In Proceedings of the 32nd International Conference on Machine Learning, pages 218 227, 2015.

[15] Satyen Kale, Chansoo Lee, and Dávid Pál. Hardness of online sleeping combinatorial optimization problems. In Advances in Neural Information Processing Systems, pages 2181 2189, 2016.

[16] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. PAC subset selection in stochastic multi-armed bandits. In ICML, volume 12, pages 655 662, 2012.

[17] Varun Kanade, H Brendan Mc Mahan, and Brent Bryan. Sleeping experts and bandits with stochastic action availability and adversarial rewards. 2009.

[18] Varun Kanade and Thomas Steinke. Learning hurdles for sleeping experts. ACM Transactions on Computation Theory (TOCT), 6(3):11, 2014.

[19] Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. Regret bounds for sleeping experts and bandits. Machine learning, 80(2-3):245 272, 2010.

[20] Tomas Kocak, Gergely Neu, Michal Valko, and Rémi Munos. Efﬁcient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems, pages 613 621, 2014.

[21] Junpei Komiyama, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa. Regret lower bound and optimal algorithm in dueling bandit problem. In COLT, pages 1141 1154, 2015.

[22] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, 2018.

[23] Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684 692, 2011.

[24] Gergely Neu and Michal Valko. Online combinatorial optimization with stochastic decision sets and adversarial losses. In Advances in Neural Information Processing Systems, pages 2780 2788, 2014.

[25] Wenbo Ren, Jia Liu, and Ness B Shroff. PAC ranking from pairwise and listwise queries: Lower bounds and upper bounds. ar Xiv preprint ar Xiv:1806.02970, 2018.

[26] Aadirupa Saha and Aditya Gopalan. Battle of bandits. In Uncertainty in Artiﬁcial Intelligence, 2018.

[27] Aadirupa Saha and Aditya Gopalan. PAC Battling Bandits in the Plackett-Luce Model. In Algorithmic Learning Theory, pages 700 737, 2019.

[28] Hossein Azari Souﬁani, David C Parkes, and Lirong Xia. Preference elicitation for general random utility models. In Uncertainty in Artiﬁcial Intelligence, page 596. Citeseer, 2013.

[29] Yanan Sui, Vincent Zhuang, Joel Burdick, and Yisong Yue. Multi-dueling bandits with dependent arms. In Conference on Uncertainty in Artiﬁcial Intelligence, UAI 17, 2017.

[30] Yanan Sui, Masrour Zoghi, Katja Hofmann, and Yisong Yue. Advancements in dueling bandits. In IJCAI, pages 5502 5510, 2018.

[31] Balázs Szörényi, Róbert Busa-Fekete, Adil Paul, and Eyke Hüllermeier. Online rank elicitation for plackett-luce: A dueling bandits approach. In Advances in Neural Information Processing Systems, pages 604 612, 2015.

[32] Tanguy Urvoy, Fabrice Clerot, Raphael Féraud, and Sami Naamane. Generic exploration and k-armed voting bandits. In International Conference on Machine Learning, pages 91 99, 2013.

[33] Huasen Wu and Xin Liu. Double Thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems, pages 649 657, 2016.

[34] Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 1556, 2012.

[35] Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1201 1208. ACM, 2009.

[36] Masrour Zoghi, Zohar S Karnin, Shimon Whiteson, and Maarten De Rijke. Copeland dueling bandits. In Advances in Neural Information Processing Systems, pages 307 315, 2015.

[37] Masrour Zoghi, Shimon Whiteson, Remi Munos, Maarten de Rijke, et al. Relative upper conﬁdence bound for the k-armed dueling bandit problem. In JMLR Workshop and Conference Proceedings, number 32, pages 10 18. JMLR, 2014.

[38] Masrour Zoghi, Shimon A Whiteson, Maarten De Rijke, and Remi Munos. Relative conﬁdence sampling for efﬁcient on-line ranker evaluation. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 73 82. ACM, 2014.