# bandits_dueling_on_partially_ordered_sets__fcf3c3f8.pdf

Bandits Dueling on Partially Ordered Sets

Julien Audiffren

CMLA ENS Paris-Saclay, CNRS Universit e Paris-Saclay, France julien.audiffren@gmail.com

Liva Ralaivola Lab. Informatique Fondamentale de Marseille

CNRS, Aix Marseille University

Institut Universitaire de France F-13288 Marseille Cedex 9, France liva.ralaivola@lif.univ-mrs.fr

We address the problem of dueling bandits deﬁned on partially ordered sets, or posets. In this setting, arms may not be comparable, and there may be several (incomparable) optimal arms. We propose an algorithm, Unchained Bandits, that efﬁciently ﬁnds the set of optimal arms the Pareto front of any poset even when pairs of comparable arms cannot be a priori distinguished from pairs of incomparable arms, with a set of minimal assumptions. This means that Unchained Bandits does not require information about comparability and can be used with limited knowledge of the poset. To achieve this, the algorithm relies on the concept of decoys, which stems from social psychology. We also provide theoretical guarantees on both the regret incurred and the number of comparison required by Unchained Bandits, and we report compelling empirical results.

1 Introduction

Many real-life optimization problems pose the issue of dealing with a few, possibly conﬂicting, objectives: think for instance of the choice of a phone plan, where a right balance between the price, the network coverage/type, and roaming options has to be found. Such multi-objective optimization problems may be studied from the multi-armed bandits perspective (see e.g. Drugan and Nowe [2013]), which is what we do here from a dueling bandits standpoint.

Dueling Bandits on Posets. Dueling bandits [Yue et al., 2012] pertain to the K-armed bandit framework, with the assumption that there is no direct access to the reward provided by any single arm and the only information that can be gained is through the simultaneous pull of two arms: when such a pull is performed the agent is informed about the winner of the duel between the two arms. We extend the framework of dueling bandits to the situation where there are pairs of arms that are not comparable, that is, we study the case where there might be no natural order that could help decide the winner of a duel this situation may show up, for instance, if the (hidden) values associated with the arms are multidimensional, as is the case in the multi-objective setting mentioned above. The notion of incomparability naturally links this problem with the theory of posets and our approach take inspiration from works dedicated to selecting and sorting on posets [Daskalakis et al., 2011].

Chasing the Pareto Front. In this setting, the best arm may no longer be unique, and we consider the problem of identifying among all available K arms the set of maximal incomparable arms, or the Pareto front, with minimal regret. This objective signiﬁcantly differs from the usual objective of dueling bandit algorithms, which aim to ﬁnd one optimal arm such as a Condorcet winner, a Copeland winner or a Borda winner and pull it as frequently as possible to minimize the regret. Finding the entire Pareto front (denoted P) is more difﬁcult, but pertains to many real-world applications. For instance, in the discussed phone plan setting, P will contain both the cheapest plan and the plan offering the largest coverage, as well as any non dominated plan in-between; therefore, every customer may then ﬁnd a a suitable plan in P in accordance with her personal preferences.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

Key: Indistinguishability. In practice, the incomparability information might be difﬁcult to obtain. Therefore, we assume the underlying incomparability structure, is unknown and inaccessible. A pivotal issue that arises is that of indistinguishability. In the assumed setting, the pull of two arms that are comparable and that have close values and hence a probability for either arm to win a duel close to 0.5 is essentially driven by the same random process, i.e. an unbiased coin ﬂip, as the draw of two arms that are not comparable. This induces the problem of indistinguishability: that of deciding from pulls whether a pair of arms is incomparable or is made of arms of similar strengths. Contributions. Our main contribution, the Unchained Bandits algorithm, implements a strategy based on a peeling approach (Section 3). We show that Unchained Bandits can ﬁnd a nearly optimal approximation of the the set of optimal arms of S with probability at least 1 δ while incurring a regret upper bounded by R O

Kwidth(S) log K

, where i is the regret associated with arm i, K the size of the poset and width(S) its width, and that this regret is essentially optimal. Moreover, we show that with little additional information, Unchained Bandits can recover the exact set of optimal arms, and that even when no additional information is available, Unchained Bandits can recover P by using decoy arms an idea stemming from social psychology, where decoys are used to lure an agent (e.g., a customer) towards a speciﬁc good/action (e.g. a product) by presenting her a choice between the targetted good and a degraded version of it (Section 4). Finally, we report results on the empirical performance of our algorithm in different settings (Section 5).

Related Works. Since the seminal paper of Yue et al. [2012] on Dueling Bandits, numerous works have proposed settings where the total order assumption is relaxed, but the existence of a Condorcet winner is assumed [Yue and Joachims, 2011, Ailon et al., 2014, Zoghi et al., 2014, 2015b]. More recent works [Zoghi et al., 2015a, Komiyama et al., 2016], which envision bandit problems from the social choice perspective, pursue the objective of identifying a Copeland winner. Finally, the works closest to our partial order setting are [Ramamohan et al., 2016] and [Dud ık et al., 2015]. The former proposes a general algorithm which can recover many sets of winners including the uncovered set, which is akin to the Pareto front; however, it is assumed the problems do not contain ties while in our framework, any pair of incomparable arms is encoded as a tie. The latter proposes an extension of dueling bandits using contexts, and introduces several algorithms to recover a Von Neumann winner, i.e. a mixture of arms that is better that any other and in our setting, any mixture of arms from the Pareto front is a Von Neumann winner. It is worth noting that the aforementioned works aim to identify a single winner, either Condorcet, Copeland or Von Neumann. This is signiﬁcantly different from the task of identifying the entire Pareto front. Moreover, the incomparability property is not addressed in previous works; if some algorithms may still be applied if incomparability is encoded as a tie, they are not designed to fully use this information, which is reﬂected by their performances in our experiments. Moreover, our lower bound illustrates the fact that our algorithm is essentially optimal for the task of identifying the Pareto front. Regarding decoys, the idea originates from social psychology; they introduce the idea that the introduction of strictly dominated alternatives may inﬂuence the perceived value of items. This has generated an abundant literature that studied decoys and their uses in various ﬁelds (see e.g. Tversky and Kahneman [1981], Huber et al. [1982], Ariely and Wallsten [1995], Sedikides et al. [1999]). From the computer science literature, we may mention the work of Daskalakis et al. [2011], which addresses the problem of selection and sorting on posets and provides relevant data structures and accompanying analyses.

2 Problem: Dueling Bandits on Posets

We here brieﬂy recall base notions and properties at the heart of our contribution.

Deﬁnition 2.1 (Poset). Let S be a set of elements. (S, <) is a partially ordered set or poset if < is a partial reﬂexive, antisymmetric and transitive binary relation on S.

Transitivity relaxation. Recent works on dueling bandits (see e.g. Zoghi et al. [2014]) have shown that the transitivity property is not required for the agent to successfully identify the maximal element (in that cas,e the Condorcet winner), if it is assumed to exists. Similarly, most of the results we provide do not require transitivity. In the following, we dub social poset a transitivity-free poset, i.e. a partial binary relation which is solely reﬂexive and antisymmetric.

Remark 2.2. Throughout, we will use S to denote indifferently the set S or the social poset (S, <), the distinction being clear from the context. We make use of the additional notation: 8a, b 2 S

a k b if a and b are incomparable (neither a < b nor b < a); a b if a < b and a 6= b; Deﬁnition 2.3 (Maximal element and Pareto front). An element a 2 S is a maximal element of S if 8b 2 S, a < b or a k b. We denote by P(S) .= {a : a < b or a k b, 8b 2 S}, the set of maximal elements or Pareto front of the social poset.

Similarly to the problem of the existence of a Condorcet winner, P might be empty for social poset (in with posets there always is at least one maximal element). In the following, we assume that |P| > 0. The notions of chain and antichain are key to identify P. Deﬁnition 2.4 (Chain, Antichain, Width and Height). C S is a chain (resp. an antichain) if 8a, b 2 C, a < b or b < a (resp. a k b). C is maximal if 8a 2 S \ C, C [ {a} is not a chain (resp. an antichain). The height (resp. width) of S is the size of its longest chain (resp. antichain).

K-armed Dueling Bandit on posets. The K-armed dueling bandit problem on a social poset S = {1, . . . , K} of arms might be formalized as follows. For all maximal chains {i1, . . . , im} of m arms there exist a family {γipiq}1 p,q m of parameters such that γij 2 ( 1/2, 1/2) and the pull of a pair (ip, iq) of arms from the same chain is the independent realization of a Bernoulli random variable Bipiq with expectation E(Bipiq) = 1/2 + γipiq, where Bipiq = 1 means that i is the winner of the duel between i and j and conversely (note that: 8i, j, γji = γij). In the situation where the pair of arms (ip, iq) selected by the agent corresponds to arms such that ip k iq, a pull is akin to the toss of an unbiased coin ﬂip, that is, γipiq = 0. This is summarized by the following assumption:

Assumption 1 (Order Compatibility). 8i, j 2 S, (i j) if and only if γij > 0.

Regret on posets. In the total order setting, the regret incurred by pulling an arm i is deﬁned as the difference between the best arm and arm i. In the poset framework, there might be multiple best arms, and we chose to deﬁne regret as the maximum of the difference between arm i and the best arm comparable to i. Formally, the regret i is deﬁned as :

i = max{γji, 8j 2 P such that j < i}.

We then deﬁne the regret incurred by comparing two arms i and j by i + j. Note the regret of a comparison is zero if and only if the agent is comparing two elements of the Pareto front.

Problem statement. The problem that we want to tackle is to identify the Pareto front P(S) of S as efﬁciently as possible. More precisely, we want to devise pulling strategies such that for any given δ 2 (0, 1), we are ensured that the agent is capable, with probability 1 δ to identify P(S) with a controlled number of pulls and a bounded regret.

"-indistinguishability. In our model, we assumed that if i k j, then γij = 0: if two arms cannot be compared, the outcome of the their comparison will only depend on circumstances independent from the arms (like luck or personal tastes). Our encoding of such framework makes us assume that when considered over many pulls, the effects of those circumstances cancel out, so that no speciﬁc arm is favored, whence γij = 0. The limit of this hypothesis and the robustness of our results when not satisﬁed are discussed in Section 5.

This property entails the problem of indistinguishability evoked previously. Indeed, given two arms i and j, regardless of the number of comparisons, an agent may never be sure if either the two arms are very close to each other (γij 0 and i and j are comparable) or if they are not comparable (γij = 0). This raises two major difﬁculties. First, any empirical estimation ˆγij of γij being close to zero is no longer a sufﬁcient condition to assert that i and j have similar values; insisting on pulling the pair (i, j) to decide whether they have similar value may incur a very large regret if they are incomparable.

Second, it is impossible to ensure that two elements are incomparable therefore, identifying the exact Pareto set is intractable if no additional information is provided. Indeed,the agent might never be sure if the candidate set no longer contains unnecessary additional elements i.e. arms very close to the real maximal elements but nonetheless dominated. This problem motivates the following deﬁnition, which quantiﬁes the notion of indistinguishability: Deﬁnition 2.5 ("-indistinguishability). Let a, b 2 S and " > 0. a and b are "-indistinguishable, noted a k" b, if |γab| ".

As the notation k" implies, the "-indistinguishability of two arms can be seen as a weaker form of incomparability, and note that as "-decreases, previously indistinguishable pairs of arms become dis-

Algorithm 1 Direct comparison

Given (S, ) a social poset, δ, " > 0, a, b 2 S Deﬁne pab the average number of victories of a over b and Iab its 1 δ conﬁdence interval. Compare a and b until |Iab| < " or 0.5 62 Iab. return a k" b if |Iab| < ", else a b if pab > 0.5, else b a.

Algorithm 2 Unchained Bandits

Given S = {s1, . . . , s K} a social poset, δ > 0, N > 0, ("t)N

+ Deﬁne Set S0 = S. Maintain ˆp = (ˆpij)K

i,j=1 the average number of victories of i against j and

i,j=1 = min

the corresponding 1 δ/NK2 conﬁdence interval.

Peel b P: for t = 1 to N do St+1 = UBSRoutine (St, "t, δ/N, A = Algorithm 1). return b P = SN+1

tinguishable, and the only 0 indistinguishable pair of arms are the incomparable pairs. The classical notions of a poset related to incomparability can easily be extended to ﬁt the "-indistinguishability: Deﬁnition 2.6 ("-antichain, "-width and "-approximation of P). Let " > 0. C S is an "-antichain if 8a 6= b 2 C, we have a k" b. Additionally, P0 S is an "-approximation of P (noted P0 2 P") if P P0 and P0 is an "-antichain. Finally, width"(S) is the size of the largest "-antichain of S.

Features of P". While the Pareto front is always unique, it might possess multiple "-approximations. The interest of working with P" is threefold: i) to ﬁnd an "-approximation of P, the agent only has to remove the elements of S which are not "-indistinguishable from P; thus, if P cannot be recovered in the partially observable setting, an "-approximation of P can be obtained; ii) any set in P" contains P, so no maximal element is discarded; iii) for any B 2 P" all the elements of B are nearly optimal, in the sense that 8i 2 B, i < ". It is worth noting that "-approximations of P may structurally differ from P in some settings, though. For instance, if S includes an isolated cycle, an "-approximation of the Pareto front may contain elements of the cycle and in such case, approximating the Pareto front using "-approximation may lead to counterintuitive results.

Finding an "-approximation of P is the focus of the next subsection.

3 Chasing P" with Unchained Bandits

3.1 Peeling and the Unchained Bandits Algorithm

While deciding if two arms are incomparable or very close is intractable, the agent is able to ﬁnd if two arms a and b are "-indistinguishable, by using for instance the direct comparison process provided by Algorithm 1. Our algorithm, Unchained Bandits, follows this idea to efﬁciently retrieve an "-approximation of the Pareto front. It is based on a peeling technique: given N > 0 and a decreasing sequence ("t)1 t N it computes and reﬁnes an "t-approximation b Pt of the Pareto front, using UBSRoutine (Algorithm 3), which considers "t-indistinguishable arms as incomparable.

Peeling S. Peeling provides a way to control the time spent on pulling indistinguishable arms, and it is used to upper bound the regret.Without peeling, i.e. if the algorithm were directly called with "N, the agent could use a number of pulls proportional to 1/"2

N trying to distinguish two incomparable arms, even though one of them is a regret inducing arm (e.g. an arm j with a large |γi,j| for some i 2 P). The peeling strategy ensures that inefﬁcient arms are eliminated in early epochs, before the agent can focus on the remaining arms with an affordable larger number of comparisons.

Algorithm subroutine. At each epoch, UBSRoutine (Algorithm 3), called on St with parameter " > 0 and δ > 0, works as follows. It chooses a single initial pivot an arm to which other arms are compared and successively examines all the elements of St. The examined element p is compared to all the pivots (the current pivot and the previously collected ones), using Algorithm 1 with parameters " and δ/K2. Each pivot that is dominated by p is removed from the pivot set. If after being compared to all the pivots, p has not been dominated, it is added to the pivot set. At the end, the set of remaining pivots is returned.

Algorithm 3 UBSRoutine

Given St a social poset, "t > 0 a precision criterion, δ0 an error parameter Initialisation Choose p 2 St at random. Deﬁne b P = {p} the set of pivots. Construct b P for c 2 St \ {p} do

for c0 2 b P, compare c and c0 using Algorithm 1 with (δ = δ0/|St|2, " = "t).

8c0 2 b P, such that c c0, remove c0 from b P if 8c0 2 b P, c k"t c0 then add c to b P return ˆP

Reuse of informations. To optimize the efﬁciency of the peeling process, Unchained Bandits reuses previous comparison results: the empirical estimates pab and the corresponding conﬁdence intervals Iab are initialized using the statistics collected from previous pulls of a and b.

3.2 Regret Analysis

In this part, we focus on geometrically decreasing peeling sequence, i.e. 9β > 0 such that "t = βt 8n 0. We now introduce the following Theorem1 which gives an upper bound on the regret incurred by Unchained Bandits. Theorem 1. Let R be the regret generated by Algorithm 2 applied on S with parameters δ, N and with a decreasing sequence ("t)N

t=1 such that "t = βt, 8t 0. Then with probability at least 1 δ, Unchained Bandits successfully returns ˆP 2 P"N after at most T comparisons, with

Kwidth"N (S)log(NK2/δ)/"2

The 1/β2 reﬂects the fact that a careful peeling, i.e. β close to 1, is required to avoid unnecessary expensive (regret-wise) comparisons: this prevents the algorithm from comparing two incomparable yet severely suboptimal arms for an extended period of time. Conversely, for a given approximation accuracy "N = ", N increases as 1/ log β, since βN = ", which illustrates the fact that unnecessary peeling, i.e. peeling that do not remove any arms, lead to a slightly increased regret. In general, β should be chosen close to 1 (e.g. 0.95), as the advantages tend to surpass the drawbacks unless additional information about the poset structure are known.

Inﬂuence of the complexity of S. In the bounds of Theorem 1, the complexity of S inﬂuences the result through its total size |S| = K and its width. One of the features of Unchained Bandits is that the dependency in S in Theorem 1 is |S|width(S) and not |S|2. For instance, if S is actually equipped with a total order, then width(S) = 1 and we recover the best possible dependency in |S| which is highlighted by the lower bound (see Theorem 2).

Comparison Lower Bound. We will now prove that the previous result is nearly optimal in order. Let A denotes a dueling bandit algorithm on hidden posets. We ﬁrst introduce the following Assumption: Assumption 2. 8K > W 2 N+

, for all δ > 0, 1/8 > " > 0, for any poset S such that |S| K and max (|P"(S)|) W, A identify an "-approximation of the Pareto front P" of S with probability at least 1 δ with at most T δ,"

A (K, W) comparisons. Theorem 2. Let A be a dueling bandit algorithm satisfying Assumption 2. Then for any δ > 0, 1/8 > " > 0, K and W two positive integers such that K > W > 0, there exists a poset S such that |S| = K, width(S) = |P(S)| = W, max (|P"(S)|) W and

A (K, W)|A(S) = P(S)

KW log(1/δ)

The main discrepancy between the usual dueling bandit upper and lower bounds for regret is the K factor (see e.g. [Komiyama et al., 2015]) and ours is arguably the K factor. It is worth noting that

1The complete proof for all our results can be found in the supplementary material.

Algorithm 4 Decoy comparison

Given (S, ) a poset, δ, > 0, a, b 2 S Initialisation Create a0, b0 the respective - decoy of a, b. Maintains pab the average number of victory of a over b and Iab its 1 δ/2 conﬁdence interval, Compare a and b0, b and a0, until max(|Iab0|, |Iba0|) < or pab0 > 0.5 or pa0b > 0.5. return a k" b if max(|Iab0|, |Iba0|) < , else a b if pab0 > 0.5, else b a.

this additional complexity is directly related to the goal of ﬁnding the entire Pareto front, as can be seen in the proof of Theorem 2 (see Supplementary).

4 Finding P using Decoys

In this section, we discuss several methods to recover the exact Pareto front from an "-approximation, when S is a poset. First, note that P can be found if additional information on the poset is available. For instance, if a lower bound c > 0 on the minimum distance of any arm to the Pareto set deﬁned as d(P) = min{γij, 8i 2 P, j 2 S \ P, such that i j} is known, then since Pc = {P}, Unchained Bandits used with "N = c will produce the Pareto front of S. Alternatively, if the size k of the Pareto front is known, P can be found by peeling St until it achieves the desired size. This can be achieved by successively calling UBSRoutine with parameters St, "t = βt, and δt = 6δ/ 2t2, and by stopping as soon as |St| = k.

This additional information may be unavailable in practice, so we propose an approach which does not rely on external information to solve the problem at hand. We devise a strategy which rests on the idea of decoys, that we now fully develop. First, we formally deﬁne decoys for posets, and we prove that it is a sufﬁcient tool to solve the incomparability problem (Algorithm 4). We also present methods for building those decoys, both for the purely formal model of posets and for real-life problems. In the following, is a strictly positive real number. Deﬁnition 4.1 ( -decoy). Let a 2 S. Then b 2 S is said to be a -decoy of a if :

1. a < b and γa,b ; 2. 8c 2 S, a k c implies b k c; 3. 8c 2 S such that c < a, γc,b .

The following proposition illustrates how decoys can be used to assess incomparability. Proposition 4.2 (Decoys and incomparability). Let a and b 2 S. Let a0 (resp. b0) be a -decoy of a (resp. b). Then a and b are comparable if and only if max(γb,a0, γa,b0) .

Algorithm 4 is derived from this result. The next proposition, which is an immediate consequence of Proposition 4.2, gives a theoretical guarantee on its performance. Proposition 4.3. Algorithm 4 returns the correct incomparability result with probability at least 1 δ after at most T comparisons, where T = 4log(4/δ)/ 2.

Adding decoys to a poset. A poset S may not contain all the necessary decoys. To alleviate this, the following proposition states that it is always possible to add relevant decoys to a poset. Proposition 4.4 (Extending a poset with a decoy). Let (S, <, γ) be a dueling bandit problem on a poset S and a 2 S. Deﬁne a0, S0, 0, γ0 as follows:

S0 = S [ {a0} 8b, c 2 S, b < c i.f.f. b <0 c and γ0

b,c = γb,c 8b 2 S, if b < a then b < a0 and γ0

b,a0 = max(γb,a, ). Otherwise, b k a0. Then (S0, <0, γ0) deﬁnes a dueling bandit problem on poset, γ0

|S = γ, and a0 is a -decoy of a.

Note that the addition of decoys in a poset does not disqualify previous decoys, so that this proposition can be used iteratively to produce the required number of decoys.

Decoys in real-life. The intended goal of a decoy a0 of a is to have at hand an arm that is known to be lesser than a. Creating such a decoy in real-life can be done by using a degraded version of a: for the case of an item in a online shop, a decoy can be obtained by e.g. increasing the price. Note that while for large values of the parameter of the decoys Algorithm 4 requires less comparisons (see

Table 1: Comparison between the ﬁve ﬁlms with the highest average scores (bottom line) and the ﬁve ﬁlms of the computed "-pareto set (top line).

Pareto Front Pulp Fiction Fight Club Shawshank Redemption The Godfather Star Wars Ep. V

Top Five Pulp Fiction Usual Suspect Shawshank Redemption The Godfather The Godfather II

Proposition 4.3), in real-life problems, the second point of Deﬁnition 4.1 tends to become false: the new option is actually so worse than the original that the decoy becomes comparable (and inferior) to all the other arms, including previously non comparable arms (example: if the price becomes absurd). In that case, the use of decoys of arbitrarily large can lead to erroneous conclusions about the Pareto front and should be avoided. Given a speciﬁc decoy, the problem of estimating in a real-life problem may seem difﬁcult. However, as decoys are not new even though the use we make of them here is a number of methods [Heath and Chatterjee, 1995] have been designed to estimate the quality of a decoy, which is directly related to , and, with limited work, this parameter may be estimated as well. We refer the interested reader to the aforementioned paper (and references therein) for more details on the available estimation methods.

Using decoys. As a consequence of Proposition 4.3, Algorithm 3 used with decoys instead of direct comparison and " = will produce the exact Pareto front. But this process can be very costly, as the number of required comparison is proportional to 1/ 2, even for strongly suboptimal arms. Therefore, our algorithm, Unchained Bandits, when combined with decoys, ﬁrst produces an "-approximation b P of P using a peeling approach and direct comparisons before reﬁning it into P by using Algorithm 3 together with decoys. The following theorems provide guarantees on the performances of this modiﬁcation of Unchained Bandits. Theorem 3. Unchained Bandits applied on S with decoys, parameters δ,N and with a decreasing sequence ("t)N 1

t=1 lower bounded by

K width(S), returns the Pareto front P of S with probability at least 1 δ after at most T comparisons, with

Kwidth(S)log(NK2/δ)/ 2'

(3) Theorem 4. Unchained Bandits applied on S with decoys, parameters δ,N and with a decreasing sequence ("t)N 1

t=1 such that "N 1

K. returns the Pareto front P of S with probability at least 1 δ while incurring a regret R such that

+ Kwidth(S) log

i, i<"N 1,i/2P

Compared to (2), (4) includes an extra term due to the regret incurred by the use of decoys. In this term, the dependency in S is slightly worse (Kwidth(S) instead of K). However, this extra regret is limited to arms belonging to an "-approximation of the Pareto front, i.e. nearly optimal arms.

Constraints on ". Theorem 4 require that "t

K , which implies that only near-optimal arms remain during the decoy step. This is crucial to obtain a reasonable upper bound on the incurred regret, as the number of comparisons using decoys is large ( 1/ 2) and is the same for every arm, regardless of its regret. Conversely, in Theorem 3 which provides an upper bound on the number of comparisons required to ﬁnd the Pareto front the "t are required to be lower bounded. This bound is tight in the (worst-case) scenario where all the arms are -indistinguishable, i.e. peeling cannot eliminate any arm. In that case, any comparison done during the peeling is actually wasted, and the lower bound on "t allows to control the number of comparisons made during the peeling step. In order to satisfy both constraints, "N must be chosen such that

K/width(S) "N

K . In particular "N =

K satisfy both condition and does not rely on the knowledge of width(S).

5 Numerical Simulations

5.1 Simulated Poset

Here, we test Unchained Bandits on randomly generated posets of different sizes, widths and heights. To evaluate the performance of Unchained Bandits, we compare it to three variants of dueling bandit algorithms which were naively modiﬁed to handle partial orders and incomparability:

Figure 1: Regret incurred by Modiﬁed IF2, Modiﬁed RUCB, Uniform Sampling and Unchained Bandits, when the structure of the poset varies. Dependence on (left:) height, (center:) size of the Pareto front and (right:) addition of suboptimal arms.

1. A simple algorithm, Uniform Sampling, inspired from the successive elimination algo-

rithm [Even-Dar et al., 2006], which simultaneously compares all possible pairs of arms until one of the arms appears suboptimal, at which point it is removed from the set of selected arms. When only -indistinguishable elements remain, it uses -decoys.

2. A modiﬁed version of the single-pivot IF2 algorithm [Yue et al., 2012]. Similarly to the

regular IF2 algorithm, the agent maintains a pivot which is compared to every other elements; suboptimal elements are removed and better elements replace the pivot. This algorithm is useful to illustrate consequences of the multi-pivot approach.

3. A modiﬁed version of RUCB [Zoghi et al., 2014]. This algorithm is useful to provide a non

pivot based perspective.

More precisely, IF2 and RUCB were modiﬁed as follows: the algorithms were provided with the additional knowledge of d(P), the minimum gap between one arm of the Pareto front and any other given comparable arm. When during the execution of the algorithm, the empirical gap between two arms reaches this threshold, the arms were concluded to be incomparable. This allowed the agent to retrieve the Pareto front iteratively, one element at a time.

The random posets are generated as follows: a Pareto front of size p is created, and w disjoint chains of length h 1 are added. Then, the top of the chains are connected to a random number of elements of the Pareto front. This creates the structure of the partial order . Finally, the exact values of the γij s are obtained from a uniform distribution, conditioned to satisfy Assumption 1 and to have d(P) 0.01. When needed, -decoys are created according to Proposition 4.4. For each experiment, we changed the value of one parameter, and left the other to their default values (p = 5, w = 2p, h = 10). Additionally, we provide one experiment where we studied the inﬂuence of the quality of the arms ( i) on the incurred regret, by adding clearly suboptimal arms2 to an existing poset. The results are averaged over ten runs, and can be found in reported on Figure 1. By default, we use δ = 1/1000 and = 1/100, β = 0.9 and N = blog(

K )/ log β)c.

Result Analysis. While Uniform Sampling implements a naive approach, it does outperform the modiﬁed IF2. This can be explained as in modiﬁed IF2, the pivot is constantly compared to all the remaining arms, including all the uncomparable, and potentially strongly suboptimal arms. These uncomparable arms can only be eliminated after the pivot has changed, which can take a large number of comparison, and produces a large regret. Unchained Bandits and modiﬁed RUCB produce much better results than Uniform Sampling and modiﬁed IF2, and their advantage increases with the complexity of S. While Unchained Bandits performs better that modiﬁed RUCB in all the experiments, it is worth noting that this difference is particularly important when additional suboptimal arms are added. In RUCB, the general idea is roughly to compare the best optimistic arm available to its closest opponent. While this approach works greatly in totally ordered set, in poset it produces a lot of comparisons between an optimal arm i and an uncomparable arm j because in this case γij = 0.5, and j appears to be a close opponent to i, even though j can be clearly suboptimal.

2For this experiment, we say that an arm j is clearly suboptimal if 9c 2 P s.t. γcj > 0.15

5.2 Movie Lens Dataset

To illustrate the application of Unchained Bandits to a concrete example, we used the 20 millions items Movie Lens dataset (Harper and Konstan [2015]), which contains movie evaluations. Movies can be seen as a poset, as two movies may be incomparable because they are from different genres (e.g. a horror movie and a documentary). To simulate a dueling bandit on a poset we proceed as follows: we remove all ﬁlms with less than 50000 evaluations, thus obtaining 159 ﬁlms, represented as arms. Then, when comparing two arms, we pick at random a user which has evaluated both ﬁlms, and compare those evaluations (ties are broken with an unbiased coin toss). Since the decoy tool cannot be used in an ofﬂine dataset, we restrict ourselves to ﬁnding an "-approximation of the Pareto front, with " = 0.05, and parameters β = 0.9, δ = 0.001 and N = blog "/ log βc = 28.

Due to the lack of a ground-truth for this experiment, no regret estimation can be provided. Instead, the resulting "-Pareto front, which contains 5 ﬁlms, is listed in Table 1, and compared to the ﬁve ﬁlms among the original 159 with the highest average scores. It is interesting to note that three ﬁlms are present in both list, which reﬂects the fact that the best ﬁlms in term of average score have a high chance of being in the Pareto Front. However, the ﬁlms contained in the Pareto front are more diverse in term of genre, which is expected of a Pareto front. For instance, the sequel of the ﬁlm The Godfather has been replaced by a a ﬁlm of a totally different genre. It is important to remember that Unchained Bandits does not have access to any information about the genre of a ﬁlm: its results are based solely on the pairwise evaluation, and this result illustrates the effectiveness of our approach.

Limit of the uncomparability model. The hypothesis that i k j ) γij = 0 might not always hold true in all real life settings: for instance movies of a niche genre will probably get dominated in users reviews by movies of popular genre even if they are theoretically incomparable resulting in their elimination by Unchained Bandit. This might explains why only 5 movies are present in our " pareto front. However, even in this case, the algorithm will produce a subset of the Pareto Front, made of uncomparable movies from popular genres. Hence, while the algorithm fails at ﬁnding all the different genre, it still provides a signiﬁcant diversity.

6 Conclusion

We introduced dueling bandits on posets and the problem of "-indistinguishability. We provided a new algorithm, Unchained Bandits, together with theoretical performance guarantees and compelling experiments to identify the Pareto front. Future work might include the study of the inﬂuence of additional hypotheses on the structure of the social poset, and see if some ideas proposed here may carry over to lattices or upper semi-lattices. Additionally, it is an interesting question whether different approaches to dueling bandits, such as Thompson Sampling [Wu and Liu, 2016], could be applied to the partial order setting, and whether results for the von Neumann problem [Balsubramani et al., 2016] can be rendered valid in the poset setting.

Acknowledgement

We would like to thank the anonymous reviewers of this work for their useful comments, particularly regarding the future work section.

Nir Ailon, Zohar Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In Proceedings

of The 31st International Conference on Machine Learning, pages 856 864, 2014.

Dan Ariely and Thomas S Wallsten. Seeking subjective dominance in multidimensional space: An explanation of

the asymmetric dominance effect. Organizational Behavior and Human Decision Processes, 63(3):223 232, 1995.

Akshay Balsubramani, Zohar Karnin, Robert E Schapire, and Masrour Zoghi. Instance-dependent regret bounds

for dueling bandits. In Conference on Learning Theory, pages 336 360, 2016.

Constantinos Daskalakis, Richard M Karp, Elchanan Mossel, Samantha J Riesenfeld, and Elad Verbin. Sorting

and selection in posets. SIAM Journal on Computing, 40(3):597 622, 2011.

Madalina M Drugan and Ann Nowe. Designing multi-objective multi-armed bandits algorithms: a study. In

Neural Networks (IJCNN), The 2013 International Joint Conference on, pages 1 8. IEEE, 2013.

Miroslav Dud ık, Katja Hofmann, Robert E Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual

dueling bandits. In Conference on Learning Theory, pages 563 587, 2015.

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-

armed bandit and reinforcement learning problems. The Journal of Machine Learning Research, 7:1079 1105, 2006.

Uriel Feige, Prabhakar Raghavan, David Peleg, and Eli Upfal. Computing with noisy information. SIAM Journal

on Computing, 23(5):1001 1018, 1994.

F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. ACM Transactions on

Interactive Intelligent Systems (Tii S), 5(4):19, 2015.

Timothy B Heath and Subimal Chatterjee. Asymmetric decoy effects on lower-quality versus higher-quality

brands: Meta-analytic and experimental evidence. Journal of Consumer Research, 22(3):268 284, 1995.

Joel Huber, John W Payne, and Christopher Puto. Adding asymmetrically dominated alternatives: Violations of

regularity and the similarity hypothesis. Journal of consumer research, pages 90 98, 1982.

Junpei Komiyama, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa. Regret lower bound and optimal

algorithm in dueling bandit problem. In Conference on Learning Theory, pages 1141 1154, 2015.

Junpei Komiyama, Junya Honda, and Hiroshi Nakagawa. Copeland dueling bandit problem: Regret lower bound,

optimal algorithm, and computationally efﬁcient algorithm. ar Xiv preprint ar Xiv:1605.01677, 2016.

Siddartha Y Ramamohan, Arun Rajkumar, and Shivani Agarwal. Dueling bandits: Beyond condorcet winners

to general tournament solutions. In Advances in Neural Information Processing Systems, pages 1253 1261, 2016.

B Robert. Ash. information theory, 1990.

Constantine Sedikides, Dan Ariely, and Nils Olsen. Contextual and procedural determinants of partner selection:

Of asymmetric dominance and prominence. Social Cognition, 17(2):118 139, 1999.

Amos Tversky and Daniel Kahneman. The framing of decisions and the psychology of choice. Science, 211

(4481):453 458, 1981.

Huasen Wu and Xin Liu. Double thompson sampling for dueling bandits. In Advances in Neural Information

Processing Systems, pages 649 657, 2016.

Yisong Yue and Thorsten Joachims. Beat the mean bandit. In Proceedings of the 28th International Conference

on Machine Learning (ICML-11), pages 241 248, 2011.

Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem.

Journal of Computer and System Sciences, 78(5):1538 1556, 2012.

Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten D Rijke. Relative upper conﬁdence bound for the

k-armed dueling bandit problem. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 10 18, 2014.

Masrour Zoghi, Zohar S Karnin, Shimon Whiteson, and Maarten de Rijke. Copeland dueling bandits. In

Advances in Neural Information Processing Systems, pages 307 315, 2015a.

Masrour Zoghi, Shimon Whiteson, and Maarten de Rijke. Mergerucb: A method for large-scale online ranker

evaluation. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 17 26. ACM, 2015b.