# dueling_bandits_with_team_comparisons__92832cae.pdf

Dueling Bandits with Team Comparisons

Lee Cohen Tel Aviv University leecohencs@gmail.com

Ulrike Schmidt-Kraepelin Technische Universität Berlin u.schmidt-kraepelin @tu-berlin.de

Yishay Mansour Tel Aviv University and Google Research mansour.yishay@gmail.com

We introduce the dueling teams problem, a new online-learning setting in which the learner observes noisy comparisons of disjoint pairs of k-sized teams from a universe of n players. The goal of the learner is to minimize the number of duels required to identify, with high probability, a Condorcet winning team, i.e., a team which wins against any other disjoint team (with probability at least 1/2). Noisy comparisons are linked to a total order on the teams. We formalize our model by building upon the dueling bandits setting (Yue et al., 2012) and provide several algorithms, both for stochastic and deterministic settings. For the stochastic setting, we provide a reduction to the classical dueling bandits setting, yielding an algorithm that identifies a Condorcet winning team within O((n+k log(k)) max(log log n,log k)

2 ) duels, where is a gap parameter. For deterministic feedback, we additionally present a gap-independent algorithm that identifies a Condorcet winning team within O(nk log(k) + k5) duels.

1 Introduction

Multi-arm bandits (MAB) is a classical model of decision making under uncertainty. In spite of the simplicity of the model, it already incorporates the essential tradeoff between exploration and exploitation. In MAB, the learner performs actions and can only observe rewards of the actions performed. One of the main tasks in MAB is best arm identification, where the goal is to identify a near-optimal action while minimizing the number of actions executed. The MAB model has numerous practical applications, including online advertising, recommendation systems, clinical trials, and more. (See Slivkins (2019); Lattimore & Szepesvári (2020) for more background).

One weakness of the MAB model is the assumption that real-valued rewards are always available. In many applications, it is more natural to compare two actions and observe which one of them is better rather than give every single action a numerical reward. For example, recommendation systems often suggest two items and obtain only their relative preference as feedback (e.g., by a click on one of them). This leads very naturally to the well-known model of dueling bandits (Yue et al., 2012), where the learner selects a pair of actions each time and observes the binary winner of a duel between the two. (See Busa-Fekete et al. (2018) for a survey on extensions of the dueling bandit model).

In this work we are interested in the case that the learner has to select two disjoint teams for a duel, which are k-sized subsets of the actions (which we call players). This appears naturally in sports or online games, where the goal is to pick one of the best teams from a set of players by observing the outcomes of matches (say, to be a school representative team, or to sponsor for tournaments). Examples include doubles tennis, basketball, and the online game League of Legends, where each match requires two disjoint teams of players to compete. Similar phenomena appear in working environments, where different R&D teams compete on implementing a project. Another example could be online advertisements where multiple products are bundled to a display ad and a customer

These authors contributed equally to this work.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

can click on any of two presented bundles, e.g., some online games offer in-app bundle purchases, and the information regarding sales of different bundles can improve the bundles composition.

Our basic model is the following. We have a universe of n players, and at each iteration the learner selects two disjoint teams for a duel and observes the winner. For any two different teams, there exists an unknown stationary probability that determines the winner of a duel between them. The requirement that dueling teams need to be disjoint is in accordance with the situation in games, where a single person cannot play for both teams. The goal of the learner is to minimize the number of duels required to identify, with high probability, a Condorcet winning team, i.e., a team which wins against any other disjoint team (with a probability of at least 1/2). We assume these probabilities are linked to a strict total order on all teams, which implies the existence of a Condorcet winning team, yet it is typically not unique. We make two minimal and natural assumptions on this total order on teams: that it is consistent to some total order among the players, and that team probabilistic comparisons hold Strong Stochastic Transitivity, a common assumption in dueling bandit settings.

Our consistency assumption implies that the best team is the team of top-k players, and this also a Condorcet winning team. However, not all relations between players are deducible for the learner. In particular, even achieving accurate estimations of the latent winning probabilities between all disjoint teams might not suffice to separate the top-k players from the rest. Consider for example an instance with four players 1 2 3 4 where k = 2 and the total order among the teams is lexicographical, i.e., 12 13 14 23 24 34. Assume that each of the three feasible duels is won by the team containing player 1 with a fixed probability γ > 1/2. Then, the learner has no chance of detecting the team 12 as the top-k team. However, any of the teams 12, 13 and 14 is a Condorcet winning team.

Our main target is to present algorithms for which the number of duels is bounded by a polynomial in the number of players n and the team size k, although the number of teams is exponential in k, i.e., Ω((n/k)k) and the number of valid duels is Ω(2k( n

2k)2k). Even if one were to accept an exponential number of arms, a direct reduction to the standard dueling bandits setting would not be feasible as not all pairs of teams are comparable in our model. In particular, duels of the form (S {a}, S {b}), which would yield a signal regarding the relation between players a and b, are forbidden. The inherent difficulty of our endeavor comes from two limitations: (1) Not all the relations between two single players are deducible, (see example above), and (2) even for pairs of players with deducible relation, having Ω(2k( n

2k)2k) valid duels and the same amount of (latent) winning probabilities makes the task of deducing their relations hard.

We start by giving a full characterization of the deducible pairwise relations between players, namely relations that can be detected by an unlimited amount of duels. Our characterization implies that every deducible players relation has one of two types of witnesses, which are constant-size sets of duels that prove their relation. Once we find a witness for one pair of players, it can often be transformed to a witness for other pairs of players. Building upon this characterization, we introduce a parameter a,b which captures the distinguishability of any two players a and b and takes a value of 0 whenever the pair is not deducible. Assuming := k,k+1 > 0, where k and k + 1 are kth and (k + 1)th best players, we give a reduction to the classic dueling bandits problem. Combining this reduction with a high-probability top-k identification algorithm for the dueling bandits setting (e.g., Mohajer et al. (2017); Ren et al. (2020)) yields a similar sample complexity upper bound, e.g., this implies a high-probability top-k identification algorithm for dueling teams with O( 2(n + k log(k)) max(log log n, log k)) duels.

Interestingly, it turns out that the deterministic case, i.e., when winning probabilities are in {0, 1}, constitutes a challenging special case of our problem where can be particularly small, or even 0. To overcome this issue we design delicate algorithms which are independent of . On a high level, a preprocessing procedure first excludes as many bad players as possible. To do so, it runs a method for identifying pairwise relations between players which performs only a small number of duels, but has little control over the pair for which the relation is uncovered. For general total orders this implies an algorithm requiring O(nk log(k) + 2O(k)) duels. For the natural case of additive linear orders, we present a more elaborated approach for detecting a Condorcet winning team within the reduced instance, resulting in an algorithm that performs O(nk log(k) + k5) duels.

We introduce our model in Section 2, give a characterization of deducible relations in Section 3, and discuss the stochastic and deterministic setting in Sections 4 and 5, respectively. In Section 6 we provide a discussion including a lower bound and a regret bound. Algorithms and (full) proofs are relegated to Sections A-D in the appendix, and Section E characterizes additive linear total orders.

1.1 Related Work

MAB best arm or subset identification single arm identification was initiated in Even-Dar et al. (2006) and later studied in many works including Bubeck et al. (2011); Kaufmann et al. (2016); Chen et al. (2017). This setting was extended by Kalyanakrishnan & Stone (2010) for multiple arms identification (i.e., top k arms), using a single arm samples. Other works that address the objective of top k identification include Chen et al. (2014); Zhou et al. (2014); Bubeck et al. (2013).

Dueling bandits The work of Yue et al. (2012) lay down the framework of non-parametric bandit feedback under total order among arms, strong stochastic transitivity, and stochastic triangle inequality assumptions and were followed by many subsequent works (For more, see a survey, Busa-Fekete et al. (2018).) In particular, some subsequent works target the task of identifying the top k players in this setting Mohajer et al. (2017); Ren et al. (2020).

Dueling bandits with sets of actions One line of dueling bandits extension consider the case where the learner selects a subset of actions and observes the outcomes of all duels between all pairs of actions in the subset (Brost et al., 2016; Sui et al., 2017), or the winner of the subset Saha & Gopalan (2018); Ren et al. (2018). As a consequence, these settings give the learner strictly more information than the dueling bandits setting. In contrast, feedback in our setting reveals less information.

MAB with multiple actions selection There are works in which the learner selects a (sometimes fixed-sized) subset of actions at each iteration, and observes either all of the individual selected arms rewards (semi-bandit feedback) or an aggregated form of the rewards (full-bandit feedback), and the task is to detect to best arm or the top k. These include combinatorial bandits Cesa-Bianchi & Lugosi (2012), top-k Rejwan & Mansour (2020), linear bandit and routing Awerbuch & Kleinberg (2008), and more. The main difference between combinatorial bandits and our setting is the feedback.

Comparison models Noisy pairwise comparison models, especially for sorting and ranking, have a long history which dates backs to the 1950 s (For more, see a survey, Pelc (2002).). Specifically, the mathematical problem Counterfeit coin was introduced in the form of a puzzle (Grossman, 1945): given a pile of 12 coins, determine which coins has a different weight (and therefore counterfeit) using balance scales while minimizing the number of measurements. The problem was followed by numerous generalizations (see Guy & Nowakowski (1995)). While this problem is restricted to coins with two different weights, our setting can be seen as a variant with multiple weights. Finally, a PAC related comparison-based model was studied in Balcan et al. (2016).

2 The Dueling Teams Problem

We formalize our problem as follows. Let n, k N with 1 k n

2 . We denote the set of players by [n] := {1, . . . , n} and call any set of k distinct players a team. Moreover, we assume the existence of an underlying strict total order among all teams, and denote it by . We also refer to as the ground truth order. In particular, for any two teams A and B either A B holds, in which case we say that A is better than B, or vice versa, and this relation is transitive. Additionally, we require the total order among the teams to be consistent with a total order among players and formalize this in the consistency assumption at the end of this section.

In each round, the learner selects an ordered pair of two disjoint teams, A and B to perform a duel, and receives a noisy binary feedback about which team is better. Note that in contrast to the usual dueling bandits setting, our setting does not allow duels of the form (A, A), as selecting teams with mutual players for a duel is not an option. We denote the (directly) observable part of by obs, i.e., A obs B iff A and B are disjoint teams and A B. Note that obs is not transitive.

We write A > B if team A is the random winner of duel (A, B). The probability Pr[A > B] is stationary and denoted by PA,B = Pr[A > B]. In each duel of team A against team B the outcome A > B is sampled independently from a Bernoulli distribution with parameter PA,B = 1 PB,A. We assume that the probabilistic comparisons are linked to the total order among the teams, i.e., A B implies PA,B > 1/2, and that PA,B exists for every pair of teams (not only disjoint ones). In the deterministic setting, it holds that PA,B {0, 1} for any teams A = B. In other words, A obs B iff the outcome of each duel (A, B) is A > B, and for two disjoint teams A and B the learner can observe whether A obs B or B obs A by performing a single duel.

A team A is a Condorcet winning team2 if A obs B for all teams B such that A B = . From our assumption on , there always exists a Condorcet winning team, but it is not necessarily unique. The learner s goal is to minimize the number of duels required to identify, with high probability in the stochastic setting and with probability 1 in the deterministic case, a Condorcet winning team.

In the following we formalize two more assumptions we impose on our model, the former affects the linking of the probabilities to the strict total order , the latter restricts the total order itself.

Strong stochastic transitivity (SST) Similarly to the dueling bandits settings in Yue et al. (2012), we assume strong stochastic transitivity. Namely, for every triplet of different teams A B C it holds that PA,C max{PA,B, PB,C}.

Consistency We assume that the total order is consistent to a total order among single players. More precisely, we say that satisfies consistency3 if for every two players a, b [n] either of the following holds true:

(i) S {a} S {b} for all S [n] \ {a, b}, |S| = k 1.

(ii) S {b} S {a} for all S [n] \ {a, b}, |S| = k 1.

The consistency assumption lets us derive a relation among the single players, by defining a b iff S {a} S {b} holds for some S. By team relation transitivity, implies a total order on [n]. Whenever we write a b for some players a, b [n] this is short-hand notation for S {a} S {b} for all subsets S [n] \ {a, b} of size k 1. Though the consistency assumption might appear strong at first sight, dropping it would render the model intractable as the order among the teams would be independent of the players identities. Namely, there is no linkage to the composition of the teams.4

For notational convenience, we assume without loss of generality that 1 2 n and write A m for the set of players containing the top m players, i.e., A m = [m]. In particular, the consistency assumption yields that A k is a Condorcet winning team. However, though the ground truth ranking induces a total order among the players, the learner might not be able to deduce the entire order. In the following we give a characterization of the deducible part of the ground truth order .

3 Witnesses: A Characterization of Deducible Relations

In this section we provide a high level description of the complete characterization of all the pairwise relations between single players that can be deduced via team duels. We refer the reader to the full description of the characterization, which appears in Appendix A. Though single players relations cannot be sampled via feasible team duels directly as intersected teams cannot duel each other, we show a sufficient and necessary condition for deducible relations in the form of a constant number of winning probabilities of observable (feasible) duels. We refer to pairs of sets of players participating in such duels as witnesses.

We denote by Pobs the set of all tuples (P , ), where each P is a team winning probability matrix that satisfy SST w.r.t. , which is a consistent strict total order on teams, and both P and are compatible with the winning probabilities of observable duels and each other, i.e., {P A,B = PA,B | A and B are disjoint teams} and P A,B = 1 P B,A > 1/2 iff A B. We remark that it follows directly from the definition of Pobs that (P, ) Pobs, where P is the ground truth winning probability matrix and the ground truth total order.

We denote by Cobs the set of strict total orders for which there exists a tuple (P , ) Pobs. Intuitively, a total order is in Cobs if there exists an instance (P , ) that has as its ground truth total order, follows our assumptions and is compatible with the observable part of (P, ). Lastly, we define A B if and only if A B for all Cobs, where A and B are not necessarily disjoint. We refer to as the deducible relation. For single player relations, we define a b if and only if

2The name is motivated by the fact that such a team is a weak Condorcet winner for the relation obs. 3Consistency is also reminiscent of separable preferences from economic theory (Breton & Sen, 1999). 4For example, the following instance would be possible: Players n k, ..., n are the k intuitively worst players (i.e., they lose in almost all duels) but together they form the best team (and a unique Condorcet winning team), and players 1, ..., k are the intuitively best players and form the second best team in the linear order among teams. In such a model any algorithm would need to iterate over all teams. In other words, the problem becomes a dueling bandits problem with Ω((n/k)k) arms.

there exists S [n] \ {a, b} such that S {a} S {b} for all Cobs. We stress that we only use Pobs and Cobs for analysis and never compute them.

Next, we define two sets of potential witnesses that have a simple structure and, in some cases, allow us to deduce single players relation: (1) A potential subsets witnesses set, denoted by Sa,b, that contains all pairs (S, S ) such that S and S are disjoint subsets of [n] \ {a, b} and both are of size k 1, and (2) A potential subset-team witnesses set, denoted by Ta,b, that contains all pairs (S, T) where S and T are disjoint subsets of [n] \ {a, b}, such that S is of size k 1 and T is of size k (and is therefore a team). Below, we define under which conditions a potential witness is indeed a witness. Definition 3.1. An element (S, S ) Sa,b is a subsets witness for a b if PS {a},S {b} > PS {b},S {a}. An element (S, T) Ta,b is a subset-team witness for a b if PS {a},T > PS {b},T .

We capture the set of the elements of Sa,b that are subsets witnesses for a b by S a,b and analogously, T a,b = {(S, T) Ta,b | (S, T) is a subset-team witness for a b}. It might be the case that S a,b T a,b is empty, in particular this holds when b a. It is also possible that both S a,b T a,b and S b,a T b,a are empty, in which case we will show that the relation between players in a and b cannot be deduced. The following theorem implies that the other direction is also true.

Theorem 3.2. Let a, b [n]. Then, a b if and only if S a,b T a,b = .

Proof sketch. Assume that S a,b T a,b = . We show that a b by using SST, the fact that is a consistent strict total order, and an exhaustive case analysis. For the sake of illustration we present only one case here, namely, that (S, S ) S a,b and that both (1) S {a} S {b} and (2) S {b} S {a} hold. Assume for contradiction that a b does not hold. It thus follows that there exists an order, Cobs for which b a holds. Let P A,B be the corresponding winning probabilities. Then, using consistency of and (1) respectively, we get S {b} S {a} S {b} and from SST P S {b},S {b} P S {a},S {b} > 1/2. In addition, applying consistency again, it follows that S {b} S {b} S {a}. Applying SST once more we get P S {b},S {a} P S {b},S {b} P S {a},S {b}, a contradiction to (S, S ) S a,b (since this implies PS {a},S {b} > PS {b},S {a}).

For the other direction we start by defining Da as the set of observable duels (A, B) such that a A. Moreover, we define a permutation π on the set of teams, which simply exchanges the players a and b when present. We then show that a b implies PA,B Pπ(A),π(B) for all (A, B) Da. Moreover, we show that a b implies that there exists (A, B) Da with PA,B > Pπ(A),π(B) as follows. Assume not. Then we show that the relation defined by A B iff π(A) π(B) is included in Cobs. However, a b implies that for any S [n] \ {a, b} of size k 1 it holds that S {a} S {b} which implies (i) S {a} S {b} as well as (ii) S {a} S {b}. Applying the definitions of and π, statement (ii) implies S {b} = π(S {a}) π(S {b}) = S {a} and hence yields a contradiction to (i). Finally, take some (A, B) Da with PA,B > Pπ(A),π(B). If b B, then (A \ {a}, B \ {b}) S a,b, otherwise (A \ {a}, B) T a,b.

For the sake of brevity, we introduce the set Xa,b which combines the pairs from Sa,b and Ta,b into a set of triples. Formally, Xa,b = {(S, S , T) | (S, S ) Sa,b, (S, T) Ta,b}. We say that (S, S , T) is a witness for a b if (S, S ) S a,b or (S, T) T a,b, and denote (S, S , T) X a,b.

4 Stochastic Setting

In this section we focus on algorithms identifying, with high probability, the top-k team, which is in particular a Condorcet winning team. The main idea is to utilize the results from the previous section to reduce the dueling teams setting to the classic dueling bandits setting (Yue et al. (2012)). To this end we will introduce our gap parameter, , which intuitively captures how easy it is to prove the relationship between the top k and the top (k + 1) player. We start by defining, for any element (S, S , T) Xa,b, a random variable Xa,b(S, S , T) that combines the outcomes of four duels that helps determining whether (S, S , T) is a witness for a b. Formally,

Xa,b(S, S , T) = 1[S {a} > S {b}] 1[S {b} > S {a}]

+ 1[S {a} > T] 1[S {b} > T] /4.

Observe that Xa,b(S, S , T) can take values from { 1/2, 1/4, 0, 1/4, 1/2}, thus E[Xa,b(S, S , T)] [ 1/2, 1/2]. Moreover, we have the following properties:

1. For every (S, S , T) Xa,b, it holds that E[Xa,b(S, S , T)] > 0 (S, S , T) X a,b.

2. If E[Xa,b(S, S , T)] = 0 for every (S, S , T) Xa,b, then Theorem 3.2 implies that the pairwise relation between players a, b cannot be deduced.

Building upon the random variables Xa,b(S, S , T), which are defined for a fix pair of players, a, b, and for each element in Xa,b, we define a single random variable Xa,b by picking a random triplet (S, S , T) Xa,b and returning a realization of Xa,b(S, S , T). For convenience, whenever we write E[Xa,b] we mean E(S,S ,T ) Xa,b[Xa,b]. Using the probabilistic method, we obtain the following theorem, which then brings us to the definition of a gap parameter for our problem. Theorem 4.1. For every two players a, b [n] it holds that a b if and only if E[Xa,b] > 0.

Gap parameter We define our gap parameter by := E[Xk,k+1]. Another interpretation for is that = 1/2(α+β 1), where α [0, 1] is the probability that the top k player wins against the top k + 1 player, when the two players are placed with a random potential subsets witness from Sk,k+1 (i.e., disjoint random subsets of k 1 players) and β [0, 1] captures the difference in winning probabilities when k and k + 1 are placed with a random potential subset-team witness from Tk,k+1 (i.e., two disjoint random subsets of sizes k 1 and k). Within Appendix B we show that for k = 1, this gap is at least of the same order as the gap parameter for the classical dueling bandit setting.

In the following we show that our gap parameter does not just help us to distinguish between the top k and (k + 1) players, but also allows us to distinguish other players in A k and players from [n] \ A k. To this end, we show in Lemma 4.2 that the expectations E[Xa,b] satisfy strong stochastic transitivity w.r.t. the ground truth total order on players. We note that most elements (S, S , T) Xa,b hold E[Xa,c(π(S), π(S ), π(T))] E[Xa,b(S, S , T)] (and analogously for Xb,c), where π is a permutation exchanging players b and c, but, surprisingly, this is not true in general. By carefully constructing a charging scheme, we manage to show that this holds in expectation over all elements of Xa,b, and derive strong stochastic transitivity for the distinguishabilities of players. Lemma 4.2. For a triplet of players a b c it holds that

E[Xa,c] max{E[Xa,b], E[Xb,c]}.

This also yields the following theorem, which paves the way for our reduction in what follows. Theorem 4.3. For any a, b [n] with a A k, b / A k it holds that E[Xa,b] E[Xk,k+1] = . Thus, if > 0 and for a team A it holds that E[Xa,b] for all a A, b [n] \ A, then A = A .

The reduction We now outline the gap-dependent algorithm. The results we have derived in Section 3 will allow us to deduce, with high probability, whether a distinguishability of a given pair of players is at least , and if so determine which is the better player. Intuitively, this is done by performing O( 1

2 ) team duels. We use E[Xa,b] as a proxy for the distinguishability between two single players, a, b, taking advantage of the fact that if their relation is deducible, then E[Xa,b] = 0 and in this case E[Xa,b] > 0 iff a b. Similar to the dueling bandits setting, even though |E[Xa,b]| < for some pairs of players, identifying A k with high probability is possible.

Since we cannot directly sample Xa,b, we will instead sample uniformly at random a triplet of sets, (S, S , T) from Xa,b. Using (S, S )( Sa,b) and (S, T)( Ta,b), we can then perform all the duels required for an unbiased sample of Xa,b(S, S , T), which is by itself a sampling of Xa,b. Given any dueling teams instance, we define a dueling bandits instance as follows: for every two players a, b [n], we define the probability that a wins in a (singles) duel against b as

Pa,b = 1/2 + E[Xa,b]. (1)

Clearly, 1 Pa,b = Pb,a, Pa,b [0, 1] and Pa,b > 1/2 implies a b. In addition, Theorem 4.1 implies that a is better than b in this dueling bandits instance iff a b. So whenever a dueling bandits algorithm is asking for a duel query, (a, b), we can make an independent sample of Xa,b by randomly drawing a triplet (S, S , T) Xa,b and returning a random sampling of Xa,b(S, S , T) + 1/2. In cases where the realization of Xa,b(S, S , T) + 1/2 is in {1/4, 1/2, 3/4}, we assign a as the duel winner if the result of flipping a coin with bias Xa,b(S, S , T) + 1/2 is 1. We formalize this idea in the sub-procedure singles Duel (in the appendix), that simulates a duel for classical dueling bandits

settings using team duels. Notice that, by Lemma 4.2, the probabilities Pa,b defined in (1) satisfy SST with respect to the total order among the players induced by the ground truth order . In addition, the feedback of each single player duel we perform is time-invariant, thus all the non-parametric assumptions for dueling bandits settings apply here. The reduction allows us to identify the top-k players using any dueling bandit algorithm with the same goal that works for total order on arms that satisfy SST, and a gap between the top k and (k + 1) arms as assumptions. We formalize this below. Theorem 4.4. Given any dueling teams instance with n and k (namely, PA,B for every two teams that hold strict total order, SST, and consistency), we have that the dueling bandit instance defined by (1) satisfies SST with respect to the ground truth order among players and for any two players a b it holds that Pa,b 1/2. Moreover, Pk,k+1 = 1/2 + .

Using the above theorem we can use any dueling bandit algorithm for top-k identification to solve our problem. Mohajer et al. (2017) provide an algorithm that returns the top-k players with probability exceeding 1 (log n) c0 with sample complexity at most c1 2 k,k+1(n+k log k) max (log log n, log k) in expectation, where c0 and c1 are universal positive constants and k,k+1 is the distinguishability between the k and the (k + 1) best players (see Algorithm 2 and Theorem 1 in Mohajer et al. (2017)).

Ren et al. (2020) show an algorithm that returns the top-k players with probability at least 1 δ with sample complexity O(P

i [n]( 2 i (log(n/δ) + log log 1 i )), where i = 1i k+1 i,k+1 + 1k i k,i and k, k + 1 are the top k and the top (k + 1) players, respectively (see Algorithm 5 and Theorem 8 in Ren et al. (2020))5. These algorithms, together with Theorem 4.1 allow us to derive the following theorem. Theorem 4.5. There exists an algorithm that returns A k with probability exceeding 1 (log n) c0

with sample complexity at most c1(n + k log k) max (log log n,log k)

2 in expectation, where c0 and c1 are universal positive constants. In addition, there exists an algorithm that returns A k with probability at least 1 δ with sample complexity O(P

i [n]( 2 i (log(n/δ) + log log 1 i )), where i = 1i k+1 E[Xi,k+1]+1k i E[Xk,i] and i denotes the top i player, thus i for all i [n].

5 Deterministic Setting

In Section 4 we showed the existence of algorithms that identify the top-k team with a number of duels that depends on . But what if is very small or even 0? One reason for that can be that all relevant probabilities (e.g., P{k} S,T , P{k+1} S,T ) are very close to 1/2 for all (S, S , T) X k,k+1. Then, a high number of duels to identify A k is unavoidable and giving upper bounds in dependence of a gap parameter very much resembles the literature on dueling bandits, where a gap between the top k and k + 1 players is often a parameter of the sample complexity (e.g., Mohajer et al. (2017)). Another reason for to be small is that the number of witnesses for k k + 1 is small. We overcome this issue by designing -independent algorithms, assuming deterministic feedback, i.e., PA,B {0, 1} for all pairs of teams (A, B). In this setting, (S, T) T a,b if and only if S {a} obs T obs S {b}, and (S, S ) S a,b if and only if S {a} obs S {b} and S {a} obs S {b}. In the appendix (Section C) we show that our results extend to a slightly stochastic environment.

The limitation of the set of witnesses makes the task of identifying a Condorcet winning team in the deterministic setting surprisingly hard. For general total orders, a crucial difficulty lies in proving that a given team is indeed Condorcet winning. Nevertheless, we are able to get the following result: Theorem 5.1. For deterministic feedback, there exists an algorithm that performs O(kn log(k) + k2 log(k)25k) duels and outputs a Condorcet winning team.

For the natural special case of additive total orders we obtain a significantly better upper bound. A total order is additive total, if there exist values for the players denoted by v(x), x [n] such that A B iff P

a A v(a) > P

b B v(b). In Section E of the appendix we give a sufficient and necessary condition for a total order to be additive. For additive total orders we present an algorithm that identifies a Condorcet winning team after polynomial many duels and also outputs a proof. Theorem 5.2. For deterministic feedback and additive total orders, there exists an algorithm that finds a Condorcet winning team within O(kn log(k) + k5) duels.

5We remark that Ren et al. (2020) also assume Stochastic triangle inequality which we do not, however it is only used to derive a lower bound.

Both algorithms rely on the same preprocessing procedure called Reduce Players which reduces the number of players from n to O(k). At the heart of this procedure is a subroutine called Uncover. After describing Uncover and Reduce Players, we prove Theorem 5.1. Towards proving Theorem 5.2, we introduce two more subroutines, namely New Cut and Compare, which are crucial for identifying and proving a Condorcet winning team within the smaller instance. Finally, Algorithm Condorcet Winning combines all components and proves Theorem 5.2. While all algorithms are formalized in Appendix C, we give sketches thereof and theorems stating their input and output below.

The Uncover Subroutine Given two disjoint teams A B, the Uncover subroutine finds a pair of players a A and b B and a subsets witness for their relation, i.e., an element from S a,b. To understand the idea of the subroutine, consider some arbitrary ordering of the elements in A and B, respectively, i.e., A = {a1, . . . , ak} and B = {b1, . . . , bk}. Then, iteratively exchange the elements a1 and b1, a2 and b2, resulting in sets A0 = A, B0 = B, A1 = {b1, a2, . . . , ak}, B1 = {a1, b2, . . . , bk}, A2 = {b1, b2, a3, . . . , ak}, and so on. Since A0 B0 but A0 = Bk Ak = B0 holds, there needs to be some earliest point in time i k for which Bi Ai is true. This implies ai bi as ({a1, . . . , ai 1, bi+1, . . . , bk}, {b1, . . . , bi 1, ai+1, . . . , ak}) is a subsets witness.

While the above sketched subroutine is simple, it performs k duels in the worst case. We refine this idea by a binary search approach, decreasing the number of duels to log(k).

Lemma 5.3. Let A and B be two disjoint teams with A B. After performing O(log(k)) duels, Uncover returns (a, b) with a A, b B and (S, S ) S a,b, and thus a b.

We remark that Lemma C.2 in the appendix is a slightly stronger version of the above lemma which allows us to partition A and B into two subsets each, A = A(1) A(2) and B = B(1) B(2). Under some circumstances, we can then guarantee that Uncover reveals the pairwise comparison between two players a b, where a is from A(1) and b is from B(1).

Reducing the Number of Players to O(k) The fact that we can eliminate some players from [n] and still find (and prove) a Condorcet winning team is due to the following observation.

Observation 5.4. Let R [n] such that A 2k R. Let ˆA R be a team such that ˆA A for all teams A R \ ˆA. Then, ˆA is a Condorcet winning team.

The procedure Reduce Players reduces the set of players [n] to some subset R [n] guaranteeing that A 2k R and |R| < 6k. The algorithm maintains a dominance graph D = (V, E) on the set of players. More precisely, the nodes of D are the players, i.e., V = [n], and there exists an arc from node a to node b if the algorithm has proven that a b. The set V<2k is the subset of the players having an indegree smaller than 2k in D. The high level idea of the algorithm is the following: It starts with the empty dominance graph D = ([n], ). Then, it iteratively identifies pairwise relations of the players with help of Uncover and adds the respective arcs to the graph. By adding more and more arcs to D, the set of nodes V<2k shrinks while A 2k V<2k is always guaranteed. At some point, the algorithm cannot identify any more pairwise relations and returns V<2k. How does the algorithm identify pairwise relations? At any point it tries to find a matching between 2k players, say {(a1, b1), . . . , (ak, bk)} with the constraint that, for all i [k], none of the arcs (ai, bi) or (bi, ai) is present within the graph D yet, and then applies Uncover on the sets A = {a1, . . . , ak} and B = {b1, . . . , bk}. In the proof of Lemma 5.5 we show that, as long as |V<2k| 6k 1, the algorithm can find such a matching. With the help of Lemma 5.5, we then prove Theorem 5.1.

Lemma 5.5. Given the set of players [n], Reduce Players returns R [n] with |R| 6k 2 and A 2k R. Reduce Players performs O(nk log(k)) duels and runs in time O(n2k2).

Proof sketch (of Theorem 5.1). Let D be the dominance graph at the end of Reduce Players. Then, the learner selects a k-sized subset of V<2k, call it ˆA, with the property that there is no arc from any node in V<2k \ ˆA towards some node in ˆA. Then, the learner tests ˆA against all possible teams containing players from V<2k \ ˆA, which are O(25k) many. If ˆA wins all of these duels, then ˆA is a Condorcet winning team by Observation 5.4. However, if there exists A ˆA, then, by the choice of ˆA, there does not exist any arc from A towards ˆA. Hence, by calling the subroutine Uncover for two arbitrary orderings of A = {a1, . . . , ak} and ˆA = {ˆa1, . . . , ˆak}, the learner will identify one additional arc. This procedure can be repeated O(k2) times and thus shows Theorem 5.1.

U1 U2 X Y W2 W1 Z

Figure 1: Illustration of the proof technique of algorithm Condorcet Winning1. In the left illustration, the solid black line indicates that all players left to it were proven to be better than all players right to it. The dashed line marked with k indicates that the sets to its left contain k players in total. However, this line does not indicate proven relations, i.e., players from X are not necessarily better than players from Y . The right figure illustrates the proof for X U1 U2 being Condorcet winning.

Subroutines New Cut and Compare The New Cut subroutine takes as input a subset of the players R [n], a pair a, b R, and a witness proving that a b, i.e., (S, T ) S a,b T a,b. That means, T can be either of size k 1 or k, and S and T are not required to be subsets of R. The subroutine outputs a partition of R into two non-empty sets U and L with U L, which is short-hand notation for u ℓfor any u U and ℓ L. The subroutine starts by initiating the set U = {a} and redefines R = R \ {a, b}. At all times, U contains only players u for which the algorithm has found a witness for u b. These witnesses are stored in a list W, and it is checked whether they can be modified to become witnesses for x b for any other element in x R. This modification is done by applying permutations on the set of subsets of the players, similarly as in the proof of Theorem 3.2 and Lemma 4.2. If the algorithm finds a witness for x b, then x is added to U and removed from R. The new witness is also stored in W. This process ends when either R is empty or all witnesses in W have been checked. At this point it holds that U R {b}, and the algorithm returns (U, L := R {b}). Lemma 5.6. Let R [n], a, b R and (S, T ) S a,b T a,b. Then, New Cut(R, (a, b), (S, T )) returns a partition of R into U and L such that U L, a U and b L. The number of duels performed by New Cut and its running time can be bounded by O(|R|2).

From now on we assume additive linear orders. The compare subroutine is crucial for obtaining upper bounds for differences between values of players subsets. It is used in the following situation. Let (a, b) be a pair of players and (S, S ) S a,b be a witness for a b. Then, it can be easily shown that v(a) v(b) > |v(S) v(S )|. We will be interested in the question whether a similar relation holds for two subsets of S and S , namely, C S and D S of equal size. The compare subroutine checks whether such a relation holds by performing two additional duels. If it returns True, then v(a) v(b) > |v(C) v(D)|. Otherwise, there can be found a pair c C and d D and a witness for their relation by one call to the Uncover subroutine. This observation is formalized below. Lemma 5.7. Let a b be two players, (S, S ) S a,b and C S, D S with |C| = |D|. If Compare((a, b), (S, S ), (C, D)) returns True, then v(a) v(b) > |v(C) v(D)|. Otherwise, one call to Uncover returns c C and d D together with a witness for their relation.

Algorithm Condorcet Winning The algorithm maintains a partition of the players into a weak ordering, i.e., T = {T1, . . . , Tℓ} with T1 T2 Tℓ. We introduce the short-hand notation T j = S

m [j] Tm and T<j = S

m [j 1] Tm. After the application of the preprocessing procedure Reduce Players, this partition consists of one set, namely T = {T1}, where |T1| O(k) and A 2k T1. At any point in the execution of the algorithm, we are especially interested in two indices, namely ik [ℓ] such that |T<ik| < k < |T ik| and similarly i2k [ℓ] such that |T<i2k| < 2k < |T i2k|. 6 Observe that all players from T<ik are guaranteed to be among the top-k players. On the other hand, among the players from Tik some belong to A k and others do not. The main idea of the algorithm is then the following: Take a prefix of T of size k, i.e., this team contains the set of players T<ik and is a subset of the players in T ik, and either prove that this prefix is a Condorcet Winning team, or refine the partition T and repeat the process. The refinement is done by splitting one element of T , say Ti, into two non-empty sets, T 1 i T 2 i , and re-indexing the sets within T . This is done with the help of Uncover and New Cut. Clearly, this increases the number of sets in the partition T by one.

6In case one of these indices does not exist, it implies that we have either identified the set A k or A 2k. In the first case we have found a Condorcet winning team and in the second case Observation 5.4 implies that we can find one by performing one additional duel. For the sake of brevity we disregard this case from now on.

We provide two different algorithms, namely Condorcet Winning1 for the case ik = i2k and Condorcet Winning2 when ik = i2k. Unsurprisingly, the latter case requires a strictly less sophisticated approach, which is why we focus on Condorcet Winning1 in the following.

The algorithm starts by partitioning the set T<ik into two sets U1 and U2, where U1 is a prefix of T<ik of size |T ik| 2k. It partitions the set Tik into five sets X, Y, W1, W2, and Z. In particular it is known that (U1 U2) (X Y W1 W2 Z) but no relation among any pair in Tik is known. Regarding the sizes of the sets it holds that |Ui| = |Wi| for i {1, 2}, |X| = |Y | = k |U1| |U2| and |U1| = |Z|. The main aim of the algorithm will be to define 0 < ϵ1 < ϵ2 and prove that the following statements are true:

(i) |v(X) v(Y )| < ϵ1 (ii) |v(a) v(b)| < ϵ2 for all a Y W1 W2 and b Z, and (iii) there exist u1, . . . , u|Z|+1 U1 U2 as well as w1, . . . , w|Z|+1 W1 W2 such that

(a) v(u1) v(w1) ϵ1 and (b) v(ui) v(wi) ϵ2 for all i {2, . . . , |Z| + 1}.

With these statements we show that U1 U2 X is a Condorcet winning team. More precisely, we show v(U1 U2 X) v(W1 W2 Y ) > |Z| ϵ2 and v(W1 W2 Y ) v(B ) > |Z| ϵ2, where B is the best response7 towards U1 U2 X. See Figure 1 for an illustration of the argument.

It remains to sketch how the algorithm defines ϵ1, ϵ2 and proves (i) (iii). For simplicity assume U1 U2. The algorithm then attempts to do the following steps: (1) Find a witness for players u U2 and w W2, using Uncover. (2) Use Compare, to prove that |v(X) v(Y )| < v( u) v( w) and |v(a) v(b)| < v( u) v( w) holds for all players a W1 W2 Y and b Z. (3) Repeat step (2) by replacing w with any player of W1. If one of the steps (1)-(3) fails, we show that the partition T can be refined. Otherwise, we show that (i) (iii) hold for ϵ1 = v( u) v(w 1) and ϵ2 = v( u) v(w 2), where w 1 and w 2 are the best and second best players from W1 { w}, respectively. The following Lemma concludes the proof sketch of Theorem 5.2. Lemma 5.8. For every instance with O(k) players, after performing O(k5) many duels, Condorcet Winning1 has identified a Condorcet winning team. Condorcet Winning2 identifies a Condorcet winning team after O(k2 log(k)) duels.

6 Extensions and Discussion

Below, we discuss several implications of our results as well as directions for future work.

Regret bound In this paper we provided algorithms to identify, with high probability, a Condorcet winning team. Another common performance metrics for the dueling bandits setting is (e.g., weak) regret. We define a regret in our setting w.r.t. A k. This is reasonable since SST implies that PA k,B P A,B for every Condorcet winning team A and any team B. Then, for time horizon T we denote by (At, Bt) the selected duel at time t and define the regret to be

t=1 min {PA k,At 1/2, PA k,Bt 1/2}.

This definition is based on weak regret for dueling bandits, as defined in Yue et al. (2012). Within Appendix D, we derive a regret bound of RT = O(n( 2(log(T) + log log 1)).

Checking Condorcet winners beyond additive linear orders The question how many duels are necessary to prove or disprove that a given team is Condorcet winning remains open for total orders that are not additive linear, even if n O(k). Nevertheless, our algorithm from Theorem 5.1 shows that a polynomial upper bound for this number would imply the existence of an algorithm with a polynomial number of duels. More formally: Let q be the number of duels required to check whether a given team is a Condorcet winning team within an instance with O(k) players. Then, there exists an algorithm that identifies a Condorcet winning team within O(kn log(k) + k2log(k)q) duels.

Lower bounds For both settings, we can show a lower bound of n 2k duels in order to identify a Condorcet winning team. We refer to Appendix D for a formal statement and a proof.

7We call B a best response towards U1 U2 X, if B contains the best k players from [n]\(U1 U2 X).

7 Acknowledgments

This project has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement No. 882396), the Israel Science Foundation (grant number 993/17), the Yandex Initiative for Machine Learning at Tel Aviv University, the Deutsche Forschungsgemeinschaft under grant BR 4744/2-1, and the Ariane de Rothschild Women Doctoral Program.

This paper is dedicated to Hunter, a dear friend who passed away May 31, 2021.

Awerbuch, B., & Kleinberg, R. (2008). Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1), 97 114.

Balcan, M., Vitercik, E., & White, C. (2016). Learning combinatorial functions from pairwise comparisons. In Proceedings of the 29th Conference on Learning Theory, COLT.

Breton, M. L., & Sen, A. (1999). Separable preferences, strategyproofness, and decomposability. Econometrica, 67(3), 605 628.

Brost, B., Seldin, Y., Cox, I. J., & Lioma, C. (2016). Multi-dueling bandits and their application to online ranker evaluation. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM), (pp. 2161 2166).

Bubeck, S., Munos, R., & Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19), 1832 1852.

Bubeck, S., Wang, T., & Viswanathan, N. (2013). Multiple identifications in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML), (pp. 258 265). PMLR.

Busa-Fekete, R., Hüllermeier, E., & Mesaoudi-Paul, A. E. (2018). Preference-based online learning with dueling bandits: A survey. Tech. rep., ar Xiv:1807.11398.

Cesa-Bianchi, N., & Lugosi, G. (2012). Combinatorial bandits. Journal of Computer and System Sciences, 78(5), 1404 1422.

Chen, L., Li, J., & Qiao, M. (2017). Towards instance optimal bounds for best arm identification. In Proceedings of the 30th Conference on Learning Theory (COLT), (pp. 535 592). PMLR.

Chen, S., Lin, T., King, I., Lyu, M. R., & Chen, W. (2014). Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems (NIPS). Curran Associates, Inc.

Even-Dar, E., Mannor, S., & Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7(39), 1079 1105.

Grossman, H. D. (1945). The twelve-coin problem. Scripta Mathematica, 11, 360 361.

Guy, R. K., & Nowakowski, R. J. (1995). Coin-weighing problems. The American Mathematical Monthly, 102(2), 164 167.

Kalyanakrishnan, S., & Stone, P. (2010). Efficient selection of multiple bandit arms: Theory and practice. In Proceedings of the 27th International Conference on Machine Learning (ICML), (pp. 511 518). PMLR.

Kaufmann, E., Cappé, O., & Garivier, A. (2016). On the complexity of best-arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17(1), 1 42.

Lattimore, T., & Szepesvári, C. (2020). Bandit Algorithms. Cambridge University Press.

Mohajer, S., Suh, C., & Elmahdy, A. (2017). Active learning for top-k rank aggregation from noisy comparisons. In Proceedings of the 34th International Conference on Machine Learning (ICML), (pp. 2488 2497).

Pelc, A. (2002). Searching games with errors - fifty years of coping with liars. Theoretical Computer Science, 270(1-2), 71 109.

Rejwan, I., & Mansour, Y. (2020). Top-$k$ combinatorial bandits with full-bandit feedback. In Proceedings of the 31st Internationcal Conference on Algorithmic Learning Theory (ALT), (pp. 752 776). PMLR.

Ren, W., Liu, J., & Shroff, N. (2020). The sample complexity of best-k items selection from pairwise comparisons. In Proceedings of the 37th International Conference on Machine Learning, (pp. 8051 8072). PMLR.

Ren, W., Liu, J., & Shroff, N. B. (2018). PAC ranking from pairwise and listwise queries: Lower bounds and upper bounds. Tech. rep., arxiv.org/abs/1806.02970.

Saha, A., & Gopalan, A. (2018). Battle of bandits. In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence (UAI), (pp. 805 814). AUAI Press.

Slivkins, A. (2019). Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(1-2), 1 286.

Sui, Y., Zhuang, V., Burdick, J., & Yue, Y. (2017). Multi-dueling bandits with dependent arms. In Proceedings of 33rd the Conference on Uncertainty in Artificial Intelligence (UAI).

Yue, Y., Broder, J., Kleinberg, R., & Joachims, T. (2012). The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5), 1538 1556.

Zhou, Y., Chen, X., & Li, J. (2014). Optimal pac multiple arm identification with applications to crowdsourcing. In Proceedings of the 31st International Conference on Machine Learning (ICML), (pp. 217 225). PMLR.