# dueling_bandits_with_qualitative_feedback__26eee3e7.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Dueling Bandits with Qualitative Feedback

Liyuan Xu,1,2 Junya Honda,1,2 Masashi Sugiyama1,2

1The University of Tokyo, 2RIKEN liyuan@ms.k.u-tokyo.ac.jp, honda@stat.t.u-tokyo.ac.jp, sugi@k.u-tokyo.ac.jp

We formulate and study a novel multi-armed bandit problem called the qualitative dueling bandit (QDB) problem, where an agent observes not numeric but qualitative feedback by pulling each arm. We employ the same regret as the dueling bandit (DB) problem where the duel is carried out by comparing the qualitative feedback. Although we can naively use classic DB algorithms for solving the QDB problem, this reduction signiﬁcantly worsens the performance actually, in the QDB problem, the probability that one arm wins the duel over another arm can be directly estimated without carrying out actual duels. In this paper1, we propose such direct algorithms for the QDB problem. Our theoretical analysis shows that the proposed algorithms signiﬁcantly outperform DB algorithms by incorporating the qualitative feedback, and experimental results also demonstrate vast improvement over the existing DB algorithms.

1 Introduction

The stochastic multi-armed bandit (MAB) problem is a sequential decision-making problem, where an agent repeatedly chooses one option from K alternatives (which are often called arms). At each round, the agent receives a random reward that depends on the arm being selected, and the goal is to maximize the cumulative reward. This problem has been extensively studied for many years, both from theoretical and practical aspects. Numerous algorithms have been proposed for the problem (Thompson 1933; Auer 2003) and applied to various ﬁelds including the design of clinical trials (Villar, Bowden, and Wason 2015), economics (Rothschild 1974) and crowdsourcing (Zhou, Chen, and Li 2014). The dueling bandit (DB) problem (Yue et al. 2012) is a variant of the MAB problem, where an agent only observes the result of a duel , a noisy comparison between the selected two arms. While the MAB problem assumes that the feedback is numeric, the DB problem only assumes that the arms are comparable based on the feedback. Therefore, it is useful when the numeric feedback is not available, such as information retrieval and clinical trials.

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. 1The longer version including all appendices is available at https://arxiv.org/abs/1809.05274

Even when the numeric feedback is not available, we may still have direct access to qualitative feedback. For example, in information retrieval, users might report the relevance of a document returned by a system on a scale of Irrelevant Partially Relevant Relevant . In such a situation, we can consider a special kind of the DB problem ﬁrst introduced by Busa-Fekete et al. (2013), which we call the qualitative DB (QDB) problem. In the QDB problem, an agent pulls one arm at each round and observes qualitative feedback. Although a duel is not conducted explicitly in the QDB problem, we consider algorithms to minimize the same regret as the DB problem, in which the probability of an arm winning a duel with another arm corresponds to the probability of the arm getting higher qualitative feedback than the other. Therefore, we can adapt any algorithms for the DB problem to the QDB problem by converting the feedback in every two rounds into the result of one duel. However, this reduction signiﬁcantly worsens the performance because, in the QDB problem, the winning probability can be calculated from the estimated feedback distributions. Busa-Fekete et al. (2013) also partially considered this problem, and they improved the performance of the classic DB algorithms by constructing a tight conﬁdence bound. However, they still use the same exploration strategy as the classic DB algorithm. In this paper, we show that we can further improve the performance by designing a speciﬁc exploration strategy for the QDB problem. Several deﬁnitions of the best arm have been proposed for the DB problem. In this paper, we consider two types of winners, the Condorcet winner and the Borda winner, both of which are deﬁned in Section 3, and we propose algorithms for each winner. The proposed algorithms are inspired by algorithms in the MAB, namely Thompson sampling (Thompson 1933) and the upper conﬁdence bound (UCB) algorithm (Auer 2003). Interestingly, the algorithm based on Thompson sampling, one of the most popular algorithms for the MAB problem, only works for the criterion of the Condorcet winner and suffers polynomial regret in a speciﬁc instance in the criterion of the Borda winner. The rest of paper is structured as follows. After discussing related work in Section 2, we formulate the QDB problem in detail in Section 3. We introduce two formulations of the QDB problem and propose algorithms for these problems in

Sections 4 and 5. Lastly, we show empirical results for an information retrieval setting in Section 6.

2 Related Work

There are two lines of researches that are related with the QDB problem. The ﬁrst is the DB problem (Yue et al. 2012), which is the MAB problem with feedback given as a form of noisy comparison between two arms. Many researches have been conducted for this problem and some of them discuss speciﬁc comparison models. For example, Hofmann, Whiteson, and de Rijke (2011) have discussed the case where a duel is carried out by interleaved comparison with some user model, and Yue et al. (2012) introduced the Bradley-Terry model. Among them, several models involve random variables corresponding to the utilities associated with arms, and the result of a duel is determined by the order of such variables. For example, a Gaussian model (Yue et al. 2012) is the case where the random variables follows a Gaussian distribution, and Busa-Fekete et al. (2013) considered the case where random variables are on a partially ordered set in the QDB problem. In the DB problem, the deﬁnition of the best arm is no longer straightforward because there may exist cyclic preference. Although early work of the DB assumes the total order on arms to ensure the existence of the maximal element, recent work has mainly sought to design algorithms for ﬁnding the Condorcet winner (Urvoy et al. 2013), which is the arm that wins over all the other arms with probability larger than or equal to 1/2. This deﬁnition can be regarded as a natural generalization of the maximal element, since the Condorcet winner coincides with the maximal element when the total order exists. A number of algorithms have been proposed for the Condorcet winner, for example Urvoy et al.; Komiyama et al.; Wu and Liu (2013; 2015; 2016). A drawback of this formulation is that the Condorcet winner does not always exist. In such cases, we may introduce other notions of the winners, such as the Borda winner (Urvoy et al. 2013) and the Copeland set (Zoghi et al. 2015). Ramamohan, Rajkumar, and Agarwal (2016) introduced numerous notions of the winners other than the Condorcet winner. The other line of related work is the qualitative multiarmed bandit (QMAB) problem (Szorenyi et al. 2015), in which an agent also receives qualitative feedback according to the chosen arm. The difference between the QDB problem and the QMAB problem is that the QDB problem handles winners deﬁned in the classic DB problem, while the QMAB problem introduces its own deﬁnition of a winner , i.e., the arm with the highest τ-quantile of the feedback distribution for τ (0, 1). This deﬁnition is, however, sometimes problematic since it ignores the difference in the feedback distribution below the τ-quantile. Let us consider the case where we have two types of medicines, A and B, and want to ﬁgure out which has less side effect. To this end, we may perform clinical trials and obtain feedback from patients about the severeness of side effects.

Table 1: An instance that requires a careful choice of τ in the QMAB problem.

No side effect Moderate Severe Medicine A 0.995 0.003 0.002 Medicine B 0.995 0.002 0.003

Assume that the feedback is reported on the scale of No side effect Moderate Severe and the true probabilities of getting each feedback are shown in Table 1. Then, we can clearly conclude that medicine A is more preferable since it has a less probability of having a severe side effect, and in fact, medicine A becomes the winner in the formulation of the QDB problem. However, the QMAB problem regards these medicines equally good unless τ 0.005 since the τ-quantile feedback is the same. Nevertheless, setting τ 0.005 is almost impossible in practice since we do not have access to the true probabilities beforehand. On the other hand, the deﬁnitions of winners considered in the QDB problem are well-studied in the context of voting theory (see Charon and Hudry (2010), for a survey), and they do not have any hyper-parameter to deﬁne the problem itself. This makes our algorithms more applicable to realworld problems.

3 Problem Formulation We formulate the QDB problem in this section. As in the MAB problem, we consider K arms associated with feedback distributions ν1, . . . , νK, and at each round t, the agent chooses one arm at [K] = {1, . . . , K} and receives feedback rt sampled from distribution νat. While the MAB problem assumes {νi} to be distributions on real values, the QDB considers qualitative feedback which corresponds to the case where {νi} are the distributions on the totally ordered set (L, ), where L is the set of possible feedback and denotes a total order between feedback. For simplicity, we assume that L = [L] and total order corresponds to order relation , which means 1 2 L. Thus, distributions {νi}K i=1 are all categorical, supports of which are [L]. Note that even though the rewards rt are nominal for notational simplicity, the sum of the feedback has no meaning in the QDB setting. The QDB problem aims to minimize the same regret as the classic DB problem, which is deﬁned based on pairwise comparison. Following early work (Busa-Fekete et al. 2013), we characterize µi,j, the probability of arm i winning over arm j, as

µi,j = P [Xi > Xj] + 1

2P [Xi = Xj] ,

where Xi and Xj are mutually independent random variables following distributions νi and νj, respectively. We consider two types of winners in this paper. The ﬁrst one is the Condorcet winner, which is the arm that wins all the other arms with probability larger than or equal to 1/2. Formally, arm i is the Condorcet winner if µi ,j 1/2 for all j = i . We denote the Condorcet winner as a CW, and the goal of the QDB problem when employing the Condorcet

winner is to minimize the following regret:

t=1 CW at ,

where CW i = µa CW,i 1/2. The second winner is the Borda winner, which is the arm with the largest Borda score, the average of the winning probabilities against other arms. Formally, the Borda score Bi for arm i is deﬁned as

and thus, the Borda winner a BW is deﬁned as a BW = arg maxi [K] Bi. The regret to minimize in this case is formulated as

t=1 BW at ,

where BW i = Ba BW Bi. The QDB problem can be solved by any algorithm for the classic DB since the same regret is used between them. Algorithms for the DB problem specify two arms (i, j) to compare at each round and receive a result of the noisy comparison generated from Ber(µi,j), where Ber(p) is the Bernoulli distribution with success probability p. This comparison can be simulated in the QDB problem as follows: We observe Xi and Xj by pulling both arms and return which Xi > Xj or Xi < Xj occurred with ties broken at random. However, in the QDB problem, we can directly estimate µi,j from the feedback distribution of each arm, which signiﬁcantly enhances exploration. Considering that {νi} are all categorical distributions on [L], we have another representation for µi,j given by

k=1 P (i) k

l=1 P (j) l 1

where P (i) k = P [Xi = k]. Let PL be the probability simplex PL = {x [0, 1]L| PL i=1 xi = 1}, and we deﬁne function µ : PL PL [0, 1] as

Hence, µ(P (i), P (j)) = µi,j for P (i) = (P (i) 1 , . . . , P (i) L ) .

4 Qualitative Dueling Bandit with the Condorcet Winner In this section, we propose an algorithm for the QDB problem with the Condorcet winner. The algorithm is called Thompson Condorcet sampling, which is based on Thompson sampling (Thompson 1933), an algorithm famous for its good performance in the standard MAB problem and wide applicability to many other problems.

Algorithm 1: Thompson Condorcet sampling

1 Set C(i) = 0 for all i [K];

2 Pull all arms t0 times, update C(i);

3 foreach t = Kt0, Kt0 + 1, . . . , T do

4 For each arm i, sample θ(i) from Dir(C(i) 1 + 1, . . . , C(i) L + 1);

5 if i : µ(θ(i), θ(j)) 1

2 for all j [K] then

6 Pull arm at = i, observe reward rt;

7 Set C(at) rt C(at) rt + 1;

// If there is no Condorcet winner, sample {θ(i)}K i=1 again.

9 Goto Line 4;

This algorithm maintains Bayesian posterior distributions of P (i) deﬁned in Section 3. We employ the Dirichlet distribution Dir(α1, . . . , αL) as the prior distribution, the probability density function of which is

f(θ; α1, . . . , αL) = Γ PL i=1 αi

QL i=1 Γ(αi)

i=1 θαi 1 i ,

where Γ(x) is the gamma function. Having Dirichlet distributions as priors is a convenient choice when observations are sampled from a categorical distribution. Let C(i)(t) = (C(i) 1 (t), . . . , C(i) L (t)) be the vector representing the observation until the t-th round, where C(i) k (t) {0, 1, . . . } represents the number of times that the feedback k [L] is observed when arm i [K] is pulled. If we employ the prior distribution as Dir(1, . . . , 1), then the posterior distribution given observations C(i)(t) is Dir(C(i) 1 (t) + 1, . . . , C(i) L (t) + 1). For notational simplicity, we sometimes denote C(i)(t) as C(i) when the round t is obvious from the context. The entire algorithm is shown in Algorithm 1. At each round t, the algorithm samples θ(i) from the posterior distributions of P (i). Then, the agent tries to pull the Condorcet winner assuming P (i) = θ(i). If the Condorcet winner does not exist, the algorithm resamples (θ(1), . . . , θ(L)) until it exists. Let P (i) be

P (i) = arg min P PL KL(P (i) P ) s.t. µ(P , P (a CW)) 1

for Kullback-Leibler (KL) divergence KL(x y) = PL i=1 xi log xi

yi . Then, the regret of Thompson Condorcet sampling is bounded as follows.

Theorem 1. If the Condorcet winner exists, then there exists t0 > 0 such that the regret of Thompson Condorcet sampling

is bounded by

i=1 (1 + ε) CW i KL(P (i) P (i)) log T

+ O (log log T)2 + O 1

for any sufﬁciently small ε > 0. The proof is given in the longer version, where the detailed condition on t0 and the precise form of the bound is also provided. From the precise form of (2) that can be found in the longer version, one can see that this regret bound grows exponentially with the number of arms K. However, this is not the inherent limitation of the Thompson Condorcet sampling but the artifact of pursuing the optimal asymptotic dependence on O(log T). As we will show in Section 6, this exponential increase in the regret does not occur in pracitice, and the algorithm works well for relatively large K. The regret bound has a similar form to the information theoretic lower bound in the MAB problems for multiparameter models (Burnetas and Katehakis 1996). Note that considering distributions P (i) is essential in these cases, whereas they are replaced with the distribution of the optimal arm in the regret bound of Thompson sampling in the MAB problem with the Bernoulli model given by Agrawal and Goyal (2013). For example, when P (a CW) = (ε, 1 2ε, ε) and P (i) = (0.5, 0.1, 0.4) , we have KL(P (i) P (i))/KL(P (i) P (a CW)) 0 as ε 0. Theorem 1 suggests the possibility of Thompson Condorcet sampling performing drastically better than the case when we apply classic DB algorithms for the QDB problem in the way discussed in Section 3. The regret lower bound of such direct applications immediately follows from the lower bound for the classic DB problem given by Komiyama et al. (2015). Proposition 1 (Adapted from Komiyama et al., 2015). When we apply any consistent algorithms for the DB problem to the QDB problem, we have

lim inf T E RCW T

i =a CW min j:µi,j< 1

CW i + CW j d(µi,j, 1

where d(x, y) = x log x

y + (1 x) log 1 x

From the upper bound given in Theorem 1, we have

lim T E RCW T

log T (1 + ε) X

CW i KL(P (i) P (i)) ,

which can be arbitrarily smaller than (3) as stated in the next lemma. Lemma 1. Assume that a CW = 1. For any ﬁxed 0 < ε < 1/(4 4 log 2), there exist P (a CW), P (1) P2 such that

d(µ(P (a CW), P (1)), 1/2)

KL(P (1) P (1)) ε. (4)

Algorithm 2: Thompson Borda sampling

1 Set C(i) = 0 for all i [K];

2 Pull all arms t0 times, update C(i);

3 foreach t = 1, . . . , do

4 For each arm i, sample θ(i) from Dir(C(i) 1 + 1, . . . , C(i) L + 1);

5 Bi 1 K 1 P

j =i µ(θ(i), θ(j));

6 Pull arm at = arg maxi [K] Bi;

7 Observe rt and set C(at) rt C(at) rt + 1;

The proof can be found in the longer version. From Lemma 1, we can say that there exists a case where Thompson Condorcet sampling can perform arbitrarily better than the direct application of any algorithms in the DB. This implies that the algorithm successfully incorporates the qualitative information to reduce the regret in the DB.

5 Qualitative Dueling Banidt with the Borda Winner In this section, we study two algorithms for the QDB problem with the Borda winner, the one based on the Thompson sampling called Thompson Borda sampling and the other based on the UCB algorithm (Auer 2003) called Borda UCB. In spite of the success of Thompson Condorcet sampling, our theoretical analysis reveals that Thompson Borda sampling can have polynomial regret in some setting. On the other hand, Borda-UCB achieves logarithmic regret, which matches the regret lower bound of the classic DB problems. Thompson Borda sampling given in Algorithm 2 is similar to Thompson Condorcet sampling. The only difference is that Thompson Borda sampling pulls the Borda winner in samples (θ(1), . . . , θ(L)). Since there always exists the Borda winner for any samples (θ(1), . . . , θ(L)), thus we do not need resampling. Although it is works surprisingly well empirically as we will see in Section 6, we prove that it suffers from polynomial regret in the worst case. Theorem 2. Assume that there are K = 3 arms such that arm 1 is the Borda winner. Then, there exists P (1), P (2), P (3) PL such that under Thompson Borda sampling with θ(1) = P (1), θ(2) = P (2), and θ(3) Dir(C(3) 1 + 1, . . . , C(3) L + 1), the statement

lim inf T E RBW T

holds for some constants ξ, η > 0. The proof can be found in the longer version. The situation considered in Theorem 2 may be somewhat unrealistic since we assume that P (1) and P (2) are known beforehand. However, we will show by an experiment that Thompson Borda sampling actually suffers from the polynomial regret without such an assumption in Section 6. Another proposed algorithm, Borda-UCB, is based on the UCB algorithm (Auer 2003), which is shown in Algorithm 3. As in the original UCB algorithm, we consider the

Algorithm 3: Borda-UCB

1 Set C(i) = 0 for all i [K] and Ni = 0;

2 Pull all arms τ times and get initial estimations;

3 while t T do

4 ˆP (i) C(i)/Ni for each arm i [K];

5 ˆBi 1 K 1 P

k [K]\{i} µ( ˆP (i), ˆP (k));

7 βi γi + 1 K 1 P

k [K]\{i} γk;

8 i UCB arg maxi [K] Bi + βi;

9 i Count {i [K]|Ni = maxj [K] Nj};

10 if i UCB i Count then

11 Pull arm at = i UCB, observe reward rt;

12 Ni Ni + 1, C(rt) at C(rt) at + 1;

14 Pull all arms in [K]\i Count;

15 Update Ni and C(i) k ;

upper conﬁdence bound ˆBi + βi for each arm i [K], where ˆBi is an estimated Borda score, and βi is the width of the conﬁdence interval controlled by a positive parameter α. Let i UCB be the arm with the largest upper conﬁdence bound. While the original UCB algorithm always pulls the arm with the largest upper conﬁdence bound, Borda-UCB pulls all arms that do not belong to i Count, the set of arms that were pulled the most, if i UCB does not belong to i Count. This exploration strategy reﬂects the fact that we have to estimate all feedback distributions accurately in order to have the precise estimation of the Borda score. The regret of Borda-UCB is bounded as follows. Theorem 3. Assume that α is set as

2, 3(1 + 3ε )2

for arbitrarily taken ε > 0. Then, for any ε > 0, the regret of Borda-UCB is bounded as

E RBW T BW all

4α ( BW min 2ε)2 log T + Cε + Cε

for some constants Cε = O 1

ε2 , Cε = O 1 (ε )2 , where

i =a BW BW i and BW min = mini =a BW BW i .

The proof is presented in the longer version, where the explicit forms of Cε and C ε are also provided. The regret bound in Theorem 3 is simpliﬁed to O(K 2 log T) when BW i = for all i = i , while the regret of the original UCB algorithm is O(K 1 log T) (Auer 2003), which is smaller by O(1/ ). However, this difference is inevitable, as proved in the following theorem. Theorem 4. Consider two instances of the QDB problem with K = 3, in which the feedback distributions of the arms are represented as Γ = (P (1) Γ , P (2) Γ , P (3) Γ ) and Θ =

(P (1) Γ , P (2) Γ , P (3) Γ ). Let RΓ T and RΘ T be the regret in each instance. Then, there exists a pair of instances (Γ, Θ) that all algorithms which achieve E RΓ T o(T a) for all constant a > 0 satisfy

lim inf T E RΘ T

log T = Ω 1 ( BW min)2

where BW min = mini =a BW BW i deﬁned on Θ. The proof is presented in the longer version. This theorem states that if the algorithm achieves sub-polynomial regret for all instances of the QDB problem with the Borda winner, there exists a case where it suffers from Ω(( BW min ) 2 log T) regret. Therefore, we can conclude that the difference in the regret upper-bound between the original UCB and Borda UCB comes from the characteristic of the QDB problem. The upper bound in Theorem 3 matches the regret lower bound in the classic DB problem, which is considered in the context of the δ-PAC DB problem (Jamieson et al. 2015). The algorithm is called δ-PAC if it ﬁnds the Borda winner with failure probability less than δ. We have the following bound of the minimum number of samples required in such δ-PAC algorithms. Proposition 2 (Theorem 1; Jamieson et al., 2015). Let τ be the total number of pulls. If K 4 and 3/8 µi,j 5/8 for all i, j [K], then any δ-PAC DB algorithm with δ 0.15 has

1 ( BW i )2 .

Existing algorithms for the Borda winner (Busa-Fekete et al. 2013; Jamieson et al. 2015) use a δ-PAC DB algorithm as a sub-routine. They ﬁrst run such an algorithm with δ = 1/T and then pulls the estimated Borda winner in the remaining rounds. Therefore, the regret of such algorithms is at least Ω((log T) P

i =a BW( BW i ) 2) from Proposition 2, and hence the regret upper bound of Borda-UCB is no worse than this lower bound. Although we were not able to prove that the regret of Borda-UCB is smaller than the direct application of classic DB algorithms, Borda-UCB performs better than them empirically as we will see in Section 6. Furthermore, Borda UCB has an another advantage that it does not require to specify T. Since existing algorithms run a (1/T)-PAC algorithm, it requires the number of rounds T to be known beforehand. However, it is often difﬁcult to guess T beforehand, and thus our algorithms are more useful in practice.

6 Experiments We test the empirical performance of the proposed algorithms through experiments based on both synthetic setting and real-world data. We ﬁrst conduct the experiments based on the real-world web search dataset that is also used in the previous work. In the experiments, our methods signiﬁcantly outperform the direct application of the existing algorithms for the classic DB. Then, we show the results of the experiments in a synthetic setting that Thompson Borda sampling has polynomial regret.

Figure 1: The regret of Thompson Condorcet sampling and other classic DB algorithms.

Experiments on a Real-World Dataset We apply proposed methods to the problem of ranker evaluation from the ﬁeld of information retrieval, which is used for evaluating the algorithms for the classic DB problem in Jamieson et al. (2015). The task is to identify the best ranker, which takes a user s search query as input and ranks the documents according to their relevance to that query. We used two web search datasets. The ﬁrst is the MSLRWEB10K dataset (Qin et al. 2010), which consists of 10,000 search queries over the documents from search results. The data also contains the values of 136 features and a corresponding user-labeled relevance factor on a scale of one to ﬁve with respect to each query-document pair. The other is the MQ2008 dataset (Qin and Liu 2013) that contains 46 features and a relevance factor labelled from one to three for each query-document pair. As in Jamieson et al. (2015), we only consider rankers that use one feature to rank documents. Therefore, the aim of the task is to determine which feature is the most capable of predicting the relevance of query-document pairs. Although Jamieson et al. (2015) set up the classic DB problem from these datasets, we can naturally formulate the QDB problem as well since we have access to the relevance factors. The qualitative feedback is generated in the following way. At each round, the algorithm selects one ranker, and it ranks the documents for a randomly chosen query. The relevance factor for the top-ranked document is revealed to the algorithm as the qualitative feedback. Therefore, we have L = 5 in the MSLR-WEB10K dataset and L = 3 in the MQ2008 dataset. We compare the regrets of the proposed algorithms to the direct application of the classic DB algorithms, which corresponds to the experiments conducted in Jamieson et al. (2015). We repeat 100 runs for each instance and the mean of the regret is reported.

Experiments for Condorcet Winner We ﬁrst show the experimental result of the QDB problem with the Condorcet winner. We compare Thompson Condorcet sampling with RUCB (Zoghi et al. 2014), RMED1, RMED2, RMED2F (Komiyama et al. 2015), which are all promising algorithms proposed for the classic DB problem with the Condorcet winner. We set t0 = 10, and the Figure 1 is the experimental result when the number of rankers is K = 5. Figure 1 shows the superiority of Thompson Condorcet

Figure 2: The regret of Thompson Condorcet sampling and other DB algorithms when there are a relatively large number of arms (K = 15).

Figure 3: The regret of Thompson Borda sampling and Borda-UCB with other classic DB algorithms.

sampling. Furthermore, we can observe all existing algorithms incur the large regrets in early rounds while Thompson Condorcet sampling does not. This is because most algorithms for the DB problem construct a set of candidates for the Condorcet winner and explores it in the ﬁrst part of the rounds, but Thompson Condorcet sampling conducts exploration and exploitation at the same time and does not require such a set. In this sense, Thompson Condorcet sampling performs more stably than the existing methods. To see the dependency of the performance of Thompson Condorcet sampling on the number of arms, we tried the setting in which we have a relatively large number of arms. The result is shown in Figure 2, in which Thompson Condorcet sampling still performs the best among the other classic DB algorithms even though the regret upper-bound proved in Theorem 1 grows exponentially with K. This result supports the argument that exponential dependency on K is just an artifact of pursuing the best regret bound in the asymptotic case and Thompson Condorcet sampling empirically performs much better than the theoretical analysis.

Experiments for Borda Winner For the Borda setting, we compare our proposed methods, Thompson Borda Sampling and Borda-UCB, with existing classic DB algorithm SSSE (Busa-Fekete et al. 2013). Furthermore, we also con-

Figure 4: The regret of proposed algorithms in the instance that Thompson Borda sampling suffers the polynomial regret.

duct a comparison with an extension of SSSE, which we call QSEEE, proposed in Busa-Fekete et al. (2013) to utilize the qualitative feedback explicitly. The result is shown in Figure 3, which shows the superiority of the proposed methods. As in the Condorcet case, SSSE and QSSSE suffer from a large regret in the early stage, while regret always increases logarithmically in the proposed algorithms. This is because existing methods ﬁrst only explore, while proposing methods always balance exploration and exploitation. Although existing methods achieve zero-regret after the exploration, this does not mean that they perform better than Borda-UCB in T since they require longer exploration phase. Surprisingly, Thompson Borda sampling works quite well in this setting, even though Theorem 2 states that it has the polynomial regret in the worst case. We suspect it is rare to encounter such a worst case in practice, but the condition for sub-polynomial regret is unknown and left to future work.

Experiments on a Synthetic Setting Theorem 2 proves that Thompson Borda sampling can incur polynomial regret for some instances, which we conﬁrm through experiments in the following. We set up the instance with K = 3 and L = 4, in which each feedback distribution is represented as P (1) = (0.0, 0.0, 1.0, 0.0) , P (2) = (0.0, 0.5, 0.0, 0.5) , and P (3) = (0.2, 0.4, 0.3, 0.1) . Here, the Borda winner is a BW = 1. We repeat running Thompson Borda sampling and Borda-UCB in this instance for 10 times, and the mean of regret is shown in Figure 4. From Figure 4, we can clearly see that Thompson Borda sampling suffers from polynomial regret, while Borda-UCB still has sub-polynomial regret. However, it takes many rounds for Borda-UCB to have less regret than Thompson Borda sampling. This is because Thompson Borda sampling explores less than necessary. In early rounds, UCB-Borda pulls arm 3 many times, which is necessary for knowing the Borda winner but incurs large regret. On the other hand, Thompson Borda sampling exploits arms 1 and 2 more, which leads its superior performance in early rounds.

7 Conclusions In this paper, we formulated and studied a novel type of the dueling bandit, called a qualitative dueling bandit. In this problem, an agent receives qualitative feedback at each round and aims to minimize the same regret as the classic DB when the duel is carried out based on that feedback. We considered two notions of winners, the Condorcet winner and the Borda winner. For the Condorcet winner, we proposed an algorithm, called Thompson Condorcet sampling, and we showed that the regret can be arbitrarily smaller than the direct application of the algorithms in classic DB. Thompson Condorcet sampling also exhibited the superior performance in the experiments based on the realword web search datasets. For the Borda winner, we studied two algorithms, Thompson Borda sampling and UCB-Borda. Although the theoretical analysis reveals that Thompson Borda sampling can have polynomial regret in some instances, the experiments showed that it performs surprisingly well empirically, especially when the number of rounds is not very large. On the other hand, we prove the logarithmic regret upper bound for UCB-Borda, which is no worse than the regret lower bound in the classic DB. As future work, it is important to derive general algorithms that can handle various notions of winners as in Ramamohan, Rajkumar, and Agarwal (2016). Another promising direction is to improve the algorithms for the Borda winner and achieve regret signiﬁcantly smaller than the classic DB as Thompson Condorcet sampling does in the Condorcet winner case.

Acknowledgements LX utilized the facility provided by Masason Foundation. JH acknowledges support by KAKENHI 18K17998, and MS acknowledges support by KAKENHI 17H00757.

References Agrawal, S., and Goyal, N. 2013. Further optimal regret bounds for Thompson sampling. In Proceedings of the 16th International Conference on Artiﬁcial Intelligence and Statistics, 99 107. Auer, P. 2003. Using conﬁdence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research 3:397 422. Burnetas, A. N., and Katehakis, M. N. 1996. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics 17(2):122 142. Busa-Fekete, R.; Sz or enyi, B.; Weng, P.; Cheng, W.; and H ullermeier, E. 2013. Top-k selection based on adaptive sampling of noisy preferences. In Proceedings of the 30th International Conference on Machine Learning, 1094 1102. Charon, I., and Hudry, O. 2010. An updated survey on the linear ordering problem for weighted or unweighted tournaments. Annals of Operations Research 175(1):107 158. Hofmann, K.; Whiteson, S.; and de Rijke, M. 2011. A probabilistic method for inferring preferences from clicks. In

Proceedings of the 20th International Conference on Information and Knowledge Management, 249 258. Jamieson, K.; Katariya, S.; Deshpande, A.; and Nowak, R. 2015. Sparse dueling bandits. In Proceedings of the 18th International Conference on Artiﬁcial Intelligence and Statistics, 416 424. Komiyama, J.; Honda, J.; Kashima, H.; and Nakagawa, H. 2015. Regret lower bound and optimal algorithm in dueling bandit problem. In Proceedings of The 28th Conference on Learning Theory, 1141 1154. Qin, T., and Liu, T. 2013. Introducing LETOR 4.0 datasets. Ar Xiv abs/1306.2597. Qin, T.; Liu, T.-Y.; Xu, J.; and Li, H. 2010. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval 13(4):346 374. Ramamohan, S. Y.; Rajkumar, A.; and Agarwal, S. 2016. Dueling bandits: Beyond Condorcet winners to general tournament solutions. In Advances in Neural Information Processing Systems 29, 1253 1261. Rothschild, M. 1974. A two-armed bandit theory of market pricing. Journal of Economic Theory 9:185 202. Szorenyi, B.; Busa-Fekete, R.; Weng, P.; and H ullermeier, E. 2015. Qualitative multi-armed bandits: A quantile-based approach. In Proceedings of the 32nd International Conference on Machine Learning, 1660 1668. Thompson, W. R. 1933. On the likelihood that one unknown probability exceeds another in the view of the evidence of two samples. Biometrika 25(3-4):285 294. Urvoy, T.; Clerot, F.; F eraud, R.; and Naamane, S. 2013. Generic exploration and K-armed voting bandits. In Proceedings of the 30th International Conference on Machine Learning, 1191 1199. Villar, S. S.; Bowden, J.; and Wason, J. 2015. Multi-armed bandit models for the optimal design of clinical trials: Beneﬁts and challenges. Statistical Science 30:199 215. Wu, H., and Liu, X. 2016. Double Thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems 30, 649 657. Yue, Y.; Broder, J.; Kleinberg, R.; and Joachims, T. 2012. The k-armed dueling bandits problem. Journal of Computer and System Sciences 78(5):1538 1556. Zhou, Y.; Chen, X.; and Li, J. 2014. Optimal PAC multiple arm identiﬁcation with applications to crowdsourcing. In Proceedings of the 31st International Conference on Machine Learning, 217 225. Zoghi, M.; Whiteson, S.; Munos, R.; and de Rijke, M. 2014. Relative upper conﬁdence bound for the K-armed dueling bandit problem. In Proceedings of the 31st International Conference on Machine Learning, 10 18. Zoghi, M.; Karnin, Z.; Whiteson, S.; and de Rijke, M. 2015. Copeland dueling bandits. In Advances in Neural Information Processing Systems 28, 307 315.