# qualitative_multiarmed_bandits_a_quantilebased_approach__36b498e1.pdf

Qualitative Multi-Armed Bandits: A Quantile-Based Approach

Bal azs Sz or enyi1,5,6 SZORENYI@INF.U-SZEGED.HU R obert Busa-Fekete2 BUSAROBI@UPB.DE Paul Weng3,4 PAWENG@CMU.EDU Eyke H ullermeier2 EYKE@UPB.DE 1INRIA Lille - Nord Europe, Seque L project, 40 avenue Halley, 59650 Villeneuve d Ascq, France 2Department of Computer Science, University of Paderborn, Warburger Str. 100, 33098 Paderborn, Germany 3SYSU-CMU Joint Institute of Engineering, 132 East Waihuan Road, Guangzhou, 510006, China 4SYSU-CMU Shunde International Joint Research Institute, 9 Eastern Nanguo Road, Shunde, 528300, China 5MTA-SZTE Research Group on Artiﬁcial Intelligence, Tisza Lajos krt. 103., H-6720 Szeged, Hungary 6Department of Electrical Engineering, The Technion - Israel Institute of Technology, Haifa, Israel 32000

We formalize and study the multi-armed bandit (MAB) problem in a generalized stochastic setting, in which rewards are not assumed to be numerical. Instead, rewards are measured on a qualitative scale that allows for comparison but invalidates arithmetic operations such as averaging. Correspondingly, instead of characterizing an arm in terms of the mean of the underlying distribution, we opt for using a quantile of that distribution as a representative value. We address the problem of quantile-based online learning both for the case of a ﬁnite (pure exploration) and inﬁnite time horizon (cumulative regret minimization). For both cases, we propose suitable algorithms and analyze their properties. These properties are also illustrated by means of ﬁrst experimental studies.

1. Introduction

The multi-armed bandit (MAB) problem (or simply bandit problem) refers to an iterative decision making problem in which an agent repeatedly chooses among K options, metaphorically corresponding to pulling one of K arms of a bandit machine. In each round, the agent receives a random reward that depends on the arm being selected. The agent s goal is to optimize an evaluation metric, e.g., the error rate (expected percentage of playing a suboptimal arm) or the cumulative regret (difference between the sum of rewards obtained and the (expected) rewards that could have been obtained by selecting the best arm in each round).

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

In the stochastic multi-armed bandit setup, the distributions can vary with the arms but do not change with time. To achieve the desired goal, the agent has to tackle the classical exploration/exploitation dilemma: It has to properly balance the pulling of arms that were found to yield high rewards in earlier rounds and the selection of arms that have not yet been tested often enough (Auer et al., 2002; Cesa Bianchi & Lugosi, 2006; Lai & Robbins, 1985).

MAB algorithms have not only been studied quite thoroughly from a theoretical point of view but have also been used in many real applications, such as medical treatment allocation design (Kuleshov & Precup, 2014), feature selection (Gaudel & Sebag, 2010; Busa-Fekete & K egl, 2010) and crowdsourced labeling (Zhou et al., 2014). In many practical applications, however, numerical rewards are not provided in a natural way. Consider the example of clinical trials for testing pain medication. Here, the patients are asked to value their pain on a scale such as no pain mild moderate severe, which is of qualitative nature. Computing averages on ordinal scales of that kind is clearly invalid and may lead to disputable conclusions.

In this paper, we therefore propose a setting in which the arm distributions are deﬁned over a complete totally ordered set (L, ); the corresponding online learning framework will be introduced formally in Section 3, after reviewing related work in Section 2. The quality of an arm is expressed in terms of a τ-quantile of the arm s distribution over L. Thus, arms are compared in terms of their τ-quantiles, and an arm is considered to be τ-optimal if its τ-quantile coincides with the highest τ-quantile of all arms.

We consider two quantile-based learning frameworks that we refer to as the ﬁnite and the inﬁnite horizon cases, respectively. The ﬁnite horizon case (Section 4) is formalized in the PAC framework: the goal of the learner is to ﬁnd a

Qualitative Multi-Armed Bandits: A Quantile-Based Approach

τ-optimal arm with probability at least 1 δ. As opposed to this, the inﬁnite horizon case (Section 5) is formalized as a regret minimization problem, in which the regret depends on τ and the quantile functions of the arms. The difﬁculty of both setups stems from the fact that, when for all τ-optimal arms, the probability of getting qualitative rewards lower or equal to the optimal τ-quantile x is close or equal to τ, it is hard (or impossible) to guess x , which is essential to decide whether an arm is optimal or not.

2. Related Work

Pure exploration algorithms for the stochastic bandit problem sample the arms a certain number of times (not necessarily known in advance) and then output a recommendation, such as the best arm or the m best arms (Bubeck et al., 2009; Even-Dar et al., 2002; Bubeck et al., 2013; Gabillon et al., 2011; Capp e et al., 2013; Kalyanakrishnan et al., 2012). Since our quantile-based learning task in the ﬁnite horizon case is formulated in a PAC setting, it can be viewed as a pure exploration strategy, too. Yet, we do not assume that absolute numerical feedback can be generated for individual arms; instead, our feedback is of qualitative nature. Therefore, since averaging rewards is no longer meaningful, the preference relation over the arms is deﬁned based on τ-quantiles instead of mean values of the underlying distribution on rewards.

Yu & Nikolova (2013) introduce a pure exploration setting where, instead of the means, the goodness value or payoff of the arms is deﬁned based on some notion of risk, such as the value-at-risk (Schachter, 1997), a famous risk measure used in ﬁnance, which is a particular case of quantiles. Their setup is similar to the best arm identiﬁcation problem (Bubeck et al., 2009), where the goal of the learner is to control the so-called simple regret, which is the difference between the payoff of the best arm and the expected payoff obtained by its recommendation. The algorithm proposed by Yu & Nikolova (2013) is based on their result concerning the concentration property of the estimators of various risk measures these properties are preconditioned on the assumption that the density functions of arms are continuously differentiable, and their derivatives are bounded from above. The proposed algorithm is computationally demanding since it solves a non-linear constrained and integer-valued optimization task in each round; moreover, their results regarding the performance of the algorithm assume that the densities are bounded away from zero everywhere. In addition to these limitations on the reward distributions, our learning setup also differs from theirs in that we assume a PAC setting with ﬁnite horizon, where the goal is to ﬁnd a τ-optimal arm with high probability. Thus, since the error of the learner is controlled, the algorithms are evaluated in terms of their sample complexity

(the number of samples taken prior to termination).

In the inﬁnite case, the most commonly optimized property is the regret with respect to the maximum mean reward (Bubeck & Cesa-Bianchi, 2012). Nevertheless, alternative targets have already been considered, too, which led to interesting formal tasks. In a recent study, Carpentier & Valko (2014) formulate the regret in terms of the extreme value of the arm distributions. The goal here is to optimize the maximal regret observed. To this end, the learner intends to identify the most abnormal arm having the heaviest tail distribution, since the rewards received on a heavy-tailed arm are likely to deviate the most from its mean with highest probability. The learner is evaluated in terms of so-called extreme regret, which is the most extreme value found and compared to the most extreme value possible. The authors devise an algorithm, called EXTREMEHUNTER, based on the optimism in the face of uncertainty principle, which can achieve logarithmic expected regret in this setting. Sani et al. (2012) consider a MAB setting with a regret notion based on the principle of riskaversion, where the risk is deﬁned based on mean-variance risk. More speciﬁcally, there is a trade-off parameter that controls the inﬂuence of the mean and variance of the arm distributions. Thus, pulling an arm with high mean might result in a high regret if the variance of the rewards is high. The worst case regret of the proposed algorithm is O(KT 2/3), and it is not clear whether it can be improved. As it is known that the worst case regret for the standard mean-based regret is O(

KT), which is achieved by the MOSS algorithm by Audibert & Bubeck (2010), the optimization of regret based on risk aversion is conjectured to be a more complex problem.

In the preference-based bandit setup (Busa-Fekete & H ullermeier, 2014), also known as duelling bandits (Yue et al., 2012), feedback about arms is not received in terms of absolute numerical rewards either. Instead, the learner is only allowed to compare the arms in a pairwise manner and receives binary feedback informing about the winner of the comparison. From this point of view, the feedback about the arms is even weaker than in our qualitative setting. Moreover, the notion of optimality of an arm can be deﬁned in various ways in the preference-based setup. For example, a commonly used notion is that of a Condorcet winner, for which the probability of winning in a pairwise comparison is larger than 1/2 against each other arm (Yue & Joachims, 2011; Zoghi et al., 2014).

3. Qualitative Multi-Armed Bandits

Formally, a standard value-based multi-armed or K-armed bandit problem is speciﬁed by real-valued random variables X1, . . . , XK associated, respectively, with K arms (that we simply identify by the numbers 1, . . . , K). In each

Qualitative Multi-Armed Bandits: A Quantile-Based Approach

time step t, the online learner selects one or more arms (depending on the speciﬁc problem) and obtains a random sample of the corresponding distributions. These samples, which are called rewards, are assumed to be independent of all previous actions and rewards. The goal of the learner can be deﬁned in different ways, such as maximizing the sum of rewards over time (Lai & Robbins, 1985; Auer et al., 2002) or identifying, with high probability, an arm the expected value of which differs by at most ϵ from the highest one (Even-Dar et al., 2002; Kalyanakrishnan et al., 2012).

In the qualitative multi-armed bandit (QMAB) problem, the rewards are not necessarily real-valued. Instead, the arms X1, . . . , XK are random variables over a complete totally ordered set1 (L, ). Accordingly, when arm k is played in the t-th round, it yields a qualitative payoff x L. Independence is again assumed to hold across rounds. We will also use the reverse order over L and the associated asymmetric (hence irreﬂexive) relations and of and , respectively. For simplicity, we shall assume that L is a subset of the real numbers and denotes the ordinary ranking over the reals. However, we shall not make use of the nominal values of the elements in L, only of their ordering.

3.1. Empirical CDF and Quantile Function

Let F X denote the cumulative distribution function (CDF) of a random variable X. The quantile function QX : [0, 1] L of X is deﬁned as

QX(τ) = inf x L : τ F X(x) .

We extend the domain of this function to the whole real line by deﬁning QX(τ) = inf L for τ < 0 and QX(τ) = sup L for τ > 1.

As already mentioned, our aim is to compare arms in terms of their τ-quantiles, where τ [0, 1] is a (userspeciﬁed) parameter of the problem the concrete learning tasks will be detailed in Sections 4 and 5, respectively. As will be seen, the highest τ-quantile of the arms, x = max1 k K QXk(τ), will play a central role in both cases. The difﬁculty of our quantile-based approach is due to the fact that x is unknown and, moreover, hard to guess.2

Denote the j-th sample of arm k by Xk,j. The empirical estimate of the CDF (or empirical CDF) of Xk based on Xk,1, . . . , Xk,m is

b F Xk m (x) = 1

j=1 I {Xk,j x} ,

1A totally ordered set is complete if every subset of it that has an upper bound also has a least upper bound. 2If x were known, the problem could be simpliﬁed to a standard value-based MAB with reward 1 in case the qualitative reward is at least as good as x and 0 otherwise.

where I { } is the indicator function. Denoting by Tt(k) the number of times arm k has been pulled up to time t, the empirical CDF of Xk in round t is b F Xk Tt(k)(x).

The empirical estimator for the quantile function of arm k is based on the empirical distribution function:

b QXk m (τ) = inf n x L : τ b F Xk m (x) o

The accuracy of these empirical estimates can be quantiﬁed using a concentration result of Dvoretzky et al. (1956), which upper-bounds the tail distribution of the deviation of the empirical cumulative distribution function in supremum norm. Its improved version (Massart, 1990) (having optimal constants) can be formulated in our case as follows:3

P F Xk b F Xk m > c 2 exp 2mc2 , (1)

where denotes the supremum norm.

For the sake of conciseness, we introduce an auxiliary function that determines the size of the conﬁdence intervals:

1 2m log π2m2

Proposition 1. Fix some 1 k K and δ (0, 1). The following holds with probability at least 1 δ: For all m 1 and for every 0 τ 1,

b QXk m (τ cm(δ)) QXk(τ) b QXk m (τ + cm(δ)) (3)

Proof. To simplify notations, denote F Xk by F and b F Xk m by b Fm. Combining the bound (1) with the uniform bound and the Basel problem one obtains that, with probability at least (1 δ), F b Fm cm(δ) for all m > 0. In addition, F b Fm cm(δ) implies

Q(τ) = inf{x L : τ F(x)}

inf n x L : τ b Fm(x) cm (δ) o

= b Qm (τ + cm (δ))

b Qm (τ cm (δ)) = inf n x L : τ b Fm(x) + cm (δ) o

inf{x L : τ F(x)}

3Each analysis in this paper also goes through using the Chernoff-Hoeffding bounds, essentially without any modiﬁcation, at the cost of having slightly worse multiplicative constants (see Appendix C).

Qualitative Multi-Armed Bandits: A Quantile-Based Approach

4. Finite Horizon: A PAC Algorithm

In this section, we consider the task of determining a best arm. In accordance with the goal highlighted in the introduction, the optimality of an arm is deﬁned as follows.

Deﬁnition 1. An arm k is said to be τ-optimal if QXk(τ) = x , where x = max 1 k K QXk (τ) . (4)

Throughout this section, let k denote the index of a τoptimal arm. Requiring the learner to output such an arm might be hard or even impossible to achieve in cases where the probability of getting qualitative rewards lower or equal to the optimal τ-quantile is close or equal to τ for all τoptimal arms. Therefore, in the spirit of the PAC bandit setting introduced by Even-Dar et al. (2002), we are going to tolerate some approximation error.

Deﬁnition 2. An arm k is said to be (ϵ, τ)-optimal iff QXk(τ +ϵ) x . Put in words, a slight negative perturbation on the distribution of an (ϵ, τ)-optimal arm yields a τ-quantile that is higher or equal to x .

Figure 1. A qualitative MAB setup with three arms. The CDFs of the arms are plotted. The rewards come from L = {x1, . . . , x10}, where x1 x2 x10.

Example 1. To illustrate the notion of (ϵ, τ)-optimality, consider the following simple qualitative MAB problem with three arms and parameters τ = 0.65, ϵ = 0.05. Each arm is a random variable over ten possible qualitative rewards x1 x2 . . . x10. Figure 1 depicts their cumulative distributions F X1, F X2 and F X3. The τquantiles of arms 1, 2 and 3 are x8, x7 and x4, respectively. The ﬁrst arm (plotted in red) is τ-optimal, whence k = 1 and x = x8. The second arm (plotted in blue) is not τoptimal, since F X2(x7) > τ; yet, it is (ϵ, τ)-optimal since F X2(x7) ϵ < τ. The third arm (plotted in black) is not (ϵ, τ)-optimal, since the 0.7-quantile of X3 is still given by x4 x8.

In practice, there may be several τ-optimal and several

(ϵ, τ)-optimal arms. The goal of the learner is to identify one of them reliably.

Deﬁnition 3. An online learning algorithm is called (ϵ, τ, δ)-quantile learner if it outputs an (ϵ, τ)-optimal arm with probability at least 1 δ.

4.1. The QPAC Algorithm

In Algorithm 1, we present our QPAC (Qualitative Probably Approximately Correct) algorithm, an adaptive elimination strategy inspired by Even-Dar et al. (2002). The algorithm computes lower and upper bounds x t and x+ t of the optimal τ-quantile and exploits that (3) holds with high probability, which has several important consequences (as will be shown later in the analysis). One such consequence is that

x t b QXk t τ + ϵ + ct δ

for all (ϵ, τ)-optimal arms k, and thus every arm h with b QXh t τ + ϵ + ct δ

K . x t can be eliminated. Another important consequence is that an arm k is necessarily (ϵ, τ)-optimal if

x+ t b QXk t τ + ϵ ct δ

The rest will be detailed in the analysis.

Algorithm 1 QPAC(δ, ϵ, τ)

1: Set A = {1, . . . , K} Active arms 2: t = 1 3: while A = do 4: for k A do 5: Pull arm k and observe Xk,t 6: x+ t = maxk A b QXk t τ + ct δ

7: x t = maxk A b QXk t τ ct δ

8: for k A do 9: if b QXk t τ + ϵ + ct δ

K x t then 10: A = A \ {k} Discard k based on (5)

11: if x+ t b QXk t τ + ϵ ct δ

12: bk = k Select k according to (6) 13: BREAK 14: t = t + 1 15: return bk

Let us illustrate the algorithm on Example 1.

Example 2. (Example 1 continued) The non-(ϵ, τ)-optimal arm 3 cannot be eliminated by QPAC unless x t x4. This happens when b QXk t (τ cm( δ

K )) x4 for arm 1 or arm 2 (see line 7 of Algorithm 1). Therefore, x t needs to be high enough to eliminate a non-(ϵ, τ)-optimal arm. Moreover, for eliminating the third arm (see line 10), we need b QX3 t (τ +ϵ+ct( δ

K )) x4 to hold, i.e., b F X3 t (x4) ct( δ

Qualitative Multi-Armed Bandits: A Quantile-Based Approach

ϵ > τ. Therefore, for eliminating a non-(ϵ, τ)-optimal arm, the estimate of its CDF needs to be tight enough as well. The selection mechanism of QPAC is based on a very similar argument as the one described above for elimination.

4.2. Analysis

The sample complexity of the qualitative PAC setting is very similar to the one of the value-based setting (see Even Dar et al. (2002)). Before discussing the result in more detail, some further notation needs to be introduced.

First of all, denote the set of (ϵ, τ)-optimal arms by Kϵ,τ, and deﬁne

ϵ k = sup n [0, 1] QXk(τ + ϵ + )

max h Kϵ,τ QXh(τ ) o

for k = 1, . . . , K, where sup = 0 by deﬁnition. Finally, let denote the max operator. Theorem 1. Assume that algorithm QPAC is run with parameters (ϵ, δ, τ) on a problem with K arms X1, . . . , XK. Then, with probability at least 1 δ, QPAC outputs an (ϵ, τ)-optimal arm after drawing

1 (ϵ ϵ k)2 log K (ϵ ϵ k) δ

samples. Thus, QPAC is an (ϵ, τ, δ)-quantile learner.

The proof of Theorem 1 is deferred to Appendix A. Remark 1. Note that the sample complexity of QPAC depends on the number of arms, K, but not on the number of different rewards (i.e., size of L). Remark 2. Lower bound on sample complexity for valuebased PAC bandits had already been investigated by Mannor & Tsitsiklis (2004). A similar lower bound analysis also applies to the qualitative setting resulting in a lower bound of the form Ω(PK k=1 1 (ϵ ϵ k)2 log 1

δ ) (see Appendix A.1). Therefore this lower bound shows that the sample complexity of the QPAC algorithm given in Theorem 1 is optimal up to a logarithmic factor.

5. Inﬁnite Horizon

In this section, we analyze the inﬁnite horizon setting, where the goal of the online learner is normally deﬁned as minimizing the cumulative regret of its actions in the course of time. First of all, this of course presupposes an appropriate deﬁnition of the notion of regret. Preferably, in order to allow for a simple accumulation, regret should be deﬁned in a quantitative way.

In the standard value-based setting, the regret of choosing an arm is typically deﬁned in terms of the difference x x

between the reward observed, x, and the reward that would have been obtained (in expectation) by choosing the best arm, namely the arm with the highest expected reward x . In our setting, a quantiﬁcation of regret in terms of differences is no longer possible, however, since arithmetic operations are not deﬁned on our qualitative scale L. Instead, as explained before, we are only allowed to compare outcomes in terms of better or worse . This leads us quite naturally to a binary regret function regret(x, y) = I {x G} I {y G} for obtaining reward y instead of x, where G is the subset of outcomes in L considered to be good . Accordingly, the (expected) immediate regret of choosing arm Xk is

ρk = max k =1,...,K E[regret(X k, Xk)]

= max k =1,...,K P[Xk G] P[Xk G] (7)

Now, the above deﬁnition of regret raises another question: What is the set G of good outcomes? In our setting, a natural answer is to consider an outcome x as good if x x , i.e., if x is at least as good as the optimal τ-quantile (4). However, in conjunction with the sharp discontinuity of the regret function (7), this deﬁnition renders regret minimization a truly hard problem. In fact, as shown by the following example, no algorithm with sublinear distribution independent regret guarantees exists. A more formal explanation of the linear worst case regret is deferred to the supplementary material (see Appendix B.2). Our worst case analysis is based on the fact that bandit instances given in Example 3 (a) and (b) are hard to distinguish. Example 3. Consider the qualitative MAB settings illustrated in Figure 2. In case (a), it is clear that x3 should be considered as the only good reward, and thus x = x3

and G = {x3}. The second arm thus never returns good rewards, whereas the ﬁrst arm does with probability 1 τ +δ. Therefore, the regret of arm 2 is 1 τ+δ. On the other hand, in case (c) both x2 and x3 should be considered good, so x = x2 and G = {x2, x3}. Thus, while arm 2 returns a good reward consistently, the ﬁrst arm is doing so only with probability 1 τ δ. The regret of the ﬁrst arm is τ + δ.

As long as one cannot distinguish between cases (a) and (c) with high enough probability, the choice of which one to optimize for (which is crucial, as arm 2 has at least constant regret in (a), and the same holds for arm 1 in (c)) will remain random. (The problem of the learner, therefore, is to ﬁnd out whether P[X1 = x1] τ or P[X1 = x1] < τ for the ﬁrst arm.) Thus, until that point is reached, any learner necessarily incurs linear regret in at least one of these examples. Additionally, to distinguish between the examples is getting more and more difﬁcult as δ approaches 0 (for formal results see Appendix B.1 and B.2). This suggests that one cannot hope for sublinear worst case regret bounds.

Qualitative Multi-Armed Bandits: A Quantile-Based Approach

(a) CDF and Q function for synthetic problem

QX1 QX2 τ 1

(b) CDF and Q function for synthetic problem

(c) CDF and Q function for synthetic problem

Figure 2. Synthetic qualitative MAB tasks with two arms.

Example 4 (Examples in Figure 2 continued). Now, consider cases (a) and (b), and the τ-quantile x . In case (b), x = x2, thus P[X1 x ] = 1 τ and P[X2 x ] = 1 while in case (a) (see Example 3), P[X1 x ] = 1 τ +δ and P[X2 x ] = 0. However, in order to distinguish the two cases, the learner needs to pull arm 1, leading to some non-negligible regret. This regret in (b), however, cannot be explained by any natural parameter (like the difference of the means in the quantitative case).

In order to avoid the problem in Example 4, we propose a slightly different deﬁnition of the set G of good outcomes. To this end, let

x (τ ) = max k=1,...,K QXk(τ )

for τ [0, 1] (thus x = x (τ)), and deﬁne

G = Lτ = {x L : x x (τ ) for some τ > τ} .

Correspondingly, the best arm with the minimal expected regret is deﬁned by

k = argmax 1 k K P[Xk Lτ] ,

Algorithm 2 QUCB(τ)

1: for rounds t = 1, . . . , K do 2: set kt = t and T(kt) = 1 3: pull arm kt and observe sample Xkt,1 4: for rounds t = K + 1, K + 2, . . . do 5: bxt = supk=1,...,K b QXk T (k)(τ + c(t, T(k))

6: kt := argmink=1,...,K bp Xk T (k)(bxt) c(t, T(k))

7: set T(kt) = T(kt) + 1 8: pull arm kt, and observe sample Xkt,T (kt)

the (expected) immediate regret of arm k is ρk = P[Xk Lτ] P[Xk Lτ], and Rt = t P[Xk Lτ] E[Pt t =1 I Xkt Lτ ] is the expected cumulative regret, where kt is the index of the arm chosen in round t . In our example, this approach renders the ﬁrst arm optimal in both (a) and (b), since in both cases Lτ = x3. Note also that in case of a clear separation (i.e., when x (τ) = x (τ+ϵ) for some ϵ > 0) this regret is equivalent to the one based on P[Xk x ].

5.1. Algorithm QUCB

In Algorithm 2, we present the pseudocode of our QUCB (which is short for Qualitative Upper Conﬁdence Bound) algorithm. In each round t, it uses an estimate b QXk Tt(k) of the τ-quantiles and pulls the arm which maximizes this es-

timate. The conﬁdence term used is c(t, s) = q

s . (The algorithm also needs to break ties, which is carried out in a similar fashion, but using the estimates of the p functions described later.) As a result, the accuracy of the estimate for the most promising arm will be increased. Suboptimal arms will be revealed as such as soon as the accuracy of the corresponding estimate is high enough mathematically, this can be guaranteed thanks to the rightcontinuity of the quantile functions.

For selecting the most promising arms between those with P[Xk Lτ] < τ, the algorithm additionally keeps track of an estimate of another function, p Xk(x) = P[Xk x], using bp Xk m (x) = 1 m Pm j=1 I {Xk,j x}. (Thus b F Xk m (x) = bp Xk m (x) for x L \ {Xk,1, . . . , Xk,m}.)

In order to state the regret bound of QUCB, some further notation needs to be introduced. But ﬁrst we need a technical lemma (see Appendix B for the proof).

Lemma 1. If P[Xk Lτ] < τ for some 1 k K then (inf Lτ) Lτ, τ > mink P[Xk inf Lτ] and τ < mink P[Xk inf Lτ]. Also, QXk(τ) = x .

For arm k with P[Xk Lτ] > τ, deﬁne k = P[Xk Lτ] τ. Now, consider some arm k with P[Xk Lτ] τ. In case P[Xk Lτ] τ, Xk is also optimal, it is thus only interesting to upper bound Tk(t) in case ρk = P[Xk

Qualitative Multi-Armed Bandits: A Quantile-Based Approach

Lτ] P[Xk Lτ] > 0. In that case, P[Xk Lτ] < τ and Lemma 1 applies. Therefore, 0 = mink P[Xk inf Lτ] τ > 0. Based on these, for Xk satisfying P[Xk Lτ] τ, deﬁne k = min(ρk, 0). Then, we have (see Appendix B for the proof):

Theorem 2. The expected cumulative regret of QUCB in round t is Rt = O P

k: k>0 ρk ( k)2 log t .

For regret lower bounds see Appendix B.1 and B.2.

6. Experiments

6.1. Finite Horizon

The goal of the ﬁrst experiment is to assess the impact of the parameters τ and ϵ on the sample complexity, that is, the number of samples taken by the algorithm prior to termination. We generated random bandit instances for which rewards are taken from a totally ordered discrete set {x1, . . . , x10}. In other words, the arm distributions are multinomial distributions. The parameters of the distributions are drawn uniformly at random from (0, 1) and proportionally scaled so as to sum up to one. The sample complexities for various values of parameters ϵ and τ are shown in Figure 3. As can be seen, the smaller ϵ, the higher the sample complexity thus, our algorithm scales gracefully with the approximation error allowed. The second observation is that the parameter τ has only a weak inﬂuence on the sample complexity. This can be explained by the fact that our conﬁdence intervals are derived for the empirical CDF as a whole, without being tailored for a speciﬁc quantile.

Sample comp.

Figure 3. The sample complexity of QPAC for K = 10 and different values of the parameter τ and ϵ. The arm distributions are categorical distributions. The results are averaged over 100 repetitions. The conﬁdence parameter δ was set to 0.05 for each run; accordingly, the average accuracy was signiﬁcantly above 1 δ = 0.95 in each case.

In the second experiment, we compare the performance of the QPAC algorithm with a standard PAC bandit algorithm on bandit problems for which the (ϵ, τ)-optimal arm coin-

Samp. complexity

QPAC (ǫ = 0.01) SE (ǫ = 0.01) QPAC (ǫ = 0.03) SE (ǫ = 0.03) QPAC (ǫ = 0.05) SE (ǫ = 0.05)

Figure 4. The sample complexity of SE and QPAC for the NHS problem with various parameter setting. The number K of arms was set to 15. The results are averaged over 100 repetitions. The conﬁdence parameter δ was set to 0.05 for each run; accordingly, the average accuracy was signiﬁcantly above 1 δ = 0.95 in each case.

cides with the ϵ-best arm in terms of means, thereby assuring that both learners are seeking the same arm. As a baseline, we run the SUCCESSIVEELIMINATION learner (SE) by Even-Dar et al. (2002), which is an (ϵ, δ)-PAC learner (i.e., it returns an ϵ-optimal arm with probability at least 1 δ). To guarantee a fair comparison, the conﬁdence interval deﬁned in (2) is used in our implementation of SE, which differs only in constants from the one of Even-Dar et al. (2002). We tested the algorithm on the Needle in the Haystack (NHS) problem, which consists of arms obeying Bernoulli distribution with parameter p except for one of them, the target, which has a slightly higher parameter p + p . Note that for τ = (1 p) p /2, we have QXi(τ) = 1 for the single τ-optimal arm, and QXi (τ) = 0 otherwise. We run the experiments with τ = 0.1, . . . , 0.9 and p = 0.1; correspondingly, p was set to 0.85, 0.75, . . . , 0.05, respectively. The approximation error ϵ was set to 0.01, 0.03 and 0.05. As can be seen from the results in Figure 4, QPAC slightly outperforms SE in terms of sample complexity for smaller values of ϵ. This can be explained by the fact that, although both algorithms are using similar elimination strategies, the statistics they use are of different nature.

6.2. Inﬁnite Horizon

In this section, we evaluate the QUCB algorithm on several numerical test cases. As a baseline, we run the standard UCB algorithm (Auer et al., 2002), which maximizes the sum of rewards. In each case, we plot the quantile-based cumulative regret and the average accuracy of the algorithms versus the number of iterations. By deﬁnition, the accuracy of an algorithm is 1, if it selects an arm from K , and 0 otherwise. We run the algorithms on the following bandit problems:

1. Bandit instances deﬁned in Figure 2(a) and 2(b). The

Qualitative Multi-Armed Bandits: A Quantile-Based Approach

10 1 10 2 10 3 10 4 10 5 10 0

10 1 10 2 10 3 10 4 10 5 0

(a) Task from Figure 2(a)

10 2 10 4 10 6 10 0

10 2 10 4 10 6 0

(b) Random with τ = 0.5

10 2 10 4 10 6 10 0

10 2 10 4 10 6 0

(c) NHS with τ = 0.5

10 1 10 2 10 3 10 4 10 5 10 0

10 1 10 2 10 3 10 4 10 5 0

(d) Task from Figure 2(b)

10 2 10 4 10 6 10 0

10 2 10 4 10 6 0

(e) Random with τ = 0.9

10 2 10 4 10 6 10 0

10 2 10 4 10 6 0

(f) NHS with τ = 0.9

Figure 5. Cumulative regret and accuracy for various test cases.

results are shown in Figure 5(a) and 5(d), respectively. We set x1 = 1, x2 = 2 and x3 = 3 in the case of UCB. The parameter δ was set to 0.1.

2. Multinomial arm distributions as described in the previous section, with parameters drawn uniformly at random. For the quantiles, we used τ {0.5, 0.9}. The results are shown in Figure 5(b) and 5(e), respectively.

3. NHS problem with parameters p = 0.45, p = 0.1, τ = 0.5 and p = 0.85, p = 0.1, τ = 0.9. The results are shown in Figure 5(c) and 5(f).

In the ﬁrst test case described in Figure 2(a), the mean of both arm distributions is 1/2. Therefore, since UCB cannot distinguish the arms, its accuracy is ﬂuctuating around 1/2. As opposed to this, QUCB is able to identify the optimal arm. In the second test case deﬁned in Figure 2(b), the best option is the second arm, both in terms of mean and τquantile. Accordingly, QUCB and UCB are both able the identify the optimal arm. On multinomial arm distributions, QUCB signiﬁcantly outperforms the UCB algorithm. This can be explained by the fact that the median (τ = 1/2) does not necessarily coincide with the mean the higher τ, the more different the goals of the learners will actually become. As expected, the performance of both algorithms is on par in the case of the NHS problem.

7. Conclusion

We have investigated the setting of quantile-based online bandit learning in the qualitative case, that is, when rewards are coming from a complete totally ordered set but are not necessarily numerical. We introduced and analyzed a PAC algorithm in the ﬁnite horizon setting. Moreover, for the inﬁnite horizon setting, we proposed an algorithm the (distribution-dependent) expected regret of which is growing logarithmically with time.

We have showed that sublinear regret in the qualitative setting is not achievable in the worst case (without using properties of the underlying distributions) in general. Since the standard reward expectation maximization problem has a known lower-bound of Ω(1/

T) (Audibert & Bubeck, 2010) and the risk-aversion setup has Ω(T 2/3) (Sani et al., 2012), therefore our worst case result implies that minimizing the quantile-based regret in the qualitative setting is intrinsically more difﬁcult than the standard value-based and the risk-aversion bandit problems.

Acknowledgments

This work was supported by the European Community s 7th Framework Programme (FP7/2007-2013) under grant agreement n 270327 (project Comp LACS) and by the German Research Foundation under grant HU 1284/8-1.

Qualitative Multi-Armed Bandits: A Quantile-Based Approach

Audibert, Jean-Yves and Bubeck, S ebastien. Regret bounds and minimax policies under partial monitoring. J. Mach. Learn. Res., 11:2785 2836, 2010.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 2002.

Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5 (1):1 122, 2012.

Bubeck, S., Munos, R., and Stoltz, G. Pure exploration in multi-armed bandits problems. In Proceedings of the 20th ALT, ALT 09, pp. 23 37, 2009.

Bubeck, S., Wang, T., and Viswanathan, N. Multiple identiﬁcations in multi-armed bandits. In Proceedings of The 30th ICML, pp. 258 265, 2013.

Busa-Fekete, R. and H ullermeier, E. A survey of preference-based online learning with bandit algorithms. In Algorithmic Learning Theory (ALT), volume 8776, pp. 18 39, 2014.

Busa-Fekete, R. and K egl, B. Fast boosting using adversarial bandits. In International Conference on Machine Learning (ICML), volume 27, pp. 143 150, Haifa, Israel, 2010. ACM.

Capp e, O., Garivier, A., Maillard, O.-A., Munos, R., and Stoltz, G. Kullback-Leibler upper conﬁdence bounds for optimal sequential allocation. Annals of Statistics, 41(3): 1516 1541, 2013.

Carpentier, A. and Valko, M. Extreme bandits. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, pp. 1089 1097, 2014.

Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning and Games. Cambridge university press, 2006.

Dvoretzky, A., Kiefer, J., and Wolfowitz, J. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, 27(3):642 669, 1956.

Even-Dar, E., Mannor, S., and Mansour, Y. PAC bounds for multi-armed bandit and markov decision processes. In Proceedings of the 15th Conference on Learning Theory (COLT), pp. 255 270, 2002.

Gabillon, V., Ghavamzadeh, M., Lazaric, A., and Bubeck, S. Multi-bandit best arm identiﬁcation. In Advances in NIPS 24, pp. 2222 2230, 2011.

Gaudel, Romaric and Sebag, Mich ele. Feature selection as a one-player game. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pp. 359 366, 2010.

Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. Pac subset selection in stochastic multi-armed bandits. In Proceedings of the Twenty-ninth International Conference on Machine Learning (ICML 2012), pp. 655 662, 2012.

Kuleshov, Volodymyr and Precup, Doina. Algorithms for multi-armed bandit problems. Co RR, abs/1402.6028, 2014.

Lai, T.L. and Robbins, H. Asymptotically efﬁcient allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985.

Mannor, Shie and Tsitsiklis, John N. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5:623 648, 2004.

Massart, P. The tight constant in the dvoretzky-kieferwolfowitz inequality. The Annals of Probability, 18(3): 1269 1283, 1990.

Sani, Amir, Lazaric, Alessandro, and Munos, R emi. Riskaversion in multi-armed bandits. In 26th Annual Conference on Neural Information Processing Systems (NIPS), pp. 3284 3292, 2012.

Schachter, B. An irreverent guide to Value-at-Risk. Financial Engineering News, 1, 1997.

Yu, Jia Yuan and Nikolova, Evdokia. Sample complexity of risk-averse bandit-arm selection. In Proceedings of the Twenty-Third International Joint Conference on Artiﬁcial Intelligence, pp. 2576 2582. AAAI Press, 2013.

Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 1556, 2012.

Yue, Yisong and Joachims, Thorsten. Beat the mean bandit. In Proceedings of the International Conference on Machine Learning (ICML), pp. 241 248, 2011.

Zhou, Yuan, Chen, Xi, and Li, Jian. Optimal pac multiple arm identiﬁcation with applications to crowdsourcing. Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 217 225, 2014.

Zoghi, M., Whiteson, S., Munos, R., and Rijke, M. Relative upper conﬁdence bound for the k-armed dueling bandit problem. In Proceedings of the Thirty-First International Conference on Machine Learning (ICML), pp. 10 18, 2014.