# bandits_with_ranking_feedback__981afb8c.pdf

Bandits with Ranking Feedback

Davide Maran Politecnico di Milano davide.maran@polimi.it

Francesco Bacchiocchi Politecnico di Milano francesco.bacchiocchi@polimi.it

Francesco Emanuele Stradi Politecnico di Milano francescoemanuele.stradi@polimi.it

Matteo Castiglioni Politecnico di Milano matteo.castiglioni@polimi.it

Nicola Gatti Politecnico di Milano nicola.gatti@polimi.it

Marcello Restelli Politecnico di Milano marcello.restelli@polimi.it

In this paper, we introduce a novel variation of multi-armed bandits called bandits with ranking feedback. Unlike traditional bandits, this variation provides feedback to the learner that allows them to rank the arms based on previous pulls, without quantifying numerically the difference in performance. This type of feedback is well-suited for scenarios where the arms values cannot be precisely measured using metrics such as monetary scores, probabilities, or occurrences. Common examples include human preferences in matchmaking problems. Furthermore, its investigation answers the theoretical question on how numerical rewards are crucial in bandit settings. In particular, we study the problem of designing no-regret algorithms with ranking feedback both in the stochastic and adversarial settings. We show that, with stochastic rewards, differently from what happens with nonranking feedback, no algorithm can suffer a logarithmic regret in the time horizon T in the instance-dependent case. Furthermore, we provide two algorithms. The first, namely DREE, guarantees a superlogarithmic regret in T in the instance-dependent case thus matching our lower bound, while the second, namely R-LPE, guarantees a regret of e O(

T) in the instance-independent case. Remarkably, we show that no algorithm can have an optimal regret bound in both instance-dependent and instance-independent cases. Finally, we prove that no algorithm can achieve a sublinear regret when the rewards are adversarial.

1 Introduction

Multi-armed bandits are well-known sequential decision-making problems where a learner is given a number of arms whose reward is unknown [Lattimore and Szepesvari, 2017]. At every round, the learner can pull an arm and observe a realization of the reward associated with that arm, which can be generated stochastically [Auer et al., 2002] or adversarially [Auer et al., 1995]. The central question

Equal Contribution.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

in multi-armed bandits concerns how to address the exploration/exploitation tradeoff to minimize the regret between the reward provided by the learning policy and the optimal clairvoyant algorithm.

In this paper, we introduce a novel variation of multi-armed bandits that, to the best of our knowledge, is unexplored so far. We name the model as bandits with ranking feedback. This feedback provides the learner with a partial observation over the rewards given by the arms. More precisely, the learner can rank the arms based on the previous pulls they experienced, but they cannot quantify numerically the difference in performance. Thus, the learner is not allowed to asses how much an arm is better or worse than another. This type of feedback is well-suited for scenarios where the arms values cannot be precisely measured using metrics such as monetary scores, probabilities, or occurrences, and naturally applies to various settings, e.g., when dealing with human preferences such as in matchmaking settings among humans and when the scores cannot be revealed for privacy or security reasons. This latter case can be found, e.g., in online advertising platforms offering automatic bidding services as they have no information on the actual revenue of the advertising campaigns since the advertisers prefer not to reveal these values being sensible data for the companies. Notice that a platform can observe the number of clicks received by an advertising campaign, but it cannot observe the revenue associated with that campaign. Remarkably, our model poses the interesting theoretical question whether the lack of numerical scores precludes the design of sublinear regret algorithms or worsens the regret bounds that are achievable when numerical scores are available.

Original contributions. We study the problem of designing no-regret algorithms for the bandits with ranking feedback problem, both in stochastic and adversarial settings. In the case of adversarial rewards, we prove that no algorithm can achieve sublinear regret. In contrast, with stochastic rewards, we show that the ranking feedback does not preclude such a possibility. In particular, in the instance-dependent case, we show that no algorithm can achieve logarithmic regret in the time horizon, and we provide an algorithm, namely DREE (Dynamical Ranking Exploration-Exploitation), guaranteeing a regret bound that matches this lower bound. In the instance-independent case, a crucial question is whether there exists an algorithm providing a better regret bound compared to the one achieved by the well-known Explore-then-Commit algorithm, which trivially guarantees an O(T 2/3) regret upper bound. We positively answer this question by designing an algorithm, namely R-LPE (Ranking Logarithmic Phased Elimination), which guarantees a regret of e O(

T) in the instance-independent case if the rewards are Gaussian. To achieve this result, we derive several non-standard results that allow us to discretize Brownian motions, which are of independent interest. These two different approaches leave open the problem of whether there exists an algorithm achieving optimal performance in both the instance-dependent and instance-independent cases. We negatively answer this question by showing that no algorithm can achieve an optimal regret bound in both cases, confirming the need to design two distinct algorithms for the two cases. Finally, we numerically evaluate our DREE and R-LPE algorithms in a testbed, and we compare their performance with some baselines from the literature in different settings. We show that our algorithms dramatically outperform the baselines in terms of empirical regret.

Related works. The field most related to bandits with ranking is preference learning, which aims at learning the preferences of one or more agents from some observations [Fürnkranz and Hüllermeier, 2010]. Let us remark that preference learning has recently gained a lot of attention from the scientific community, as it enables the design of AI artifacts capable of interacting with human-in-the-loop (HTL) environments. Indeed, human feedback may be quite misleading when it is asked to report numerical values, while humans are far more effective at reporting ranking preferences. The preference learning literature mainly focuses on two kinds of preference observations: pairwise preferences and ranking. In the first case, the data observed by the learner involves preferences between two objects, i.e., a partial preference is given to the learner. In the latter, a complete ranking of the available data is given as feedback. Our work belongs to the latter branch. Preference learning has been widely investigated by the online learning community, see, e.g., [Bengs et al., 2018].

Precisely, our work presents several similarities with the dueling bandits settings [Yue et al., 2012, Saha and Gaillard, 2022, Lekang and Lamperski, 2019], where, in each round, the learner pulls two arms and observes a ranking over them. Nevertheless, although dueling bandits share similarities to our setting, they present substantial differences. Specifically, in our model, the learner observes a ranking depending on the arms they have pulled so far. In dueling bandits, the learner observes an instantaneous comparison between the arms they have just pulled; thus, the outcome of such a comparison does not depend on the arms previously selected, as is the case of bandits with ranking

feedback. As a consequence, while in bandits with ranking feedback the goal of the learner is to exploit the arm with the highest mean, in dueling bandits the goal of the learner is to select the arm winning with the highest probability. Furthermore, while we adopt the classical notion of regret used in the bandit literature to assess the theoretical properties of our algorithms, in dueling bandits, the algorithms are often evaluated with a suitable notion of regret, which differs from the classical one.

Dueling bandits have their reinforcement learning (RL) counterpart in the preference-based reinforcement learning (Pb RL), see, e.g., [Novoseller et al., 2019] and [Wirth et al., 2017]. Interestingly, Pb RL techniques differ from the standard RL approaches in that they allow an algorithm to learn from non-numerical rewards; this is particularly useful when the environment encompasses human-like entities [Chen et al., 2022]. Furthermore, preference-based reinforcement learning provides a bundle of results, ranging from theory [Xu et al., 2020] to practice [Christiano et al., 2017, Lee et al., 2021]. In Pb RL, preferences may concern both states and actions; contrariwise, our framework is stateless since the rewards gained depend only on the action taken during the learning dynamic. Moreover, the differences outlined between dueling bandits and bandits with ranking feedback still hold for preference-based reinforcement learning, as preferences are considered between observations instead of the empirical mean of the accumulated rewards.

Paper structure. The paper is structured as follows. In Section 2, we report the problem formulation, the setting, and the necessary notation. In Section 3, we study the stochastic setting, that is, when the rewards are sampled from a fixed distribution. In Section 3.1, we present an instance-dependent regret lower bound that characterizes our problem. Thus, in Section 3.2, we propose our algorithm and we show that it achieves a tight instance-dependent regret bound. In Section 3.3, we show a trade-off between the regret upper bound achievable by any algorithm in the instance-dependent and instance-independent cases. Finally, in Section 3.4, we present an algorithm achieving an optimal instance-independent regret bound when the rewards are Gaussian. In Section 4, we study the adversarial setting, that is, when no statistical assumptions are made about the rewards. In this case, we present our impossibility result, showing that no algorithm can achieve sublinear regret. In Appendix A and Appendix B, we report the omitted proofs for the stochastic setting. In Appendix C, we report the omitted proof for the adversarial setting. Finally, in Appendix D, we report the empirical evaluation of our algorithms.

2 Problem formulation

In this section, we formally state the model of bandits with ranking feedback and discuss the learnerenvironment interaction. Subsequently, we define policies and the notion of regret both in the stochastic and in the adversarial setting.

Setting and interaction. Differently from the classical version of the multi-armed bandit problem see, e.g., [Lattimore and Szepesvari, 2017] in which the learner observes the reward associated with the pulled arm, in bandits with ranking feedback the learner can only observe a ranking over the arms based on the previous pulls. Formally, we assume the learner-environment interaction to unfold as follows.2

(i) At every round t [T], where T is the time horizon, the learner chooses an arm it A := [n], where A is the set of available arms and n = |A| < + . (ii) We study both stochastic and adversarial settings. In the stochastic setting, the environment draws the reward rt(it) associated with arm it from a probability distribution νit, i.e., rt(it) νit, whereas, in the adversarial setting, rt(it) is chosen adversarially by an opponent from a bounded set of reward functions. (iii) There is a bandit feedback on the reward of the arm it A pulled at round t leading to the estimate of the empirical mean of it as follows:

P j Wt(i) rj(i)

2Given n N we denote with [n] := {1, . . . , n}. Furthermore, given a finite set A, we denote by SA the set containing all the possible permutations of its elements. Similarly, we let P(A) be the set containing all the probability distributions with support A.

where Wt(i) := {τ [t] | iτ = i} and Zi(t) := |Wt(i)|.3 The learner observes the ranking over the empirical means {ˆrt(i)}i A. Formally, we assume that the ranking Rt SA observed by the learner at round t is such that: ˆrt(Rt,i) ˆrt(Rt,j) t [T] i, j [n] s.t. i j, where Rt,i A denotes the i-th element in the ranking Rt at round t [T]. Ties are broken in favor of the lower index.

For the sake of clarity, we provide an example to illustrate our setting and the corresponding learnerenvironment interaction.

Example. We consider an instance with two arms, i.e., A = {1, 2}, where the learner plays the first arm at rounds t = 1 and t = 3, and the second arm at round t = 2, such that W3(1) = {1, 3} and W3(2) = {2}. Let r1(1) = 1 and r3(1) = 5 be the rewards obtained from playing the first arm at rounds t = 1 and t = 3, respectively, and let r2(2) = 5 be the reward for playing the second arm at round t = 2. The empirical means of the two arms and the resulting rankings at each round t [3] are given by:

ˆrt(1) = 1, ˆrt(2) = 0 Rt = 1, 2 t = 1 ˆrt(1) = 1, ˆrt(2) = 5 Rt = 2, 1 t = 2 ˆrt(1) = 3, ˆrt(2) = 5 Rt = 2, 1 t = 3.

Policies and regret. At every round t, the arm played by the learner is prescribed by a policy π. In both the stochastic and adversarial settings, we let the policy π be a randomized map from the history of the interaction Ht 1 = (R1, i1, R2, i2, . . . , Rt 1, it 1) to the set of all probability distributions with support A. Formally, we let π : Ht 1 P (A), for t [T], such that it π(Ht 1). As is customary in stochastic bandits, the learner s goal is to design a policy π that minimizes the cumulative expected regret, which is formally defined as follows:

t=1 rt(i ) rt(it)

where the expectation in the definition of regret is taken over the randomness of both the policy and the environment, and we define i arg maxi A µi with µi := E [νi] for each i [n]. In contrast, in the adversarial setting, the regret is defined as RT (π) = PT t=1 rt(i ) rt(it) and we define i arg maxi A PT t=1 rt(i). Furthermore, from here on, we omit the dependence on the policy selected by the learner π in the regret formulation, referring to RT (π) as RT whenever it is clear from the context. In the stochastic setting, we introduce the following additional notation. Given an arm i [n], we let i := µi µi represent the suboptimality gap of that arm. Furthermore, when n = 2, we simply refer to the suboptimality gap as .

As we will further discuss, the impossibility of observing the reward realizations raises several technical challenges when designing no-regret algorithms, as the approaches adopted for standard (non-ranking) bandits may be challenging to apply within our framework. In the following sections, we discuss how the lack of this information degrades the performance of the algorithms when the feedback is provided as a ranking.

3 Analysis in the stochastic setting

As a preliminary observation, we note that optimistic approaches, such as the UCB1 algorithm, are challenging to apply within our framework. This is because the learner lacks the information to estimate the expected rewards of the different arms, making it difficult to infer confidence bounds. Therefore, the most popular algorithm one can employ in bandits with ranking feedback is the explore-then-commit (EC) algorithm (see, e.g., Auer et al. [2002]), in which the learner explores the arms uniformly during the initial rounds and then commits to the one with the highest empirical mean. However, as we will also discuss in the following, such algorithm achieve suboptimal regret guarantees. Thus, in the rest of this section, we present two algorithms specifically designed to achieve optimal instance-dependent and instance-independent regret bounds.

3Note that the latter definition is well-posed as long as |Wt(i)| > 0. For each i A and t [T] such that |Wt(i)| = 0, we let ˆrt(i) = 0.

3.1 Instance-dependent lower bound

In the classical multi-armed bandit problem, it is well known that it is possible to achieve an instancedependent regret bound that is logarithmic in the time horizon T. However, in this section, we show that such a result does not hold when the feedback is provided as a ranking. Our impossibility result leverages a connection between random walks and the cumulative rewards of the arms. Formally, we define an (asymmetric) random walk as follows. Definition 1. A random walk is a stochastic process {Gt}t N such that:

Gt = 0 t = 0 Gt 1 + ϵt t 1 ,

where {ϵt}t N is an i.i.d. sequence of integrable random variables, and E[ϵt] = p is the drift of the random walk.

Before introducing our negative result, we introduce a lemma that characterizes the average performance of two random walks with different drifts.

Lemma 1 (Separation lemma). Let {Gt}t N, {G t}t N be two independent random walks defined as:

Gt+1 = Gt + ϵt and G t+1 = G t + ϵ t,

where G0 = G 0 = 0 and the drifts satisfy E[ϵt] = p > q = E[ϵ t], for each t N. Then, we have:

P t, t N Gt/t G t /t c(p, q) > 0.

We remark that the exact value of the constant c(p, q) in Lemma 1 depends only on the two drifts p and q, as well as the probability distribution defining these drifts. In the simpler case of Bernoulli distributions, the constant c(p, q) can be derived in closed form, as shown in Lemma 3 in the appendix. The rationale behind Lemma 1 is that, when considering two random walks with different drifts, there exists a separating line between them with strictly positive probability. Thus, with non-negligible probability, the empirical mean of the process with the higher drift always upper bounds the empirical mean of the process with the lower drift.

We also observe that the cumulative reward collected by a specific arm during the learning process can be represented as a random walk, whose drift corresponds to the expected reward associated with that arm. In the classical version of the multi-armed bandit problem, the learner can fully observe the evolution of this random walk, while in our setting, the learner can only observe a ranking of the different arms, making it impossible to quantify their performance numerically. Therefore, in bandits with ranking feedback, the only way for the learner to assess how close two arms are in terms of expected reward is by observing subsequent switches in their positions within the ranking. However, Lemma 1 implies that even if two arms have very close expected rewards, the learner may never observe a switch in the ranking, as the average mean of the arm with the higher expected reward may upper bound the second throughout the entire learning process. As a result, the following negative result holds. Theorem 1 (Instance-dependent lower bound). Let π be any policy for the bandits with ranking feedback problem, then, for any C : [0, + ) [0, + ), there exists a > 0 and a time horizon T > 0 such that RT > C( ) log(T).

3.2 Instance-dependent upper bound

We introduce the Dynamical Ranking Exploration-Exploitation algorithm (DREE). The pseudo-code is provided in Algorithm 1. As usual in bandit algorithms, in the first n rounds, a pull for each arm is performed (Lines 2 3). At every subsequent round t > n, the exploitation/exploration trade-off is addressed by playing the best arm according to the received feedback unless there is at least one arm whose number of pulls at t is smaller than a superlogarithmic function f(t) : (0, ) R+.4 More precisely, the algorithm plays an arm i at round t if it has been pulled less than f(t) times

4A function f(t) is superlogarithmic when lim inf t

f(t) log(t) = + . Coherently, a subpolynomial function is

such that for every α > 0 we have lim sup t

(Lines 4 5), where ties due to multiple arms pulled less than f(t) times are broken arbitrarily. Instead, if all arms have been pulled at least f(t) times, the arm in the highest position of the last ranking feedback is pulled (Lines 6 8). Each round terminates with the learner receiving the updated ranking over the arms as feedback (Line 9). Let us observe that the exploration strategy of Algorithm 1 is deterministic, and the only source of randomness concerns the realization of the arms rewards.

We state the following result, providing the regret upper bound of Algorithm 1 as a function of f. Theorem 2 (Instance-dependent upper bound). Assume that the reward distribution of every arm is 1-subgaussian. Let f : (0, ) R+ be a superlogarithmic nondecreasing function in t. Then there is a term C(f, i) for each sub-optimal arm i [n] which does not depend on T, such that Algorithm 1 satisfies RT (1 + f(T)) Pn i=1 i + log(T) Pn i=1 C(f, i).

To minimize the asymptotic dependence on T of the cumulative regret suffered by the algorithm, we can choose as an example f(t) = log(t)1+δ, where the parameter δ > 0 is as small as possible. However, if i < 1, the minimization of δ comes at the cost of increasing the term C(f, i) as it grows exponentially as δ goes to zero. In particular, the terms C(f, i) are formally defined in the following corollary. Corollary 1. Let δ > 0 and f(t) = log(t)1+δ be the sperlogarithmic function used in Algorithm 1, then we have:

C(f, i) = 2 i

(2/ 2 i) 1/δ + 1

1 e 2 i /2 .

Algorithm 1 Dynamical Ranking Exploration-Exploitation (DREE)

1: for t [T] do 2: if t n then 3: play arm it 4: else if arm i has played <f(t) times then 5: play it = i 6: else 7: play it = Rt 1,1 8: end if 9: receive updated ranking Rt 10: end for

We remark that the term C(f, i) depends exponentially on i, suggesting that C(f, i) may be large even when adopting values of δ that are not arbitrarily close to zero. At this point, a natural question arises. Is such an exponential dependence with respect to the suboptimality gaps unavoidable, or is it a consequence of the kind of feedback the learner receives? In the next section, we answer this question.

Furthermore, let us observe that Algorithm 1 satisfies important properties. More precisely, (i) it matches the instance-dependent regret lower-bound, since f( ) can be chosen arbitrarily close to log(t), (ii) it works without requiring the knowledge of the time horizon T, thus being an any-time algorithm.

3.3 Instance dependent/independent trade-off

In this section, we provide a negative result, showing that no algorithm can achieve good performance in terms of both instance-dependent and instance-independent regret bounds, suggesting that the two cases need to be studied separately. Formally, the following theorem holds. Theorem 3 (Instance Dependent/Independent Trade-off). Let π be any policy for the bandits with ranking feedback problem. If π satisfies the following properties:

(instance-dependent regret upper bound) RT Pn i=1 C( i)T α

(instance-independent regret upper bound) RT n CT β

then, 2α + β 1, where α, β 0.

Proof. (Sketch) Let p1 = 0.5, p2 = 0.5 , p 2 = 0.5 + , for some > 0 specified in the proof. We consider three instances:

P : ν1 = Be(p1) ν2 = Be(p2) P : ν1 = Be(p1) ν2 = Be(p 2) P : ν1 = Be(1) ν2 = Be(0)

Clearly, the first arm is optimal in instances P, P , while the second arm is optimal in P . We then define the event:

τ=1 {Rt = 1, 2 }.

In the first and in the third instances, the event Et corresponds to the optimal arm being ranked in the first position for the first t time steps, while the opposite holds for the second instance. We observe the following two key points. (1) Since the event Et contains the realization of all the rankings up to time t, no policy can distinguish between the three instances until this event holds. Therefore, we have: EP[Z2(t)|Et] = EP [Z2(t)|Et] = EP [Z2(t)|Et].

(2) In instance P , the event Et happens almost surely for every t. Therefore, to ensure that the policy π achieves an instance-dependent regret bounded by C(1)T α in this instance, we need E[Z2(T)|ET ] CT α in all three instances.

The main question at this point reduces to: are CT α pulls of the last-ranking arm enough to distinguish between the two instances P and P ? With a change of measure argument restricted to the first CT α pulls of the last-ranking arm, we are able to show that, for a sufficiently small value of > 0, distinguishing between P, P is impossible with strictly positive probability. Then, we can prove that if the previous consideration holds, in the instance P , we have:

RT Ω T 1 2ρα ,

for a constant ρ > 0 close to one. This lower bound on the instance-independent regret entails that β 1 2α, or, equivalently β + 2α 1.

From Theorem 3, we can easily infer the following impossibility result.

Corollary 2. There is no policy π for the bandits with ranking feedback problem achieving both subpolynomial regret in the instance-dependent case, i.e., for every α > 0, there exists a function C( ) such that RT Pn i=1 C( i)T α, and sublinear regret in the instance-independent case.

To ease the interpretation of Corollary 2, in the following result, we discuss the instance-independent regret bound achieved by Algorithm 1.

Corollary 3. For every choice of δ > 0 in f(t) = log(t)1+δ, there is no value of η > 0 for which Algorithm 1 achieves an instance-independent regret bound of the form RT O(T 1 η).

The above result shows that Algorithm 1 suffers from linear instance-independent regret in T, except for logarithmic terms. Moreover, the following corollary of Theorem 3 answers the question raised in the previous section. Indeed, we can prove that the unpleasant dependence on the suboptimality gaps i is not a feature of Algorithm 1; instead, it cannot be avoided until the instance-dependent regret has a good order in T.

Corollary 4. Let π be any policy for the bandits with ranking feedback problem that satisfies and instance-dependent regret upper bound of the form RT Pn i=1 C( i)f(T), where f( ) is a subpolynomial function. Then, C( ) is super-polynomial in 1/ .

Proof. We prove the opposite implication, namely that if C( ) is polynomial in 1/ , then the instance-independent regret bound cannot be subpolynomial in T. By assumption, in case of two arms with just one gap , we have:

ℓ=1 Cℓ ℓf(T),

which implies that the instance-independent regret can be bounded in the following way

RT sup >0 min

ℓ=1 Cℓ ℓf(T), T

ℓ=1 sup >0 min Cℓ ℓf(T), T .

Notice that, for T 1/(ℓ+1) the first term is less than CT ℓ α+1 f(T), while for T 1/(ℓ+1), the second one is less than T ℓ ℓ+1 . Therefore, the full instance-independent regret is bounded by:

ℓ=1 CℓT ℓ ℓ+1 f(T),

which is polynomial in T. If, by contradiction, f(T) were subpolynomial, this bound would be sublinear in T, but this contradicts the result of Corollary 2.

3.4 Instance-independent upper bound

Algorithm 2 Ranking Logarithmic Phased Elimination (R-LPE)

1: initialize S = [n] 2: initialize L = LG(1/2, 1, T) 3: for t [T] do 4: play it arg mini S Zi(t) 5: update Zit(t) number of times it has been pulled 6: observe ranking Rt 7: if mini S Zi(t) L then

8: α = log(mini S Zi(t))

2 9: S = Ft(T 2α) 10: end if 11: end for

The impossibility result stated in Corollary 2 pushes for the need for an algorithm guaranteeing sublinear regret in the instance-independent case. Initially, we observe that the standard Explore-then-Commit (EC) algorithm [Auer et al., 2002] can be applied within our framework, achieving an O(T 2/3) instance-independent regret bound. In the following, we provide a brief overview of how the EC algorithm works. It divides the time horizon into two phases as follows: (i) exploration phase: the arms are pulled uniformly for the first m n rounds, where m is a parameter of the algorithm one can tune to minimize the regret; (ii) commitment phase: the arm maximizing the estimated reward is pulled. In the case of bandits with ranking feedback, the EC algorithm explores the arms in the first m n rounds and subsequently pulls the arm in the first position of the ranking feedback received at round t = m n. As is customary in standard (non-ranking) bandits, the best regret bound can be achieved by setting m = T 2/3 , thus obtaining an O(T 2/3) regret upper bound. We show that we can get a regret bound better than that of the EC algorithm. In particular, we provide the Ranking Logarithmic Phased Elimination (R-PLE) algorithm, which breaks the barrier of O(T 2/3) guaranteeing an e O(

T) regret bound when neglecting logarithmic terms. Due to the mathematical instruments involved, the proof of this regret bound only holds for the case of Gaussian rewards, as the ones presented in a similar setting by Garivier et al. [2016]. The pseudocode of R-PLE is reported in Algorithm 2.

In order to proper analyze the algorithm, we need to introduce the two following definitions. Initially, we introduce the definition of the loggrid set as follows. Definition 2 (Loggrid). Given a, b R s.t a < b and a constant value T > 0, we define:

LG(a, b, T) := T λjb+(1 λj)a : λj = j log(T) , j = 0, . . . , log(T) .

Next, we give the notion of active set, which the algorithm employs to cancel out sub-optimal arms. Definition 3 (Filtering condition). Let S be the active set of the algorithm, at a certain timestep. We say that a timestep t [T] is fair if all active arms have been pulled the same number of times times. In any fair timestep t [T], we define the active set Ft(ζ) as the set of arms such that:

τ=1 s.t. τ fair {Rτ(i) > Rτ(j)} ζ

This condition will be called filtering condition.

Initially, we observe that R-LPE differs from Algorithm 1, as it takes into account the entire history of the process, not just the most recent ranking Rt. It also requires knowledge of T. We denote with S the set of active arms used by the algorithm. Initially, the set S comprises all the possible

arms available in the problem (Line 1). Furthermore, the set which drives the update of the decision space S, namely L, is initialized as the loggrid built on parameters 1/2, 1, T (Line 2). At every round t [T], R-LPE chooses the arm from the active set S with the minimum number of pulls, namely i [n] s.t. Zi(t) is minimized (Line 4); ties are broken by index order. Next, the number of times arm it has been pulled, namely Zit(t), is updated accordingly (Line 5). The peculiarity of the algorithm is that the set S changes every time the condition mini Zi(t) L is satisfied (Line 7). When the aforementioned condition is met, the set of active arms S is filtered to avoid the exploration on sub-optimal arms. Precisely, S is filtered given the time dependent parameter α (Lines 89). We state the following theorem providing an instance-independent regret bound to Algorithm 2.

Theorem 4. In the stochastic bandits with ranking feedback setting, when the noise is Gaussian, Algorithm 2 achieves RT 62n4 log(T)2T 1/2.

Proof. (Sketch) To prove the theorem, we define, for every pair of indices i, j [n], the event:

Eψ ij := Eij µj µi > ψ else, (1)

where Eij corresponds to the event in which arm i eliminates arm j at some point in the process, while ψ is a constant defined in the following. The probability that at least one of these events holds is bounded (by Lemma 6 and employing a union bound) as P(Ψ) := P Sn i =j,i,i [n] Eψ ij

O n2 log(T)2T 1/2ψ 1 , since their number is at most n(n 1)/2.

Therefore, if the complement of the event Ψ holds, there exists an arm i with a gap (w.r.t. the first arm) less than (n 1)ψ, which is not eliminated until the last round. This is because at most n 1 eliminations can happen, and if the complement of the event Ψ holds, such eliminations concern pairs of arms with a difference in mean of at most ψ.

Since the arm i [n] is not eliminated until the last round under the event ΨC, the probability of event E ii , corresponding to the event in which the suboptimal arm i [n] survives for more than O(log(T)T 1/2 1 ii ) pulls, is bounded by 2T 1/2 thanks to Lemma 7. As a result, employing a union bound over all possible values of i , we can say that the probability that any event E ii with i [n] occuring is at most 2(n 1)T 1/2. Thus, fixing ψ = i/(2(n 1)) ensures that 1 ii i/n, which entails P(Zi(T) O(log(T)T 1/2 1 i )) O(n3 log(T)2T 1/2 1 i )), and thus, it holds E[ i Zi(T)] = O(n3 log(T)2T 1/2). Finally, using the Regret Decomposition Lemma [Lattimore and Szepesvari, 2017] we can conclude the proof. Proving the two lemmas, however, is not trivial since it requires to see the whole process as a discretization of a biased Brownian motion, and then applying results for this kind of stochastic processes.

At first glance, the result presented in Theorem 4 may seem unsurprising. Indeed, there are several elimination algorithms achieving O(

T) regret bounds in different bandit settings (see, for example, [Auer and Ortner, 2010, Lattimore et al., 2020, Li and Scarlett, 2022]). Nevertheless, our setting poses several additional challenges compared to existing ones. For instance, in our framework, it is not possible to rely on concentration bounds, as the current feedback is heavily correlated with the past ones. This is precisely the reason for the anomalous growth of the regret in terms of the number of arms n: due to the extremely correlated feedback, two union bounds are necessary trough the last proof, and this leads to an increase in the dependence of the order of n. In the impossibility of using concentration inequalities, our analysis employs novel arguments, drawing from recent results in the theory of Brownian Motions, which allow to properly model the particular ranking feedback.

4 Analysis in the adversarial setting

We focus on bandits with ranking feedback in adversarial settings. In particular, we show that no algorithm provides sublinear regret without statistical assumptions on the rewards.

Theorem 5. In adversarial bandits with ranking feedback, there exists a constant γ (0, 1) such that no algorithm achieves o(T) regret with respect to the best arm in hindsight with probability greater than γ.

Proof. (Sketch) The proof introduces three instances in an adversarial setting in a way that no algorithm can achieve sublinear regret in all three. The main reason behind such a negative result is that ranking feedback obfuscates the value of the rewards so as not to allow the algorithm to distinguish two or more instances where the rewards are non-stationary. The three instances employed in the proof are divided into three phases such that the instances are similar in terms of rewards for the first two phases, while they are extremely different in the third phase. In summary, if the learner receives the same ranking when playing in two instances with different best arms in hindsight, it is not possible to achieve a small regret in both of them.

Acknowledgments

This paper is supported by the FAIR (Future Artificial Intelligence Research) project, funded by the Next Generation EU program within the PNRR-PE-AI scheme (M4C2, Investment 1.3, Line on Artificial Intelligence), by the EU Horizon project ELIAS (European Lighthouse of AI for Sustainability, No. 101120237) and by project SERICS (PE00000014) under the NRRP MUR program funded by the EU - NGEU.

P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 322 331, 1995. doi: 10.1109/SFCS.1995.492488.

Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65, 2010.

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002.

Paolo Baldi. Stochastic calculus. In Stochastic Calculus, pages 215 254. Springer, 2017.

Viktor Bengs, Robert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke Hüllermeier. Preference-based online learning with dueling bandits: A survey. 2018. doi: 10.48550/ARXIV.1807.11398. URL https://arxiv.org/abs/1807.11398.

Xiaoyu Chen, Han Zhong, Zhuoran Yang, Zhaoran Wang, and Liwei Wang. Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 3773 3793. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/chen22ag.html.

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences, 2017. URL https://arxiv.org/abs/1706. 03741.

Johannes Fürnkranz and Eyke Hüllermeier, editors. Preference Learning. Springer, 2010. ISBN 978-3-642-14124-9. doi: 10.1007/978-3-642-14125-6. URL https://doi.org/10.1007/ 978-3-642-14125-6.

Aurélien Garivier, Tor Lattimore, and Emilie Kaufmann. On explore-then-commit strategies. Advances in Neural Information Processing Systems, 29, 2016.

Tor Lattimore and Csaba Szepesvari. Bandit algorithms. 2017. URL https://tor-lattimore. com/downloads/book/book.pdf.

Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, pages 5662 5670. PMLR, 2020.

Kimin Lee, Laura Smith, Anca Dragan, and Pieter Abbeel. B-pref: Benchmarking preference-based reinforcement learning, 2021. URL https://arxiv.org/abs/2111.03026.

Tyler Lekang and Andrew Lamperski. Simple algorithms for dueling bandits, 2019. URL https: //arxiv.org/abs/1906.07611.

Zihan Li and Jonathan Scarlett. Gaussian process bandit optimization with few batches. In International Conference on Artificial Intelligence and Statistics, pages 92 107. PMLR, 2022.

Ellen R. Novoseller, Yibing Wei, Yanan Sui, Yisong Yue, and Joel W. Burdick. Dueling posterior sampling for preference-based reinforcement learning, 2019. URL https://arxiv.org/abs/ 1908.01289.

Aadirupa Saha and Pierre Gaillard. Versatile dueling bandits: Best-of-both-world analyses for online learning from preferences, 2022. URL https://arxiv.org/abs/2202.06694.

Lajos Takács. On a generalization of the arc-sine law. The Annals of Applied Probability, 6(3): 1035 1040, 1996.

Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. A survey of preferencebased reinforcement learning methods. J. Mach. Learn. Res., 18(1):4945 4990, jan 2017. ISSN 1532-4435.

Yichong Xu, Ruosong Wang, Lin F. Yang, Aarti Singh, and Artur Dubrawski. Preference-based reinforcement learning with finite-time guarantees, 2020. URL https://arxiv.org/abs/2006. 08910.

Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 1556, 2012. ISSN 00220000. doi: https://doi.org/10.1016/j.jcss.2011.12.028. URL https://www.sciencedirect. com/science/article/pii/S0022000012000281. JCSS Special Issue: Cloud Computing 2011.

A Proofs of instance dependent stochastic analysis

A.1 Proof of instance dependent lower bound and lemmas

Lemma 1 (Separation lemma). Let {Gt}t N, {G t}t N be two independent random walks defined as:

Gt+1 = Gt + ϵt and G t+1 = G t + ϵ t,

where G0 = G 0 = 0 and the drifts satisfy E[ϵt] = p > q = E[ϵ t], for each t N. Then, we have:

P t, t N Gt/t G t /t c(p, q) > 0.

Proof. Let us consider the random walk defined as follows:

e Gt+1 = e Gt + ϵt p + q

Since E ϵt p+q

2 > 0, and observing that a random walk with a non null drift is transient, we know that there exists a constant c1(p, q) > 0 such that:

P e Gt > 0, t > 0 = c1(p, q) > 0.

On the other hand, we observe that the random walk:

e G t+1 = e G t + ϵ t p + q

satisfies the opposite inequality since E[ϵ t p+q

2 ] < 0. Therefore, for the same reason as above, we have: P e G t < 0, t > 0 = c2(p, q) > 0,

for some constant c2(p, q) > 0.

Thus, since the two processes are independent, we have:

t,t =1 { e Gt+1 > 0, e G t +1 < 0} > 0,

which entails:

c1(p, q)c2(p, q) = P \

t,t =1 { e Gt > 0, e G t < 0}

t,t =1 {Gt tp + q

2 > 0, G t t p + q

t,t =1 {Gt/t > p + q

2 , G t /t < p + q

t,t =1 {Gt/t > G t /t } .

The latter inequality can be easily reformulated as in the statement of the lemma, concluding the proof.

In order to prove the lower bound, we will also show the following lemma.

Lemma 2. Let {Xi}i [n] be a sequence of i.i.d. Bernoulli random variables. Then, for every event E Fn, where Fn is the filtration generated by X1, . . . Xn, we have:

PX1,...Xn Be(p)(E) PX1,...Xn Be(p )(E) max p

Proof. Without loss of generality, let us assume that p < p. Then, we have:

PX1,...Xn Be(p)(E) = Z

{0,1}n 1E(x)

i=1 pxi(1 p)1 xidx

{0,1}n 1E(x)

i=1 (p/p )xip xi(1 p )1 xidx (2)

{0,1}n 1E(x)

i=1 p xi(1 p )1 xidx

n PX1,...Xn Be(p )(E).

where Equation (2) follows from the assumption that p < p. The other way round can be proved substituting p and p . This concludes the proof.

Theorem 1 (Instance-dependent lower bound). Let π be any policy for the bandits with ranking feedback problem, then, for any C : [0, + ) [0, + ), there exists a > 0 and a time horizon T > 0 such that RT > C( ) log(T).

Proof. Let p1 = 0.5, p2 = 0.5 , p 2 = 0.5 + , for some 0 < 1/4 (independent on T) that satisfies: 2C(1) log(1 4 ) 1/2.

We consider the following three instances:

P : ν1 = Be(p1) ν2 = Be(p2) P : ν1 = Be(p1) ν2 = Be(p 2) P : ν1 = Be(1) ν2 = Be(0)

Clearly, the optimal arm is the first in instances P and P , while the optimal arm is the second in instance P . By contradiction, we assume that, for every time horizon T > 0, exists a policy πT ( |Ht) for each t < T satisfying, in all three cases, the following condition:

RT (πT ) C( ) log(T). (3)

We also define the following event about the rankings received by the learner:

τ=1 {Rτ = 1, 2 }.

The event Et can be interpreted as "up to time t, the learner has always observed the ranking 1, 2 ". Note that by applying Equation 3 to the particular case of instance P , we have:

C(1) log(T) RT (πT |P ) = EP [Z2(T)] = EP [Z2(T)|ET ],

where the last step holds since in instance P the event ET holds almost surely. Therefore, we have:

EP [Z2(T)|ET ] C(1) log(T). Let h N. Then,we have:

PP(Z2(T) < h|ET ) = PP (Z2(T) < h|ET )

1 C(1) log(T)

We note that the above equality holds since, under the event ET , no policy can distinguish between the two instances P and P. On the other hand, the above inequality follows from Markov s theorem.

Furthermore, we notice that the event {Z2(T) < h} is contained in the σ algebra generated by the first h pulls of arm 2 (and all the pulls of arm 1, but this is irrelevant since ν1 corresponds to the same distribution in the first two instances). Therefore, for all h > 0, from Lemma 2 we have:

PP (Z2(T) < h) 0.5

h PP(Z2(T) < h)

h PP(Z2(T) < h, ET )

h PP(Z2(T) < h|ET )PP(ET )

h 1 C(1) log(T)

Where κ( ) := inft>0 PP (Et) > 0 thanks to Lemma 1.

Here, for every 1/4 x > 0, the inequality 0.5 x (1 4x)(0.5 + x) 0 holds. Thus, for all h > 0, we have: 0.5

Therefore, taking h = 2C(1) log(T), we have:

PP (Z2(T) < h) 0.5

h 1 C(1) log(T)

1 4 2C(1) log(T )

2 T 2C(1) log(1 4 ).

Thus, we can lower bound on the regret in case of instance P as follows:

RT (πT |P ) EP [(T Z2(T))] (T h)PP (Z2(T) < h) = (T 2C(1) log(T))PP (Z2(T) < 2C(1) log(T))

κ( ) (T 2C(1) log(T))

2 T 2C(1) log(1 4 )

κ( ) (T 2C(1) log(T))

which grows polynomially with time. At this point, Equation (3) would give, for every T > 0, the following:

C( ) log(T) κ( ) (T 2C(1) log(T))

2 T 1/2 Ω κ( )

Clearly, for T sufficiently big, the above statement is false, concluding the proof.

A.2 Proof of instance dependent upper bound

Theorem 2 (Instance-dependent upper bound). Assume that the reward distribution of every arm is 1-subgaussian. Let f : (0, ) R+ be a superlogarithmic nondecreasing function in t. Then there is a term C(f, i) for each sub-optimal arm i [n] which does not depend on T, such that Algorithm 1 satisfies RT (1 + f(T)) Pn i=1 i + log(T) Pn i=1 C(f, i).

Proof. Let i [n] be the optimal arm. For any suboptimal arm i [n], we have:

τ=1 P(iτ = i)

τ=n+1 P(iτ = i, Zi(τ 1) < f(τ)) +

τ=n+1 P(iτ = i, Zi(τ 1) f(τ)), (4)

where we let iτ [n] be the arm pulled at time τ [T]. We split the proof in two parts, providing a bound for each term defining Equation (4).

Claim 1: The first term of Equation (4) is bounded by f(T). Indeed, notice that, being f( ) nondecreasing, we have:

τ=1 1{iτ = i, Zi(τ 1) < f(τ)}

ρ=1 1{iτ = i, Zi(τ 1) = ρ}

ρ=1 1{iτ = i, Zi(τ 1) = ρ}

τ=1 1{iτ = i, Zi(τ 1) = ρ} f(T).

Claim 2: The second term in Equation (4) is bounded by C( i) log(T) for some C( i). We know, by design of the algorithm, that the arm i [n] can be pulled only if:

1. It has the highest empirical mean.

2. Every other arm has been pulled at least f(T) times, including arm i [n].

We define the event: Ei,t := {Zi(t) f(t)}.

Then, we have:

P(iτ = i, Zi(τ 1) f(τ)) P(ˆrτ(i) > ˆrτ(i ), Ei,t, Ei ,t)

which can be true only if at least one of the following holds:

1. ˆrτ(i) > µi + i/2, which, when intersected with Ei,τ, holds thanks Hoeffding s inequality with probability at most:

P(ˆrτ(i) > µi + i/2, Ei,τ)

y f(τ) P(ˆrτ(i) > µi + i/2, Zi(τ) = y)

y f(τ) e y 2 i 2 = e f(τ) 2 i 2

2. ˆrτ(i ) < µi i/2, which, when intersected to Ei ,τ, is also true with the same probability as above, thanks to Hoeffding s inequality.

Therefore, we have proved that:

τ=1 P(iτ = i, Zi(τ 1) f(τ)) 2(1 e 2 i /2) 1 T X

τ=1 e f(τ) 2 i 2

2(1 e 2 i /2) 1 PT τ=1 e f(τ) 2 i 2 log(T) | {z } ( )

We show that there exists a constant C0( i) < + such that ( ) C0( i) in the equation above. We start observing that, since f( ) is superlogathmic, we have:

lim sup t te f(t) 2 i 2 = lim sup t te log(t) f(t) 2 i 2 log(t) = lim sup t t 1

f(t) 2 i 2 log(t) = 0.

Thus, we can find t0 such that e f(t) 2 i 2 1

t for all t t0, so that:

Pt τ=1 e f(τ) 2 i 2 log(t) lim sup t

Pt τ=1 e f(τ) 2 i 2 Pt τ=1 1 τ lim sup t

Pt0 τ=1 e f(τ) 2 i 2 Pt τ=1 1 τ | {z } 0

Pt τ=t0 e f(τ) 2 i 2 Pt τ=t0 1 τ | {z } 1

Thus, since a sequence with finite limit superior is always bounded, we have:

C0( i) := 2(1 e 2 i /2) 1 sup t>1

Pt τ=1 e f(τ) 2 i 2 log(t) < + ,

proving that for every suboptimal arm i [n] the following holds:

E[Zi(T)] 1 + f(T) + C0( i) log(T).

Finally, defining C( i) := i C0( i), we have:

i=1 i E[Zi(T)],

concluding the proof.

Corollary 1. Let δ > 0 and f(t) = log(t)1+δ be the sperlogarithmic function used in Algorithm 1, then we have:

C(f, i) = 2 i

(2/ 2 i) 1/δ + 1

1 e 2 i /2 .

Proof. Let t0 > 1 be smallest integer such that:

e log(t)1+δ 2 i 2 1

for all t t0 > 1. By rearranging the latter inequality we have that:

(2/ 2 i) 1/δ .

From the proof of Theorem 2, we have C( i, f) := i C0( i), where:

C0( i) = 2(1 e 2 i /2) 1 sup t>1

Pt τ=1 e f(τ) 2 i 2 log(t) .

Furthermore, we let t > 1 be the integer satisfying:

Pt τ=1 e f(τ) 2 i 2 log(t) Pt τ=1 e f(τ) 2 i 2 log(t ) ,

for all the integers t > 1. Notice that such a value always exists, as the supremum limit of the above quantity is finite. Therefore, we have:

C0( i) log(t ) = 2(1 e 2 i /2) 1 t X

τ=1 e f(τ) 2 i 2

= 2(1 e 2 i /2) 1 Pt τ=1 e f(τ) 2 i 2 log(t ) log(t )

2(1 e 2 i /2) 1

Pt0 τ=1 e f(τ) 2 i 2 log(t ) +

Pt τ=t0 e f(τ) 2 i 2 Pt τ=t0 1 τ

Where the last step is due to the fact that for t0 > 1 we have Pt τ=t0 1 τ log(t ).

Furthermore, thanks to the definition of t0, we have: Pt τ=t0 e f(τ) 2 i 2 Pt τ=t0 1 τ 1.

With this consideration, we are able to conclude:

C0( i) 2(1 e 2 i /2) 1

Pt0 τ=1 e f(τ) 2 i 2 log(t ) +

Pt τ=t0 e f(τ) 2 i 2 Pt τ=t0 1 τ

2(1 e 2 i /2) 1 t0 log(t ) + 1

2(1 e 2 i /2) 1 + 2(1 e 2 i /2) 1e

(2/ 2 i) 1/δ .

Where in the last step we have used the fact that log(t) > 1 for t > 2. Recollecting all the terms we have:

C( i, log(t)1+δ) = i C0( i, log(t)1+δ) 2 i(1 e 2 i /2) 1(e

(2/ 2 i) 1/δ + 1),

concluding the proof.

B Proofs in the instance independent stochastic analysis

B.1 Instance dependent/independent trade-off

Lemma 3. Let us define a random walk

Gt+1 = Gt + ϵt ϵt = 1 with prob. p 1 with prob. 1 p .

with G0 = 1 and p = 1/2 + /2 > 0.5 for some (0, 1). Then, we have:

Proof. We define: fn = P(G0 = n, t : Gt = 0) which satisfies, for n 0, the following recursive equation:

fn = pfn 1 + (1 p)fn+1 with fn = 1 for n 0. The equation corresponding to the aforementioned dynamical system is:

(1 p)λ2 λ + p = 0.

The two solutions of the above equation are:

1 4p(1 p) 2(1 p)

Thus, for all n > 0, we obtain:

1 4p(1 p) 2(1 p)

1 4p(1 p) 2(1 p)

where A = 0 (otherwise, the equation does not define a probability) and B = 1 (since f1 1 for 0). Therefore, from the definition of p, we get:

1 4p(1 p) 2(1 p)

1 4(1/2 /2)(1/2 + /2)

The lemma holds by setting n = 1.

Theorem 3 (Instance Dependent/Independent Trade-off). Let π be any policy for the bandits with ranking feedback problem. If π satisfies the following properties:

(instance-dependent regret upper bound) RT Pn i=1 C( i)T α

(instance-independent regret upper bound) RT n CT β

then, 2α + β 1, where α, β 0.

Proof. Let p1 = 0.5, p2 = 0.5 , p 2 = 0.5 + . Let us consider three problems:

P : ν1 = Be(p1) ν2 = Be(p2) P : ν1 = Be(p1) ν2 = Be(p 2) P : ν1 = Be(1) ν2 = Be(0)

Clearly, the optimal arm is the first in instances P and P , while the optimal arm is the second in instance P .

Let us now define the event:

τ=1 {Rt = 1, 2 }.

By assumption, the policy π has a sub-T α instance-dependent regret. Therefore, in instance P , we have for all η > 0 the following:

EP [Z2(T)|ET ]

T α+η = lim sup T

T α+η = lim sup T

RT T α+η = 0,

since in the instance P , the event ET holds almost surely.

Under this event, no policy can distinguish between the instances. Therefore,

lim sup t EP[Z2(t)|Et]

T α+η = lim sup T

EP [Z2(t)|Et]

= C > 0 t > 0 : EP[Z2(t)|Et] Ctα+η.

Where EP[ ] is the expectation over the random variables of the rewards in instance P. Therefore, by Markov s inequality, we have:

PP(Z2(T) > 2CT α+η|ET ) EP[Z2(T)|ET ]

2CT α+η CT α+η

Now, note that for every h > 0, the event {Z2(T) < h} is contained in the σ algebra generated by the first h pulls of arm 2 (and all the pulls of arm 1, but this is irrelevant since arm 1 corresponds to the same distribution in the first two instances). Therefore, thanks to Lemma 2, we have:

h > 0 PP (Z2(T) h) 0.5

h PP(Z2(T) h)

h PP(Z2(T) h, ET )

h PP(Z2(T) h|ET )PP(ET ).

By the previous step, we have PP(Z2(T) 2CT α+η|ET ) 1/2, so that:

PP (Z2(T) 2CT α+η) 1

2CT α+η PP (ET ),

while, thanks to Lemma 3, we have:

PP(ET ) 1 1

meaning that:

PP (Z2(T) 2CT α+η) 1

2CT α+η 2 1 + .

We use this result to provide a lower bound for the regret in the instance-independent case. Analyzing the instance independent-regret, by definition, we have to fix T as time horizon and let the arm gap my depend on T.

Let us now fix ρ > 1. With the choice = T ρα( 1/4 for sufficiently big T), we have:

PP (Z2(T) 2CT α+η) 1

2CT α+η 2T ρα

1 4T ρα 2CT α+η 2T ρα

1 4T ρα 2CT α+η T ρα

At this point if we choose η = α(ρ 1)/2, the following fact holds:

1 4T ρα 2CT α+η = lim y 0+

1 4y Cy (α+η)

1 4y Cy α(1/2+ρ/2)

1 4y Cy (1/2+ρ/2)

where in the first equality we substituted y = T ρα. Here, y

2ρ , where the second exponent is strictly positive.

1 4T ρα 2CT α+η = lim y 0+

1 4y 1/y Cy ρ 1

= lim y 0+(1/e4)Cy ρ 1

This limit shows that there is cρ > 0 such that: 1 4T ρα 2CT α+η cρ ,

when T > 0 is sufficiently large.

Substituting this property in Equation (5), we get:

PP (Z2(T) 2CT α(1/2+ρ/2)) cρ

holding for every ρ > 0 and sufficiently large T > 0.

Thus, with = T ρα, we have:

EP [RT ] (T 2CT α(1/2+ρ/2))PP (Z2(T) 2CT α(1/2+ρ/2))

2 T 2ρα = cρ

for all T > 0sufficiently big.

Therefore, for β 1 2ρα it is not possible to have an instance-independent upper regret bound. Since this is valid for every ρ > 1, we can also extend the result to any β < 1 2α, which leads to the conclusion that the necessary condition to satisfy both:

(instance-dependent regret bound)

i=1 C( i)T α T > 0

(instance-independent regret bound)

RT n CT β T > 0

for the same policy π is: 2α + β 1.

B.2 Proofs of instance independent regret upper bound

To derive the final instance-independent regret bound, we introduce some results from the theory of stochastic processes. The following subsections are therefore devoted to developing all the necessary results to prove the final regret bound of the algorithm.

B.2.1 Discretizing the Brownian motion

In this section, we prove some results about the relationship between random walks and Brownian motions, that will be crucial in the proof of the regret bound. For this scope, we will introduce this quantity:

|Bt + tµ0 < η| = Z 1

0 1( ,η)(τ)dτ,

corresponding to the Lebesgue measure of the set {t [0, 1] : Bt + tµ0 < η}.

We start with a lemma that bounds the increments in a standard Brownian motion. Lemma 4. Let {Bt}t [0,1] be a standard Brownian motion. We define:

Ii := [i/n, (i + 1)/n],

for all i {0, . . . , n 1}. Then, for every η 0, we have:

sup i {0,...,n 1} (sup t Ii Bt Bi/n) η

= P inf i {0,...,n 1}(inf t Ii Bt Bi/n) η

2 n exp η2n

Proof. We notice that the Brownian motion satisfies:

sup i {0,...,n 1} (sup t Ii Bt Bi/n) η

i=0 sup t Ii Bt Bi/n > η

i=0 P sup t Ii Bt Bi/n > η

i=0 2P(B(i+1)/n Bi/n > η)

i=0 2P(N(0, σ2/n) > η)

2nπ 2 n exp η2n

where the second equality holds from the reflection principle (see [Baldi, 2017]), and last inequality holds since it is well known that:

P(N(0, β2) > y) exp( y2/2β2)

for tail bound on Gaussian distributions. In the exact same way, we can prove that:

P inf i {0,...,n 1}(inf t Ii Bt Bi/n) η 2 n exp η2n

Together, the two results imply the thesis.

We are now ready to prove a theorem that links Brownian motion and random walks in terms of the probability that each of them stays in the interval [0, ). Lemma 5 (Discretization lemma). Let {Gi}i {0,...n 1} be a Gaussian 0-mean unit variance random walk, µ R, and {Bt}t [0,1] a standard Brownian motion. Then, for every s (0, 1), we have:

P (|Bt + tµ0 > η| > s) P(n, η) P

i=0 1(0, ) (Gi + iµ) > sn

P (|Bt + tµ0 > η| > s) + P(n, η),

P (|Bt + tµ0 η| s) P(n, η) P

i=0 1( ,0] (Gi + iµ) sn

P (|Bt + tµ0 η| s) + P(n, η),

with P(n, η) = 2 n exp( η2n/2)

2π and µ0 = nµ.

Proof. We only prove the first part, as the second one follows trivially by substituting:

s 1 s, µ µ, Gi Gi, Bt Bt.

Let {Bt}t [0,1] be a standard Brownian motion. We define:

Ii := [i/n, (i + 1)/n],

for all i {0, . . . , n 1}. Furthermore, we set µ0 = nµ. With this definition, we have the following set of inclusions for any s [0, 1] and η > 0:

{|Bt + tµ0 > η| > s} = Z 1

0 1(η, )(Bτ + τµ0) dτ > s

Ii 1(η, )(Bτ + τµ0) dτ > s

i=0 supτ Ii1(η, )(Bτ + τµ0) > sn

sup i {0,...,n 1}

sup t Ii Bt Bi/n

Moreover, using the same steps above, it also holds:

{|Bt + tµ0 > η| > s} = Z 1

0 1( η, )(Bτ + τµ0) dτ > s

inf i {0,...,n 1}(inf t Ii Bt Bi/n) η .

Now, note that the random variable Bi/n for each i [n] has the same distribution of Gi/ n, thus:

n Bi/n + i nµ0

i=0 1(0, ) (Gi + iµ) > sn

Therefore, by union bound, we have:

P (|Bt + tµ0 > η| > s) P

i=0 1(0, ) (Gi + iµ) > sn

sup i {0,...,n 1}

sup t Ii Bt Bi/n

P (|Bt + tµ0 > η| > s) P

i=0 1(0, ) (Gi + iµ) > sn

P inf i {0,...,n 1}(inf t Ii Bt Bi/n) η .

The proof is completed applying Lemma 4 and reordering the terms.

Corollary 5. Let {Gi}i {0,...n 1} be a Gaussian 0-mean unit variance random walk, and µ R. Then, for every s (0, 1), we have:

i=0 1( ,0] (Gi + iµ) sn

P Bt + tµ0 2 log(n) n

2 πn log(n),

P Bt + tµ0 2 log(n) n

2 πn log(n)

where µ0 = nµ.

Proof. It is sufficient to make the substitution:

η = 2 log(n) n ,

in Lemma 5. Then, we have:

P(n, η) = 2 n exp η2n/2

= 2 n exp log(n)2

= 2n exp log(n)2

2 πn log(n),

concluding the proof.

B.2.2 Proofs of filtering inequalities

All the proof of this subsections will be based on the following very powerful result, which studies the time spent by a Brownian Motion with drift in the half-line [0, ). Theorem 6 (Takács [1996]). Let Bt be a standard Brownian motion on t [0, 1], and let us note as | | the Lebesgue measure of a set. For µ0 R and η > 0, we have

P (|Bt + tµ0 η| s) = 2 Z s

φ(µ0 1 τ) 1 τ + µ0Φ(µ0

φ(η/ τ µ0 τ) τ µ0e2µ0ηΦ( η/ τ µ0 τ) dτ,

2π e x2/2 Φ(x) := Z x

Thanks to the previous theorem, we can prove the following crucial results.

Theorem 7. Let T be a sufficiently large constant. Let {Gi}i {0,...n 1} be a Gaussian 0-mean unit variance random walk, and µ R. If µ CT α, for some α (0, 1/2) and C = 4 log(T), then setting n = T 1/2+α we have

i=0 1( ,0] (Gi + iµ) T 2α !

Proof. In the rest of the proof, we will assume, for ease of notation, that T is such that T 1/2+α an integer, so that n = T 1/2+α. This is done without loss of generality, since substituting n with n + 1 leads to a negligble difference for T sufficiently big. Applying the discretization corollary 5, we have that for every s (0, 1)

i=0 1( ,0] (Gi + iµ) sn

P Bt + tµ0 2 log(n) n

2 πn log(n), (6)

where µ0 = nµ. Therefore, by assumption,

µ0 = nµ (T 1/2+α)1/2CT α = CT 1/4 α/2.

At this point, we can apply Theorem 6 to have, for any η > 0,

P (|Bt + tµ0 η| s) = 2 Z s

ϕ(µ0 1 τ) 1 τ + µ0Φ(µ0

1 τ µ0e2µ0ηΦ η µ0τ τ

which means that

P (|Bt + tµ0 η| s) = 1 2 Z 1

ϕ(µ0 1 τ) 1 τ | {z } (1)

1 τ) | {z } (2)

1 τ | {z } (3)

µ0e2µ0ηΦ η µ0τ τ

Here, we have to consider that

η = 2 log(n) n 2 log(T)T α/2 1/4

µ0 CT 1/4 α/2.

Moreover, to have the thesis, we are interested in a value of s such that sn = T 2α, corresponding to T 1/2+α. Therefore, in the interval [T 1/2+α, 1], we have

1. Consider term (3):

1 τ ϕ η µ0T 1/2+α

1 T 1/4+α/2

= ϕ ηT 1/4 α/2 µ0T 1/4+α/2 1 T 1/4+α/2 .

Here, since η = 2 log(n) n 2 log(T)T α/2 1/4, the part ηT 1/4 α/2 is bounded by 2 log(T).

Instead, µ0T 1/4+α/2 CT 1/4 α/2T 1/4+α/2 = C.

2. Term (4) is non-negative.

Therefore, for C = 4 log(T), we have that in the interval [T 1/2+α, 1]

(3) + (4) ϕ (2 log(T)) 1 T 1/4+α/2 = 1

2πT 1/4+α/2 e 2 log(T )2 T 1

With this inequality, we have

P |Bt + tµ0 η| T 1/2+α = 1 2 T 1

ϕ(µ0 1 τ) 1 τ + µ0Φ(µ0

T 1/2+α 1 p

2π(1 τ) + |µ0|dτ

2π(1 τ) + |µ0|dτ

At this point, knownig from the assumptions that n < T, we have µ0

T, which implies

P |Bt + tµ0 η| T 1/2+α 1 T 1/2

Substituting this result into Equation 6, we get, for s = T 1/2+α and n T 1/2+α

i=0 1( ,0] (Gi + iµ) < T 2α !

2 πT 1/2+α log(T 1/2+α)

The second result is the following

Theorem 8. Let T be a sufficiently large constant. Let {Gi}i {0,...n 1} be a Gaussian 0-mean unit variance random walk, and µ R such that µ CT θ, for some θ (0, 1/2) and C = 2 p

log(T) + 2. Then, for any α (0, 1/2), setting n = T 1/2+α we have

i=0 1( ,0] (Gi + iµ) T 2α !

Proof. In the rest of the proof, we will assume, for ease of notation, that T is such that T 1/2+α an integer, so that n = T 1/2+α. This is done without loss of generality, since substituting n with n + 1 leads to a negligble difference for T sufficiently large. Applying the discretization corollary 5, we have that for every s (0, 1)

i=0 1( ,0] (Gi + iµ) sn

P Bt + tµ0 2 log(n) n

2 πn log(n), (7)

where µ0 = nµ. Therefore, by assumption,

µ0 = nµ (T 1/2+α)1/2CT θ = CT 1/4+α/2 θ.

Differently from the previous proof, here we cannot directly apply Theorem 6, since η = 2 log(n) n < 0.

Still, we can say that

P Bt + tµ0 2 log(n) n

s = P Bt tµ0 > 2 log(n) n

= P Bt tµ0 2 log(n) n

At this point, we set η = 2 log(n) n , µ0 = µ0 and Bt = Bt (it is not necessary to rename it since its distribution is symmentric). In this way we can apply Theorem 6 having that the previous probability corresponds to

P (|Bt + t µ0 η| > 1 s) = 2 Z 1

ϕ( µ0 1 τ) 1 τ | {z } (1)

1 τ) | {z } (2)

1 τ | {z } (3)

µ0e2µ0ηΦ η µ0τ τ

Here, we have to consider that

η = 2 log(n) n 2 log(T)T α/2 1/4

µ0 CT 1/4+α/2 θ.

Moreover, to have the thesis, we are interested in a value of s such that sn = T 2α, corresponding to T 1/2+α.

Here, it is convenient to divide the proof in two cases, depending on the sign of 1/4 + α/2 θ.

1. Assume (1/4 + α/2 θ > 0). Then, considering term (3) we have that for τ [1/2, 1]

(3) ϕ η µ0τ τ

Moreover, since term (4) is nonnegative we also have

2 = 1 π e (

Being 1/4 + α/2 θ > 0 and η < 1, the exponent is less than (

2)2/2. This means that for C = 2 p

log(T) + 2 the full term is bounded by

(3) + (4) 1 π e (

2)2/2 = 1 π e (

2 log(T ))2/2 = T 1

Substituting this inequality, we get

P |Bt + t µ0 η| > 1 T 1/2+α 2T 1

ϕ( µ0 1 τ) 1 τ + µ0Φ( µ0

1 T 1/2+α 1 p

2π(1 τ) + | µ0|dτ

π (2 + T 1/2+α µ0) 6T 1

This quantity is of course less than T θ, since θ (0, 1/2) by assumption

2. Assume (1/4 + α/2 θ < 0). In this case, we have, being µ0 0, the following inequality

P (|Bt + t µ0 η| > 1 s) P (|Bt η| > 1 s) .

This simplified form leads to

P (|Bt + t µ0 η| > 1 s) 2 Z 1

ϕ(0) 1 τ ϕ η τ

ϕ(0) 1 τ ϕ (0) 1 τ dτ

Since in our case s = T 1/2+α < 1/2, this can be further simplified as

P (|Bt + t µ0 η| > 1 s) 1

y=1 τ = 2 π

This leads to

P |Bt + t µ0 η| > 1 T 1/2+α 4

π T 1/4+α/2.

By assumption, 1/4 + α/2 θ < 0 the exponent is 1/4 + α/2 < T 1/2+θ. Therefore, we have

P |Bt + t µ0 η| > 1 T 1/2+α 4

Thus, we have proved that in both cases

P (|Bt + t µ0 η| > 1 s) 4

Finally, applying Equation (7) and substituting the value of n, we get

i=0 1( ,0] (Gi + iµ) T 2α !

π T 1/2+θ +

2 πT 1/2+α log(T 1/2+α),

which implies

i=0 1( ,0] (Gi + iµ) T 2α !

B.2.3 Regret bound

Before the actual proof, we are stating a simple proposition about the structure of the loggrid, which will ease the next computations.

Proposition 9. Let

LG(1/2, 1, T) := T λj+(1 λj)/2 : λj = j log(T) , j = 0, . . . , log(T) .

The following identities hold

1. LG(1/2, 1, T) can be equivalently defined as

LG(1/2, 1, T) := n T 1/2+ j 2 log(T ) , j = 0, . . . , log(T) o .

2. Let ℓj the j th element of LG(1/2, 1, T), and αj = log(ℓj)

log(T ) 1/2. Then αj = j 2 log(T ) +

3. The ratio of two consecutive values of ℓj is ℓj+1

1 2 log(T ) [ e, 2] for T 51.

Next, we prove the following lemmas, which concern some features of our algorithm.

Lemma 6. For any arm i, the probability of the event Eii0, corresponding to i eliminating the another i0 arm such that their gap is ii0 := µi0 µi > 0, is, at most

P(Eii0) 6 log(T)(4 log(T) + 2)T 1/2 1 ii0 .

Proof. Let us call:

e ii0 = ii0 4 log(T) + 2.

At this point, there are two possibilities,

1. e ii0 T 1/2: in this case, the statement of the lemma is vacuous.

2. e ii0 > T 1/2: in this case, by assumption, there are two consecutive ℓj , ℓj +1 L such that

e ii0 T 1/2

ℓj +1 , T 1/2

This is true due to the fact that that the sequence ℓj spans from T 1/2 to T. By Proposition 9, this can be equivalently expressed by saying that

e ii0 T j +1 2 log(T ) , T j 2 log(T ) .

Let us define the following family of events:

Eii0(z) := arm i eliminates arm i0 when both have been pulled z times.

The probability of Eii0 is bounded by:

P(Eii0) = P

z=1 Eii0(z)

z L Eii0(z)

j=1 P(Eii0(ℓj)).

Here, we have applied the fact that, by design of the algorithm, the arms can only be discarded in fair steps for which z L and then a union bound. Here, we notice that by definition of the filtering condition, defining tj as the fair time-step where both arms have been played ℓj times, this event can be again rewritten as: ( tj X

τ=1:τ fair {Rτ(i) > Rτ(i0)} T 2αj )

where αj = log(ℓj)

2. If we call ˆµτ,i0, ˆµτ,i the empirical means of arms i0, i after τ pulls of each, the previous event can be interpreted as the time in which the random walk given by the difference of the rewards of the two arms stays in ( , 0]:

τ=1:τ fair {Rτ(i) > Rτ(i0)} =

τ=1 1 {ˆµτ,i ˆµτ,i0}

where Pτ k=1 ri,k is the cumulative reward of arm i and Pτ k=1 r1,k is the cumulative reward of arm i0. Therefore, we have written this quantity as the time spent by the random walk Gτ in the interval ( , 0], for τ = 1, . . . ℓj. The drift term for this random walk is given by:

E[ri,k r1,k] = µi0 µi = ii0.

Therefore, we can apply Theorem 8 for the following choice of parameters,

(a) α = αj = j 2 log(T ) + o(T 1/2) (Proposition 9), which implies n = T 1/2+αj = ℓj.

(b) θ = j +1 2 log(T ) . We can use this choice since the drift is

ii0 = (4 log(T) + 2) | {z }

T j +1 2 log(T )

therefore the assumptions of the theorem are respected.

Applying the theorem, we have:

P(Ei(2ℓj)) 3T 1/2+θ = 3T 1/2+ j +1 2 log(T ) .

Summing over j, we get,

j=1 P(E1(2ℓj))

3 log(T)T 1/2+ j +1 2 log(T ) .

To conclude, consider that, by definition ,

ℓj = T 1/2+ j 2 log(T ) ;

this means that: e ii0 T j +1 2 log(T ) , T j 2 log(T ) i ,

so that, in particular, e 1 ii0 T

j 2 log(T )

and 1 ii0 (4 log(T) + 2)T

j 2 log(T ) . Substituting in the bound we have just found results in,

3 log(T)T 1/2+ j +1 2 log(T ) 3 log(T)(4 log(T) + 2)T 1/2T

1 2 log(T ) 1 ii0 6 log(T)(4 log(T) + 2)T 1/2 1 ii0 .

Lemma 7. For any arm i [n], the probability of the event E ii0, corresponding to arm i [n] not being eliminated after (8 log(T) + 4)T 1/2 1 ii0 pulls if there is an active arm i0 with ii0 := µi0 µi > 0 is such that: P(E ii0) 2T 1/2.

Proof. Define ℓj, αj as in the previous lemma, so that αj = log(ℓj)

2. As in the previous lemma, we define: e ii0 = ii0 4 log(T) + 2,

and j {1, . . . log(T) } such that:

e ii0 T j +1 2 log(T ) , T j 2 log(T ) .

Here, remember that by definition of the filtering condition, defining tj as the fair timestep where both arms have been played ℓj times, Ei i0 is included in the following event: ( tj X

τ=1:τ fair {Rτ(i0) > Rτ(i)} T 2αj )

As before, this event can be interpreted as the difference between two random walks being negative, due to the fact that:

τ=1:τ fair {Rτ(i0) > Rτ(i)} =

τ=1 1 {ˆµτ,1 ˆµτ,i}

In this formulation, we have written the quantity of interest for the filtering condition after ℓj +1 pulls as the time spent by the random walk Gτ in the interval ( , 0], for τ = 1, . . . ℓj +1. This time, the drift term is: E[ri0,k ri,k] = µi0 µi = ii0. Therefore, we can apply Theorem 7 for α = αj , since, by assumption,

ii0 = (4 log(T) + 2) | {z } 4log(T )

T j +1 2 log(T )

This theorem leads to:

P E ii0 1 P

τ=1 1( ,0] (Gτ) T 2αj +1

Thm.7 2T 1/2. (8)

To get the thesis, is sufficient to reformulate the critical time-step ℓj +1 in terms of ii0. By definition,

e ii0 T j 2 log(T ) = T 1/2

Therefore, ii0 (8 log(T) + 4) T 1/2

ℓj +1 . From this, it immediately follows,

ℓj +1 (8 log(T) + 4)T 1/2 1 ii0 ,

concluding the proof.

We are finally able to prove our main result about the instance-independent regret of the algorithm. Theorem 4. In the stochastic bandits with ranking feedback setting, when the noise is Gaussian, Algorithm 2 achieves RT 62n4 log(T)2T 1/2.

Proof. Fix a sub-optimal arm i with corresponding gap i with respect to the optimal arm and let ψ a parameter to be chosen later. Define, for every couple of indices i1, i0 the event:

Eψ i1i0 := Ei1i0 µi0 µi1 > ψ else,

where the event Ei1i0 is defined as arm i1 eliminating arm i0 in some point of the process. The probability that at least one of this events verifies is bounded by lemma 6 and union bound with:

i1=1,i0=0 Eψ i1i0

3n(n 1) log(T)(4 log(T) + 2)T 1/2ψ 1,

as their number is at most n(n 1)/2. Therefore, under the complementary of Ψ, as no elimination with gap larger than ψ happens, we are sure that an arm i with gap (w.r.t. the first arm) less than (n 1)ψ survives until the last: in fact, at most n 1 eliminations may happen, and all between pair of arms with a difference at most ψ.

As the arm i is active until the last, the probability of event E ii that i survives for more than (8 log(T) + 4)T 1/2 1 ii pulls is bounded by Lemma 7 with 2T 1/2. Making the union bound over all possible values of i , which is a random variable, we can say that the probability that any of this event happens is at most 2(n 1)T 1/2. Summarizing, we have the following bound on Zi(T):

1. Zi(T) T if either Ψ verifies, which happens with probability,

3n(n 1) log(T)(4 log(T) + 2)T 1/2ψ 1,

or if Eii verifies, which happens with probability at most 2(n 1)T 1/2

2. Zi(T) T(8 log(T) + 4)T 1/2( i (n 1)ψ) 1 otherwise. The last comes just from the fact that ii i (n 1)ψ, by definition of i .

If we take ψ = 2(n 1), it results in,

E[ i Zi(T)] i TP( n i1=1,i0=0Eψ i1i0) + T i P(Eii )

+ i T(8 log(T) + 4)T 1/2( i (n 1)ψ) 1

i T 3n(n 1) log(T)(4 log(T) + 2)T 1/2ψ 1 + 2n T 1/2

+ i T(8 log(T) + 4)T 1/2( i (n 1)ψ) 1

ψ= 2(n 1) i T 6n(n 1)2 log(T)(4 log(T) + 2)T 1/2 1 i + 2n T 1/2

+ 2 i T(8 log(T) + 4)T 1/2 1 i = 6n(n 1)2 log(T)(4 log(T) + 2)T 1/2 + 2n i T 1/2 + 2T(8 log(T) + 4)T 1/2.

Being i (0, 1) and n 2, the previous quantity is bounded by,

E[ i Zi(T)] 62n3 log(T)2T 1/2. The proof is completed by using the the Regret Decomposition Lemma Lattimore and Szepesvari [2017].

C Proof for adversarial setting

Theorem 5. In adversarial bandits with ranking feedback, there exists a constant γ (0, 1) such that no algorithm achieves o(T) regret with respect to the best arm in hindsight with probability greater than γ.

Proof. This negative result follows from the impossibility to achieve RT CT regret by any algorithm, with C properly set constant and probability 1 ϵ, in all three instances reported next. Please notice that, this result implies that even the no-regret property cannot be achieved in the bandit with ranking feedback setting.

Without loss of generality we consider rewards function bounded in [0, 10]. Consider three instances, with two arms a0, a1 for each and the associated rewards, defined as follows:

Instance 1 :

2 t 1 , 1 2 t 2 , 1 2 t 3 a1 : 0 t 1 , 0 t 2 , 0 t 3

Instance 2 :

( a0 : δ t 1 , 0 t 2 , 0 t 3 a1 : 0 t 1 , 1 t 2 , 1 t 3

Instance 3 :

( a0 : δ t 1 , 0 t 2 , 10 t 3 a1 : 0 t 1 , 1 t 2 , 0 t 3

where phase 1 is made by the first T/4 rounds, phase 2 is made by the next T/4 rounds, phase 3 is made by the last T/2 rounds and δ is near to 0.

In phase 1 all the instances have the same ranking feedback, as the first action gives higher rewards with respect to the second one. To make instance 1 receive RT CT, it is necessary: 1 2T 1

2E[na0] CT E[na0] (1 2C)T

where na0 is the number of times the first arm has been pulled, and the expected value is taken on the randomization of the algorithm. From previous equation we obtain that in all instances:

4T = (1 C1)T/4

where C1 = 8C, na0 is the number of time the first arm has to be pulled in phase 1 and the inequality is computed considering that a0 is played in all the next phases. By reverse Markov inequality:

P n 1 a0 > (1 C1)T/4 C1 C1

Setting the probability equal to 9/10 we obtain: C1 = 10C1 from which follow that with probability 9/10 we have:

n 1 a0 > (1 10C1)T/4

and consequently:

n 1 a1 10C1T/4.

We observe that in the second phase, instances 2 and 3 have the same feedback. Proceeding as done before, to make instance 2 receive RT CT it is necessary:

3 4T E [na1] CT E[na1] 3

From previous equation we obtain that in instances 2 and 3 :

4 C T T/2 = (1 C2)T/4

where the inequality is computed considering that a1 is played in the next phases and C2 = 4C. By reverse Markov inequality, we obtain that, with probability 9/10:

n 2 a1 > (1 10C2)T/4

and consequently:

n 2 a0 10C2T/4 We neglect the δ value for now, as it can be chosen to be insignificant with respect to the previous computation. Now we focus on the third phase, in which instance 2 should play:

4 C T T/4 = (1 C3)T/2,

where C3 = 2C. By reverse Markov inequality, we obtain that, with probability 9/10:

n 3 a1 > (1 10C3)T/2

and consequently:

n 3 a0 10C3T/2 Now, we compute the number of rounds needed in the third instance to switch the ranking in the third phase, namely q. Notice that, until this switch, the last two instances receive the same feedback. We compute q in the best-case scenario (that is, when small q value is sufficient to allow the switch) that satisfies the constraints previously shown. Precisely, q is computed so that the empirical mean of

arm a0 is greater then the arm a1 one, given that n 1 a0 > (1 10C1)T/4 and n 2 a1 > (1 10C2)T/4. Formally: 0(1 10C1)T/4 + 10q

q + (1 10C1)T/4 0C110T/4 + (1 10C2)T/4 + 0T/2

10C1T/4 + (1 10C2)T/4 + T/2 We now show that for proper C value we can lower bound the right side with 1

4. In particular:

0C110T/4 + (1 10C2)T/4 + 0T/2

10C1T/4 + (1 10C2)T/4 + T/2 > 1

4 C < 1/200

which means that, for C < 1 200, we can substitute the right side of the equation with 1

4 to simplify the computation. Moreover, notice that gap between 1

4 and 0C110T/4+(1 10C2)T/4+0T/2

10C1T/4+(1 10C2)T/4+T/2 allowed us to neglect the computations with δ. Then:

0(1 10C1)T/4 + 10q

q + (1 10C1)T/4 1/4 q 4

To achieve a contradiction, it sufficient to find C so that q + n 3 a1 > T/2; indeed, the previous inequality shows the impossibility to gain enough rewards to make the ranking change and, at the same time, guarantee the minimum rewards to make instance 2 no-regret. Given that the ranking switch is a necessary condition to make instance 3 no-regret, the result of impossibility follows for:

4 20C T/4 + (1 20C)T/2 > T/2 C < 1 1640

To conclude the proof, we show that the intersection between the events derived by reverse Markov inequality (namely Ei with i [3]) holds with constant probability:

i [3] P(Ec i )

where the inequality holds by Union Bound. Substituting all the previous results in the definition of regret we obtain, with probability 7 10 = 1 ϵ and C < 1 1640, RT CT = Ω(T) which concludes the proof.

D Numerical evaluation

This section presents a numerical evaluation of the algorithms proposed in the paper for the stochastic settings, namely, DREE and R-LPE. The goal of such a study is to show two crucial results: firstly, the comparison of our algorithms with a well-known bandit baseline, and secondly, the need to develop distinct algorithms tailored for instance-dependent and instance-independent scenarios.

To establish a benchmark for comparison, we consider the EC (Explore-Then-Commit) algorithm, which is one of the most popular algorithms among the explore-then-commit class providing sublinear regret guarantees. In the following, we evaluate the DREE algorithm with different choices of the δ parameter in the function f(t) = log(t)1+δ; precisely, we choose δ {1.0, 1.5, 2.0}. Furthermore, we consider four stochastic instances whose specific parameters are discussed below. In all these instances, we assume the rewards to be drawn from Gaussian random variables with unit variance, i.e., σ2 = 1, and we let the time horizon be equal to T = 2 105. Finally, for each algorithm, we evaluate the cumulative regret averaged over 50 runs.

We structure the presentation of the experimental results into two groups. In the first, the instances have a small min, while in the second, the instances have a large min.

Small values of min We focus on two instances with min < 0.05. In the first of these two instances, we consider n = 4 arms, and a minimum gap of min = 0.03. In the second instance, we consider n = 6 arms, with min = 0.03. The expected values of the rewards of each arm are reported in Section D.1, while the experimental results in terms of average cumulative regret are reported in Figures 1 2. We observe that in the first instance (see Figure 1) all the DREE algorithms exhibits a linear regret bound, confirming the strong sensitivity of this family of algorithms on the parameter min in terms of regret bound. In contrast, the R-LPE algorithm exhibits better performances in terms of regret bound, as its theoretical guarantee are independent on the values of min. Furthermore, Figure 2 shows that the DREE algorithms (with δ 1.0, 1.5) achieve a better regret bound when the number of arms is increased. Indeed, these regret bounds are comparable to the ones achieved by the R-LPE algorithm. The previous result is reasonable as the presence of i-s in the regret bound lowers the dependence on the number of arms. It is worth noticing that all our algorithms outperform the baseline EC.

0 0.5 1 1.5 2

Cumulative Regret

EC DREE (δ = 1) DREE (δ = 1.5)

DREE (δ = 2)

Figure 1: Instance with min = 0.03 and all the gaps small.

0 0.5 1 1.5 2

Cumulative Regret

EC DREE (δ = 1) DREE (δ = 1.5)

DREE (δ = 2)

Figure 2: Instance with min = 0.03 and the other gaps big

Large values of min We focus on two instances with min 0.25. In the first instance, we consider n = 4 arms with a minimum gap of min = 0.5 among their expected rewards. In the second instance, we instead consider a larger number of arms, specifically n = 8, with a minimum gap equal to min = 0.25. The expected values of the rewards are reported in Section D.1, while the experimental results in terms of average cumulative regret are provided in Figures 3 4. As it clear from both Figures 3 4 when min is sufficiently large, the DREE algorithms (with δ {1.0, 1.5}) achieves better performances with respect both the EC and R-PLE algorithms in terms of cumulative regret. Furthermore, there is empirical evidence that a small δ guarantees better performance, which is reasonable according to theory. Indeed, when δ is small, the function f(t), which drives the exploration, is closer to a logarithm. Also, as shown in Corollary 1, when min is large enough, the parameter δ affects the dimension of C(f, i) more weakly, which results in a better regret bound.

0 0.5 1 1.5 2

Cumulative Regret

EC DREE (δ = 1) DREE (δ = 1.5)

DREE (δ = 2)

Figure 3: Instance with min = 0.5.

0 0.5 1 1.5 2

Cumulative Regret

EC DREE (δ = 1) DREE (δ = 1.5)

DREE (δ = 2)

Figure 4: Instance with min = 0.25.

D.1 Experiments

For the sake of clarity, we report in the followings additional details on the four instances presented in Figures 1,2,3,4. Notice that, all the plots present the cumulative regret averaged for 50 runs, with 95% confidence interval. Furthermore:

Instance of Figure 1: time horizon T = 2 105, arms n = 4, mean reward vector

µ = [0.9, 1.05, 1.12, 1.15],

unitary variance for each arm, min = 0.03; Instance of Figure 2: time horizon T = 2 105, arms n = 6, mean reward vector

µ = [0.03, 0.07, 0.1, 0.08, 0.97, 1],

unitary variance for each arm, min = 0.03; Instance of Figure 3: time horizon T = 2 105, arms n = 4, mean reward vector

µ = [0.05, 0.25, 0.5, 1.0],

unitary variance for each arm, min = 0.5; Instance of Figure 4: time horizon T = 2 105, arms n = 8, mean reward vector

µ = [0.05, 0.05, 0.1, 0.15, 0.25, 0.5, 0.75, 1.0],

unitary variance for each arm, min = 0.25;

D.2 Detailed explanation of the experiments

In this section, we report all the details of the experiments performed in the paper. These are important to ensure the truthfullness of the results and the claims based on empirical validation.

Training details In the main paper we have presented four experiments, each corresponding to a different environment. Each experiment is performed for fifty random seeds, ad the computation is split in 10 parallel processes by the library joblib. The overall computational time for one experiment is around 337.92 seconds, that is roughly five minutes and one half.

Compute As stated, the numerical simulations resulted to be very fast. For this reason, it was not necessary to run them on a server, and we used a personal computer with the following specifications:

CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 2.80 GHz RAM: 16,0 GB Operating system: Windows 11 System type: 64 bit

Reproducibility Due to the stochastic nature of the bandit problem, all the simulations have been repeated several times. We have performed all the experiments with 50 different random seeds, corresponding precisely to the first 50 natural numbers. The seed influences the generation of the reward by the environment, while all algorithms proposed, being deterministic, are independent on the seed.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: All we say in the abstract is done in the work. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We clarifiy that the contribution is theoretical, which is the only aspect that can be seen as a limitation of this work. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: We leave the complete and formal proofs in the appendix. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: In the appendix we report every detail of the experiments, even the amount of CPU time used. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: We provide the code.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: These information are reported in the appendix.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: Every experiment reports, together with the regret curves, a shaded area indicating the uncertainty.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors).

It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: These information are reported in the appendix.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: Experiments are just Python simulation, no concerns about real data.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: Only theoretical work.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: We did not use data.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: We only use standard Python libraries to make the experiments.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: Answered in previous questions. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: We did not do any crowdsourcing. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: We have no study participants. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.