# active_bipartite_ranking__15fc5422.pdf

Active Bipartite Ranking

James Cheshire Stephan Clemencon Telecom Paris Tech first.last@telecom-paris.fr

Vincent Laurent ENS Paris Saclay first.last@ens-paris-saclay.fr

In this paper, we develop an active learning framework for the bipartite ranking problem. Motivated by numerous applications, ranging from supervised anomaly detection to credit-scoring through the design of medical diagnosis support systems, and usually formulated as the problem of optimizing (a scalar summary of) the ROC curve, bipartite ranking has been the subject of much attention in the passive context. Various dedicated algorithms have been recently proposed and studied by the machine-learning community. In contrast, active bipartite ranking rule is poorly documented in the literature. Due to its global nature, a strategy for labeling sequentially data points that are difficult to rank w.r.t. to the others is required. This learning task is much more complex than binary classification, for which many active algorithms have been designed. It is the goal of this article to provide a rigorous formulation of such a selective sampling approach. We propose a dedicated algorithm, referred to as active-rank, which aims to minimise the distance between the ROC curve of the ranking function built and the optimal one, w.r.t. the sup norm. We show that, for a fixed confidence level ε and probability δ, active-rank is PAC(ε, δ). In addition, we provide a problem dependent upper bound on the expected sampling time of active-rank and also demonstrate a problem dependent lower bound on the expected sampling time of any PAC(ε, δ) algorithm. Beyond the theoretical analysis carried out, numerical results are presented, providing strong empirical evidence of the performance of the algorithm proposed, which compares favorably with more naive approaches.

1 Introduction

In bipartite ranking, the statistical framework is exactly the same as that in standard binary classification, the flagship problem in statistical learning theory. One observes n 1 independent copies Dn = {(X1, Y1), . . . , (Xn, Yn)} of a generic random pair (X, Y ) with (unknown) distribution P, where Y is a binary random label, valued in { 1, +1} say, and X is a high dimensional random vector, taking its values in X Rd with d 1, that models some information hopefully useful to predict Y . In contrast to binary classification, the goal pursued is of global (and not local) nature. It is not to assign a label, positive or negative, to any new input observation X but to rank any new set of (temporarily unlabeled) observations X 1, . . . , X n by means of a (measurable) scoring function s : X R, so that those with positive label appear on top of the list (i.e. are those with the highest scores) with high probability. More formally, the accuracy of any scoring rule can be evaluated through the ROC curve criterion or its popular scalar summary, the AUC (standing for the Area Under the ROC Curve), and, as expected, optimal scoring functions w.r.t. these performance measures can be shown to be increasing transforms of the posterior probability η(x) = P{Y = +1 | X = x},

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

x X. Though easy to formulate, this problem encompasses many applications, ranging from credit risk screening to the design of decision support tools for medical diagnosis through (supervised) anomaly detection. Hence, motivated by a wide variety of applications, bipartite ranking has received much attention these last few years. Many approaches to this global learning problem (i.e. the problem of learning a preorder on the input space X based on a binary feedback) have been proposed and investigated. In Clémençon and Vayatis (2009), optimization of the (empirical) ROC curve in sup norm is considered via nonlinear approximation techniques, while this functional optimization problem is viewed as a superposition of cost-sensitive binary classification problems in Clémençon and Vayatis (2008) so as to propose an alternative method. In Clémençon et al. (2008) (see also Agarwal et al. (2005)), empirical maximization of the empirical AUC criterion is considered, bipartite ranking being viewed as a pairwise classification problem, a plug-in approach to bipartite ranking is developed in Clémençon and Robbiano (2011) and scalar performance criteria other than the AUC have been recently reviewed in Menon and Williamsson (2016). Whereas the vast majority of dedicated articles consider the batch situation solely, where the learning procedure fully relies on a set Dn of training examples given in advance, the goal of this paper is to develop an active learning framework for bipartite ranking, in other words to investigate this problem in an iterative context, where the learning procedure can formulate queries in a sequential manner, so as to observe the labels at new data points in order to refine progressively the scoring/ranking model. Precisely, the challenge consists in determining an incremental experimental design to label the data points in X that would permit to improve the ROC curve progressively, with statistical guarantees.

Our contributions We describe an algorithm, active-rank, which sequentially queries points of the feature space X. Given a confidence level ε > 0 and probability δ > 0, the goal of active-rank is, in a few queries as possible, to output a ranking of X, such that the induced ROC curve of said ranking is within ε of the optimal ROC curve, in terms of the sup norm, with probability greater than 1 δ. We restrict our selves to dimension 1, i.e. X = [0, 1] and make a single key assumption that the posterior η is piecewise constant on a grid of size K. Theorem 3.2 then shows that active-rank satisfies the above statistical guarantee and furthermore, provides an upper bound on it s expected total number of queries. In Theorem 3.3 we provide a lower bound on the expected number of queries for any possible policy, which satisfies a confidence level ε with probability greater than 1 δ. The aforementioned bounds are problem dependent, in the sense that they depend on features of the posterior η. Finally we conduct a practical analysis of active-rank on synthetic data, comparing it to several naive approaches.

The article is structured as follows. In Section 2 we formally define our setting as well recalling some key notions related to bipartite ranking and ROC analysis. In Section 2 we also cover some of the existing literature in active learning, of relevance to bipartite ranking. We then describe the active-rank algorithm in Section 3.1. Following this our theoretical results are presented in Section 3.2. Lastly, the experiments are displayed in Section 4. We then conclude and discuss some perspectives for future research in Section 5. Technical details and proofs are deferred to the Supplementary Material.

2 Background and preliminaries

2.1 Notation

Here we introduce several dedicated notions that will be extensively used in the subsequent analysis. For any integer n 1, we set [n] := {1, . . . , n}, denote by Sn the symmetric group of permutations on {1, . . . , n}, by In the identity map of Sn. By λ is meant the Lebesgue measure on [0, 1]. Given two probability distributions P and Q on a measurable space (Ω, F), we write P Q when P is absolutely continuous w.r.t. Q. For any a, b in [0, 1], Ber(a) refers to the Bernoulli distribution with mean a and kl(a, b) to the Kullback Leibler divergence between the Bernoulli distributions Ber(a) and Ber(b). For any a, b [0, 1], the Chernoff Information between the distributions Ber(a) and Ber(b) is defined as, kl (a, b) = kl(x , a) = kl(x , b), where x is the unique x [0, 1] such that kl(x, a) = kl(x, b). The indicator function of any event E is denoted by I{E}, the Dirac mass at any point x by δx, and the pseudo-inverse of any cdf κ(u) on R by κ 1(t) = inf{v R : κ(v) t}.

2.2 Setting The bipartite ranking problem A rigorous formulation of bipartite ranking involves functional performance measures. Let S be the set of all scoring functions, any s S defines a preorder s on X: for all (x, x ) X 2, x s x s(x) s(x ). From a quantitative perspective, the accuracy of any scoring rule can be evaluated through the ROC curve criterion, namely the PPplot t R 7 (1 Hs(t), 1 Gs(t)), where Hs(t) = P{s(X) t | Y = 1} and Gs(t) = P{s(X) t | Y = +1}, for all t R. The curve can also be viewed as the graph of the càd-làg function α (0, 1) 7 ROC(s, α) = 1 Gs H 1 s (1 α). The notion of ROC curve defines a partial order on the set of all scoring functions (respectively, the set of all preorders on X): s1 is more accurate than s2 when ROC(s2, α) ROC(s1, α) for all α (0, 1). As can be proved by a straightforward Neyman-Pearson argument, the set S of optimal scoring functions is composed of increasing transforms of the posterior probability η(x) = P{Y = +1 | X = x}, x X. We have S = {s S : (x, x ) X 2, η(x) < η(x ) s (x) < s (x )} and

(s, s ) S S , α (0, 1), ROC(s, α) ROC (α) := ROC(s , α).

The ranking performance of a candidate s S can be thus measured by the distance in sup-norm between its ROC curve and ROC , namely d (s, s ) := supα (0,1){ROC (α) ROC(s, α)} . An alternative convention to represent the ROC of a scoring function s, which we will use for the remainder of this paper, is to consider the broken line ] ROC(s, .), which arises from connecting the PP-plot by line segments at each possible jump of the cdf Hs. From here on out when referring to the ROC of a scoring function s, we refer to the broken line ] ROC(s, .).

The active learning setting Whereas in the batch mode, the construction of a nearly optimal scoring function (i.e. a function s S such that d (s, s ) is small with high probability) is based on a collection of independent training examples given in advance, the objective of an active learner is to formulate queries in order to recover sequentially the optimal preorder on the feature space X defined by the supposedly unknown function η. That is, the active learner plays a game with multiple time steps, where, at time each step n, they must choose a point an X to query, so as to observe the random label Yn Ber(η(an)) and refine the scoring model incrementally. After a sufficient number of rounds has elapsed, chosen at the learner s discretion, a final scoring function ˆs, is output.

Piecewise constant scoring functions. Here we consider the simplest scoring functions, measurable functions that are constant on pieces of the input space X forming a partition. As shown in Clemencon and Vayatis (2009) (see subsection 2.3 therein), when smooth enough, ROC can be accurately approximated by the (stepwise) ROC curve of a piecewise constant scoring function. Because the goal of this paper is to highlight the nature of active bipartite ranking rather than treating the problem in full generality, various simplifying assumptions are made in the subsequent analysis. For simplicity we suppose that X = [0, 1) and introduce the grid points {G1, ..., GK} = {i/K : i = 1, . . . , K 1}, where K 1. A preorder on X can be then naturally defined by means of a permutation σ SK. Consider indeed the scoring function

i=1 i I{x [Gσ(i), Gσ(i+1))}. (1)

We denote by SK the set of all functions of type (1). To avoid dealing with model bias here, we assume that the optimal preorder, that induced by η(x) namely, can be defined by a scoring function in SK. Assumption 2.1. There exist a permutation σ SK and distinct constants µ1, . . . , µK in (0, 1) such that

i=1 µi I{x [Gσ(i), Gσ(i+1))} for all x [0, 1) .

We write p = 1

i [K] µi. We point out that, as η may remain constant over multiple sections of the grid, the permutation σ satisfying assumption 2.1, is not necessarily unique. In the subsequent analysis, the parameter K is supposed to be known, in contrast with the µi s, which have to be learned by means of an active strategy. As we assume no structure between the grid points, one can easily consider Assumption 2.1 in higher dimensions, simply consider the d dimensional grid of size Kd. As such, all results presented in this paper extend trivially to higher dimensions.

Policies and fixed confidence regime. We denote the outputted scoring function of the learner ˆs SK. The way the learner interacts with the environment - i.e. their choice of points to query, how many samples to draw in total and their choice of ˆs SK, we term the policy of the learner. We write C for the set of all possible policies of the learner. For a policy π C and problem ν B we denote random variable τ π ν as the stopping time of policy π. We write ˆsπ ν for the scoring function outputted by policy π on problem ν. Where obvious we may drop the dependency on π, ν in the notation, referring to the scoring function outputted by the learner as simply ˆs. We write Pν,π as the distribution on all samples gathered by a policy π on problem ν. We similarly define Eν,π.

For the duration of this paper we will work in the fixed confidence regime. For a confidence level ε, define, Sε K := {s SK : d (s, η) ε}. A policy π is said to be PAC(δ, ε) (probably approximately correct), on the class of problems B, if, ν B, Pν,π[ˆs Sε K] 1 δ. The goal of the learner is to then obtain a PAC(δ, ε) policy π, such that the expected stopping time in the worst case, supν B Eν,π[τ π ν ], is minimised.

Defining problem complexity The expected minimum number of samples a policy must draw on a certain problem, ν B, to be PAC(δ, ε) is a quantity which depends upon the features of ν, specifically, the shape of the posterior η. When defining our measure of problem complexity we must capture this dependence as succinctly as possible. To build intuition for our definition we first explore a naive strategy and introduce some informative Lemmas. A naive approach to the active bipartite ranking problem, is to treat each pair of points on the grid, i, j [K] as a separate classification problem. To correctly distinguish the situations, Hi,j 0 := µi > µj, Hi,j 1 := µi < µj, with probability greater than 1 δ, it is well known, see e.g. Kaufmann et al. (2014), that for small δ, the minimum number of samples required is of the order log(1/δ) kl (µj,µi), where we remind the reader kl is the the Chernoff Information, closely related to the kl divergence, see Section 2.1. Thus, if the learner wished to output a scoring function in S , the sample complexity would be of the order, P

i [K] log(1/δ) minj [K](kl (µj,µi)). Of course, distinguishing between Hi,j 0 and Hi,j 1 is impractical when µi and µj are very close, or even equal. However, in our regime, the learner is not required to correctly rank every pair of points i, j [K], only to output a scoring function existing in Sε K. Intuition indicates that the learner may be irreverent to the ranking within certain groups of points on the gird, as long as there posterior values are sufficiently close. For instance, consider a partition of [0, 1], P = {C1, C2, C3}, and increasing sequence (β1, β2, β3) [0, 1]3 with

i=1 I(x Ci)βi, s(x) = I(x C1) + 2I(x {C2 C3}) , (2)

Where the scoring function function s essentially, treats all points of C2, C3 of the same rank. See Figure 1 for the ROC curves of η and s. Via simple calculation, we have the following, d (η, s) = λ(C3)

p (β3 β2)/2 1 (β3+β2)/2, which suggests that, whether or not there exists a scoring function in Sε K, which treats the groups C2, C3 of the same rank, depends upon two things, the size of the groups and also their position on the ROC curve. Specifically, if β3 β2 2εp(1 (β3+β2)/2)

λ(C3) , then s / Sε K. The following lemma formalises this intuition, the proof follows via a direct generalisation of the above example.

Lemma 2.2. Let > 0, i [K] and define S(i, ) K as the set of scoring functions such that, for all s S(i, ) K , one has that j : |µj µi| , sign(s(i) s(j)) = sign(µi µj). There exist a s S(i, ) K , ν B, such that on problem ν, d (η, s) |{j:|µi µj| }|

Lemma 2.2 suggests that the for i [K] the learner must be at least able to distinguish Hi,j 0 vs Hi,j 1 ,

for all j : |µi µk| i, where, i := max x > 0 : P

i =j x I |µi µj| x < Kεp(1 µi) .

The following Lemma shows that i is not only an upper bound on the necessary order of the confidence level around µi but also a lower bound, the proof of which can be found in the proof of Theorem 3.2. Lemma 2.3. For a problem ν B, let s SK be a scoring function such that the following holds, i [K], j : |µj µi| i/4, sign(s(i) s(j)) = sign(µi µj)

Figure 1: The ROC curves of η and s, as defined in Equations (2), with (β1, β2, β3) = (0.1, 0.2, 0.6) and C1, C2, C3 equally sized.

Figure 2: The complexity of points i [K], for η as defined in Scenario 1 of the experiments, see Section 4.

then s Sε K.

Under the conditions where Kεp > 1 and for a grid point i [K], µi is close to one, a certain phenomenon occurs. Specifically, when i 1 µi, for the learner to know the value of µi up to an error of i is no longer sufficient, as in such a case, to the best of the learners knowledge, the value of µi may be arbitrarily close to 1, and thus have an arbitrarily large effect on the regret. With this in mind, we define i as follows,

i := max x > 0 : X

i =j x I |µi µj| x < Kεp(1 µi) (1 µi)

Note that in the case where Kεp < 1, we have i = i. In the case where µi i, we thus define the complexity at a point i [K] as H(1) i = 1 kl(µi,µi+ i) kl(µi,µi i). If µi i, H(1) i =

1 kl(µi,µi+ i). For a problem ν B, the total problem complexity is then given as P

i [K] H(1) i . As an illustrative example, see Figure 2 for the complexity of i [K] with η defined as in Scenario 1 of the experiments, see section 4.

There is a natural comparison to multi armed bandits, where the problem complexity is typically given as the summation across the individual complexity of each arm. However, in most multi armed settings the complexity of a single arm is dependent upon its distance to a single other arm, e.g. the optimal arm, whereas in our setting the complexity of a single grid point i [K] has a more complex dependency on the shape of the posterior around Gi.

2.3 Performance of the passive approach

At this point we can consider how a uniform sampling strategy would perform, that is, the learner simply draws a sample from each section of the grid in turn. This would be essentially a passive approach, in line with the classical batch setting. For a uniform/passive sampling strategy to be PAC(ε, δ) one would have to draw samples until the width of the confidence interval at all points i is less than i. Therefore, a uniform sampling strategy, with an appropriate stopping rule, would have the following tight upper bound on its expected sampling time, up to log terms, c K maxi [K] 1 kl(µi,µi+c i), for some absolute constants c, c > 0. The improvement we hope to gain in the active setting is to replace the max with a weighted summation across the grid. Thus, in settings where the i are relatively constant across large sections of the grid, then the theoretical performance of a passive approach can be close to optimal. In contrast, in cases where a very small section of the interval is hard to rank and the rest is easy - i.e. K maxi [K] 1 kl(µi,µi+ i) is much greater than P

i [K] 1 kl(µi,µi+ i), a passive approach will fail. Such settings involve large K where the gaps i on the majority of cells are large, with a relatively small number of cells with small gaps i. Incidentally, we point out that this corresponds to many situations of interest in practice (in

information retrieval, for a specific request, the vast majority of the documents are equally irrelevant, while the ranking of a very small fraction of relevant documents is challenging; the same phenomenon is also observed in credit-risk screening). In these cases the benefit to the practitioner, will be that they quickly focus on the interesting sections of the feature space.

2.4 Related literature in active learning

While, to the best of our knowledge, Bipartite Ranking has not yet been considered under active learning, there are several related settings. Firstly, it is important to note, that as we assume the posterior η is piecewise constant on the grid of size K, we can view our problems as a K armed bandit. In the case where K = 2 the bipartite ranking problem becomes akin to best arm identification (BAI) for the two armed bandit, also known as A/B-Testing. In BAI for the two armed bandit, the learner sequentially draws samples from two distributions νA, νB with respective means µA, µB. Their objective is then carry out the hypothesis test, H0 := (µA µB), H1 := (µa > µB) in as few samples as possible. A/B Testing is considered in an active, fix confidence regime, in Kaufmann and Kalyanakrishnan (2013), wherein they prove a lower bound on the expected sampling time of any PAC(ε, δ) as of the order log(1/δ) kl(µA,µB). Note that, in the case where K = 2 active-rank matches said lower bound up to logarithmic terms. In the fixed confidence regime, the BAI problem has also been generalised for larger K > 2 - see Garivier and Kaufmann (2016),Jamieson and Talwalkar (2016) along with the Top M problem, where the learner must output the M best arms - see Kaufmann et al. (2014), Kalyanakrishnan et al. (2012), however, for K > 2 both BAI and Top M problems are no longer comparable to our setting.

Aside from BAI and the Top M problem, there are several other settings in active learning that, while not directly comparable to our own, are worth mentioning. The first is active clustering. Several works have considered clustering in an online framework, see Choromanska and Monteleoni (2012), Liberty et al. (2016), Cohen-Addad et al. (2021) and Khaleghi et al. (2012). In the above works new observations from certain arms become available to the learner at each time step, however the learner does not actively choose which arms to pull and therefore the flavour of the above literature is very different to our setting. Much closer is the work of Yang et al. (2022) in which the authors consider a active clustering problem, represented as K armed d dimensional bandit, where the arms are split into M clusters. They work in a PAC(δ) setup where their goal is to recover the entire clustering of the arms, with probability greater than 1 δ in as few samples as possible. Comparing to our setting, if one is to view a section of the grid on which η is constant as a single cluster, by retrieving the clustering of the arms one can then easily do ranking. Their results differ to our own in several key ways though. Firstly their algorithm takes the number of clusters M as a parameter, this highlights the main difference between their setting and our own. In the Bipartite ranking problem, assuming ε is not very small, one does not have to recover exactly all the clusters to ensure regret under the d norm is less than ε. Therefore we do not need to know the number of clusters and our algorithm must be able to exploit larger ε to achieve smaller stopping times. The second key difference is that the results of Yang et al. (2022) are hold only in the asymptotics, that is as δ 0. Their algorithm employs a forced exploration phase, which ensures each arm is pulled at at least a sub linear rate. Essentially, this means that in such an asymptotic setting, the means of the arms are known to the learner, which naturally drastically changes the nature of their results. Extension to bounds for fixed δ > 0 would be none trivial, noted as potential future work in Yang et al. (2022), and essential if one were to compare to our confidence setting.

Also of note is active multi class classification. In Krishnamurthy et al. (2017) they consider a cost sensitive classification problem, where the learner receives input examples x X and cost vectors c RK, where c(y) is the cost of predicting label y on x. For each input example received the learner is able to query a subset of labels. The objective is to then train a classifier with minimal expected loss in as few queries as possible. The results of Krishnamurthy et al. (2017) cannot be directly compared to our own, as in our setting there is no such thing as the cost of a classifier at a given point x X, as the cost miss ranking a section of [0, 1] is dependent on our ranking of the entire feature space.

3 Our results

3.1 The active-rank Algorithm Our algorithm active-rank maintains an active set of grid points across several rounds. At the beginning of each round active-rank draws a sample, uniformly, from all points of the grid in the active set and at the end of each round, eliminates points from the active set based on a specific criterion. We track the empirical mean of all samples drawn from the ith point of the grid, [Gi, Gi+1) up to round t as ˆµt i. At the beginning of each round t, for each grid point i [K], we will maintain an upper and lower confidence bound, on µi, which we term the UCB and LCB index respectively. At time t, for each grid point i [K] and exploration parameter β(t, δ) : N [0, 1] R+, remaining in the active set, we then define the LCB index,

LCB(t, i) := min q 0, ˆµt i : kl ˆµt i, q β(t, δ)

and the UCB index,

UCB(t, i) := max q ˆµt i, 1 : kl ˆµt i, q β(t, δ)

Let St denote the active set at the beginning of round t, via careful choice of exploration parameter the following Lemma holds, the proof of which can be found in Section A of the supplementary material.

Lemma 3.1. We have that, the event

i [St] {µk [LCB(t, i), UCB(t, i)]} ,

occurs with probability greater than 1 δ.

For i [K], time t and z R define, Ui,t(z) := j St : j = i, |bµt i bµt j| z . Following Lemma 2.2, our intuition would be to then remove a point i from the active set if, Ui,t UCB(i, t) LCB(i, t) c Kεp(1 bµi,t) |UCB(i,t) LCB(i,t)|, for some well chosen constant c > 0. However, due to the technical difficulty of the proof, we make the following concession. For t > 0, let St be the list S at time t. At time t let (t) = maxi St(UCB(t, i) LCB(t, i)). Furthermore, at time t define the set of grid points,

Qt := i [K] : (t) 1

Kεbpt |Ui,t(6 (t))| 1 (1 bµt i) . (5)

If a point exists in Qt, we remove it from the active set. Note that active-rank does not take the average of the posterior p as a parameter. Instead we show it is possible to use an estimate bpt which updates round by round.

Elimination algorithms such as active-rank have seen wide usage in the literature for BAI, see In Paulson (1964),Mannor and Tsitsiklis (2004), Even-Dar et al. (2002) and Even-Dar et al. (2006). However, closer to our work is the Racing algorithm Kaufmann and Kalyanakrishnan (2013), designed for the Top M problem, where, as in our approach, the confidence bounds used are based on the kl divergence as opposed to Hoeffdings. Their elimination criterion, however, differs considerably to our own. For simplicity let us consider the Top1 problem, that is BAI - the following arguments can be extended in the case of Top M. The racing algorithm of Kaufmann and Kalyanakrishnan (2013) eliminates an arm i [K] from the active set, at time t, when, the positive gap the lower confidence bound around the highest empirical mean and the upper confidence bound at point i is greater than ε. However, due to the global nature of the ranking problem, in our setting, the decision to remove a point from the active set is not made based on the distance to another single point. We rather consider a condition on the local smoothness of the posterior around the point i. An additional difficulty that arises here is that the local smoothness around a point can potentially depend upon points no longer in the active set and once a point is no longer in the active set, we essentially have no control of the width of its confidence interval.

3.2 Theoretical bounds

Algorithm 1 active-rank

Initialise: S = [K], t = 1 repeat

Sample a point drawn uniformly from [Gi, Gi+1) end for Let pt be a point drawn uniformly from [0, 1] and update ˆpt = ((t 1)ˆpt 1 + pt)/t for i S do

if i Qt, (t) ˆpt/4 then

S = S \ {i} end if end for t = t + 1 until S = Let ˆσt GK be the permutation sorting the list (bµt i)i [K] into ascending order. Output: ˆs = P

i K i I{x [Gˆσ(i), Gˆσ(i+1))}

Proving active-rank is PAC(ε, δ) and upper bounding the expected sampling time Theorem 3.2 demonstrates that our algorithm active-rank is PAC(ε, δ) and provides a problem dependent upper bound on it s expected sampling time. The proof can be found in Section A of the supplementary material. Theorem 3.2 makes no assumption on the posterior η, aside from it being piecewise constant on the grid of size K, i.e. Assumption 2.1. For i [K], set H(2) i =

maxj [K] 1 kl(µj,µj+ i/8) 1 kl(µj,µj i/8) .

Theorem 3.2. For ε, δ > 0, γ > 480/ log(K), with β(t, δ) = cγ log(t2K2/δ) where cγ is a constant depending only on γ, on all problems ν B, on execution of active-rank, with output ˆs, we have that,

d (ˆs, η) ε ,

with probability greater than 1 δ. Furthermore, the expected stopping time of active-rank is upper bounded by the following,

i [K] H(2) i log c γH(2) i K/δ ,

where c γ, c γ, are constants depending only on γ.

Lower bound Theorem 3.3 provides a problem dependent lower bound on the expected sampling time of any PAC(ε, δ) policy. The proof of Theorem 3.3 can be found in Section B of the supplementary material. Theorem 3.3. Let ε [0, 1/4), 0 < δ < 1 exp( 1/8) and ν B such that Kεp < 1/8. For any PAC(ε, δ) policy π, there exists a problem ν B such that, for all i [K], i i/2 where i is the gap of the ith grid point on problem ν, where the expected stopping time of policy π on problem ν is bounded as follows, E ν,π τ ν π c X

i [K] H(1) i ,

where c > 0 is an absolute constant and H(1) i is the complexity of point i on problem ν. Furthermore we have that, X

i [K] H(1) i c X

i [K] H(1) i ,

where c > 0 is an absolute constant.

The proof of Theorem 3.3 follows from a novel application of a Fano type inequality on a well chosen set of problems.

Gap between upper and lower bound There are essentially two components in the gap between the bounds of Theorems, 3.3 and 3.2. The first is the additional logarithmic dependency upon K present in our upper bound. Despite being logarithmic this dependence is potentially significant, as in practical situations, the size of grid needed, for the assumption that the posterior η is piecewise constant, may be very large. The second component, in the gap between upper and lower bounds is the difference in the H(1) i and H(2) i terms. The reason H(2) i appears in Theorem 3.2 is that, the decision to remove a point i [K] from our active set is made based on the minimum width of confidence interval across the entire grid, (t) as opposed to the local width at i, UCB(i, t) LCB(i, t), see Equation (5). As we are dealing with Bernoulli distributions and kl divergence based confidence bounds, for a fixed number of samples, points close to zero or one will have tighter confidence bounds

and thus may be sampled more than is necessary. If one were to assume that the posterior η exists solely in the interval [γ, 1 γ] for some γ > 0, then for all i [K], H(2) i and H(1) i will be with a constant factor of each other, with that constant depending on γ.

In authors opinion, both the logarithmic dependency on K and usage of (t) may be removed. However, this would require several non trivial modifications to the proof of Theorem 3.2, see the discussion in Section A of the supplementary material for details. As, to the best of our knowledge, this is the first work to consider the bipartite ranking problem in an active learning setting, we present Theorem 3.2 as it stands and leave the aforementioned improvements for future works.

4 Experiments In this section we discuss practical cases based on synthetic data. For all experiments δ is fixed at 0.01 and the constants used are smaller than their theoretical counterparts, which are typically overestimated, furthermore bp is calculated with all previous samples. As represented in Figure 3, each cell i is assigned a level value µi so that η follows the Assumption 2.1. Without loss of generality, we assume that η can be described as an increasing family (µi)i [K]. Our study scenarios are then as follows:

Scenario 1: (µi)i [K] = (0, 0.28, 0.3, 0.38) and K = 16;

Scenario 2: (µi)i [K] = (0.8((i 1)/K)4 + 0.1)i [K] and K = 64;

Scenario 3: (µi)i [K] is sub-sampled (with replacement) of ((i 1)/K)4)i [100] and K = 64 Scenario 4: µi = 0.8((i 1)/K) + 0.1 i [K]\{7, 8} with µ7 = 0.8(6/K) + 0.3, µ8 = 0.8(7/K) 0.1 and K = 16 The objective of these scenarios is to evaluate the capacity of the algorithm on different cases. As shown in Figure 3, scenarios 1 and 3 will have variable jumps and cell sizes.

Competing algorithms To our knowledge there is no algorithm dealing with the active learning for bipartite ranking problem, we thus compare to the following naive approaches. Passive rank: each new point at is drawn uniformly on [0, 1]. Naive rank: each new point at is sampled in Pit s.t. it := arg maxi [K] UCB(t, i) LCB(t, i), this algorithm reduces the bias in an undifferentiated way without considering the problem as global (requiring peer-to-peer comparison). Active classification: for this algorithm we consider binary classification with threshold 0.5. The set of active cells S, as defined in Algorithm 1 is then St = {i [K]; 0.5 [ˆµt i LCB(t, i); ˆµt i + UCB(t, i)]} (

i; argmin i [K] (|0.5 ˆµt i|)

. As the competing algorithms do not output a stopping time, for a single

η, algorithm active-rank is run across several values of ε, following a geometric sequence of common ratio 0.99 and initial value 1. The values of ε then plotted against the respective stopping times of active-rank.

Figure 3: Different scenarios chosen for the experiments

Interpretation of results On scenario 1, the simplest case, and to an extent scenario 2, active-rank and passive suffer near identical regret for larger sample sizes. On all other scenarios active-rank outperforms all competitors. The fact that active-rank does not reach extremely small values of regret suggests that analysis for much higher sample size may be interesting, however this creates issues in computation time, the same goes for larger K. Also, the uniform

Figure 4: Regret estimated by Monte Carlo for 100 realizations of each algorithm, corresponding to the scenarios of the figure 3 respectively .

approach still performs relatively well, and appears difficult to fool. Some more work may be needed to find a setting in which uniform sampling suffers considerably.

5 Conclusion

To the best of our knowledge we have developed the first rigorous framework for the active bipartite ranking problem and our algorithm, active-rank, is the first to tackle said problem. Our upper bound on performance of active-rank matches our lower bound up to logarithmic terms, in the case where the posterior η is not very close to 0 or 1 at any point. As well as theoretical guarantees we have demonstrated good practical performance of active-rank, on synthetic data, in various settings. We conclude with some perspectives for future research.

An obvious path for future research, is to replace the Assumption 2.1 with a smoothness assumption on the posterior η, e.g. a Hölder condition. The setting would then be equivalent to a continuous armed bandit as opposed to a finite armed bandit. Assuming the learner has knowledge of the Hölder coefficient, a standard approach in continuous armed bandits is to first discretise and then apply classic techniques from finite armed bandits, carefully choosing the discretisation level to balance the discretisation error and classical regret. It is of our opinion that such an approach would not be sufficient in our case. We conjecture that to achieve optimal or near optimal performance the learner must vary the level of discretisation across the feature space, based on the flatness of the posterior function η and placement on the ROC curve. Also, under such a Hölder condition, the extension to higher dimensions would no longer be immediate.

Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., and Roth, D. (2005). Generalization bounds for the area under the ROC curve. J. Mach. Learn. Res., 6:393 425.

Choromanska, A. and Monteleoni, C. (2012). Online clustering with experts. In Artificial Intelligence and Statistics, pages 227 235. PMLR.

Clémençon, S. and Robbiano, S. (2011). Minimax learning rates for bipartite ranking and plug-in rules. In Proceedings of ICML, number 1.

Clémençon, S. and Vayatis, N. (2008). Overlaying classifiers: a practical approach for optimal scoring. Constructive Approximation, .:.

Clémençon, S. and Vayatis, N. (2009). Tree-based ranking methods. IEEE Transactions on Information Theory, 55(9):4316 4336.

Clémençon, S., Lugosi, G., and Vayatis, N. (2008). Ranking and Empirical Minimization of UStatistics. The Annals of Statistics, 36(2):844 874.

Clemencon, S. and Vayatis, N. (2009). On partitioning rules for bipartite ranking. In van Dyk, D. and Welling, M., editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 97 104, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA. PMLR.

Cohen-Addad, V., Guedj, B., Kanade, V., and Rom, G. (2021). Online k-means clustering. In International Conference on Artificial Intelligence and Statistics, pages 1126 1134. PMLR.

Even-Dar, E., Mannor, S., and Mansour, Y. (2002). Pac bounds for multi-armed bandit and markov decision processes. In International Conference on Computational Learning Theory, pages 255 270. Springer.

Even-Dar, E., Mannor, S., and Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079 1105.

Garivier, A. and Kaufmann, E. (2016). Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998 1027. PMLR.

Jamieson, K. and Talwalkar, A. (2016). Non-stochastic best arm identification and hyperparameter optimization. In Artificial intelligence and statistics, pages 240 248. PMLR.

Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. (2012). Pac subset selection in stochastic multi-armed bandits. In ICML, volume 12, pages 655 662.

Kaufmann, E., Cappé, O., and Garivier, A. (2014). On the complexity of a/b testing. In Conference on Learning Theory, pages 461 481. PMLR.

Kaufmann, E. and Kalyanakrishnan, S. (2013). Information complexity in bandit subset selection. In Conference on Learning Theory, pages 228 251. PMLR.

Khaleghi, A., Ryabko, D., Mary, J., and Preux, P. (2012). Online clustering of processes. In Artificial Intelligence and Statistics, pages 601 609. PMLR.

Krishnamurthy, A., Agarwal, A., Huang, T., III, H. D., and Langford, J. (2017). Active learning for cost-sensitive classification. Co RR, abs/1703.01014.

Liberty, E., Sriharsha, R., and Sviridenko, M. (2016). An algorithm for online k-means clustering. In 2016 Proceedings of the eighteenth workshop on algorithm engineering and experiments (ALENEX), pages 81 89. SIAM.

Mannor, S. and Tsitsiklis, J. N. (2004). The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623 648.

Menon, A. and Williamsson, R. (2016). Bipartite ranking: A risk theoretic perspective. Journal of Machine Learning Research, 7:1 102.

Paulson, E. (1964). A sequential procedure for selecting the population with the largest mean from k normal populations. The Annals of Mathematical Statistics, 35(1):174 180.

Yang, J., Zhong, Z., and Tan, V. Y. (2022). Optimal clustering with bandit feedback. ar Xiv preprint ar Xiv:2202.04294.

A Proof of theorem 3.2

Before the proof of Theorem 3.2 we prove some initial lemmas, the first of which is Lemma 3.1, which lower bounds the probability of our favourable event E by 1 δ. We prove a slightly extended version of Lemma 3.1. At time t, define the LCB index of bp as,

LCB(t, 0) := min q [0, bpt] : kl(bpt, q) β(t, δ)

and the UCB index,

UCB(t, 0) := max q [bpt, 1] : kl(bpt, q) β(t, δ)

Note that as bpt [min(µt i), max(µt i)], we have that,

UCB(0, t) LCB(0, t) (t) .

The extended version of Lemma 3.1 is then as follows,

Lemma A.1. We have that, the event

i [St] {0} {µk [LCB(t, i), UCB(t, i)]} ,

occurs with probability greater than 1 δ.

Proof. Via Chernoff s inequality, for i [K] {0}, at time t we have that,

P(LCB(t, k) µk) exp( β(t, δ)) ,

and, P(UCB(t, k) µk) exp( β(t, δ)) .

It then remains to note that via our choice of exploration parameter and a union bound,

k [K] {0} exp( β(t, δ)) δ ,

and the result follows.

The following Lemma shows that, on rounds in which a point is removed from the active set, our estimate bp remains within a constant factor of the true p .

Proposition A.2. On event E we have that, for all rounds t such that a point is removed from the active set, 2p/3 ˆpt 4p/3.

Proof. Let t be a round in which a point is removed from the active set. On event E, for all i [K],

|µi ˆµt i| (t) ,

we thus have |p ˆp| (t) which, in combination with the fact (t) ˆp/4, implies (t) (bp + (t))/3 and thus (t) p/3.

For the purposes of the proof we split Theorem 3.2 into two parts. The first part is covered by the following lemma, which states that active-rank is PAC(ε, δ).

Lemma A.3. For ε, δ > 0, 1/K εp on all problems ν B, on execution of active-rank, with output ˆs, we have that, d (ˆs, η) ε ,

with probability greater than 1 δ.

Proof. For the remainder of this proof we will work under the following event,

k [K] {µk [LCB(t, i), UCB(t, i)]} .

Proposition A.4. Let i [K] be removed from St at time t, such that {j : |µi µj| i} St. For all j : |µi µj| j, we have that, k St : |µk µj| j,

sign(bµτ j bµτ k) = sign(µj µk) .

Proof. Firstly note that, for all s t, (s) (t). Thus for all j St, |bµτ j µj| (t) and therefore for any point j : 4 (t) j, we have that, for all k St : |µj µk| j,

sign(bµτ j bµτ k) = sign(µj µk) .

Thus to prove our result we must show that, for all j : |µi µj| j, j 4 (t). Now, if i is removed from St at time t, we have that,

Kεbpt |Ui,t(6 (t))| 1 (1 bµt i) ,

24(1 bµt i) , (8)

and therefore via Proposition A.2,

|Ut,i(6 (t))| Kεbpt(1 bµt i) 24 (t) Kεp(1 bµt i) 12 (t) Kεp(1 µi)

and we have that 6 (t) i. Now let j : |µj µi| 4 (t), we then have, |bµj bµi| 2 (t) and thus Uj,t(4 (t)) Ui,t(6 (t)) and furthermore (t) 1 12(1 bµt j) therefore, via proposition A.2,

|Ut,j(4 (t))| Kεbpt(1 bµt i) 24 (t) Kεp(1 bµt i) 12 (t) Kεp(1 µj)

and as a result j 4 (t). Now take j : |µj µi| 4 (t). If j 4 (t) we are done. Thus assume j 4 (t). In this case, we have that j : |µi µj| j. Thus we have shown that, for a point j St, if |µi µj| j, then (t) j/4 and the result follows.

Proposition A.5. At time t, let i St, we have that, j [K] \ St : |µi µj| i,

sign(bµτ j bµτ i ) = sign(µj µi) .

and furthermore, j : {k : |µj µk|} St, k [K] : |µk µj| j,

sign(bµτ j bµτ k) = sign(µj µk) .

Proof. The proof will follow by induction. Assume at time t that i St 1 we have that, j [K] \ St 1 : |µi µj| i,

sign(bµτ j bµτ i ) = sign(µj µi) ,

and that, j [K] : {k : |µj µk|} St 1, k : |µk µj| j,

sign(bµτ j bµτ k) = sign(µj µk) .

Now let i St, and j [K] \ St : |µ i µj| i, we must show that,

sign(bµτ j bµτ i ) = sign(µj µ i) ,

and that, j [K] : {k : |µj µk|} St, i : |µk µj| j,

sign(bµτ j bµτ k) = sign(µj µi) .

For the first statement, assume j St 1 and that {k : |µj µk|} St 1, as otherwise we are done, via the inductive assumption. The point j is then removed from the active set at the end of round t + 1 and the statement thus follows via Proposition A.4.

For the second statement, for j St 1 assume {k : |µj µk|} St 1, as otherwise we are done via the inductive assumption. Therefore, there must exist such an k, removed from the active set at the end of round t 1, such that |µk µj| j. The result then follows via Proposition A.4.

For a set C [0, 1], define,

κ(C) := 1 λ(C)

C η(x) , dx .

Let α [0, 1], define the subset Zα [0, 1] such that, P(X Zα|Y = 1) = α , that is,

λ(Zα)(1 κ(Zα))

and such that, for some iα [K],

j : µj > µiα, Pj Zα , j : µj < µiα, Pj Zα = . (9)

We then have ROC (α) = P(X Zα|Y = +1) = λ(Zα)(κ(Zα))

p . The choice of Zα is not necessarily unique, and as η me be constant across multiple sections of the grid iα is also not necessarily unique, in this case we take arbitrary Zα, iα. Now define the subset ˆZα [0, 1] such that,

λ( ˆZα)(1 κ( ˆZα))

and, x ˆZα, y / ˆZα, s ˆ P(x) s ˆ P(y) ,

so ROC(s, α) = λ( ˆ Zα)(κ( ˆ Zα)) p . Again ˆZα is not necessarily unique, in which case we choose arbitrarily. Via Propositions A.4 and A.5, we have that, j [K] : |µj µiα| iα,

sign(bµτ j bµτ iα) = sign(µj µiα) . (10)

Z α = {x Zα : |η(x) µiα| iα} , ˆZ α = {x ˆZα : |η(x) µiα| iα} .

Via Equation 10, we have that,

ROC(α, η) ROC(α, s ˆ P) = λ(Z α)κ(Z α) p λ( ˆZ α)κ( ˆZ α) p .

Before finalising the proof we must lower bound λ( ˆZ α) and κ( ˆZ α). We first lower bound κ( ˆZ α).

κ( ˆZ α) µiα iα , κ(Z α), µiα + iα . (11)

We will now lower bound λ( ˆZ α)

λ( ˆZ α) λ(Z α) = 1 κ(Z α)

1 κ( ˆZ α) 1 µiα + iα

1 µiα iα . (12)

Via combinations of Equations (11) and (12), we have,

λ(Z α)κ(Z α) p λ( ˆZ α)κ( ˆZ α) p 1

λ(Z α)(µiα + iα) λ( ˆZ α)(µiα iα) (13)

λ(Z α)(µiα + iα) λ(Z α)(µiα iα)(1 µiα + iα))

2λ(Z α) iα p(1 µiα iα) 2λ(Z α) iα p(1 µiα) (15)

It remains to remark that, iα εp λ(Z α), by definition, and thus,

ROC(α, η) ROC α, s ˆ P ε .

As we chose α w.l.o.g the proof then follows.

A.1 Proof of stopping time for active-rank

We will now prove the second part of Theorem 3.2, the upper bound on the expected sampling time of active-rank. Lemma A.6. For ε, δ > 0, γ > 480/ log(K), 1/K pε, with β(t, δ) = cγ log(t2K2/δ) where cγ is a constant depending only on γ, on all problems ν B such that i [K], i 1 µi, we have that, for γ > 1, the expected stopping time of active-rank is upper bounded by the following,

i [K] H(2) i log c γH(2) i K2/δ ,

where c γ, c γ are constants depending only on γ.

Proof. We upper bound E[τ] as follows. We denote the number of times a section of the grid [Gi, Gi+1) has been sampled by the learner, up to and including time t as, Ni(t). Let τi = Ni(τ).

t=1 P(τi t) (16)

t=1 P(i / Qt) . (17)

Define the event,

ξi,t := { j St, (UCB(t, j) µj + i/96} { j St, LCB(t, j) µj i/96)} {p (t) bpt p + (t)} .

Proposition A.7. For i [K] such that i 2p, t > 0, we have that, on ξi,t, i Qt.

Proof. First note that, under event ξi,t, (t) i/96 and thus for a j Ui,t(6 (t)), we have,

|µi µj| i/96 + i/96 + 6 i/96 i ,

thus, |Ui,t(6 (t))| |{j : |µi µj| i}| (18)

and furthermore, (t) p/4 which implies bp 2p. Now, via definition of i,

i Kpε |{j : |µi µj| i}| 1 (1 µi) , (19)

In the case, Kpε |{j:|µi µj| i}| 1,

(t) (1 bµi) + (t)

96 (1 bµi)/48

and we have that

Kpε |{j : |µi µj| i}| (1 + (t)) Kbpε |Ui,t(6 (t))| 3Kbpε 2|Ui,t(6 (t))| thus Kbpε |Ui,t(6 (t))| 2

finally leading to,

(t) Kεˆpt(1 bµt i) 24|Ui,t(6 (t))| .

Then let us assume Kpε |{j:|µi µj| i}| 1. In this case,

(t) i/96 Kεp(1 µi) 96|Ui,t(6 (t))| Kεbp((1 bµt i) + (t)) 48|Ui,t(6 (t))| Kεbp(1 bµt i) 24|Ui,t(6 (t))| ,

where the third inequality comes from the fact that (t) p/4 which implies bp 2p.

In what follows, assume i is such that i 2p. Via combination Proposition A.7 and Equation (17), we have that,

t=1 P ξc i,t .

We will now upper bound P t=1 P(ξi,t). For γ > 1 let,

T i 0 = min t : β(t, δ)

t min i [K]

kl(µi, µi + i/96)

log(K)γ kl(µi, µi i/96)

We have that, for all t > T i 0, j St,

P(UCB(t, j) µj + i/96) P kl(ˆµt j, µj + i/96) kl(µj, µj + i/96)

r(γ) = {x (µj, µj + i/96) : kl(x, µj + i/96) = kl(µj, µj + i/96)/(log(K)γ)} .

Consider the function ϕ(x) = kl(µj + x, µj + i/96), on the interval [0, i/96]. Via the properties of the KL divergence, ϕ is convex and ϕ( i/96) = 0. As a result, ϕ(x) (1 x) kl(µj, µj + i/96)/(2 i), which implies,

r(γ) µj + i

96 2 log(K)γ

for log(K)γ > 384. We now have, for all t > T i 0,

P kl(ˆµt j, µj + i/96) kl(µj, µj + i/96)

= P kl(ˆµt j, µj + i/96) kl(r(γ), µj + i/96)

= P(ˆµt j r(γ))

exp( t kl(µj, r(γ)))

exp log(K)γ kl(µj, r(γ))

kl(µj, µj + i/96) β(t, δ)

γ kl µj, µj + i

kl(µj, µj + i/96)β(t, δ)

exp( log(K)β(t, δ)cγ)

where cγ is a constant depending only on γ. Thus, via Equation (20), for t T i 0

P(UCB(t, j) µj + i/2) exp( log(K)β(t, δ)cγ) , via similar reasoning we have also that,

P(LCB(t, j) µj i/2) exp( log(K)β(t, δ)cγ) ,

and furthermore, P(bpt / [p (t), p + (t)]) exp( β(t, δ)), and now via a union bound,

t=T i 0+1 P(ξc t,i) c exp( cγ) . (21)

where c > 0 is an absolute constant. Thus we have shown that for a point i : i 2p,

E[τi] T i 0 + c K exp( cγ) .

It is then straight forward to note that for any point i : i 2p,

E[τi] T i 0 + c K exp( cγ) ,

where i is any point i such that i 2p. It remains to upper bound T i 0. Via definition of the exploration parameter, we have that,

T i 0 H(2) i log cγH(2) i K/δ ,

where cγ > 0 is an absolute constant depending only on γ. Before continuing, we demonstrate the following Lemma-

Lemma A.8. On all problems ν B, we have that,

|{i : i < 2p}| K/4.

Proof. Firstly assume that, |{i : i µi}| K/4 and define i = max{i : i µ}. We then have that,

j : j µj, |µi µj| i ,

i Kεp |{j : |µi µj| i }| εp/4 p .

Thus we have |{i : i < p}| K/4. Now assume,

|{i : i µi}| K/4 .

As we must have |{i : µi > 2p}| K/2, we thus have, |{i : i < 2p}| K/4

With Lemma A.8, we can then upper bound E[τ] as follows,

i: i 2p E[τi] 4 X

i [K] H(2) i log( cγH(2) i K/δ) + c γK exp( cγ) ,

and the result follows.

B Proof of Theorem 3.3

Proof. The proof will follow by application of a Fano type inequality on a well chosen set of problems.

Step 1: Constructing our well chosen set of problems We define a set of grid points U0, U1, ... recursively as follows, U0 = arg min( i), for m 0 we then define

Um+1 = arg min{ i : j m, |µUj µi| 3 i + 3 Uj} .

Let M be the largest integer for which UM exists. Note that the sequence ( Um)m>M is monotonically increasing and furthermore, for all i [K],

|{m [M] : |3µUm 3µi| Um + i}| 2 . (22)

We then define the corresponding set of groups, D0, D1, ..., DM as follows. For i [K] let m, n [M] be such that, µUm µi µUn then set i+ = m n and i = m n. If |µi µi | i + i+ then i Di+, otherwise i Di .

Proposition B.1. For all m [M] we have that, i Dm, Um i.

Proof. W.l.o.g assume µi µUm. Let j be such that |µUj µi| 3 i + 3 Uj. Via Equation (22) we have that µUm µi µUj and by definition of Dm, Dj, m > j. Thus j < m : |µi µUj| 3 Uj + 3 i and therefore, if i < Um, then

arg min{ i : j m 1, |µUj µi| 3 m + 3 j} = Um

which is a contradiction via the definition of Um.

For m [M], set Wm = {i : |µi µUm| 3 Um}.

Proposition B.2. For all m [M],

|{i Dm : |µi µUm| 3 Um}| 36|Wm|

Proof. Firstly, by definition we have that,

Um Kεp(1 µUm)

|Wm| . (23)

Now, let i = arg maxi Dm\Wm( i). As we have that i Dm, |µi µUm| 3 i + 3 Um, the following holds, i Dm \ Wm, |µi µi | 3 i + 3 Um .

Now consider the set of adjacent grid points L Dm \ Wm such that λ S

i L[Gi, Gi+1) i , for j L,

j Kεp(1 µj)

Kεp((1 µUm) + 3 j + 3 Um)

Where the final inequality comes from the fact we assume Kεp < 1. Thus if we assume |L| 12|Wm|, we have that, j Kεp(1 µUm)

2|Ck| , a contradiction via proposition B.1 and Equation (23)

The proof of the following proposition follows via the same argument as in the proof of Proposition B.2.

Proposition B.3. For all m [M],

|Wm| 3|{i Wm : |µi µUm| Um}|

We are now ready to construct our set of problems. Consider a family of problems νQ indexed by Q {0, 1}K, where the target function ηQ corresponding to νQ is defined as follows. Set the coefficients,

(β1 m, β2 m)m [M] and for m [M], set βm 1 = µUm + 4/3 Um, β2 = µUm 4/3 Um, also define the sets

i Wm:Qi=1 [Gi, Gi+1) CQ m,2 = [

i Wm:Qi= 1 [Gi, Gi+1) .

i Wm[Gi, Gi+1) we then have

m [M] 1(x CQ m,1)β1 m + 1(x CQ m,2)β2 m .

Finally let CQ 0 = [0, 1] \ {CQ m,1, CQ m,2 : m [M]} and note that for all x CQ 0 , η(x) = 0. The following Lemma shows that, for a problem ν B, the gaps and complexity across our family problems, νQ indexed by Q, does not differ too much from ν.

Lemma B.4. Let Q {0, 1}K, for all m [M], i : [Gi, Gi+1) Cm, we have that Um/4 < (Q) i < 3 Um, where (Q) i is the gap of point i on problem νQ

Proof. We have immediately that 3 Um 3 Um, definition of Um Then, by proposition B.3,

Um/3 Kεp(1 µUm) 3|{i : |µi µUm| }| Kεp(1 µUm)

|Wm| Kεp(1 µUm 4 Um/3)

|Wm| +4Kεp Um

which implies, in combination with our assumption Kεp 1/8 that,

Um/6 Kεp(1 µUm 4 Um/3)

and thus that for all i Wm, (Q) i Um/6.

Lemma B.5. Given, ν B, we have that for all νQ, i [K], (Q) i i/2, where (Q) i is the gap of point i on problem νQ. Furthermore, P H(1) i,Q c P H(1) i where H(1) i,Q is the complexity of point i on problem νQ, for some absolute constant c > 0.

Proof. We prove the first statement. For all i : [Gi, Gi+1) C0 we have that (Q) i = 1 and otherwise the statement follows from Lemma B.5. The second statement then follows by combination of Lemma B.5 and Propositions B.2 and B.1.

Step 2: showing that one suffers ε regret on a well chosen event Let Qi be the transformation of Q that flips the ith coordinate,

Qk a = Qa If a = k, 1 Qa If a = k .

We remind the reader that we denote the scoring function outputted by the learner as ˆs. Now define,

z : Hsˆs(z) λ Sm 1 n=1 Cn CQ m,1 (1 κ Sm 1 n=1 Cn CQ m,1

define the event,

ξi,m := {x [Gi, Gi+1) : ˆs(x) > zm} 1 2K

and then the events,

i:[Gi,Gi+1) Cm,Qi=1 1(ξi,m) 3Kλ(Cm)

i:[Gi,Gi+1) Cm,Qi=0 1(ξi,m) Kλ(Cm)

For a set C [0, 1], define,

κ(C) := 1 λ(C)

C η(x) , dx .

Let ˆZm [0, 1] be the largest set such that, x ˆZm, y / ˆZm, ˆs(x) ˆs(y), and,

λ( ˆZm)(1 κ( ˆZm)) λ

n=1 Cn Cm,1

n=1 Cn Cm,1

Note that ˆZm is not necessarily unique, in this case we choose an arbitrary such ˆZm. Furthermore define,

, ˆZ1 m = n x ˆZm : x Cm,1 o ,

x ˆZm : x Cm,2

And let ˆGm = Sm 1 n=1 Cn \ ˆZ0 m. Note that under event Em 1 , we have

λ Z1 m Kλ(Cm)/4 . (27)

Now, via definition of ˆZm, we have the following,

λ( ˆZ1 m)(1 κ(Cm,1))+λ( ˆZ2 m)(1 κ(Cm,2)) λ(Cm,1)(1 κ(Cm,1))+λ( ˆGm)(1 κ( ˆGm)) , (28)

Thus, for a problem νQ : P i:[Gi,Gi+1) Cm Qi Kλ(Cm)/2, on event Em 1 , via combination of Equations (27) and (28),

λ( ˆZ2 m)(1 κ(Cm,2)) 3λ(Cm,1)(1 κ(Cm,1))/4 + λ( ˆGm)(1 κ( ˆGm)) (29)

(3λ(Cm,1)/4 + λ( ˆGm))(1 κ(Cm,1)) , (30)

where the final inequality comes from the fact κ(Cm,1) κ( ˆGm). To complete Step: 2 we now lower bound d (ˆs, η) on event Em 1 . Firstly note that,

(1 ˆZm)λ( ˆZm)

= λ Sm 1 n=1 Cn Cm,1 κ Sm 1 n=1 Cn Cm,1

= λ Sm 1 n=1 Cn κ Sm 1 n=1 Cn

p + λ(Cm,1)κ(Cm,1)

therefore, for a problem νQ : P

i:Pi Cm Qi Kλ(Cm)/2, on event Em 1 ,

d (ˆs, η) λ( ˆGm)κ( ˆGm)

p + λ(Cm,1)κ(Cm,1)

p κ(Cm,1)λ( ˆZ1 m) p κ(Cm,2)λ( ˆZ2 m) p

λ( ˆGm)κ( ˆGm)

p + 3λ(Cm,1)κ(Cm,1)

4p κ(Cm,2)λ( ˆZ2 m) p

λ( ˆGm)κ( ˆGm)

p + 3λ(Cm,1)κ(Cm,1)

4p κ(Cm,2)(3λ(Cm,1/4 + λ( ˆGm))

p 1 κ(Cm,1) 1 κ(Cm,2)

κ(Cm,1) κ(Cm,1)1 κ(Cm,1)

κ(Cm,1) κ(Cm,2)

8/3 Um 1 κ(Cm,2)

3λ(Cm,1 Cm,2)

8/3 Um 1 κ(Cm,2)

where the first inequality follows from Equation (27) and the second from (29).

Thus, as we assume policy π is PAC(δ, ε), on all problems νQ, we must have that, on all problems νQ : P

i:[Gi,Gi+1) Cm Qi Kλ(Cm)/2,

PνQ,π(Em 1 ) δ , (31)

Via similar reasoning we can show that on all problems νQ : P

i:[Gi,Gi+1) Cm Qi Kλ(Cm)/2, we must have that, PνQ,π(Em 0 ) δ . (32)

Step 4: bounding the probability of the sum of ξm i Now, for m [M], via the Azuma hoeffding inequality applied to the martingale, X

i:[Gi,Gi+1) Cm,Qi=0 [1(ξi,m) PQ(ξi,m)] ,

we have that,

i:[Gi,Gi+1) Cm,Qi=0 [1(ξi,m) PQ(ξi,m)] Kλ(Cm) log 1 1 δ

Thus via combination of Equations (32) and (33) we must have that, Q : PK i:[Gi,Gi+1) Cm Qi Kλ(Cm)/2,

i:[Gi,Gi+1) Cm,Qi=0 PQ(ξi,m) λ(Cm) K

4 + K log 1 1 δ

where the second inequality comes from our assumption δ 1 exp( 1/8). Via similar reasoning we also have that, Q : P

i:[Gi,Gi+1) Cm Qa Kλ(Cm)/2

i:[Gi,Gi+1) Cm,Qi=1 PQ(ξi,m) λ(Cm) K

2 K log 1 1 δ

Step 5: applying a Fano type inequality We first define the class of problems upon which we will apply Fano.

Q : m [M], X

i:[Gi,Gi+1) Cm Qi = Kλ(Cm)

Q : m [M], X

i:[Gi,Gi+1) Cm Qi = Kλ(Cm)

We remind the reader that for i [K], we write τi = Pτ t=1 1(at [Gi, Gi+1)) and for m [M], τ(m) = P

i:[Gi,Gi+1) Cm Ti. We see that, for all Q Q0, i : Qi = 0, there exists a unique Q Q

such that Qi = Q, therefore, X

i:[Gi,Gi+1) Cm,Qi=1 PQi(ξi,m) = X

i:[Gi,Gi+1) Cm,Qi=0 PQ(ξi,m) ,

and thus, using the data processing inequality and the convexity of the relative entropy we have,

i:Pi Cm,Qi=1 PQi(ξi,m)

i:Pi Cm,Qi=1 PQ(ξi,m)

| {z } 10/8

i:Pi Cm,Qi=1 EQ[τm]kl(µUm 4/3 Um, µUm + 4/3 Um)

EQ[τ(m)] kl(µUm 4/3 Um, µUm + 4/3 Um)

EQ[τ(m)] kl(µUm 4/3 Um, µUm + 4/3 Um)

Then using the Pinsker inequality kl(x, y) 2(x y)2, we obtain

i:Pi Cm,:Qi=1 PQ(ξi,m) 3

v u u tmax Q Q

EQ[τ(m)] kl(µUm 4/3 Um, µUm + 4/3 Um)

and therefore,

EQ[τ(m)] kl(µUm m, µUm + m)

m [M] EQ[τ(m)] c X

Kλ(Cm) kl(µUm 4/3 Um, µUm + 4/3 Um)

where c > 0 is an absolute constant. The proof now follows, as m [M], i : [Gi, Gi+1) Cm, Hi 1 kl(µUm 4/3 Um,µUm+4/3 Um).