# fast_rates_in_poolbased_batch_active_learning__da01a063.pdf

Journal of Machine Learning Research 25 (2024) 1-42 Submitted 12/22; Revised 3/24; Published 9/24

Fast Rates in Pool-Based Batch Active Learning

Claudio Gentile cgentile@google.com Google Research New York City, NY, USA

Zhilei Wang zhileiwang92@gmail.com World Quant LLC New York City, NY, USA

Tong Zhang tozhang@illinois.edu University of Illinois Urbana-Champaign, IL

Editor: Sivan Sabato

We consider a batch active learning scenario where the learner adaptively issues batches of points to a labeling oracle. Sampling labels in batches is highly desirable in practice due to the smaller number of interactive rounds with the labeling oracle (often human beings). However, batch active learning typically pays the price of a reduced adaptivity, leading to suboptimal results. In this paper we propose a solution which requires a careful trade oﬀbetween the informativeness of the queried points and their diversity. We theoretically investigate batch active learning in the practically relevant scenario where the unlabeled pool of data is available beforehand (pool-based active learning). We analyze a novel stagewise greedy algorithm and show that, as a function of the label complexity, the excess risk of this algorithm matches the known minimax rates in a standard statistical learning setting with linear function spaces. Our results also exhibit a mild dependence on the batch size. These initial results are then extended to hold for general function spaces with similar algorithmics. These are the ﬁrst theoretical results that employ careful trade oﬀs between informativeness and diversity to rigorously quantify the statistical performance of batch active learning in the pool-based scenario.

Keywords: Informativeness vs. diversity, Statistical learning, G-optimal design

1. Introduction

The aim of active learning is to reduce the data requirement of training processes through the careful selection of informative subsets of the data across several interactive rounds. This increased interactive power enables the adaptation of the sampling process to the actual state of the learning algorithm at hand, yet this beneﬁt comes at the price of frequent re-training of the model and increased interactions with the labeling oracle (which is often just a pool of human labelers).

The batch mode of active learning is one where labels are queried in batches of suitable size, and the models are re-trained/updated either after each batch or even less frequently. This sampling mode often corresponds to the way labels are gathered in practical large-scale processing pipelines.

c 2024 Claudio Gentile and Zhilei Wang and Tong Zhang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/22-1409.html.

Gentile and Wang and Zhang

Batch active learning tries to strike a reasonable balance between the beneﬁts of adaptivity and the costs associated with interaction and re-training. Yet, since the sampling is split into batches, and model updates can only be performed at the end of each batch, a batch active learning algorithm has to prevent to the extent possible the sampling of redundant points. The standard trade-oﬀthat arises is then to ensure that the sampled points are informative enough for the model, if taken in isolation, while at the same time being diverse enough so as to avoid sampling redundant labels. We study batch active learning in the pool-based model, where an unlabeled pool of data is made available to the algorithm beforehand, and the goal is to single out a subset of the data so as to achieve the same statistical performance as if training were carried out on the entire pool. In this setting, we describe and analyze novel algorithms that obtain fast rates of convergence of their excess risk as a function of the number of requested labels. Interestingly enough, in the linear case, optimal rates are obtained even if we allow the batch size to grow with the pool size, the actual trade-oﬀbeing ruled by the amount of noise in the data. Another appealing aspect is that our algorithms guarantee a number of re-training rounds which is at worst logarithmic, while being able to automatically adapt to the level of noise. We operate in speciﬁc realizable settings, starting with linear models as a warmup, and then extending our results to the more general non-linear setting. Unlike what is traditionally done by many algorithmic solutions to active learning available in the literature (e.g., Balcan et al. (2007); Balcan and Long (2013); Zhang and Li (2021)), we do not formulate strong assumptions on the input distribution. We establish careful trade-oﬀs between the informativeness and the diversity of the queried labels, and rigorously quantify the statistical performance on batch active learning in a noisy pool-based setting. To our knowledge, these are the ﬁrst guarantees of this kind that apply to a noisy (hence realistic) batch pool-based active learning scenario. See also the related work contained in Section 3. A short version of the current paper appeared in ICML 2022 (see Gentile et al., 2022b), and a longer version in Arxiv (Gentile et al., 2022a). The ICML version only contains the linear model analysis. This extended version greatly generalizes the linear analysis of Gentile et al. (2022b) to a broader class of nonlinear functions. This nonlinear analysis introduces a novel concept of eluder-like dimension of a function space, and an associated eluder-like conﬁdence interval, which can have other applications beyond active learning see, e.g., the recent paper by Ye et al. (2023), which built upon the Arxiv version of this paper (Gentile et al., 2022a).

1.1 Content and contributions

Our contributions can be described as follows.

1. We ﬁrst present an eﬃcient algorithm for pool-based batch active learning for noisy linear models (Algorithm 1). This algorithm generates pseudo-labels by computing sequences of linear classiﬁers that restrict their attention to exponentially small regions of the margin space, and then trains a single model based on the pseudo-labels only. The design inspiring the sampling within each stage is a G-optimal design, computed through a greedy strategy. We show (Theorem 1) that under the standard i.i.d. assumption of the (input, label) pairs, the model so trained enjoys an excess risk bound

Fast Rates in Pool-Based Batch Active Learning

with respect to the Bayes optimal predictor which is best possible, when expressed in terms of the total number of requested labels. The number of re-training stages (that is, the number of linear classiﬁers computed to generate pseudo-labels) is at most logarithmic in the pool size and, importantly, it automatically adapts to the noise level without knowing it in advance.

2. Since the above algorithm does not operate on a constant batch size B, we show in Section 4.2 an easy adaptation to the constant batch size, and make the observation that B therein may also scale as T β, for some exponent β < 1 that depends on the amount of noise (see comments surrounding Corollary 2), still retaining the abovementioned optimal rates.

3. Our algorithmic technique can be seen as a skeleton technique that can be applied to more general situations, provided the estimators employed at each stage and the diversity measure guiding the design have matching properties. We provide extensions to the general nonlinear case in Section 5, and give an analysis that covers general non-linear function spaces (Theorem 3). The non-linear extension relies on a novel notion of eluder-like dimension of a function space, which allows us to show learnability whenever the function space has eluder dimension at level ϵ of the form poly(1/ϵ). Also this algorithm can be made to work in the constant batch size case, and the same considerations as in the linear case apply to the general nonlinear case.

2. Preliminaries and Notation

We denote by X the input space (e.g., X = Rd), by Y the output space, and by D an unknown distribution over X Y. The corresponding random variables will be denoted by x and y. We also denote by DX the marginal distribution of D over X. Given a function h (also called a hypothesis or a model) mapping X to Y, the population loss (often referred to as risk) of h is denoted by L(h), and deﬁned as L(h) = E(x,y) D[loss(h(x), y)], where loss( , ) : Y Y [0, 1] is a given loss function. For simplicity of presentation, we restrict ourselves to a binary classiﬁcation setting with 0-1 loss, so that Y = { 1, +1}, and loss(ˆy, y) = 1{ˆy = y} {0, 1}, being 1{ } the indicator function of the predicate at argument. When clear from the surrounding context, we will omit subscripts like (x, y) D from probabilities and expectations. We are given a class of models F = {f : X [0, 1]} and the Bayes optimal predictor h (x) = sgn (f (x) 1/2), where

f (x) = P(y = 1|x)

is assumed to belong to class F (the so-called realizability assumption). This assumption is reasonable whenever the model class F we operate on is wide enough. For instance, a realizability (or quasi-realizability) assumption seems natural in overparameterized settings implemented by nowadays Deep Neural Networks. A simple example is a generalized linear model

f (x) = σ( w , x ) , (1)

Gentile and Wang and Zhang

where σ : R [0, 1] is a suitable sigmoidal function, e.g., σ(z) = ez 1+ez , w is an unknown vector in Rd, with bounded (Euclidean) norm ||w|| R for some R 1, and , denotes the usual inner product in Rd. Throughout this paper, we adopt the commonly used low-noise condition on the marginal distribution DX of Mammen and Tsybakov (1999): there are constant c > 0, ϵ0 (0, 1] and exponent α 0 such that for all ϵ (0, ϵ0] we have

P |f (x) 1/2| < ϵ/2 c ϵα . (2)

Note, in particular, that α gives the so-called hard margin condition P |f (x) 1/2| < ϵ = 0 for all ϵ ϵ0 while, at the opposite end of the spectrum, exponent α = 0 (and c = 1) corresponds to making no assumptions whatsoever on DX . For simplicity, we shall assume throughout that the above low-noise condition holds for c = 1. The noise exponent α and range constant ϵ0 are typically unknown, and our algorithms will not rely on the prior knowledge of them. We are given a class of models F, and a pool P of T unlabeled instances x1, . . . , x T X, drawn i.i.d. according to a marginal distribution DX obeying condition (2) (with c = 1). The associated labels y1, . . . , y T Y are such that the pairs (xt, yt), t = 1, . . . , T, are drawn i.i.d. according to D, the labels being generated according to the conditional distribution determined by some f F. The labels are not initially revealed to us, and the goal of the active learning algorithm is to come up at the end of training with a model bh : X Y whose excess risk L(bh) L(h ) is as small as possible, while querying as few labels as possible for points in P. The way labels are queried follows the standard batch active learning protocol. We are given a batch size B 1. Label acquisition and learning proceeds in a sequence of stages, ℓ= 1, 2, . . .. At each stage ℓ, the algorithm is allowed to query B-many labels by only relying on labels acquired in the past ℓ 1 stages. Note that each point xt in pool P can only be queried once, which is somehow equivalent to assuming that the noise in the corresponding label yt is persistent. We shall henceforth denote by NT (P) the total number of labels (sometimes referred to as label complexity) queried by the algorithm at hand on pool P, and by NT,B(P) the same quantity if we want to emphasize the dependence on B. The analysis of our algorithms hinges upon a suitable measure of diversity, D(x, S), that quantiﬁes how far oﬀa data point x X is from a ﬁnite set of points S X. Though many diversity measures may be adopted for practical purposes (e.g., Wei et al., 2015; Sener and Savarese, 2018; Kirsch et al., 2019; Ash et al., 2020; Killamsetty et al., 2021; Kirsch et al., 2021; Citovsky et al., 2021), the one enabling tight theoretical analyses for our algorithms is one that is somehow coupled with the estimators our active learning algorithms rely upon. Speciﬁcally, given diversity measure D(x, S), an estimator bf = bf(S) in a ﬁxed design scenarios is coupled with D(x, S) if we can guarantee L approximation bounds of the form

| bf S(x) f (x)| D(x, S) x . (3)

In the case of linear function spaces over X = Rd, the estimators bf = bf(S) will essentially be least-squares predictors, and a coupled diversity measure will be spectral-like: D(x, S) =

1 2 A 1 S = ||x||A 1 S = q

x A 1 S x , that is, the Mahalanobis norm of x w.r.t. the positive

semi-deﬁnite matrix A 1 S , where AS = I + P z S zz , being I the d d identity matrix.

Fast Rates in Pool-Based Batch Active Learning

Note that D(x, S) is large when x is aligned with small eigenvectors of AS, while it is small if x is aligned with large eigenvectors of that matrix. In particular, D(x, S) achieves its maximal value ||x||2 when x is orthogonal to the space spanned by S. Hence, x is very diﬀerent from S as measured by D(x, S) if x contributes a direction of the input space which is not already spanned by S. We denote by |AS| the determinant of matrix AS. In the more general nonlinear case, our diversity measure is similar in spirit to the eluder dimension of Russo and Van Roy (2013), as well as to the more recent online decoupling coeﬃcient, as deﬁned by Dann et al. (2021) see Section 5 for details. At an intuitive level, since the label requests are batched, and model updates are typically performed only at the end of each stage, a batch active learning algorithm is compelled to operate within each stage by trading oﬀthe (predicted) informativeness of the selected labels against the diversity of the data points whose labels are requested. Moreover, the larger the batch size B the less adaptive the algorithm is forced to be, hence we expect B to somehow play a role in the performance of the algorithm. From a practical standpoint, there are indeed two separate notions of adaptivity to consider. One is the number of interactive rounds with the labeling oracle, the other is the number of times we re-train (or update) a model based on the labels gathered during the interactive rounds. The two notions need not coincide. While the former essentially accounts for the cost of interacting with human labelers, the latter is more related to the cost of re-training/updating a (potentially very complex) learning system.

3. Related Work

While experimental studies on batch active learning are reported since the early 2000s (see, e.g., Hoi et al., 2006), it is only with the deployment at scale of deep neural networks that we have seen a general resurgence of interest in active learning, and batch active learning in particular. The batch pool-based model studied here is the one that has spurred the widest attention, as it corresponds to the way in practice labels are gathered in large-scale processing pipelines. This interest has generated a ﬂurry of recent investigations, mainly of experimental nature, yet containing a lot of interesting and diverse approaches to batch active learning. Among these are the papers by Gu et al. (2012, 2014); Sener and Savarese (2018); Kirsch et al. (2019); Zhdanov (2019); Shui et al. (2020); Ash et al. (2020); Kim et al. (2020); Killamsetty et al. (2021); Kirsch et al. (2021); Ghorbani et al. (2021); Citovsky et al. (2021); Kothawade et al. (2021). On the theoretical side, active learning is a well-studied sub-ﬁeld of statistical learning. General references in pool-based active learning include the papers by Dasgupta (2004, 2005); Hanneke (2014); Nowak (2011); Tosh and Dasgupta (2017), and speciﬁc algorithms for half-spaces under classes of input distributions are contained, e.g., in Balcan et al. (2007); Balcan and Long (2013); Zhang and Li (2021). However, none of these papers tackle the practically relevant scenario of batch active learning. In fact, restricting to theoretical aspects of batch active learning makes the research landscape far less populated. Below we brieﬂy summarize what we think are among the most relevant papers to our work, as directly related to batch active learning, and then mention recent eﬀorts in contiguous ﬁelds, like adaptive sampling and subset selection, which may serve as a general reference and inspiration.

Gentile and Wang and Zhang

Batch active learning in the pool-based scenario is one of the motivating applications in Chen and Krause (2013), where the main concern is to investigate general conditions under which a batch greedy policy achieves similar performance as the optimal policy that operates with the same batch size. Yet, the authors consider simple noise free scenarios, while the important observation (Theorem 2 therein) that a batch greedy algorithm is also competitive with respect to an optimal fully sequential policy (batch size one) does not apply to active learning. Chen et al. (2015, 2017) are along similar lines, with the addition of persistent noise, but do not tackle batch active learning problems. A paper with a similar aim as ours, yet operating in the streaming setting of active learning, is Amin et al. (2020). The authors show that some classes of fully sequential active algorithms can be turned into sequential algorithms that query labels in batches and suﬀer only an additive (times log factors) overhead in the label complexity. This transformation is essentially obtained by freezing the state of the fully sequential algorithm, but it is unclear whether any notion of diversity over the batch is enforced by the resulting batch algorithms. A recent stream-based active learning paper that is worth mentioning is Camilleri et al. (2021b), which shares a similar method and modeling assumptions as ours in leveraging optimal design, but it does not deal with batch active learning. The main concern there is essentially to improve the performance of adaptive sampling by reducing the variance of the estimators. Moreover, a related work of Katz-Samuels et al. (2021) considered the batch active learning setting for binary classiﬁcation, with 0-1 valued hypothesis class trained via binary classiﬁcation loss instead of regression loss of this paper. Conﬁdence interval resulting from binary classiﬁcation loss minimization requires a concept similar to disagreement coeﬃcient (Hanneke, 2014), which is quite diﬀerent from regression conﬁdence interval in statistics which we employ here. Therefore their problem setting and results are not comparable to ours. Another couple of very recent works in the streaming setting of active learning which are relevant to this paper (speciﬁcally, for the general function space setting) are Zhu and Nowak (2022) and Sekhari et al. (2023). Both papers assume a sequential i.i.d. setting, hence they are not batch pool-based. Being i.i.d. sequential (and not pool-based), their algorithms are able to sidestep the dependence on our eluder dimension-like notion of complexity, and instead rely on the so-called value-function disagreement coeﬃcient (Foster et al., 2020), which is likely to be smaller than eluder-like dimension complexity measures. Again, those results are incomparable to ours. Our algorithms borrow ideas from the classical experimental design literature (Pronzato and P azman, 2013), and is also motivated by works in active learning, such as Chaudhuri et al. (2015); Mukherjee et al. (2022). However, these works consider diﬀerent objective functions than us. Speciﬁcally, they try to ﬁnd designs with ﬁnite sample guarantees that can minimize the log-loss or regression loss. In such settings, as long as there is noise, they cannot achieve faster convergence rates than O(1/NT (P)). In contrast, our batch active learning setting directly considers the classiﬁcation objective, which can potentially achieve faster rates than the corresponding regression problem when low noise conditions are satisﬁed. Our algorithms employ stage-wise batch sample selection and elimination to achieve such faster rates. A learning problem somewhat similar to pool-based batch active learning is training subset selection (sometimes called dataset summarization), whose goal is to come up with

Fast Rates in Pool-Based Batch Active Learning

a compressed version of a (big) dataset that oﬀers to a given learning algorithm the same inference capabilities as if applied to the original dataset. The problem can be organized in rounds (as in batch active learning) and bridging one to the other can in practice be done by label hallucination/pseudo-labeling. Representative works include Wei et al. (2015); Killamsetty et al. (2021); Borsos et al. (2021). A ﬁnal word before delving into the technical sections: Our work is theoretical in nature. We try to understand the statistical sample eﬃciency of pool based batch active learning, although some of the proposed methods in the paper may not pay speciﬁc attention to the computational aspects of the involved procedures. Speciﬁcally, whereas our methods are computationally eﬃcient in the linear function case, they need not be in the general nonlinear case.

4. Warmup: The Linear Case

As a warm up, we start by considering a simple linear model of the form f (x) = 1+ w ,x

2 , where both w and x lie in the d-dimensional Euclidean unit ball (so that w , x [ 1, 1] and f (x) [0, 1]). Algorithm 1 contains in a nutshell the main ideas behind our algorithmic solutions, which is to greedily approximate a G-optimal design in the selection of points at each stage. The way it is formulated, Algorithm 1 does not operate with a constant batch size B per stage. We will reduce to the constant batch size case in Section 4.2. The same algorithmic skeleton will be applied to the general nonlinear case in Section 5. The algorithm for the linear case takes as input a ﬁnite pool of points P of size T and proceeds across stages ℓ= 1, 2, . . . by generating at each stage ℓa (linear-threshold) predictor sgn( wℓ, x ), where wℓis a ridge regression estimator computed only on the labeled pairs (xℓ,1, yℓ,1), . . . , (xℓ,Tℓ, yℓ,Tℓ) collected during that stage. These predictors are used to trim the current pool Pℓ 1 by eliminating both the points on which wℓis itself conﬁdent (set Cℓ) and those whose labels have just been queried (set Qℓ). At each stage ℓ, the points xℓ,t to query are selected in a greedy fashion by maximizing1 D(x, Qℓ) = ||x||A 1 ℓ,t 1 over the

current pool Pℓ 1 (excluding the already selected points Qℓ, which are contained in Aℓ,t 1), so as to make xℓ,t maximally diﬀerent from Qℓ. When stage ℓterminates, we are guaranteed that we have collected a set of points Qℓ such that all remaining points x in the pool satisfy D(x, Qℓ) ϵℓ. Threshold ϵℓ, deﬁned at the beginning of the stage, is exponentially decaying with ℓ. It is this threshold that determines the actual length of the stage, and rules the elimination of unqueried points from the pool, along with the corresponding generation of pseudo-labels during the stage. Algorithm 1 stops generating new stages when the size |Pℓ| of pool Pℓtriggers the condition d/2 ℓ+1 > 2 ℓ+1|Pℓ| (which is satisﬁed, in particular, when Pℓbecomes empty). In that case, the current stage ℓbecomes the ﬁnal stage L. Finally, the algorithm uses the subset of points L ℓ=1Cℓand the associated pseudo-labels ˆy generated during the L stages to train a linear classiﬁer bw (e.g., a support vector machine) to zero empirical error on that subset. Our analysis (see Appendix A) shows that with high probability such a consistent linear classiﬁer exists. Each point x that remains in the pool,

1. As a matter of fact, the chosen xℓ,t need not be the maximizer of D(x, Qℓ), the analysis only requires D(xℓ,t, Qℓ) > ϵℓ.

Gentile and Wang and Zhang

Algorithm 1: Pool-based batch active learning algorithm for linear models.

1 Input: Conﬁdence level δ (0, 1], pool of instances P Rd of size |P| = T

2 Initialize: P0 = P

3 for ℓ= 1, 2, . . . ,

4 Initialize within stage ℓ:

ϵℓ= 2 ℓ/( q

2 log 2ℓ(ℓ+1)T

Aℓ,0 = I, t = 0, Qℓ=

while Pℓ 1\Qℓ = and max x Pℓ 1\Qℓ x A 1 ℓ,t > ϵℓ

Pick xℓ,t argmax x Pℓ 1\Qℓ x A 1 ℓ,t 1

Update Aℓ,t = Aℓ,t 1 + xℓ,tx ℓ,t

Qℓ= Qℓ {xℓ,t}

Set Tℓ= t, the number of queries made in stage ℓ if Qℓ =

Query the labels yℓ,1, . . . , yℓ,Tℓassociated with the unlabeled data in Qℓ, and compute

wℓ= A 1 ℓ,Tℓ

t=1 yℓ,txℓ,t

Set Cℓ= {x Pℓ 1\Qℓ: | wℓ, x | > 2 ℓ}

Compute pseudo-labels on each x Cℓas ˆy = sgn wℓ, x

Set Pℓ= Pℓ 1\(Cℓ Qℓ)

if d/2 ℓ+1 > 2 ℓ+1|Pℓ|

Exit the for-loop (L is the total number of stages)

5 Predict labels in pool P:

Train an SVM classiﬁer bw on L ℓ=1Cℓvia the generated pseudo-labels ˆy

Predict on each x ( L ℓ=1Qℓ) PL through sgn( bw, x )

Fast Rates in Pool-Based Batch Active Learning

that is, each x ( L ℓ=1Qℓ) PL, is assigned label sgn( bw, x ). Notice, in particular, that bw is not trying to ﬁt the queried labels of L ℓ=1Qℓ, but only the pseudo-labels of L ℓ=1Cℓ. It is also worth observing how Algorithm 1 resolves the trade-oﬀbetween informativeness and diversity we alluded to in previous sections. Once we reach stage ℓ, what remains in the pool are only the points x such that | wℓ 1, x | 2 ℓ+1 (this is because we have eliminated in stage ℓ 1 all the points in Cℓ 1). Hence, the remaining points which the approximate G-optimal design operates with in stage ℓare those which the previous model wℓ 1 is not suﬃciently conﬁdent on. The algorithm then puts all these low-conﬁdent points on the same footing (that is, they are considered equally informative if taken in isolation), and then relies on the approximate G-optimal design scheme to maximize diversity among them. The set-wise diversity measure we end up maximizing is indeed a determinant-like diversity measure. This is easily seen from the fact that PTℓ t=1 ||xℓ,t||2 A 1 ℓ,t 1 log |Aℓ,Tℓ| .

On one hand, this careful selection of points contributes to keeping the variance of estimator wℓunder control. On the other hand, the fact that we stop accumulating labels when max x Pℓ 1\Qℓ x A 1 ℓ,Tℓ ϵℓessentially implies that sgn( wℓ, x ) = sgn( w , x ) on all

points x we generate pseudo-labels for, which in turn ensures that these pseudo-labels are consistent with w . Sequential experimental design has become popular, e.g., in the (contextual) bandits literature, see Ch. 22 in Lattimore and Szepesvari (2020), and is explicitly contained in recent works on best arm identiﬁcation (e.g., Fiez et al., 2019; Camilleri et al., 2021a). Notice that in those works a design is a distribution over the set of actions (which would correspond to pool P in our case), and the algorithm is aﬀorded to sample a given action xt multiple times, obtaining each time a fresh reward value yt such that E[yt | xt] = w , xt . This is not conceivable in a pool-based active learning scenario where label noise is persistent, and each action xt can only be played once. This explains why the design we rely upon here is necessarily more restrained than in those papers.

4.1 Analysis

The following result applies to the linear function case.2

Theorem 1 Let T d and assume that x 2 1 for all x P. Then with probability at least 1 δ over the random draw of (x1, y1), . . . , (x T , y T ) D the excess risk L(bw) L(w ), the label complexity NT (P), and the number of stages L generated by Algorithm 1 are simultaneously upper bounded as follows:

L(bw) L(w ) C C(δ, T, ϵ0)

α+2 , d Tϵ0

+ log log T

NT (P) C C(δ, T, ϵ0)

d α α+2 T 2 α+2 , d

+ log2 log T

α + 2 , log 4

+ log log T

2. Detailed proofs are deferred to the appendices.

Gentile and Wang and Zhang

for an absolute constant C and

C(δ, T, ϵ0) = log2 T

4.2 Constant batch size

We now describe a simple modiﬁcation to Algorithm 1 that makes it work in the constant batch size case. Let us denote by Tℓthe length of stage ℓin Algorithm 1. The modiﬁed algorithm simply runs Algorithm 1: If Tℓ< B the modiﬁed algorithm relies on model wℓ generated by Algorithm 1 without saturating the budget of B labels at that stage. On the contrary, if Tℓ B, the modiﬁed algorithm splits stage ℓof Algorithm 1 into Tℓ/B stages of size B (except, possibly, for the last one), and then uses the queried set Qℓgenerated by Algorithm 1 across all those stages. Hence, in this case, the modiﬁed algorithm is not exploiting the potential beneﬁt of updating the model every B queried labels. For instance, if B = 100 and Tℓ= |Qℓ| = 240, the modiﬁed algorithm will split this stage into three successive stages of size 100, 100, and 40, respectively, and then rely on the 240 labels queried by Algorithm 1 across the three stages. In particular, the update of the model wℓ, and the associated pseudo-label computation on sets Cℓis only performed at the end of the third stage. Notice that the modiﬁed algorithm we just described is a legitimate pool-based batch active learning algorithm operating on a constant batch size B, and its analysis is a direct consequence of the one in Theorem 1, after we take care of the possible over-counting that may arise in the reduction. Speciﬁcally, observe that the ﬁnal hypothesis bw produced by the modiﬁed algorithm is the same as the one computed by Algorithm 1, hence the same bound on the excess risk applies. As for label complexity, if we stipulate that a batch algorithm operating on a constant batch size B will be billed B labels at each stage even if it ends up querying less, then the label complexity of the modiﬁed algorithm will overcount the number of labels simply due to the rounding oﬀin Tℓ/B . However, at each of the L stages of Algorithm 1, the over-counting is bounded by B, so that, overall, the label complexity of the constant batch size variant exceeds the one of Algorithm 1 by at most an additive BL term which, due to the bound on L in Theorem 1, is of the form max n B α+2 log T

d , B log 1 ϵ0

o . This is summarized in the following corollary.

Corollary 2 With the same assumptions and notation as in Theorem 1, with probability at least 1 δ over the random draw of (x1, y1), . . . , (x T , y T ) D the label complexity NT,B(P) achieved by the modiﬁed algorithm operating on a batch of size B is bounded as follows:

NT,B(P) C C(δ, T, ϵ0)

d α α+2 T 2 α+2 , d

+ log2 log T

α + 2 , log 4

+ log log T

where C, C(δ, T, ϵ0) are the same as in Theorem 1.

A few comments are in order.

Fast Rates in Pool-Based Batch Active Learning

1. An important practical aspect of this modiﬁed algorithm (inherited from Algorithm 1) is the very mild number of re-trainings required to achieve the claimed performance. Despite the total number of labels can be as large as T 2 α+2 , the number L of times the model is actually re-trained is not T 2 α+2 /B, but only logarithmic in T, irrespective of the noise level α (that is, even when the low-noise assumption (2) is vacuous). On the other hand, it is also important to observe that the bound on L shrinks as α increases, that is, when the problem becomes easier. Overall, these properties make the algorithm attractive in practical learning scenarios where the re-training time turns out to be the main bottleneck in the data acquisition process, and a learning procedure is needed that automatically adapts the re-training eﬀort to the hardness of the problem.

2. Let us disregard lower order terms and only consider the asymptotic behavior as T . Comparing the excess risk bound in Theorem 1 to the label complexity bound in Corollary 2, one can see that when B = O(T 2 α+2 ) we have with high probability

L(bw) L(w ) 1

(NT,B(P)) 1+α

which is the minimax rate one can achieve for VC classes under the low-noise condition (2) with exponent α (e.g., Castro and Nowak, 2008; Hanneke, 2009; Koltchinskii, 2010; Dekel et al., 2012). Hence, in order to achieve high-probability minimax rates, one need not try to make the algorithm more adaptive by having it operate with an even smaller B: any B as small as T 2 α+2 will indeed suﬃce in our learning scenario.

3. Similar minimax bounds on excess risk against label complexity for linear function spaces have been shown in the streaming setting in Dekel et al. (2012); Wang et al. (2021), though their results only hold in the fully sequential case (that is, B = 1) and only hold in expectation over the random draw of the data, not with high probability.

The fact that a batch greedy algorithm can be competitive with a fully sequential policy has also been observed in problems which are similar in spirit to active learning, like inﬂuence maximization (see, in particular, Chen and Krause, 2013). More recently, in the context of adaptive sequential decision making, Esfandiari et al. (2021) have proposed an eﬃcient semi-adaptive policy that performs logarithmically-many rounds of interaction achieving similar performance as the fully sequential policy. This paper improves on the original ideas contained in Golovin and Krause (2017). Yet, when adapted to active learning, these results turn out to apply to very stylized scenarios that assume lack of noise in the labels, and/or disregard the computational aspects associated with maintaining a posterior distribution or a version space (which would be of size O(T d) in our case).

5. The Nonlinear Case

We are now ready to consider the more general non-linear scenario described in Section 2. Given an arbitrary set of points S = {x1, . . . , x T } (not necessarily i.i.d. and possibly with repeated items), denote by P( ) a permutations over the indices {1, . . . , T}, and

Gentile and Wang and Zhang

x P(1), . . . , x P(T) the permutation of S corresponding to P. We deﬁne the dimension Dim(F, S) of the class of functions F projected onto S as

Dim(F, S) = sup P

t=1 D2 x P(t); x P(1), . . . , x P(t 1) ,

D2(x, x1, . . . , xt 1 ) = sup f,g F

(f(x) g(x))2 Pt 1 i=1(f(xi) g(xi))2 + 1 .

The function D(x, x1, . . . , xt 1 ) deﬁned above is our (F-dependent) measure of diversity of x from x1, . . . , xt 1, and will be used to select diverse points x to query in each batch. We will give below notable examples where Dim(F, S) is suitably bounded. For now, what is important to keep in mind is that our active learning analyses will be non-vacuous whenever Dim T (F) = sup S X : |S|=T Dim(F, S)

is sub-linear in T, e.g., Dim T (F) = O(poly(log T)) or Dim T (F) = O(T ζ), for ζ [0, 1). A suﬃcient condition for this to happen is discussed next. The reader familiar with the literature in non-linear contextual bandits will recognize Dim(F, S) as a close relative to the so-called eluder dimension (Russo and Van Roy, 2013), as well as to the more recent online decoupling coeﬃcient, as deﬁned in Dann et al. (2021). We recall that for a function space F = {f : X [0, 1]}, and scale ϵ > 0, the eluder dimension dim E(F, ϵ) can be deﬁned as follows. A point x X is ϵ-dependent on points {x1, . . . , xt} X with respect to F if any pair of functions f, g F satisfying Pt i=1(f(xi) g(xi))2 ϵ2 also satisﬁes (f(x) g(x))2 ϵ2. Point x is ϵ-independent of {x1, . . . , xt} with respect to F if x is not ϵ-dependent on {x1, . . . , xt}. Then dim E(F, ϵ) is the length of the longest sequence of elements in X such that, for some ϵ ϵ, every element in the sequence is ϵ -independent from its predecessors in the sequence. A number of papers provide bounds on the eluder dimension for various function classes. The original bounds on tabular, linear, and generalized linear functions are found in Russo and Van Roy (2013); Osband and Van Roy (2014). When the function class lies in a reproducing kernel Hilbert space (rkhs), Huang et al. (2021) show that the eluder dimension is equivalent to the notion of information gain (see also Remark 6 below). Further, Dong et al. (2021) prove an exponential lower bound on the eluder dimension for Re LU networks. As for the relationships between Dim T (F) and dim E(F, ϵ), we show in Appendix B, Lemma 23 therein, the following bound:3

Dim T (F) C inf η 1

dim E(F, T 1

2η ) log2(T) + T 1 1

where C > 0 is a universal constant.

3. For a weighted version of our complexity measure, Ye et al. (2023) obtained a bound corresponding to η = 1, which does not enable a trade oﬀon η.

Fast Rates in Pool-Based Batch Active Learning

For instance, if dim E(F, ϵ) = O((1/ϵ)β), for some β 0, then selecting η in (4) that equalizes the two terms in the right-hand side (and disregarding the log2(T) factor) gives a bound of the form Dim T (F) = O(T

which is achieved for eluder dimension dim E(F, ϵ) at scale ϵ = (1/T) 1 2+β . As a consequence, our bounds will be non-vacuous if the function space F satisﬁes

dim E(F, ϵ) = O((1/ϵ)β)

for some β < . We stress that this is not possible to achieve if we rely on the typical eluder dimension scale of ϵ = 1/T found in the non-linear bandit literature in the absence of gap assumptions (e.g., Russo and Van Roy, 2013) or the scale ϵ = in the presence of gap > 0 among actions/policies (e.g., Foster et al., 2020). In the case where F is a class of linear functions on a d-dimensional space, it is known that dim E(F, ϵ) = O(d log(1/ϵ)), and we can set in (4) η = 1 to obtain Dim T (F) = O(d log3 T), which is worse than the usual d log T bound one is expecting. As a matter of fact, in the linear (or generalized linear) case a more direct argument is possible (see Remark 6 below) giving the desired Dim T (F) = O(d log T) bound. The algorithm for the non-linear case is given as Algorithm 2. The pseudocode is similar to the one in Algorithm 1, where we replace the diversity measure x A 1 ℓ,t 1 by D(x, Qℓ),

the margin wℓ, x by bfℓ(x) 1/2, and deﬁne the conﬁdence levels through the quantity KT (δ, ℓ, γ) satisfying

KT (δ, ℓ, γ) =

v u u u t8 log 2ℓ(ℓ+ 1) e T V γ V

8 log 8ℓ(ℓ+ 1)T 2

where V is the VC-subgraph dimension (or Pollard s pseudo-dimension) of F, i.e. the VCdimension of the 0/1-valued class of functions

{(x, θ) 1{f(x) θ} : f F, θ [0, 1]} .

The above expression for KT (δ, ℓ, γ) is itself the result of upper bounding in the parametric case (V < ) the covering numbers N (F, Qℓ, γ) (with γ = 1/T) occurring in the analysis. See, in particular, Lemma 26 in Appendix C. This upper bound via the VC-subgraph dimension is not strictly necessary, and is only aimed at making the function KT (δ, ℓ, 1/T) occurring in the algorithm s pseudo-code easier to interpret.

5.1 Analysis

The following is the main result of this paper.

Theorem 3 Let V be the VC-subgraph dimension of F, and T max{Dim(F, P), V }. Then with probability at least 1 δ over the random draw of (x1, y1), . . . , (x T , y T ) D the

Gentile and Wang and Zhang

excess risk L( bf) L(f ), the label complexity NT (P), and the number of stages L generated by Algorithm 2 with γ = 1/T are simultaneously upper bounded as follows:

L( bf) L(f )

C C(δ, T, ϵ0)

( Dim(F, P)

α+2 , Dim(F, P)

δ + V log T

NT (P) C C(δ, T, ϵ0)

Dim(F, P) α α+2 T 2 α+2 , Dim(F, P)

+ log2 log T

log T Dim(F,P)

α + 2 , log 1

+ log log T

for an absolute constant C and

C(δ, T, ϵ0) = V log2 T

A few remarks are in order.

Remark 4 In Algorithm 2 the stopping condition

Dim(F, P)/2 ℓ+1 > 2 ℓ+1|Pℓ|

involves Dim(F, P), which may be hard to compute in practice. In its stead, we can introduce a tuning parameter ω and replace Dim(F, P) therein by ω. As a consequence, the bounds in Theorem 3 (ignoring constant factors and lower order terms) become

L( bf) L(f ) max

( Dim(F, P)

α+2 Dim(F, P)

α+2 , Dim(F, P)

Dim(F, P) α α+2 T 2 α+2 Dim(F, P)

2 α+2 , Dim(F, P)

α + 2 , log 1

Note that the excess risk is optimized when ω = Dim(F, P).

Remark 5 For the sake of illustration, consider the case where Dim(F, P) Dim T (F) = O(T ζ), for some ζ [0, 1), as is the case when dim E(F, ϵ) = O((1/ϵ)β) for some β < . In this case, one can easily see that with high probability (and disregarding constants and log factors)

L( bf) L(f ) 1

Fast Rates in Pool-Based Batch Active Learning

Algorithm 2: Pool-based batch active learning algorithm for general non-linear models.

1 Input: Conﬁdence level δ (0, 1], pool of instances P Rd of size |P| = T, scale parameter γ

2 Initialize: P0 = P

3 for ℓ= 1, 2, . . . ,

4 Initialize within stage ℓ:

ϵℓ= 2 ℓ/KT (δ, ℓ, γ)

while Pℓ 1\Qℓ = and max x Pℓ 1\Qℓ D(x, Qℓ) > ϵℓ

Pick xℓ,t argmax x Pℓ 1\Qℓ D(x, Qℓ)

Update Qℓ= Qℓ {xℓ,t}

Set Tℓ= t, the number of queries made in stage ℓ if Qℓ =

Query the labels yℓ,1, . . . , yℓ,Tℓassociated with the unlabeled data in Qℓ, and compute

bfℓ= argmin f F

2 f(xℓ,t) 2

Set Cℓ= {x Pℓ 1\Qℓ: | bfℓ(x) 1/2| > 2 ℓ}

Compute pseudo-labels on each x Cℓas ˆy = sgn( bfℓ(x) 1/2)

bfℓ= 1/2 (random guess), Cℓ=

Set Pℓ= Pℓ 1\(Cℓ Qℓ)

if Dim(F, P)/2 ℓ+1 > 2 ℓ+1|Pℓ|

Exit the for-loop (L is the total number of stages)

5 Predict labels in pool P:

Find empirical 0 1 loss minimizer bf on L ℓ=1Cℓvia the generated pseudo-labels ˆy

Predict on each x ( L ℓ=1Qℓ) PL through sgn( bf(x) 1/2)

Gentile and Wang and Zhang

which are similar to the bounds obtained in Wang et al. (2021) in the streaming setting under assumptions that are similar in spirit.

Remark 6 One may be wondering whether it is generally possible to directly bound Dim T (F) in relevant cases. The only relevant cases we are currently aware of are: (i) the linear and generalized linear cases we considered in Section 4 of this paper, and Section 5 of Gentile et al. (2022b), where it is easy to show that Dim T (F) = O(d log T); (ii) The case where F is made up of bounded functions in a rkhs. In this case, Dim(F, S) log |I + G(S)| , and Dim T (F) max S X : |S|=T log |I + G(S)| , where G(S) is the kernel Gram

matrix build out of the data in pool S. Hence, Dim T (F) depends on the decay of the eigenvalues of the underlying kernel function. In this case, since F may not be a VC-class, we also need to resort to known results for bounding covering numbers of function classes within a rkhs (e.g., Zhou, 2002). The argument for the linear case is particularly simple. Consider an arbitrary ordering x1, . . . , x T of S, and deﬁne

D2(xt, x1, . . . , xt 1 ) = sup w1,w2 Rd : ||w1||,||w2|| 1

( w1, xt w2, xt )2 Pt 1 i=1( w1, xi w2, xi )2 + 1

= sup w Rd : ||w||=2

( w, xt )2 Pt 1 i=1( w, xi )2 + 1

= sup w Rd : ||w||=2

w xtx t w w At 1w ,

where At 1 = Pt 1 i=1 xix i + 1

2I. Since w xtx t w (w At 1w)(x t A 1 t 1xt), the above gives

D2(xt, x1, . . . , xt 1 ) x t A 1 t 1xt

Dim(F, S) =

t=1 D2(xt, x1, . . . , xt 1 ) = O(d log T)

by known inequalities (e.g., Azoury and Warmuth, 2001; Cesa-Bianchi et al., 2005; Abbasi Yadkori et al., 2011). A very similar argument can be carried out for the rkhs case by switching to dual variables, leading to the claimed bound Dim(F, S) log |I + G(S)| . See, e.g., Cesa-Bianchi et al. (2005), Section 3.2 therein. Incidentally, the above also shows that the algorithm for the general non-linear case (Algorithm 2) essentially recovers the one for the linear case (Algorithm 1) up to logarithmic factors. To see this, observe that the quantity KT (δ, ℓ, γ) at scale γ = 1/T deﬁning ϵℓin Algorithm 2 is similar to the log factor at the denominator in the expression for ϵℓin Algorithm 1. More importantly, as argued above, in the linear case we have D(x, Qℓ) ||x||A 1 ℓ,t , and Dim(F, S) = O(d), which are the corresponding quantities Algorithm 1 relies upon.

Remark 7 A constant batch size version of Algorithm 2 can also be devised, and the associated properties spelled out. The details are similar to those in Section 4.2 for the linear

Fast Rates in Pool-Based Batch Active Learning

case, but the interplay with the batch size B now also depends on the behavior of Dim(F, P) or Dim T (F), as a function of T = |P|. The details are easy extensions of the arguments we carried out for the linear case (comments surrounding Corollary 2), and are therefore omitted.

6. Conclusions and Open Questions

We have described and analyzed novel batch active learning algorithms in the pool-based setting that in relevant cases achieve minimax rates of convergence of their excess risk as a function of the number of queried labels. The minimax nature of our results is retained also when the batch size B is allowed to scale polynomially (B T β, for β 1) with the size T of the training set, the allowed exponent β depending on the actual level of noise in the data. The algorithms have a number of re-training rounds which is at worst logarithmic, and is able to automatically adapt to the noise level. Our algorithms generate pseudo-labels by restricting to exponentially small regions of the margin space. The main content of this paper is the extension of the above results to the general non-linear case through a notion of dimension Dim(F, S) that takes the adaptive nature of active learning sampling into account. The resulting algorithm is a generalization of the one for the linear case, while the corresponding analysis greatly generalizes the linear case analysis, where we replace the greedy version of G-optimal design with a design guided by the diversity measure induced by the function space at hand. We related Dim(F, S) to the eluder dimension dim E(F, ϵ) of F at an appropriate scale ϵ, and pointed out that any function space F such that dim E(F, ϵ) is polynomial in 1/ϵ is actually learnable in our realizable scenario. It would be nice to see whether it is possible to replace Dim T (F) by some substantially smaller notion of dimension, in both the algorithm and the analysis. For instance, it might be possible to leverage the fact that our algorithm is selecting points xt by maximizing D(x, Qℓ), so that the sequence x1, x2, . . . is not completely arbitrary. This would be similar in spirit to the arguments by Brukhim et al. (2023), where the authors introduce a notion of function complexity, called dissimilarity measure, that is able to exploit the speciﬁc structure of optimistic bandit algorithms, leading to bounds that are in some cases substantially sharper than those based on the eluder dimension.

Acknowledgements

We thank the anonymous reviewers for their insightful comments that helped improve both the content and the presentation of this paper. We thank the action editor for their careful handling of our submission. Tong Zhang is partially supported by an NSF IIS grant No. 2416897.

Gentile and Wang and Zhang

Appendix A. Proofs for Section 4

Consider Algorithm 1, and denote by Tℓthe length of stage ℓ. We denote for any ϵ > 0,

Tϵ = {x P : | w , x | ϵ} .

Recall that in Algorithm 1 variable L counts the total number of stages (a random quantity), while the size of the original pool |P| is denoted by T. We ﬁrst show that on the conﬁdent sets, that is, on sets Cℓwhere pseudo-labels are generated, the learner has with high probability no regret. Before giving our key lemma, it will be useful to deﬁne the events

Eℓ= max x Pℓ 1\Qℓ | wℓ w , x | 2 ℓ ,

for ℓ= 1, . . . , L.

Lemma 8 For any positive L,

Proof We assume Pℓ 1\Qℓis not empty (it could be empty only in the ﬁnal stage L). We follow the material contained in Chapters 20 and 21 of the book of Lattimore and Szepesvari (2020). Let ξℓ,t = yℓ,t w , xℓ,t and notice that ξℓ,t are independent 1-sub Gaussian random variables conditioned on Pℓ 1. Also, observe that, conditioned on past stages 1, . . . , ℓ 1, we are in a ﬁxed design scenario, where the xℓ,t are chosen without knowledge of the corresponding labels yℓ,t. We can write, for any x Pℓ 1,

wℓ w , x = A 1 ℓ,Tℓ

t=1 yℓ,txℓ,t

t=1 xℓ,t w , xℓ,t + ξℓ,txℓ,t

= A 1 ℓ,Tℓ(Aℓ,Tℓ I)w + A 1 ℓ,Tℓ

t=1 ξℓ,txℓ,t

= w , x A 1 ℓ,Tℓ+

t=1 xℓ,t, x A 1 ℓ,Tℓξℓ,t .

Since {ξℓ,t}Tℓ t=1 are 1-sub-Gaussian and independent conditioned on {xℓ,t}, the variance term PTℓ t=1 xℓ,t, x A 1 ℓ,Tℓξℓ,t is r PTℓ t=1 xℓ,t, x 2 A 1 ℓ,Tℓ -sub-Gaussian. We apply Lemma 29

t=1 xℓ,t, x A 1 ℓ,Tℓξℓ,t

t=1 xℓ,t, x 2 A 1 ℓ,Tℓ log 2ℓ(ℓ+ 1)T

δ ℓ(ℓ+ 1)T .

Fast Rates in Pool-Based Batch Active Learning

Now observe that

t=1 xℓ,t, x 2 A 1 ℓ,Tℓ = x 2 A 1 ℓ,Tℓ A 1 ℓ,Tℓx 2 x 2 A 1 ℓ,Tℓ .

We plug back into the previous inequality to obtain

t=1 xℓ,t, x A 1 ℓ,Tℓξℓ,t

2 x 2 A 1 ℓ,Tℓ log 2ℓ(ℓ+ 1)T

δ ℓ(ℓ+ 1)T .

Using a union bound, we get with probability at least 1 δ ℓ(ℓ+1),

t=1 xℓ,t, x A 1 ℓ,Tℓξℓ,t

2 x 2 A 1 ℓ,Tℓ log 2ℓ(ℓ+ 1)T

holds uniformly for all x Pℓ 1. For the bias term w , x A 1 ℓ,Tℓ, notice that Aℓ,Tℓ I

| w , x A 1 ℓ,Tℓ| x A 1 ℓ,Tℓ w A 1 ℓ,Tℓ x A 1 ℓ,Tℓ.

Hence with probability at least 1 δ ℓ(ℓ+1),

| wℓ w , x |

2 log 2ℓ(ℓ+ 1)T

x A 1 ℓ,Tℓ,

holds uniformly for all x Pℓ 1. Notice that by the selection criterion in Algorithm 1, max x Pℓ 1\Qℓ x A 1 ℓ,Tℓ ϵ2 ℓ. As a

consequence, with probability at least 1 δ ℓ(ℓ+1),

max x Pℓ 1\Qℓ | wℓ w , x |

2 log 2ℓ(ℓ+ 1)T

Recalling the deﬁnition of ϵℓin Algorithm 1 and using an union bound over ℓ, we get the desired result.

As a simple consequence, we have the following lemma.

Lemma 9 Assume TL ℓ=1 Eℓholds. Then Algorithm 1 generates pseudo-labels such that, on all points x L ℓ=1Cℓ, sgn( wℓ, x ) = sgn( w , x ).

Proof Simply observe that if x L ℓ=1Cℓis such that sgn( wℓ, x ) = 1 then wℓ, x > 2 ℓ, which implies w , x > 0 by the assumption that Eℓholds. Similarly, sgn( wℓ, x ) = 1 implies w , x < 0.

Gentile and Wang and Zhang

Lemma 10 The length Tℓof stage ℓin Algorithm 1 is (deterministically) upper bounded as

Proof Since in stage ℓthe algorithm terminates at Tℓ, any round t < Tℓis such that

||xℓ,t+1||2 A 1 ℓ,t > ϵ2 ℓ.

We denote | | as the determinant of the matrix at argument and have the known identity

|Aℓ,t+1| = |Aℓ,t + xℓ,t+1x ℓ,t+1|

= |Aℓ,t| |I + A 1 ℓ,t xℓ,t+1x ℓ,t+1|

= (1 + ||xℓ,t+1||2 A 1 ℓ,t )|Aℓ,t|

where the third equality holds since I + A 1/2 ℓ,t xℓ,t+1x ℓ,t+1A 1/2 ℓ,t has d 1 eigenvalues 1 and one eigenvalue 1 + ||xℓ,t+1||2 A 1 ℓ,t .

Combining the above equality with the fact that log(1 + x) x 1+x x

2 for 0 x 1, we get

||xℓ,t+1||2 A 1 ℓ,t = |Aℓ,t+1|

|Aℓ,t| 1 2(log |Aℓ,t+1| log |Aℓ,t|) .

Therefore, 2(log |Aℓ,t+1| log |Aℓ,t|) > ϵ2 ℓ.

Summing over t = 0, . . . , Tℓ 1 yields,

2 log |Aℓ,Tℓ|

|Aℓ,0| ϵ2 ℓTℓ.

Now, Aℓ,0 = I, so that |Aℓ,0| = 1, and

log |Aℓ,Tℓ| log (trace(Aℓ,Tℓ)/d)d d log 1 + Tℓ

ϵ2 ℓ log 1 + Tℓ

Let G(x) = x log(1+x), and notice that G(x) is increasing for x > 0. We have

where the second inequality holds since ϵℓ ϵ1 < 1

Fast Rates in Pool-Based Batch Active Learning

As a consequence,

as claimed.

The proof then proceeds by bounding two relevant quantities associated with the behavior of Algorithm 1: the label complexity

and the weighted cumulative regret over pool P of size T, deﬁned as

x P 1{sgn bw, x = sgn w , x }| w , x | .

We will ﬁrst present intermediate bounds on RT (P) and NT (P) as a function of L, and then rely on the properties of the noise (hence the randomness on P) to complete the proofs. To simplify the math display we denote

KT (δ, ℓ) =

2 log 2ℓ(ℓ+ 1)T

so that ϵℓ= 1 2ℓKT (δ,ℓ). Lemma 10 immediately delivers the following bound on NT (P):

Theorem 11 For any pool realization P, the label complexity NT (P) of Algorithm 1 operating on a pool P of size T is bounded deterministically as

3 d log 2LKT (δ, L) K2 T (δ, L)4L

Proof By deﬁnition

K2 T (δ, L)

3d log 2LKT (δ, L) K2 T (δ, L)4L+1 ,

where the second inequality follows from the fact that both 1 ϵℓand KT (δ, ℓ) increase with

ℓ, and the last inequality follows from PL ℓ=1 4ℓ< 4

As for the regret RT (P), we have the following high probability result.

Gentile and Wang and Zhang

Theorem 12 For any pool realization P, the weighted cumulative regret RT (P) of Algorithm 1 operating on a pool P of size T is bounded as

RT (P) 64d log 2LKT (δ, L) K2 T (δ, L)2L + d 2L 1 ,

assuming TL ℓ=1 Eℓholds.

Proof We decompose the pool P as the union of following disjoint sets

P = L l=1Cℓ L l=1Qℓ PL

and, correspondingly, the weighted cumulative regret RT (P) as the sum of the three components RT (P) = R( L l=1Cℓ) + R( L l=1Qℓ) + R(PL) .

Assume TL ℓ=1 Eℓholds. First, notice that on Cℓ,

sgn bw, x = sgn wℓ, x = sgn w , x

under the assumption that Eℓholds, thus points in L ℓ=1Cℓdo not contribute weighted regret for bw, i.e., R( L l=1Cℓ) = 0 .

Next, on PL, we have | w L, x | 2 L. Combining this with the assumption that EL holds, we get | w , x | 2 L+1, which implies that the weighted cumulative regret on PL is bounded as R(PL) 2 L+1|PL| < d 2L 1 ,

the second inequality deriving from the stopping condition deﬁning L in Algorithm 1. Finally, on the queried points L l=1Qℓ, it is unclear whether sgn bw, x = sgn w , x or not, so we bound the weighted cumulative regret contribution of each data item x therein by | w , x |. Now, by construction, x Qℓ Pℓ 1, so that | wℓ 1, x | 2 ℓ+1 which, combined with the assumption that Eℓ 1 holds, yields | w , x | 2 ℓ+2. Since |Qℓ| = Tℓ, we have

R( L ℓ=1Qℓ) 4

and Lemma 10 allows us to write

R( L l=1Qℓ) 32d

64d log 2LKT (δ, L) K2 T (δ, L)2L ,

the last inequality following from a reasoning similar to the one that lead us to Theorem 11.

Given any pool realization P, both the label complexity and weighted regret are bounded by a function of L. Adding the ingredient of the low noise condition (2) helps us leverage the randomness in P and further bound from above the number of stages L.

Fast Rates in Pool-Based Batch Active Learning

Speciﬁcally, assume the low noise condition (2) holds for f (x) = 1+ w ,x

2 , for some unknown exponent α 0 and unknown constant ϵ0 (0, 1]. Using a multiplicative Chernoﬀ bound, it is easy to see that for any ﬁxed ϵ , with probability at least 1 δ,

2 (Tϵα + log(1/δ)) ,

the probability being over the random draw of the initial pool P. Now, since ϵL is itself a random variable (since so is L), we need to resort to a covering argument. For any positive number M, consider the following set of ﬁxed ϵ values

2i/α : i = 0, . . . , M o .

Then with probability at least 1 δ,

Tϵα + log M

holds simultaneously over ϵ KM. Set M = log2 T and assume ϵ is the smallest value in KM that is bigger than or equal to ϵ . If ϵ is not the smallest value in KM, then by construction we have ϵα ϵα < 2ϵα so that, for all ϵ > ϵ0 2M/α ,

|Tϵ | |Tϵ| 3

Tϵα + log M

< 3 Tϵα + log M

On the other hand if ϵ ϵ0 2M/α we can write

|Tϵ | T ϵ0 2M/α

Tϵα 0 2M + log M

making Eq. (5) hold in this case as well. We deﬁne the event

|Tϵ | < 3 Tϵα + log M

Then P E 1 δ , (6)

for M = log2 T. We set ϵ to be the unique solution of the equation4

d/ϵ = 3 Tϵα+1 + ϵ log M

Eq. (5) will be applied, in particular, to the margin value 2 L+2 when 2 L+2 ϵ0. Armed with Eqs. (5) and (7) with M = log2 T, we prove a lemma that upper bounds the number of stages L.

4. We need to further assume T > 2

3d so as to make sure the solution exists.

Gentile and Wang and Zhang

Lemma 13 Let ϵ be deﬁned through (7), with T > 2

3d. Assume both E and TL ℓ=1 Eℓhold. Then the number of stages L of Algorithm 1 is upper bounded as

1 α+2 + 3 1

1 α+2 log(log2 T

= max O 1 α + 2 log T

+ log log T

Here the O-notation only omits absolute constants.

Proof If at stage L 1 the algorithm has not stopped, then we must have

d/2 L+2 2 L+2|PL 1| .

Notice that if x PL 1 then | w L 1, x | 2 L+1. Combining it with the assumption that EL 1 holds, we have | w , x | 2 L+2 which implies |PL 1| |T2 L+2|. We split the analysis into two cases. On one hand, when 2 L+2 > ϵ0, this condition gives us directly

On the other hand if 2 L+2 ϵ0, then given E holds, |T2 L+2| is upper bounded as

|T2 L+2| 3 T 2( L+2)α + log M

with M = log2 T. Plugging into the ﬁrst display results in

d/2 L+2 3 T2( L+2)(α+1) + 2 L+2 log(M

which resembles (7) with 2 L+2 here playing the role of ϵ therein. Then, from the deﬁnition of ϵ in (7) we immediately obtain 2 L+2 ϵ , thus L log2( 1

ϵ ) + 2. Moreover, from (7)

we see that d/ϵ 3Tϵα+1 , which is equivalent to ϵ ( d

3T ) 1 α+2 . Replacing this upper bound on ϵ back into the right-hand side of (7) and dividing by d yields

1 α+2 + 3 1

1 α+2 log(M

which gives the claimed upper bound on L through L log2( 1

Corollary 14 Let T > d. Then with probability at least 1 2δ over the random draw of (x1, y1), . . . , (x T , y T ) D the label complexity NT (P) and the weighted cumulative regret

Fast Rates in Pool-Based Batch Active Learning

RT (P) of Algorithm 1 simultaneously satisfy the following:

NT (P) = log2 T

O max d α α+2 T 2 α+2 , d

+ log2 log T

RT (P) = log2 T

O max d α+1 α+2 T 1 α+2 , d

+ log log T

where the O-notation only omits absolute constants .

Proof Assume both E and TL ℓ=1 Eℓhold. Recalling the deﬁnition of KT (δ, L), we have

KT (δ, L) = O

Similar to lemma 13, we split the analysis into two cases depending on whether or not 2 L+2

is bigger than ϵ0. If 2 L+2 ϵ0, we have

1 α+2 + 3 1

1 α+2 log(log2 T

1 α+2 log log T

Plugging the above bounds into Theorem 11 gives

NT (P) = O d L + log K2 T (δ, L) K2 T (δ, L)4L

= O L + K2 T (δ, L) K2 T (δ, L) d α α+2 T 2 α+2 + log2 log T

2 d α α+2 T 2 α+2 + log2 log T

O d α α+2 T 2 α+2 + log2 log T

Similarly applying them to Theorem 12,

RT (P) = O d L + log K2 T (δ, L) K2 T (δ, L)2L

= O L + K2 T (δ, L) K2 T (δ, L) d α+1 α+2 T 1 α+2 + log log T

2 d α+1 α+2 T 1 α+2 + log log T

O d α+1 α+2 T 1 α+2 + log log T

where in the second equality we used the assumption that d < T.

Gentile and Wang and Zhang

If 2 L+2 > ϵ0, then 2L 4

ϵ0 . Plugging these bounds into Theorem 11 and Theorem 12 gives

NT (P) = O log2 T

RT (P) = O log2 T

Lastly, (6) and lemma 8 together yield

which concludes the proof.

We now turn the bound on the weighted cumulative regret RT (P) in the previous corollary into a bound on the excess risk. We can write

L(bw) L(w ) = E(x,y) D h 1{y = sgn( bw, x )} 1{y = sgn( w , x )} i

= Ex DX h Ey DY|X 1{y = sgn( bw, x )} 1{y = sgn( w , x )} i

= Ex DX h 1{sgn( bw, x ) = sgn( w , x )} | w , x | i ,

where bw is the hypothesis returned by Algorithm 1. Now, simply observe that

1{sgn( bw, x ) = sgn( w , x )} | w , x |

has the same form as the function φ(bw, x) in Appendix C on which the uniform convergence result of Theorem 28 applies, with bϵ(δ) therein replaced by the bound on RT (P) borrowed from Corollary 14. This allows us to conclude that with probability at least 1 δ

L(bw) L(w ) = log2 T

α+2 , d Tϵ0

+ log log T

as claimed in Theorem 1 in the main body of the paper.

Appendix B. Proofs for Section 5

The proof has the very same structure as the one in Section A. Hence we only emphasize the main diﬀerences. The starting point of the analysis is Proposition 2 in Russo and Van Roy (2013) which, with our assumptions and notation, reads as follows.

Fast Rates in Pool-Based Batch Active Learning

Lemma 15 Let bfℓbe the estimator computed by Algorithm 2 at the end of stage ℓ, and Qℓ= {xℓ,1, . . . , xℓ,Tℓ} be the queried points in stage ℓ. Then with probability at least 1 δ ℓ(ℓ+1) we have

t=1 (f (xℓ,t) bfℓ(xℓ,t))2 8 log 2ℓ(ℓ+ 1) N (F, Qℓ, γ)

8 log 8ℓ(ℓ+ 1)T 2 ℓ δ

where N (F, Qℓ, γ) is the size of a γ-cover of function class F with respect to the inﬁnity norm.

Note that from Lemma 26 we can further bound N (F, Qℓ, γ) by e T V γ V , where V is the VC-subgraph dimension of F. To simplify the math display we denote

KT (δ, ℓ, γ) =

v u u u t8 log 2ℓ(ℓ+ 1) e T V γ V

8 log 8ℓ(ℓ+ 1)T 2

so that ϵℓ= 1 2ℓKT (δ,ℓ,γ). Moreover, we deﬁne, for any ϵ > 0,

Tϵ = {x P : |f (x) 1/2| ϵ} .

Recall that in Algorithm 2 variable L counts the total number of stages, while T denotes the size of the original pool P. Similar to Appendix A, we deﬁne the events

Eℓ= max x Pℓ 1\Qℓ |f (x) bfℓ(x)| 2 ℓ ,

for ℓ= 1, . . . , L.

Lemma 16 For any positive L, we have

Proof We assume Pℓ 1\Qℓis not empty (it could be empty only in the ﬁnal stage L). Then for any ℓand x Pℓ 1\Qℓwe can write

(f (x) bfℓ(x))2

= (f (x) bfℓ(x))2 PTℓ t=1(f (xℓ,t) bfℓ(xℓ,t))2 + 1

t=1 (f (xℓ,t) bfℓ(xℓ,t))2 + 1

(f(x) g(x))2 PTℓ t=1(f(xℓ,t) g(xℓ,t))2 + 1

t=1 (f (xℓ,t) bfℓ(xℓ,t))2 + 1

= D2(x; Qℓ)

t=1 (f (xℓ,t) bfℓ(xℓ,t))2 + 1

ϵ2 ℓK2 T (δ, ℓ, γ) ,

Gentile and Wang and Zhang

holds with probability at least 1 δ ℓ(ℓ+1). This is due to Lemma 15 and the fact that max x Pℓ 1\Qℓ D(x, Qℓ) ϵℓ.

Using the deﬁnition of ϵℓand union bound we get the desired result.

Similar to Appendix A, this yields the following consequence.

Lemma 17 Assume TL ℓ=1 Eℓholds. Then Algorithm 2 generates pseudo-labels such that, on all points x L ℓ=1Cℓ, sgn( bfℓ(x) 1/2) = sgn(f (x) 1/2).

Lemma 18 The length Tℓof stage ℓin Algorithm 2 is (deterministically) upper bounded as

Tℓ Dim(F, Qℓ)

ϵ2 ℓ Dim(F, P)

Proof Since in stage ℓthe algorithm terminates at Tℓ, any round t < Tℓis such that

D2 (xℓ,t+1; xℓ,1, . . . , xℓ,t ) > ϵ2 ℓ.

Summing this inequality up we get

Dim(F, Qℓ) ϵ2 ℓTℓ,

as a consequence

Tℓ Dim(F, Qℓ)

ϵ2 ℓ Dim(F, P)

as claimed.

We then bound the label complexity

and the weighted cumulative regret over pool P of size T,

x P 1 n sgn( bf(x) 1/2) = sgn(f (x) 1/2) o |f (x) 1/2| .

We will ﬁrst present intermediate bounds on RT (P) and NT (P) as a function of L, and then rely on the properties of the noise (hence the randomness on P) to complete the proofs. Lemma 18 immediately delivers the following bound on NT (P):

Theorem 19 For any pool realization P, the label complexity NT (P) of Algorithm 2 operating on a pool P of size T is bounded deterministically as

NT (P) 4L+1

3 K2 T (δ, L, γ)Dim(F, P) .

As for the regret RT (P), we have the following high probability result.

Fast Rates in Pool-Based Batch Active Learning

Theorem 20 For any pool realization P, the weighted cumulative regret RT (P) of Algorithm 1 operating on a pool P of size T is bounded as

RT (P) 2L+4K2 T (δ, L, γ)Dim(F, P) ,

assuming TL ℓ=1 Eℓholds.

Proof We decompose the pool P as the union of following disjoint sets

P = L l=1Cℓ L l=1Qℓ PL

and, correspondingly, the weighted cumulative regret RT (P) as the sum of the three components RT (P) = R( L l=1Cℓ) + R( L l=1Qℓ) + R(PL) .

Assume TL ℓ=1 Eℓholds. First, notice that on Cℓ,

sgn( bf(x) 1/2) = sgn( bfℓ(x) 1/2) = sgn(f (x) 1/2) ,

where the second equality is due to Lemma 17 and the ﬁrst equality follows from the fact that ˆf minimize the 0 1 loss on the generated pseudo-labels, thus incurs no loss since f already incurs zero loss. As a consequence points in L ℓ=1Cℓdo not contribute weighted regret for bf, i.e., R( L l=1Cℓ) = 0 .

Next, on PL, we have | bf L(x) 1/2| 2 L. Combining this with the assumption that EL holds, we get |f (x) 1/2| 2 L+1, which implies that the weighted cumulative regret on PL is bounded as R(PL) 2 L+1|PL| < Dim(F, P) 2L 1 ,

the second inequality deriving from the stopping condition deﬁning L in Algorithm 2. Finally, on the queried points L l=1Qℓ, it is unclear whether sgn( bf(x) 1/2) = sgn(f (x) 1/2) or not, so we bound the weighted cumulative regret contribution of each data item x therein by |f (x) 1/2|. Now, by construction, x Qℓ Pℓ 1, so that |fℓ 1(x) 1/2| 2 ℓ+1 which, combined with the assumption that Eℓ 1 holds, yields |f (x) 1/2| 2 ℓ+2. Since |Qℓ| = Tℓ, we have

R( L ℓ=1Qℓ) 4

and Lemma 18 allows us to write

R( L l=1Qℓ) 4

2 ℓDim(F, P)

ϵ2 ℓ 2L+3K2 T (δ, L, γ)Dim(F, P) ,

the last inequality following from a reasoning similar to the one that lead us to Theorem 19.

Gentile and Wang and Zhang

Given any pool realization P, both the label complexity and weighted regret are bounded by a function of L. Adding the ingredient of the low noise condition (2) helps us leverage the randomness in P and further bound from above the number of stages L. Speciﬁcally, assume the low noise condition (2) holds for f (x), for some unknown exponent α 0 and unknown constant ϵ0 (0, 1]. We set ϵ to be the unique solution of the equation5

Dim(F, P)/ϵ = 3 Tϵα+1 + ϵ log M

Eq. (5) holds in this case by low noise condition and will be applied, in particular, to the margin value 2 L+2 when 2 L+2 ϵ0. Armed with Eqs. (5) and (8) with M = log2 T, we prove a lemma that upper bounds the number of stages L. Recall the deﬁnition of event E:

|Tϵ | < 3 Tϵα + log M

Lemma 21 Let ϵ be deﬁned through (7), with T > 2

3Dim(F, P) and γ = 1/T. Assume both E and TL ℓ=1 Eℓhold. Then the number of stages L of Algorithm 2 is upper bounded as

" 3T Dim(F, P)

1 α+2 + 3 1 Dim(F, P)

1 α+2 log(log2 T

= max O 1 α + 2 log T Dim(F, P)

+ log log T

Here the O-notation only omits absolute constants.

Proof If at stage L 1 the algorithm has not stopped, then we must have

Dim(F, P)/2 L+2 2 L+2|PL 1| .

Notice that if x PL 1 then | bf L 1(x) 1/2| 2 L+1. Combining it with the assumption that EL 1 holds, we have |f (x) 1/2| 2 L+2 which implies |PL 1| |T2 L+2|. The rest of the proof remains unchanged compared to Lemma 13, except that we replace d by Dim(F, P).

5. We need to further assume T > 2

3Dim(F, P) so as to make sure the solution exists.

Fast Rates in Pool-Based Batch Active Learning

Corollary 22 Let T > Dim(F, P) and γ = 1/T. Then with probability at least 1 2δ over the random draw of (x1, y1), . . . , (x T , y T ) D the label complexity NT (P) and the weighted cumulative regret RT (P) of Algorithm 2 simultaneously satisfy the following:

NT (P) = V log2 T

O max Dim(F, P) α α+2 T 2 α+2 , Dim(F, P)

+ log2 log T

RT (P) = V log2 T

O max Dim(F, P) α+1 α+2 T 1 α+2 , Dim(F, P)

+ log log T

where the O-notation only omits absolute constants .

Proof Assume both E and TL ℓ=1 Eℓhold. Recalling the deﬁnition of KT (δ, L, γ), we have

KT (δ, L, γ) =

v u u u t8 log 2ℓ(ℓ+ 1) e T V γ V

8 log 8ℓ(ℓ+ 1)T 2

v u u u u tlog

log L + log T

We choose γ = 1

T and simplify the above display as

KT (δ, L, 1

Similar to Lemma 21, we split the analysis into two cases depending on whether or not 2 L+2 is bigger than ϵ0. If 2 L+2 ϵ0, we have

" 3T Dim(F, P)

1 α+2 + 3 1 Dim(F, P)

1 α+2 log(log2 T

T Dim(F, P)

1 α+2 + 1 Dim(F, P)

1 α+2 log log T

Gentile and Wang and Zhang

Plugging the above bounds into Theorem 19 gives

NT (P) = O Dim(F, P) L + log KT

K2 T (δ, L, 1

= O L + log KT

K2 T (δ, L, 1

T ) Dim(F, P) α α+2 T 2 α+2 + log2 log T

= O L + V log T

Dim(F, P) α α+2 T 2 α+2 + log2 log T

= L + V log T

O Dim(F, P) α α+2 T 2 α+2 + log2 log T

Similarly applying them to Theorem 20,

RT (P) = O Dim(F, P) L + log KT

K2 T (δ, L, 1

K2 T (δ, L, 1

T ) Dim(F, P) α+1 α+2 T 1 α+2 + log log T

= O L + V log T

Dim(F, P) α+1 α+2 T 1 α+2 + log log T

O Dim(F, P) α+1 α+2 T 1 α+2 + log log T

where in the second equality we used the assumption that Dim(F, P) < T. If 2 L+2 > ϵ0, then 2L 4

ϵ0 . Plugging these bounds into Theorem 19 and Theorem 20 gives

NT (P) =O V log2 T

O Dim(F, P)

RT (P) =O V log2 T

O Dim(F, P)

Lastly, (6) and lemma 16 together yield

which concludes the proof.

We now turn the bound on the weighted cumulative regret RT (P) in the previous corollary into a bound on the excess risk. We can write

L( bf) L(f ) = E(x,y) D h 1 n y = sgn( bf(x) 1/2) o 1{y = sgn(f (x) 1/2)} i

= Ex DX h Ey DY|X 1 n y = sgn( bf(x) 1/2) o 1{y = sgn(f (x) 1/2)} i

= Ex DX h 1 n sgn( bf(x) 1/2) = sgn(f (x) 1/2) o |f (x) 1/2| i ,

Fast Rates in Pool-Based Batch Active Learning

where bf is the hypothesis returned by Algorithm 2. Now, simply observe that

1 n sgn( bf(x) 1/2) = sgn(f (x) 1/2) o |f (x) 1/2|

has the same form as the function φ( bf, x) in Appendix C on which the uniform convergence result of Theorem 28 (with bw, x replaced by bf(x) 1/2) applies, with bϵ(δ) therein replaced by the bound on RT (P) borrowed from Corollary 22. This allows us to conclude that, with probability at least 1 δ,

L( bf) L(f )

( Dim(F, P)

α+2 , Dim(F, P)

δ + V log (2e T)

as claimed in Theorem 3 in the main body of the paper. Lastly we show that our notion of complexity is upper bounded by the eluder dimension at an appropriate scale.

Lemma 23 There exists a positive constant C such that, for any T > 0,

sup S X : |S|=T Dim(F, S) C inf η 1

dim E(F, T 1

2η ) log2(T) + T 1 1

Proof We only need to prove for any η 1 and a ﬁxed sequence x1, . . . , x T ,

t=1 D2 xt; x1, . . . , xt 1 C dim E(F, T 1

2η ) log2(T) + T 1 1

Recall that

D2 xt; x1, . . . , xt 1 = sup f,g F

(f(xt) g(xt))2 Pt 1 i=1(f(xi) g(xi))2 + 1 .

Without loss of generality we can assume there exist ft and gt such that

(f(xt) g(xt))2 Pt 1 i=1(f(xi) g(xi))2 + 1 = (ft(xt) gt(xt))2 Pt 1 i=1(ft(xi) gt(xi))2 + 1 .

Given η 1, we split the [0, 1] interval into the log2(T) + 1 intervals

[0, 2 log2(T) /η], (2 log2(T) /η, 2 ( log2(T) 1)/η], . . . , (2 1/η, 1] .

Let Ui = {xt S | (ft(xt) gt(xt))2 (2 i/η, 2 (i 1)/η]}

and U log2(T) +1 = {xt S | (ft(xt) gt(xt))2 2 log2(T) /η} .

Gentile and Wang and Zhang

t=1 D2 xt; x1, . . . , xt 1 =

(ft(xt) gt(xt))2 Pt 1 i=1(ft(xi) gt(xi))2 + 1

xt U log2(T ) +1

(ft(xt) gt(xt))2 Pt 1 i=1(ft(xi) gt(xi))2 + 1

(ft(xt) gt(xt))2 Pt 1 i=1(ft(xi) gt(xi))2 + 1 + 2 1 η T 1 1

Each Ui we further decompose into Ki = |Ui|/dim E(F, 2 i

2η ) +1 disjoint sets Ω1 i , . . . , ΩKi i in the following way. Originally all Ωj i (j = 1, . . . , Ki) are empty. We sort the xt s in Ui according to their index and assign them one by one. Speciﬁcally, we assign xt to Ωj(t) i where

j(t) is the smallest index among 1, . . . , Ki 1 such that xt is 2 i

2η -independent to elements

in Ωj(t) i ; if xt is 2 i

2η -dependent to Ω1 i , . . . , ΩKi 1 i , assign it to ΩKi i . As a consequence,

(ft(xt) gt(xt))2 Pt 1 i=1(ft(xi) gt(xi))2 + 1 2 i 1

(j(t) 1)2 i

η + 1 = 2 1 η

j(t) 1 + 2 i η ,

where the inequality follows from xt s 2 i

2η -dependency on Ω1 i , . . . , Ωj(t) 1 i . With the help of the above inequality,

(ft(xt) gt(xt))2 Pt 1 i=1(ft(xi) gt(xi))2 + 1 =

(ft(xt) gt(xt))2 Pt 1 i=1(ft(xi) gt(xi))2 + 1

(ft(xt) gt(xt))2 Pt 1 i=1(ft(xi) gt(xi))2 + 1

2 1 η |Ωj i|

j 1 + 2 i η + 2|ΩKi i |

Ki 1 + 2 i η

2 dim E(F, 2 i

j 1 + 2 i η + 2 |Ui|

2 dim E(F, 2 i

2η )(log(Ki) + 1) ,

where in the second inequalty we used the fact that |Ωj i| dim E(F, 2 i

η ), for j = 1, . . . , Ki 1. Finally combining all the equations and using the fact that dim E(F, ) is monotone

t=1 D2 xt; x1, . . . , xt 1 2dim E(F, T 1

i=1 (log(Ki) + 1) + 2 1 η T 1 1

C dim E(F, T 1

2η ) log2(T) + T 1 1

Fast Rates in Pool-Based Batch Active Learning

as claimed.

Appendix C. Ancillary technical results

This section collects ancillary technical results that are used throughout the appendix. We ﬁrst recall the following version of the Hoeﬀding s bound.

Lemma 24 Let a1, . . . , a T be T arbitrary real numbers, and {σ1, . . . , σT } be T i.i.d. Rademacher variables, each taking values 1 with equal probability. Then for any ϵ 0

2 PT t=1 a2 t

where the probability is with respect to {σ1, . . . , σT }.

Let us consider the linear case ﬁrst. Deﬁne the function φ : Rd P [0, 1] as

φ(bw, x) = 1{sgn bw, x = sgn w , x }ρ( w , x ) ,

where ρ( ) has range in [0, 1], and does not depend on bw. We have the following standard covering result, which is a direct consequence of Sauer Shelah lemma (e.g., Sauer, 1972).

Lemma 25 Consider any given ST = {x1, . . . , x T } Rd, and let

Φ(ST ) = {[φ(bw, x1), . . . , φ(bw, x T )] : bw Rd} .

We have, when T d,

The following result gives a bound on L covering number for VC subgraph class.

Lemma 26 Let f F be a [0, 1]-valued function of x X. Assume the binary function 1{z f(x)} deﬁned on R X has VC-dimension V , i.e. F has VC-subgraph dimension V . Then given ST = {x1, . . . , x T } X, with T V ,

N(ϵ, F, L (ST )) e T

Proof As N( , F, L (ST )) is a decreasing function of ϵ, we can assume that 1/ϵ is an integer. Consider discretization of [0, 1] by m = 1/ϵ points U1, . . . , Um so that U [0, 1], mini |U Ui| ϵ. Consider m T points {(Uj, xi) : i, j}, by Sauer s lemma, the cardinality of {( 1{Uj f(xi)})i,j : f F} is at most e T

V ϵ V . Let f1, . . . , f N attain all distinctive values of {( 1{Uj f(xi)})i,j : f F}. Given any f F, there exists fk such that 1{Uj f(xi)} = 1{Uj fk(xi)} for all i and j. It is easy to check that this implies that |f(xi) fk(xi)| ϵ for all xi ST .

The next result follows from a standard symmetrization argument.

Gentile and Wang and Zhang

Lemma 27 Let X = Rd, ST = (x1, . . . , x T ) be a sample drawn i.i.d. according to DX and S T = (x 1, . . . , x T ) be another sample drawn according to DX , with T d. Then with probability at least 1 δ

t=1 φ(bw, x t) 3

t=1 φ(bw, xt) + 8 log(1/δ) + 8d log(2e T/d) ,

uniformly over bw Rd.

Proof Let {σ1, . . . , σT } be independent Rademacher variables as in Lemma 24. We can write, for any ϵ 0,

t=1 φ(bw, x t) 3

t=1 φ(bw, xt) + 2ϵ

h φ(bw, x t) φ(bw, xt) i 1

h φ(bw, xt) + φ(bw, x t) i +ϵ

t=1 σt h φ(bw, x t) φ(bw, xt) i 1

h φ(bw, xt) + φ(bw, x t) i +ϵ

t=1 σtφ(bw, x t) 1

t=1 φ(bw, x t) + ϵ

t=1 σtφ(bw, xt) 1

t=1 φ(bw, xt) + ϵ

t=1 σtφ(bw, xt) 1

t=1 φ(bw, xt) + 1

d sup ST , bw Rd P

t=1 σtφ(bw, xt) 1

t=1 φ(bw, xt) + 1

(from the union bound and Lemma 25)

d sup ST , bw Rd exp

(PT t=1 φ(bw, xt) + ϵ)2

8 PT t=1 φ(bw, xt)2

(from Lemma 24)

d exp( ϵ/4) ,

the last inequality deriving from the fact that, since φ(bw, xt) [0, 1],

(PT t=1 φ(bw, xt) + ϵ)2 PT t=1 φ(bw, xt)2 (PT t=1 φ(bw, xt) + ϵ)2 PT t=1 φ(bw, xt) 2ϵ .

Take ϵ such that δ = 2 e T

d d exp( ϵ/4), to obtain the claimed bound.

Fast Rates in Pool-Based Batch Active Learning

Theorem 28 With the same notation and assumptions as in Lemma 27, let bw Rd be a function of ST such that 1 T

t=1 φ(bw, xt) bϵ(δ)

holds with probability at least 1 δ, for some bϵ(δ) [0, 1]. Then with probability at least 1 3δ:

Ex Dφ(bw, x) 4bϵ(δ) + 22 log 1

δ + 11d log 2e T

Proof Use the multiplicative Chernoﬀbound

Ex Dφ(bw, x) 4

t=1 φ(bw, x t) + 32

3T log(1/δ) ,

and then apply Lemma 27 to further bound the right-hand side.

To control noise terms, which are 1-subgaussian random variables, we provide the following lemma which is a direct implication of Chernoﬀbound.

Lemma 29 Suppose ξ is a σ-subgaussian random variable, then for any δ > 0,

2σ2 log(2/δ) δ .

Let us now consider the nonlinear case analyzed in Section B. Similar to the linear case, deﬁne the function φ : F P [0, 1] as

φ( bf, x) = 1 n sgn( bf(x) 1/2) = sgn(f (x) 1/2) o ρ(f (x)) ,

and ρ( ) has range in [0, 1]. As for the counterparts to Theorem 28, simply observe that Lemma 25 and Lemma 27 hold in the more general case with d replaced by V , where F is a function class having ﬁnite VC-subgraph dimension V . In particular, given any ST = {x1, . . . , x T } X, deﬁne

Φ(ST ) = {[φ( bf, x1), . . . , φ( bf, x T )] : bf F} .

We then have, when T V ,

where V is the VC dimension of the binary-valued class n sgn( bf(x) 1/2) : bf F o .

Gentile and Wang and Zhang

Now it is easy to see that V V , so that we also have

With this modiﬁcation, Theorem 28 holds with factor d log 2e T

d therein replaced by V log 2e T

V , once we also replace bw, x by bf(x) 1/2.

Y. Abbasi-Yadkori, D. P al, and C. Szepesv ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011, pages 2312 2320, 2011.

K. Amin, C. Cortes, G. De Salvo, and A. Rostamizadeh. Understanding the eﬀects of batching in online active learning. In Proc. AISTATS, 2020.

J. Ash, C. Zhang, A. Krishnamurthy, J. Langford, and A. Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In International Conference on Learning Representations (ICLR), 2020.

K. S. Azoury and M. K. Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43:211 246, 2001.

M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In 20th Annual Conference on Learning Theory (COLT), 2007.

N. Balcan and P. Long. Active and passive learning of linear separators under log-concave distributions. In Proceedings of the 26th annual conference on Learning Theory (COLT), 2013.

Z. Borsos, M. Mutny, M. Tagliasacchi, and A. Krause. Data summarization via bilevel optimization, 2021. URL https://arxiv.org/abs/2109.12534.

N. Brukhim, M. Dudik, A. Pacchiano, and R. Schapire. A uniﬁed model and dimension for interactive estimation, 2023. URL https://arxiv.org/abs/2306.06184.

R. Camilleri, J. Katz-Samuels, and K. Jamieson. High-dimensional experimental design and kernel bandits. In Proc. 38th International Conference on Machine Learning, PMLR 139, 2021a.

R. Camilleri, Z. Xiong, M. Fazel, L. Jain, and K. Jamieson. Selective sampling for online best-arm identiﬁcation. In Advances in Neural Information Processing Systems, 2021b.

R. Castro and R. Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5):2339 2353, 2008.

N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order perceptron algorithm. SIAM J. Comput., 34(3):640 668, 2005.

Fast Rates in Pool-Based Batch Active Learning

K. Chaudhuri, S. M. Kakade, P. Netrapalli, and S. Sanghavi. Convergence rates of active learning for maximum likelihood estimation. Advances in Neural Information Processing Systems, 28, 2015.

Y. Chen and A. Krause. Near-optimal batch mode active learning and adaptive submodular optimization. In Proceedings of the 30th International Conference on Machine Learning, volume 28(1), pages 160 168. PMLR, 2013.

Y. Chen, S. H. Hassani, A. Karbasi, and A. Krause. Sequential information maximization: When is greedy near-optimal? In Proc. 28th Conference on Learning Theory, PMLR 40, pages 338 363, 2015.

Y. Chen, S. H. Hassani, and A. Krause. Near-optimal bayesian active learning with correlated and noisy tests. In Proc. 20th International Conference on Artiﬁcial Intelligence and Statistics, 2017.

G. Citovsky, G. De Salvo, C. Gentile, L. Karydas, A. Rajagopalan, A. Rostamizadeh, and S. Kumar. Batch active learning at scale. In 35th Conference on Neural Information Processing Systems (Neur IPS), 2021.

C. Dann, M. Mohri, T. Zhang, and J. Zimmert. A provably eﬃcient model-free posterior sampling method for episodic reinforcement learning. In 35th Conference on Neural Information Processing System, 2021.

S. Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems, volume 17, 2004.

S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in neural information processing systems, pages 235 242, 2005.

O. Dekel, C. Gentile, and K. Sridharan. Selective sampling and active learning from single and multiple teachers. J. Mach. Learn. Res., 13(1), 2012.

K. Dong, J. Yang, and T. Ma. Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature, 2021. URL https://arxiv.org/ abs/2102.04168.

H. Esfandiari, A. Karbasi, and V. Mirrokni. Adaptivity in adaptive submodularity. In Proc. 34th Annual Conference on Learning Theory, volume 134, pages 1 24. PMLR, 2021.

T. Fiez, L. Jain, K. G. Jamieson, and L. Ratliﬀ. Sequential experimental design for transductive linear bandits. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

D. Foster, A. Rakhlin, D. Simchi-Levi, and Y. Xu. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective, 2020. URL https://arxiv.org/abs/2010.03104.

C. Gentile, Z. Wang, and T. Zhang. Fast rates in pool-based batch active learning, 2022a. URL https://arxiv.org/abs/2202.05448.

Gentile and Wang and Zhang

C. Gentile, Z. Wang, and T. Zhang. Achieving minimax rates in pool-based batch active learning. In Proceedings of the 39th International Conference on Machine Learning. PMLR, 2022b.

A. Ghorbani, J. Zou, and A. Esteva. Data shapley valuation for eﬃcient batch active learning, 2021. URL https://arxiv.org/abs/2104.08312v1.

D. Golovin and A. Krause. Adaptive submodularity: A new approach to active learning and stochastic optimization, 2017. URL https://arxiv.org/abs/1003.3967.

Q. Gu, T. Zhang, J. Han, and C. Ding. Selective labeling via error bound minimization. Advances in neural information processing systems, 25, 2012.

Q. Gu, T. Zhang, and J. Han. Batch-mode active learning via error bound minimization. In Proceedings of the Thirtieth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 300 309, 2014.

S. Hanneke. Adaptive rates of convergence in active learning. In Proc. of the 22th Annual Conference on Learning Theory, 2009.

S. Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2 3):131 309, 2014.

S. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch mode active learning and its application to medical image classiﬁcation. In 23rd International Conference on Machine Learning (ICML), 2006.

K. Huang, S. Kakade, J. Lee, and Q. Lei. A short note on the relationship of information gain and eluder dimension, 2021. URL https://arxiv.org/abs/2107.02377.

J. Katz-Samuels, J. Zhang, L. Jain, and K. Jamieson. Improved algorithms for agnostic pool-based active classiﬁcation. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 5334 5344. PMLR, 2021.

K. Killamsetty, D. Sivasubramanian, G. Ramakrishnan, and R. Iyer. Glister: Generalization based data subset selection for eﬃcient and robust learning, 2021. URL https://arxiv. org/abs/2012.10630.

K. Kim, K. Park, D. Kim, and S. Chun. Task-aware variational adversarial active learning, 2020. URL https://arxiv.org/abs/2002.04709v2.

A. Kirsch, J. van Amersfoort, and Y. Gal. Batchbald: Eﬃcient and diverse batch acquisition for deep bayesian active learning, 2019. URL https://arxiv.org/abs/1906.08158v2.

A. Kirsch, S. Farquhar, and Y. Gal. A simple baseline for batch active learning with stochastic acquisition functions, 2021. URL https://arxiv.org/abs/2106.12059.

V. Koltchinskii. Rademacher complexities and bounding the excess risk of active learning. Journal of Machine Learning Research, 11:2457 2485, 2010.

Fast Rates in Pool-Based Batch Active Learning

S. Kothawade, N. Beck, K. Killamsetty, and R. Iyer. Similar: Submodular information measures based active learning in realistic scenarios. In Advances in Neural Information Processing Systems, 2021.

T. Lattimore and C. Szepesvari. Bandit Algorithms. Cambridge University Press, 2020.

E. Mammen and A. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808 1829, 1999.

S. Mukherjee, A. S. Tripathy, and R. Nowak. Chernoﬀsampling for active testing and extension to active regression. In International Conference on Artiﬁcial Intelligence and Statistics, pages 7384 7432. PMLR, 2022.

R. D. Nowak. The geometry of generalized binary search. IEEE Transactions on Information Theory, 57(12):7893 7906, 2011.

I. Osband and B. Van Roy. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, volume 27, 2014.

L. Pronzato and A. P azman. Design of experiments in nonlinear models. Lecture notes in statistics, 212(1), 2013.

D. Russo and B. Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, volume 26, 2013.

N. Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13:145 147, 1972.

A. Sekhari, K. Sridharan, W. Sun, and R. Wu. Selective sampling and imitation learning via online regression. In Advances in Neural Information Processing Systems, 2023.

O. Sener and S. Savarese. Active learning for convolutional neural networks: A coreset approach. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1a Iuk-RW.

C. Shui, C. Zhou, F. Gagne, and B. Wang. Deep active learning: Uniﬁed and principled method for query and training. In 23rd International Conference on Artiﬁcial Intelligence and Statistics (Ai STATS), 2020.

C. Tosh and S. Dasgupta. Diameter-based active learning. In 34th International Conference on Machine Learning (ICML), 2017.

Z. Wang, P. Awasthi, C. Dann, and C. Sekhari, A. Gentile. Neural active learning with performance guarantees. In Advances in Neural Information Processing Systems 34, 2021.

K. Wei, R. Iyer, and J. Bilmes. Submodularity in data subset selection and active learning. In International Conference on Machine Learning, pages 1954 1963. PMLR, 2015.

C. Ye, W. Xiong, Q. Gu, and T. Zhang. Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and Markov decision processes. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 39834 39863. PMLR, 2023.

Gentile and Wang and Zhang

C. Zhang and Y. Li. Improved algorithms for eﬃcient active learning halfspaces with massart and tsybakov noise. In 34th Annual Conference on Learning Theory (COLT), 2021.

F. Zhdanov. Diverse mini-batch active learning, 2019. URL https://arxiv.org/abs/ 1901.05954v1.

D.-X. Zhou. The covering number in learning theory. Journal of Complexity, 18:739 767, 2002.

Y. Zhu and R. Nowak. Eﬃcient active learning with abstention. In Advances in Neural Information Processing Systems, volume 35, pages 35379 35391. Curran Associates, Inc., 2022.