# neural_active_learning_with_performance_guarantees__a68e9c8d.pdf

Neural Active Learning with Performance Guarantees

Zhilei Wang New York University New York, NY 10012 zhileiwang92@gmail.com

Pranjal Awasthi Google Research New York, NY 10011 pranjalawasthi@google.com

Christoph Dann Google Research New York, NY 10011 chrisdann@google.com

Ayush Sekhari Cornell University Ithaca, NY 14850 ayush.sekhari@gmail.com

Claudio Gentile Google Research New York, NY 10011 cgentile@google.com

We investigate the problem of active learning in the streaming setting in nonparametric regimes, where the labels are stochastically generated from a class of functions on which we make no assumptions whatsoever. We rely on recently proposed Neural Tangent Kernel (NTK) approximation tools to construct a suitable neural embedding that determines the feature space the algorithm operates on and the learned model computed atop. Since the shape of the label requesting threshold is tightly related to the complexity of the function to be learned, which is a-priori unknown, we also derive a version of the algorithm which is agnostic to any prior knowledge. This algorithm relies on a regret balancing scheme to solve the resulting online model selection problem, and is computationally efﬁcient. We prove joint guarantees on the cumulative regret and number of requested labels which depend on the complexity of the labeling function at hand. In the linear case, these guarantees recover known minimax results of the generalization error as a function of the label complexity in a standard statistical learning setting.

1 Introduction

Supervised learning is a fundamental paradigm in machine learning and is at the core of modern breakthroughs in deep learning [29]. A machine learning system trained via supervised learning requires access to labeled data collected via recruiting human experts, crowdsourcing, or running expensive experiments. Furthermore, as the complexity of current deep learning architectures grows, their requirement for labeled data increases signiﬁcantly. The area of active learning aims to reduce this data requirement by studying the design of algorithms that can learn and generalize from a small carefully chosen subset of the training data [13, 40].

The two common formulations of active learning are pool based active learning, and sequential (or streaming) active learning. In the pool based setting [30], the learning algorithm has access to a large unlabeled set of data points, and the algorithm can ask for a subset of the data to be labeled. In contrast, in the sequential setting, data points arrive in a streaming manner, either adversarially or drawn i.i.d. from a distribution, and the algorithm must decide whether to query the label of a given point or not [14].

From a theoretical perspective, active learning has typically been studied under models inspired by the probably approximately correct (PAC) model of learning [41]. Here one assumes that there is a pre-speciﬁed class H of functions such that the target function mapping examples to their labels

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

either lies in H or has a good approximation inside the class. Given access to unlabeled samples generated i.i.d. from the distribution, the goal is to query for a small number of labels and produce a hypothesis of low error.

In the parametric setting, namely, when the class of functions H has ﬁnite VC-dimension (or ﬁnite disagreement coefﬁcient) [21], the rate of convergence of active learning, i.e., the rate of decay of the regret as a function of the number of label queries (N), is of the form ν N 1/2 + e

N, where ν is the population loss of the best function in class H. This simple ﬁnding shows that active learning behaves like passive learning when ν > 0, while very fast rates can only be achieved under low noise (ν 0) conditions. This has been worked out in, e.g., [19, 15, 5, 4, 6, 38].

While the parametric setting comes with methodological advantages, the above shows that in order to unleash the true power of active learning, two properties are desirable: (1) A better interplay between the input distribution and the label noise and, (2) a departure from the parametric setting leading us to consider wider classes of functions (so as to reduce the population loss ν to close to 0). To address the above, there has also been considerable theoretical work in recent years on non-parametric active learning [10, 33, 31]. However, these approaches suffer from the curse of dimensionality and do not lead to computationally efﬁcient algorithms. A popular approach that has been explored empirically in recent works is to use Deep Neural Networks (DNNs) to perform active learning (e.g., [37, 26, 39, 3, 44]). While these works empirically demonstrate the power of the DNN-based approach to active learning, they do not come with provable guarantees. The above discussion raises the following question: Is provable and computationally efﬁcient active learning possible in non-parametric settings?

We answer the above question in the afﬁrmative by providing the ﬁrst, to the best of our knowledge, computationally efﬁcient algorithm for active learning based on Deep Neural Networks. Similar to non-parametric active learning, we avoid ﬁxing a function class a-priori. However, in order to achieve computational efﬁciency, we instead propose to use over-parameterized DNNs, where the amount of over-parameterization depends on the input data at hand. We work in the sequential setting, and propose a simple active learning algorithm that forms an uncertainty estimate for the current data point based on the output of a DNN, followed by a gradient descent step to update the network parameters if the data point is queried. We show that under standard low-noise assumptions [32] our proposed algorithm achieves fast rates of convergence.

In order to analyze our algorithm, we use tools from the theory of Neural Tangent Kernel (NTK) approximation [24, 2, 18] that allows us to analyze the dynamics of gradient descent by considering a linearization of the network around random initialization. Since we study the non-parametric regime, the convergence rates of our algorithm depend on a data-dependent complexity term that is expected to be small in practical settings, but could be very large in worst-case scenarios. Furthermore, the algorithm itself needs an estimate of complexity term in order to form accurate uncertainty estimates. We show that one can automatically adapt to the magnitude of the unknown complexity term by designing a novel model selection algorithm inspired by recent works in model selection in multiarmed bandit settings [36, 35]. Yet, several new insights are needed to ensure that the model selection algorithm can simultaneously achieve low generalization error without spending a signiﬁcant amount of budget on label queries.

2 Preliminaries and Notation

Let X denote the input space, Y the output space, and D an unknown distribution over X Y. We denote the corresponding random variables by x and y. We also denote by DX the marginal distribution of D over X, and by DY|x0 the conditional distribution of random variable y given x = x0. Moreover, given a function f (sometimes called a hypothesis or a model) mapping X to Y, the conditional population loss (often referred to as conditional risk) of f is denoted by L(f | x), and deﬁned as L(f | x) = Ey DY|x[ℓ(f(x), y) | x], where ℓ: Y Y [0, 1] is a loss function. For ease of presentation, we restrict to a binary classiﬁcation setting with 0-1 loss, whence Y = { 1, +1}, and ℓ(a, y) = 11{a = y} {0, 1}, 11{ } being the indicator function of the predicate at argument. When clear from the surrounding context, we will omit subscripts like y DY|x" from probabilities and expectations.

We investigate a non-parametric setting of active learning where the conditional distribution of y given x is deﬁned through an unknown function h : X 2 [0, 1] such that

P(y = 1 | x) = h((x, 0)) P(y = 1 | x) = h((0, x)) , (1) where 0 X, (x1, x2) denotes the concatenation (or pairing) of the two instances x1 and x2 (so that (x, 0) and (0, x) are in X 2) and, for all x X we have h((x, 0)) + h((0, x)) = 1. We make no explicit assumptions on h, other than its well-behavedness w.r.t. the data {xt}T t=1 at hand through the formalism of Neural Tangent Kernels (NTK) see below. As a simple example, in the linear case, X is the d-dimensional unit ball, h( , ) is parametrized by an unknown unit vector θ Rd, and h((x1, x2)) = 1+ (θ, θ),(x1,x2)

2 , so that h((x, 0)) = 1+ θ,x

2 and h((0, x)) = 1 θ,x

2 , where , is the usual dot product in Rd.

We consider a streaming setting of active learning where, at each round t [T] = {1, . . . , T}, a pair (xt, yt) X Y is drawn i.i.d. from D. The learning algorithm receives as input only xt, and is compelled to both issue a prediction at for yt and, at the same time, decide on-the-ﬂy whether or not to observe yt. These decisions can only be based on past observations. Let Et denote the conditional expectation E[ |(x1, y1) . . . , (xt 1, yt 1), xt], and we introduce the shorthand

xt,a = (xt, 0) if a = 1 (0, xt) if a = 1 .

Notice that with this notation E[ℓ(a, yt) | xt] = 1 h(xt,a), for all a Y. We quantify the accuracy of the learner s predictions through its (pseudo) regret, deﬁned as

Et[ℓ(at, yt) | xt] E[ℓ(a t , yt) | xt] =

h(xt,a t ) h(xt,at) ,

where a t is the Bayesian-optimal classiﬁer on instance xt, that is, a t = arg maxa Y h(xt,a). Additionally, we are interested in bounding the number of labels NT the algorithm decides to request. Our goal is to simultaneously bound RT and NT with high probability over the generation of the sample {(xt, yt)}t=1,...,T .

Throughout this work, we consider the following common low-noise condition on the marginal distribution DX (Mammen-Tsybakov low noise condition [32]): There exist absolute constants c > 0, and α 0 such that for all ϵ (0, 1/2) we have

P |h((x, 0)) 1

2| < ϵ c ϵα .

In particular, α = gives the so-called hard margin condition P |h((x, 0)) 1

2| < ϵ = 0. while, at the opposite extreme, exponent α = 0 (and c = 1) results in no assumptions whatsoever on DX . For simplicity, we shall assume throughout that the above low-noise condition holds for1 c = 1.

Our techniques are inspired by the recent work [45] from which we also borrow some notation. We are learning the class of functions {h} by means of fully connected neural networks

f(x, θ) = m Wnσ(...σ(W1x)) ,

where σ is a Re LU activation function σ(x) = max{0, x}, m is the width of the network and n 2 is its depth. In the above, θ Rp collectively denotes the set of weights {W1, W2, . . . , Wn} of the network, where p = m + 2md + m2(n 2) is their number, and the input x at training time should be thought of as some xt,a X 2.

With any depth-n network and data points {xt,a}t=1,...,T, a= 1 we associate a depth-n NTK matrix as follows [24]. First, rename {xt,a}t=1,...,T, a= 1 as {x(i)}i=1,...,2T . Then deﬁne matrices

e H(1) = h H(1) i,j i2T 2T

i,j=1 Σ(1) = h Σ(1) i,j i2T 2T

i,j=1 with H(1) i,j = Σ(1) i,j = x(i), x(j) ,

and then, for any k n and i, j = 1, . . . , 2T, introduce the bivariate covariance matrix

" Σ(k) i,i Σ(k) i,j Σ(k) i,j Σ(k) j,j

1A more general formulation requires the above to hold only for ϵ ϵ0, where ϵ0 (0, 1/2) is a third parameter. We shall omit this extra parameter from our presentation.

by which we recursively deﬁne

Σ(k+1) i,j = 2E(u,v) N(0,A(k) i,j )[σ(u)σ(v)]

and e H(k+1) i,j = 2 e H(k) i,j E(u,v) N(0,A(k) i,j )[ 11{u 0} 11{v 0}] + Σ(k+1) i,j .

The 2T 2T-dimensional matrix H = 1

2( e H(n) + Σ(n)) is called the Neural Tangent Kernel (NTK) matrix of depth n (and inﬁnite width) over the set of points {xt,a}t=1,...,T, a= 1. The reader is referred to [24] for more details on NTK.

In order to avoid heavy notation, we assume ||xt|| = 1 for all t. Matrix H is positive semi-deﬁnite by construction but, as is customary in the NTK literature (e.g., [2, 9, 17]), we assume it is actually positive deﬁnite (hence invertible) with smallest eigenvalue λ0 > 0. This is a mild assumption that can be shown to hold if no two vectors xt are aligned to each other.

We measure the complexity of the function h at hand in a way similar to [45]. Using the same rearrangement of {xt,a}t=1,...,T, a= 1 into {x(i)}i=1,...,2T as above, let h be the 2T-dimensional (column) vector whose i-th component is h(x(i)). Then, we deﬁne the complexity ST,n(h) of h over {xt,a}t=1,...,T, a= 1 w.r.t. an NTK of depth n as ST,n(h) =

h H 1h . Notice that this notion of (data-dependent) complexity is consistent with the theoretical ﬁndings of [2], who showed that for a two-layer network the bound on the generalization performance is dominated by y H 1y, where y is the vector of labels. Hence if y is aligned with the top eigenvectors of H the learning problem becomes easier. In our case, vector h plays the role of vector y. Also observe that S2 T,n(h) can in general be as big as linear in T (in which case learning becomes hopeless with our machinery). In the special case where h belongs to the RKHS induced by the NTK, one can upper bound ST,n(h) by the norm of h in the RKHS.

The complexity term ST,n(h) is typically unknown to the learning algorithm, and it plays a central role in both regret and label complexity guarantees. Hence the algorithm needs to learn this value as well during its online functioning. Apparently, this aspect of the problem has been completely overlooked by [45] (as well as by earlier references on contextual bandits in RKHS, like [12]), where a (tight) upper bound on ST,n(h) is assumed to be available in advance. We will cast the above as a model selection problem in active learning, where we adapt and largely generalize to active learning the regret balancing technique from [36, 35].

In what follows, we use the short-hand g(x; θ) = θf(x, θ) and, for a vector g Rp and matrix Z Rp p, we often write p

g Zg as ||g||Z, so that ST,n(h) = ||h||H 1.

2.1 Related work

The main effort in theoretical works in active learning is to obtain rates of convergence of the population loss of the hypothesis returned by the algorithm as a function of the number N of requested labels. We emphasize that most of these works, that heavily rely on approximation theory, are not readily comparable to ours, since our goal here is not to approximate h through a DNN on the entire input domain, but only on the data at hand.

As we recalled in the introduction, in the parametric setting the convergence rates of the regret are of the form ν N 1/2 + e

N, where ν is the population loss of the best function in class H. Hence, active learning rates behave like the passive learning rate N 1/2 when ν > 0, while fast rates can only be achieved under very low noise (ν 0) conditions. In this respect, relevant references include [20, 27] where, e.g., in the realizable case (i.e., when the Bayes optimal classiﬁer lies in H), minimax active learning rates of the form N α+1

2 are shown to hold for adaptive algorithms that do not know beforehand the noise exponent α. In non-parametric settings, a comprehensive set of results has been obtained by [31], which builds on and signiﬁcantly improves over earlier results from [33]. Both papers work under smoothness (Holder continuity/smoothness) assumptions. In addition, [33]

requires DX to be (quasi-)uniform on X = [0, 1]d. In [31] the minimax active learning rate N β(α+1)

2β+d is shown to hold for β-Holder classes, where exponent β plays the role of the complexity of the class of functions to learn, and d is the input dimension. This algorithm is adaptive to the complexity parameter β, and is therefore performing a kind of model selection. Notice that minimax rates in the

parametric regime are recovered by setting β . Of a somewhat similar ﬂavor is an earlier result by [27], where a convergence rate of the form N α+1

2+κα is shown, being κ the metric entropy of the class (again, a notion of complexity). A reﬁnement of the results in [31] has recently been obtained by [34] where, following [11], a more reﬁned notion of smoothness for the Bayes classiﬁer is adopted which, however, also implies more restrictive assumptions on the marginal distribution DX .

As opposed to those bounds, our bounds are data-dependent, in that all relevant quantities appearing in the bounds will be random variables depending on the data at hand (which are themselves random). One may attempt to turn these into data-independent results (like in most of the papers we cited above) by, e.g., establishing bounds on that hold in expectation or with high probability over the random draw of the data, but this theory is currently unavailable in the NTK literature (as far as we know). Very recently some results have appeared for certain special cases, see [23] for example. But such results are too embryonic in nature to allow us a full-ﬂedged comparison.

Model selection of the scale of a Nearest-Neighbor-based active learning algorithm is also performed in [28], whose main goal is to achieve data-dependent rates based on the noisy-margin properties of the random sample at hand, rather than those of the marginal distribution. Their active learning rates are not directly comparable to ours and, unlike our paper, the authors work in a pool-based scenario, where all unlabeled points are available beforehand. Finally, an interesting investigation in active learning for over-parametrized and interpolating regimes is contained in [25]. The paper collects a number of interesting insights in active learning for 2-layer Neural Networks and Kernel methods, but it restricts to either uniform distributions on the input space or cases of well-clustered data points, with no speciﬁc regret and query complexity guarantees, apart from very special (though insightful) cases.

3 Basic Algorithm

Our ﬁrst algorithm (Algorithm 1) uses randomly initialized, but otherwise frozen, network weights (a more reﬁned algorithm where the network weights are updated incrementally is described and analyzed in the appendix). Algorithm 1 is an adaptation to active learning of the neural contextual bandit algorithm of [45], and shares similarities with an earlier selective sampling algorithm analyzed in [16] for the linear case. The algorithm generates network weights θ0 by independently sampling from Gaussian distributions of appropriate variance, and then uses θ0 to stick with a gradient mapping φ( ) which will be kept frozen from beginning to end. The algorithm also takes as input the complexity parameter S = ST,n(h) of the underlying function h satisfying (1). We shall later on remove the assumption of the prior knowledge of ST,n(h). In particular, removing the latter, turns out to be quite challenging from a technical standpoint, and gives rise to a complex online model selection algorithms for active learning in non-parametric regimes.

At each round t, Algorithm 1 receives an instance xt X, and constructs the two augmented vectors xt,1 = (xt, 0) and xt, 1 = (0, xt) (intuitively corresponding to the two actions" of a contextual bandit algorithm). The algorithm predicts the label yt associated with xt by maximizing over a Y an upper conﬁdence index Ut,a stemming from the linear approximation h(xt,a) m φ(xt,a), θt 1 θ0 subject to ellipsoidal constraints Ct 1, as in standard contextual bandit algorithms operating with the frozen mapping φ( ). In addition, in order to decide whether or not to query label yt, the algorithm estimates its own uncertainty by checking to what extent Ut,at is close to 1/2. This uncertainty level is ruled by the time-varying threshold Bt, which is expected to shrink to 0 as time progresses. Notice that Bt is a function of γt 1, which in turn includes in its deﬁnition the complexity parameter S. Finally, if yt is revealed, the algorithm updates its least-squares estimator θt by a rank-one adjustment of matrix Zt and an additive update to the bias vector bt. No update is taking place if the label is not queried. The following is our initial building block.2

Theorem 1. Let Algorithm 1 be run with parameters δ, S, m, and n on an i.i.d. sample (x1, y1), . . . , (x T , y T ) D, where the marginal distribution DX fulﬁlls the low-noise condition with exponent α 0 w.r.t. a function h that satisﬁes (1) and such that

2ST,n(h) S. If m = poly(T, n, λ 1 0 , log(1/δ)), then with probability at least 1 δ the cumulative regret RT and

2All proofs are in the appendix.

Algorithm 1: Frozen NTK Selective Sampler. Input: Conﬁdence level δ, complexity parameter S, network width m, and depth n . Initialization:

Generate each entry of Wk independently from N(0, 2/m), for k [n 1], and each entry of Wn independently from N(0, 1/m); Deﬁne φ(x) = g(x; θ0)/ m, where θ0 = W1, . . . , Wn Rp is the (frozen) weight vector of the neural network so generated; Set Z0 = I Rp p, b0 = 0 Rp . for t = 1, 2, . . . , T

Observe instance xt X and build xt,a X 2, for a Y = { 1, +1} Set Ct 1 = {θ : θ θt 1 Zt 1 γt 1 m }, with γt 1 = p

log det Zt 1 + 2 log(1/δ) + S Set

Ut,a = m max θ Ct 1 φ(xt,a), θ θ0 = m φ(xt,a), θt 1 θ0 + γt 1 φ(xt,a) Z 1 t 1

Predict at = arg maxa Y Ut,a Set It = 11{|Ut,at 1/2| Bt} {0, 1} with Bt = Bt(S) = 2γt 1 φ(xt,at) Z 1 t 1 if It = 1

Query yt Y, and set loss ℓt = ℓ(at, yt) Update

Zt = Zt 1 + φ(xt,at)φ(xt,at)

bt = bt 1 + (1 ℓt)φ(xt,at)

θt = Z 1 t bt/ m + θ0

else Zt = Zt 1, bt = bt 1, θt = θt 1, γt = γt 1, Ct = Ct 1 .

the total number of queries NT are simultaneously upper bounded as follows:

α+1 α+2 H LH + log(1/δ) + S2 α+1

α+2 T 1 α+2 + log(log T/δ)

α α+2 H LH + log(1/δ) + S2 α α+2 T 2 α+2 + log(log T/δ) ,

where LH = log det(I + H), H being the NTK matrix of depth n over the set of points {xt,a}t=1,...,T, a= 1.

The above bounds depend, beyond time horizon T, on three relevant quantities: the noise level α, the complexity parameters S and the log-determinant quantity LH. Notice that, whereas S essentially quantiﬁes the complexity of the function h to be learned, LH measures instead the complexity of the NTK itself, hence somehow quantifying the complexity of the function space we rely upon in learning h. It is indeed instructive to see how the bounds in the above theorem vary as a function of these quantities. First, as expected, when α = 0 we recover the usual regret guarantee RT = O(

T), more precisely a bound of the form RT = O((LH + LHS)

T), with the trivial label complexity NT = O(T). At the other extreme, when α we obtain the guarantees RT = NT = O(LH(LH + S2)). In either case, if h is too complex" when projected onto the data, that is, if S2 T,n(h) = Ω(T), then all bounds become vacuous.3 At the opposite end of the spectrum, if {h} is simple, like a class of linear functions with bounded norm in a d-dimensional space, and the network depth n is 2 then ST,n(h) = O(1), and LH = O(d log T), hence recovering the rates reported in [16] for the linear case. The quantity LH is tightly related to the decaying rate of the eigenvalues of the NTK matrix H, and is poly-logarithmic in T in several important cases [42]. One relevant example is discussed in [43], which relies on the spectral characterization of NTK in [7, 8]: If n = 2 and all points x(i) concentrate on a d0-dimensional subspace of the RKHS spanned by the NTK, then LH = O(d0 log T).

3The same happens, e.g., to the regret bounds in [45].

It is also important to stress that, via a standard online-to-batch conversion, the result in Theorem 1 can be turned to a compelling guarantee in a traditional statistical learning setting, where the goal is to come up at the end of the T rounds with a hypothesis f whose population loss L(f) = Ex DX [L(f | x)] exceeds the Bayes optimal population loss Ext DX [h(xt,a t )] = Ext DX [max{h(xt,1), h(xt, 1)}] by a vanishing quantity. Following [16], this online-to-batch algorithm will simply run Algorithm 1 by sweeping over the sequence {(xt, yt)}t=1,...,T only once, and pick one function uniformly at random among the sequence of predictors generated by Algorithm 1 during its online functioning, that is, among the sequence {Ut(x)}t=1,...,T , where Ut(x) = arg maxa Y maxθ Ct 1 φ(x ,a), θ θ0 , with x ,1 = (x, 0) and x , 1 = (0, x). This randomized algorithm enjoys the following high-probability excess risk guarantee:4

Et unif(T )[L(Ut)] Ext DX [h(xt,a t )] = O

LH LH + log(1/δ) + S2

α+2 + log log(T/δ)

Combining with the guarantee on the number of labels NT from Theorem 1 (and disregarding log factors), this allows us to conclude that the above excess risk can be bounded as a function of NT as LH(LH + S2)

where LH(LH + S2) plays the role of a (compound) complexity term projected onto the data

x1, . . . , x T at hand. When restricting to VC-classes, the convergence rate N α+1

2 T is indeed the best rate (minimax rate) one can achieve under the Mammen-Tsybakov low-noise condition with exponent α (see, e.g., [10, 20, 27, 16]).

Yet, since we are not restricting to the parametric case, both LH and, more importantly, S2 can be a function of T. In such cases, the generalization bound in (2) can still be expressed as a function of NT alone, For instance, when LH is poly-logarithmic in T and S2 = O(T β), for some β [0, 1),

one can easily verify that (2) takes the form N (1 β)(α+1)

2+βα T (again, up to log factors).

In Section A.3 of the appendix, we extend all our results to the case where the network weights are not frozen, but are updated on the ﬂy according to a gradient descent procedure. In this case, in Algorithm 1 the gradient vector φ(x) = g(x; θ0)/ m will be replaced by φt(x) = g(x; θt 1)/ m, where θt is not the linear-least squares estimator θt = Z 1 t bt/ m + θ0, as in Algorithm 1, but the result of the DNN training on the labeled data {(xk, yk) : k t, Ik = 1} gathered so far.

4 Model Selection

Our model selection algorithm is described in Algorithm 2. The algorithm operates on a pool of base learners of Frozen NTK selective samplers like those in Algorithm 1, each member in the pool being parametrized by a pair of parameters (Si, di), where Si plays the role of the (unknown) complexity parameter ST,n(h) (which was replaced by S in Algorithm 1), and di plays the role of an (a-priori unknown) upper bound on the relevant quantity P

t T : it=i 1 2 It,i B2 t,i that is involved in the analysis (see Lemma 5 and Lemma 7 in Appendix A.1). This quantity will at the end be upper bounded by a term of the form LH(LH + log(T/δ) + S2 T,n(h)) whose components LH and S2 T,n(h) are initially unknown to the algorithm.

Algorithm 2 maintains over time a set Mt of active base learners, and a probability distribution pt over them. This distribution remains constant throughout a sequence of rounds between one change to Mt and the next. We call such sequence of rounds an epoch. Upon observing xt, Algorithm 2 selects which base learner to rely upon in issuing its prediction at and querying the label yt, by drawing base learner it Mt according to pt.

Then Algorithm 2 undergoes a series of carefully designed elimination tests which are meant to rule out mis-speciﬁed base learners, that is, those whose associated parameter Si is likely to be smaller than ST,n(h), while retaining those such that Si ST,n(h). These tests will help keep both the regret bound and the label complexity of Algorithm 2 under control. Whenever, at the end of some

4Observe that this is a data-dependent bound, in that the RHS is random variable. This is because both LH and S may depend on x1, . . . , x T .

round t, any such test triggers, that is, when it happens that |Mt+1| < |Mt| at the end of the round, a new epoch begins, and the algorithm starts over with a fresh distribution pt+1 = pt.

The ﬁrst test ( disagreement test") restricts to all active base learners that would not have requested the label if asked. As our analysis for the base selective sampler (see Lemma 8 in Appendix A.1) shows that a well-speciﬁed base learner does not suffer (with high probability) any regret on non-queried rounds, any disagreement among them reveals mis-speciﬁcation, thus we eliminate in pairwise comparison the base learner that holds the smaller Si parameter. The second test ( observed regret test") considers the regret behavior of each pair of base learners i, j Mt on the rounds k t on which i was selected (ik = i) and requested the label (Ik,i = 1), but j would not have requested if asked (Ik,j = 0), and the predictions of the two happened to disagree on that round (ak,i = ak,j). The goal here is to eliminate base learners whose cumulative regret is likely to exceed the regret of the smallest well-speciﬁed learner, while ensuring (with high probability) that any well-speciﬁed base learner i is not removed from the pool. In a similar fashion, the third test ( label complexity test") is aimed at keeping under control the label complexity of the base learners in the active pool Mt. Finally, the last test ( di test") simply checks whether or not the candidate value di associated with base learner i remains a valid (and tight) upper bound on LH(LH + S2 T,n(h)).

Notice that the sampling distribution pt plays base learners with small di more often than learners with large di. Note also that di is exactly the (instance-dependent) factor in the cumulative regret and label complexity bounds for base learners that are well-speciﬁed. This means that base learners with lower regret are chosen more frequently than base learners that accumulate regret quicker (and similarly for label complexity). In fact, the sampling distribution is chosen so that the total contribution to the cumulative regret of each base learner is roughly equal. As a consequence, the total cumulative regret of Algorithm 2 is at most M (number of base learners) times the regret of each base learner, and the best base learner in particular, which is a key property for achieving the guarantees in Theorem 2 below. Of course, this only works when the base learners are well-speciﬁed but the four tests in Algorithm 2 ensure that all other learners are eventually eliminated.

We have the following result, whose proof is contained in Appendix A.2. Theorem 2. Let Algorithm 2 be run with parameters δ, γ α with a pool of base learners M1 of size M on an i.i.d. sample (x1, y1), . . . , (x T , y T ) D, where the marginal distribution DX fulﬁlls the low-noise condition with exponent α 0 w.r.t. a function h that satisﬁes (1) and having complexity ST,n(h). Let also M1 contain at least one base learner i such that

2ST,n(h) Si 2

2ST,n(h) and di = Θ(LH(LH + log(M/δ) + S2 T,n(h))), where LH = log det(I + H), being H the NTK matrix of depth n over the set of points {xt,a}t=1,...,T, a= 1. If m = poly(T, n, λ 1 0 , log(1/δ)), then with probability at least 1 δ the cumulative regret RT and the total number of queries NT are simultaneously upper bounded as follows:

RT = O M LH LH + log(M/δ) + S2 T,n(h) γ+1 T 1 γ+2 + M L(T, δ)

NT = O M LH LH + log(M/δ) + S2 T,n(h) γ γ+2 T 2 γ+2 + M L(T, δ) ,

where L(T, δ) is the logarithmic term deﬁned at the beginning of Algorithm 2 s pseudocode.

We run Algorithm 2 with the pool M1 = {(Si1, di2)}, where Si1 = 2i1, i1 = 0, 1, . . . , O(log T) and di2 = 2i2, i2 = 0, 1, . . . , O(log T), ensuring5 the existence of a pair (i1, i2) such that

2ST,n(h) Si1 2

and LH LH + log(M/δ) + S2 T,n(h) di2 2LH LH + log(M/δ) + S2 T,n(h) . Hence the resulting error due to the discretization is just a constant factor, while the resulting number M of base learners is O(log2 T).

Theorem 2 allows us to conclude that running Algorithm 2 on the above pool of copies of Algorithm 1 yields guarantees that are similar to those obtained by running a single instance of Algorithm 1 with

5Notice that the bounds in Theorem 2 become vacuous if either ST,n(h) or LH are Θ(

T), hence we are only interested in making indices i1 and i2 reach a value which is at most logarithmic in T.

Algorithm 2: Frozen NTK Selective Sampler with Model Selection. Input: Conﬁdence level δ; probability parameter γ 0; pool of base learners M1, each identiﬁed with a pair (Si, di); number of rounds T. Set L(t, δ) = log 5.2 log(2t)1.4

δ for t = 1, 2, . . . , T

Observe instance xt X and build xt,a X 2, for a Y = { 1, +1} for i Mt

Set It,i {0, 1} as the indicator of whether base learner i would ask for label on xt Set at,i Y as the prediction of base learner i on xt Let Bt,i = Bt,i(Si) denote the query threshold of base learner i (from Algorithm 1) Select base learner it pt = (pt,1, pt,2, . . . , pt,|Mt|), where

d (γ+1) i P

j Mt d (γ+1) j , if i Mt

0, otherwise

Predict at = at,it if It,it = 1

Query label yt Y and send (xt, yt) to base learner it Mt+1 = Mt

Set Nt = {i Mt : It,i = 0} // (1) Disagreement test for all pairs of base learners i, j Nt that disagree in their prediction (at,i = at,j)

Eliminate all learners with smaller S: Mt+1 = {m Mt+1 : Sm > min{Si, Sj}}

for all pairs of base learners i, j Mt // (2) Observed regret test Consider rounds where the chosen learner i requested the label but j did not, and i and j disagree in their prediction:

Vt,i,j = {k [t]: ik = i, Ik,i = 1, Ik,j = 0, ak,i = ak,j}

k Vt,i,j ( 11{ak,i = yk} 11{ak,j = yk}) > X

k Vt,i,j (1 Bk,i)+1.45 q

|Vt,i,j|L(|Vt,i,j|, δ)

Eliminate base learner i: Mt+1 = Mt+1 \ {i}

for i Mt // (3) Label complexity test Consider rounds where base learner i was played: Tt,i = {k [t]: ik = i} if X

k Tt,i Ik,i > inf ϵ (0,1/2]

3ϵγ|Tt,i|+ 1

k Tt,i Ik,i B2 k,i 1

+2L(|Tt,i|, δ/(M log2(12t)))

Eliminate base learner i: Mt+1 = Mt+1 \ {i}

for i Mt // (4) di test if P

2 Ik,i B2 k,i) > 8di Eliminate base learner i: Mt+1 = Mt+1 \ {i}

2ST,n(h), that is, as if the complexity parameter ST,n(h) were known beforehand. Yet, this model selection guarantee comes at a price, since Algorithm 2 needs to receive as input the noise exponent α (through parameter γ α) in order to correctly shape its label complexity test.

The very same online-to-batch conversion mentioned in Section 3 can be applied to Algorithm 2. Again, combining with the bound on the number of labels and disregarding log factors, this gives us a high probability excess risk bound of the form

LH LH + S2 T,n(h) 3α+2

provided γ = α. Following the same example as at the end of Section 3, when LH is poly-logarithmic

in T and S2 = O(T β), for some β [0, 1), one can verify that (3) is of the form N (1 β(α+1))(α+1)

2+βα T (up to log factors), which converges for β < 1/(α + 1). Hence, compared to (2) we can ensure convergence in a more restricted set of cases.

Section A.3 in the appendix contains the extension of our model selection procedure to the case where the network weights are themselves updated.

5 Conclusions and Work in Progress

We have presented a rigorous analysis of selective sampling and active learning in general nonparametric scenarios, where the complexity of the Bayes optimal predictor is evaluated on the data at hand as a ﬁtting measure with respect to the NTK matrix of a given depth associated with the same data. This complexity measure plays a central role in the level of uncertainty the algorithm assigns to labels (the higher the complexity the higher the uncertainty, hence the more labels are queried). Yet, since this is typically an unknown parameter of the problem, special attention is devoted to designing and analyzing a model selection technique that adapts to this unknown parameter.

In doing so, we borrowed tools and techniques from Neural Bandits [45, 43], selective sampling (e.g., [16]), and online model selection in contextual bandits [36, 35], and combined them together in an original and non-trivial manner.

We proved regret and label complexity bounds that recover known minimax rates in the parametric case, and extended such results well beyond the parametric setting achieving favorable guarantees that cannot easily be compared to available results in the literature of active learning in non-parametric settings. One distinctive feature of our proposed technique is that it gives rise to efﬁcient and manageable algorithms for modular DNN architecture design and deployment.

We conclude by mentioning a few directions we are currently exploring:

1. We are trying to get rid of the prior knowledge of α in the model selection Algorithm 2. This may call for a slightly more reﬁned balancing technique that jointly involves ST,n(h) and α itself.

2. Regardless of whether α is available, it would be nice to improve the dependence on γ = α in the regret bound of Theorem 2. This would ensure convergence of the generalization bound as NT when ST,n(h)2 = T β, for all β [0, 1). We conjecture that this is due to a suboptimal design of our balancing mechanism for model selection in Algorithm 2.

3. We are investigating links between the complexity measure ST,n(h) and the smoothness properties of the (Bayes) regression function h with respect to the NTK kernel (of a given depth n).

[1] Y. Abbasi-yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pages 2312 2320. Curran Associates, Inc., 2011.

[2] S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang. On exact computation with an inﬁnitely wide neural net. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.

[3] J. T. Ash, C. Zhang, A. Krishnamurthy, J. Langford, and A. Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. ar Xiv preprint ar Xiv:1906.03671, 2019.

[4] M. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78 89, 2009.

[5] N. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In COLT, 2008.

[6] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML, 2009.

[7] A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.

[8] Y. Cao, Z. Fang, Y. Wu, D. Zhou, and Q. Gu. Towards understanding the spectral bias of deep learning. In ar Xiv:1912.01198, 2019.

[9] Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.

[10] R. Castro and R. Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5):2339 2353, 2008.

[11] K. Chaudhuri and S. Dasgupta. Rates of convergence for nearest neighbor classiﬁcation. In Advances in Neural Information Processing Systems, pages 3437 3445, 2014.

[12] S. R. Chowdhury and A. Gopalan. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning, 2017.

[13] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine learning, 15(2):201 221, 1994.

[14] I. Dagan and S. P. Engelson. Committee-based sampling for training probabilistic classiﬁers. In Machine Learning Proceedings 1995, pages 150 157. Elsevier, 1995.

[15] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems, 2007.

[16] O. Dekel, C. Gentile, and K. Sridharan. Selective sampling and active learning from single and multiple teachers. J. Mach. Learn. Res., 13(1), 2012.

[17] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent ﬁnds global minima of deep neural networks. In International Conference on Machine Learning, page 1675 1685, 2019.

[18] S. Du, J. Lee, Y. Tian, A. Singh, and B. Poczos. Gradient descent learns one-hidden-layer CNN: Don t be afraid of spurious local minima. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1339 1348. PMLR, 2018.

[19] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007.

[20] S. Hanneke. Adaptive rates of convergence in active learning. In Proc. of the 22th Annual Conference on Learning Theory, 2009.

[21] S. Hanneke et al. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131 309, 2014.

[22] S. R. Howard, A. Ramdas, J. Mc Auliffe, and J. Sekhon. Time-uniform, nonparametric, nonasymptotic conﬁdence sequences. ar Xiv preprint ar Xiv:1810.08240, 2018.

[23] D. Hsu, C. Sanford, R. Servedio, and E. V. Vlatakis-Gkaragkounis. On the approximation power of two-layer networks of random relus. ar Xiv preprint ar Xiv: 2102.02336, 2021.

[24] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: convergence and generalization in neural networks. In Advances in neural information processing systems, page 8571 8580. MIT Press, 2018.

[25] M. Karzand and R. Nowak. Maximin active learning in overparameterized model classes. In ar Xiv:1905.12782v2. 2020.

[26] A. Kirsch, J. Van Amersfoort, and Y. Gal. Batchbald: Efﬁcient and diverse batch acquisition for deep bayesian active learning. ar Xiv preprint ar Xiv:1906.08158, 2019.

[27] V. Koltchinskii. Rademacher complexities and bounding the excess risk of active learning. Journal of Machine Learning Research, 11:2457 2485, 2010.

[28] A. Kontorovich, S. Sabato, and R. Urner. Active nearest-neighbor learning in metric spaces. In Advances in Neural Information Processing Systems, pages 856 864, 2016.

[29] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. Advances in neural information processing systems, 25:1097 1105, 2012.

[30] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classiﬁers. In SIGIR 94, pages 3 12. Springer, 1994.

[31] C. A. Locatelli A. and S. Kpotufe. Adaptivity to noise parameters in nonparametric active learning. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1383 1416, 2017.

[32] E. Mammen and A. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808 1829, 1999.

[33] S. Minsker. Plug-in approach to active learning. Journal of Machine Learning Research, 13:67 90, 2012.

[34] B. Njike and X. Siebert. Nonparametric adaptive active learning under local smoothness condition. In arxiv: 2102.11077. 2021.

[35] A. Pacchiano, C. Dann, G. C., and P. Bartlett. Regret bound balancing and elimination for model selection in bandits and RL. ar Xiv preprint ar Xiv:2012.13045, 2020.

[36] A. Pacchiano, M. Phan, Y. Abbasi Yadkori, A. Rao, J. Zimmert, T. Lattimore, and C. Szepesvari. Model selection in contextual stochastic bandit problems. In Advances in Neural Information Processing Systems, volume 33, pages 10328 10337. Curran Associates, Inc., 2020.

[37] R. Pop and P. Fulop. Deep ensemble bayesian active learning: Addressing the mode collapse issue in monte carlo dropout via ensembles. ar Xiv preprint ar Xiv:1811.03897, 2018.

[38] M. Raginsky and A. Rakhlin. Lower bounds for passive and active learning. In Advances in Neural Information Processing Systems, 2011.

[39] O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. ar Xiv preprint ar Xiv:1708.00489, 2017.

[40] B. Settles. Active learning literature survey. 2009.

[41] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984.

[42] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. In arxiv:1309.6869. 2013.

[43] W. Zhang, D. Zhou, L. Li, and Q. Gu. Neural thompson sampling. In ar Xiv:2010.00827. 2020.

[44] F. Zhdanov. Diverse mini-batch active learning. ar Xiv preprint ar Xiv:1901.05954, 2019.

[45] D. Zhou, L. Li, and Q. Gu. Neural contextual bandits with ucb-based exploration. In Proceedings of the 37th International Conference on Machine Learning, 2020.