# adaptive_regionbased_active_learning__128a6641.pdf

Adaptive Region-Based Active Learning

Corinna Cortes 1 Giulia De Salvo 1 Claudio Gentile 1 Mehryar Mohri 1 2 Ningshan Zhang 3

We present a new active learning algorithm that adaptively partitions the input space into a ﬁnite number of regions, and subsequently seeks a distinct predictor for each region. We prove theoretical guarantees for both the generalization error and the label complexity of our algorithm, and analyze the number of regions deﬁned by the algorithm under some mild assumptions. We also report the results of an extensive suite of experiments on several real-world datasets demonstrating substantial empirical beneﬁts over existing single-region and non-adaptive region-based active learning baselines.

1. Introduction

In many learning problems, including document classiﬁcation, image annotation, and speech recognition, large amounts of unlabeled data are at the learner s disposal at practically no cost. In contrast, reliable labeled data is often more costly to acquire, since it requires careful assessments by human labelers. To limit that cost, in active learning, the learner seeks to request as few labels as possible to learn an accurate predictor. This is an attractive learning scenario with signiﬁcant practical beneﬁts, which remains a challenging theoretical and algorithmic setting.

The literature on active learning is very broad. Thus, we give only a brief discussion of previous work here and refer the reader to (Dasgupta, 2011) for an in-depth survey of the main algorithmic and theoretical ideas, as well as its current challenges. For separable problems, Cohn et al. (1994) introduced the CAL algorithm, which only requires a logarithmic number of label requests, log( 1

ϵ ), to achieve ϵaccuracy. Later, other on-line active learning algorithms for general hypothesis classes and distributions were designed

1Google Research, New York, NY; 2Courant Institute of Mathematical Sciences, New York, NY; 3Hudson River Trading, New York, NY. Correspondence to: Ningshan Zhang <nzhang@stern.nyu.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

with guarantees both for generalization and label complexity in the agnostic setting (Freund et al., 1997; Balcan et al., 2006; Hanneke, 2007; Dasgupta et al., 2008; Beygelzimer et al., 2009; 2010; Huang et al., 2015; Zhang & Chaudhuri, 2014), and in the separable settings (Dasgupta, 2004; Golovin & Krause, 2017; Nowak, 2011; Tosh & Dasgupta, 2017).

The theoretical analysis of the label complexity of active learning for various hypothesis classes and data distributions has been discussed in several publications (Dasgupta, 2006; Castro & Nowak, 2008; Koltchinskii, 2010; Hanneke & Yang, 2015; Hanneke, 2014; Mussmann & Liang, 2018). In particular, for hypothesis sets consisting of linear separators, a series of publications gave margin-based on-line active learning algorithms that admit guarantees under some speciﬁc distributional assumptions (Dasgupta et al., 2005; Balcan et al., 2007; Balcan & Long, 2013; Awasthi et al., 2015; Zhang, 2018).

For all these algorithms, the hypothesis set or version space H is ﬁxed beforehand and, over time, as more labeled information is acquired, it is gradually shrunk to rule out hypotheses too far from the best-in-class hypothesis. This paper initiates the study of an alternative family of algorithms where the hypothesis set H is ﬁrst expanded over time before shrinking. Speciﬁcally, we consider active learning algorithms that adaptively partition the input space into a ﬁnite number of disjoint regions, each equipped with the hypothesis set H, and that subsequently seek a distinct predictor for each region. Such algorithms can achieve a substantially better performance, as shown by our theoretical analysis and largely demonstrated by our experiments.

The design of such algorithms raises several questions: How should the input space be partitioned to ensure an improvement in overall performance? How can labels be requested most effectively across regions to learn an accurate predictor per region? Can we provide generalization and label complexity guarantees? In this paper, we tackle these questions and devise an algorithm for this problem, called Adaptive Region-Based Active Learning (ARBAL), beneﬁting from favorable theoretical guarantees. From a theoretical standpoint, there are several challenging problems: ensuring that the region-speciﬁc best-in-class hypothesis is not discarded, the selection of the splitting criteria, and the dependency of

Adaptive Region-Based Active Learning

the ﬁnal generalization bound on such criteria.

Of course, if a beneﬁcial partition of the input space is available to the learner, as assumed in the related work of Cortes et al. (2019b), then no further work is needed to adaptively seek one. In practice, however, such strong oracle information may not be available and, even when a natural pre-partitioning of the input space is available, without recourse to labeled data, it is not guaranteed to help improve the generalization error. Furthermore, we will not assume that dividing the input space is always beneﬁcial. However, if there exists indeed a partition such that a region-speciﬁc predictor performs signiﬁcantly better than a global one, then, with high probability, ARBAL will ﬁnd it. Otherwise, no split is made and ARBAL works just like a single-region active learning algorithm. In practice, in almost all cases we tested, ARBAL splits the input space into multiple regions and achieves a signiﬁcant performance improvement.

Another line of work somewhat related to our paper is the hierarchical sampling approach of Dasgupta & Hsu (2008) in the pool-based setting of active learning, further analyzed by Urner et al. (2013) and Kpotufe et al. (2015), where the learner receives as input a batch of unlabeled points to select from. However, it is important to stress that the methods proposed in those publications rely on (hierarchical) clusterability assumptions of the data that help save labels, while, here, we are more concerned with a problem in model selection for active learning, where splitting the input space is likely to improve generalization rather than reducing label complexity.

In summary, we present an active learning algorithm, AR-

BAL, that adaptively partitions the input space and performs region-based active learning. Our theoretical results (Theorem 3 and Theorem 9) show that, remarkably, when the algorithm splits the input space into K regions, modulo a standard term in O(1/

T) decreasing with the number of rounds T, the generalization error of ARBAL is close to R γ(K 1), where R is the best-in-class error for the unpartitioned original input space and γ > 0 a parameter of the algorithm. Thus, when at least one split is made by ARBAL (K > 1), then, for T sufﬁciently large, the error of the algorithm is close to a quantity strictly smaller than the original best-in-class error! Moreover, we show that, under mild theoretical assumptions, ARBAL indeed splits the original input space into multiple subregions (Proposition 4 and Corollary 5). Our experiments conﬁrm that this almost always occurs (Section 5). This signiﬁcant theoretical improvement over even the original best-in-class error is further corroborated by our extensive experimental study with 25 datasets where, in most cases, ARBAL achieves a better performance than the best active learning algorithm working with the original single region.

The rest of this paper is structured as follows. In Section 2,

we introduce the preliminaries relevant to our discussion and give a more formal deﬁnition of the learning scenario. In Section 3, we present our new learning algorithm, ARBAL, and justify its splitting criterion via theoretical guarantees. In Section 4, we provide generalization and label complexity bounds for ARBAL in terms of a key parameter for the splitting criterion, and the number of regions partitioned. Moreover, in Section 4.2, we show that, under some natural assumptions about the data distribution, ARBAL beneﬁts from guaranteed improvement over IWAL (Beygelzimer et al., 2009). In Section 5, we report the results of a series of experiments on multiple datasets, demonstrating the substantial beneﬁts of ARBAL over existing non-region-based active learning algorithms, such as IWAL and margin-based uncertainty sampling, and over the non-adaptive region-based active learning baseline ORIWAL (Cortes et al., 2019b).

2. Learning scenario

We now discuss the learning scenario, starting with some preliminary deﬁnitions. Let X RD denote the input space, Y = { 1, +1} the output space, and D an unknown distribution over X Y. We denote by DX the marginal distribution of D over X and, given a prediction space Z R, we denote by ℓ: Z Y [0, 1] a loss function, which we assume to be µ-Lipschitz with respect to its ﬁrst argument, for some constant µ > 0. Let H be a family of hypotheses consisting of functions mapping X to Z. Then, the generalization error or expected loss of a hypothesis h H is denoted by R(h) and deﬁned by R(h) = E(x,y) D[ℓ(h(x), y)].

We consider the on-line setting of active learning where, at each round t [T] = {1, . . . , T}, the learner receives as input a point xt X drawn i.i.d. according to DX and must decide to request or not its label yt. The decision is ﬁnal and cannot be retroactively changed. At the end of T rounds, the learner returns a hypothesis bh T H. In this setting, two conﬂicting quantities determine the performance of an on-line active learning algorithm: its label complexity, that is, the number of labels it has requested over T rounds, and the generalization error R(bh T ) of the hypothesis it returns.

In the standard case where the hypothesis set H is given beforehand, the learner seeks a single best predictor from H. Here, we consider instead the setup where the algorithm adaptively partitions the input space X into K regions X1, . . . , XK, each equipped with a copy of the hypothesis set H and with K upper-bounded by some parameter κ 1. Given the partition X1, . . . , XK, the hypothesis bh T returned by the algorithm after T rounds admits the following form: bh T (x) = PK k=1 1x Xkbhk,T (x), where bhk,T is the hypothesis chosen after T rounds by the algorithm for region Xk.

Let pk = P(Xk) denote the probability of region Xk with respect to DX, k [K], and let Rk(h) denote the condi-

Adaptive Region-Based Active Learning

tional expected loss of a hypothesis h on region Xk, that is Rk(h) = E(x,y) D[ℓ(h(x), y)|x Xk]. By deﬁnition, we have R(h) = PK k=1 pk Rk(h) for any hypothesis h. We assume the learner has access to large amounts of unlabeled data, which can be used to accurately estimate pk. In fact, our results can be easily adapted to the case where the pks are estimated via a collection of unlabeled examples requested on-the-ﬂy. While this would not add much to our analysis in terms of technical difﬁculty, it would make the entire theoretical effort unnecessarily more cluttered.

We denote by h H the overall best-in-class hypothesis over X (single region before any splitting) and by h k H the k-th region s best-in-class hypothesis, that is, h = argminh H R(h) and h k = argminh H Rk(h). We will also use as shorthand the following notation: R = R(h ) and R k = Rk(h k).

Observe that minimizing the generalization error within each region Xk individually is equivalent to minimizing the overall error over the larger set H[K] = PK k=1 1x Xkhk(x): hk H . Clearly, the performance of the best predictor in H[K] is always at least as favorable as that of the best predictor in H, but it can be considerably better, especially when the local performances of h ks with large pk are substantially superior to that of h on the same regions.

3. Algorithm

Our algorithm, called ARBAL (Adaptive Region-Based Active Learning), is an on-line active learning algorithm that adaptively partitions the input space into subregions. ARBAL adopts a label requesting policy similar to that of the single-region IWAL algorithm of Beygelzimer et al. (2009), which is based on the largest possible disagreement among the current set of hypotheses on the input: at round t, given the hypothesis set Ht and input point xt X, IWAL ﬂips a coin Qt {0, 1} with bias pt = p(xt) deﬁned by

pt = max h,h Ht,y Y ℓ(h(xt), y) ℓ(h (xt), y).

If Qt = 1, then the label of xt is requested and the algorithm receives yt, otherwise no label is revealed. Since the loss function ℓtakes values in [0, 1], the requesting probability pt [0, 1] is well deﬁned. IWAL then seeks to shrink the current set Ht to reduce the querying probability pt for future inputs, while, at the same time, keeping (with high probability) the overall best-in-class hypothesis h in this set. At the end of T rounds, IWAL returns the importance-weighted empirical risk minimizer bh T = argminh HT PT t=1 Qtℓ(h(xt), yt)/pt.

Our techniques and ideas for splitting are illustrated with IWAL, since IWAL works with any hypothesis set and

bounded loss function, and admits generalization guarantees with no distributional assumption. In contrast, CAL (Cohn et al., 1994) assumes a separable case; DHM (Dasgupta et al., 2008) and A2 (Balcan et al., 2006) are designed for the 0-1 loss, and many other margin-based algorithms only work for linear classiﬁers. Furthermore, for the separable case (R = 0), the recent work of (Cortes et al., 2019b) proposes an enhanced version of IWAL, called EI-

WAL , whose label complexity is on the order of log |H|

ϵ , thereby matching the bound of CAL and DHM. Nevertheless, our techniques can be applied to other algorithms, so long as they beneﬁt from theoretical guarantees, such as Corollary 1 of (Dasgupta et al., 2008) or Theorem 1 of (Cortes et al., 2019b). In that case, our splitting criterion can be modiﬁed accordingly and our theoretical analysis straightforwardly adapted to such guarantees.

Our algorithm can be viewed as an adaptive region-based version of IWAL, where the label requesting policy and the shrinking procedure of IWAL are applied at the regional level. As already mentioned, the following questions arise when designing the algorithm: (1) How should we determine the regions? (2) Can we learn to adaptively partition the input space into favorable subregions, using actively requested labels? We now explicitly address both questions and describe our algorithm in detail.

The pseudocode of ARBAL is given in Algorithm 1. The algorithm admits two phases: in the ﬁrst phase (split phase), the algorithm partitions the input space into K disjoint regions while actively requesting labels according to IWAL s policy on the regional level. This phase is constrained by two input parameters: κ limits the maximum number of regions generated (K κ), and τ caps the maximal number of online rounds for this phase. Section 3.1 describes in detail the main subroutine of this phase, SPLIT (Algorithm 2), including the splitting conditions that guarantee a signiﬁcant improvement in generalization ability resulting from the split. Whenever the algorithm decides to split, each resulting region is equipped with a copy of the original hypothesis set H. Notice that the algorithm actively selects labels in this phase, even if it does not shrink the hypothesis set(s), and thus it still requests fewer labels than passive learning. Thus far, we have not made any assumption about the deﬁnition of the subregions. Our splitting criterion is agnostic to their shape, thus any hierarchical partitioning method could be used in the splitting phase. For simplicity, we will adopt the axis-aligned splitting method commonly used for (binary) decision trees, though more convoluted splitting shapes are clearly possible (see Section 5).

In the second phase (IWAL phase), ARBAL runs IWAL separately on each of the regions deﬁned by the ﬁrst phase, to learn a good predictor per region. After T rounds, ARBAL returns bh T , which combines region-speciﬁc importance-

Adaptive Region-Based Active Learning

Algorithm 1 ARBAL(H, τ, κ, (γt)t [T ])

K 1, X1 X, H1 H for t [T] do

Observe xt; set kt k such that xt Xk pt max h,h Hkt ,y Y ℓ(h(xt), yt) ℓ(h (xt), yt)

Qt BERNOULLI(pt) if Qt = 1 then

yt LABEL(xt) end if if t τ and K < κ then

Xl, Xr SPLIT(Xkt, γt) # split phase if split then

K K + 1, Xkt Xl, XK Xr HK H, Hkt H end if else

Hkt UPDATE(Hkt) # IWAL phase end if end for return bh T PK k=1 1x Xkbhk,T

Algorithm 2 SPLIT(Xk, γ)

for d [D] and c R do

(Xl, Xr) REGSPLIT(Xk, d, c)

γd,c pk h Lk,t(bhk,t) Lk,t(bhlr,t) q

end for (d , c ) argmaxd [D],c R γd,c if γd ,c γ then

X l {x Xk : x[d ] c } # split X r {x Xk : x[d ] > c } return X l , X r else

return # no split end if

weighted empirical risk minimizers bhk,T . In Section 3.2, we describe the IWAL phase, and discuss its connections to ORIWAL (Cortes et al., 2019b).

One may ask why we break up the learning horizon into two phases, where we ﬁrst determine the partition and next proceed with region-based active learning? Given all possible partitions of the input space, why not running IWAL with the family of hypotheses containing all possible partitions of X with leaf predictors hk H, that is, H = Pκ k=1 1x Xkhk : hk H, κ k=1Xk = X, Xk Xk = for k = k ? First, H is an exceedingly complex hypothesis set, whose complexity can lead to vacuous learning guarantees. Second, its computational cost makes it prohibitive to use with IWAL. Moreover, even if we ﬁx the partition and only vary the predictors in the leaf nodes, as proven in Appendix B, running IWAL with H may cost up to κ times more labels than running IWAL separately within each partitioned region. For all these reasons, we adopt the two-phase learning framework discussed.

3.1. SPLIT phase

The advantage of region-based learning hinges on the improvement in the best-in-class error after each split, which motivates our splitting subroutine: SPLIT splits a region if and only if the best-in-class error is likely to improve by a strictly positive amount. We will show in Corollary 2 that, with high-probability, the best-in-class error is guaranteed to decrease from each split.

The pseudocode of SPLIT is given in Algorithm 2. At time t, SPLIT searches for the most favorable choice of the splitting parameters (d, c) as follows: for a ﬁxed pair (d, c), the algorithm calls subroutine REGSPLIT(Xk, d, c) to split Xk into a left region Xl (xd c) and a right region Xr (xd > c), and then computes a conﬁdence gap γd,c as deﬁned in Algorithm 2, where Lk,t(h) denotes the importance-weighted empirical risk of hypothesis h on region Xk,

Lk,t(h) = 1 Tk,t

s [t],xs Xk

ps ℓ(h(xs), ys),

where Tk,t = |{s [t]: xs Xk}| is the number of samples that have been observed in region Xk up to time t, and

σT = κD log h 8T 3|H|3κD

i denotes the slack term. Fur-

thermore, bhk,t = argminh H Lk,t(h) denotes the empirical risk minimizer (ERM) on Xkt. Similarly, bhl,t and bhr,t denote the ERM of region Xl and Xr, respectively, and bhlr,t = 1x Xlbhl,t + 1x Xrbhr,t is the combination of the two region-speciﬁc ERMs. The conﬁdence gap γd,c serves as a conservative estimate of the improvement in the best-inclass error. SPLIT searches for the maximum conﬁdence gap over all distinct pairs: (d , c ) = argmaxd,c γd,c. When γd ,c is larger than the pre-speciﬁed threshold parameter γ, it splits with (d , c ) and allocates to the two newly created regions the same initial hypothesis set H (see Algorithm 1), otherwise it does not split.

To implement the SPLIT subroutine, we maintain an array of region labels of past samples, and D sorted arrays of past samples according to each of the D coordinates. At time t, for each coordinate d [D], it takes O(log(t)) to insert xt into the sorted array, and O(t) to compute the key term Lk,t for all t+1 splitting thresholds on the sorted array. Here, we use the fact that, although there are inﬁnitely many possible splitting threshold values, we only need to consider t + 1 thresholds to distinguish the t feature values {xs,d : s [t]}. It also takes O(t) to update the region labels of past samples after split, and thus a total of O(t D) to run SPLIT at time t. Furthermore, as already mentioned in Section 2, we assume access to a set U of unlabeled samples to estimate all the pks. To do so, we maintain a binary tree corresponding to the splits. A new split converts a leaf node ui with number of elements |ui| |U| into an internal node at the cost of O(|ui|). The cost of updating the tree for the κ splits in

Adaptive Region-Based Active Learning

order to estimate all pks is thus O(Pκ i=1 |ui|), where the sum is over all the internal nodes of the tree.

Alternatively, these probabilities can be estimated incrementally during the on-line execution of the algorithm, and our theoretical analysis can be extended along these lines using a union bound similar to the one invoked in the proof of Lemma 1.

We now introduce some additional notation before discussing the theoretical guarantees of the SPLIT algorithm. Let h k, h l , h r be the best-in-class predictors on region Xk, Xl, and Xr, respectively, and denote by h lr = 1x Xlh l + 1x Xrh r. Then, the improvement in the best-in-class error after this split is pk[Rk(h k) Rk(h lr)]. The following concentration lemma relates the improvement in the bestin-class error to its empirical counterparts, which leads to the theoretical guarantee for the SPLIT subroutine (Corollary 2). Its proof uses a martingale concentration bound, as well as covering number techniques to guarantee that the high-probability bound holds uniformly for any possible sequence of splitting. The proof is given in Appendix C.

Lemma 1. With probability at least 1 δ/4, for all binary trees with (at most) κ leaf nodes, the improvement in the minimal empirical error by splitting concentrates around the improvement in the best-in-class error: Rk(h k) Rk(h lr) Lk,t(bhk,t) Lk,t(bhlr,t) r

Corollary 2. With probability at least 1 δ/4, for all splits made by ARBAL, the improvement in the best-in-class error is at least γt, where γt is the threshold at the split time.

Corollary 2 guarantees that, with high probability, whenever

ARBAL splits, the best-in-class error is strictly improved by at least γt > 0. This yields the fundamental advantage of region-based learning.

One challenge ARBAL faces is that, whenever it chooses to split, it commits to competing against a more accurate predictor, that is the region-speciﬁc best-in-class hypothesis on the reﬁned regions. To ensure success, we need to guarantee not only that the best-in-class over the current region or those over subregions after the split are not pruned out, but also that the best-in-class hypothesis over any future region produced after further splitting remains in the hypothesis space that will be given as input to ARBAL s second phase.

One can show that, if ARBAL prunes out some hypotheses before the split phase has ended, it may lose the future best-in-class predictor, and thus fail dramatically. As a simple example, consider the binary classiﬁcation problem depicted in Figure 1, where the unlabeled data is uniformly distributed within a square, and the true classiﬁcation boundary admits a zig-zag shape (the left plot of Figure 1). If the learner uses the class of linear separators as the initial hy-

Figure 1: The input space X is a (2-dimensional) square. Left: the true classiﬁcation boundary (in red) as a function of x1 and x2. Middle: (approximately) the current hypothesis space (shaded blue area) after trimming an initial set of linear separators given several labeled samples. Right: the best-in-class predictors (in blue) when the input space X splits into two regions Xl and Xr from the middle of x1.

pothesis set H, then, after receiving a certain number of labeled samples, it ﬁnds that the best performing hypothesis is approximately the diagonal separator from bottom left to top right. Suppose the algorithm would now trim H to only maintain separators performing similarly to the diagonal separator, with decision surfaces indicated by the shaded area in the middle plot of Figure 1. If later on, the learner splits the input space (the square) into two regions (left and right rectangles in the right plot of Figure 1), then the best-in-class separators for the two rectangles are horizontal separators. Clearly, the two horizontal best-in-class separators are not contained in the current H (which is meant to apply to the entire input space). In summary, trimming H before making splits introduces the risk of losing the best-in-class separators on the partitioned regions. This is the reason why ARBAL maintains throughout the split phase the original hypothesis space H. The shrinkage of H only takes place during the IWAL phase, presented next.

3.2. IWAL phase

In this phase, with the regions X1, . . . , XK being ﬁxed, ARBAL runs a separate IWAL subroutine on each one of them, requesting labels and reducing the hypothesis space from H to region-speciﬁc Hk, k [K]. As the algorithm requests labels, Hk shrinks towards the best-in-class hypothesis on region Xk. The hypothesis space is updated according to the IWAL update rule, which is derived from the concentration bound. Speciﬁcally, we update the hypothesis space Hk,t sitting on region Xk at time t by

Hk,t = n h Hk,t 1 : Lk,t(h) min h Hk,t 1Lk,t(h)+ r

To summarize, in this phase, ARBAL freezes the regions X1, . . . , XK, allowing no further splits, and requests labels and shrinks the set of hypotheses hosted by each such region.

Starting with a ﬁxed partition makes the second phase of ARBAL very similar to the learning scenario recently inves-

Adaptive Region-Based Active Learning

tigated by Cortes et al. (2019b), who proposed the ORIWAL algorithm for this learning scenario. In particular, during the second phase, we could also run the ORIWAL algorithm to achieve an additional improvement in generalization error. Since ORIWAL is orthogonal to the main contribution of this paper, we do not further detail this here.

4. Theoretical analysis

In this section, we present generalization error and label complexity guarantees for the ARBAL algorithm. We ﬁrst need some deﬁnitions and concepts from (Beygelzimer et al., 2009). Deﬁne the distance ρ(f, g) between two hypotheses f, g H as ρ(f, g) = E(x,y) D |ℓ(f(x), y) ℓ(g(x), y)|.1

The generalized disagreement coefﬁcient θ(D, H) of a class of functions H with respect to distribution D is deﬁned as the minimum value of θ, such that for all r > 0,

sup h H,ρ(h,h ) r,y Y

ℓ(h(x), y) ℓ(h (x), y) θr .

Since ARBAL calls IWAL as a subroutine, the theoretical results of ARBAL directly depend on those of IWAL, which are summarized in Theorem 6 in Appendix A.

Recall the use of the split threshold γ in Algorithm 2, which is the minimum value of the conﬁdence gap γd,c that allows ARBAL to split a region. We discuss ARBAL under two settings: using a ﬁxed threshold γ, and using a time-varying and data-dependent adaptive threshold γt.

4.1. ARBAL with a ﬁxed γ

Suppose we run ARBAL with a ﬁxed threshold γ. The label complexity of the algorithm depends on the regionbased disagreement coefﬁcient θk = θ(Dk, H), where Dk = D|Xk is deﬁned as the conditional distribution of x on region Xk. Let θmax = maxk K θk denote the maximum disagreement coefﬁcient across regions, and let r0 = maxh H ρ(h, h ). Let Ft denotes the σ-algebra generated by (x1, y1, Q1), . . . , (xt, yt, Qt).

Theorem 3. Assume that a run of ARBAL over T rounds has split the input space into K regions. Then, for any δ > 0, with probability at least 1 δ, the following inequality holds:

R(bh T ) RU +

where RU = R γ(K 1) is an upper bound on the best-in-class error obtained by ARBAL. Moreover, with probability at least 1 δ, the expected number of labels

1This deﬁnition of ρ(f, g) slightly differs from the original deﬁnition in (Beygelzimer et al., 2009), and it improves the label complexity bound of IWAL by a constant. See Appendix A for more details.

requested, τT = PT t=1 Ext DX pt|Ft 1 , satisﬁes

τT min{2θr0, 1}τ + 4θmax(T τ) h RU + 8

The proof is given in Appendix C. It combines the learning guarantee of IWAL (Theorem 6) with those of splitting (Corollary 2). Theorem 3 shows that, with high probability, the generalization error of the hypothesis returned by ARBAL is close to RU = R γ(K 1), which is more favorable than the single-region best-in class error R by γ(K 1). We will show later that, under natural assumptions, with high probability, there is at least one split, which implies γ(K 1) > 0 (Proposition 4). Furthermore, the reduction in the best-in-class error also improves label complexity: when T τ, the label complexity of ARBAL is O RUT compared to IWAL s O R T).

In practice, we set γ = Ω( p

σT /T ) to ensure that the generalization bound in Theorem 3 is more favorable than the generalization bound of IWAL (Theorem 6). We give more details on how to set this ﬁxed γ in Appendix C (see comments following the proof of Theorem 3).

There is a critical trade-off when determining the key parameters τ and κ. With a larger τ and κ, ARBAL is likely to split into more regions and thus admits a smaller RU. On the other hand, a larger τ means a longer split phase, where

ARBAL requests labels more often compared to the original

IWAL algorithm since ARBAL does not shrink the hypothesis set H during this phase. Furthermore, a larger κ yields a larger σT , which slightly affects the generalization error. Nevertheless, our experimental results show that larger values of τ and κ almost always improve the ﬁnal excess risk, at the expense of higher computational cost.

4.2. ARBAL with adaptive γt

The learning guarantees of Theorem 3 depend on the number of regions K deﬁned by the algorithm. Given any ﬁxed value of γ, however, there is no guarantee on the number of times ARBAL splits within the ﬁrst τ rounds (the duration of the ﬁrst phase). In the worst case when K = 1, ARBAL offers no improvement over IWAL, yet ARBAL requests more labels than IWAL during the initial τ rounds.

In this section, we show that by adopting a time-varying and data-dependent splitting threshold γt, we can enable SPLIT to split more often, and thus achieve an enhanced performance guarantee. To do so, we make additional assumptions on the potential gain of splitting.

Let Xk be an intermediate region created during the split phase, possibly the original input space X. Assume that for any such Xk, there exists at least one way of splitting Xk

Adaptive Region-Based Active Learning

into Xl Xr such that the improvement in the conditional best-in-class error is at least ρ: Rk(h k) Rk(h lr) ρ, where ρ > 0 is a positive constant. With this assumption, we can derive upper bounds on the time ARBAL splits when run with a time-varying adaptive γt = P(Xkt)ρ/2. Proposition 4. Let ARBAL be run with γt = ρ P(Xkt)/2. Then, for any δ > 0, with probability at least 1 δ/2, the

ﬁrst split occurs before round l 2σT 4

Thus, when τ 2σT 4

ρ + 1 2 , with high probability, ARBAL will split X for the ﬁrst time, and reduce the bestin-class error by at least ρ/2, according to Proposition 4 and Corollary 2. In Appendix C, we prove a more general version (Lemma 11) that upper bounds the time of split for all regions created during the split phase.

If we further assume that the splitting with at least ρ improvement in the conditional best-in-class error results in regions that are not too small, i.e., min{P(Xl), P(Xr)} c P(Xk), with 0 < c < 0.5, then we can also prove a lower bound on the number of splits. Corollary 5. Let ARBAL run with γt = P(Xkt)ρ/2. Then, with probability at least 1 δ/2, ARBAL splits more than

min n log1/c h τ 2σT( 4

ρ +1) 2 i , κ 1 o times by the end of the

split phase.

Corollary 5 gives the minimal number of splits under the assumptions made in this section. It states that, as the duration of the split phase τ increases, or as the conditional improvement ρ increases, or as the minimal proportion of subregion size c increases, ARBAL tends to make more splits and therefore achieves a better generalization guarantee. The lower bound in Corollary 5 tends to be loose, as it assumes that

ARBAL keeps splitting the smallest region, which is unlikely to be the case in practice. In Appendix C, we combine Proposition 4 and Corollary 5 to give an upper bound on the ﬁnal best-in-class error after the splits by ARBAL.

Note that the true value of ρ is the property of the underlying distribution, and to accurately estimate ρ is an open question that is beyond the scope of this paper. One practical solution is to explore ρ on various orders of magnitude, for instance (0.1, 0.01, 0.001), such that ARBAL makes a reasonable number of splits. We set ρ = 0.01 in our experiments.

5. Experiments

In this section, we report the results of a series of experiments. We tested 24 binary classiﬁcation datasets from the UCI and openml repositories, and also the MNIST dataset with 3 and 5 as the two classes, which is standard binary classiﬁcation task extracted from the MNIST dataset (e.g., (Crammer et al., 2009)). Table 1 in Appendix D lists sum-

mary statistics for these datasets. For ease of experimental comparison, for datasets with large input dimension D, we followed the preprocessing step in (Cortes et al., 2019b), retaining only the ﬁrst 10 principal components of the original feature vectors. Due to space limitations, in this section we show the results on several medium-sized datasets. The results for the remaining datasets are provided in Appendix D. For each experiment, we randomly shufﬂed the dataset, ran the algorithms on the ﬁrst half of the data (so that the number of active learning rounds T equals N/2), and tested the classiﬁer returned on the remaining half to measure misclassiﬁcation loss. We only showed results on the ﬁrst 103.5 3000 requested labels, which are enough to differentiate the performances among various algorithms. We repeated this process 50 times on each dataset, and report average results with standard error across the 50 repetitions. We use the logistic loss function ℓdeﬁned for all (x, y) X Y and hypotheses h: X R by ℓ(h(x), y) = log(1 + e yh(x)), which we then rescale to [0, 1]. The initial hypothesis set H consists of 3,000 randomly drawn hyperplanes with bounded norms. As mentioned in Section 4.1, larger values of τ and κ almost always yield better ﬁnal excess risk. We chose κ = 20 and allow the ﬁrst phase to run at most τ = 800 rounds so as to make ARBAL fully split into the desired number of regions on almost all datasets. Since the slack term σT derived from high-probability analyses are typically overly conservative, we simply use 0.01/ Tk in the SPLIT subroutine.

ARBAL with ﬁxed or adaptive γ. We ﬁrst compare ARBAL with ﬁxed γ to ARBAL with an adaptive γt. Figure 2 plots the misclassiﬁcation loss versus the number of labels requested on four datasets. The vertical lines indicate the label counts when ARBAL transits from the ﬁrst to the second phase, and the legends give the average number of resulting regions K the algorithms produce. As one can see, adaptive γt tends to split into more regions and to exit the split phase earlier, and hence often results in superior prediction performance over ﬁxed γ. Thus, in the rest of this section, we show the performance of adaptive γt. Results on other datasets (see Appendix D) show similar patterns. During the active learning split phase, even though ARBAL does not shrink the hypothesis set(s), both versions are observed to request labels in only 50% - 90% of the rounds, which is far less than passive learning.

ARBAL vs. ORIWAL. Since the key idea of ARBAL is the informed adaptive splitting criterion, we compare ARBAL with the ORIWAL algorithm of Cortes et al. (2019b), a nonadaptive splitting algorithm that ﬁrst randomly generates κ regions, and then runs region-based active learning on these regions. The regions of ORIWAL are obtained from terminal nodes of random binary trees, that is, binary trees with random splitting coordinates and thresholds (hence, they are axis-aligned rectangles, as for ARBAL). Figure 3 compares

Adaptive Region-Based Active Learning

2.0 2.5 3.0 log10(Number of Labels)

Misclassification Loss

Adaptive, K=20 Fixed, K=3.3 egg

2.0 2.5 3.0 3.5 log10(Number of Labels)

Misclassification Loss

Adaptive, K=20 Fixed, K=2.9 magic04

2.0 2.5 3.0 3.5 log10(Number of Labels)

Misclassification Loss

Adaptive, K=20 Fixed, K=3 house16H

2.0 2.5 3.0 3.5 log10(Number of Labels)

Misclassification Loss

Adaptive, K=19.2 Fixed, K=2 nomao

Figure 2: Misclassiﬁcation loss of ARBAL with ﬁxed and adaptive threshold γ on held out test data vs. number of labels requested (log10 scale), with κ = 20 and τ = 800. The vertical lines indicate the end of the ﬁrst (split) phase.

2.0 2.5 3.0 3.5 log10(Number of Labels)

Misclassification Loss

ARBAL ORIWAL IWAL MARGIN egg

2.0 2.5 3.0 3.5 log10(Number of Labels)

Misclassification Loss

ARBAL ORIWAL IWAL MARGIN magic04

2.0 2.5 3.0 3.5 log10(Number of Labels)

Misclassification Loss

ARBAL ORIWAL IWAL MARGIN house16H

2.0 2.5 3.0 3.5 log10(Number of Labels)

Misclassification Loss

ARBAL ORIWAL IWAL MARGIN nomao

2.0 2.5 3.0 3.5 log10(Number of Labels)

Misclassification Loss

ARBAL ORIWAL IWAL MARGIN fried

2.0 2.5 3.0 3.5 log10(Number of Labels)

Misclassification Loss

ARBAL ORIWAL IWAL MARGIN mv

2.0 2.5 3.0 3.5 log10(Number of Labels)

Misclassification Loss

ARBAL ORIWAL IWAL MARGIN shuttle

2.0 2.5 3.0 3.5 log10(Number of Labels)

Misclassification Loss

ARBAL ORIWAL IWAL MARGIN electricity

Figure 3: Misclassiﬁcation loss of ARBAL(with adaptive γt), ORIWAL, IWAL, and MARGIN on hold out test data vs. number of labels requested (log10 scale), with κ = 20 and τ = 800. The ARBAL curves are repetitions from Figure 2.

ARBAL of adaptive γt with ORIWAL, and shows that ARBAL quickly takes over (recall that the x axis is on log scale) and performs substantially better than ORIWAL, on eight datasets covered by Figure 3. Results on other datasets show similar patterns, even though ORIWAL sometimes uses more regions than ARBAL, since ARBAL may not always fully split into κ regions. These results empirically verify the advantage of an adaptive splitting criterion.

ARBAL vs. non-splitting baselines. We also compare AR-

BAL with the single-region IWAL algorithm, and the singleregion MARGIN algorithm, which is a standard uncertainty sampling algorithm that requests the label closest to the decision boundary of the current empirical risk minimizer (note that MARGIN runs under a pool-based setting and thus sees more information than on-line algorithms).

Figure 3 shows that MARGIN is a strong baseline that outperforms IWAL on almost all the datasets, sometimes even ORIWAL (e.g. house16H), but ARBAL is still more favorable than MARGIN. The difference of errors observed in these plots after consuming much of the sample is essentially due to the difference of the split-region and single-region best-in-class errors, that is, RU vs. R , which further corroborates our theory. The results for most other datasets show similar patterns. In fact, ARBAL can also be used with

the MARGIN algorithm as a subroutine, which is likely to lead to even better performance but, as with the MARGIN algorithm, that extension would not beneﬁt from any general theoretical guarantee and might actually underperform in some instances where the MARGIN technique can fail.

Finally, as mentioned in Section 3.1, ARBAL is agnostic to the shape of subregions. For instance, we can split a region via an arbitrary separating hyperplane or via hierarchical clustering, that is, determine two new centers and assign points to the closest center. In Appendix D, we compare axis-aligned binary tree splitting method with hierarchical clustering splitting, using adaptive γt. Our results suggest that, for most datasets, splitting via binary trees is more favorable than via hierarchical clustering.

6. Conclusion

We presented a novel algorithm for adaptive region-based active learning, and proved that it beneﬁts from favorable generalization and label complexity guarantees. We also studied the extent to which splitting the input space is likely to lead to improved prediction performance. We complemented our theoretical ﬁndings by reporting the results of several experiments with our algorithm on standard benchmarks. Our extensive experiments demonstrate substantial

Adaptive Region-Based Active Learning

performance improvements over existing active learning algorithms such as IWAL and margin-based uncertainty sampling, as well as other region-based baselines that do not rely on adaptively splitting of the input space.

Our techniques were showcased through IWAL-like algorithms (Beygelzimer et al., 2009; Cortes et al., 2019b), but they can be straightforwardly combined with other base active learning algorithms, such as the DHM algorithm of Dasgupta et al. (2008), achieving similar generalization and label complexity guarantees. They can also be combined with the margin algorithm and result in even more substantial improvements in practice, as with the combination of

ORIWAL with the margin algorithm in (Chuang et al., 2019). However, such algorithms currently do not beneﬁt from proven theoretical guarantees, due to the lack of general guarantees for the margin algorithm.

Altogether, our theory, algorithms, and empirical results provide a new promising solution to active learning, with very important practical beneﬁts. These results also suggest further investigation of the general idea of adaptively reﬁning and enriching the hypothesis set for active learning. From the theoretical standpoint, this work illustrates an signiﬁcant speciﬁc instance of model selection in active learning that beneﬁts from generalization and label complexity guarantees.

Our ARBAL algorithm uses a greedy splitting criterion that maximizes the beneﬁt of the current split, which does not consider the potential beneﬁt of subsequent splits. One direction for future research is to come up with a less greedy, but perhaps computationally more intensive splitting algorithm, possibly via lookahead search techniques, that yields a even better partitioning with smaller best-in-class error. Both IWAL and ARBAL can be extended to multi-class classiﬁcation tasks with a bounded loss, such as a margin loss, and enjoy similar learning guarantees. When the number of classes is very large, however, the current algorithm may not be sufﬁciently efﬁcient and is a topic of active research as in standard multi-class classiﬁcation and structured prediction theory (?). Our algorithm and techniques can also be extended to other learning tasks such as regression and ranking using similar methods.

Acknowledgements

This work was partly supported by NSF CCF-1535987, NSF IIS-1618662, and a Google Research Award. Much of NZ s research was done during her Ph.D. studies at New York University. We thank anonymous reviewers for their helpful comments and suggestions.

Awasthi, P., Balcan, M.-F., Haghtalab, N., and Urner, R. Efﬁcient learning of linear separators under bounded noise. In Proceedings of COLT, pp. 167 190, 2015.

Balcan, M.-F. and Long, P. Active and passive learning of linear separators under log-concave distributions. In Proceedings of COLT, pp. 288 316, 2013.

Balcan, M.-F., Beygelzimer, A., and Langford, J. Agnostic active learning. In Proceedings of ICML, 2006.

Balcan, M.-F., Broder, A., and Zhang, T. Margin based active learning. In International Conference on Computational Learning Theory, pp. 35 50. Springer, 2007.

Beygelzimer, A., Dasgupta, S., and Langford, J. Importance weighted active learning. In Proceedings of ICML, pp. 49 56. ACM, 2009.

Beygelzimer, A., Hsu, D. J., Langford, J., and Zhang, T. Agnostic active learning without constraints. In Proceedings of NIPS, pp. 199 207, 2010.

Castro, R. M. and Nowak, R. D. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54 (5):2339 2353, 2008.

Chuang, G., De Salvo, G., Karydas, L., Kagy, J., Rostamizadeh, A., and Theeraphol, A. Active learning empirical study. In Neur IPS2019 LIRE Workshop, 2019.

Cohn, D., Atlas, L., and Ladner, R. Improving generalization with active learning. Machine learning, 15(2): 201 221, 1994.

Cortes, C., De Salvo, G., Gentile, C., Mohri, M., and Zhang, N. Active learning with disagreement graphs. In Proceedings of ICML, 2019a.

Cortes, C., De Salvo, G., Gentile, C., Mohri, M., and Zhang, N. Region-based active learning. In Proceedings of AISTATS 2019, 2019b.

Crammer, K., Kulesza, A., and Dredze, M. Adaptive regularization of weight vectors. In Nips, 2009.

Dasgupta, S. Analysis of a greedy active learning strategy. In Advances in neural information processing systems, pp. 337 344, 2004.

Dasgupta, S. Coarse sample complexity bounds for active learning. In Proceedings of NIPS, pp. 235 242, 2006.

Dasgupta, S. Two faces of active learning. Theor. Comput. Sci., 412(19):1767 1781, 2011.

Dasgupta, S. and Hsu, D. Hierarchical sampling for active learning. In Proceedings of ICML, pp. 208 215. ACM, 2008.

Adaptive Region-Based Active Learning

Dasgupta, S., Kalai, A. T., and Monteleoni, C. Analysis of perceptron-based active learning. In International Conference on Computational Learning Theory, pp. 249 263. Springer, 2005.

Dasgupta, S., Hsu, D. J., and Monteleoni, C. A general agnostic active learning algorithm. In Proceedings of NIPS, pp. 353 360, 2008.

Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. Selective sampling using the query by committee algorithm. Machine learning, 28(2-3):133 168, 1997.

Golovin, D. and Krause, A. Adaptive submodularity: A new approach to active learning and stochastic optimization. In ar Xiv:1003.3967, 2017.

Hanneke, S. A bound on the label complexity of agnostic active learning. In Proceedings of ICML, pp. 353 360. ACM, 2007.

Hanneke, S. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3): 131 309, 2014.

Hanneke, S. and Yang, L. Minimax analysis of active learning. The Journal of Machine Learning Research, 16(1): 3487 3602, 2015.

Huang, T.-K., Agarwal, A., Hsu, D., Langford, J., and E. Schapire, R. Efﬁcient and parsimonious agnostic active learning. In Proceedings of NIPS, 2015.

Koltchinskii, V. Rademacher complexities and bounding the excess risk in active learning. Journal of Machine Learning Research, 11(Sep):2457 2485, 2010.

Kpotufe, S., Urner, R., and Ben-David, S. Hierarchical label queries with data-dependent partitions. In Proceedings of COLT, pp. 1176 1189, 2015.

Mussmann, S. and Liang, P. On the relationship between data efﬁciency and error for uncertainty sampling. In PMLR 80: Proceedings of the 35th International Conference on Machine Learning, pp. 3674 3682, 2018.

Nowak, R. The geometry of generalized binary search. IEEE Transactions on Information Theory, 57(12):7893 7906, 2011.

Tosh, C. and Dasgupta, S. Diameter-based active learning. In Thirty-fourth International Conference on Machine Learning (ICML), 2017.

Urner, R., Wulff, S., and Ben-David, S. PLAL: Clusterbased active learning. In Proceedings of COLT, pp. 376 397, 2013.

Zhang, C. Efﬁcient active learning of sparse halfspaces. In Proceedings of COLT, 2018.

Zhang, C. and Chaudhuri, K. Beyond disagreement-based agnostic active learning. In Proceedings of NIPS, pp. 442 450, 2014.