# efficient_active_learning_with_abstention__ef8d0003.pdf Efficient Active Learning with Abstention Yinglun Zhu Department of Computer Sciences University of Wisconsin-Madison Madison, WI 53706 yinglun@cs.wisc.edu Robert Nowak Department of Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 rdnowak@wisc.edu The goal of active learning is to achieve the same accuracy achievable by passive learning, while using much fewer labels. Exponential savings in terms of label complexity have been proved in very special cases, but fundamental lower bounds show that such improvements are impossible in general. This suggests a need to explore alternative goals for active learning. Learning with abstention is one such alternative. In this setting, the active learning algorithm may abstain from prediction and incur an error that is marginally smaller than random guessing. We develop the first computationally efficient active learning algorithm with abstention. Our algorithm provably achieves polylog( 1 ε) label complexity, without any low noise conditions. Such performance guarantee reduces the label complexity by an exponential factor, relative to passive learning and active learning that is not allowed to abstain. Furthermore, our algorithm is guaranteed to only abstain on hard examples (where the true label distribution is close to a fair coin), a novel property we term proper abstention that also leads to a host of other desirable characteristics (e.g., recovering minimax guarantees in the standard setting, and avoiding the undesirable noise-seeking behavior often seen in active learning). We also provide novel extensions of our algorithm that achieve constant label complexity and deal with model misspecification. 1 Introduction Active learning aims at learning an accurate classifier with a small number of labeled data points (Settles, 2009; Hanneke, 2014). Active learning has become increasingly important in modern application of machine learning, where unlabeled data points are abundant yet the labeling process requires expensive time and effort. Empirical successes of active learning have been observed in many areas (Tong and Koller, 2001; Gal et al., 2017; Sener and Savarese, 2018). In noise-free or certain low-noise cases (i.e., under Massart noise (Massart and Nédélec, 2006)), active learning algorithms with provable exponential savings over the passive counterpart have been developed (Balcan et al., 2007; Hanneke, 2007; Dasgupta et al., 2009; Hsu, 2010; Dekel et al., 2012; Hanneke, 2014; Zhang and Chaudhuri, 2014; Krishnamurthy et al., 2019; Katz-Samuels et al., 2021). On the other hand, however, not much can be said in the general case. In fact, Kääriäinen (2006) provides a Ω( 1 ε2 ) lower bound by reducing active learning to a simple mean estimation problem: It takes Ω( 1 ε2 ) samples to distinguish η(x) = 1 2 + ε and η(x) = 1 2 ε. Even with the relatively benign Tsybakov noise (Tsybakov, 2004), Castro and Nowak (2006, 2008) derive a Ω(poly( 1 ε)) lower bound, again, indicating that exponential speedup over passive learning is not possible in general. These fundamental lower bounds lay out statistical barriers to active learning, and suggests considering a refinement of the label complexity goals in active learning (Kääriäinen, 2006). Inspecting these lower bounds, one can see that active learning suffers from classifying hard examples that are close to the decision boundary. However, do we really require a trained classifier to do well 36th Conference on Neural Information Processing Systems (Neur IPS 2022). on those hard examples? In high-risk domains such as medical imaging, it makes more sense for the classifier to abstain from making the decision and leave the problem to a human expert. Such idea is formalized under Chow s error (Chow, 1970): Whenever the classifier chooses to abstain, a loss that is barely smaller than random guessing, i.e., 1 2 γ, is incurred. The parameter γ should be thought as a small positive quantity, e.g., γ = 0.01. The inclusion of abstention is not only practically interesting, but also provides a statistical refinement of the label complexity goal of active learning: Achieving exponential improvement under Chow s excess error. When abstention is allowed as an action, Puchkin and Zhivotovskiy (2021) shows, for the first time, that exponential improvement in label complexity can be achieved by active learning in the general setting. However, the approach provided in Puchkin and Zhivotovskiy (2021) can not be efficiently implemented. Their algorithm follows the disagreement-based approach and requires maintaining a version space and checking whether or not an example lies in the region of disagreement. It is not clear how to generally implement these operations besides enumeration (Beygelzimer et al., 2010). Moreover, their algorithm relies on an Empirical Risk Minimization (ERM) oracle, which is known to be NP-Hard even for a simple linear hypothesis class (Guruswami and Raghavendra, 2009). In this paper, we break the computational barrier and design an efficient active learning algorithm with exponential improvement in label complexity relative to conventional passive learning. The algorithm relies on weighted square loss regression oracle, which can be efficiently implemented in many cases (Krishnamurthy et al., 2017, 2019; Foster et al., 2018, 2020). The algorithm also abstains properly, i.e., abstain only when it is the optimal choice, which allows us to easily translate the guarantees to the standard excess error. Along the way, we propose new noise-seeking noise conditions and show that: uncertainty-based active learners can be easily trapped, yet our algorithm provably overcome these noise-seeking conditions. As an extension, we also provide the first algorithm that enjoys constant label complexity for a general set of regression functions. 1.1 Problem setting Let X denote the input space and Y denote the label space. We focus on the binary classification problem where Y = {+1, 1}. The joint distribution over X Y is denoted as DXY. We use DX to denote the marginal distribution over the input space X, and use DY|x to denote the conditional distribution of Y with respect to any x X. We define η(x) := Py DY|x(y = +1) as the conditional probability of taking a positive label. We consider the standard active learning setup where (x, y) DXY but y is observed only after a label querying. We consider hypothesis class H : X Y. For any classifier h H, its (standard) error is defined as err(h) := P(x,y) DXY(h(x) = y). Function approximation. We focus on the case where the hypothesis class H is induced from a set of regression functions F : X [0, 1] that predicts the conditional probability η(x). We write H = HF := {hf : f F} where hf(x) := sign(2f(x) 1). The size of F is measured by the well-known complexity measure: the Pseudo dimension Pdim(F) (Pollard, 1984; Haussler, 1989, 1995). We assume Pdim(F) < throughout the paper.1 Following existing works in active learning (Dekel et al., 2012; Krishnamurthy et al., 2017, 2019) and contextual bandits (Agarwal et al., 2012; Foster et al., 2018; Foster and Rakhlin, 2020; Simchi-Levi and Xu, 2020), we make the following realizability assumption. Assumption 1 (Realizability). The learner is given a set of regressors F : X [0, 1] such that there exists a f F characterize the true conditional probability, i.e., f = η. The realizability assumption allows rich function approximation, which strictly generalizes the setting with linear function approximation studied in active learning (e.g., in (Dekel et al., 2012)). We relax Assumption 1 in Section 4.2 to deal with model misspecification. Regression oracle. We consider a regression oracle over F, which is extensively studied in the literature in active learning and contextual bandits (Krishnamurthy et al., 2017, 2019; Foster et al., 2018, 2020). Given any set S of weighted examples (w, x, y) R+ X Y as input, the regression 1See Appendix D for formal definition of the Pseudo dimension. Many function classes of practical interests have finite Pseudo dimension: (1) when F is finite, we have Pdim(F) = O(log|F|); (2) when F is a set of linear functions/generalized linear function with non-decreasing link function, we have F = O(d); (3) when F is a set of degree-r polynomial in Rd, we have Pdim(F) = O( d+r r ). oracle outputs bf = arg min f F (w,x,y) S w(f(x) y)2. (1) The regression oracle solves a convex optimization problem with respect to the regression function, and admits closed-form solutions in many cases, e.g., it is reduced to least squares when f is linear. We view the implementation of the regression oracle as an efficient operation and quantify the computational complexity in terms of the number of calls to the regression oracle. Chow s excess error (Chow, 1970). Let h := hf H denote the Bayes classifier. The standard excess error of classifier h H is defined as err(h) err(h ). Since achieving exponential improvement (of active over passive learning) with respect to the standard excess error is impossible in general (Kääriäinen, 2006), we introduce Chow s excess error next. We consider classifier of the form bh : X Y { } where denotes the action of abstention. For any fixed 0 < γ < 1 2, the Chow s error is defined as errγ(bh) := P(x,y) DXY(bh(x) = y,bh(x) = ) + (1/2 γ) P(x,y) DXY(bh(x) = ). (2) The parameter γ can be chosen as a small constant, e.g., γ = 0.01, to avoid excessive abstention: The price of abstention is only marginally smaller than random guess. The Chow s excess error is then defined as errγ(bh) err(h ) (Puchkin and Zhivotovskiy, 2021). For any fixed accuracy level ε > 0, we aim at constructing a classifier bh : X Y { } with ε Chow s excess error and polylog( 1 ε) label complexity. We also relate Chow s excess error to standard excess error in Section 3. Remark 1. Competing against the optimal Chow s error, i.e., errγ(bh) infh:X {+1, 1, } errγ(h), will eliminate active learning gains. As in Kääriäinen (2006), it suffices to consider a simple problem with X = {x}. In order to achieve ε excess error against the optimal Chow s classifier, we need to distinguish cases η(x) = 1 2 γ 2ε and η(x) = 1 2 γ + 2ε, which inevitably requires Ω( 1 ε2 ) samples. We defer a detailed discussion (with pictorial explanations) of Chow s excess error in Appendix B. 1.2 Contributions and paper organization We provide informal statements of our main results in this section. Our results depend on complexity measures such as value function disagreement coefficient θ and eluder dimension e (formally defined in Section 2 and Appendix C). These complexity measures are previously analyzed in contextual bandits (Russo and Van Roy, 2013; Foster et al., 2020) and we import them to the active learning setup. These complexity measures are well-bounded for many function classes of practical interests, e.g., we have θ, e = e O(d) for linear and generalized linear functions on Rd. Our first main contribution is that we design the first computationally efficient active learning algorithm (Algorithm 1) that achieves exponential labeling savings, without any low noise assumptions. Theorem 1 (Informal). There exists an algorithm that constructs a classifier bh : X {+1, 1, } with Chow s excess error at most ε and label complexity e O( θ Pdim(F) γ2 polylog( 1 ε)), without any low noise assumptions. The algorithm can be efficiently implemented via a regression oracle: It takes e O( θ Pdim(F) ε γ3 ) oracle calls for general F, and e O( θ Pdim(F) ε γ ) oracle calls for convex F. The formal statements are provided in Section 2. The statistical guarantees (i.e., label complexity) in Theorem 1 is similar to the one achieved in Puchkin and Zhivotovskiy (2021), with one critical difference: The label complexity provided in Puchkin and Zhivotovskiy (2021) is in terms of the classifier-based disagreement coefficient ˇθ (Hanneke, 2014). Even for a set of linear classifier, ˇθ is only known to be bounded in special cases, e.g., when DX is uniform over the unit sphere (Hanneke, 2007). On the other hand, we have θ d for any DX (Foster et al., 2020). We say that a classifier bh : X {+1, 1, } enjoys proper abstention if it abstains only if abstention is indeed the optimal choice (based on Eq. (2)). For any classifier that enjoys proper abstention, one can easily relate its standard excess error to the Chow s excess error, under commonly studied Massart/Tsybakov noises (Massart and Nédélec, 2006; Tsybakov, 2004). The classifier obtained in Theorem 1 enjoys proper abstention, and achieves the following guarantees (formally stated in Section 3.1). Theorem 2 (Informal). Under Massart/Tsybakov noise, with appropriate adjustments, the classifier learned in Theorem 1 achieves the minimax optimal label complexity under standard excess error. We also propose new noise conditions that strictly generalize the usual Massart/Tsybakov noises, which we call noise-seeking conditions. At a high-level, the noise-seeking conditions allow abundant data points with η(x) equal/close to 1 2. These points are somewhat harmless since it hardly matters what label is predicted at that point (in terms of excess error). These seemingly harmless data points can, however, cause troubles for any active learning algorithm that requests the label for any point that is uncertain, i.e., the algorithm cannot decide if |η(x) 1 2| is strictly greater than 0. We call such algorithms uncertainty-based active learners. These algorithms could wastefully sample in these harmless regions, ignoring other regions where erring could be much more harmful. We derive the following proposition (formally stated in Section 3.2) under these noise-seeking conditions. Proposition 1 (Informal). For any labeling budget B 1 γ2 polylog( 1 ε), there exists a learning problem such that (1) any uncertainty-based active learner suffers standard excess error Ω(B 1); yet (2) the classifier bh learned in Theorem 1 achieves standard excess error at most ε. The above result demonstrates the superiority of our algorithm over any uncertainty-based active learner. Moreover, we show that, under these strictly harder noise-seeking conditions, our algorithm still achieve guarantees similar to the ones stated in Theorem 2. Before presenting our next main result, we first consider a simple active learning problem with X = {x}. Under Massart noise, we have |η(x) 1 2| τ0 for some constant τ0 > 0. Thus, it takes no more than O(τ 2 0 log 1 δ ) labels to achieve ε standard excess error, no matter how small ε is. This example shows that, at least in simple cases, we can expect to achieve a constant label complexity for active learning, with no dependence on 1 ε at all. To the best of our knowledge, our next result provides the first generalization of such phenomenon to a general set of (finite) regression functions, as long as its eluder dimension e is bounded. Theorem 3 (Informal). Under Massart noise with parameter τ0 and a general (finite) set of regression function F. There exists an algorithm that returns a classifier with standard excess error at most ε and label complexity O( e log(|F|/δ) τ 2 0 ), which is independent of 1 A similar constant label complexity holds with Chow s excess error, without any low noise assumptions. We also provide discussion on why previous algorithms do not achieve such constant label complexity, even in the case with linear functions. We defer formal statements and discussion to Section 4.1. In Section 4.2, we relax Assumption 1 and propose an algorithm that can deal with model misspecification. Paper organization. The rest of this paper is organized as follows. We present our main algorithm and its guarantees in Section 2. We further analyze our algorithm under standard excess error in Section 3, and discuss other important properties of the algorithm. Extensions of our algorithm, e.g., achieving constant label complexity and dealing with model misspecification, are provided in Section 4. We defer the discussion of additional related work and all proofs to the Appendix due to lack of space. 2 Efficient active learning with abstention We provide our main algorithm (Algorithm 1) in this section. Algorithm 1 is an adaptation of the algorithm developed in Krishnamurthy et al. (2017, 2019), which studies active learning under the standard excess error (and Massart/Tsybakov noises). We additionally take the abstention option into consideration, and manually construct classifiers using the active set of (uneliminated) regression functions (which do not belong to the original hypothesis class). These new elements allow us to achieve ε Chow s excess error with polylog( 1 ε) label complexity, without any low noise assumptions. Algorithm 1 Efficient Active Learning with Abstention Input: Accuracy level ε > 0, abstention parameter γ (0, 1/2) and confidence level δ (0, 1). 1: Define T := e O( θ Pdim(F) ε γ ), M := log2 T and Cδ := O(Pdim(F) log(T/δ)). 2: Define τm := 2m for m 1, τ0 := 0 and βm := (M m + 1) Cδ. 3: for epoch m = 1, 2, . . . , M do 4: Get bfm := arg minf F Pτm 1 t=1 Qt(f(xt) yt)2. // We use Qt {0, 1} to indicate whether the label of xt is queried. 5: (Implicitly) Construct active set of regression functions Fm F as t=1 Qt(f(xt) yt)2 t=1 Qt( bfm(xt) yt)2 + βm 6: Construct classifier bhm : X {+1, 1, } as ( , if [lcb(x; Fm), ucb(x; Fm)] 1 2 + γ ; sign(2 bfm(x) 1), o.w. and construct query function gm(x) := 1 1 2 (lcb(x; Fm), ucb(x; Fm)) 1(bhm(x) = ). 7: if epoch m = M then 8: Return classifier bh M. 9: for time t = τm 1 + 1, . . . , τm do 10: Observe xt DX . Set Qt := gm(xt). 11: if Qt = 1 then 12: Query the label yt of xt. Algorithm 1 runs in epochs of geometrically increasing lengths. At the beginning of epoch m [M], Algorithm 1 first computes the empirical best regression function bfm that achieves the smallest cumulative square loss over previously labeled data points ( bf1 can be selected arbitrarily); it then (implicitly) constructs an active set of regression functions Fm, where the cumulative square loss of each f Fm is not too much larger than the cumulative square loss of empirical best regression function bfm. For any x X, based on the active set of regression functions, Algorithm 1 constructs a lower bound lcb(x; Fm) := inff Fm f(x) and an upper bound ucb(x; Fm) := supf Fm f(x) for the true conditional probability η(x). An empirical classifier bhm : X {+1, 1, } and a query function gm : X {0, 1} are then constructed based on these confidence ranges and the abstention parameter γ. For any time step t within epoch m, Algorithm 1 queries the label of the observed data point xt if and only if Qt := gm(xt) = 1. Algorithm 1 returns bh M as the learned classifier. We now discuss the empirical classifier bhm and the query function gm in more detail. Consider the event where f Fm for all m [M], which can be shown to hold with high probability. The constructed confidence intervals are valid under this event, i.e., η(x) [lcb(x; Fm), ucb(x; Fm)]. First, let us examine the conditions that determine a label query. The label of x is not queried if Case 1: bhm(x) = . We have η(x) [lcb(x; Fm), ucb(x; Fm)] [ 1 2 + γ]. Abstention leads to the smallest error (Herbei and Wegkamp, 2006), and no query is needed. 2 / (lcb(x; Fm), ucb(x; Fm)). We have sign(2 bfm(x) 1) = sign(2f (x) 1). Thus, no excess error is incurred and there is no need to query. The only case when label query is issued, and thus when the classifier bhm may suffer from excess error, is when 1 2 (lcb(x; Fm), ucb(x; Fm)) and [lcb(x; Fm), ucb(x; Fm)] 1 hold simultaneously. Eq. (3) necessarily leads to the condition w(x; Fm) := ucb(x; Fm) lcb(x; Fm) > γ. Our theoretical analysis shows that the event must 1(w(x; Fm) > γ) happens in- frequently, and its frequency is closely related to the so-called value function disagreement coefficient (Foster et al., 2020), which we introduce as follows.2 Definition 1 (Value function disagreement coefficient). For any f F and γ0, ε0 > 0, the value function disagreement coefficient θval f (F, γ0, ε0) is defined as sup DX sup γ>γ0,ε>ε0 ε2 PDX f F : |f(x) f (x)| > γ, f f DX ε 1, where f 2 DX := Ex DX [f 2(x)]. Combining the insights discussed above, we derive the following label complexity guarantee for Algorithm 1 (we use θ := supf F,ι>0 θval f (F, γ/2, ι) and discuss its boundedness below). 3 Theorem 4. With probability at least 1 2δ, Algorithm 1 returns a classifier with Chow s excess error at most ε and label complexity O( θ Pdim(F) γ2 log2( θ Pdim(F) ε γ ) log( θ Pdim(F) Theorem 4 shows that Algorithm 1 achieves exponential label savings (i.e., polylog( 1 ε)) without any low noise assumptions. We discuss the result in more detail next. Boundedness of θ. The value function disagreement coefficient is well-bounded for many function classes of practical interests. For instance, we have θ d for linear functions on Rd and θ Clink d for generalized linear functions (where Clink is a quantity related to the link function). Moreover, θ is always upper bounded by complexity measures such as (squared) star number and eluder dimension (Foster et al., 2020). See Appendix C for the detailed definitions/bounds. Comparison to Puchkin and Zhivotovskiy (2021). The label complexity bound derived in Theorem 4 is similar to the one derived in Puchkin and Zhivotovskiy (2021), with one critical difference: The bound derived in Puchkin and Zhivotovskiy (2021) is in terms of classifier-based disagreement coefficient ˇθ (Hanneke, 2014). Even in the case with linear classifiers, ˇθ is only known to be bounded under additional assumptions, e.g., when DX is uniform over the unit sphere. Computational efficiency. We discuss how to efficiently implement Algorithm 1 with the regression oracle defined in Eq. (1). 4 Our implementation relies on subroutines developed in Krishnamurthy et al. (2017); Foster et al. (2018), which allow us to approximate confidence bounds ucb(x; Fm) and lcb(x; Fm) up to α approximation error with O( 1 α) (or O(log 1 α) when F is convex and closed under pointwise convergence) calls to the regression oracle. To achieve the same theoretical guarantees shown in Theorem 4 (up to changes in constant terms), we show that it suffices to (i) control the approximation error at level O( γ log T ), (ii) construct the approximated confidence bounds c lcb(x; Fm) and d ucb(x; Fm) in a way such that the confidence region is non-increasing with respect to the epoch m, i.e., (c lcb(x; Fm), d ucb(x; Fm)) (c lcb(x; Fm 1), d ucb(x; Fm 1)) (this ensures that the sampling region is non-increasing even with approximated confidence bounds, which is important to our theoretical analysis), and (iii) use the approximated confidence bounds c lcb(x; Fm) and d ucb(x; Fm) to construct the classifier bhm and the query function gm. We provide our guarantees as follows, and leave details to Appendix E (we redefine θ := supf F,ι>0 θval f (F, γ/4, ι) in the Theorem 5 to account to approximation error). Theorem 5. Algorithm 1 can be efficiently implemented via the regression oracle and enjoys the same theoretical guarantees stated in Theorem 4. The number of oracle calls needed is e O( θ Pdim(F) 2Compared to the original definition studied in contextual bandits (Foster et al., 2020), our definition takes an additional sup over all possible marginal distributions DX to account for distributional shifts incurred by selective querying (which do not occur in contextual bandits). Nevertheless, as we show below, our disagreement coefficient is still well-bounded for many important function classes. 3It suffices to take θ := θval f (F, γ/2, ι) with ι γε to derive a slightly different guarantee. See Appendix E. 4Recall that the implementation of the regression oracle should be viewed as an efficient operation since it solves a convex optimization problem with respect to the regression function, and it even admits closed-form solutions in many cases, e.g., it is reduced to least squares when f is linear. On the other hand, the ERM oracle used in Puchkin and Zhivotovskiy (2021) is NP-hard even for a set of linear classifiers (Guruswami and Raghavendra, 2009). for a general set of regression functions F, and e O( θ Pdim(F) ε γ ) when F is convex and closed under pointwise convergence. The per-example inference time of the learned bh M is e O( 1 γ2 log2( θ Pdim(F) for general F, and e O(log 1 γ ) when F is convex and closed under pointwise convergence. With Theorem 5, we provide the first computationally efficient active learning algorithm that achieves exponential label savings, without any low noise assumptions. 3 Guarantees under standard excess error We provide guarantees for Algorithm 1 under standard excess error. In Section 3.1, we show that Algorithm 1 can be used to recover the usual minimax label complexity under Massart/Tsybakov noise; we also provide a new learning paradigm based on Algorithm 1 under limited budget. In Section 3.2, we show that Algorithm 1 provably avoid the undesired noise-seeking behavior often seen in active learning. 3.1 Recovering minimax optimal label complexity One way to convert an abstaining classifier bh : X Y { } into a standard classifier ˇh : X Y is by randomizing the prediction in its abstention region, i.e., if bh(x) = , then its randomized version ˇh(x) predicts +1/ 1 with equal probability (Puchkin and Zhivotovskiy, 2021). With such randomization, the standard excess error of ˇh can be characterized as err(ˇh) err(h ) = errγ(bh) err(h ) + γ Px DX (bh(x) = ). (4) The standard excess error depends on the (random) abstention region of bh, which is difficult to quantify in general. To give a more practical characterization of the standard excess error, we introduce the concept of proper abstention in the following. Definition 2 (Proper abstention). A classifier bh : X Y { } enjoys proper abstention if and only if it abstains in regions where abstention is indeed the optimal choice, i.e., x X : bh(x) = x X : η(x) 1 2 + γ =: Xγ. Proposition 2. The classifier bh returned by Algorithm 1 enjoys proper abstention. With randomization over the abstention region, we have the following upper bound on its standard excess error err(ˇh) err(h ) errγ(bh) err(h ) + γ Px DX (x Xγ). (5) The proper abstention property of bh returned by Algorithm 1 is achieved via conservation: bh will avoid abstention unless it is absolutely sure that abstention is the optimal choice.5 To characterize the standard excess error of classifier with proper abstention, we only need to upper bound the term Px DX (x Xγ), which does not depends on the (random) classifier bh. Instead, it only depends on the marginal distribution. We next introduce the common Massart/Tsybakov noise conditions. Definition 3 (Massart noise, Massart and Nédélec (2006)). A distribution DXY satisfies the Massart noise condition with parameter τ0 > 0 if Px DX (|η(x) 1/2| τ0) = 0. Definition 4 (Tsybakov noise, Tsybakov (2004)). A distribution DXY satisfies the Tsybakov noise condition with parameter β 0 and a universal constant c > 0 if Px DX (|η(x) 1/2| τ) c τ β for any τ > 0. As in Balcan et al. (2007); Hanneke (2014), we assume knowledge of noise parameters (e.g., τ0, β). Together with the active learning lower established in Castro and Nowak (2006, 2008), and focusing on the dependence of ε, our next theorem shows that Algorithm 1 can be used to recover the minimax label complexity in active learning, under the standard excess error. 5On the other hand, however, the algorithm provided in Puchkin and Zhivotovskiy (2021) is very unlikely to have such property. In fact, only a small but nonzero upper bound of abstention rate is provided (Proposition 3.6 therein) under the Massart noise with γ τ0 2 ; yet any classifier that enjoys proper abstention should have exactly zero abstention rate. Theorem 6. With an appropriate choice of the abstention parameter γ in Algorithm 1 and randomization over the abstention region, Algorithm 1 learns a classifier ˇh at the minimax optimal rates: To achieve ε standard excess error, it takes eΘ(τ 2 0 ) labels under Massart noise and takes eΘ(ε 2/(1+β)) labels under Tsybakov noise. Remark 2. In addition to recovering the minimax rates, the proper abstention property is desirable in practice: It guarantees that bh will not abstain on easy examples, i.e., it will not mistakenly flag easy examples as hard-to-classify , thus eliminating unnecessary human labeling efforts. Algorithm 1 can also be used to provide new learning paradigms in the limited budget setting, which we introduce below. No prior knowledge of noise parameters are required in this setup. New learning paradigm under limited budget. Given any labeling budget B > 0, we can then choose γ B 1/2 in Algorithm 1 to make sure the label complexity is never greater than B (with high probability). The learned classifier enjoys Chow s excess error (with parameter γ) at most ε; its standard excess error (with randomization over the abstention region) can be analyzed by relating the γ Px DX (x Xγ) term in Eq. (5) to the Massart/Tsybakov noise conditions, as discussed above. 3.2 Abstention to avoid noise-seeking Active learning algorithms sometimes exhibit noise-seeking behaviors, i.e., oversampling in regions where η(x) is close to the 1 2 level. Such noise-seeking behavior is known to be a fundamental barrier to achieve low label complexity (under standard excess error), e.g., see Kääriäinen (2006). We show in this section that abstention naturally helps avoiding noise-seeking behaviors and speeding up active learning. To better illustrate how properly abstaining classifiers avoid noise-seeking behavior, we first propose new noise conditions below, which strictly generalize the usual Massart/Tsybakov noises. Definition 5 (Noise-seeking Massart noise). A distribution DXY satisfies the noise-seeking Massart noise condition with parameters 0 ζ0 < τ0 1/2 if Px DX (ζ0 < |η(x) 1/2| τ0) = 0. Definition 6 (Noise-seeking Tsybakov noise). A distribution DXY satisfies the noise-seeking Tsybakov noise condition with parameters 0 ζ0 < 1/2, β 0 and a universal constant c > 0 if Px DX (ζ0 < |η(x) 1/2| τ) c τ β for any τ > ζ0. Compared to the standard Massart/Tsybakov noises, these newly proposed noise-seeking conditions allow arbitrary probability mass of data points whose conditional probability η(x) is equal/close to 1/2. As a result, they can trick standard active learning algorithms into exhibiting the noise-seeking bahaviors (and hence their names). We also mention that the parameter ζ0 should be considered as an extremely small quantity (e.g., ζ0 ε), with the extreme case corresponding to ζ0 = 0 (which still allow arbitrary probability for region {x X : η(x) = 1/2}). Ideally, any active learning algorithm should not be heavily affected by these noise conditions since it hardly matters (in terms of excess error) what label is predicted over region {x X : |η(x) 1/2| ζ0}. However, these seemingly benign noise-seeking conditions can cause troubles for any uncertainty-based active learner, i.e., any active learning algorithm that requests the label for any point that is uncertain (see Definition 10 in Appendix F for formal definition). In particular, under limited budget, we derive the following result. Proposition 3. Fix ε, δ, γ > 0. For any labeling budget B 1 γ2 log2( 1 ε γ ) log( 1 ε γ δ), there exists a learning problem (with a set of linear regression functions) satisfying Definition 5/Definition 6 such that (1) any uncertainty-based active learner suffers expected standard excess error Ω(B 1); yet (2) with probability at least 1 δ, Algorithm 1 returns a classifier with standard excess error at most ε. The above result demonstrates the superiority of our Algorithm 1 over any uncertainty-based active learner. Moreover, we show that Algorithm 1 achieves similar guarantees as in Theorem 6 under the strictly harder noise-seeking conditions. Specifically, we have the following guarantees. Theorem 7. With an appropriate choice of the abstention parameter γ in Algorithm 1 and randomization over the abstention region, Algorithm 1 learns a classifier ˇh with ε + ζ0 standard excess error after querying eΘ(τ 2 0 ) labels under Definition 5 or querying eΘ(ε 2/(1+β)) labels under Definition 6. The special case of the noise-seeking condition with ζ0 = 0 is recently studied in (Kpotufe et al., 2021), where the authors conclude that no active learners can outperform the passive counterparts in the nonparametric regime. Theorem 7 shows that, in the parametric setting (with function approximation), Algorithm 1 provably overcomes these noise-seeking conditions. 4 Extensions We provide two adaptations of our main algorithm (Algorithm 1) that can (1) achieve constant label complexity for a general set of regression functions (Section 4.1); and (2) adapt to model misspecification (Section 4.2). These two adaptations can also be efficiently implemented via regression oracle and enjoy similar guarantees stated in Theorem 5. We defer computational analysis to Appendix G and Appendix H. 4.1 Constant label complexity We start by considering a simple problem instance with X = {x}, where active learning is reduced to mean estimation of η(x). Consider the Massart noise case where η(x) / [ 1 2 + τ0]. No matter how small the desired accuracy level ε > 0 is, the learner should not spend more than O( log(1/δ) τ 2 0 ) labels to correctly classify x with probability at least 1 δ, which ensures 0 excess error. In the general setting, but with Chow s excess error, a similar result follows: It takes at most O( log(1/δ) γ2 ) samples to verify if η(x) is contained in [ 1 2 + γ] or not. Taking the optimal action within {+1, 1, } (based on Eq. (2)) then leads to 0 Chow s excess error. This reasoning shows that, at least in simple cases, one should be able to achieve constant label complexity no matter how small ε is. One natural question to ask is as follows. Is it possible to achieve constant label complexity in the general case of active learning? We provide the first affirmative answer to the above question with a general set of regression function F (finite), and under general action space X and marginal distribution DX . The positive result is achieved by Algorithm 2 (deferred to Appendix G.2), which differs from Algorithm 1 in two aspects: (1) we drop the epoch scheduling, and (2) apply a tighter elimination step derived from an optimal stopping theorem. Another change comes from the analysis of the algorithm: Instead of analyzing with respect to the disagreement coefficient, we work with the eluder dimension e := supf F ef (F, γ/2).6 To do that, we analyze active learning from the perspective of regret minimization with selective querying (Dekel et al., 2012), which allows us to incorporate techniques developed in the field of contextual bandits (Russo and Van Roy, 2013; Foster et al., 2020). We defer a detailed discussion to Appendix G.1 and provide the following guarantees. Theorem 8. With probability at least 1 2δ, Algorithm 2 returns a classifier with expected Chow s excess error at most ε and label complexity O( e log(|F|/δ) γ2 ), which is independent of 1 Based on discussion in Section 3, we can immediately translate the above results into standard excess error guarantees under the Massart noise (with γ replaced by τ0). We next discuss why existing algorithms/analyses do not guarantee constant label complexity, even in the linear case. 1. Epoch scheduling. Many algorithms proceed in epochs and aim at halving the excess error after each epoch (Balcan et al., 2007; Zhang and Chaudhuri, 2014; Puchkin and Zhivotovskiy, 2021). One inevitably needs log 1 ε epochs to achieve ε excess error. 2. Relating to disagreement coefficient. The algorithm presented in Krishnamurthy et al. (2019) does not use epoch scheduling. However, their label complexity are analyzed with disagreement coefficient, which incurs a P1/ε t=1 1 t = O(log 1 ε) term in the label complexity. Remark 3. Algorithm 2 also provides guarantees when x is selected by an adaptive adversary (instead of i.i.d. sampled x DX ). In that case, we simultaneously upper bound the regret and the label complexity (see Theorem 10 in Appendix G.2). Our results can be viewed as a generalization of the results developed in the linear case (Dekel et al., 2012). 6We formally define eluder dimension in Appendix C. As examples, we have e = O(d log 1 γ ) for linear functions in Rd, and e = O(Clink d log 1 γ ) for generalized linear functions (where Clink is a quantity related to the link function). 4.2 Dealing with model misspecification Our main results are developed under realizability (Assumption 1), which assumes that there exists a f F such that f = η. In this section, we relax that assumption and allow model misspecification. We assume the learner is given a set of regression function F : X [0, 1] that may only approximates the conditional probability η. More specifically, we make the following assumption. Assumption 2 (Model misspecification). There exists a f F such that f approximate η up to κ > 0 accuracy, i.e., supx X f(x) η(x) κ. We use a variation of Algorithm 1 to adapt to model misspecification (Algorithm 3, deferred to Appendix H.1). Compared to Algorithm 1, the main change in Algorithm 3 is to apply a more conservative step in determining the active set Fm at each epoch: We maintain a larger active set of regression function to ensure that f is not eliminated throughout all epochs. Our algorithm proceeds without knowing the misspecification level κ. However, the excess error bound presented next holds under the condition that κ ε (i.e., it requires that the misspecification is no larger than the desired accuracy). Abbreviate θ := supι>0 θval f (F, γ/2, ι), we achieve the following guarantees. Theorem 9. Suppose κ ε. With probability at least 1 2δ, Algorithm 3 returns a classifier with Chow s excess error O(ε θ log( Pdim(F) ε γ δ )) and label complexity O( θ Pdim(F) γ2 log2( Pdim(F) log( Pdim(F) We only provide guarantee when κ ε, since the learned classifier suffers from an additive κ term in the excess error (see Appendix H.2 for more discussion). On the other hand, the (inefficient) algorithm provided in Puchkin and Zhivotovskiy (2021) works without any assumption on the approximation error. An interesting future direction is to study the relation between computational efficiency and learning with general approximation error. Acknowledgments and Disclosure of Funding We thank the anonymous reviewers for their helpful comments. This work is partially supported by NSF grant 1934612 and AFOSR grant FA9550-18-1-0166. Alekh Agarwal, Miroslav Dudík, Satyen Kale, John Langford, and Robert Schapire. Contextual bandit learning with predictable rewards. In Artificial Intelligence and Statistics, pages 19 26. PMLR, 2012. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638 1646. PMLR, 2014. Martin Anthony. Uniform glivenko-cantelli theorems and concentration of measure in the mathematical modelling of learning. Research Report LSE-CDAM-2002 07, 2002. Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In International Conference on Computational Learning Theory, pages 35 50. Springer, 2007. Alina Beygelzimer, Daniel J Hsu, John Langford, and Tong Zhang. Agnostic active learning without constraints. Advances in neural information processing systems, 23, 2010. Olivier Bousquet and Nikita Zhivotovskiy. Fast classification rates without standard margin assumptions. Information and Inference: A Journal of the IMA, 10(4):1389 1421, 2021. Rui M Castro and Robert D Nowak. Upper and lower error bounds for active learning. In The 44th Annual Allerton Conference on Communication, Control and Computing, volume 2, page 1, 2006. Rui M Castro and Robert D Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5):2339 2353, 2008. CK Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16(1):41 46, 1970. Sanjoy Dasgupta, Adam Tauman Kalai, and Adam Tauman. Analysis of perceptron-based active learning. Journal of Machine Learning Research, 10(2), 2009. Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Selective sampling and active learning from single and multiple teachers. The Journal of Machine Learning Research, 13(1):2655 2697, 2012. Dylan Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pages 3199 3210. PMLR, 2020. Dylan Foster, Alekh Agarwal, Miroslav Dudík, Haipeng Luo, and Robert Schapire. Practical contextual bandits with regression oracles. In International Conference on Machine Learning, pages 1539 1548. PMLR, 2018. Dylan J Foster, Alexander Rakhlin, David Simchi-Levi, and Yunzong Xu. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. ar Xiv preprint ar Xiv:2010.03104, 2020. David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100 118, 1975. Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In International Conference on Machine Learning, pages 1183 1192. PMLR, 2017. Venkatesan Guruswami and Prasad Raghavendra. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39(2):742 765, 2009. Steve Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th international conference on Machine learning, pages 353 360, 2007. Steve Hanneke. Theory of active learning. Foundations and Trends in Machine Learning, 7(2-3), 2014. David Haussler. Decision theoretic generalizations of the pac model for neural net and other learning applications. 1989. David Haussler. Sphere packing numbers for subsets of the boolean n-cube with bounded vapnikchervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217 232, 1995. Radu Herbei and Marten H Wegkamp. Classification with reject option. The Canadian Journal of Statistics/La Revue Canadienne de Statistique, pages 709 721, 2006. Daniel Joseph Hsu. Algorithms for active learning. Ph D thesis, UC San Diego, 2010. Tzu-Kuo Huang, Alekh Agarwal, Daniel J Hsu, John Langford, and Robert E Schapire. Efficient and parsimonious agnostic active learning. Advances in Neural Information Processing Systems, 28, 2015. Matti Kääriäinen. Active learning in the non-realizable case. In International Conference on Algorithmic Learning Theory, pages 63 77. Springer, 2006. Julian Katz-Samuels, Jifan Zhang, Lalit Jain, and Kevin Jamieson. Improved algorithms for agnostic pool-based active classification. In International Conference on Machine Learning, pages 5334 5344. PMLR, 2021. Vladimir Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. The Journal of Machine Learning Research, 11:2457 2485, 2010. Samory Kpotufe, Gan Yuan, and Yunfan Zhao. Nuances in margin conditions determine gains in active learning. ar Xiv preprint ar Xiv:2110.08418, 2021. Akshay Krishnamurthy, Alekh Agarwal, Tzu-Kuo Huang, Hal Daumé III, and John Langford. Active learning for cost-sensitive classification. In International Conference on Machine Learning, pages 1915 1924. PMLR, 2017. Akshay Krishnamurthy, Alekh Agarwal, Tzu-Kuo Huang, Hal Daumé III, and John Langford. Active learning for cost-sensitive classification. Journal of Machine Learning Research, 20:1 50, 2019. Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, pages 5662 5670. PMLR, 2020. Andrea Locatelli, Alexandra Carpentier, and Samory Kpotufe. Adaptivity to noise parameters in nonparametric active learning. In Proceedings of the 2017 Conference on Learning Theory, PMLR, 2017. Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. The Annals of Statistics, 34 (5):2326 2366, 2006. Stanislav Minsker. Plug-in approach to active learning. Journal of Machine Learning Research, 13 (1), 2012. D Pollard. Convergence of Stochastic Processes. David Pollard, 1984. Nikita Puchkin and Nikita Zhivotovskiy. Exponential savings in agnostic active learning through abstention. ar Xiv preprint ar Xiv:2102.00451, 2021. Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In NIPS, pages 2256 2264. Citeseer, 2013. Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. Burr Settles. Active learning literature survey. 2009. Shubhanshu Shekhar, Mohammad Ghavamzadeh, and Tara Javidi. Active learning for classification with abstention. IEEE Journal on Selected Areas in Information Theory, 2(2):705 719, 2021. David Simchi-Levi and Yunzong Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Available at SSRN 3562765, 2020. Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45 66, 2001. Alexander B Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135 166, 2004. Chicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learning. Advances in Neural Information Processing Systems, 27, 2014. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] Our results in the misspecification setting (i.e., in Section 4.2) are derived under an assumption. We provide discussion on such assumption in Appendix H.2, and leave a comprehensive study of the problem for future work. (c) Did you discuss any potential negative societal impacts of your work? [N/A] Our paper is theoretical in nature, and there is no negative societal impact of our work in the foreseeable future. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] Assumptions are clearly stated in the statement of each theorem. (b) Did you include complete proofs of all theoretical results? [Yes] Complete proofs are provided in the Appendix. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]