# learning_with_labeling_induced_abstentions__19aafd91.pdf Learning with Labeling Induced Abstentions Kareem Amin Google Research New York, NY kamin@google.com Giulia De Salvo Google Research New York, NY giuliad@google.com Afshin Rostamizadeh Google Research New York, NY rostami@google.com Consider a setting where we wish to automate an expensive task with a machine learning algorithm using a limited labeling resource. In such settings, examples routed for labeling are often out of scope for the machine learning algorithm. For example, in a spam detection setting, human reviewers not only provide labeled data but are such high-quality detectors of spam that examples routed to them no longer require machine evaluation. As a consequence, the distribution of examples routed to the machine is intimately tied to the process generating labels. We introduce a formalization of this setting, and give an algorithm that simultaneously learns a model and decides when to request a label by leveraging ideas from both the abstention and active learning literatures. We prove an upper bound on the algorithm s label complexity and a matching lower bound for any algorithm in this setting. We conduct a thorough set of experiments including an ablation study to test different components of our algorithm. We demonstrate the effectiveness of an efficient version of our algorithm over margin sampling on a variety of datasets. 1 Introduction In this paper, we consider a system that relies on automated predictions made by a machine learning model. We assume this system has a limited budget for requesting ground-truth labels (e.g. from a domain expert). In practice, such request can be used, among other purposes, to gather additional training data for the machine learning model. If the system asks for a label for an example, then it no longer needs the model s prediction for that particular example. Thus, the pattern of label queries effectively defines the distribution that the model will learn with and predict on. Take for example a large-scale video-hosting website. The website wants to automatically detect videos that violate its community guidelines. In order to acquire labels for this task, some of these videos are evaluated by a finite pool of human reviewers. There are two consequences of this evaluation. Firstly, the reviewer provides a training label for the model. Secondly, the human intervention makes it so that the machine learning algorithm is no longer tasked with making predictions on these examples. As a result, the goal is then optimize the the model s performance in the domain where it will be executed. We take the perspective of a system designer who wants to understand (and optimize for) the performance of the automated system on the examples that it will be asked to evaluate. Let r : X ! {0, 1} be a rule governing whether the system requests a label for x 2 X. Let h : X ! R be a hypothesis describing the automated system. We seek to minimize E[L(h(x), y) | r(x) = 0], where L is a loss function. We can think of this setting as combining the objective studied in abstention learning (minimizing E[L(h(x), y) | r(x) = 0]) with the feedback model studied in active learning (labels are only available when r(x) = 1). We introduce a new framework, which we call dual purpose learning framework, that combines these elements. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). We first analyze the proper dual purpose labeling framework in an online setting, which proceeds as follows. At the start of each round t, the learner selects both a requester function rt from some class R , where is an upper bound on the request-rate of functions in this class, and a hypothesis ht from some class H using all the feedback available from the past. The learner s expected loss on this round is given by E[L(ht(x), y) | rt(x) = 0]. The function rt also determines the feedback available to the algorithm. If rt(x) = 1, the learner observes (x, y), which was drawn i.i.d. from an unknown distribution, and otherwise, only x is revealed and y is censored. The goal of the learner is to compete with the optimal choice of hypothesis and requester by minimizing the excess loss LH,R (ht, rt) = E[L(ht(x), y) | rt(x) = 0] inf(h ,r )2H R E[L(h (x), y) | r (x) = 0]. Our first main result is the surprising fact that, under mild assumptions, bounds on the excess loss that match the O(1/ t) generalization rates of full-feedback passive learning are not possible in the proper dual purpose labeling framework. This lower bound also suggests a relaxation of the proper dual purpose labeling framework, which we call improper dual purpose labeling. In the improper setting, the learner is still interested in learning ht, rt that minimize LH,R . However, the learner is allowed to request labels using a more powerful requester class than R during training. As in classical PAC-learning results, improper learning allows the circumvention of the impossibility result. As a practical matter, the improper setting is useful when the system designer is willing to spend more resources during training. In our motivating example, the designer of the abuse-detection system might be willing to implement a more complicated system during training, which might include a larger budget (in dollars or man-power) for human intervention. However, after a time horizon T, training stops, and the designer commits to some h T , r T for r T 2 R . From then on, examples satisfyng r T (x) = 0 are routed to h T , and thus, we wish to characterize LH,R (h T , r T ). In the improper setting, we demonstrate that IWAL, an algorithm from the active learning literature [Beygelzimer et al., 2009], can be adapted to our setting into an algorithm which we call DPL-IWAL. Since our objective is no longer an expectation of some loss function, but rather, the conditional expectation evaluated on the event that r(x) = 0, IWAL s standard analysis does not apply. A key technical hurdle is proving that estimates of this conditional loss concentrate at the right rate in order to attain generalization guarantees for the pair (h, r) returned by DPL-IWAL. We show that over a time horizon T, DPL-IWAL algorithm requests O(( + )T) examples where = min(h,r)2H R E[L(h(x), y) | r(x) = 0] is the optimum value of our objective. At the same time our main lower bound demonstrates that (( + )T) requests are in fact necessary to compete with the best policy in H R . Finally, we conduct a thorough exploration of these techniques on a number of datasets. We first undertake an ablation study, using a finite hypothesis class, showing that DPL-IWAL outperforms baselines that either ignore the active learning or abstention learning aspects of the problem. DPLIWAL is not computationally efficient to implement since just like IWAL, it maintains a version space, that is a set of candidate hypotheses, which is non-trivial to optimize over in general. We therefore also conduct a number of experiments with an efficient heuristic inspired by our results using continuous hypothesis classes, outperforming natural baselines including margin sampling [Lewis and Gale, 1994, Balcan et al., 2007], which admits state-of-the-art performance for active learning problems in practice [Yang and Loog, 2016, Mussmann and Liang, 2018, Chuang et al., 2019]. 1.1 Related Work Our setting encompasses an objective considered in abstention learning (sometimes called selective classification), where the learner controls its evaluation region, that is the region of the domain where the learner is evaluated, by abstaining on the complement. In our setting, the evaluation region is where the learner does not request a label and the complementing abstention region is where the learner makes a request. At the same time, our setting considers the feedback mechanism from active learning. These are tied in a specific way in our framework: feedback is only available on the abstention (i.e. requesting) region and no information is revealed about the evaluation region. Neither an abstention algorithm nor a standard active learning algorithm would work in our setting and the algorithms in these settings can lead to incorrect solutions (see Appendix C). Nevertheless, we survey some of the relevant literature. Learning with abstention was first studied by Chow [1957, 1970] for specific practical applications. Subsequently, several authors analyzed algorithms [Bartlett and Wegkamp, 2008, Grandvalet et al., 2008, Yuan and Wegkamp, 2010, Yuang and Wegkamp, 2011] for this setting with an emphasis on developing margin-based rules for abstention. Along a different line of work, El-Yaniv and Wiener [2010, 2011] analyze the theoretical trade-off between the coverage of an abstention function and a classifier s performance when not abstaining. All these works share in common the assumption that the learner has offline access to fully labeled samples, unlike the setting considered in this work. A more recent line of work Cortes et al. [2016a,b, 2018] considers a setting where the algorithms learn over two classes of functions, a hypothesis class and an abstention class. These work either assume full feedback is always available (e.g. Cortes et al. [2016a,b]), or like in Cortes et al. [2018], the feedback is only available in the evaluation region, which is the exact opposite feedback mechanism than that of our setting since we receive feedback in the abstention region. Similarly, Shekhar et al. [2021] consider a setting where feedback is always available. In the active learning literature, several authors focus on analyzing margin-based active learning which requests the labels for points close to a learned model classification surface [Dasgupta et al., 2005, Balcan et al., 2007, Balcan and Long, 2013, Awasthi et al., 2014, 2015, Zhang, 2018, Huang et al., 2019, Zhang et al., 2020]. Other algorithms admit generalization guarantees on the same order as passive learning while proving that their algorithm s label complexity, i.e. the number of points requested during learning, is bounded by a favorable rate. Beygelzimer et al. [2009] derived an algorithm, called IWAL, for general loss functions with strong theoretical guarantees. We wish to use this algorithm for our setting with the abstention learning objective. However, the loss function we consider is in fact a conditional expectation evaluated on an event, which is not amenable to standard IWAL. Our analysis and corresponding algorithm, DPL-IWAL, describes how to apply IWAL to such a setting. Finally, ideas from the abstention framework have been applied to the active learning previously. Zhang and Chaudhuri [2014] used confidence-based predictors as a subroutine of an active learning algorithm. El-Yaniv and Wiener [2012] applied an abstention strategy from El-Yaniv and Wiener [2010] to the CAL algorithm of Cohn et al. [1994] proving theoretical guarantees, but only under specific model and distributional assumptions, which we do not make in our setting. 2 Setting and Preliminaries In the dual purpose labeling framework, a learner is given a hypothesis class H with finite VCdimension d, and a class of deterministic requester functions R {X ! {0, 1}}. Nature fixes a distribution D over X Y, unknown to the learner. Given the marginal over X, and > 0, we denote R R as the subset of R with bounded request-rate Ex[r(x) = 1] for all r 2 R . Ex[r(x) = 1] can be well estimated using unlabeled data, which is generally readily available, allowing a bound on request-rate to be enforced in practice. The interaction between the learner and nature proceeds through a sequence of rounds t. The learner first selects (ht, rt) 2 H R as a function of the past. Nature then draws an independent sample (xt, yt) from D, where xt is revealed to the learner. If rt(xt) = 1, the learner additionally observes yt, but the performance of ht is not evaluated. If rt(xt) = 0, the learner does not observe yt, but the performance of ht is evaluated. Thus, the choice of rt serves dual purposes: it determines whether the algorithm receives feedback, and whether its performance will be evaluated. Formalizing this further, we suppose that the learner is given a loss function L : Y Y ! R. The learner seeks to output ht that generalizes well on the region where rt dictates that ht should be evaluated. We therefore seek to bound the (conditional) excess loss LH,R (ht, rt) = E[L(ht(x), y) | rt(x) = 0] inf(h ,r )2H R E[L(h (x), y) | r (x) = 0]. In particular we seek O(1/ t) bounds on LH,R (ht, rt), matching the generalization rate of full-feedback passive learning. The conditional loss of the best pair in H R plays an important role in our lower bounds, and so it is useful to define = inf(h ,r )2H R Ex,y[L(h (x), y) | r (x) = 0]. Finally, we call this the proper dual purpose framework since the algorithm is attempting to generalize well with respect to the class H R and labels are generated according to functions in R . In the subsequent section, we will see that O(1/ t) bounds on LH,R (ht, rt) are impossible in general. This motivates the improper dual purpose framework as introduced in the following section. 3 Lower Bound In this section, we present a lower bound stating that it is impossible for any algorithm to achieve an O(1/ t) generalization rate in the proper dual purpose setting. Thus, we here introduce the notion of improper algorithms. On each round t, an improper algorithm selects ht 2 H and an Rt : X ! {0, 1} that is unconstrained for the purposes of showing a lower bound (e.g. Rt is not necessarily in the set R ). As in the proper setting, label requests are tied to the evaluation region. That is, the algorithm sees yt i.f.f. Rt(xt) = 1, and wishes to minimize LH,R (ht, Rt). Notice that L is still defined with respect to the reference class R , and that the proper setting is a special case of the improper setting where Rt is equal to an rt 2 R . The lower bound below shows that any algorithm with the desired generalization rate must satisfy 1 t=1 E[Rt(xt)] > . Thus, since E[r(x)] for every r 2 R , no proper algorithm can attain the desired rate. We prove a bound that holds for almost any possible classes of functions H, R, with some restrictions on R. We say that R separates X if given any finite set of examples in x0, . . . , xn 2 X, there exists a point ˆx outside this finite set of points and a requester in R such r(ˆx) = 1 while r(x0) = = r(xn) = 0. This condition is much stronger than is necessary for the lower bound, but is already satisfied by simple classes such as when R contains linear separators and X is a ball in any dimension 2, and becomes easier to satisfy as R becomes more complex. We need a weaker condition that requires only the separation property hold for some set shattered by H. Definition 1. R separates X with respect to H if d = VCD(H), and there exists a set of d examples x0, . . . , xd 1 shattered by H, ˆx 2 X, and r 2 R such that r(ˆx) = 1 and r(x0) = . . . r(xd 1) = 0. We first describe a distribution that follows the basic construction used to demonstrate lower bounds for pure active learning [Beygelzimer et al., 2009]. We then augment this distribution to include a region of mass that contains random noise. Intuitively, one can think of an algorithm as requesting a label either to solve the active learning problem or to avoid loss on examples that it is uncertain on in the region of mass . We argue that the optimal algorithm, when not solving the active learning problem, spends T labels requesting on the random noise. While this is the basic idea, the proof needs to preclude the possibility that an algorithm with a suboptimal prediction h benefits from requesting labels outside the region of mass purely for the purpose of avoiding loss (and not to solve the active learning problem). An algorithm can also request labels at a rate greater than on any given round with the hope of decreasing its overall label complexity over T rounds. The proof shows that neither of these strategies benefits an algorithm enough to deviate from the optimal strategy outlined in the previous paragraph. Definition 2. Fix any 0. We say that a round t is a failed round if the algorithm selects a requesting strategy Rt : X ! {0, 1}, and hypothesis ht 2 H satisfying E[L(ht(x), y) | Rt(x) = 0] min(h,r)2H R E[L(h(x), y) | r(x) = 0] + As a practical matter, we will also require that an improper algorithm outputs a model (h T , r T ) 2 H R at the end of its time horizon, where r T comes from the reference class. This paves the way for algorithms discussed in the subsequent sections which are applicable in systems that are able to tolerate a higher labeling overhead during a finite training horizon, but eventually need to converge on a model with request rate . Crucially, the lower bound establishes the minimum additional overhead during training as (defined in Section 2), which is matched by our upper bounds. Theorem 1. Let L(h(x), y) = 1[yh(x) 0] be the misclassification loss. Given R that separates X with respect to H with d = VCD(H), let R R consist of requesters with bounded requestrate . For any 1/4, 1/2, there exists a distribution on X { 1, 1} such that = min(h,r)2H R E[L(h(x), y) | r(x) = 0] . Furthermore, there exists a sufficiently large T 0 such that with probability at least 1/2 any algorithm that, (A) outputs (h T , r T ) 2 H R , such that E[L(h T (x), y) | r T (x) = 0] + d /T, and (B) suffers no more than T/2 failed rounds, requires that: E[P t Rt] (( + )T). Algorithm 1 DPL-IWAL Algorithm Require: Max iteration T > 0, V1 = H R , t0 be the first time t such that t 16 log(t/δ) for t 2 [1, t0] do Observe xt in order to construct estimates 1 t 1 s=1 1[r(xs) = 0] for t 2 [t0 + 1, T] do (ht, rt) argmin(h,r)2Vt b Lt 1(h, r) Receive xt if rt(xt) = 1 then Request label yt pt(xt) min 1, max (h,r),(h0,r0)2Vt max &&& L(h(xt),y)1[r(xt)=0] s=1 1[r(xs)=0] L(h0(xt),y)1[r0(xt)=0] s=1 1[r0(xs)=0] qt Bernoulli(pt) if qt = 1 then Request or re-use label yt Vt+1 {(h, r) 2Vt : b Lt(h, r) min(h,r)2Vt b Lt(h, r) + t} if rt(xt) = 0 qt = 0 then Predict label using sgn(ht(xt)) Return: (h T , r T ) The above theorem states that an algorithm must request the labels of at least (( + )T) examples (in expectation) if we require that the pair returned by the algorithm generalizes at a rate approximating that of standard supervised learning. This then directly implies that Rt must be selected outside of the class R since this class only contains functions with a requesting rate of at most , resulting in label complexity at most O( T) in expectation. All proofs can be found in Appendix A. Although we state the lower bound for classification loss, it can be extended to any loss function where mispredicting the sign of an example, yh(x) 0, implies L(h(x), y) C for some constant C. This is true for the logistic, hinge, squared, and absolute losses. 4 Dual Purpose Labeling Algorithm In this section, we present our algorithm, DPL-IWAL (see Algorithm 1 for the pseudo-code), in the improper dual purpose framework. At a high level, DPL-IWAL finds the pair (h, r) 2 H R that minimizes the conditional loss, E[L(h(x), y)|r(x) = 0], i.e. the expected loss of h conditioned on the event that the label is not requested by r. Intuitively, the best pair (h, r) requests the label of the point whenever the prediction of sgn(h(x)) is likely to be incorrect. To find such a pair, at each round t, the algorithm first constructs an importance weighted estimate, b Lt(h, r), of E[L(h(x), y)|r(x) = 0] by using the fewest number of labeled points as possible and then chooses the pair (ht, rt) that minimizes b Lt(h, r). Ideally, the importance weighted estimates b Lt(h, r) could be constructed from the set of points whose labels have been requested by r1, . . . , rt, but these sets of points are a non-trivially biased sample of the underlying distribution. Moreover, these points could reside in regions of the space that are not useful for calculating E[L(h(x), y)|r(x) = 0]. To see this more clearly, consider a simple case when R is contains just one function and the algorithm is then simply finding the argminh2H E[L(h(x), y)|r(x) = 0]. The points the r functions request the label for, meaning points where r(x) = 1, do not reveal any information necessary to estimate E[L(h(x), y)|r(x) = 0] for h 2 H. Thus, the algorithm must label other regions in the space. Below, we describe how the algorithm uses a subset of the points requested by the rt 2 R in conjunction with some additional carefully chosen points, via a function qt outside R , to construct unbiased estimators, b Lt(h, r). This fact thus makes DPL-IWAL an improper dual purpose algorithm. Similarly to IWAL [Beygelzimer et al., 2009], the DPL-IWAL algorithm constructs an importance weighted estimate, but instead of estimating the expected loss as is done in IWAL, we craft an estimate of the conditional losses for t > t0, b Lt(h, r) = 1 t t0 1 L(h(xs), ys)1[r(xs) = 0] s0=1 1[r(xs0) = 0] where a coin qs 2 {0, 1} is flipped with a bias probability ps(xs) and where t0 is the first time t such that t 16 log(t/δ). Note that for the first s 2 [1, t0], we simply observe the features xs in order to construct 1 s 1 s0=1 1[r(xs0) = 0] that are non-zero with high probability since these are needed in definition of the denominator of b Lt(h, r). Ignoring the qs and ps(xs) for now, the numerator 1 t t0 1 s=t0+1 L(h(xs), ys)1[r(xs) = 0] is a measure of the joint expectation E[L(h(x), y)1[r(x) = 0]] while the denominator contains running averages of the E[r(x) = 0]. Roughly speaking, by considering the ratio of these two terms, we estimate the conditional expected loss, E[L(h(x), y)|r(x) = 0] = E[L(h(x),y)1[r(x)=0]] E[r(x)=0] . The algorithm maintains a version space, Vt, as defined in Algorithm 1, which it reduces at each round. We prove that it suffices to use a slack term t = O (1/t) log(1/δ) , in order to ensure with high probability that (h , r ) remain within the version space as it shrinks. The O( ) hides constants and log(t|H R |) factors; see the appendix for exact constants. In order to reduce the number of labeled points used to construct the importance-weighted estimates, the probability of requesting a point pt is defined by the (estimated) conditional loss difference between pairs of functions in this shrinking set Vt. Given the above, the algorithm s overall requesting rule, Rt, is thus defined by the following condition: Rt(xt) = 1 if and only if rt(xt) = 1 _ qt = 1 where rt 2 R and qt 62 R . 5 Generalization and Label Complexity Guarantees In this section, we present a series of theoretical guarantees that analyze the performance of our approach as compared to different baselines as well as prove an upper bound on the expected number of label requests by the DPL-IWAL algorithm. In our framework, we seek to select a hypothesis ht which incurs minimal loss whenever a label request is not made. Below, we prove an upper bound on this type of loss that is in terms of the best pair of functions, (h , r ) = argmin(h,r)2H R E[L(h(x), y)|r(x) = 0]. Theorem 2. Given any < 1 2, for any δ > 0, with probability at least 1 δ, for all t 16 log(3t/δ), E[L(ht(x), y)|rt(x) = 0] E[L(h (x), y)|r (x) = 0] + O (1/t) log(1/δ) This guarantee states that the pair (ht, rt) chosen by the algorithm is converging to the best pair (h , r ) as a function of the time t with respect to the conditional loss. The assumption t 16 log(3t/δ) is mild condition, for example, if δ = 0.0001 we then require t > 104. Also, in most standard applications, only a small fraction of examples can be labeled and it is natural to assume that < 1 2. Nevertheless, this assumption can be reduced by increasing the constraint on t in the bound. Overall, the theoretical analysis, which is given in Appendix B, departs from standard derivations since we need to carefully deal with conditional losses and the constructed estimates, b Lt(h, r). More concretely, we first must ensure that the denominator of the estimate b Lt(h, r) is non-zero with high probability as otherwise the estimate would not be well defined. To do so, we start the labeling process of our algorithm only after t0 examples have been observed. After t0 examples have been observed and using the fact that E[r(x) = 0] > 1 , we can then prove that the condition s0=1 1[r(xs0) = 0] > 0 holds with high probability (Lemma 1). Then, in order to apply Azuma s inequality on conditional losses, we need to prove that L(h(xs),ys)1[r(xs)=0] s0=1 1[r(xs0)=0] is bounded by a favorable constant despite the variable denominator term (Lemma 2). Azuma s inequality implies that the estimates are converging to E[b Lt(h, r)], but this is not enough since we want to prove guarantees in terms of E[L(h(x), y)|r(x) = 0]. Thus, using a series of concentration inequalities, we prove that expected value of the estimate, E[b Lt(h, r)], converges to the expected conditional loss, E[L(h(x), y)|r(x) = 0] at the desired rate (see proof of Theorem 2). Next, we compare the quality of the predictions of our approach to that of simply predicting according to the best-in-class as measured by the non-conditional loss, hb = argminh2H E[L(h(x), y)]. This comparison quantifies the potential benefits of our framework as compared to that of supervised learning since hb is the hypothesis that an algorithm in the supervised setting is attempting to learn. In the next corollary, we consider the never-requester function, r , that is 1[r (x) = 0] = 1 for all x 2 X. The never-requester is in R for any value of > 0. The never-requester trivially does not increase the label complexity and since it s a single function, it also does not discernibly augment the complexity of class, R , and so it can be included in R at effectively no cost. Corollary 1. Given any < 1 2, for any δ > 0, with probability at least 1 δ, for all t 16 log(3t/δ), E[L(ht(x), y)|rt(x) = 0] E[L(hb(x), y)] + γ + O (1/t) log(1/δ) , where γ = E[L(h (x), y)|r (x) = 0] E[L(hb(x), y)]. Furthermore, if r 2 R , then γ 0. The above corollary states the predictions of the chosen function ht when not requesting admit strictly fewer mistakes as compared to the predictions of hb, the best-in-class in H, whenever γ < O (1/t) log(1/δ) . The value γ characterizes the difference between the best-in-class in our setting versus the best-in-class in the supervised learning. The more negative this term is, the fewer number of mistakes are made. In an empirical study in Appendix D, we show that typically γ is significantly smaller than 0. To derive label complexity guarantees, we define a disagreement coefficient for conditional losses, directly derived from the coefficient definitions in Henneke [2007], Beygelzimer et al. [2009]. Let ((h, r), (h0, r0)) = E[| L(h(x),y)1[r(x)=0] E[r(x)=0] L(h0(x),y)1[r0(x)=0] E[r0(x)=0] |] be a measure of the distance between two pairs (h, r) and (h0, r0). Based on this metric, we define the ball around the best pair (h , r ) as follows: B(h , r , ) = {(h, r) 2 H R : ((h, r), (h , r )) }. The disagreement coefficient is the infimum value of > 0 such that for all 0: max(h,r)2B(h ,r , ) maxy &&& L(h(x),y)1[r(x)=0] E[r(x)=0] L(h (x),y)1[r (x)=0] . The next theo- rem bounds the expected number of points requested needed to construct the estimates, ˆLt(h, r) in terms of the coefficient . Theorem 3. Given any < 1 2, for all δ > 0, with probability at least 1 δ, PT s=1 E[ps(xs)] = O( T + T), where is the disagreement coefficient. Since the labeling rate of rt is at most , the label complexity of the DPL-IWAL algorithm is then given by PT s=1 E[rs(xs) = 1]+E[ps(xs)] = O(( + )T + T). Assume that E[L(ht(x), y) | Rt(x) = 0] E[L(ht(x), y)|rt(x) = 0]; intuitively, this implies using qt in addition to rt to make requests helps reduce our conditional loss and it holds in practice as shown by our experiments in Appendix D. Then it follows, by Theorem 2, that E[L(ht(x), y) | Rt(x) = 0] + O( (1/t) log(1/δ)), i.e. with high probability failed rounds do not occur. Thus, in this case, DPL-IWAL exhibits an upper bound on the label complexity that matches the lower bound stated in the previous section, apart for o(T) terms, namely T. An even sharper bound for the o(T) term is possible, using analysis similar to that of the EIWAL algorithm in Cortes et al. [2019], resulting in rate of p T. The analysis that proves the label complexity bound carefully deals with the denominator term of the estimates of the conditional loss as well as the fact that we are working with two function classes. Similarly to the generalization bound analysis, we first prove that the denominator term is well behaved and is not too far form its mean. Then, we leverage the disagreement coefficient and shrinking version space over the two classes. In this paper, we analyzed the scenario where the requesting functions rt 2 R are constrained to request at most times and minimize the conditional loss over this set of requesters. One could instead consider a Lagrangian relaxation of this constraint and pay a fixed cost c for each request. That is, consider the cost-based loss defined by 1r(x)=0L(h(x), y) + c1r(x)=1. Both of these loss views are important and in fact have been analyzed in the abstention setting (e.g., see El-Yaniv and Wiener [2010] for conditional loss and see Cortes et al. [2016a] for cost-based loss). Despite focusing on conditional loss over constrained requester functions, the theory and algorithms in this paper can be extended to the cost-based loss. In particular, for the algorithm not to incur too many mistakes during training and to return a pair of function that generalized well, the algorithm must select Rt outside of the class of requesters. The version of DPL-IWAL for cost-based loss will thus require requesting according to both qt and rt. 6 Empirical Investigation We start our empirical investigation by corroborating the theoretical insights made in the previous sections with an ablation study of DPL-IWAL. Since IWAL needs to solve a computationally intractable constrained optimization problem over its version space, which is only made more challenging in our setting by the joint optimization over H R , we use finite classes for these initial DPL-Simplified Algorithm Require: Max iteration T > 0, classes H R for t 2 [1, T] do ht argminh2H s=1 qs L(h(xs), ys) rt argminr2R L(ht(xs),ys)1[r(xs)=0] s0=1 1[r(xs0 )=0] Receive xt if rt(xt) = 1 then qt 1, request label yt else qt 0, predict label using sgn(ht(xt)) Return: (h T , r T ) Figure 1: On the left, the number of online mistakes made while processing a stream of data and the held-out conditional loss on non-requested points made by DPL-IWAL and baselines comparators. The plots show the mean and standard deviation over 10 trials. On the right, the pseudo-code of DPL-Simplified Algorithm. experiments. Then, we turn to a more practical algorithm inspired by the DPL-IWAL algorithm, but that leverages readily available optimization routines and empirically effective heuristics. Ablation Study: We compare the performance of DPL-IWAL against baselines which ignore either the active learning and/or the abstention aspects of the dual purpose labeling, in order to demonstrate that indeed both aspects are necessary. To do so, we compare our algorithm to several variants, defined as follows. The fully uniform baseline simply decides to request a label using a random biased coin-flip independent of the example, thereby mimicking standard passive supervised learning. The uniform rt baseline uses the DPL-IWAL algorithm in conjunction with trivial requester class that fixes its output uniformly at random, independent of the input xt. The uniform qt baseline is similar to DPL-IWAL, with outcomes qt determined by coin-flips with a fixed bias, independent of xt. For this experiment, we define a set of requester regions near the margin of the classification surface, each covering a mass of the overall distribution. Specifically, for any set of real-valued hypotheses set H, where the classification of point x made by h 2 H is defined as sgn(h(x)), we define a margin-based requester function class as: R ,H = rh(x) 7! 1[|h(x)| h, (DX )]: h 2 H , where DX is marginal distribution on X and h, (DX ) is the largest threshold value that satisfies EDX [rh(x) = 1] . Note, the h, threshold can be estimated using unlabeled data. We test six publicly available datasets [Chang and Lin] and for each, we use linear logistic regression models trained using the Python scikit-learn library. For all datasets, we construct a finite hypothesis class H and a matching finite margin-based requester set R ,H. We then stream unlabeled examples to each algorithm by sampling without replacement from a training split. The process is repeated 10 times, using a different random train/test split for each trial. For details, see Appendix D. In Figure 1 (left), we compare each method using two metrics. First, we consider the number of incorrect predictions (i.e. online mistakes ) that the model makes while processing the stream of unlabeled training data, where label-requests are spared mistakes. Second, we consider the conditional loss of the currently selected pair (ht, rt) by measuring the average misclassification loss on non-requested points using a held-out test split. See Appendix D for our results on all datasets, which all show a similar pattern. Overall these results show that our DPL-IWAL algorithm, with both a non-trivial sampling function qt and requesting function rt, outperforms all baseline methods. This indicates that both active learning and abstention aspects of DPL-IWAL are necessary since naively applying an active learning algorithm without abstention (e.g. uniform rt) or naively applying an abstention algorithm without active learning (e.g. uniform qt) admit suboptimal results. DPL-Simplified Algorithm: Here, we consider the DPL-Simplified Algorithm (Figure 1 right) where, the joint optimization problem over H R is split into two separate optimizations. This piece-wise optimization may not arrive at the same solution as the joint optimization, but we find it to be an empirically effective proxy. The first optimization over H is a standard learning problem over the currently labeled examples and any off-the-shelf hypothesis class and training algorithm can be used. The second optimization over R is still non-trivial to solve. However, the objective # Online Mistakes Conditional Test Error (%) dataset KNN MAR MIX KNN MAR MIX a9a 2017 35 1531 44 1713 51 12.1 0.39 8.9 0.44 10.3 0.30 acoustic 1654 58 1458 52 1418 51 9.4 0.35 8.1 0.13 7.8 0.28 cifar 904 30 849 38 701 26 5.0 0.23 5.1 0.14 3.8 0.09 cod-rna 644 27 360 13 329 12 3.1 0.11 2.2 0.06 1.5 0.07 covtype 3471 89 3365 89 3318 78 18.6 0.30 19.3 0.18 17.9 0.28 HIGGS 5862 79 5500 90 5689 69 35.2 0.31 32.7 0.25 34.0 0.27 ijcnn 654 52 420 20 310 30 2.2 0.28 2.4 0.16 0.7 0.14 mnist 1489 69 1032 26 958 50 6.7 0.31 5.1 0.09 4.1 0.12 shuttle 29 16 122 27 21 8 0.1 0.05 0.8 0.09 0.1 0.05 skin 61 64 662 110 36 31 0.1 0.02 4.0 0.89 0.02 0.01 Figure 2: The left side of the table displays the mean and standard deviation over 10 trials of the number of total online mistakes after processing 20,000 examples for each of the requestor strategies (all limited to requesting labels for 20% of examples). The right side of the table shows the mean and standard deviation of the conditional misclassification loss Pr(h(x) 6= y|r(x) = 0), measured on the test set, for (h T , r T ). On the right, histograms show the number of errors made by a partially-trained linear hypothesis h as a function of the model confidence for two datasets. does provide the intuition that an optimal requester seeks to cover all of classifier ht s mistakes. This suggests an approximate solution, where we train a requester r that seeks to classify the incorrect predictions of a fixed hypothesis ht over the set of labeled examples thus far. At the same time, notice that the simplified algorithm fixes qt = rt(xt) (and no longer needs to solve IWAL s constrained optimization problem). This makes the requester function responsible for not just sampling regions of the space that H cannot correctly capture, but also sampling examples that are effective for training. To cover the regions where the classifier is incorrect, we leverage what we call a KNN-Requester function. In particular, we use scikit-learn s KNeighbors Classifier to train a non-parametric model to predict hypothesis ht s training mistakes. The resulting requester is r KNN,h, (x) = 1[|1 Pr (h(x) 6= y|x)| < ] , where Pr (h(x) 6= y|x) denotes the probability that h makes a mistake according to the KNN model and indicates the classifier s threshold has been tuned so that the requester labels approximately a fraction examples. To select points that are effective for training the classifier, we borrow intuition from the simple yet empirically very effective margin (or uncertainty) active learning algorithm, which samples examples that the current model is least confident on (i.e., example closest to the decision surface). This leads us to the Mixture-Requester (MIX) function which merges the margin and KNN strategies. Specifically, we uniformly combine the probability score produced by the KNN model underlying r KNN,h, and the margin score derived from ht(x) as follows: r MIX,h, (x) = 1[|1 Pr (h(x) 6= y|x)| + |σ(h(x)) 0.5| < h, ] , where σ is a normalizing function that maps the input into [0, 1], which in our case is the output of scikit-learn s predict_proba method. Thus, this requester seeks to sample points that are both covering mistakes of the classifier as well as sampling points that are effective for training. In addition to the above, we evaluate the simpler Margin-Requester to serve as a natural, yet effective baseline: r MAR,h, (x) = 1[|σ(h(x)) 0.5| < h, ]. This requester is essentially mimicking the behavior of using uncertainty-based active learning without any regard to the DPL setting. In the following experiments, the hypothesis class H is the set of linear models with bounded L2norm, trained using scikit-learn s Logistic Regression implementation, and we use 10 publicly available datasets, all which we cast as binary classification problems (see Appendix D for details). We execute a batch variant of DPL-Simplified, where at each iteration we process a batch of 5,000 examples, querying 20% of the examples for their labels and making prediction for the rest. Upon the completion of the iteration, we receive the requested labels and update the choice of (h, r). This larger batch size, more closely reflects practical learning settings, where it is impractical to re-train the model after every single label query [Amin et al., 2020]. All methods are seeded with 500 randomly sampled initial examples and each experiment is run for 10 trials. In Figure 2, we show both the number of online errors incurred during the iterative labeling/training procedure as well as the average conditional misclassification loss of the final (h, r) pair on a test set. In Appendix D, we plot the full learning curves in Figure 8 associated with Figure 2. Overall, MIX is very effective, outperforming the baseline methods in 8 out of 10 datasets. To better understand when MIX outperforms the margin-requester, we present a histogram of errors as a function the confidence of a model trained with a set of 3,500 examples sampled uniformly at random (Figure 2 right). The histogram for the HIGGS dataset, where the margin baseline performs well, shows that most of the errors are highly concentrated around the minimum model certainty region, i.e. where the model prediction score is close to 0.5. In contrast, the histogram for cod-rna, where the DPL-inspired mixture-requester excels, shows that while there are some errors concentrated around the 0.5 threshold, there are also a number of errors far away from the model decision boundary. 7 Conclusion We introduced a new setting which models the relationship between labeling and learning in real systems. We derived a lower bound in terms of the abstention-rate of a reference class and the optimum value of our objective. We presented an algorithm DPL-IWAL that admits strong generalization and label complexity guarantees and a more efficient variant, DPL-Simplified. Finally, we reported experiments which corroborate our theoretical findings and demonstrate that our algorithm outperforms natural baselines, including the ubiquitous margin sampling algorithm. Kareem Amin, Corinna Cortes, Giulia De Salvo, and Afshin Rostamizadeh. Understanding the effects of batching in online active learning. In The 23nd International Conference on Artificial Intelligence and Statistics, 2020. Pranjal Awasthi, Maria-Florina Balcan, and Philip M. Long. The power of localization for efficiently learning linear separators with noise. In Proceedings of the forty-sixth annual ACM symposium on theory of computing, pages 449 458. ACM, 2014. Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Ruth Urner. Efficient learning of linear separators under bounded noise. In Conference on Learning Theory, pages 167 190, 2015. Maria-Florina Balcan and Phil M. Long. Active and passive learning of linear separators under log-concave distributions. In Conference on Learning Theory, pages 288 316, 2013. Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In Proceeedings of COLT, pages 35 50. Springer, 2007. Peter Bartlett and Marten Wegkamp. Classification with a reject option using a hinge loss. JMLR, pages 291 307, 2008. Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. In Proceedings of the 26th International Conference on Machine Learning, pages 49 56. ACM, 2009. Chih-Chung Chang and Chih-Jen Lin. Libsvm. https://www.csie.ntu.edu.tw/~cjlin/ libsvmtools/datasets/. Accessed: 2021-05-28. C.K. Chow. An optimum character recognition system using decision function. IEEE T. C., 1957. C.K. Chow. On optimum recognition error and reject trade-off. IEEE T. C., 1970. Galen Chuang, Giulia De Salvo, Lazarus Karydas, Jean-francois Kagy, Afshin Rostamizadeh, and A Theeraphol. Active learning empirical study. In Neur IPS2019 LIRE Workshop, 2019. David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine Learning, 15(2):201 221, 1994. Corinna Cortes, Giulia De Salvo, and Mehryar Mohri. Learning with rejection. In ALT, pages 67 82. Springer, Heidelberg, Germany, 2016a. Corinna Cortes, Giulia De Salvo, and Mehryar Mohri. Boosting with abstention. In NIPS. MIT Press, Corinna Cortes, Giulia De Salvo, and Mehryar Mohri. Online learning with abstention. In ICML, Corinna Cortes, Giulia De Salvo, Claudio Gentile, Mehryar Mohri, and Ningshan Zhang. Region- based active learning. In International Conference on Artificial Intelligence and Statistics, 2019. Ido Dagan and Sean P Engelson. Committee-based sampling for training probabilistic classifiers. In Machine Learning Proceedings 1995, pages 150 157. Elsevier, 1995. Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni. Analysis of perceptron-based active learning. In Conference on Learning Theory, pages 249 263. Springer Berlin Heidelberg, 2005. Sanjoy Dasgupta, Daniel J. Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems, pages 353 360, 2008. Ran El-Yaniv and Yair Wiener. On the foundations of noise-free selective classification. JMLR, 2010. Ran El-Yaniv and Yair Wiener. Agnostic selective classification. In NIPS, 2011. Ran El-Yaniv and Yair Wiener. Active learning via perfect selective classification. In JMLR, 2012. Yves Grandvalet, Joseph Keshet, Alain Rakotomamonjy, and Stephane Canu. Suppport vector machines with a reject option. In NIPS, 2008. Stephen Henneke. A bound on the label complexity of agnostic active learning. In ICML, 2007. Boshuang Huang, Sudeep Salgia, and Qing Zhao. Disagreement-based active learning in online settings. ar Xiv preprint ar Xiv:1904.09056, 2019. David Lewis and William Gale. A sequential algorithm for training text classifiers. In Proceedings of SIGIR, 1994. Stephen Mussmann and Percy Liang. Uncertainty sampling is preconditioned stochastic gradient descent on zero-one loss. In Proceedings of Neur IPS, 2018. Shubhanshu Shekhar, Mohammad Ghavamzadeh, and Tara Javidi. Active learning for classification with abstention. IEEE Journal on Selected Areas in Information Theory, 2021. Yazhou Yang and Marco Loog. A benchmark and comparison of active learning for logistic regression. In ar Xivpreprint ar Xiv:1611.08618, 2016. Ming Yuan and Marten Wegkamp. Classification methods with reject option based on convex risk minimizations. In Journal of Machine Learning Research, 2010. Ming Yuang and Marten Wegkamp. Support vector machines with a reject option. In Bernoulli, 2011. Chicheng Zhang. Efficient active learning of sparse halfspaces. In Conference on Learning Theory, Chicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learning. In Advances in Neural Information Processing Systems, pages 442 450, 2014. Chicheng Zhang, Jie Shen, and Pranjal Awasthi. Efficient active learning of sparse halfspaces with arbitrary bounded noise. Advances in Neural Information Processing Systems, 33, 2020.