# online_active_learning_with_surrogate_loss_functions__fc378697.pdf Online Active Learning with Surrogate Loss Functions Giulia De Salvo Google Research New York, NY 10011 giuliad@google.com Claudio Gentile Google Research New York, NY 10011 cgentile@google.com Tobias Sommer Thune University of Copenhagen Copenhagen, DK tobias.thune@gmail.com We derive a novel active learning algorithm in the streaming setting for binary classification tasks. The algorithm leverages weak labels to minimize the number of label requests, and trains a model to optimize a surrogate loss on a resulting set of labeled and weak-labeled points. Our algorithm jointly admits two crucial properties: theoretical guarantees in the general agnostic setting and a strong empirical performance. Our theoretical analysis shows that the algorithm attains favorable generalization and label complexity bounds, while our empirical study on 18 real-world datasets demonstrate that the algorithm outperforms standard baselines, including the Margin Algorithm, or Uncertainty Sampling, a highperforming active learning algorithm favored by practitioners. 1 Introduction While unlabeled data is generated at increasing rates, supervised learning often requires large portions of this data to be annotated by human raters, which carry significant costs. Active learning seeks to significantly alleviate this problem through algorithms that attain comparable performance to passive learning (i.e., to full supervision) while using considerably fewer labels. Because of the ever increasing use of labeled data in modern machine learning approaches, this calls for the development of more and more effective active learning techniques. In this paper, we are specifically interested in binary classification problems in the so-called streaming (or online) setting of active learning. In this setting, learning proceeds in a sequence of rounds, where in each round an unlabeled sample is received, and the learner decides on-the-fly whether or not to observe the associated binary label. Now, it is common practice, in active learning or machine learning in general, that even for binary classification problems, the loss used both for training and for evaluating statistical performance is not the zero-one loss, but some surrogate function thereof. This results in more tractable (e.g. convex) optimization problems in the training phase, but also in more calibrated metrics in the evaluation phase. For instance, we may be compelled to output estimates of class probabilities, say, the success rate of a given set of decisions, and thus deploy a maximum posterior probability estimator for such probabilities. Then, given a training set with labels y { 1} and if ˆp(x) is an estimate of the conditional class probability P [y = 1 | x], a standard goal is to minimize the log-loss 1+y 2 log 1 ˆp(x) + 1 y 2 log 1 1 ˆp(x) . In these cases, the logistic link function ˆp(x) = 1 1+e h(x) is often adopted to model conditional class probabilities where h is some (unknown) function in a given hypothesis class. In this and many other situations that arise in practice, it is thus advisable to resort to surrogate loss functions for both training and evaluation. The online active learning literature contains a number of proposed algorithms that admit theoretical guarantees for both generalization error and label complexity, that is, the expected number of label requests. The theoretical analyses found in these works are either tailored to the zero-one loss or to general surrogate losses, with much more emphasis historically placed on the former. The papers that 35th Conference on Neural Information Processing Systems (Neur IPS 2021). analyze zero-one loss introduce novel ideas tackling important theoretical questions in active learning, but the theory does not easily extend to general loss functions see discussion in Henneke and Yang [2019]. Nevertheless, the algorithmic solutions we propose borrow several ideas from one such work, namely [Dasgupta et al., 2008]. In the agnostic setting, Dasgupta et al. [2008] developed an algorithm for the zero-one loss based on the disagreement between functions, that carefully constructs pseudo-labels, that is, weak labels generated by the algorithm itself. We call this algorithm DHM, after the name of the three authors. For general surrogate losses, Beygelzimer et al. [2009] designed an algorithm based on constructing importance weighted predictors with theoretical guarantees in the agnostic setting. Cortes et al. [2019a,b, 2020] further developed these importance-weighted algorithms enhancing their guarantees and used them as building blocks for meta-algorithms. We refer to these algorithms as IWAL (Importance Weighted Active Learning). On a different line of work, Henneke and Yang [2019] presents a theoretical analysis of an active learning algorithm that uses surrogate loss functions for binary classification. Among the theoreticallyoriented contributions in active learning, this work appears to be the closest reference to our paper when it comes to motivations. Yet, it is fair to say that the authors make the very strong assumption that the Bayes optimal function for the surrogate loss at hand lies in the hypothesis class used for learning, and they explicitly state that this realizability condition cannot be relaxed without resorting to a significantly different approach. Moreover, unlike our work and that of IWAL-based algorithms, Henneke and Yang [2019] are only focusing on deriving bounds on the zero-one risk. And finally, their algorithm operates in a pool-based setting of active learning, where the whole set of unlabeled points is available to the learner, which is more flexible than the streaming setting analyzed here. In practice, the Margin Algorithm, or Uncertainty Sampling, of Lewis and Gale [1994] attains a substantial performance improvement compared to passive learning as well as other active learning algorithms, and is thus often favored by practitioners (e.g. Schohn and Cohn [2000], Tong and Koller [2001], Brinker [2003], Culotta and Mc Callum [2005], Joshi et al. [2009], Mussmann and Liang [2018]). Moreover, the margin algorithm is flexible in that it can be used in both the pool and streaming setting. In the pool setting, several papers have recently tailored active learning algorithms to neural networks, always comparing their algorithm s empirical performance to that of the margin algorithm [Geifman and El-Yaniv, 2017, Gal et al., 2017, Savarese et al., 2018, Ducoffe and Precioso, 2018, Ash et al., 2020, Moon et al., 2020, Schroder and Niekler, 2020]. Overall, the margin algorithm is often found to be a competitive contender to this new collection of practical active learning algorithms. Additionally, two extensive empirical studies found that the margin algorithm outperforms many recent active learning algorithms [Yang and Loog, 2016, Chuang et al., 2019]. On the theoretical side, several authors developed active learning algorithms with theoretical guarantees for specific classes of functions (e.g., linear separators) based on sampling along the margin [Dasgupta et al., 2005, Balcan et al., 2007, Balcan and Long, 2013, Awasthi et al., 2014, 2015, Zhang, 2018, Zhang et al., 2020], but the guarantees of these margin-based algorithms only hold under strong assumptions on the realizability of the problem. Our goal is thus to derive an algorithm that admits theoretical guarantees without relying on realizability assumptions and that, at the same time, attains in practice a performance that is better than that of the margin algorithm in the streaming setting. 1.1 Contributions In this paper, we introduce a novel active learning algorithm, called ALPS (Actively Learning over Pseudo-labels for Surrogate losses), in the streaming setting for binary classification tasks that trains a model by optimizing a surrogate loss on the joint set of labeled and self-constructed pseudo-labeled points (Section 3). Crucially, we prove that ALPS admits theoretical guarantees with respect to the surrogate loss in the general agnostic setting while at the same time surpassing the training performance of the margin algorithm as well as other baselines. More concretely, we prove theoretical guarantees for ALPS in terms of both generalization and label complexity, under assumptions on the interplay between the function class and the data distribution which are not captured by traditional low noise assumptions often formulated in active learning (Section 4). We show that ALPS is able to leverage pseudo-labels in a principled manner even though the loss function of interest is not the zero-one loss, but a surrogate loss thereof. We then complement our theoretical findings with a thorough experimental investigation on 18 real-world datasets using a class of neural networks, Multi-layer Perceptrons. We test our algorithm against the IWAL algorithm of Beygelzimer et al. [2009] since it admits theoretical guarantees for general loss functions, and against passive learning since we want to quantify the improvement over supervised methods. Additionally, we generate as competitors a pool of margin-based uncertainty samplers by varying their label request threshold, and show that on almost all datasets our algorithm outperforms the ex-post best among such algorithms (Section 5). This paper follows the general research agenda of reducing the gap between theory and practice, which has been widening at a faster rate in the recent active learning literature. As outlined above, we try to bridge this disconnect by deriving an algorithm that not only works well in practice, but also admits solid theoretical guarantees. In doing so, we also introduce novel algorithmic ideas, which could independently lead to new directions of research. 1.2 Main Ideas The main mechanism behind ALPS is to query the label of the current instance based on the disagreement between hypotheses and to construct pseudo-labels for the non-queried instances. The algorithm then trains a model on the joint set of labeled and pseudo-labeled data. The sample used for training will be unbiased with respect to marginal distribution on the instance space, but label noise will be introduced whenever the pseudo-label differs from the true (unrevealed) label. This noise affects both the generalization ability of the algorithm and its label complexity. Controlling this noise is not only crucial, but also quite challenging, for it relies on assessing the expected reliability of the pseudo-labels. In order to control this noise, we introduce the novel idea of using requester functions: ALPS learns to query labels on instances with unreliable (that is, potentially noisy) pseudo-labels by using a requester function chosen to minimize an importance-weighted estimate of the noise. For the sake of our results, the requester functions are assumed to lie in some generic function class. One natural example of this class is that of margin-based functions, meaning functions that request along the margin of a given prediction model. In this case, our algorithm is related to the margin algorithm but, unlike the theoretical guarantees for the margin algorithm (e.g. Balcan et al. [2007]), we do not rely on specific distributional assumptions on the instance space. The requester functions are reminiscent of the abstention functions introduced in Cortes et al. [2016], but serve an entirely different purpose in active learning. ALPS can be seen as an extension of the DHM algorithm [Dasgupta et al., 2008] to surrogate loss functions, which seeks to address the open question posed by Dasgupta et al. [2008] of whether there are active learning algorithms that only require solving tractable optimization problems in agnostic scenarios. At the same time, these two algorithms differ on how they treat the label noise introduced by pseudo-labeling. In DHM, the label noise can be easily bypassed by a simple property of the zero-one loss. In contrast, for general surrogate loss functions, this noise is unavoidable, and if not controlled as is done in the ALPS algorithm, the learned hypothesis can be arbitrarily far from the best-in-class. 2 Preliminaries and Notation We consider an active learning framework in the streaming setting for binary classification. Learning proceeds in a sequence of rounds. In each round n, the learner receives from the environment an instance (or feature vector) xn from an instance (or feature) space X. Based on what has been observed so far, the learner can then decide whether or not to request the true label yn associated with xn. Label yn is assumed to lie in the output space Y = { 1}. The pairs (x1, y1), (x2, y2), . . . are drawn i.i.d. from a joint distribution D over X Y. In order to evaluate the statistical performance of the learner, we consider any bounded surrogate loss function ℓ: R Y [0, B] that upper bounds the zero-one loss, and let ℓB( , ) = ℓ( , )/B be a [0, 1]-normalized version of this loss. Let H denote a hypothesis class of functions h : X R. For simplicity of exposition, we assume that H is a finite class, but our analysis can be extended to classes with finite VC dimension by standard covering arguments. The true (expected) risk of hypothesis h H is defined as err(h) = E(x,y) D[ℓ(h(x), y)] and h = argminh H err(h) denotes the best-in-class hypothesis. Moreover, we denote by err(h, Z) = 1 n P (x,y) Z ℓ(h(x), y) the empirical risk of h over a training set Z = {(x1, y1), (x2, y2), . . . , (xn, yn)}. Given round n 1, we denote by Tn the set of labeled data points {(xs, ys)} from rounds s [n] := {1, . . . , n} where the label was requested by the learning algorithm, and denote by Sn its complement, that is, the set of labeled data points {(xs, ys)} from rounds s [n] where the label was not requested. Unlike Tn, the set Sn is not fully known to the learner. On those rounds, the learner replaces the true unobserved label ys by pseudo-label ˆys, which is a suitable proxy to ys generated by the learner itself. We denote by ˆSn the set containing data points {(xs, ˆys)} from rounds s [n] where the label is not requested, and the true label gets replaced by the corresponding pseudo-label. For notational convenience (and unless otherwise specified), we set ˆys = ys in rounds s where the true label is known to the learner. Then the following two short-hands are used for empirical risk up to round n using ground-truth labels and the empirical risk up to round n using pseudo-labels: err n (h) = err(h, Sn Tn) = 1 s=1 ℓ(h(xs), ys) ; c err n (h) = err(h, ˆSn Tn) = 1 s=1 ℓ(h(xs), ˆys) . In the sequel, we will also need the following definition of consistency of hypothesis h H. Definition 1. We say that h H is consistent with a labeled data point (x, y) X Y if sgn(h(x)) = y and, by extension, that h is consistent with a set of labeled data points Z if it is consistent with all data points in Z. Two hypotheses h, h H are consistent with one another on a set of (unlabeled) points if sgn(h(x)) = sgn(h (x)) for all x in that set of points. At each round n, the learner is compelled to output a hypothesis hn H, and we seek to bound its true risk, err(hn), in terms of the true risk, err(h ), of the best-in-class hypothesis h , as a function of the total number of streamed points n thus far. In addition, we want a bound on the label complexity of the learner, that is, the number of labels requested by the learner during the course of training. This bound should be provably smaller than n, which is the label complexity of a passive learner observing all labels. Further ancillary definitions will be introduced in later sections. 3 The ALPS Algorithm At a high level, ALPS either requests the label of an instance vector xn or assigns it a pseudo-label at each round n. The algorithm then selects a model hn that is consistent on the pseudo-labeled points, ˆSn, and that minimizes the empirical estimate of the risk, c errn(h), which is defined in terms of pseudo-labeled and labeled points, ˆSn Tn, processed thus far. See pseudo-code in Algorithm 1. More concretely, the learned hypothesis at round n is hn = LEARN( ˆSn, ˆSn Tn) where: Definition 2. For sets of labeled data points S and Z, LEARN(S, Z) returns a hypothesis in H that minimizes the empirical risk c err(h, Z) on the second argument Z, and is consistent with all labeled data points in the first argument S, where consistency is defined in Definition 1. LEARN raises a flag if no such hypothesis is found. In the LEARN procedure, we only need to consider hypotheses consistent with pseudo-labeled points, ˆSn, because we can show that with high probability, the sign of the best-in-class, h = argminh H err(h), matches the sign of the pseudo-labeled points. In order to decide whether to query the label of xn, ALPS leverages the disagreements between hypotheses, h+1, h 1 H returned by the LEARN procedure, where the label of the current instance xn is assumed to be either +1 for learning hypothesis h+1 and -1 for learning hypothesis h 1. Ideally, the algorithm would use the empirical difference errn(h+1) errn(h 1) based on ground-truth labels to assess this disagreement, but because not all labels at time n have been disclosed to the algorithm, it instead uses the pseudo-labeled version c errn(h+1) c errn(h 1), which is indeed observable. The algorithm compares this difference to a slack term, n, derived from our theoretical analysis. See Appendix A for all slack term definitions. If the loss difference is smaller than the 1The sign of the pseudo-label ˆy is determined by the condition c errn 1(h ˆy) c errn 1(hˆy) > n 1 and whether h ˆy exist for ˆy 1. E.g., ˆy = +1 if c errn 1(h 1) c errn 1(h+1) > n 1 or if h 1 does not exist. Algorithm 1: Actively Learning over Pseudo-labels for Surrogate losses (ALPS) Input: LEARN(S, Z), Hypothesis class H, requester class R, and slack terms n, n. ˆS0 = ; T0 = ; F1 = H R; for n = 1, 2, . . . , N do Receive feature vector xn; For ˆy = 1, let hˆy = LEARN( ˆSn 1 {(xn, ˆy)}, ˆSn 1 {(xn, ˆy)} Tn 1); Define version space Fn and bias pn: Fn = {(h, r) Fn 1 : min (h ,r ) Fn 1 ln 1(h , r ) ln 1(h, r) n 1} pn = max (h,r),(h ,r ) Fn max y Y ℓB(y, h(xn))I{r(xn) 0} ℓB(y, h (xn))I{r (xn) 0} Qn Bernoulli(pn); if c errn 1(h ˆy) c errn 1(hˆy) > n 1(or if no such h ˆy is found) for some ˆy { 1} AND Qn = 0 then rn = argminr Rn E[I{r(x) > 0}], where Rn = {r : (h, r) Fn and h consistent with ˆSn 1 {(xn, ˆy)}} if rn(xn) > 0 then Request yn and update ˆSn = ˆSn 1, Tn = Tn 1 {(xn, yn)}; else Do not request yn and use1pseudo-label ˆy to update ˆSn = ˆSn 1 {(xn, ˆy)}, Tn = Tn 1; end if else Request yn and update ˆSn = ˆSn 1, Tn = Tn 1 {(xn, yn)}; end if hn = LEARN( ˆSn, ˆSn Tn); end for Output h N = LEARN( ˆSN, ˆSN TN); slack, then the label is requested because the sign of best-in-class h cannot be inferred. Otherwise, a pseudo-label is constructed. In more detail, the sign of the pseudo-label ˆy is determined by the condition c errn 1(h ˆy) c errn 1(hˆy) > n 1 and whether h ˆy exist for ˆy 1. Note that by this construction, the LEARN procedure can always return a consistent hypothesis on all pseudo-labeled points. To understand the next steps in the ALPS algorithm, we relate the ground-truth empirical difference errn(h) errn(h ) to its the pseudo-labeled counterpart c errn(h) c errn(h ). To this effect, we define the difference of difference of losses An(h, h ) between two hypotheses h, h H at time n as An(h, h ) := err n (h) err n (h ) c err n (h) c err n (h ) . This idea is also adopted by Dasgupta et al. [2008] in their analysis of the DHM algorithm. Yet, because they only deal with zero-one loss, in their case An(h, h ) = 0 for consistent h, h even when the true label does not match the pseudo-label. This easily follows from the fact that if sgn(h) and sgn(h ) agree on instance x, then2 I{sgn(h(x)) = y} I{sgn(h (x)) = y} = I{sgn(h(x)) = ˆy} I{sgn(h (x)) = ˆy}, irregardless of whether y = ˆy or not. In the surrogate loss setting analyzed here, whenever the pseudo-labels do not match their corresponding true labels, the term An(h, h ) is typically non-zero even for consistent h, h . As a consequence, the magnitude of An(h, h ) must be carefully controlled since, otherwise, it would break both generalization and label complexity bounds as well as the guarantee that the best-in-class h is consistent with ˆSn. In the sequel, we call An the noise term, as it captures the noise introduced during training due to the discrepancy between pseudo-labels and true labels. 2Throughout, I{ } is the indicator function of its argument. To control the noise term, we introduce the idea of using requester functions. Specifically, we consider a set R of functions r: X R, where r(x) > 0 means that the label of x is requested by r, and otherwise it is not requested. At each round n, the algorithm picks a requester function rn out of the set R, and the algorithm s final condition of whether or not to request the label is decided by the sign of rn(xn). The requester function rn thus works as the final gatekeeper of whether to use the pseudo-label. Notice that by definition the noise term An is non-zero when the true labels do not match the pseudolabels, ys = ˆys for some s [n]. By construction, since the final condition of whether to use a pseudo-label in round s is determined by rs, it follows that I{ys = ˆys} = I{ys = ˆys}I{rs(xs) 0}. In addition, for any h consistent with ˆSs, it holds that ˆys = sgn(h(xs)), and since the surrogate loss ℓupper bounds the zero-one loss, we have that for any h consistent with ˆSs, I{ys = ˆys}I{rs(xs) 0} ℓ(h(xs), ys)I{rs(xs) 0} = BℓB(h(xs), ys)I{rs(xs) 0} . (1) The above can thus be used as an upper bound of the noise. To estimate this upper bound of An, the algorithm constructs an unbiased importance-weighted estimate, ln(h, r), of E[ℓB(h(x), y)I{r(x) 0}] as follows: ln(h, r) = 1 Qs ps(xs) ℓB(h(xs), ys)I{r(xs) 0} , where Qs is a Bernoulli random variable with bias ps(xs) and where ps( ) is a (random) function only depending on the past history {(xs , ys ), Qs }s 1 s =1. The algorithm will request the label ys whenever Qs = 1 to construct this unbiased estimate. In order not to request the label (Qs = 1) too frequently, ps( ) is based on a shrinking version space Fn that contains only the pairs (h, r) whose empirical estimate ln(h, r) is close to the smallest one. Given these estimates of E[ℓB(h(x), y)I{r(x) 0}], the algorithm chooses a requester function rn with the smallest request rate E[I{r(x) > 0}] from a set of requester functions in the version space Fn. In this way, the algorithm ensures that the noise term An is kept small. This is due to the fact that the version space Fn contains pairs of function (h, r) that are provably close to those minimizing E[ℓB(h(x), y)I{r(x) 0}], combined with the fact that ℓB(h(xn), yn)I{rn(xn) 0} is an upper bound on the noise at round n via (1). Putting the above together, ALPS will request the label of xn whenever one of the following three conditions hold: 1) the loss differences of h+ and h is smaller than the slack term, in which case the revealed label is used as we cannot infer the sign of h and cannot construct a pseudo-label 2) Qn = 1, in which case the revealed label is used for the empirical estimates ln(h, r), 3) rn(xn) > 0, in which case the revealed label is used to control the noise term, An, introduced by pseudo-labeling. In all cases, the revealed labels are used in LEARN to find the best model at each round n. Notice that, for the sake of argument, we have assumed in the above description to have access to a large unlabeled pool of examples x so as to construct a convenient set of requester functions, that is, to estimate E[I{r(x) > 0}] accurately. Yet, our algorithm and its associated analysis can be generalized straightforwardly to the case when such a pool is available only in a streaming way. Moreover, in our experiments in Section 5, we do not use an unlabeled pool to generate requester functions or to estimate the requesting rate. Instead, we define the set of requester functions to be margin-based functions with varying thresholds, and then estimate the requesting rate based on the data processed thus far. See Section 5 for details. 4 Theoretical Guarantees We now present generalization and label complexity guarantees of our algorithm. These bounds will be in terms of the best-in-class hypothesis, h , whose risk err(h ) will be denoted by ν. For these guarantees, we consider a class R of requester functions that satisfies the following: Assumption 1. There exists r R and a constant C 0 such that E[ℓ(y, h (x))I{r (x) 0}] = 0 and E[I{r (x) > 0}] Cν. If given access to a large unlabeled pool of data, we can always construct a function r such that E[ℓ(y, h(x))I{r(x) 0}] = 0 for any h, since r can always be augmented in such a way to request more frequently (that is, making r(x) > 0 for most x). Moreover below, we will further relax the first constraint to E[ℓ(y, h(x))I{r(x) 0}] ϵ for some ϵ 0. However, unless more information about the structure of the problem is known, proving that there exists a function r in the set R such that E[I{r (x) > 0}] Cν holds with a favorable (i.e., reasonably small) value of C is more difficult. Nevertheless, unfavorable C values, just like unfavorable disagreement coefficients, only impact the label complexity bound and not the generalization guarantee. After the theorems below, we present several natural examples of favorable C values. Note that Assumumption 1 does not dictate conditions on the true conditional class probability, η(x) = P[y = 1 | x]. Hence, we are not relying on low-noise assumptions often adopted in active learning in realizable scenarios. Specifically, our assumption cannot be viewed as a surrogate loss counterpart to the Tsybakov [Mammen and Tsybakov, 1999, Tsybakov, 2004] low noise condition in the realizable setting [Castro and Nowak, 2008, Hanneke, 2011, Koltchinskii, 2010, Dekel et al., 2012] nor as an incarnation of the Bernstein condition, often invoked in statistical learning settings to achieve fast rates in both passive and active learning [Massart and Nedelec, 2006, Bartlett et al., 2006, Koltchinskii, 2006, Van Erven et al., 2015, Henneke and Yang, 2019]. In general, the richer the R class is, the easier Assumption 1 is satisfied with a small C. At the same time, since ALPS is learning over H and R simultaneously, the complexity of R affects both generalization bound and label complexity, as per usual in learning guarantees. Below, we first present the generalization bound of the ALPS algorithm. Here and in Theorem 2, the O notation is hiding logarithmic factors in n, 1/δ, and |H R|. Theorem 1. Under Assumption 1, for any δ > 0, with probability at least 1 δ, for any n > 0, err(hn) ν + O q where hn is the hypothesis computed by ALPS in round n. This guarantee states that the risk of the hypothesis returned by the algorithm converges to the best-in-class hypothesis at a rate that matches that of standard supervised learning, despite the fact that ALPS uses less labeled points. The analysis used to derive this theorem proves that the noise introduced by the pseudo-labels is, in fact, controlled by the learned requester function so that the above bound can be attained. In passing, we observe that if the loss ℓis convex and the Bayes optimal hypothesis is in the class H, then by using standard consistency techniques [Zhang, 2004, Bartlett et al., 2006] in conjunction with Theorem 1 above, we could also attain a bound on the excess risk of the zero-one loss. The next theorem shows a bound on the expected number of labels the algorithm requests. This bound will be in terms of a notion of disagreement coefficient derived from the one in [Hanneke, 2007]. Define a metric ρ on the space of hypotheses H as ρ(h, h ) := E(x,y) D |ℓ(h(x), y) ℓ(h(x), y)| . Using this metric, we consider the γ-ball around the best-in-class: Bγ(h ) = {h : ρ(h, h ) γ} . Then, let θ := supγ>0 {Px[ h Bγ(h ) : sgn(h(x)) = sgn(h (x))]/γ} be the disagreement coefficient of H (for the given distribution D over X Y). Theorem 2. Under Assumption 1, for any δ > 0, with probability 1 δ, the label complexity of the ALPS algorithm at time n > 0 is bounded as O (nν(θ + C)) . Disregarding constants θ and C, the above bound is O(nν). This label complexity guarantee of ALPS in conjunction with its generalization bound matches the known lower bound Ω(nν) from Beygelzimer et al. [2009] up to constants and is therefore optimal. In Assumption 1, we considered the case when E[ℓ(y, h (x))I{r (x) 0}] = 0 which is easy to satisfy if we have a rich enough class R with functions that request more frequently. Nevertheless, we can relax this assumption if desired as follows: Assumption 2. There exists r R and constants C 0 and ϵ 0 such that E[ℓ(y, h (x))I{r (x) 0}] ϵ and E[I{r (x) > 0}] Cν. Under this assumption, the label complexity guarantee in Theorem 2 remains unchanged while an ϵ term is added to the generalization bound in Theorem 1 when ALPS is run with slack n 1 + ϵ. By definition, ϵ is at most ν and if ϵ ν, then this term has basically no effect on generalization. There are several natural instances where the condition stated in Assumption 1 and Assumption 2 are satisfied with favorable C values. Consider for example a scenario where along the margin of the Margin Requesting Region h Hypothesis Figure 1: Left: In yellow is a margin requesting region {x X : |h (x)| < γ }, where the hypothesis h , depicted by the black line, admits a conditional error of, say, 1/2. Middle: A tapered sigmoid function that equals 1 or 0 outside the [ γ , γ ] interval. Right: An example of a requesting region (in yellow) that is not margin-based and where Assumption 2 holds with C < 1. best-in-class hypothesis, there are hard-to-classify examples, while further away the best-in-class hypothesis mostly classifies the examples correctly. See Figure 1 for an illustration where there is such a margin γ around the best-in-class h . This example is often used to explain why the popular margin algorithm works since this algorithm queries the label of points close to the classification surface. In our case, letting r (x) = γ |h (x)|, it holds that E[ℓ(y, h (x))I{|h (x)| γ }] ϵ for a small ϵ and E[I{|h (x)| < γ }] = ν E[ℓ(y,h (x)) | |h (x)|<γ ]. If the expected loss of h conditioned on the margin region is high, say 1/2, then it follows that E[I{r (x) > 0}] = E[I{|h (x)| < γ }] 2ν, so that Assumption 2 is satisfied with C = 2. Specifically, in Figure 1, it holds that C = 2 and ϵ = 2/50. When h performs well outside the margin, then Assumption 2 holds with a small ϵ while if it correctly classifies all points outside the margin, then ϵ = 0 and Assumption 1 holds. The ϵ = 0 case is easily satisfied by generalized linear models, P[y = 1|x] = σ(h (x)), where σ is a tapered sigmoid function. Note that Assumption 1 is not placing specific restrictions on the marginal distribution over X, other than E[I{|h (x)| < γ }] = 2ν. This does not imply conditions on η(x) when x falls in the region |h (x)| < γ and hence we are not relying on realizability assumptions. With any given class of hypotheses H, one can always associate the margin-based requester function class R = {r : r(x) = γ |h(x)| , γ 0, h H}. In this case, ALPS can be seen as trying to simultaneously approximate h and the best threshold γ . The resulting algorithm turns out to be related to margin-based approaches since ALPS will seek to query the label of these more difficult points along the margin, but ALPS finds this optimal pair by using a different approach. At the same time, we would like to stress that, in our framework, the function class R need not be restricted to margin-based requesters: if there exists a small region, {x X : r (x) > 0}, for some function r : X R such that most of the loss incurred by the best-in-class h is coming from this region, and R is rich enough to contain such function r , then Assumption 1 and Assumption 2 hold. For an illustration of this situation, see Figure 1 on the right. 4.1 Comparisons to IWAL and DHM The IWAL algorithm of Beygelzimer et al. [2009] constructs importance weighted estimates of the expected loss and uses them to select the prediction function. In contrast, ALPS leverages importance weighted estimators to select requester functions that minimize the noise term. Thus, even though both algorithms construct importance weighted estimators, they serve entirely different purposes. Both IWAL and ALPS attain generalization and label complexity bounds that are effectively of the same order, but our empirical results show that IWAL considerably underperforms compared to other active learning algorithm on all 18 datasets we tested. These empirical results are consistent with other authors findings (e.g., Figure 2 in Cortes et al. [2019b] and Figure 3 in Cortes et al. [2020]). Even though ALPS can be seen as a generalization of DHM [Dasgupta et al., 2008] to surrogate loss functions, there are several key differences. The noise introduced by the pseudo-labels does not affect DHM since by definition An(h, h ) = 0 for the zero-one loss. Thus, DHM does not resort to requester functions or other techniques to deal with unreliable pseudo-labels. Moreover, unlike in DHM, the second argument in the LEARN( , ) subroutine over which ALPS minimizes error is ˆSn Tn, not just Tn. For the zero-one loss, minimizing the error over ˆSn Tn or only Tn is equivalent since hypothesis consistent on ˆSn admit zero loss over the points in ˆSn. Note that ALPS s pseudo-code Figure 2: The plots on the left show the performance of ALPS, IWAL, the ex-post best margin algorithm, and passive learning with 4-layer networks for cifar and with 3-layer networks for ijcnn. For the 4-layer networks for cifar, the plot on the right shows the margin algorithms with thresholds in Γm ordered from smallest to largest threshold in Γm where Margin-1 corresponds to the smallest threshold value in Γm, Margin-2 to the second smallest, etc. As a function of the number of observed points (or rounds) n, the plots show the mean over the 50 repetitions and twice its standard error of the number of requested labels as well as the accuracy and logistic loss on the test set of the model returned by each algorithm. reduces to DHM s when ℓis the zero-one loss and R is set to include only the never-requesting function, that is, r(x) = 0 for all x, in which case the importance-weighted estimates ln(h, r) are not constructed. We then attain the same guarantees as DHM s without resorting to Assumption 1 or Assumption 2. 5 Empirical Results In this section, we present our experimental results in the streaming active learning setting that test the ALPS algorithm, the IWAL algorithm, the margin algorithm (or uncertainty sampling), and a passive learning algorithm, which requests the label of all points and finds the hypothesis that attains the smallest empirical loss. We compare to IWAL since it is a principled algorithm in the general agnostic scenario with strong theoretical guarantees. We test the margin algorithm since even though its theoretical guarantees do not hold in general, it admits a strong performance in practice. We run a passive learning algorithm to demonstrate the benefit of ALPS over standard supervised learning. We tested these algorithms on 18 publicly available datasets where we used the logistic loss as the surrogate loss ℓand used feedforward artificial neural networks as our model class. Specifically, we used the Multi-layer Perceptron algorithm in the scikit-learn library and ran two diverse network architectures, one with 3-layers and another with 4-layers. For each type of neural network, we constructed a diverse finite set of hypotheses by pre-training on small random subsets while varying the l2 regularization parameter and initial weights. For each experiment, we randomly shuffled the dataset, generated the finite hypothesis set, and ran all the algorithms. We repeated the experiment 50 times and averaged the results. See Appendix D for details on the datasets and model class settings. Recall that the margin algorithm in the streaming setting first selects the hypothesis h H with the smallest empirical loss and then requests the label of a point x if | Ph[y = +1|x] Ph[y = 1|x]| γ, where Ph[y = 1|x] is the model h s (estimated) conditional probability of labeling the given point x either +1 or 1. We ran nine instances of the margin algorithm for all thresholds γ Γm where the set of thresholds Γm was chosen as a result of a tuning procedure that consisted of a standard grid search and zooming in. For ALPS, the requester class R is also defined by margin-based functions as r(x) = I{| Ph[y = +1|x] Ph[y = 1|x]| γ} for each hypothesis h H and each γ Γr. Note that we run ALPS with ϵ = 0 for this class R. Unlike Γm, the set Γr was not tuned since ALPS learns the best threshold in Γr. For details on Γm and Γr, please see Appendix D. To evaluate the algorithm performance, both accuracy and logistic loss are of interest in different applications as explained in the introduction. Thus, Figure 2 shows the accuracy and logistic loss on the test set of the model returned by each algorithm at each round n. Figure 2 also plots the number of points labeled by each algorithm as a function of the number of unlabeled points (or rounds) n. Notice, in particular, that for the passive learning algorithm, since the data is streamed in a random order the value of the curves at time step n can be interpreted as running the algorithm up until we gather n queries on randomly drawn points. So, for instance, at round n = 200 of the passive curve (orange line) for the ijcnn dataset, we effectively randomly sample and reveal the label of 200 points out of the training set, choose a model that corresponds to an empirical risk minimizer over these 200 points, and evaluate this model s accuracy on the test set. The passive learning curve is then used as a baseline comparator of the active learning algorithms. That is, the best performing active learning algorithm is the one that attains an accuracy and logistic loss curve that is on par or better than that of passive learning while requesting the fewest number of labels. Figure 2 on the right shows that the accuracy and logistic loss curves of the margin algorithm varies with the threshold. In almost all datasets, there exists thresholds whose corresponding margin algorithm admits an accuracy curve that is below that of passive learning and as the threshold increases, the accuracy curve improves eventually matching that of passive learning. This indicates that a good set of threshold values was tested, since we seek an algorithm that requests the fewest number of points (i.e. having a small threshold) with an accuracy curve on par to that of passive learning. We then pick the ex-post best margin algorithm with smallest threshold that admits an learning curve area with respect to accuracy that is within one standard error of the passive learning curve area. If no algorithm admits such a curve, we pick the margin algorithm with the largest learning curve area. Choosing the ex-post best margin algorithm in this way and tuning of Γm gives the margin algorithm an unfair advantage, since both of these processes use the learning curves based on the revealed labels. Nevertheless, it provides an upper bound on the best performance of these uncertainty samplers. The left-most plots of Figure 2 show that, for both cifar and ijcnn datasets, the ALPS algorithm performs better than all baseline methods since out of the algorithms that attain the accuracy and logistic loss curve close to that of passive learning, it requests the fewest number of points. In Appendix D, we report these figures for all 18 datasets. Overall for the 3-layer networks, our results show that the ALPS algorithm outperforms the ex-post best margin algorithm on all but one dataset. Similarly, for the 4-layer networks, ALPS performs better than the margin algorithm on all but two dataset. Thus, despite the advantage given by the ex-post tuning of the margin algorithm, ALPS attains a better performance. Moreover, ALPS outperforms both passive learning and IWAL on all datasets. On average across all datasets, ALPS requests only a mere 28% of the processed points. Since ALPS was run with ϵ = 0, these experiments suggest that Assumption 1 holding with favorable C values often happens in practice or that Assumption 1 is simply an artifact of our analysis. Additionally, the ex-post best margin algorithm outperforms IWAL on the majority of the datasets, which is consistent with previous studies [Cortes et al., 2020]. IWAL attains a higher variance across the trials for some datasets, and overall it performs better than passive learning. In general, the same empirical conclusions about the relative performance of the active learning algorithms hold for logistic loss and accuracy curves for all algorithms except for the margin algorithm. On a few datasets, the margin algorithm attains a logistic loss curve that is worse than that of passive learning while admitting an accuracy curve that is on par to that of passive learning (e.g., ijcnn). 6 Conclusion We designed an active learning algorithm, called ALPS, for general loss functions that learns to leverage pseudo-labels by using requester functions in order to train a model over a joint set of pseudolabel and labeled points. ALPS operates in the general agnostic setting; its model is guaranteed to converge to the best-in-class prediction model at the same rate as passive learning, while achieving favorable label complexity guarantees under a mild assumption on the class of requester functions. Our comprehensive empirical study on 18 datasets shows that ALPS outperforms relevant baselines, including the margin algorithm often used in practice. As a next step, we will investigate how to extend our algorithm and its associated analysis to multi-class classification setting. Acknowledgments. Most of this research was done while the Tobias Sommer Thune was visiting Google Research in New York. Tobias Sommer Thune further acknowledges partial support by the Independent Research Fund Denmark, grant number 9040-00361B. We thank the Neur IPS anonymous reviewers whose comments helped us improve the presentation of this paper. Kareem Amin, Giulia De Salvo, and Afshin Rostamizadeh. Understanding the effects of batching in online active learning. In Proceedings of AISTATs, 2020. Jordan Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In Proceedings of ICLR, 2020. Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for efficiently learning linear separators with noise. In Proceedings of STOC, pages 449 458. ACM, 2014. Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Ruth Urner. Efficient learning of linear separators under bounded noise. In Proceedings of COLT, pages 167 190, 2015. Maria-Florina Balcan and Phil Long. Active and passive learning of linear separators under logconcave distributions. In Proceedings of COLT, pages 288 316, 2013. Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In Proceeedings of COLT, pages 35 50. Springer, 2007. Peter Bartlett, Michael Jordan, and Jon Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138 156, 2006. Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. In Proceedings of ICML, pages 49 56. ACM, 2009. Klaus Brinker. Incorporating diversity in active learning with support vector machines. In Proceedings of ICML, 2003. Rui Castro and Robert Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54:2339 2353, 2008. Nicolò Cesa-Bianchi and Claudio Gentile. Improved risk tail bounds for on-line algorithms. IEEE Transactions on Information Theory, 54(1):386 390, 2008. Chih-Chung Chang and Chih-Jen Lin. Libsvm : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011. Galen Chuang, Giulia De Salvo, Lazarus Karydas, Jean-francois Kagy, Afshin Rostamizadeh, and A Theeraphol. Active learning empirical study. In Neur IPS2019 LIRE Workshop, 2019. Corinna Cortes, Spencer Greenberg, and Mehryar Mohri. Relative deviation learning bounds and generalization with unbounded loss functions. In Ar Xiv:1310.5796, 2013. Corinna Cortes, Giulia De Salvo, and Mehryar Mohri. Learning with rejection. In Proceedings of ALT, 2016. Corinna Cortes, Giulia De Salvo, Claudio Gentile, Mehryar Mohri, and Ningshan Zhang. Active learning with disagreement graphs. In Proceedings of ICML, 2019a. Corinna Cortes, Giulia De Salvo, Claudio Gentile, Mehryar Mohri, and Ningshan Zhang. Regionbased active learning. In Proceedings of AISTATS, 2019b. Corinna Cortes, Giulia De Salvo, Claudio Gentile, Mehryar Mohri, and Ningshan Zhang. Adaptive region-based active learning. In Proceedings of ICML, 2020. Aron Culotta and Andrew Mc Callum. Reducing labeling effort for structured prediction tasks. In Proceedings of AAAI, 2005. Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni. Analysis of perceptron-based active learning. In Proceedings of COLT, pages 249 263. Springer, 2005. Sanjoy Dasgupta, Daniel Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. In Proceedings of Neur IPS, pages 353 360, 2008. Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Selective sampling and active learning from single and multiple teachers. Journal of Machine Learning Research, 13:2655 2697, 2012. Melanie Ducoffe and Frederic Precioso. Adversarial active learning for deep networks: a margin based approach. In ar Xiv, 2018. Yarin Gal, Riashat Islam, and Zoubin. Ghahramani. Deep bayesian active learning with image data. In Proceedings of ICML, 2017. Yonatan Geifman and Ran El-Yaniv. Deep active learning over the long tail. In ar Xiv, 2017. Steve Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of ICML, pages 353 360. ACM, 2007. Steve Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333 361, 2011. Steve Henneke and Liu Yang. Surrogate losses in passive and active learning. In Electronic Journal of Statistics, 2019. Ajay Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos. Multi-class active learning for image classification. In Proceedings of CVPR, 2009. Sham Kakade and Ambuj Tewari. On the generalization ability of online strongly convex programming algorithms. In Proceedings of Neur IPS, pages 801 808, 2009. Vladimir Koltchinskii. Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593 2656, 2006. Vladimir Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. Journal of Machine Learning Research, 11(Sep):2457 2485, 2010. Alex Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009. Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann. lecun.com/exdb/mnist/. David Lewis and William Gale. A sequential algorithmfor training text classifiers. In Proceedings of SIGIR, 1994. Enno Mammen and Alexandre Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27:1808 1829, 1999. Pascal Massart and Elodie Nedelec. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326 2366, 2006. Jooyoung Moon, Jiho Kim, Younghak Shin, and Sangheum Hwan. Confidence aware learning for deep neural networks. In ar Xiv, 2020. Stephen Mussmann and Percy Liang. Uncertainty sampling is preconditioned stochastic gradient descent on zero-one loss. In Proceedings of Neur IPS, 2018. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Silvio Savarese, Ozan Sener, and Zoubin. Ghahramani. Active learning for neural network a core set approach. In Proceedings of ICLR, 2018. Greg Schohn and David Cohn. Less is more: Active learning with support vector machines. Proceeedings of ICML, 2000. Christopher Schroder and Andreas. Niekler. A survey of active learning for text classification for deep neural networks. In ar Xiv, 2020. Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2001. Alexandre Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135 166, 2004. Tim Van Erven, Peter Grunwald, Nishant Mehta, Mark Reid, and Robert Williamson. Fast rates in statistical and online learning. Journal of Machine Learning Research, 16:1793 1861, 2015. Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked science in machine learning. SIGKDD Explorations, 15(2):49 60, 2013. doi: 10.1145/2641190.2641198. URL http://doi.acm.org/10.1145/2641190.2641198. Yazhou Yang and Marco Loog. A benchmark and comparison of active learning for logistic regression. In ar Xiv, 2016. Chicheng Zhang. Efficient active learning of sparse halfspaces. In Proceedings of COLT, 2018. Chicheng Zhang, Jie Shen, and Pranjal Awasthi. Efficient active learning of sparse halfspaces with arbitrary bounded noise. In Proceedings of Neur IPS, 2020. Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 2004.