# active_learning_with_disagreement_graphs__94d3e07c.pdf Active Learning with Disagreement Graphs Corinna Cortes 1 Giulia De Salvo 1 Claudio Gentile 1 Mehryar Mohri 1 2 Ningshan Zhang 3 We present two novel enhancements of an online importance-weighted active learning algorithm IWAL, using the properties of disagreements among hypotheses. The first enhancement, IWALD, prunes the hypothesis set with a more aggressive strategy based on the disagreement graph. We show that IWAL-D improves the generalization performance and the label complexity of the original IWAL, and quantify the improvement in terms of a disagreement graph coefficient. The second enhancement, IZOOM, further improves IWAL-D by adaptively zooming into the current version space and thus reducing the best-in-class error. We show that IZOOM admits favorable theoretical guarantees with the changing hypothesis set. We report experimental results on multiple datasets and demonstrate that the proposed algorithms achieve better test performances than IWAL given the same amount of labeling budget. 1. Introduction In standard supervised learning, often a significant amount of labeled data is needed to learn an accurate predictor. But, while unlabeled data is virtually unlimited and available at no cost in many applications such as natural language processing or image recognition, labeled data is typically very costly, since it requires human inspection, often by domain experts. This challenge of learning with a limited labeling budget, or, equivalently, while minimizing the number of labels requested, motivates the scenario of active learning. The goal of active learning algorithms is to learn an accurate predictor with as few labels as possible. There are two standard settings. In the so-called pool setting, the learner is provided with a pool of i.i.d. unlabeled points, from which 1Google Research, New York, NY, USA; 2Courant Institute, New York, NY, USA; 3Leonard N. Stern School of Business, New York University, New York, NY, USA. Correspondence to: Ning- shan Zhang . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). he interactively selects points to label. In the on-line setting, the learner receives a sequence of i.i.d. unlabeled points and, at each round, decides on whether to request the label of the current point. In both settings, once the labeling budget is reached, the learner returns a predictor chosen out of a hypothesis set, which is hoped to admit a smaller generalization error than if the labeling budget had been spent at random. An active learning algorithm for the online setting can also be applied to the pool setting, since the learner can stream the pool of samples. Active learning algorithms are typically evaluated in terms of their label complexity, that is how many labels are sufficient to obtain a small generalization error, , with high probability. In the on-line setting, (Cohn et al., 1994) proposed a disagreement-based algorithm, CAL, with label complexity log( 1 ), when the data is separable, as opposed to random sampling with a label complexity of 1 . The CAL algorithm maintains a version space, which includes all the classifiers that are consistent with the observed labels, and requests labels only when there is some disagreement among the predictors in the current version space. Following the idea of disagreement-based active learning, several algorithms extended CAL to the non-separable case: A2 (Balcan et al., 2006; Hanneke, 2007), DHM (Dasgupta et al., 2008), IWAL (Beygelzimer et al., 2009; 2010), which admit theoretical guarantees in generalization error and label complexity. In particular, the label complexities of these disagreement-based algorithms are bounded in terms of two critically quantities: the loss of the best predictor h in the hypothesis set, and the so-called disagreement coefficient (Hanneke, 2007) which depends on the disagreement between h and other hypotheses in the neighborhood of h . More recently, Zhang and Chaudhuri (2014) moved beyond disagreement-based active learning and instead combined confidence-rated predictors with disagreements to improve the label complexity. Besides disagreement-based active learning, another line of work (Dasgupta et al., 2005; Balcan et al., 2007; Balcan and Long, 2013; Awasthi et al., 2014; 2015; Zhang, 2018) studied learning linear separators by labeling samples close to the current estimate of the decision boundary. This type of algorithms admit favorable label complexity assuming the uniform distribution over the unit sphere or a log-concave distribution. Cortes et al. (2019) recently proposed a region-based active learn- Active Learning with Disagreement Graphs ing algorithm, where a favorable partition of the input space into sub-regions is assumed, and the algorithm optimally allocates unlabeled samples to active learning algorithms operating on each region. Although many active learning algorithms are based on the notion of disagreement between hypotheses at each round, we are not aware of any making use of the average disagreements between all hypotheses, which are quantities that can be accurately estimated without requiring labels. More generally, we will talk about a disagreement graph by referring to the fully connected graph whose vertices are the hypotheses and whose edges are labeled with the average disagreements between the vertices. One key idea in this paper is to make use of the disagreement graph to improve an existing disagreement-based active learning algorithm. Clearly, depending on the distribution, the graph may be more or less favorable. One favorable scenario is where the best-in-class vertex is within an isolated cluster of vertices, and the average disagreements within this cluster is small. In that case, an active learning algorithm will be able to quickly locate this cluster, and then identify the best-inclass vertex while requesting only a few labels since the disagreements are small. In this paper, we propose an active learning algorithm using the disagreement graph, and give guarantees in terms of properties of the disagreement graph, which measures how favorable the graph will be. While the learning bound is distribution-independent, the quality of the disagreement graph does depend on the distribution. Another critical quantity that determines the performance of active learning algorithms is the loss of the best-in-class predictor: a smaller best-in-class error helps active learning algorithms achieve high accuracy with fewer labels. For most active learning algorithms with theoretical guarantees, the hypothesis set and therefore the best-in-class error are fixed. Thus, the second key contribution of this paper is to reduce the best-in-class error by adaptively enriching the original hypothesis set near the current best one, while running an active learning algorithm. The challenge for doing so is that the standard theoretical guarantees for existing active learning algorithm do not hold for a changing hypothesis set. Nevertheless, we will show that, by exploiting a key property of the disagreements, theoretical guarantees can be proven for the generalization error and the label complexity, both depending on a smaller best-in-class error. The rest of this paper is organized as follows. In Section 2, we introduce the notation relevant to our analysis and define the learning scenario. In Section 3, we briefly describe the IWAL algorithm (Beygelzimer et al., 2009), which serves as an important baseline. In Section 4, we present a refined analysis of the label complexity bound of IWAL. In Section 5, we devise a new hypothesis pruning strategy enhancing the original strategy of IWAL, by using the dis- agreement graph. In Section 6, we further improve the above disagreement-graph-based IWAL by adaptively zooming into the function space to decrease the best-in-class error. We prove favorable label complexities and generalization bounds for our new algorithms using the properties of the disagreement graph. Finally, we report the results of several experiments demonstrating the favorable empirical performance of our new algorithms in Section 7. 2. Preliminaries In this section, we introduce the relevant notation and describe the active learning scenario we examine. We denote by X Rd the input space and by Y = { 1, +1} the binary output space. We assume that training and test points are drawn i.i.d. according to an unknown distribution D over X Y. We will denote by DX the marginal distribution induced by D over the input space X. Let H be the hypothesis set of functions mapping from X to Z R. Let : Z Y ! [0, 1] be a loss function. For any hypothesis h, we will denote by R(h) the generalization error or expected loss of h: R(h) = E[ (h(x), y)]. We denote by h = argminh2H R(h) the best-in-class hypothesis in H and by R = R(h ) its expected loss. We also denote by L(h(x), h0(x)) the maximum disagreement value between two hypotheses h, h0 on point x 2 X: L(h(x), h0(x)) = max !! (h(x), y) (h0(x), y) With a slight abuse of notation, we denote by L(h, h0) = Ex DX[L(h(x), h0(x))] the expected maximum disagreement value between h and h0 over DX. We consider the on-line active learning scenario where, at each round t 2 [T] = {1, . . . , T}, the learner receives an input point xt 2 X drawn i.i.d. according to DX and either requests its label, in which case she receives its label yt, or simply chooses not to solicit its label. The quality of an active learning algorithm is measured by two quantities in this setting: the generalization error of the hypothesis bh T returned by the algorithm after T rounds, and the total number of labels requested within the T rounds. 3. Background on the IWAL Algorithm Here, we give a brief description of the importance-weighted active learning algorithm of Beygelzimer et al. (2009), IWAL, which we extend and improve upon in later sections. We will also indicate the guarantees previously shown for this algorithm. We consider the IWAL algorithm since it can be used with different loss functions and hypothesis sets, and is agnostic to noise levels. Active Learning with Disagreement Graphs Given a finite hypothesis set H, IWAL operates on a sample (x1, y1), (x2, y2), . . . , (x T , y T ) drawn i.i.d. according to D. The algorithm maintains at any time t a version space Ht H, with H1 = H. At time t, the algorithm flips a coin Qt 2 {0, 1} with bias pt defined by f,g2Ht L(f(xt), g(xt)). (1) If Qt = 1, IWAL requests the label yt and trims Ht to Ht+1 via an importance-weighted empirical risk minimization: h 2 Ht : Lt(h) Lt(bht) + 2 t (2/t) log(2t(t + 1)|H|2/δ) for some fixed confidence parameter δ > 0, and where Lt(h) denotes the importance-weighted empirical risk of hypothesis h 2 H: (h(xs), ys), with bht = argminh2Ht Lt(h) its minimizer. After T rounds, IWAL returns the hypothesis bh T . The theoretical guarantees for IWAL can be expressed in terms of the generalized disagreement coefficient.1 Given r > 0, let BIWAL(f, r) denote the ball of radius r centered in f 2 H: BIWAL(f, r) = {g 2 H: L(f, g) r}, where the distance is measured by the expected disagreement value L(f, g). The generalized disagreement coefficient is defined as the minimum value of IWAL such that for all r > 0, max h2BIWAL(h ,r) L(h(x), h (x)) IWALr . (2) Note that, by definition, we have IWAL 1. The disagreement coefficient is a critical parameter that is widely used in the analysis of disagreement-based active learning: Hanneke (2007) proved upper and lower bounds for the label complexity of the A2 algorithm in terms of the disagreement coefficient; Dasgupta et al. (2008) also gave an upper bound for the DHM algorithm using the disagreement coefficient. See (Hanneke, 2014) for a more extensive analysis of the disagreement coefficient and active learning algorithms. Let Ft denote all the previous observations up to time t: Ft = {(x1, y1, p1, Q1), , (xt, yt, pt, Qt)} with F0 = ;. Then, IWAL admits the following learning guarantee (Beygelzimer et al., 2009). Theorem 1 (IWAL). For any δ > 0, with probability at least 1 δ, for any t > 0, the following holds: (1) h 2 Ht; (2) for all h 2 Ht, R(h) R + 4 t 1 ; (3) additionally, R(bh T ) R + 2 T , (3) 1The generalized disagreement coefficient coincides with the standard disagreement coefficient (Hanneke, 2007) when is the zero-one loss. (2/t) log(2t(t + 1)|H|2/δ), and where K is a constant that depends on the loss function : maxy2Y (z, y) (z0, y) miny2Y (z, y) (z0, y) Note that we have K 1. When is the logistic loss, which is the loss used in the experiments reported by the authors of IWAL, and the output of H is bounded by M, that is Z [ M, M], then K can be as large as 1 + e M. Large values of K in combination with IWAL 1 may result in 4 IWALKl R > 1, which would make the label complexity bound (4) vacuous, since pt 1 by definition. In the next section, we introduce an improved label complexity bound for IWAL, which removes the dependency on Kl and is strictly more favorable than (4). 4. Improved Guarantees for IWAL In this section, we present a more favorable label complexity guarantee for IWAL. To do so, we first introduce a new definition of the generalized disagreement coefficient, , that is a lower bound on IWAL, and then prove a label complexity bound for IWAL in terms of . For any two hypotheses h, h0 2 H, we denote by (h, h0) their distance defined as follows in terms of their losses: (h, h0) = E (x,y) D | (h(x), y) (h0(x), y)| We denote by B(h , r) the ball of radius r 0: B(h , r) = {h 2 H: (h, h ) r}. Our new disagreement coefficient is the infimum value of > 0 such that for all r 0: max h2B(h ,r) L(h(x), h (x)) Although (5) is syntactically similar to (2), for the definition of IWAL, the distance metrics defining B(h , r) and BIWAL(h , r) are distinct: BIWAL is defined in terms of a disagreement-based distance metric L( , ), while B is defined in terms of a loss-based metric ( , ). We show that IWAL, and that one can derive a more favorable label complexity bound for IWAL using . Theorem 2. For any δ > 0, with probability at least 1 δ, for all t 2 [T], the following holds for the label requesting probability pt of IWAL: (2/t) log(2t(t + 1)|H|2/δ). Furthermore, the following inequality holds: IWAL. Compared to Theorem 1, our new label complexity bound removes the dependency on Kl and replaces IWAL with . The inequalities Kl 1 and IWAL show that the Active Learning with Disagreement Graphs expression (6) is strictly more favorable than (4). Note, Kl can in fact be very large for some losses or hypothesis sets. The proof of Theorem 2 is mostly similar to that of IWAL and is given in Appendix A. 5. Enhanced IWAL Using the Disagreement In this section, we present an enhanced version of the IWAL algorithm, IWAL-D, that exploits the disagreement graph. IWAL-D adopts the same label requesting policy as IWAL, but it prunes the hypothesis set more aggressively using disagreement-graph-based slack terms. We provide a theoretical analysis of IWAL-D in terms of the property of the disagreement graph, and show that IWAL-D admits a more favorable learning guarantee than IWAL, especially when the disagreement graph is favorable. The motivation for IWAL-D comes from the following lemma that relates the importance-weighted empirical error and the generalization error. Due to space limitation, the proofs of all the theoretical results are given in Appendix A. Lemma 1 (IWAL-D). For any δ > 0, with probability at least 1 δ, for all t 2 [T] and for all f, g 2 Ht, the following inequality holds: |Lt(f) Lt(g) R(f) + R(g)| 1 + L(f, g) (2/t) log(2t(t + 1)|H|2/δ). Lemma 1 gives a customized concentration bound for every pair (f, g) in terms of L(f, g) 1, thus, it is more refined than the corresponding concentration bound of IWAL, which admits a term 2 t on the right-hand side. Using Lemma 1 to prune the hypothesis sets, we obtain the pruning strategy of the IWAL-D algorithm: Ht is trimmed to Ht+1 according to the following h 2 Ht : Lt(h) Lt(bht)+ The full pseudocode of IWAL-D is given in Algorithm 1, with slack term t = (2/t) log(2t(t + 1)|H|2/δ). Lemma 1 guarantees that, with high probability, IWAL-D always maintains the best-in-class predictor in Ht for all t 2 [T], which is necessary for the active learning algorithm to succeed. On the other hand, since IWAL-D uses a smaller slack term than IWAL for pruning ((1 + L(h,bht)) t 2 t), it shrinks the hypothesis set more aggressively and therefore reduces the label complexity. We now present the learning guarantee for IWAL-D, which is based on Lemma 1. The proof is mostly the same as the proof of IWAL except using Lemma 1 as the concentration lemma, and is thus omitted. Theorem 3 (IWAL-D). Let bh T denote the hypothesis returned by IWAL-D after T rounds. Then, for any δ > 0, with Algorithm 1 IWAL-D(H) H1 H for t 2 [T] do Receive xt pt maxh,h02Ht L(h(xt), h0(xt)) Qt BERNOULLI(pt) if Qt = 1 then yt LABEL(xt) bht argminh2Ht Lt(h) Ht+1 h 2 Ht : Lt(h) Lt(bht) + 1 + L(h,bht) end if end for return bh T probability at least 1 δ, for any t 2 [T], the following holds: (1) h 2 Ht; (2) for all h 2 Ht, 2 + L(h,bht 1) + L(h, h ) (3) additionally, R(bh T ) R + 1 + L(bh T , h ) T , and (7) 2 + L(h,bht 1) + L(h, h ) Comparing Theorem 3 to the guarantees of IWAL (Theorem 1), both the generalization bound (7) of bh T and the label complexity (8) per round are improved, since L( , ) 1 by definition. The improvement in the learning guarantees heavily depends on the property of the average disagreement values near h . When the pair of hypotheses considered, in particular the pair made of the best-in-class h and the empirical risk minimizer bh T agree almost everywhere, then the generalization bound of R(bh T ) can be reduced from R + 2 T (IWAL) to approximately R + T (IWAL-D). Similarly, when the average disagreement values between all pairs in Ht are small, the label complexity bound is smaller than that of IWAL. However, the result of Theorem 3 is not so straightforward to digest, as it involves disagreement values which are not as clear as, for example the differences in the generalization error. To remove the disagreement values from the guarantees and to better analyze the improvement over IWAL, we introduce a new term, which we refer to as the disagreement graph coefficient , and present guarantees for IWAL-D in terms of and the generalization error only. The disagreement graph coefficient is the infimum value of such that, 8r > 0, max h2B(h ,r) L(h, h ) r, where B(h , r) is the ball centered at h with radius r, as Active Learning with Disagreement Graphs defined in Section 4. Note that we have 1, since (h, h0) = E (x,y) D | (h(x), y) (h0(x), y)| Furthermore, by definition , since max h2B(h ,r) L(h, h ) E x DX max h2B(h ,r) L(h(x), h (x)) and thus existing upper bounds of the disagreement coefficient under various scenarios also apply to . In particular, when is the 0-1 loss, it is easy to show that = 1 under all distributions. For other loss functions, for to be small, one favorable learning scenario is where the best-in-class h is surrounded by an isolated cluster of hypotheses, where every hypothesis within the cluster agrees with h almost everywhere, thus L(h, h ) (h, h ) and is close to 1. We now derive the following learning guarantees for IWAL- D only in terms of and the best-in-class error R . Corollary 1. Let bh T denote the hypothesis returned by IWAL-D after T rounds. Then, for all δ > 0, with probability at least 1 δ, when 2 (R + T ) < 1, R(bh T ) R + T R + 2 T . Moreover, with probability at least 1 δ, for t such that 3 (R + 2 t 1) < 1, 4 (R + 2 t 1). Corollary 1 states that, when R is small enough to meet the desired assumptions, IWAL-D admits improved guarantees over IWAL in terms of both the generalization error and label complexity. Both can be quantified in terms of the disagreement graph coefficient and the best-in-class error R . The smaller R is, the larger is the improvement of IWAL-D s guarantee over that of IWAL. It is worth emphasizing again that IWAL-D always admits more favorable learning guarantees than IWAL. When the preconditions in Corollary 1 do not hold, the improvement may not be captured by our disagreement graph coefficient analysis, but IWAL-D is still at least as favorable as IWAL. 6. Enhanced IWAL-D Using a Zooming-in In this section, we exploit another property of the average disagreements and describe an algorithm, IZOOM that further improves upon IWAL-D by adaptively enriching the hypothesis set near the current best predictor. We show that IZOOM provides an additional improvement over IWAL-D both in terms of generalization bound and label complexity. For the IZOOM algorithm, we need additional assumptions on the loss function: (1) The loss function takes the form (z, y) = f(zy), for some function f; (2) Function f is non-increasing, convex, and 1-Lipschitz. Many binary classification loss functions meet these assumptions, including the logistic loss (z, y)=log(1+e zy) and the hinge loss (z, y) = max(0, 1 yz). In this section, we assume that these properties hold for the loss function considered. Recall that the IWAL-D algorithm works on a finite set of hypotheses H. Although IWAL-D returns a final predictor that is close to the best-in-class hypothesis among H, the best-in-class error, or equivalently the approximation error of H itself, may potentially be quite large. To reduce the approximation error, the learner can increase the size of H and, without any prior knowledge of the data distribution, sample uniformly with a finer granularity from the function space to construct H. Now, with access to the average disagreements between arbitrary hypotheses, which can be accurately estimated from large amounts of unlabeled data, the first attempt the learner could make is to construct H in a distribution-dependent way as follows: first, create an -cover G over the infinite function space H1, where G contains a set of hypotheses such that for every h in H1, there exists a g 2 G with L(h, g) , and thus |R(h) R(g)| L(h, g) ; next, set H = G, which gives a finite hypothesis set with the desired resolution of in the generalization error. The learner can then apply any active learning algorithm to H to achieve favorable learning guarantees. Depending on the data distribution, the cardinality of H can be significantly smaller. In Appendix B, we present the formal definition of the -cover and an example of applying IWAL to the -cover, and prove its learning guarantees in terms of . However, with or without access to the average disagreements, the size of H in general increases exponentially in the dimension of the feature space, making it impractical to run any active learning algorithm with a hypothesis set H of a desired resolution, especially for datasets with highdimensional feature vectors. Furthermore, it is unclear how to efficiently construct an -cover for the function space, using the average disagreements as the distance metric. Can we adaptively increase the resolution of the hypothesis set while actively requesting the labels? More importantly, can we still prove theoretical guarantees for the final output while changing the hypothesis set? In this section, we describe the IZOOM algorithm that achieves this goal. The pseudocode of the IZOOM algorithm is given in Algorithm 2. IZOOM uses label requesting and hypothesis pruning policies similar to those of IWAL-D, with a slightly Active Learning with Disagreement Graphs Algorithm 2 IZOOM(H) H1 H for t 2 [T] do Receive xt pt maxf,g2Ht L(f(xt), g(xt)) + 4 T Qt BERNOULLI(pt) if Qt = 1 then yt LABEL(xt) bht argminh2Ht Lt(h) H0 h 2 Ht : Lt(h) Lt(bht) + (1 + L(h,bht) + 4 t+1 RESAMPLE(H0 t+1, |H| |H0 t+1|) Ht+1 H0 t+1 end if end for return bh T different slack term t = (2/t) log(4TN 2 T /δ), where N1, is the -covering number of conv(H) with respect to the 1 norm. However, after IWAL-D has pruned the hypothesis set from Ht to H0 t+1, IZOOM further samples new hypotheses within the convex hull of H0 t+1 and combines them with H0 t+1 to define Ht+1. The subroutine for sampling new hypotheses is given in Algorithm 3 and illustrated in Figure 1. In this illustration, at the end of round t, IZOOM locates the empirical best predictor bht among Ht, and prunes out three hypotheses that are far away from bht to obtain H0 t+1. Next, IZOOM randomly samples three new hypotheses from conv(H0 t+1) by taking convex combinations of bht and other hypotheses in H0 t+1, and combine them with H0 t+1 to obtain Ht+1. As IZOOM keeps zooming-in, thereby enriching the current hypothesis set H0 t+1, we expect the final hypothesis set HT , from which we learn the final hypothesis output bh T , to admit a substantially smaller approximation error. In many cases of interest, the covering number N1, in the expression of t is polynomial in 1 . In particular, when H is the family of functions with bounded norm in the reproducing kernel Hilbert space (RKHS) associated with Gaussian kernels, as shown by Guo et al. (1999) [Eq. (28)], we have log N1, = O(log ). To fix ideas, we set = 1 T , but note that can be chosen based on the covering number. By design, IZOOM maintains a fixed number of hypotheses for all Ht, t 2 [T], so that IZOOM does not require additional computational resources or storage space than the original IWAL-D using the same number of hypotheses. With |Ht| being fixed, the more hypotheses IWAL-D prunes out from Ht, the more hypotheses IZOOM samples and adds into Ht+1. Thus, the greedy pruning strategy of IWAL-D helps IZOOM prune out and add in more hypotheses. A key step in providing guarantees for the IZOOM algo- Algorithm 3 RESAMPLE(H, n) bht argmin H Lt(h) Sample λi U[0, 1], 8i 2 [n] Sample hi H \ bht, 8i 2 [n] ehi λibht + (1 λi)hi, 8i 2 [n] return {ehi : i 2 [n]} Figure 1. Illustration of IZOOM at time t. H0 t+1 (the red points) is obtained by pruning out three hypotheses (the black points) from Ht. IZOOM then samples three new hypotheses (the blue points) from conv(H0 t+1), and combines them with H0 t+1 to get Ht+1. rithm is the concentration bound of Lemma 1. In particular, one needs to show that for all t 2 [T] and for all pairs of hypotheses ft, gt 2 Ht, the random variable | (ft(xs), ys) (gt(xs), ys)|Qs/ps is bounded for all past observations {(xs, ys, ps, Qs): s t}. This naturally holds for IWAL-D since the hypothesis set Ht is always shrinking. However, it is much less straightforward to show that for IZOOM, since it augments Ht with new hypotheses. To prove the guarantees for IZOOM, we rely on a novel property of the disagreement value. Its proof uses the non-increasing property of the loss function. Lemma 2. Let Hs be a set of hypotheses at time s. For any t s, assume that Ht is in the convex hull of Hs: Ht conv(Hs) = λihi(x): λ 2 , hi 2 Hs where is the probability simplex of dimension |Hs|. Then, for any x 2 X, the following inequality holds: max h,h02Ht L(h(x), h0(x)) max h,h02Hs L(h(x), h0(x)). (9) Lemma 2 states that for the sequence of hypothesis sets Ht defined by IZOOM, the maximum disagreement value on a fixed point x is non-increasing despite the fact that Ht is augmented with newly sampled hypotheses. This property is the key to proving the desired concentration result of IZOOM, Lemma 3, from which we derive the generalization bounds and the label complexities. Lemma 3 (IZOOM). For all δ > 0, with probability at least Active Learning with Disagreement Graphs 1 δ, for all t T, and for all f, g 2 Ht , |Lt(f) Lt(g) R(f)+R(g)| 1+L(f, g)+ 4 Before presenting the learning guarantees of IZOOM, we introduce some additional notation. Let e Ht denote the set of hypotheses that have been considered up to round t, which includes the original H and the newly sampled hypotheses: e Ht = H[{[t s}. Clearly e H1 e H2 , as IZOOM considers increasingly richer hypothesis sets. On the other hand, the sampling subroutine (Algorithm 3) creates random combinations of current empirical best predictor bht and other hypotheses in H0 t+1, thus H00 t+1 conv(H0 t+1) conv(Ht). It follows that Ht+1 conv(Ht) for all t, and therefore Ht conv(Hs) for all s t. Let h t be the best-in-class hypothesis of e Ht, and let R We now present the learning guarantees for IZOOM. Like IWAL and IWAL-D, IZOOM s label complexity bound also depends on the disagreement coefficient. However, for IZOOM, the disagreement coefficient changes per round since it is defined with respect to the changing Ht and h t . We denote by t the disagreement coefficient for round t, and give guarantees in terms of t. The proof uses Lemma 3 and the convexity of the loss function. Theorem 4 (IZOOM). For any δ > 0, with probability at least 1 δ, for all t 2 [T], the following holds: (1) h t 2 Ht; (2) For all h 2 Ht+1, 2 + L(h,bht) + L(h, h (3) Additionally, 1 + L(bh T , h T (1 + T ), and 2 + L(h,bht) T (1 + 4 t + 4 t t). Theorem 4 shows that, remarkably, the best-in-class predictor among the cumulative hypothesis set e Ht is always contained in Ht, which is only a subset of e Ht. As IZOOM enriches e Ht over time by sampling more hypotheses near bht (thus near h t as well), the best-in-class error of e Ht will be decreasing. It follows that R T R , where R is the bestin-class error of IWAL-D using the initial H. In addition, it is reasonable to assume that for smooth distributions, the disagreement coefficient t is stable and does not change dramatically over the changing Ht. Therefore, IZOOM also decreases the label complexity over IWAL-D since the label complexity scales with the decreasing best-in-class error R(h t ). Overall, IZOOM simultaneously decreases the generalization bound and the label complexity over IWAL-D. To better understand the result of Theorem 4, we derive similar results to Corollary 1 for IZOOM. In the same way Table 1. Binary classification dataset summary: number of observa- tions (N), number of features (d), proportion of minority class (r). Datasets are ordered by number of features. For high-dimensional datasets we only use the first 10 principal components. Dataset N d r skin 245,057 3 0.208 codrna 59,535 8 0.333 shuttle 43,500 9 0.216 magic04 19,020 10 0.352 ijcnn1 49,990 22 0.097 covtype 581,012 54 0.488 nomao 34,465 118 0.286 a9a 48,842 123 0.239 as for t, we denote t the time-varying disagreement graph coefficient defined with respect to Ht and h t . The rest of the proof is syntactically the same as the proof of Corollary 1. Corollary 2 (IZOOM). For all δ > 0, with probability at least 1 δ, when 2 T (R T (1 + T )) 1, T (1 + T ) 1 T T T + 2 T + 4 T (1 + T ). Moreover, with probability at least 1 δ, for t such that 3 t(R t + 2 t + 4 T (1 + t)) 1, T (1 + 3 t) 1 3 t t t + 2 t) + 4 T (1 + 4 t + 4 t t). In the same way as for t, it is reasonable to assume that, for smooth distributions, the coefficient t is stable over the changing Ht. On the other hand, R T can be significantly reduced by zooming in, thus T R T is likely to be substantially smaller than R . Thus, IZOOM is likely to achieve further improvements over IWAL-ZOOM, a variant that uses IWAL as the underlying subroutine when zooming in. 7. Experiments In this section, we report the results of several experiments comparing IWAL, IWAL-D and IZOOM. We experimented with these algorithms in 8 binary classification datasets from the UCI repository. Table 1 summarizes the relevant statistics for these datasets. Due to space limitations, we only show the results for 4 datasets, and give the results for the remaining datasets in Appendix C. For high-dimensional datasets, we only kept the first 10 principal components of the original features. We used the standard logistic loss function, which is defined for all (x, y) 2 X Y and hypothesis h by log(1 + e yh(x)), which we then rescaled to [0, 1]. For all algorithms, we randomly drew 3,000 hyperplanes with bounded norms as our base hypothesis set H. In addi- Active Learning with Disagreement Graphs 7 8 9 10 11 log2(Number of Labels) Misclassification Loss IWAL 3000 IWAL D 3000 nomao 7 8 9 10 11 log2(Number of Labels) Misclassification Loss IWAL 3000 IWAL D 3000 codrna 7 8 9 10 11 log2(Number of Labels) Misclassification Loss IWAL 3000 IWAL D 3000 skin 7 8 9 10 11 log2(Number of Labels) Misclassification Loss IWAL 3000 IWAL D 3000 covtype Figure 2. Misclassification loss of IWAL and IWAL-D on held-out test data versus number of labels requested (log2 scale). In some cases, we observe that IWAL-D achieves a better test performance (nomao, skin, covetype), sometimes mostly at the beginning. In others, the difference is not significant. 7 8 9 10 11 log2(Number of Labels) Misclassification Loss IWAL 3000 IWAL 12000 IZOOM 3000 nomao 7 8 9 10 11 log2(Number of Labels) Misclassification Loss IWAL 3000 IWAL 12000 IZOOM 3000 codrna 7 8 9 10 11 log2(Number of Labels) Misclassification Loss IWAL 3000 IWAL 12000 IZOOM 3000 skin 7 8 9 10 11 log2(Number of Labels) Misclassification Loss IWAL 3000 IWAL 12000 IZOOM 3000 covtype Figure 3. Misclassification loss of IWAL, and IZOOM, on held-out test data versus number of labels requested (log2 scale). The figures show that IZOOM provides significant improvements in performance over IWAL, despite it is run with just |H| = 3000. The red curves for IWAL with |H| = 3000 are repetitions from Figure 2. tion, for the IZOOM algorithm we kept |H| = 3,000 throughout the learning. We found that in the end, the total number of hypotheses ever considered by IZOOM was around 6,000, thus for fair comparisons, we also experimented with IWAL using a larger set of |H| = 12,000 hypotheses (twice as many as the number of hypotheses considered by IZOOM), since starting off with a randomly sampled hypothesis set, with no adaptation, makes IWAL intuitively less effective. For every pair of h, h0 2 H, we approximated L(h, h0) with the average disagreement values on 2,000 unlabeled samples. For each experiment, we randomly shuffled the dataset and ran the algorithms on the first 50% of the data, and tested the learned classifier on the remaining 50% of the data. We repeated the process 50 times for each dataset, and reported the average results with standard errors. We first compared IWAL-D with IWAL, where both algorithms have 3,000 initial hypotheses. Figure 2 shows the misclassification loss of IWAL and IWAL-D on held-out test data against the number of labels requested (on log2 scale). We observe that, in some cases IWAL-D outperforms IWAL with a lower misclassification error (nomao, skin, covtype). In other cases, the difference is not significant. Overall, there appears to be advantages in using the more aggressive disagreement-graph-based pruning strategy. Next, we compared IZOOM to IWAL for various hypothesis set sizes. Figure 3 plots the misclassification loss on held-out test data against the number of labels (on a log2 scale). On almost all datasets, the performance of IWAL improves significantly from |H| = 3,000 to |H| = 12,000, as expected. However, the IZOOM algorithm with |H| = 3,000 achieves almost from the beginning a significantly better prediction accuracy than the original IWAL even with the large size of |H| on almost all datasets. Meanwhile, IZOOM had considered a total of 6,000 hypotheses within each experiment, which is only half of the largest size of |H| for IWAL. This again illustrates that IZOOM samples from the function space more effectively than uniformly sampling. On several datasets, the learning curve of IZOOM is much steeper than IWAL at the beginning, which makes IZOOM a promising active learning algorithm, since the performance in the early regime is of particular interest in active learning. 8. Conclusion We presented two active learning algorithms exploiting average disagreements between hypotheses. We showed that they benefit from favorable generalization and label complexity guarantees. We also reported the results of several experiments demonstrating that they can achieve substantial performance improvements over existing active learning algorithms such as IWAL. Altogether, our theory, algorithms, and empirical results provide a very effective solution for active learning. Acknowledgments We thank all anonymous reviewers for their comments. This work was partly funded by NSF CCF-1535987 and NSF IIS-1618662. Active Learning with Disagreement Graphs P. Awasthi, M.-F. Balcan, and P. Long. The power of local- ization for efficiently learning linear separators with noise. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 449 458. ACM, 2014. P. Awasthi, M.-F. Balcan, N. Haghtalab, and R. Urner. Effi- cient learning of linear separators under bounded noise. In Conference on Learning Theory, pages 167 190, 2015. M.-F. Balcan and P. Long. Active and passive learning of linear separators under log-concave distributions. In Conference on Learning Theory, pages 288 316, 2013. M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In ICML, 2006. M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In International Conference on Computational Learning Theory, pages 35 50. Springer, 2007. A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In Proceedings of the 26th annual international conference on machine learning, pages 49 56. ACM, 2009. A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Ag- nostic active learning without constraints. In Advances in Neural Information Processing Systems, pages 199 207, 2010. D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine learning, 15(2):201 221, 1994. C. Cortes, G. De Salvo, C. Gentile, M. Mohri, and N. Zhang. Region-based active learning. In AISTATS 2019, 2019. S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. In International Conference on Computational Learning Theory, pages 249 263. Springer, 2005. S. Dasgupta, D. Hsu, and C. Monteleoni. A general ag- nostic active learning algorithm. In Advances in neural information processing systems, pages 353 360, 2008. Y. Guo, P. L. Bartlett, J. Shawe-Taylor, and R. C. Williamson. Covering numbers for support vector machines. In COLT, 1999. S. Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th international conference on Machine learning, pages 353 360. ACM, 2007. S. Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3): 131 309, 2014. C. Zhang. Efficient active learning of sparse halfspaces. In Conference on Learning Theory, 2018. C. Zhang and K. Chaudhuri. Beyond disagreement-based agnostic active learning. In Advances in Neural Information Processing Systems, pages 442 450, 2014.