# diameterbased_active_learning__49c53831.pdf Diameter-Based Active Learning Christopher Tosh 1 Sanjoy Dasgupta 1 To date, the tightest upper and lower-bounds for the active learning of general concept classes have been in terms of a parameter of the learning problem called the splitting index. We provide, for the first time, an efficient algorithm that is able to realize this upper bound, and we empirically demonstrate its good performance. 1. Introduction In many situations where a classifier is to be learned, it is easy to collect unlabeled data but costly to obtain labels. This has motivated the pool-based active learning model, in which a learner has access to a collection of unlabeled data points and is allowed to ask for individual labels in an adaptive manner. The hope is that choosing these queries intelligently will rapidly yield a low-error classifier, much more quickly than with random querying. A central focus of active learning is developing efficient querying strategies and understanding their label complexity. Over the past decade or two, there has been substantial progress in developing such rigorously-justified active learning schemes for general concept classes. For the most part, these schemes can be described as mellow: rather than focusing upon maximally informative points, they query any point whose label cannot reasonably be inferred from the information received so far. It is of interest to develop more aggressive strategies with better label complexity. An exception to this general trend is the aggressive strategy of (Dasgupta, 2005), whose label complexity is known to be optimal in its dependence on a key parameter called the splitting index. However, this strategy has been primarily of theoretical interest because it is difficult to implement algorithmically. In this paper, we introduce a variant of the methodology that yields efficient algorithms. We show that 1Department of Computer Science and Engineering, UC San Diego, La Jolla, CA, USA. Correspondence to: Christopher Tosh , Sanjoy Dasgupta . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). it admits roughly the same label complexity bounds as well as having promising experimental performance. As with the original splitting index result, we operate in the realizable setting, where data can be perfectly classified by some function h in the hypothesis class H. At any given time during the active learning process, the remaining candidates that is, the elements of H consistent with the data so far are called the version space. The goal of aggressive active learners is typically to pick queries that are likely to shrink this version space rapidly. But what is the right notion of size? Dasgupta (2005) pointed out that the diameter of the version space is what matters, where the distance between two classifiers is taken to be the fraction of points on which they make different predictions. Unfortunately, the diameter is a difficult measure to work with because it cannot, in general, be decreased at a steady rate. Thus the earlier work used a procedure that has quantifiable label complexity but is not conducive to implementation. We take a fresh perspective on this earlier result. We start by suggesting an alternative, but closely related, notion of the size of a version space: the average pairwise distance between hypotheses in the version space, with respect to some underlying probability distribution on H. This distribution can be arbitrary that is, there is no requirement that the target h is chosen from it but should be chosen so that it is easy to sample from. When H consists of linear separators, for instance, a good choice would be a log-concave density, such as a Gaussian. At any given time, the next query x is chosen roughly as follows: Sample a collection of classifiers h1, h2, . . . , hm from restricted to the current version space V . Compute the distances between them; this can be done using just the unlabeled points. Any candidate query x partitions the classifiers {hi} into two groups: those that assign it a + label (call these V + x ) and those that assign it a label (call these V x ). Estimate the average-diameter after labeling x by the sum of the distances between classifiers hi within V + x , or those within V x , whichever is larger. Out of the pool of unlabeled data, pick the x for which Diameter-Based Active Learning this diameter-estimate is smallest. This is repeated until the version space has small enough average diameter that a random sample from it is very likely to have error less than a user-specified threshold . We show how all these steps can be achieved efficiently, as long as there is a sampler for . Dasgupta (2005) pointed out that the label complexity of active learning depends on the underlying distribution, the amount of unlabeled data (since more data means greater potential for highly-informative points), and also the target classifier h . That paper identifies a parameter called the splitting index that captures the relevant geometry, and gives upper bounds on label complexity that are proportional to 1/ , as well as showing that this dependence is inevitable. For our modified notion of diameter, a different averaged splitting index is needed. However, we show that it can be bounded by the original splitting index, with an extra multiplicative factor of log(1/ ); thus all previouslyobtained label complexity results translate immediately for our new algorithm. 2. Related Work The theory of active learning has developed along several fronts. One of these is nonparametric active learning, where the learner starts with a pool of unlabeled points, adaptively queries a few of them, and then fills in the remaining labels. The goal is to do this with as few errors as possible. (In particular, the learner does not return a classifier from some predefined parametrized class.) One scheme begins by building a neighborhood graph on the unlabeled data, and propagating queried labels along the edges of this graph (Zhu et al., 2003; Cesa-Bianchi et al., 2009; Dasarathy et al., 2015). Another starts with a hierarchical clustering of the data and moves down the tree, sampling at random until it finds clusters that are relatively pure in their labels (Dasgupta & Hsu, 2008). The label complexity of such methods have typically be given in terms of smoothness properties of the underlying data distribution (Castro & Nowak, 2008; Kpotufe et al., 2015). Another line of work has focused on active learning of linear separators, by querying points close to the current guess at the decision boundary (Balcan et al., 2007; Dasgupta et al., 2009; Balcan & Long, 2013). Such algorithms are close in spirit to those used in practice, but their analysis to date has required fairly strong assumptions to the effect that the underlying distribution on the unlabeled points is logconcave. Interestingly, regret guarantees for online algorithms of this sort can be shown under far weaker conditions (Cesa-Bianchi et al., 2006). The third category of results, to which the present paper belongs, considers active learning strategies for general concept classes H. Some of these schemes (Cohn et al., 1994; Dasgupta et al., 2007; Beygelzimer et al., 2009; Balcan et al., 2009; Zhang & Chaudhuri, 2014) are fairly mellow in the sense described earlier, using generalization bounds to gauge which labels can be inferred from those obtained so far. The label complexity of these methods can be bounded in terms of a quantity known as the disagreement coefficient (Hanneke, 2007). In the realizable case, the canonical such algorithm is that of (Cohn et al., 1994), henceforth referred to as CAL. Other methods use a prior distribution over the hypothesis class, sometimes assuming that the target classifier is a random draw from this prior. These methods typically aim to shrink the mass of the version space under , either greedily and explicitly (Dasgupta, 2004; Guillory & Bilmes, 2009; Golovin et al., 2010) or implicitly (Freund et al., 1997). Perhaps the most widely-used of these methods is the latter, query-by-committee, henceforth QBC. As mentioned earlier, shrinking -mass is not an optimal strategy if low misclassification error is the ultimate goal. In particular, what matters is not the prior mass of the remaining version space, but rather how different these candidate classifiers are from each other. This motivates using the diameter of the version space as a yardstick, which was first proposed in (Dasgupta, 2005) and is taken up again here. 3. Preliminaries Consider a binary hypothesis class H, a data space X, and a distribution D over X. For mathematical convenience, we will restrict ourselves to finite hypothesis classes. (We can do this without loss of generality when H has finite VC dimension, since we only use the predictions of hypotheses on a pool of unlabeled points; however, we do not spell out the details of this reduction here.) The hypothesis distance induced by D over H is the pseudometric d(h, h0) := Prx D(h(x) 6= h0(x)). Given a point x 2 X and a subset V H, denote x = {h 2 V : h(x) = 1} x = V \ V + x . Given a sequence of data points x1, . . . , xn and a target hypothesis h , the induced version space is the set of hypotheses that are consistent with the target hypotheses on the sequence, i.e. {h 2 H : h(xi) = h (xi) for all i = 1, . . . , n}. 3.1. Diameter and the Splitting Index The diameter of a set of hypotheses V H is the maximal distance between any two hypotheses in V , i.e. diam(V ) := max h,h02V d(h, h0). Diameter-Based Active Learning Without any prior information, any hypothesis in the version space could be the target. Thus the worst case error of any hypothesis in the version space is the diameter of the version space. The splitting index roughly characterizes the number of queries required for an active learning algorithm to reduce the diameter of the version space below . While reducing the diameter of a version space V H, we will sometimes identify pairs of hypotheses h, h0 2 V that are far apart and therefore need to be separated. We will refer to {h, h0} as an edge. Given a set of edges E = {{h1, h0 1}, . . . , {hn, h0 , we say a data point x - splits E if querying x separates at least a fraction of the pairs, that is, if and similarly for E x . When attempting to get accuracy > 0, we need to only eliminate edge of length greater than . Define E = {{h, h0} 2 E : d(h, h0) > }. The splitting index of a set V H is a tuple ( , , ) such that for all finite edge-sets E Prx D(x -splits E ) . The following theorem, due to Dasgupta (2005), bounds the sample complexity of active learning in terms of the splitting index. The O notation hides polylogarithmic factors in d, , , log 1/ , and the failure probability δ. Theorem 1 (Dasgupta 2005). Suppose H is a hypothesis class with splitting index ( , , ). Then to learn a hypothesis with error , (a) any active learning algorithm with 1/ unlabeled samples must request at least 1/ labels, and (b) if H has VC-dimension d, there is an active learning algorithm that draws O(d/( ) log2(1/ )) unlabeled data points and requests O((d/ ) log2(1/ )) labels. Unfortunately, the only known algorithm satisfying (b) above is intractable for all but the simplest hypothesis classes: it constructs an -covering of the hypothesis space and queries points which whittle away at the diameter of this covering. To overcome this intractability, we consider a slightly more benign setting in which we have a samplable prior distribution over our hypothesis space H. 3.2. An Average Notion of Diameter With a prior distribution, it makes sense to shift away from the worst-case to the average-case. We define the average diameter of a subset V H as the expected distance between two hypotheses in V randomly drawn from , i.e. Φ(V ) := Eh,h0 |V [d(h, h0)] where |V is the conditional distribution induced by restricting to V , that is, |V (h) = (h)/ (V ) for h 2 V . Intuitively, a version space with very small average diameter ought to put high weight on hypotheses that are close to the true hypothesis. Indeed, given a version space V with h 2 V , the following lemma shows that if Φ(V ) is small enough, then a low error hypothesis can be found by two popular heuristics: random sampling and MAP estimation. Lemma 2. Suppose V H contains h . Pick > 0. (a) (Random sampling) If Φ(V ) |V (h ) then Eh |V [d(h , h)] . (b) (MAP estimation) Write pmap = maxh2V |V (h). Pick 0 < < pmap. If Φ(V ) 2 (min{ |V (h ), pmap })2 , then d(h , h) for any h with |V (h) pmap . Proof. Part (a) follows from Φ(V ) = Eh,h0 |V [d(h, h0)] |V (h )Eh |V [d(h , h)]. For (b), take δ = min( |V (h ), pmap ) and define V ,δ = {h 2 V : |V (h) δ}. Note that V ,δ contains h as well as any h 2 V with |V (h) pmap . We claim diam(V ,δ) is at most . Suppose not. Then there exist h1, h2 2 V ,δ satisfying d(h1, h2) > , implying Φ(V ) = Eh,h0 |V [d(h, h0)] 2 |V (h1) |V (h2) d(h1, h2) > 2δ2 . But this contradicts our assumption on Φ(V ). Since both h, h 2 V ,δ, we have (b). 3.3. An Average Notion of Splitting We now turn to defining an average notion of splitting. A data point x -average splits V if (V )2 Φ(V + And we say a set S H has average splitting index ( , , ) if for any subset V S such that Φ(V ) > , Prx D (x -average splits V ) . Diameter-Based Active Learning Intuitively, average splitting refers to the ability to significantly decrease the potential function (V )2Φ(V ) = Eh,h0 [ (h, h0 2 V ) d(h, h0)] with a single query. While this potential function may seem strange at first glance, it is closely related to the original splitting index. The following lemma, whose proof is deferred to Section 5, shows the splitting index bounds the average splitting index for any hypothesis class. Lemma 3. Let be a probability measure over a hypothesis class H. If H has splitting index ( , , ), then it has average splitting index ( 4dlog(1/ )e, 2 , ). Dasgupta (2005) derived the splitting indices for several hypothesis classes, including intervals and homogeneous linear separators. Lemma 3 implies average splitting indices within a log(1/ ) factor in these settings. Moreover, given access to samples from |V , we can easily estimate the quantities appearing in the definition of average splitting. For an edge sequence E = ({h1, h0 1}, . . . , {hn, h0 When hi, h0 i are i.i.d. draws from |V for all i = 1, . . . , n, which we denote E ( |V )2 n, the random variables (E), (E x ), and (E+ x ) are unbiased estimators of the quantities appearing in the definition of average splitting. Lemma 4. Given E ( |V )2 n, we have = Φ(V ) and E (V )2 Φ(V + for any x 2 X. Similarly for E Proof. From definitions and linearity of expectations, it is easy to observe E[ (E)] = n Φ(V ). By the independence of hi, h0 i, we additionally have x ] d(hi, h0 i) | hi, h0 Remark: It is tempting to define average splitting in terms of the average diameter as x )} (1 )Φ(V ). However, this definition does not satisfy a nice relationship with the splitting index. Indeed, there exist hypothesis classes V for which there are many points which 1/4-split E for any E but for which every x 2 X satisfies x )} Φ(V ). This observation is formally proven in the appendix. 4. An Average Splitting Index Algorithm Suppose we are given a version space V with average splitting index ( , , ). If we draw O(1/ ) points from the data distribution then, with high probability, one of these will - average split V . Querying that point will result in a version space V 0 with significantly smaller potential (V 0)2Φ(V 0). If we knew the value a priori, then Lemma 4 combined with standard concentration bounds (Hoeffding, 1963; Angluin & Valiant, 1977) would give us a relatively straightforward procedure to find a good query point: 1. Draw E0 ( |V )2 M and compute the empirical es- timate bΦ(V ) = 1 M (E0). 2. Draw E ( |V )2 N for N depending on and bΦ. 3. For suitable M and N, it will be the case that with high probability, for some x, Querying that point will decrease the potential. However, we typically would not know the average splitting index ahead of time. Moreover, it is possible that the average splitting index may change from one version space to the next. In the next section, we describe a query selection procedure that adapts to the splittability of the current version space. 4.1. Finding a Good Query Point Algorithm 2, which we term SELECT, is our query selection procedure. It takes as input a sequence of data points x1, . . . , xm, at least one of which -average splits the current version space, and with high probability finds a data point that /8-average splits the version space. SELECT proceeds by positing an optimistic estimate of , which we denote b t, and successively halving it until we are Diameter-Based Active Learning Algorithm 1 DBAL Input: Hypothesis class H, prior distribution Initialize V = H while 1 4 for E ( |V )2 n do Draw m data points x = (x1, . . . , xm) Query point xi = SELECT(V, x) and set V to be consistent with the result end while return Current version space V in the form of the queried points (x1, h (x1)), . . . , (x K, h (x K)) Algorithm 2 SELECT Input: Version space V , prior , data x = (x1, . . . , xm) Set b 1 = 1/2 for t = 1, 2, . . . do Draw E0 ( |V )2 mt and compute bΦt = 1 mt (E0) Draw E ( |V )2 nt If 9 xi s.t. 1 nt max (1 b t)bΦt, then halt and return xi Otherwise, let b t+1 = b t/2 end for confident that we have found a point that b t-average splits the version space. In order for this algorithm to succeed, we need to choose nt and mt such that with high probability (1) bΦt is an accurate estimate of Φ(V ) and (2) our halting condition will be true if b t is within a constant factor of and false otherwise. The following lemma, whose proof is in the appendix, provides such choices for nt and mt. Lemma 5. Let , , δ0 > 0 be given. Suppose that version space V satisfies Φ(V ) > . In SELECT, fix a round t and data point x 2 X that exactly -average splits V (that is, max{ |V (V + x )} = (1 )Φ(V )). If then with probability 1 δ0, bΦt (1 b t/4)Φ(V ) and (a) if b t/2, then > (1 b t)bΦt. (b) If 2b t, then (1 b t)bΦt. Given the above lemma, we can establish a bound on the number of rounds and the total number of hypotheses SELECT needs to find a data point that /8-average splits the version space. Theorem 6. Suppose that SELECT is called with a version space V with Φ(V ) and a collection of points x1, . . . , xm such that at least one of xi -average splits V . If δ0 δ/(2m(2 + log(1/ ))), then with probability at least 1 δ, SELECT returns a point xi that ( /8)-average splits V , finishing in less than dlog(1/ )e + 1 rounds and 1 2 + log(1/ ) hypotheses in total. Remark 1: It is possible to modify SELECT to find a point xi that (c )-average splits V for any constant c < 1 while only having to draw O(1) more hypotheses in total. First note that by halving b t at each step, we immediately give up a factor of two in our approximation. This can be made smaller by taking narrower steps. Additionally, with a constant factor increase in mt and nt, the approximation ratios in Lemma 5 can be set to any constant. Remark 2: At first glance, it appears that SELECT requires us to know in order to calculate δ0. However, a crude lower bound on suffices. Such a bound can always be found in terms of . This is because any version space is ( /2, , /2)-splittable (Dasgupta, 2005, Lemma 1). By Lemma 3, so long as is less than /4, we can substitute 8dlog(2/ )e for in when we compute δ0. Proof of Theorem 6. Let T := dlog(1/ )e + 1. By Lemma 5, we know that for rounds t = 1, . . . , T, we don t return any point which does worse than b t/2-average splits V with probability 1 δ/2. Moreover, in the T-th round, it will be the case that /4 b T /2, and therefore, with probability 1 δ/2, we will select a point which does no worse than b T /2-average split V , which in turn does no worse than /8-average split V . Note that we draw mt + nt hypotheses at each round. By Lemma 5, for each round bΦt 3Φ(V )/4 3 /4. Thus # of hypotheses drawn = + 72 Φ(V )2 Given b t = 1/2t and T 2 + log 1/ , we have 22+log 1/ 2 Diameter-Based Active Learning Plugging in δ0 δ 2m(2+log(1/ )), we recover the theorem statement. 4.2. Active Learning Strategy Using the SELECT procedure as a subroutine, Algorithm 1, henceforth DBAL for Diameter-based Active Learning, is our active learning strategy. Given a hypothesis class with average splitting index ( , /2, ), DBAL queries data points provided by SELECT until it is confident Φ(V ) < . Denote by Vt the version space in the t-th round of DBAL. The following lemma, which is proven in the appendix, demonstrates that the halting condition (that is, (E) < 3 n/4, where E consists of n pairs sampled from ( |V )2) guarantees that with high probability DBAL stops when Φ(Vt) is small. Lemma 7. The following holds for DBAL: (a) Suppose that for all t = 1, 2, . . . , K that Φ(Vt) > . Then the probability that the termination condition is ever true for any of those rounds is bounded above by K exp (b) Suppose that for some t = 1, 2, . . . , K that Φ(Vt) /2. Then the probability that the termination condition is not true in that round is bounded above by K exp Given the guarantees on the SELECT procedure in Theorem 6 and on the termination condition provided by Lemma 7, we get the following theorem. Theorem 8. Suppose that H has average splitting index ( , /2, ). Then DBAL returns a version space V satisfying Φ(V ) with probability at least 1 δ while using the following resources: + 2 log 1 (h ) rounds, with one label per round, (b) m 1 log 2K δ unlabeled data points sampled per round, and 1 2 + log(1/ ) δ + log log 1 hypotheses sampled per round. Proof. From definition of the average splitting index, if we draw m = 1 δ unlabeled points per round, then with probability 1 δ/2, each of the first K rounds will have at least one data point that -average splits the current version space. In each such round, if the version space has average diameter at least /2, then with probability 1 δ/4 SELECT will return a data point that /8-average splits the current version space while sampling no more log m K log 1 hypotheses per round by Theorem 6. By Lemma 7, if the termination check uses n0 = O hypotheses per round, then with probability 1 δ/4 in the first K rounds the termination condition will never be true when the current version space has average diameter greater than and will certainly be true if the current version space has diameter less than /2. Thus it suffices to bound the number of rounds in which we can /8-average split the version space before encountering a version space with /2. Since the version space is always consistent with the true hypothesis h , we will always have (Vt) (h ). After + 2 log 1 (h ) rounds of /8-average splitting, we have (h )2Φ(VK) (VK)2Φ(VK) Thus in the first K rounds, we must terminate with a version space with average diameter less than . 5. Proof of Lemma 3 In this section, we give the proof of the following relationship between the original splitting index and our average splitting index. Lemma 3. Let be a probability measure over a hypothesis class H. If H has splitting index ( , , ), then it has average splitting index ( 4dlog(1/ )e, 2 , ). The first step in proving Lemma 3 is to relate the splitting index to our estimator ( ). Intuitively, splittability says that for any set of large edges there are many data points which remove a significant fraction of them. One may suspect this should imply that if a set of edges is large on average, then there should be many data points which remove a significant fraction of their weight. The following lemma confirms this suspicion. Lemma 9. Suppose that V H has splitting index ( , , ), and say E = ({h1, h0 1}, . . . , {hn, h0 n}) is a sequence of hypothesis pairs from V satisfying 1 n (E) > 2 . Then if x D, we have with probability at least , 1 4dlog(1/ )e Diameter-Based Active Learning 0 20 40 60 Queries Average Diameter 0 25 50 75 100 Queries Average Diameter 0 50 100 150 200 250 Queries Average Diameter Strategy CAL DBAL QBC Random Figure 1. Simulation results on homogeneous linear separators. Left: d = 10. Middle: d = 25. Right: d = 50. Proof. Consider partitioning E as E0 = {{h, h0} 2 E : d(h, h0) < } and Ek = {{h, h0} 2 E : d(h, h0) 2 [2k 1 , 2k ) for k = 1, . . . , K with K = dlog 1 e. Then E0, . . . , EK are all disjoint and their union is E. Define E1:K = [K We first claim that (E1:K) > (E0). This follows from the observation that because (E) 2n and each edge in E0 has length less than , we must have (E1:K) = (E) (E0) > 2n n > (E0). Next, observe that because each edge {h, h0} 2 Ek with k 1 satisfies d(h, h0) 2 [2k 1 , 2k ), we have Since there are only K summands on the right, at least one of these must be larger than (E1:K)/K. Let k denote that index and let x be a point which -splits Ek. Then we have x ) (E1:K) (Ek \ (Ek)+ (E1:K) 2k 1 |Ek| Since (E1:K) (E0), we have Symmetric arguments show the same holds for E Finally, by the definition of splitting, the probability of drawing a point x which -splits Ek is at least , giving us the lemma. With Lemma 9 in hand, we are now ready to prove Lemma 3. Proof of Lemma 3. Let V H such that Φ(V ) > 2 . Suppose that we draw n edges E i.i.d. from |V and draw a data point x D. Then Hoeffding s inequality (Hoeffding, 1963), combined with Lemma 4, tells us that there exist sequences n, δn & 0 such that with probability at least 1 3δn, the following hold simultaneously: n (E) Φ(V ) + n, (V )2 Φ(V + For n small enough, we have that Φ(V ) n > 2 . Combining the above with Lemma 9, we have with probability at least 3δn, (V )2 Φ(V + 1 4dlog(1/ )e 1 4dlog(1/ )e (Φ(V ) + n) By taking n ! 1, we have n, δn & 0, giving us the lemma. 6. Simulations We compared DBAL against the baseline passive learner as well as two other generic active learning strategies: CAL Diameter-Based Active Learning 0 10 20 30 40 50 Queries Average Diameter 0 25 50 75 100 Queries Average Diameter 0 10 20 30 40 50 Queries Average Diameter 0 25 50 75 100 Queries Average Diameter Strategy CAL DBAL QBC Random Figure 2. Simulation results on k-sparse monotone disjunctions. In all cases k = 4. Top left: d = 75, p = 0.25. Top right: d = 75, p = 0.5. Bottom left: d = 100, p = 0.25. Bottom right: d = 100, p = 0.5. and QBC. CAL proceeds by randomly sampling a data point and querying it if its label cannot be inferred from previously queried data points. QBC uses a prior distribution and maintains a version space V . Given a randomly sampled data point x, QBC samples two hypotheses h, h0 |V and queries x if h(x) 6= h0(x). We tested on two hypothesis classes: homogeneous, or through-the-origin, linear separators and k-sparse monotone disjunctions. In each of our simulations, we drew our target h from the prior distribution. After each query, we estimated the average diameter of the version space. We repeated each simulation several times and plotted the average performance of each algorithm. Homogeneous linear separators The class of ddimensional homogeneous linear separators can be identified with elements of the d-dimensional unit sphere. That is, a hypothesis h 2 Sd 1 acts on a data point x 2 Rd via the sign of their inner product: h(x) := sign(hh, xi). In our simulations, both the prior distribution and the data distribution are uniform over the unit sphere. Although there is no known method to exactly sample uniformly from the version space, Gilad-Bachrach et al. (2005) demonstrated that using samples generated by the hit-andrun Markov chain works well in practice. We adopted this approach for our sampling tasks. Figure 1 shows the results of our simulations on homogeneous linear separators. Sparse monotone disjunctions A k-sparse monotone disjunction is a disjunction of k positive literals. Given a Boolean vector x 2 {0, 1}n, a monotone disjunction h classifies x as positive if and only if xi = 1 for some positive literal i in h. In our simulations, each data point is a vector whose coordinates are i.i.d. Bernoulli random variables with parameter p. The prior distribution is uniform over all k-sparse monotone disjunctions. When k is constant, it is possible to sample from the prior restricted to the version space in expected polynomial time using rejection sampling. The results of our simulations on k-sparse monotone disjunctions are in Figure 2. Acknowledgments The authors are grateful to the reviewers for their feedback and to the NSF for support under grants IIS-1162581 and DGE-1144086. Part of this work was done at the Simons Institute for Theoretical Computer Science, Berkeley, as part of a program on the foundations of machine learning. CT additionally thanks Daniel Hsu and Stefanos Poulis for helpful discussions. Angluin, Dana and Valiant, Leslie G. Fast probabilistic al- gorithms for hamiltonian circuits and matchings. In Proceedings of the ninth annual ACM symposium on Theory of computing, pp. 30 41. ACM, 1977. Balcan, Maria-Florina and Long, Phil. Active and passive Diameter-Based Active Learning learning of linear separators under log-concave distributions. In Proceedings of the 26th Conference on Learning Theory, pp. 288 316, 2013. Balcan, Maria-Florina, Broder, Andrei, and Zhang, Tong. Margin based active learning. In International Conference on Computational Learning Theory, pp. 35 50. Springer, 2007. Balcan, Maria-Florina, Beygelzimer, Alina, and Langford, John. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78 89, 2009. Beygelzimer, Alina, Dasgupta, Sanjoy, and Langford, John. Importance weighted active learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 49 56, 2009. Castro, Rui M and Nowak, Robert D. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5):2339 2353, 2008. Cesa-Bianchi, Nicolo, Gentile, Claudio, and Zaniboni, Luca. Worst-case analysis of selective sampling for linear classification. Journal of Machine Learning Research, 7:1205 1230, 2006. Cesa-Bianchi, Nicolo, Gentile, Claudio, and Vitale, Fabio. Learning unknown graphs. In International Conference on Algorithmic Learning Theory, pp. 110 125. Springer, 2009. Cohn, David, Atlas, Les, and Ladner, Richard. Improving generalization with active learning. Machine learning, 15(2):201 221, 1994. Dasarathy, Gautam, Nowak, Robert, and Zhu, Xiaojin. S2: An efficient graph based active learning algorithm with application to nonparametric classification. In Proceedings of The 28th Conference on Learning Theory, pp. 503 522, 2015. Dasgupta, Sanjoy. Analysis of a greedy active learning strategy. In Advances in neural information processing systems, pp. 337 344, 2004. Dasgupta, Sanjoy. Coarse sample complexity bounds for active learning. In Advances in neural information processing systems, pp. 235 242, 2005. Dasgupta, Sanjoy and Hsu, Daniel. Hierarchical sampling for active learning. In Proceedings of the 25th international conference on Machine learning, pp. 208 215. ACM, 2008. Dasgupta, Sanjoy, Monteleoni, Claire, and Hsu, Daniel J. A general agnostic active learning algorithm. In Advances in neural information processing systems, pp. 353 360, 2007. Dasgupta, Sanjoy, Kalai, Adam Tauman, and Monteleoni, Claire. Analysis of perceptron-based active learning. Journal of Machine Learning Research, 10(Feb):281 299, 2009. Freund, Yoav, Seung, H Sebastian, Shamir, Eli, and Tishby, Naftali. Selective sampling using the query by committee algorithm. Machine learning, 28(2-3):133 168, 1997. Gilad-Bachrach, Ran, Navot, Amir, and Tishby, Naftali. Query by committee made real. In Proceedings of the 18th International Conference on Neural Information Processing Systems, pp. 443 450. MIT Press, 2005. Golovin, Daniel, Krause, Andreas, and Ray, Debajyoti. Near-optimal bayesian active learning with noisy observations. In Advances in Neural Information Processing Systems, pp. 766 774, 2010. Guillory, Andrew and Bilmes, Jeff. Average-case active learning with costs. In International Conference on Algorithmic Learning Theory, pp. 141 155. Springer, 2009. Hanneke, Steve. A bound on the label complexity of ag- nostic active learning. In Proceedings of the 24th international conference on Machine learning, pp. 353 360. ACM, 2007. Hoeffding, Wassily. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13 30, 1963. Kpotufe, Samory, Urner, Ruth, and Ben-David, Shai. Hier- archical label queries with data-dependent partitions. In Proceedings of The 28th Conference on Learning Theory, pp. 1176 1189, 2015. Zhang, Chicheng and Chaudhuri, Kamalika. Beyond disagreement-based agnostic active learning. In Advances in Neural Information Processing Systems, pp. 442 450, 2014. Zhu, Xiaojin, Ghahramani, Zoubin, and Lafferty, John. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning, 2003.