# search_improves_label_for_active_learning__e137eac0.pdf

Search Improves Label for Active Learning

Alina Beygelzimer Yahoo Research New York, NY beygel@yahoo-inc.com

Daniel Hsu Columbia University New York, NY djhsu@cs.columbia.edu

John Langford Microsoft Research New York, NY jcl@microsoft.com

Chicheng Zhang UC San Diego La Jolla, CA chz038@cs.ucsd.edu

We investigate active learning with access to two distinct oracles: LABEL (which is standard) and SEARCH (which is not). The SEARCH oracle models the situation where a human searches a database to seed or counterexample an existing solution. SEARCH is stronger than LABEL while being natural to implement in many situations. We show that an algorithm using both oracles can provide exponentially large problem-dependent improvements over LABEL alone.

1 Introduction

Most active learning theory is based on interacting with a LABEL oracle: An active learner observes unlabeled examples, each with a label that is initially hidden. The learner provides an unlabeled example to the oracle, and the oracle responds with the label. Using LABEL in an active learning algorithm is known to give (sometimes exponentially large) problem-dependent improvements in label complexity, even in the agnostic setting where no assumption is made about the underlying distribution [e.g., Balcan et al., 2006, Hanneke, 2007, Dasgupta et al., 2007, Hanneke, 2014].

A well-known deﬁciency of LABEL arises in the presence of rare classes in classiﬁcation problems, frequently the case in practice [Attenberg and Provost, 2010, Simard et al., 2014]. Class imbalance may be so extreme that simply ﬁnding an example from the rare class can exhaust the labeling budget. Consider the problem of learning interval functions in [0, 1]. Any LABEL-only active learner needs at least Ω(1/ ) LABEL queries to learn an arbitrary target interval with error at most [Dasgupta, 2005]. Given any positive example from the interval, however, the query complexity of learning intervals collapses to O(log(1/ )), as we can just do a binary search for each of the end points.

A natural approach used to overcome this hurdle in practice is to search for known examples of the rare class [Attenberg and Provost, 2010, Simard et al., 2014]. Domain experts are often adept at ﬁnding examples of a class by various, often clever means. For instance, when building a hate speech ﬁlter, a simple web search can readily produce a set of positive examples. Sending a random batch of unlabeled text to LABEL is unlikely to produce any positive examples at all.

Another form of interaction common in practice is providing counterexamples to a learned predictor. When monitoring the stream ﬁltered by the current hate speech ﬁlter, a human editor may spot a clear-cut example of hate speech that seeped through the ﬁlter. The editor, using all the search tools available to her, may even be tasked with searching for such counterexamples. The goal of the learning system is then to interactively restrict the searchable space, guiding the search process to where it is most effective.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Counterexamples can be ineffective or misleading in practice as well. Reconsidering the intervals example above, a counterexample on the boundary of an incorrect interval provides no useful information about any other examples. What is a good counterexample? What is a natural way to restrict the searchable space? How can the intervals problem be generalized?

We deﬁne a new oracle, SEARCH, that provides counterexamples to version spaces. Given a set of possible classiﬁers H mapping unlabeled examples to labels, a version space V H is the subset of classiﬁers still under consideration by the algorithm. A counterexample to a version space is a labeled example which every classiﬁer in the version space classiﬁes incorrectly. When there is no counterexample to the version space, SEARCH returns nothing.

How can a counterexample to the version space be used? We consider a nested sequence of hypothesis classes of increasing complexity, akin to Structural Risk Minimization (SRM) in passive learning [see, e.g., Vapnik, 1982, Devroye et al., 1996]. When SEARCH produces a counterexample to the version space, it gives a proof that the current hypothesis class is too simplistic to solve the problem effectively. We show that this guided increase in hypothesis complexity results in a radically lower LABEL complexity than directly learning on the complex space. Sample complexity bounds for model selection in LABEL-only active learning were studied by Balcan et al. [2010], Hanneke [2011].

SEARCH can easily model the practice of seeding discussed earlier. If the ﬁrst hypothesis class has just the constant always-negative classiﬁer h(x) = 1, a seed example with label +1 is a counterexample to the version space. Our most basic algorithm uses SEARCH just once before using LABEL, but it is clear from inspection that multiple seeds are not harmful, and they may be helpful if they provide the proof required to operate with an appropriately complex hypothesis class.

Deﬁning SEARCH with respect to a version space rather than a single classiﬁer allows us to formalize counterexample far from the boundary in a general fashion which is compatible with the way LABEL-based active learning algorithms work.

Related work. The closest oracle considered in the literature is the Class Conditional Query (CCQ) [Balcan and Hanneke, 2012] oracle. A query to CCQ speciﬁes a ﬁnite set of unlabeled examples and a label while returning an example in the subset with the speciﬁed label, if one exists.

In contrast, SEARCH has an implicit query set that is an entire region of the input space rather than a ﬁnite set. Simple searches over this large implicit domain can more plausibly discover relevant counterexamples: When building a detector for penguins in images, the input to CCQ might be a set of images and the label penguin . Even if we are very lucky and the set happens to contain a penguin image, a search amongst image tags may fail to ﬁnd it in the subset because it is not tagged appropriately. SEARCH is more likely to discover counterexamples surely there are many images correctly tagged as having penguins.

Why is it natural to deﬁne a query region implicitly via a version space? There is a practical reason it is a concise description of a natural region with an efﬁciently implementable membership ﬁlter [Beygelzimer et al., 2010, 2011, Huang et al., 2015]. (Compare this to an oracle call that has to explicitly enumerate a large set of examples. The algorithm of Balcan and Hanneke [2012] uses samples of size roughly dν/ 2.)

The use of SEARCH in this paper is also substantially different from the use of CCQ by Balcan and Hanneke [2012]. Our motivation is to use SEARCH to assist LABEL, as opposed to using SEARCH alone. This is especially useful in any setting where the cost of SEARCH is signiﬁcantly higher than the cost of LABEL we hope to avoid using SEARCH queries whenever it is possible to make progress using LABEL queries. This is consistent with how interactive learning systems are used in practice. For example, the Interactive Classiﬁcation and Extraction system of Simard et al. [2014] combines LABEL with search in a production environment.

The ﬁnal important distinction is that we require SEARCH to return the label of the optimal predictor in the nested sequence. For many natural sequences of hypothesis classes, the Bayes optimal classiﬁer is eventually in the sequence, in which case it is equivalent to assuming that the label in a counterexample is the most probable one, as opposed to a randomly-drawn label from the conditional distribution (as in CCQ and LABEL).

Is this a reasonable assumption? Unlike with LABEL queries, where the labeler has no choice of what to label, here the labeler chooses a counterexample. If a human editor ﬁnds an unquestionable

example of hate speech that seeped through the ﬁlter, it is quite reasonable to assume that this counterexample is consistent with the Bayes optimal predictor for any sensible feature representation.

Organization. Section 2 formally introduces the setting. Section 3 shows that SEARCH is at least as powerful as LABEL. Section 4 shows how to use SEARCH and LABEL jointly in the realizable setting where a zero-error classiﬁer exists in the nested sequence of hypothesis classes. Section 5 handles the agnostic setting where LABEL is subject to label noise, and shows an amortized approach to combining the two oracles with a good guarantee on the total cost.

2 Deﬁnitions and Setting

In active learning, there is an underlying distribution D over X Y, where X is the instance space and Y := { 1, +1} is the label space. The learner can obtain independent draws from D, but the label is hidden unless explicitly requested through a query to the LABEL oracle. Let DX denote the marginal of D over X.

We consider learning with a nested sequence of hypotheses classes H0 H1 Hk , where Hk YX has VC dimension dk. For a set of labeled examples S X Y, let Hk(S) := {h Hk : (x, y) S h(x) = y} be the set of hypotheses in Hk consistent with S. Let err(h) := Pr(x,y) D[h(x) = y] denote the error rate of a hypothesis h with respect to distribution D, and err(h, S) be the error rate of h on the labeled examples in S. Let h k = arg minh Hk err(h) breaking ties arbitrarily and let k := arg mink 0 err(h k) breaking ties in favor of the smallest such k. For simplicity, we assume the minimum is attained at some ﬁnite k . Finally, deﬁne h := h k , the optimal hypothesis in the sequence of classes. The goal of the learner is to learn a hypothesis with error rate not much more than that of h .

In addition to LABEL, the learner can also query SEARCH with a version space.

Oracle SEARCHH(V ) (where H {Hk} k=0)

input: Set of hypotheses V H output: Labeled example (x, h (x)) s.t. h(x) = h (x) for all h V , or if there is no such example.

Thus if SEARCHH(V ) returns an example, this example is a systematic mistake made by all hypotheses in V . (If V = , we expect SEARCH to return some example, i.e., not .)

Our analysis is given in terms of the disagreement coefﬁcient of Hanneke [2007], which has been a central parameter for analyzing active learning algorithms. Deﬁne the region of disagreement of a set of hypotheses V as Dis(V ) := {x X : h, h V s.t. h(x) = h (x)}. The disagreement coefﬁcient of V at scale r is θV (r) := suph V,r r Pr DX [Dis(BV (h, r ))]/r , where BV (h, r ) = {h V : Prx DX [h (x) = h(x)] r } is the ball of radius r around h.

The O( ) notation hides factors that are polylogarithmic in 1/δ and quantities that do appear, where δ is the usual conﬁdence parameter.

3 The Relative Power of the Two Oracles

Although SEARCH cannot always implement LABEL efﬁciently, it is as effective at reducing the region of disagreement. The clearest example is learning threshold classiﬁers H := {hw : w [0, 1]} in the realizable case, where hw(x) = +1 if w x 1, and 1 if 0 x < w. A simple binary search with LABEL achieves an exponential improvement in query complexity over passive learning. The agreement region of any set of threshold classiﬁers with thresholds in [wmin, wmax] is [0, wmin) [wmax, 1]. Since SEARCH is allowed to return any counterexample in the agreement region, there is no mechanism for forcing SEARCH to return the label of a particular point we want. However, this is not needed to achieve logarithmic query complexity with SEARCH: If binary search starts with querying the label of x [0, 1], we can query SEARCHH(Vx), where Vx := {hw H : w < x} instead. If SEARCH returns , we know that the target w x and can safely reduce the region of disagreement to [0, x). If SEARCH returns a counterexample (x0, 1) with x0 x, we know that w > x0 and can reduce the region of disagreement to (x0, 1].

This observation holds more generally. In the proposition below, we assume that LABEL(x) = h (x) for simplicity. If LABEL(x) is noisy, the proposition holds for any active learning algorithm that doesn t eliminate any h H : h(x) = LABEL(x) from the version space.

Proposition 1. For any call x X to LABEL such that LABEL(x) = h (x), we can construct a call to SEARCH that achieves a no lesser reduction in the region of disagreement.

Proof. For any V H, let HSEARCH(V ) be the hypotheses in H consistent with the output of SEARCHH(V ): if SEARCHH(V ) returns a counterexample (x, y) to V , then HSEARCH(V ) := {h H : h(x) = y}; otherwise, HSEARCH(V ) := V . Let HLABEL(x) := {h H : h(x) = LABEL(x)}. Also, let Vx := H+1(x) := {h H : h(x) = +1}. We will show that Vx is such that HSEARCH(Vx) HLABEL(x), and hence Dis(HSEARCH(Vx)) Dis(HLABEL(x)).

There are two cases to consider: If h (x) = +1, then SEARCHH(Vx) returns . In this case, HLABEL(x) = HSEARCH(Vx) = H+1(x), and we are done. If h (x) = 1, SEARCH(Vx) returns a valid counterexample (possibly (x, 1)) in the region of agreement of H+1(x), eliminating all of H+1(x). Thus HSEARCH(Vx) H \ H+1(x) = HLABEL(x), and the claim holds also.

As shown by the problem of learning intervals on the line, SEARCH can be exponentially more powerful than LABEL.

4 Realizable Case

We now turn to general active learning algorithms that combine SEARCH and LABEL. We focus on algorithms using both SEARCH and LABEL since LABEL is typically easier to implement than SEARCH and hence should be used where SEARCH has no signiﬁcant advantage. (Whenever SEARCH is less expensive than LABEL, Section 3 suggests a transformation to a SEARCH-only algorithm.)

This section considers the realizable case, in which we assume that the hypothesis h = h k Hk has err(h ) = 0. This means that LABEL(x) returns h (x) for any x in the support of DX .

4.1 Combining LABEL and SEARCH

Our algorithm (shown as Algorithm 1) is called LARCH, because it combines LABEL and SEARCH. Like many selective sampling methods, LARCH uses a version space to determine its LABEL queries.

For concreteness, we use (a variant of) the algorithm of Cohn et al. [1994], denoted by CAL, as a subroutine in LARCH. The inputs to CAL are: a version space V , the LABEL oracle, a target error rate, and a conﬁdence parameter; and its output is a set of labeled examples (implicitly deﬁning a new version space). CAL is described in Appendix B; its essential properties are speciﬁed in Lemma 1.

LARCH differs from LABEL-only active learners (like CAL) by ﬁrst calling SEARCH in Step 3. If SEARCH returns , LARCH checks to see if the last call to CAL resulted in a small-enough error, halting if so in Step 6, and decreasing the allowed error rate if not in Step 8. If SEARCH instead returns a counterexample, the hypothesis class Hk must be impoverished, so in Step 12, LARCH increases the complexity of the hypothesis class to the minimum complexity sufﬁcient to correctly classify all known labeled examples in S. After the SEARCH, CAL is called in Step 14 to discover a sufﬁciently low-error (or at least low-disagreement) version space with high probability.

When LARCH advances to index k (for any k k ), its set of labeled examples S may imply a version space Hk(S) Hk that can be actively-learned more efﬁciently than the whole of Hk. In our analysis, we quantify this through the disagreement coefﬁcient of Hk(S), which may be markedly smaller than that of the full Hk.

The following theorem bounds the oracle query complexity of Algorithm 1 for learning with both SEARCH and LABEL in the realizable setting. The proof is in section 4.2.

Theorem 1. Assume that err(h ) = 0. For each k 0, let θk ( ) be the disagreement coefﬁcient of Hk (S[k ]), where S[k ] is the set of labeled examples S in LARCH at the ﬁrst time that k k . Fix any , δ (0, 1). If LARCH is run with inputs hypothesis classes {Hk} k=0, oracles LABEL and SEARCH, and learning parameters , δ, then with probability at least 1 δ: LARCH halts after at most k + log2(1/ ) for-loop iterations and returns a classiﬁer with error rate at most ; furthermore,

Algorithm 1 LARCH

input: Nested hypothesis classes H0 H1 ; oracles LABEL and SEARCH; learning parameters , δ (0, 1) 1: initialize S , (index) k 0, 0 2: for i = 1, 2, . . . do 3: e SEARCHHk(Hk(S)) 4: if e = then # no counterexample found 5: if 2 then 6: return any h Hk(S) 7: else 8: + 1 9: end if 10: else # counterexample found 11: S S {e} 12: k min{k : Hk (S) = } 13: end if 14: S S CAL(Hk(S), LABEL, 2 , δ/(i2 + i)) 15: end for

it draws at most O(k dk / ) unlabeled examples from DX , makes at most k + log2(1/ ) queries to SEARCH, and at most O ( k + log(1/ ) (maxk k θk ( )) dk log2(1/ )) queries to LABEL.

Union-of-intervals example. We now show an implication of Theorem 1 in the case where the target hypothesis h is the union of non-trivial intervals in X := [0, 1], assuming that DX is uniform. For k 0, let Hk be the hypothesis class of the union of up to k intervals in [0, 1] with H0 containing only the always-negative hypothesis. (Thus, h is the union of k non-empty intervals.) The disagreement coefﬁcient of H1 is Ω(1/ ), and hence LABEL-only active learners like CAL are not very effective at learning with such classes. However, the ﬁrst SEARCH query by LARCH provides a counterexample to H0, which must be a positive example (x1, +1). Hence, H1(S[1]) (where S[1] is deﬁned in Theorem 1) is the class of intervals that contain x1 with disagreement coefﬁcient θ1 4.

Now consider the inductive case. Just before LARCH advances its index to a value k (for any k k ), SEARCH returns a counterexample (x, h (x)) to the version space; every hypothesis in this version space (which could be empty) is a union of fewer than k intervals. If the version space is empty, then S must already contain positive examples from at least k different intervals in h and at least k 1 negative examples separating them. If the version space is not empty, then the point x is either a positive example belonging to a previously uncovered interval in h or a negative example splitting an existing interval. In either case, S[k] contains positive examples from at least k distinct intervals separated by at least k 1 negative examples. The disagreement coefﬁcient of the set of unions of k intervals consistent with S[k] is at most 4k, independent of .

The VC dimension of Hk is O(k), so Theorem 1 implies that with high probability, LARCH makes at most k + log(1/ ) queries to SEARCH and O((k )3 log(1/ ) + (k )2 log3(1/ )) queries to LABEL.

4.2 Proof of Theorem 1

The proof of Theorem 1 uses the following lemma regarding the CAL subroutine, proved in Appendix B. It is similar to a result of Hanneke [2011], but an important difference here is that the input version space V is not assumed to contain h . Lemma 1. Assume LABEL(x) = h (x) for every x in the support of DX . For any hypothesis set V YX with VC dimension d < , and any , δ (0, 1), the following holds with probability at least 1 δ. CAL(V, LABEL, , δ) returns labeled examples T {(x, h (x)) : x X} such that for any h in V (T), Pr(x,y) D[h(x) = y x Dis(V (T))] ; furthermore, it draws at most O(d/ ) unlabeled examples from DX , and makes at most O (θV ( ) d log2(1/ )) queries to LABEL.

We now prove Theorem 1. By Lemma 1 and a union bound, there is an event with probability at least 1

i 1 δ/(i2 + i) 1 δ such that each call to CAL made by LARCH satisﬁes the high-probability guarantee from Lemma 1. We henceforth condition on this event.

We ﬁrst establish the guarantee on the error rate of a hypothesis returned by LARCH. By the assumed properties of LABEL and SEARCH, and the properties of CAL from Lemma 1, the labeled examples S in LARCH are always consistent with h . Moreover, the return property of CAL implies that at the end of any loop iteration, with the present values of S, k, and , we have Pr(x,y) D[h(x) = y x Dis(Hk(S))] 2 for all h Hk(S). (The same holds trivially before the ﬁrst loop iteration.) Therefore, if LARCH halts and returns a hypothesis h Hk(S), then there is no counterexample to Hk(S), and Pr(x,y) D[h(x) = y x Dis(Hk(S))] . These consequences and the law of total probability imply err(h) = Pr(x,y) D[h(x) = y x Dis(Hk(S))] .

We next consider the number of for-loop iterations executed by LARCH. Let Si, ki, and i be, respectively, the values of S, k, and at the start of the i-th for-loop iteration in LARCH. We claim that if LARCH does not halt in the i-th iteration, then one of k and is incremented by at least one. Clearly, if there is no counterexample to Hki(Si) and 2 i > , then is incremented by one (Step 8). If, instead, there is a counterexample (x, y), then Hki(Si {(x, y)}) = , and hence k is incremented to some index larger than ki (Step 12). This proves that ki+1 + i+1 ki + i + 1. We also have ki k , since h Hk is consistent with S, and i log2(1/ ), as long as LARCH does not halt in for-loop iteration i. So the total number of for-loop iterations is at most k + log2(1/ ). Together with Lemma 1, this bounds the number of unlabeled examples drawn from DX .

Finally, we bound the number of queries to SEARCH and LABEL. The number of queries to SEARCH is the same as the number of for-loop iterations this is at most k + log2(1/ ). By Lemma 1 and the fact that V (S S ) V (S ) for any hypothesis space V and sets of labeled examples S , S , the number of LABEL queries made by CAL in the i-th for-loop iteration is at most O(θki( ) dki 2 i polylog(i)). The claimed bound on the number of LABEL queries made by LARCH now readily follows by taking a max over i, and using the facts that i k and dk dk for all k k.

4.3 An Improved Algorithm

LARCH is somewhat conservative in its use of SEARCH, interleaving just one SEARCH query between sequences of LABEL queries (from CAL). Often, it is advantageous to advance to higher complexity hypothesis classes quickly, as long as there is justiﬁcation to do so. Counterexamples from SEARCH provide such justiﬁcation, and a result from SEARCH also provides useful feedback about the current version space: outside of its disagreement region, the version space is in complete agreement with h (even if the version space does not contain h ). Based on these observations, we propose an improved algorithm for the realizable setting, which we call SEABEL. Due to space limitations, we present it in Appendix C. We prove the following performance guarantee for SEABEL. Theorem 2. Assume that err(h ) = 0. Let θk( ) denote the disagreement coefﬁcient of V ki i at the ﬁrst iteration i in SEABEL where ki k. Fix any , δ (0, 1). If SEABEL is run with inputs hypothesis classes {Hk} k=0, oracles SEARCH and LABEL, and learning parameters , δ (0, 1), then with probability 1 δ: SEABEL halts and returns a classiﬁer with error rate at most ; furthermore, it draws at most O((dk + log k )/ ) unlabeled examples from DX , makes at most k + O (log(dk / ) + log log k ) queries to SEARCH, and at most O (maxk k θk(2 ) (dk log2(1/ ) + log k )) queries to LABEL.

It is not generally possible to directly compare Theorems 1 and 2 on account of the algorithmdependent disagreement coefﬁcient bounds. However, in cases where these disagreement coefﬁcients are comparable (as in the union-of-intervals example), the SEARCH complexity in Theorem 2 is slightly higher (by additive log terms), but the LABEL complexity is smaller than that from Theorem 1 by roughly a factor of k . For the union-of-intervals example, SEABEL would learn target union of k intervals with k + O(log(k / )) queries to SEARCH and O((k )2 log2(1/ )) queries to LABEL.

5 Non-Realizable Case

In this section, we consider the case where the optimal hypothesis h may have non-zero error rate, i.e., the non-realizable (or agnostic) setting. In this case, the algorithm LARCH, which was designed for the realizable setting, is no longer applicable. First, examples obtained by LABEL and SEARCH are of different quality: those returned by SEARCH always agree with h , whereas the labels given by LABEL need not agree with h . Moreover, the version spaces (even when k = k ) as deﬁned by LARCH may always be empty due to the noisy labels.

Another complication arises in our SRM setting that differentiates it from the usual agnostic active learning setting. When working with a speciﬁc hypothesis class Hk in the nested sequence, we may observe high error rates because (i) the ﬁnite sample error is too high (but additional labeled examples could reduce it), or (ii) the current hypothesis class Hk is impoverished. In case (ii), the best hypothesis in Hk may have a much larger error rate than h , and hence lower bounds [Kääriäinen, 2006] imply that active learning on Hk instead of Hk may be substantially more difﬁcult.

These difﬁculties in the SRM setting are circumvented by an algorithm that adaptively estimates the error of h . The algorithm, A-LARCH (Algorithm 5), is presented in Appendix D.

Theorem 3. Assume err(h ) = ν. Let θk( ) denote the disagreement coefﬁcient of V ki i at the ﬁrst iteration i in A-LARCH where ki k. Fix any , δ (0, 1). If A-LARCH is run with inputs hypothesis classes {Hk} k=0, oracles SEARCH and LABEL, learning parameter δ, and unlabeled example budget O((dk + log k )(ν + )/ 2), then with probability 1 δ: A-LARCH returns a classiﬁer with error rate ν + ; it makes at most k + O (log(dk / ) + log log k ) queries to SEARCH, and O (maxk k θk(2ν + 2 ) (dk log2(1/ ) + log k ) (1 + ν2/ 2)) queries to LABEL.

The proof is in Appendix D. The LABEL query complexity is at least a factor of k better than that in Hanneke [2011], and sometimes exponentially better thanks to the reduced disagreement coefﬁcient of the version space when consistency constraints are incorporated.

5.1 AA-LARCH: an Opportunistic Anytime Algorithm

In many practical scenarios, termination conditions based on quantities like a target excess error rate are undesirable. The target is unknown, and we instead prefer an algorithm that performs as well as possible until a cost budget is exhausted. Fortunately, when the primary cost being considered are LABEL queries, there are many LABEL-only active learning algorithms that readily work in such an anytime setting [see, e.g., Dasgupta et al., 2007, Hanneke, 2014].

The situation is more complicated when we consider both SEARCH and LABEL: we can often make substantially more progress with SEARCH queries than with LABEL queries (as the error rate of the best hypothesis in Hk for k > k can be far lower than in Hk). AA-LARCH (Algorithm 2) shows that although these queries come at a higher cost, the cost can be amortized.

AA-LARCH relies on several subroutines: SAMPLE-AND-LABEL, ERROR-CHECK, PRUNE-VERSION-SPACE and UPGRADE-VERSION-SPACE (Algorithms 6, 7, 8, and 9). The detailed descriptions are deferred to Appendix E. SAMPLE-AND-LABEL performs standard disagreement-based selective sampling using oracle LABEL; labels of examples in the disagreement region are queried, otherwise inferred. PRUNE-VERSION-SPACE prunes the version space given the labeled examples collected, based on standard generalization error bounds. ERROR-CHECK checks if the best hypothesis in the version space has large error; SEARCH is used to ﬁnd a systematic mistake for the version space; if either event happens, AA-LARCH calls UPGRADE-VERSION-SPACE to increase k, the level of our working hypothesis class.

Theorem 4. Assume err(h ) = ν. Let θk ( ) denote the disagreement coefﬁcient of Vi at the ﬁrst iteration i after which k k . Fix any (0, 1). Let n = O(maxk k θk(2ν +2 )dk (1+ν2/ 2)) and deﬁne C = 2(n + k τ). Run Algorithm 2 with a nested sequence of hypotheses {Hk} k=0, oracles LABEL and SEARCH, conﬁdence parameter δ, cost ratio τ 1, and upper bound N = O(dk / 2). If the cost spent is at least C , then with probability 1 δ, the current hypothesis h has error at most ν + .

The proof is in Appendix E. A comparison to Theorem 3 shows that AA-LARCH is adaptive: for any cost complexity C, the excess error rate is roughly at most twice that achieved by A-LARCH.

6 Discussion

The SEARCH oracle captures a powerful form of interaction that is useful for machine learning. Our theoretical analyses of LARCH and variants demonstrate that SEARCH can substantially improve LABEL-based active learners, while being plausibly cheaper to implement than oracles like CCQ.

Algorithm 2 AA-LARCH

input: Nested hypothesis set H0 H1 ; oracles LABEL and SEARCH; learning parameter δ (0, 1); SEARCH-to-LABEL cost ratio τ, dataset size upper bound N. output: hypothesis h. 1: Initialize: consistency constraints S , counter c 0, k 0, veriﬁed labeled dataset L , working labeled dataset L0 , unlabeled examples processed i 0, Vi Hk(S). 2: loop 3: Reset counter c 0. 4: repeat 5: if ERROR-CHECK(Vi, Li, δi) then 6: (k, S, Vi) UPGRADE-VERSION-SPACE(k, S, ) 7: Vi PRUNE-VERSION-SPACE(Vi, L, δi) 8: Li L 9: continue loop 10: end if 11: i i + 1 12: (Li, c) SAMPLE-AND-LABEL(Vi 1, LABEL, Li 1, c) 13: Vi PRUNE-VERSION-SPACE(Vi 1, Li, δi) 14: until c = τ or li = N 15: e SEARCHHk(Vi) 16: if e = then 17: (k, S, Vi) UPGRADE-VERSION-SPACE(k, S, {e}) 18: Vi PRUNE-VERSION-SPACE(Vi, L, δi) 19: Li L 20: else 21: Update veriﬁed dataset L Li. 22: Store temporary solution h = arg minh Vi err(h , L). 23: end if 24: end loop

Are there examples where CCQ is substantially more powerful than SEARCH? This is a key question, because a good active learning system should use minimally powerful oracles. Another key question is: Can the beneﬁts of SEARCH be provided in a computationally efﬁcient general purpose manner?

Josh Attenberg and Foster J. Provost. Why label when you can search? alternatives to active learning for applying human resources to build classiﬁcation models under extreme class imbalance. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, July 25-28, 2010, pages 423 432, 2010.

Maria-Florina Balcan and Steve Hanneke. Robust interactive learning. In COLT, 2012.

Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In ICML, 2006.

Maria-Florina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. The true sample complexity of active learning. Machine learning, 80(2-3):111 139, 2010.

Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang. Agnostic active learning without constraints. In Advances in Neural Information Processing Systems 23, 2010.

Alina Beygelzimer, Daniel Hsu, Nikos Karampatziakis, John Langford, and Tong Zhang. Efﬁcient active learning. In ICML Workshop on Online Trading of Exploration and Exploitation, 2011.

David A. Cohn, Les E. Atlas, and Richard E. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201 221, 1994.

Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems 18, 2005.

Sanjoy Dasgupta, Daniel Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems 20, 2007.

Luc Devroye, László Györﬁ, and Gabor Lugosi. A Probabilistic Theory of Pattern Recognition. Springer Verlag, 1996.

Steve Hanneke. A bound on the label complexity of agnostic active learning. In ICML, pages 249 278, 2007.

Steve Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333 361, 2011.

Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends R in Machine Learning, 7(2-3):131 309, 2014. ISSN 1935-8237. doi: 10.1561/2200000037.

Tzu-Kuo Huang, Alekh Agarwal, Daniel Hsu, John Langford, and Robert E. Schapire. Efﬁcient and parsimonious agnostic active learning. In Advances in Neural Information Processing Systems 28, 2015.

Matti Kääriäinen. Active learning in the non-realizable case. In Algorithmic Learning Theory, 17th International Conference, ALT 2006, Barcelona, Spain, October 7-10, 2006, Proceedings, pages 63 77, 2006.

Patrice Y. Simard, David Maxwell Chickering, Aparna Lakshmiratan, Denis Xavier Charles, Léon Bottou, Carlos Garcia Jurado Suarez, David Grangier, Saleema Amershi, Johan Verwey, and Jina Suh. ICE: enabling non-experts to build models interactively for large-scale lopsided problems. Co RR, abs/1409.4814, 2014. URL http://arxiv.org/abs/1409.4814.

Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982.

Vladimir N. Vapnik and Alexey Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16(2):264 280, 1971.