# active_learning_with_oracle_epiphany__5c1f04a6.pdf

Active Learning with Oracle Epiphany

Tzu-Kuo Huang Uber Advanced Technologies Group Pittsburgh, PA 15201

Lihong Li Microsoft Research Redmond, WA 98052

Ara Vartanian University of Wisconsin Madison Madison, WI 53706

Saleema Amershi Microsoft Research Redmond, WA 98052

Xiaojin Zhu University of Wisconsin Madison Madison, WI 53706

We present a theoretical analysis of active learning with more realistic interactions with human oracles. Previous empirical studies have shown oracles abstaining on difﬁcult queries until accumulating enough information to make label decisions. We formalize this phenomenon with an oracle epiphany model and analyze active learning query complexity under such oracles for both the realizable and the agnostic cases. Our analysis shows that active learning is possible with oracle epiphany, but incurs an additional cost depending on when the epiphany happens. Our results suggest new, principled active learning approaches with realistic oracles.

1 Introduction

There is currently a wide gap between theory and practice of active learning with oracle interaction. Theoretical active learning assumes an omniscient oracle. Given a query x, the oracle simply answers its label y by drawing from the conditional distribution p(y | x). This oracle model is motivated largely by its convenience for analysis. However, there is mounting empirical evidence from psychology and human-computer interaction research that humans behave in far more complex ways. The oracle may abstain on some queries [Donmez and Carbonell, 2008] (note this is distinct from classiﬁer abstention [Zhang and Chaudhuri, 2014, El-Yaniv and Wiener, 2010]), or their answers can be inﬂuenced by the identity and order of previous queries [Newell and Ruths, 2016, Sarkar et al., 2016, Kulesza et al., 2014] and by incentives [Shah and Zhou, 2015]. Theoretical active learning has yet to account for such richness in human behaviors, which are critical to designing principled algorithms to effectively learn from human annotators.

This paper takes a step toward bridging this gap. Speciﬁcally, we formalize and analyze the phenomenon of oracle epiphany. Consider active learning from a human oracle to build a webpage classiﬁer on basketball sport vs. others. It is well-known in practice that no matter how simple the task looks, the oracle can encounter difﬁcult queries. The oracle may easily answer webpage queries that are obviously about basketball or obviously not about the sport, until she encounters a webpage on basketball jerseys. Here, the oracle cannot immediately decide how to label ( Does this jersey webpage qualify as a webpage about basketball? ). One solution is to allow the oracle to abstain by answering with a special I-don t-know label [Donmez and Carbonell, 2008]. More interestingly, Kulesza et al. [2014] demonstrated that with proper user interface support, the oracle may temporarily abstain on similar queries but then have an epiphany : she may suddenly decide how to label all basketball apparel-related webpages. Empirical evidence in [Kulesza et al., 2014] suggests that epiphany may be induced by the accumulative effect of seeing multiple similar queries. If a future basketball-jersey webpage query arrives, the oracle will no longer abstain but will answer

Part of this work was done while the author was with Microsoft Research.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

with the label she determined during epiphany. In this way, the oracle improves herself on the subset of the input space that corresponds to basketball apparel-related webpages.

Empirical evidence also suggests that oracle abstention, and subsequent epiphany, may happen separately on different subsets of the input space. When building a cooking vs. others text classiﬁer, Kulesza et al. [2014] observed oracle epiphany on a subset of cooking supplies documents, and separately on the subset of culinary service documents; on gardening vs. others, they observed separate oracle epiphany on plant information and on local garden documents; on travel vs. others, they observed separate oracle epiphany on photography, rental cars, and medical tourism documents.

Our contributions are three-fold: (i) We formalize oracle epiphany in Section 2; (ii) We analyze EPICAL, a variant of the CAL algorithm [Cohn et al., 1994], for realizable active learning with oracle epiphany in Section 3. (iii) We analyze Oracular-EPICAL, a variant of the Oracular-CAL algorithm [Hsu, 2010, Huang et al., 2015], for agnostic active learning in Section 4. Our query complexity bounds show that active learning is possible with oracle epiphany, although we may incur a penalty waiting for epiphany to happen. This is veriﬁed with simulations in Section 5, which highlights the nuanced dependency between query complexity and epiphany parameters.

2 Problem Setting

As in standard active learning, we are given a hypothesis class H YX for some input space X and a binary label set Y { 1, 1}. There is an unknown distribution µ over X Y, from which examples are drawn IID. The marginal distribution over X is µX. Deﬁne the expected classiﬁcation error, or risk, of a classiﬁer h H to be err(h) E(x,y) µ [1 (h(x) = y)]. As usual, the active learning goal is as follows: given any ﬁxed ϵ, δ (0, 1), we seek an active learning algorithm which, with probability at least 1 δ, returns a hypothesis with classiﬁcation error at mostϵ after sending a small number of queries to the oracle. What is unique here is an oracle epiphany model.

The input space consists of two disjoint sets X = K U. The oracle knows the label for items in K (for known ) but initially does not know the labels in U (for unknown ). The oracle will abstain if a query comes from U (unless epiphany happens, see below). Furthermore, U is partitioned into K disjoint subsets U = U1 U2 . . . UK. These correspond to the photograph/rental cars/medical tourism subsets in the travel task earlier. The active learner does not know the partitions nor K. When the active learner submits a query x X to the oracle, the learner will receive one of three outcomes in Y+ { 1, 1, }, where indicates I-don t-know abstention.

Importantly, we assume that epiphany is modeled as K Markov chains: Whenever a unique x Uk is queried on some unknown region k {1, . . . , K} which did not experience epiphany yet, the oracle has a probability β [0, 1] of epiphany on that region. If epiphany happens, the oracle then understands how to label everything in Uk. In effect, the state of Uk is ﬂipped from unknown to known. Epiphany is irrevocable: Uk will stay known from now on and the oracle will answer accordingly for all future x therein. Thus the oracle will only answer if Uk remains unknown. The requirement for a unique x is to prevent a trivial active learning algorithm which repeatedly queries the same item in an attempt to induce oracle epiphany. This requirement does not pose difﬁculty for analysis if µX is continuous on X, since all queries will be unique with probability one.

Therefore, our oracle epiphany model is parameterized by (β, K, U1, . . . , UK). All our analyses below will be based on this epiphany model. Of course, the model is only an approximation to real human oracle behaviors; In Section 6 we will discuss more sophisticated epiphany models for future work.

3 The Realizable Case

In this section, we study the realizable active learning case, where we assume there exists some h H such that the label of an example x X is y = h (x). It follows that err(h ) = 0. Although the realizability assumption is strong, the analysis is insightful on the role of epiphany. We will show that the worst-case query complexity has an additional 1/β dependence. We also discuss nice cases where this 1/β can be avoided depending on U s interaction with the disagreement region. Furthermore, our analysis focuses on the K = 1 case; that is, the oracle has only one unknown region U = U1. This case is the simplest but captures the essence of the algorithm we propose in this section.

For convenience, we will drop the superscript and write U. In the next section, we will eliminate both assumptions, and present and analyze an algorithm for the agnostic case with an arbitrary K 1.

We modify the standard CAL algorithm [Cohn et al., 1994] to accommodate oracle epiphany. The modiﬁed algorithm, which we call EPICAL for epiphany CAL, is given in Alg. 1. Like CAL, EPICAL receives a stream of unlabeled items; It maintains a version space; If the unlabeled item falls into the disagreement region of the version space the oracle is queried. The essential difference to CAL is that if the oracle answers , no update to the version space happens. The stopping criterion ensures that the true risk of any hypothesis in the version space is at most ϵ, with high probability.

Algorithm 1 EPICAL

Input: ϵ, δ, oracle, X, H Version space V H Disagreement region D {x X | h, h V, h(x) = h (x)} for t = 1, 2, 3, . . . do

Sample an unlabeled example from the marginal distribution restricted to D: xt µX|D Query oracle with xt to get yt if yt = then

V {h V | h(xt) = yt} D {x X | h, h V, h(x) = h (x)} end if if µX(D) ϵ then

Return any h V end if end for

Our analysis is based on the following observation: before oracle epiphany and ignoring all queries that result in , EPICAL behaves exactly the same as CAL on an induced active-learning problem. The induced problem has input space K, but with a projected hypothesis space we detail below. Hence, standard CAL analysis bounds the number of queries to ﬁnd a good hypothesis in the induced problem. Now consider the sequence of probabilities of getting a label in each step of EPICAL. If these probabilities tend to be small, EPICAL will terminate with anϵ-risk hypothesis without even having to wait for epiphany. If these probabilities tend to be large, we may often hit the unknown region U. But the number of such steps is bounded because epiphany will happen with high probability.

Formally, we deﬁne the induced active-learning problem as follows. The input space is X K, and the output space is still Y. The sampling distribution is µX(x) µX(x)1 (x K) /µX(K). The hypothesis space is the projection of H onto X: H { h Y X | h H, x X : h(x) = h(x)}. Clearly, the induced problem is still realizable; let h be the projected target hypothesis. Let θ be the disagreement coefﬁcient [Hanneke, 2014] for the original problem without unknown regions. The induced problem potentially has a different disagreement coefﬁcient:

θ sup r>0 r 1 Ex µX 1 h H s.t. h (x) = h(x), Ex µX 1 h(x ) = h (x ) r .

Let m be the number of queries required for the CAL algorithm to ﬁnd a hypothesis of ϵ/2 risk with probability 1 δ/4 in the induced problem. It is known [Hanneke, 2014, Theorem 5.1] that

m M θ dim( H) ln θ + ln 4

where dim( ) is the VC dimension. Similarly, let m CAL be the number of queries required for CAL to ﬁnd a hypothesis of ϵ risk with probability 1 δ/4 in the original problem, and we have m CAL MCAL θ dim(H) ln θ + ln 4

ϵ . Furthermore, deﬁne m |{t | yt = }| to be the number of queries in EPICAL for which the oracle returns . We deﬁne Ut to be U for an iteration t before epiphany, and after that. We deﬁne Dt to be the disagreement region D at iteration t. Finally, deﬁne the unknown fraction within disagreement as αt µX(Dt Ut)/µX(Dt). We are now ready to state the main result of this section.

Theorem 1. Given any ϵ and δ, EPICAL will, with probability at least 1 δ, return an ˆh H with err(ˆh) ϵ, after making at most MCAL + M + 3

Remark The bound above consists of three terms. The ﬁrst is the standard CAL query complexity bound with an omniscient oracle. The other two are the price we pay when the oracle is imperfect. The second term is the query complexity for ﬁnding a low-risk hypothesis in the induced active-learning problem. In situations where µX(U) = ϵ/2 and β 1, it is hard to induce epiphany, but it sufﬁces to ﬁnd a hypothesis from H with ϵ/2 risk in the induced problem (which implies at most ϵ risk under the original distribution µX); it indicates M is unavoidable in some cases. The third term is roughly the extra query complexity required to induce epiphany. It is unavoidable in the worst case: when U = X, one has to wait for oracle epiphany to start collecting labeled examples to infer h ; the average number of steps until epiphany is on the order of 1/β. Finally, note that not all three terms contribute simultaneously to the query complexity of EPICAL. As we will see in the analysis and in the experiments, usually one or two of them will dominate, depending on how U interacts with the disagreement region. Summing them up simpliﬁes our exposition, without changing the order of the worst-case bounds.

Our analysis starts with the deﬁnition of the following two events. Lemmas 2 and 3 show that they hold with high probability when running EPICAL; the proofs are delegated to Appendix A. Deﬁne:

and Eα |{t | αt > 1/2}| 2

Lemma 2. Pr{E } 1 δ/4 . Lemma 3. Pr{Eα} 1 δ/4. Lemma 4. Assume event Eα holds. Then, the number of queries from K before oracle epiphany or before EPICAL terminates, whichever happens ﬁrst, is at most m + 2

Proof. (sketch) Denote the quantity by m. Before epiphany, V and D in EPICAL behave in exactly the same way as in CAL on K. It takes m queries to get to ϵ/2 accuracy in K by the deﬁnition of m. If m m, then m < m + 2

δ trivially, and we are done. Otherwise, it must be the case that αt > 1/2 for every step after V reaches ϵ/2 accuracy on K. Suppose not. Then there is a step t where αt 1/2. Note V reaching ϵ/2 accuracy on K implies µX(Dt) µX(Dt Ut) ϵ/2. Together with αt = µX(Dt Ut)/µX(Dt) 1/2, we have µX(Dt) < ϵ. But this would have triggered termination of EPICAL at step t, a contradiction. Since we assume Eα holds, we have m m + 2

Proof of Theorem 1. We will prove the query complexity bound, assuming (i) events E and Eα hold; and (ii) M and MCAL successfully upper bound the corresponding query complexity of standard CAL. By Lemmas 2 and 3 and a union bound, the above holds with probability at least 1 δ.

Suppose epiphany happens before EPICAL terminates. By event E and Lemma 4, the total number of queried examples before epiphany is at most m + 3

δ. After epiphany, the total number of queries is no more than that of running CAL from scratch; this number is at most MCAL. Therefore, the total query complexity is at most M + MCAL + 3

Suppose epiphany does not happen before EPICAL terminates. In this case, the number of queries in the unknown region is at most 1

δ (event E ), and the number of queries in the known region is at most m + 2

δ (Lemma 4). Thus, the total number of queries is at most M + 3

4 The Agnostic Case

In the agnostic setting the best hypothesis, h arg minh err(h), has a nonzero error. We want an active learning algorithm that, for a given accuracy ϵ > 0, returns a hypothesis h with small regret reg(h, h ) err(h) err(h ) ϵ while making a small number of queries. Among existing agnostic active learning algorithms we choose to adapt the Oracular-CAL algorithm, ﬁrst proposed by Hsu [2010] and later improved by Huang et al. [2015]. Oracular-CAL makes no assumption on H or µ, and can be implemented solely with an empirical risk minimization (ERM) subroutine, which is often well approximated by convex optimization over a surrogate loss in practice. This is a signiﬁcant advantage over several existing agnostic algorithms, which either explicitly maintain a version space, as done in A2 [Balcan et al., 2006], or require a constrained ERM routine [Dasgupta et al., 2007] that may not be well approximated efﬁciently in practice. IWAL [Beygelzimer et al., 2010] and Active

Algorithm 2 Oracular-EPICAL

1: Set c1 4 and c2 2

6 + 9. Let η0 1 and ηt 12

t ln 32t|H| ln t

2: Initialize labeled data Z0 , the version space V1 H, and the ERM h1 as any h H. 3: for t = 1, 2, . . . do

4: Observe new example xt, where (xt, yt) i.i.d. µ. 5: if xt Dt {x | x X, (h, h ) V2 t s.t. h(x) = h (x)} then 6: Query oracle with xt.

7: Zt Zt 1 {(xt, yt)}, oracle returns yt. Zt 1, oracle returns . 8: ut 1 (oracle returns ) . 9: else 10: Zt Zt 1 {(xt, ht(xt))}. // update the labeled data with the current ERM s prediction 11: ut 0. 12: end if 13: err(h, Zt) 1

t Pt i=1 1 (xi Di) (1 ui)1 (h(xi) = yi)+1 (xi / Di) 1 (h(xi) = hi(xi)) . 14: ht+1 arg minh H err(h, Zt). 15: bt 1 t Pt i=1 ui.

ηt err(ht+1, Zt) + c2(ηt + bt). 17: Vt+1 {h H | err(h, Zt) err(ht+1, Zt) t}. 18: end for

Cover [Huang et al., 2015] are agnostic algorithms that are implementable with an ERM routine, both using importance weights to correct for querying bias. But in the presence of s, choosing proper importance weights becomes challenging. Moreover, the improved Oracular-CAL [Huang et al., 2015] we use2 has stronger guarantees than IWAL, and in fact, the best known worst-case guarantees among efﬁcient, agnostic active learning algorithms.

Our proposed algorithm, Oracular-EPICAL, is given in Alg. 2. Note t here counts unlabeled data, while in Alg. 1 it counts queries. Roughly speaking, Oracular-EPICAL also has an additive factor of O(K/β) compared to Oracular-CAL s query complexity. It keeps a growing set Z of labeled examples. If the unlabeled example xt falls in the disagreement region, the algorithm queries its label: when the oracle returns a label yt, the algorithm adds xt and yt to Z; when the oracle returns , no update to Z happens. If xt is outside the disagreement region, the algorithm adds xt and the label predicted by the current ERM hypothesis ht(xt) to Z. Alg. 2 keeps an indicator ut, which records whether was returned on xt, and it always updates the ERM and the version space after every new xt. For simplicity we assume a ﬁnite H; this can be extended to H with ﬁnite VC dimension.

The critical modiﬁcation we make here to accommodate oracle abstention is that the threshold t deﬁning the version space additively depends on the average number of s received up to round t. This allows us to show that Oracular-EPICAL retains the favorable bias guarantee of Oracular CAL: with high probability, all of the imputed labels are consistent with the classiﬁcations of h , so imputation never pushes the algorithm away from h . Oracular-EPICAL only uses the version space in the disagreement test. With the same technique used by Oracular-CAL, summarized in Appendix B, the algorithm is able to perform the test solely with an ERM routine.

We now state Oracular-EPICAL s general theoretical guarantees, which hold for any oracle model, and then specialize them for the epiphany model in Section 2. We start with a consistency result:

Theorem 5 (Consistency Guarantee). Pick any 0 < δ < 1/e and let t := c1 p

ηt err(h )+c2(ηt + bt). With probability at least 1 δ, the following holds for all t 1,

err(h) err(h ) 4 t for all h Vt+1, and (1) err(h , Zt) err(ht+1, Zt) t. (2)

2This improved version of Oracular-CAL deﬁnes the version space using a tighter threshold than the one used by Hsu [2010], and has the same worst-case guarantees as Active Cover [Huang et al., 2015].

All hypotheses in the current version space, including the current ERM, have controlled expected regrets. Compared with Oracular-CAL s consistency guarantee, this is worse by an additive factor of O(bt), the average number of s over t examples. Importantly, h always remains in the version space, as implied by (2). This guarantees that all predicted labels used by the algorithm are consistent with h , since the entire version space makes the same prediction. The query complexity bound is:

Theorem 6 (Query Complexity Bound). Let Qt Pt i=1 1 (xi Di) denote the total number of queries Alg. 2 makes after observing t examples. Under the conditions of Theorem 5, with probability at least 1 δ the following holds: t > 0, Qt is bounded by

4θ err(h )t + θ O q

t err(h ) ln(t|H|/δ) ln2 t + ln(t|H|/δ) ln t + tbt ln t + 8 ln(8t2 ln t/δ) ,

where θ denotes the disagreement coefﬁcient [Hanneke, 2014].

Again, this result is worse than Oracular-CAL s query complexity [Huang et al., 2015] by an additive factor. The magnitude of this factor is less trivial than it seems: since the algorithm increases the threshold by bt, it includes more hypotheses in the version space, which may cause the algorithm to query a lot more. However, our analysis shows that the number of queries only increases by O(tbt ln t), i.e., ln t times the total number of s received over t examples.

The full proofs of both theorems are in Appendix C. Here we provide the key ingredient. Consider an imaginary dataset Z t where all the labels queried by the algorithm but not returned by the oracle are imputed, and deﬁne the error on this imputed data:

err(h, Z t ) 1

i=1 1 (xi Di) 1 (h(xi) = yi) + 1 (xi / Di) 1 (h(xi) = hi(xi)) . (3)

Note that the version space Vt and therefore the disagreement region Dt are still deﬁned in terms of err(h, Zt), not err(h, Z t ). Also deﬁne the empirical regrets between two hypotheses h and h : reg(h, h , Zt) err(h, Zt) err(h , Zt) and reg(h, h , Z t ) on Z t in the same way. The empirical error and regret on Z t are not observable, but can be easily bounded by observable quantities:

err(h, Zt) err(h, Z t ) err(h, Zt) + bt, (4)

|reg(h, h , Zt) reg(h, h , Z t )| bt, (5)

where bt = Pt i=1 ui/t is also observable. Using a martingale analysis resembling Huang et al. [2015] s for Oracular-CAL, we prove concentration of the empirical regret reg(h, h , Z t ) to its expectation. For every h Vt+1, the algorithm controls its empirical regret on Zt , which bounds reg(h, h , Z t ) by the above. This leads to a bound on the expected regret of h. The query complexity analysis follows the standard framework of Hsu [2010] and Huang et al. [2015].

Next, we specialize the guarantees to the oracle epiphany model in Section 2:

Corollary 7. Assume the epiphany model in Section 2. Fix ϵ > 0, δ > 0. Let d ln(|H|/(ϵδ)), e K K ln(K/δ) and e err(h ). With probability at least 1 δ, the following holds: The ERM hypothesis htϵ+1 satisﬁes err(htϵ+1) e ϵ, where tϵ = O de

β , and the total number of queries made up to round tϵ is θ O e

β + ln e ϵ2 + 1

ϵ d + e K ϵβ e

ϵ + 1 d + e K

The proof is in Appendix D. This corollary reveals how the epiphany parameters K and β affect query complexity. Setting e K = 0 recovers the result for a perfect oracle, showing that the (unlabeled) sample complexity tϵ worsens by an additive factor of e K/(βϵ) in both realizable and agnostic settings. For query complexity, in the realizable setting the bound becomes θ O ln d + e K/β /ϵ d + e K/β . In the agnostic setting, the leading term in our bound is θ O (e /ϵ)2 d+( e Ke )/(βϵ) . In both cases, our bounds are worse by roughly an additive factor of O( e K/β) than bounds for perfect oracles.

As for the effect of U, the above corollary is a worst-case result: it uses an upper bound on tbt that holds even for U = X. For certain U s the upper bound can be much tighter. For example, if U Dt = for sufﬁciently large t, then tbt will be O(1) for all β, with or without epiphany.

5 Experiments

To complement our theoretical results, we present two simulated experiments on active learning with oracle epiphany: learning a 1D threshold classiﬁer and handwritten digit recognition (OCR). Speciﬁcally, we will highlight query complexity dependency on the epiphany parameterβ and on U.

EPICAL on 1D Threshold Classiﬁers. Take µX to be the uniform distribution over the interval X = [0, 1]. Our hypothesis space is the set of threshold classiﬁers H = {ha : a [0, 1]} where ha(x) = 1 (x a). We choose h = h 1

2 and set the target classiﬁcation error at ϵ = 0.05.

We illustrate epiphany with a single unknown region K = 1, U = U1. However, we contrast two shapes of U: in one set of experiments we set U = [0.4, 0.6] which contains the decision boundary 0.5. In this case, the active learner EPICAL must induce oracle epiphany in order to achieve ϵ risk. In another set of experiments U = [0.7, 0.9], where we expect the learner to be able to bypass the need for epiphany. Intuitively, this latter U could soon be excluded from the disagreement region. For both U, we systematically vary the oracle epiphany parameterβ {2 6, 2 5, . . . , 20}. A small β means epiphany is less likely per query, thus we expect the learner to spend more queries trying to induce epiphany in the case of U = [0.4, 0.6]. In contrast, β may not matter much in the case of U = [0.7, 0.9] since epiphany may not be required. Note that β = 20 = 1 reverts back to the standard active learning oracle, since epiphany always happens immediately. We run each combination of β, U

0.0 0.5 1.0

0 100 200 300 400

EPICAL Passive

0.0 0.5 1.0

0 100 200 300 400

EPICAL Passive

0 20 40 60 80

0 20 40 60 80 Excess queries

1/β (a) U = [0.4, 0.6] (b) U = [0.7, 0.9] (c) U = [0.4, 0.6]

Figure 1: EPICAL results on 1D threshold classiﬁers

for 10, 000 trials. The results are shown in Figure 1. As expected, (a) shows a clear dependency on β. This indicates that epiphany is necessary in the case U = [0.4, 0.6] for learning to be successful. In contrast, the dependence on β vanishes in (b) when U is shifted sufﬁciently away from the target threshold (and thus from later disagreement regions). The oracle need not reach epiphany for learning to happen. Note (b) does not contradict with EPICAL query complexity analysis since Theorem 1 is a worst case bound that must hold true for all U.

To further clarify the role of β, note EPICAL query complexity bound predicts an additive term of O(1/β) on top of the standard CAL query complexities (i.e., both M and MCAL). This term represents excess queries needed to induce epiphany. In Figure 1(c) we plot this excess against 1

β for U = [0.4, 0.6]. Excess is computed as the number of EPICAL queries minus the average number of queries for β = 1. Indeed, we see a near linear relationship between excess queries and 1/β.

Finally, as a baseline we compare EPICAL to passive learning. In passive learning x1, x2, . . . are chosen randomly according to µX instead of adaptively. Note passive learning here is also subject to oracle epiphany. That is, the labels yt are produced by the same oracle epiphany model, some of them can be initially. Our passive learning simply maintains a version space. If it encounters it does not update the version space. All EPICAL results are better than passive learning.

Oracular-EPICAL on OCR. We consider the binary classiﬁcation task of 5 vs. other digits on MNIST [Le Cun et al., 1998]. This allows us to design the unknown regions {Uk} as certain other digits, making the experiments more interpretable. Furthermore, we can control how confusable the U digits are to 5 to observe the inﬂuence on oracle epiphany.

Although Alg. 2 is efﬁciently implementable with an ERM routine, it still requires two calls to a supervised learning algorithm on every new example. To scale it up, we implement an approximate version of Alg. 2 that uses online optimization in place of the ERM. More details are in Appendix E. While being efﬁcient in practice, this online algorithm may not retain Alg. 2 s theoretical guarantees.

0 1e 4 1e 3 1e 2 1e 1 1 0

Oracular EPICAL Passive

0 1e 4 1e 3 1e 2 1e 1 1 400

Oracular EPICAL Passive

Figure 2: Oracular-EPICAL results on OCR.

We use epiphany parameters β {1, 10 1, 10 2, 10 3, 10 4, 0}, K = 1, and U is either 3 or 1 . By using β = 1 and β = 0, we include the boundary cases where the oracle is perfect or never has an epiphany. The two different U s correspond to two contrasting scenarios: 3 is among the nearest digits to 5 as measured by the binary classiﬁcation error between 5 and every other single digit, while 1 is the farthest. The two U s are about the same size, each covering roughly 10% of the data. More details and experimental results with other choices of U can be found in Appendix E. For each combination of β and U, we perform 100 random trials. In each trial, we run both the online version of Alg. 2 and online passive logistic regression (also subject to oracle epiphany) over a randomly permuted training set of 60, 000 examples, and check the error of the online ERM on the 10, 000 testing examples every 10 queries from 200 up to our query budget of 13, 000. In each trial we record the smallest number of queries for achieving a test error of 4%. Fig. 2(a) and Fig. 2(b) show the median of this number over the 100 random trials, with error bars being the 25th and 75th quantiles. The effect of β on query complexity is dramatic for the near U = 3 but subdued for the far U = 1 . In particular, for U = 3 small β s force active learning to query as many labels as passive learning. The ﬂattening at 13, 000 at the end means no algorithm could achieve a 4% test error within our query budget. For U = 1 , active learning is always much better than passive regardless of β. Again, this illustrates that both β and U affect the query complexity. As performance references, passive learning on the entire labeled training data achieves a test error of 2.6%, while predicting the majority class (non-5) has a test error of 8.9%.

6 Discussions

Our analysis reveals a worst case O(1/β) term in query complexity due to the wait for epiphany, and we hypothesize Ω(K/β) to be the tight lower bound. This immediately raises the question: can we decouple active learning queries from epiphany induction? What if the learner can quickly induce epiphany by showing the oracle a screenful of unlabeled items at a time, without the oracle labeling them? This possibility is hinted in empirical studies. For example, Kulesza et al. [2014] observed epiphanies resulting from seeing items. Then there is a tradeoff between two learner actions toward the oracle: asking a query (getting a label or small contribution toward epiphany), or showing several items (not getting labels but potentially large contribution toward epiphany). One must formalize the cost and beneﬁt of this tradeoff. Of course, real human behaviors are even richer. Epiphanies may be reversible on certain queries, where the oracle begins to have doubts on her previous labeling. Extending our model under more relaxed assumptions is an interesting open question for future research.

Acknowledgments

This work is supported in part by NSF grants IIS-0953219, IIS-1623605, DGE-1545481, CCF1423237, and by the University of Wisconsin-Madison Graduate School with funding from the Wisconsin Alumni Research Foundation.

Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In Proceedings of the 23rd international conference on Machine learning, pages 65 72. ACM, 2006.

Alina Beygelzimer, John Langford, Zhang Tong, and Daniel J Hsu. Agnostic active learning without constraints. In Advances in Neural Information Processing Systems, pages 199 207, 2010.

David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine learning, 15(2):201 221, 1994.

Sanjoy Dasgupta, Claire Monteleoni, and Daniel J Hsu. A general agnostic active learning algorithm. In Advances in neural information processing systems, pages 353 360, 2007.

Pinar Donmez and Jaime G Carbonell. Proactive learning: cost-sensitive active learning with multiple imperfect oracles. In Proceedings of the 17th ACM conference on Information and knowledge management, pages 619 628. ACM, 2008.

Ran El-Yaniv and Yair Wiener. On the foundations of noise-free selective classiﬁcation. The Journal of Machine Learning Research, 11:1605 1641, 2010.

Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131 309, 2014.

Daniel J. Hsu. Algorithms for Active Learning. Ph D thesis, University of California at San Diego, 2010.

Tzu-Kuo Huang, Alekh Agarwal, Daniel J Hsu, John Langford, and Robert E Schapire. Efﬁcient and parsimonious agnostic active learning. In NIPS, pages 2737 2745, 2015.

S. M. Kakade and A. Tewari. On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems 21, 2009.

Nikos Karampatziakis and John Langford. Online importance weight aware updates. In UAI, pages 392 399, 2011.

Todd Kulesza, Saleema Amershi, Rich Caruana, Danyel Fisher, and Denis Xavier Charles. Structured labeling for facilitating concept evolution in machine learning. In CHI, pages 3075 3084, 2014.

Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Edward Newell and Derek Ruths. How one microtask affects another. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI, pages 3155 3166, 2016.

Advait Sarkar, Cecily Morrison, Jonas F Dorn, Rishi Bedi, Saskia Steinheimer, Jacques Boisvert, Jessica Burggraaff, Marcus D Souza, Peter Kontschieder, Samuel Rota Bulò, et al. Setwise comparison: Consistent, scalable, continuum labels for computer vision. In CHI, 2016.

Nihar Bhadresh Shah and Denny Zhou. Double or nothing: Multiplicative incentive mechanisms for crowdsourcing. In Advances in Neural Information Processing Systems, pages 1 9, 2015.

Chicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learning. In Advances in Neural Information Processing Systems, pages 442 450, 2014.