# efficient_and_parsimonious_agnostic_active_learning__52108253.pdf

Efﬁcient and Parsimonious Agnostic Active Learning

Tzu-Kuo Huang Microsoft Research, NYC

tkhuang@microsoft.com

Alekh Agarwal Microsoft Research, NYC

alekha@microsoft.com

Daniel Hsu Columbia University

djhsu@cs.columbia.edu

John Langford Microsoft Research, NYC

jcl@microsoft.com

Robert E. Schapire Microsoft Research, NYC

schapire@microsoft.com

We develop a new active learning algorithm for the streaming setting satisfying three important properties: 1) It provably works for any classiﬁer representation and classiﬁcation problem including those with severe noise. 2) It is efﬁciently implementable with an ERM oracle. 3) It is more aggressive than all previous approaches satisfying 1 and 2. To do this, we create an algorithm based on a newly deﬁned optimization problem and analyze it. We also conduct the ﬁrst experimental analysis of all efﬁcient agnostic active learning algorithms, evaluating their strengths and weaknesses in different settings.

1 Introduction

Given a label budget, what is the best way to learn a classiﬁer?

Active learning approaches to this question are known to yield exponential improvements over supervised learning under strong assumptions [7]. Under much weaker assumptions, streaming-based agnostic active learning [2, 4, 5, 9, 18] is particularly appealing since it is known to work for any classiﬁer representation and any label distribution with an i.i.d. data source.1 Here, a learning algorithm decides for each unlabeled example in sequence whether or not to request a label, never revisiting this decision. Restated then: What is the best possible active learning algorithm which works for any classiﬁer representation, any label distribution, and is computationally tractable?

Computational tractability is a critical concern, because most known algorithms for this setting [e.g., 2, 16, 18] require explicit enumeration of classiﬁers, implying exponentially-worse computational complexity compared to typical supervised learning algorithms. Active learning algorithms based on empirical risk minimization (ERM) oracles [4, 5, 13] can overcome this intractability by using passive classiﬁcation algorithms as the oracle to achieve a computationally acceptable solution.

Achieving generality, robustness, and acceptable computation has a cost. For the above methods [4, 5, 13], a label is requested on nearly every unlabeled example where two empirically good classiﬁers disagree. This results in a poor label complexity, well short of information-theoretic limits [6] even for general robust solutions [18]. Until now.

In Section 3, we design a new algorithm called ACTIVE COVER (AC) for constructing query probability functions that minimize the probability of querying inside the disagreement region the set of points where good classiﬁers disagree and never query otherwise. This requires a new algorithm that maintains a parsimonious cover of the set of empirically good classiﬁers. The cover is a result of solving an optimization problem (in Section 4) specifying the properties of a desirable

1See the monograph of Hanneke [11] for an overview of the existing literature, including alternative settings where additional assumptions are placed on the data source (e.g., separability) [8, 3, 1].

query probability function. The cover size provides a practical knob between computation and label complexity, as demonstrated by the complexity analysis we present in Section 4.

Also in Section 3, we prove that AC effectively maintains a set of good classiﬁers, achieves good generalization error, and has a label complexity bound tighter than previous approaches. The label complexity bound depends on the disagreement coefﬁcient [10], which does not completely capture the advantage of the algorithm. In the end of Section 3 we provide an example of a hard active learning problem where AC is substantially superior to previous tractable approaches. Together, these results show that AC is better and sometimes substantially better in theory.

Do agnostic active learning algorithms work in practice? No previous works have addressed this question empirically. Doing so is important because analysis cannot reveal the degree to which existing classiﬁcation algorithms effectively provide an ERM oracle. We conduct an extensive study in Section 5 by simulating the interaction of the active learning algorithm with a streaming supervised dataset. Results on a wide array of datasets show that agnostic active learning typically outperforms passive learning, and the magnitude of improvement depends on how carefully the active learning hyper-parameters are chosen.

More details (theory, proofs and empirical evaluation) are in the long version of this paper [14].

2 Preliminaries

Let P be a distribution over X { 1}, and let H { 1}X be a set of binary classiﬁers, which we assume is ﬁnite for simplicity.2 Let EX[ ] denote expectation with respect to X PX , the marginal of P over X. The expected error of a classiﬁer h H is err(h) := Pr(X,Y ) P(h(X) = Y ), and the error minimizer is denoted by h := arg minh H err(h). The (importance weighted) empirical error of h H on a multiset S of importance weighted and labeled examples drawn from X { 1} R+ is err(h, S) := P

(x,y,w) S w 1(h(x) = y)/|S|. The disagreement region for a subset of classiﬁers A H is DIS(A) := {x X | h, h A such that h(x) = h (x)}. The regret of a classiﬁer h H relative to another h H is reg(h, h ) := err(h) err(h ), and the analogous empirical regret on S is reg(h, h , S) := err(h, S) err(h , S). When the second classiﬁer h in (empirical) regret is omitted, it is taken to be the (empirical) error minimizer in H.

A streaming-based active learner receives i.i.d. labeled examples (X1, Y1), (X2, Y2), . . . from P one at a time; each label Yi is hidden unless the learner decides on the spot to query it. The goal is to produce a classiﬁer h H with low error err(h), while querying as few labels as possible. In the IWAL framework [4], a decision whether or not to query a label is made randomly: the learner picks a probability p [0, 1], and queries the label with that probability. Whenever p > 0, an unbiased error estimate can be produced using inverse probability weighting [12]. Speciﬁcally, for any classiﬁer h, an unbiased estimator E of err(h) based on (X, Y ) P and p is as follows: if Y is queried, then E = 1(h(X) = Y )/p; else, E = 0. It is easy to check that E(E) = err(h). Thus, when the label is queried, we produce the importance weighted labeled example (X, Y, 1/p).3

3 Algorithm and Statistical Guarantees

Our new algorithm, shown as Algorithm 1, breaks the example stream into epochs. The algorithm admits any epoch schedule so long as the epoch lengths satisfy τm 1 2τm. For technical reasons, we always query the ﬁrst 3 labels to kick-start the algorithm. At the start of epoch m, AC computes a query probability function Pm : X [0, 1] which will be used for sampling the data points to query during the epoch. This is done by maintaining a few objects of interest during each epoch in Step 4: (1) the best classiﬁer hm+1 on the sample Zm collected so far, where Zm has a mix of queried and predicted labels; (2) a radius m, which is based on the level of concentration we want various empirical quantities to satisfy; and (3) the set Am+1 consisting of all the classiﬁers with empirical regret at most m on Zm. Within the epoch, Pm determines the probability of querying an example in the disagreement region for this set Am of good classiﬁers; examples outside this

2The assumption that H is ﬁnite can be relaxed to VC-classes using standard arguments. 3If the label is not queried, we produce an ignored example of weight zero; its only purpose is to maintain the correct count of querying opportunities. This ensures that 1/|S| is the correct normalization in err(h, S).

Algorithm 1 ACTIVE COVER (AC) input: Constants c1, c2, c3, conﬁdence δ, error radius γ, parameters α, β, ξ for (OP), epoch schedule 0 = τ0 < 3 = τ1 < τ2 < τ3 < . . . < τM satisfying τm+1 2τm for m 1. initialize: epoch m = 0, Z0 := , 0 := c1 ϵ1 + c2ϵ1 log 3, where ϵm := 32 log(|H|τm/δ)/τm.

1: Query the labels {Yi}3 i=1 of the ﬁrst three unlabeled examples {Xi}3 i=1, and set A1 := H, P1 Pmin,i = 1, and S = {(Xj, Yj, 1)}3 j=1. 2: for i = 4, . . . , n, do 3: if i = τm + 1 then 4: Set Zm = Zm 1 S, and S = . Let

hm+1 := arg min h H err(h, Zm), m := c1 q

ϵmerr(hm+1, Zm) + c2ϵm log τm, and

Am+1 := {h H | err(h, Zm) err(hm+1, Zm) γ m}.

5: Compute the solution Pm+1( ) to the problem (OP) and increment m := m + 1. 6: end if 7: if next unlabeled point Xi Dm := DIS(Am), then 8: Toss coin with bias Pm(Xi); add example (Xi, Yi, 1/Pm(Xi)) to S if outcome is heads, otherwise add (Xi, 1, 0) to S (see Footnote 3). 9: else 10: Add example with predicted label (Xi, hm(Xi), 1) to S. 11: end if 12: end for 13: Return h M+1 := arg minh H err(h, ZM).

region are not queried but given labels predicted by hm (so error estimates are not unbiased). AC computes Pm by solving the optimization problem (OP), which is further discussed below.

The objective function of (OP) encourages small query probabilities in order to minimize the label complexity. The constraints (1) in (OP) bound the variance in our importance-weighted regret estimates for every h H. This is key to ensuring good generalization as we will later use Bernsteinstyle bounds which rely on our random variables having a small variance. More speciﬁcally, the LHS of the constraints measures the variance in our empirical regret estimates for h, measured only on the examples in the disagreement region Dm. This is because the importance weights in the form of 1/Pm(X) are only applied to these examples; outside this region we use the predicted labels with an importance weight of 1. The RHS of the constraint consists of three terms. The ﬁrst term ensures the feasibility of the problem, as P(X) 1/(2α2) for X Dm will always satisfy the constraints. The second empirical regret term makes the constraints easy to satisfy for bad hypotheses this is crucial to rule out large label complexities in case there are bad hypotheses that disagree very often with hm. A beneﬁt of this is easily seen when hm H, which might have a terrible regret, but would force a near-constant query probability on the disagreement region if β = 0. Finally, the third term will be on the same order as the second one for hypotheses in Am, and is only included to capture the allowed level of slack in our constraints which will be exploited for the efﬁcient implementation in Section 4. In addition to controlled variance, good concentration also requires the random variables of interest to be appropriately bounded. This is ensured through the constraints (2), which impose a minimum query probability on the disagreement region. Outside the disagreement region, we use the predicted label with an importance weight of 1, so that our estimates will always be bounded (albeit biased) in this region. Note that this optimization problem is written with respect to the marginal distribution of the data points PX, meaning that we might have inﬁnitely many of the latter constraints. In Section 4, we describe how to solve this optimization problem efﬁciently, and using access to only unlabeled examples drawn from PX.

Algorithm 1 requires several input parameters, which must satisfy:

α 1, ξ 1 8nϵM log n, β2 1 γnϵM log n, γ 216, c1 2α

6, c2 216c2 1, c3 1.

The ﬁrst three parameters, α, β and ξ control the tightness of the variance constraints (1). The next three parameters γ, c1 and c2 control the threshold that deﬁnes the set of empirically good classiﬁers; c3 is used in the minimum probability (4) and can be simply set to 1.

Optimization Problem (OP) to compute Pm

s.t. h H EX

1(h(x) = hm(x) x Dm)

x X 0 P(x) 1, and x Dm P(x) Pmin,m (2)

where Im h (X) = 1(h(x) = hm(x) x Dm),

bm(h) = 2α2EX[Im h (X)] + 2β2γreg(h, hm, Zm 1)τm 1 m 1 + ξτm 1 2 m 1, (3)

Pmin,m = min

τm 1err(hm, Zm 1) nϵM + log τm 1 , 1

Epoch Schedules: The algorithm takes an arbitrary epoch schedule subject to τm < τm+1 2τm. Two natural extremes are unit-length epochs, τm = m, and doubling epochs, τm+1 = 2τm. The main difference lies in the number of times (OP) is solved, which is a substantial computational consideration. Unless otherwise stated, we assume the doubling epoch schedule where the query probability and ERM classiﬁer are recomputed only O(log n) times.

Generalization and Label Complexity. We present guarantees on the generalization error and label complexity of Algorithm 1 assuming a solver for (OP), which we provide in the next section. Our ﬁrst theorem provides a bound on generalization error. Deﬁne

errm(h) := 1

j=1 (τj τj 1)E(X,Y ) P[1(h(X) = Y X DIS(Aj))],

0 := 0 and m := c1 p

ϵmerrm(h ) + c2ϵm log τm for m 1.

Essentially m is a population counterpart of the quantity m used in Algorithm 1, and crucially relies on errm(h ), the true error of h restricted to the disagreement region at epoch m. This quantity captures the inherent noisiness of the problem, and modulates the transition between O(1/ n) to O(1/n) type error bounds as we see next.

Theorem 1. Pick any 0 < δ < 1/e such that |H|/δ >

192. Then recalling that h = arg minh H err(h), we have for all epochs m = 1, 2, . . . , M, with probability at least 1 δ

reg(h, h ) 16γ m for all h Am+1, and (5)

reg(h , hm+1, Zm) 216 m. (6)

The proof is in Section 7.2.2 of [14]. Since we use γ 216, the bound (6) implies that h Am for all epochs m. This also maintains that all the predicted labels used by our algorithm are identical to those of h , since no disagreement amongst classiﬁers in Am was observed on those examples. This observation will be critical to our proofs, where we will exploit the fact that using labels predicted by h instead of observed labels on certain examples only introduces a bias in favor of h , thereby ensuring that we never mistakenly drop the optimal classiﬁer from Am. The bound (5) shows that every classiﬁer in Am+1 has a small regret to h . Since the ERM classiﬁer hm+1 is always in Am+1, this yields our main generalization error bound on the classiﬁer hτm+1 output by Algorithm 1. Additionally, it also clariﬁes the deﬁnition of the sets Am as the set of good classiﬁers: these are classiﬁers which indeed have small population regret relative to h . In a realizable setting where h

has zero error, m = O(1/τm) leading to a O(1/n) regret after n unlabeled examples are presented to the algorithm. On the other extreme, if errm(h ) is a constant, then the regret is O(1/ n). There are also interesting regimes in between, where err(h ) might be a constant, but errm(h ) measured

over the disagreement region decreases rapidly. More speciﬁcally, we show in Appendix E of [14] that the expected regret of the classiﬁer returned by Algorithm 1 achieves the optimal rate [6] under the Tsybakov [17] noise condition.

Next, we provide a label complexity guarantee in terms of the disagreement coefﬁcient [11]: θ = θ(h ) := supr>0 PX {x | h H s.t. h (x) = h(x), PX {x | h(x ) = h (x )} r}/r.

Theorem 2. With probability at least 1 δ, the number of label queries made by Algorithm 1 after n examples over M epochs is 4θ err M(h )n + θ O( p

nerr M(h ) log(|H|/δ) + log(|H|/δ)).

The theorem is proved in Appendix D of [14]. The ﬁrst term of the label complexity bound is linear in the number of unlabeled examples, but can be quite small if θ is small, or if err M(h ) 0 it is indeed 0 in the realizable setting. The second term grows at most as O( n), but also becomes a constant for realizable problems. Consequently, we attain a logarithmic label complexity in the realizable setting. In noisy settings, our label complexity improves upon that of predecessors such as [5, 13]. Beygelzimer et al. [5] obtain a label complexity of θ n, exponentially worse for realizable problems. A related algorithm, Oracular CAL [13], has label complexity scaling with p

nerr(h ) but a worse dependence on θ. In all comparisons the use of err M(h ) provides a qualitatively superior analysis to all previous results depending on err(h ) since this captures the fact that noisy labels outside the disagreement region do not affect the label complexity. Finally, as in our regret analysis, we show in Appendix E of [14] that the label complexity of Algorithm 1 achieves the information-theoretically lower bound [6] under Tsybakov s low-noise condition [17].

Section 4.2.2 of [14] gives an example where the label complexity of Algorithm 1 is signiﬁcantly smaller than both IWAL and Oracular CAL by virtue of rarely querying in the disagreement region. The example considers a distribution and a classiﬁer space with the following structure: (i) for most examples a single good classiﬁer predicts differently from the remaining classiﬁers; (ii) on a few examples, half the classiﬁers predict one way and half the other. In the ﬁrst case, little advantage is gained from a label because it provides evidence against only a single classiﬁer. ACTIVE COVER queries over the disagreement region with a probability close to Pmin in case (i) and probability 1 in case (ii), while others query with probability Ω(1) everywhere implying O( n) times more queries.

4 Efﬁcient implementation

The computation of hm is an ERM operation, which can be performed efﬁciently whenever an efﬁcient passive learner is available. However, several other hurdles remain. Testing for x DIS(Am) in the algorithm, as well as ﬁnding a solution to (OP) are considerably more challenging. The epoch schedule helps, but (OP) is still solved O(log n) times, necessitating an extremely efﬁcient solver.

Starting with the ﬁrst issue, we follow Dasgupta et al. [9] who cleverly observed that x Dm := DIS(Am) can be efﬁciently determined using a single call to an ERM oracle. Speciﬁcally, to apply their method, we use the oracle to ﬁnd4 h = arg min{err(h, Zm 1) | h H, h(x) = hm(x)}. It can then be argued that x Dm = DIS(Am) if and only if the easily-measured regret of h (that is, reg(h , hm, Zm 1)) is at most γ m 1. Solving (OP) efﬁciently is a much bigger challenge because it is enormous: There is one variable P(x) for every point x X, one constraint (1) for each classiﬁer h and bound constraints (2) on P(x) for every x. This leads to inﬁnitely many variables and constraints, with an ERM oracle being the only computational primitive available.

We eliminate the bound constraints using barrier functions. Notice that the objective EX[1/(1 P(x))] is already a barrier at P(x) = 1. To enforce the lower bound (2), we modify the objective to

where µ is a parameter chosen momentarily to ensure P(x) Pmin,m for all x Dm. Thus, the modiﬁed goal is to minimize (7) over non-negative P subject only to (1). We solve the problem in the dual where we have a large but ﬁnite number of optimization variables, and efﬁciently maximize the dual using coordinate ascent with access to an ERM oracle over H. Let λh 0 denote the

4 See Appendix F of [15] for how to deal with one constraint with an unconstrained oracle.

Algorithm 2 Coordinate ascent algorithm to solve (OP) input Accuracy parameter ε > 0. initialize λ 0.

1: loop 2: Rescale: λ s λ where s = arg maxs [0,1] D(s λ).

3: Find h = arg max h H EX

Im h (X) Pλ(X)

4: if EX h Im h (X) Pλ(X) i bm( h) ε then

5: return λ 6: else

7: Update λ h as λ h λ h + 2EX[Im h (X)/Pλ(X)] bm( h) EX[Im h (X)/qλ(X)3] .

8: end if 9: end loop

Lagrange multiplier for the constraint (1) for classiﬁer h. Then for any λ, we can minimize the Lagrangian over each primal variable P(X) yielding the solution

Pλ(x) = 1(x Dm)qλ(x)

1 + qλ(x) , where qλ(x) = s

h H λh Im h (x) (8)

and Im h (x) = 1(h(x) = hm(x) x Dm). Clearly, µ/(1 + µ) Pλ(x) 1 for all x Dm, so all the bound constraints (2) in (OP) are satisﬁed if we choose µ = 2Pmin,m. Plugging the solution Pλ into the Lagrangian, we obtain the dual problem of maximizing the dual objective

D(λ) = EX 1(X Dm)(1 + qλ(X))2 X

h H λhbm(h) + C0 (9)

over λ 0. The constant C0 is equal to 1 Pr(Dm) where Pr(Dm) = Pr(X Dm). An algorithm to approximately solve this problem is presented in Algorithm 2. The algorithm takes a parameter ε > 0 specifying the degree to which all of the constraints (1) are to be approximated. Since D is concave, the rescaling step can be solved using a straightforward numerical line search. The main implementation challenge is in ﬁnding the most violated constraint (Step 3). Fortunately, this step can be reduced to a single call to an ERM oracle. To see this, note that the constraint violation on classiﬁer h can be written as

Im h (X) P(X)

1(X Dm) 1 P(X) 2α2 1(h(X) = hm(X))

2β2γτm 1 m 1(err(h, Zm 1) err(hm, Zm 1)) ξτm 1 2 m 1. The second term of the right-hand expression is simply the scaled risk (classiﬁcation error) of h with respect to the actual labels. The ﬁrst term is the risk of h in predicting samples which have been labeled according to hm with importance weights of 1/P(x) 2α2 if x Dm and 0 otherwise; note that these weights may be positive or negative. The last two terms do not depend on h. Thus, given access to PX (or samples approximating it, discussed shortly), the most violated constraint can be found by solving an ERM problem deﬁned on the labeled samples in Zm 1 and samples drawn from PX labeled by hm, with appropriate importance weights detailed in Appendix F.1 of [14]. When all primal constraints are approximately satisﬁed, the algorithm stops. We have the following guarantee on the convergence of the algorithm. Theorem 3. When run on the m-th epoch, Algorithm 2 halts in at most Pr(Dm)/(8P 3 min,mε2) iterations and outputs a solution ˆλ 0 such that Pˆλ satisﬁes the simple bound constraints in (2) exactly, the variance constraints in (1) up to an additive factor of ε, and

+ 4Pmin,m Pr(Dm), (10)

where P is the solution to (OP). Furthermore, ˆλ 1 Pr(Dm)/ε.

If ε is set to ξ2τm 1 2 m 1, an amount of constraint violation tolerable in our analysis, the number of iterations (hence the number of ERM oracle calls) in Theorem 3 is at most O(τ 2 m 1). The proof is in Appendix F.2 of [14].

Table 1: Summary of performance metrics

OAC IWAL0 IWAL1 ORA-OAC ORAIWAL0

AUC-GAIN 0.151 0.150 0.142 0.125 0.115 0.121 0.095 AUC-GAIN 0.065 0.085 0.081 0.078 0.073 0.075 0.072

Solving (OP) with expectation over samples: So far we considered solving (OP) deﬁned on the unlabeled data distribution PX , which is unavailable in practice. A natural substitute for PX is an i.i.d. sample drawn from it. In Appendix F.3 of [14] we show that solving a properly-deﬁned sample variant of (OP) leads to a solution to the original (OP) with similar guarantees as in Theorem 3.

5 Experiments with Agnostic Active Learning

While AC is efﬁcient in the number of ERM oracle calls, it needs to store all past examples, resulting in large space complexity. As Theorem 3 suggests, the query probability function (8) may need as many as O(τ 2 i ) classiﬁers, further increasing storage demand. Aiming at scalable implementation, we consider an online approximation of AC, given in Section 6.1 of [14]. The main differences from AC are: (1) instead of a batch ERM oracle, it invokes an online oracle; and (2) instead of repeatedly solving (OP) from scratch, it maintains a ﬁxed-size set of classiﬁers (and hence non-zero dual variables), called the cover, for representing the query probability, and updates the cover with every new example in a manner similar to the coordinate ascent algorithm for solving (OP). We conduct an empirical comparison of the following efﬁcient agnostic active learning algorithms:

OAC: Online approximation of ACTIVE COVER (Algorithm 3 in Section 6.1 of [14]). IWAL0 and IWAL1: The algorithm of [5] and a variant that uses a tighter threshold. ORA-OAC, ORA-IWAL0, and ORA-IWAL1: Oracular-CAL [13] versions of OAC, IWAL0 and IWAL1. PASSIVE: Passive learning on a labeled sub-sample drawn uniformly at random.

Details about these algorithms are in Section 6.2 of [14]. The high-level differences among these algorithms are best explained in the context of the disagreement region: OAC does importanceweighted querying of labels with an optimized query probability in the disagreement region, while using predicted labels outside; IWAL0 and IWAL1 maintain a non-zero minimum query probability everywhere; ORA-OAC, ORA-IWAL0 and ORA-IWAL1 query labels in their respective disagreement regions with probability 1, using predicted labels otherwise.

We implemented these algorithms in Vowpal Wabbit (http://hunch.net/ vw/), a fast learning system based on online convex optimization, using logistic regression as the ERM oracle. We performed experiments on 22 binary classiﬁcation datasets with varying sizes (103 to 106) and diverse feature characteristics. Details about the datasets are in Appendix G.1 of [14]. Our goal is to evaluate the test error improvement per label query achieved by different algorithms. To simulate the streaming setting, we randomly permuted the datasets, ran the active learning algorithms through the ﬁrst 80% of data, and evaluated the learned classiﬁers on the remaining 20%. We repeated this process 9 times to reduce variance due to random permutation. For each active learning algorithm, we obtain the test error rates of classiﬁers trained at doubling numbers of label queries starting from 10 to 10240. Formally, let errora,p(d, j, q) denote the test error of the classiﬁer returned by algorithm a using hyper-parameter setting p on the j-th permutation of dataset d immediately after hitting the q-th label budget, 10 2(q 1), 1 q 11. Let querya,p(d, j, q) be the actual number of label queries made, which can be smaller than 10 2(q 1) when algorithm a reaches the end of the training data before hitting that label budget. To evaluate an algorithm, we consider the area under its curve of test error against log number of label queries:

AUCa,p(d, j) = 1

errora,p(d, j, q + 1) + errora,p(d, j, q) log2 querya,p(d, j, q + 1)

querya,p(d, j, q)

A good active learning algorithm has a small value of AUC, which indicates that the test error decreases quickly as the number of label queries increases. We use a logarithmic scale for the number of label queries to focus on the performance with few label queries where active learning is the most relevant. More details about hyper-parameters are in Appendix G.2 of [14].

number of label queries 10 1 10 2 10 3 10 4

relative improvement in test error

OAC IWAL0 IWAL1 ORA-OAC ORA-IWAL0 ORA-IWAL1 PASSIVE baseline

(a) Best hyper-parameter per dataset

number of label queries 10 1 10 2 10 3 10 4

relative improvement in test error

OAC IWAL0 IWAL1 ORA-OAC ORA-IWAL0 ORA-IWAL1 PASSIVE baseline

(b) Best ﬁxed hyper-parameter

Figure 1: Average relative improvement in test error v.s. number of label queries

We measure the performance of each algorithm a by the following two aggregated metrics:

AUC-GAIN (a) := mean d max p median 1 j 9

AUCbase(d, j) AUCa,p(d, j)

AUCbase(d, j)

AUC-GAIN(a) := max p mean d median 1 j 9

AUCbase(d, j) AUCa,p(d, j)

AUCbase(d, j)

where AUCbase denotes the AUC of PASSIVE using a default hyper-parameter setting, i.e., a learning rate of 0.4 (see Appendix G.2 of [14]). The ﬁrst metric shows the maximal gain each algorithm achieves with the best hyper-parameter setting for each dataset, while the second shows the gain by using the single hyper-parameter setting that performs the best on average across datasets.

Results and Discussions. Table 1 gives a summary of the performances of different algorithms. When using hyper-parameters optimized on a per-dataset basis (top row in Table 1), OAC achieves the largest improvement over the PASSIVE baseline, with IWAL0 achieving almost the same improvement and IWAL1 improving slightly less. Oracular-CAL variants perform worse, but still do better than PASSIVE with the best learning rate for each dataset, which leads to an average of 9.5% improvement in AUC over the default learning rate. When using the best ﬁxed hyper-parameter setting across all datasets (bottom row in Table 1), all active learning algorithms achieve less improvement compared with PASSIVE (7% improvement with the best ﬁxed learning rate). In particular, OAC gets only 6.5% improvement. This suggests that careful tuning of hyper-parameters is critical for OAC and an important direction for future work.

Figure 1(a) describes the behaviors of different algorithms in more detail. For each algorithm a we identify the best ﬁxed hyper-parameter setting

p := arg max p mean d median 1 j 9

AUCbase(d, j) AUCa,p(d, j)

AUCbase(d, j)

and plot the relative test error improvement by a using p averaged across all datasets at the 11 label budgets: 10 2(q 1), mean d median 1 j 9

errorbase(d, j, q) errora,p (d, j, q)

errorbase(d, j, q)

All algorithms, including PASSIVE, perform similarly during the ﬁrst few hundred label queries. IWAL0 performs the best at label budgets larger than 80, while IWAL1 does almost as well. ORAOAC is the next best, followed by ORA-IWAL1 and ORA-IWAL0. OAC performs worse than PASSIVE except at label budgets between 320 and 1280. In Figure 1(b),we plot results obtained by each algorithm a using the best hyper-parameter setting for each dataset d:

p d := arg max p median 1 j 9

AUCbase(d, j) AUCa,p(d, j)

AUCbase(d, j)

As expected, all algorithms perform better, but OAC beneﬁts the most from using the best hyperparameter setting per dataset. Appendix G.3 of [14] gives more detailed results, including test error rates obtained by all algorithms at different label query budgets for individual datasets.

In sum, when using the best ﬁxed hyper-parameter setting, IWAL0 outperforms other algorithms. When using the best hyper-parameter setting tuned for each dataset, OAC and IWAL0 perform equally well and better than other algorithms.

[1] Maria-Florina Balcan and Phil Long. Active and passive learning of linear separators under log-concave distributions. In Conference on Learning Theory, pages 288 316, 2013.

[2] Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In Proceedings of the 23rd international conference on Machine learning, pages 65 72. ACM, 2006.

[3] Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In Proceedings of the 20th annual conference on Learning theory, pages 35 50. Springer-Verlag, 2007.

[4] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML, 2009.

[5] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In NIPS, 2010.

[6] R.M. Castro and R.D. Nowak. Minimax bounds for active learning. Information Theory, IEEE Transactions on, 54(5):2339 2353, 2008.

[7] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15:201 221, 1994.

[8] S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems 18, 2005.

[9] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In NIPS, 2007.

[10] S. Hanneke. Theoretical Foundations of Active Learning. Ph D thesis, Carnegie Mellon University, 2009.

[11] Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131 309, 2014.

[12] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a ﬁnite universe. J. Amer. Statist. Assoc., 47:663 685, 1952. ISSN 0162-1459.

[13] Daniel J. Hsu. Algorithms for Active Learning. Ph D thesis, University of California at San Diego, 2010.

[14] Tzu-Kuo Huang, Alekh Agarwal, Daniel J Hsu, John Langford, and Robert E Schapire. Efﬁcient and parsimonious agnostic active learning. ar Xiv preprint ar Xiv:1506.08669, 2015.

[15] Nikos Karampatziakis and John Langford. Online importance weight aware updates. In UAI 2011, Proceedings of the Twenty-Seventh Conference on Uncertainty in Artiﬁcial Intelligence, Barcelona, Spain, July 14-17, 2011, pages 392 399, 2011.

[16] Vladimir Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. J. Mach. Learn. Res., 11:2457 2485, December 2010.

[17] A. B. Tsybakov. Optimal aggregation of classiﬁers in statistical learning. Ann. Statist., 32: 135 166, 2004.

[18] Chicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learning. In Advances in Neural Information Processing Systems, pages 442 450, 2014.