# efficient_and_parsimonious_agnostic_active_learning__52108253.pdf Efficient and Parsimonious Agnostic Active Learning Tzu-Kuo Huang Microsoft Research, NYC tkhuang@microsoft.com Alekh Agarwal Microsoft Research, NYC alekha@microsoft.com Daniel Hsu Columbia University djhsu@cs.columbia.edu John Langford Microsoft Research, NYC jcl@microsoft.com Robert E. Schapire Microsoft Research, NYC schapire@microsoft.com We develop a new active learning algorithm for the streaming setting satisfying three important properties: 1) It provably works for any classifier representation and classification problem including those with severe noise. 2) It is efficiently implementable with an ERM oracle. 3) It is more aggressive than all previous approaches satisfying 1 and 2. To do this, we create an algorithm based on a newly defined optimization problem and analyze it. We also conduct the first experimental analysis of all efficient agnostic active learning algorithms, evaluating their strengths and weaknesses in different settings. 1 Introduction Given a label budget, what is the best way to learn a classifier? Active learning approaches to this question are known to yield exponential improvements over supervised learning under strong assumptions [7]. Under much weaker assumptions, streaming-based agnostic active learning [2, 4, 5, 9, 18] is particularly appealing since it is known to work for any classifier representation and any label distribution with an i.i.d. data source.1 Here, a learning algorithm decides for each unlabeled example in sequence whether or not to request a label, never revisiting this decision. Restated then: What is the best possible active learning algorithm which works for any classifier representation, any label distribution, and is computationally tractable? Computational tractability is a critical concern, because most known algorithms for this setting [e.g., 2, 16, 18] require explicit enumeration of classifiers, implying exponentially-worse computational complexity compared to typical supervised learning algorithms. Active learning algorithms based on empirical risk minimization (ERM) oracles [4, 5, 13] can overcome this intractability by using passive classification algorithms as the oracle to achieve a computationally acceptable solution. Achieving generality, robustness, and acceptable computation has a cost. For the above methods [4, 5, 13], a label is requested on nearly every unlabeled example where two empirically good classifiers disagree. This results in a poor label complexity, well short of information-theoretic limits [6] even for general robust solutions [18]. Until now. In Section 3, we design a new algorithm called ACTIVE COVER (AC) for constructing query probability functions that minimize the probability of querying inside the disagreement region the set of points where good classifiers disagree and never query otherwise. This requires a new algorithm that maintains a parsimonious cover of the set of empirically good classifiers. The cover is a result of solving an optimization problem (in Section 4) specifying the properties of a desirable 1See the monograph of Hanneke [11] for an overview of the existing literature, including alternative settings where additional assumptions are placed on the data source (e.g., separability) [8, 3, 1]. query probability function. The cover size provides a practical knob between computation and label complexity, as demonstrated by the complexity analysis we present in Section 4. Also in Section 3, we prove that AC effectively maintains a set of good classifiers, achieves good generalization error, and has a label complexity bound tighter than previous approaches. The label complexity bound depends on the disagreement coefficient [10], which does not completely capture the advantage of the algorithm. In the end of Section 3 we provide an example of a hard active learning problem where AC is substantially superior to previous tractable approaches. Together, these results show that AC is better and sometimes substantially better in theory. Do agnostic active learning algorithms work in practice? No previous works have addressed this question empirically. Doing so is important because analysis cannot reveal the degree to which existing classification algorithms effectively provide an ERM oracle. We conduct an extensive study in Section 5 by simulating the interaction of the active learning algorithm with a streaming supervised dataset. Results on a wide array of datasets show that agnostic active learning typically outperforms passive learning, and the magnitude of improvement depends on how carefully the active learning hyper-parameters are chosen. More details (theory, proofs and empirical evaluation) are in the long version of this paper [14]. 2 Preliminaries Let P be a distribution over X { 1}, and let H { 1}X be a set of binary classifiers, which we assume is finite for simplicity.2 Let EX[ ] denote expectation with respect to X PX , the marginal of P over X. The expected error of a classifier h H is err(h) := Pr(X,Y ) P(h(X) = Y ), and the error minimizer is denoted by h := arg minh H err(h). The (importance weighted) empirical error of h H on a multiset S of importance weighted and labeled examples drawn from X { 1} R+ is err(h, S) := P (x,y,w) S w 1(h(x) = y)/|S|. The disagreement region for a subset of classifiers A H is DIS(A) := {x X | h, h A such that h(x) = h (x)}. The regret of a classifier h H relative to another h H is reg(h, h ) := err(h) err(h ), and the analogous empirical regret on S is reg(h, h , S) := err(h, S) err(h , S). When the second classifier h in (empirical) regret is omitted, it is taken to be the (empirical) error minimizer in H. A streaming-based active learner receives i.i.d. labeled examples (X1, Y1), (X2, Y2), . . . from P one at a time; each label Yi is hidden unless the learner decides on the spot to query it. The goal is to produce a classifier h H with low error err(h), while querying as few labels as possible. In the IWAL framework [4], a decision whether or not to query a label is made randomly: the learner picks a probability p [0, 1], and queries the label with that probability. Whenever p > 0, an unbiased error estimate can be produced using inverse probability weighting [12]. Specifically, for any classifier h, an unbiased estimator E of err(h) based on (X, Y ) P and p is as follows: if Y is queried, then E = 1(h(X) = Y )/p; else, E = 0. It is easy to check that E(E) = err(h). Thus, when the label is queried, we produce the importance weighted labeled example (X, Y, 1/p).3 3 Algorithm and Statistical Guarantees Our new algorithm, shown as Algorithm 1, breaks the example stream into epochs. The algorithm admits any epoch schedule so long as the epoch lengths satisfy τm 1 2τm. For technical reasons, we always query the first 3 labels to kick-start the algorithm. At the start of epoch m, AC computes a query probability function Pm : X [0, 1] which will be used for sampling the data points to query during the epoch. This is done by maintaining a few objects of interest during each epoch in Step 4: (1) the best classifier hm+1 on the sample Zm collected so far, where Zm has a mix of queried and predicted labels; (2) a radius m, which is based on the level of concentration we want various empirical quantities to satisfy; and (3) the set Am+1 consisting of all the classifiers with empirical regret at most m on Zm. Within the epoch, Pm determines the probability of querying an example in the disagreement region for this set Am of good classifiers; examples outside this 2The assumption that H is finite can be relaxed to VC-classes using standard arguments. 3If the label is not queried, we produce an ignored example of weight zero; its only purpose is to maintain the correct count of querying opportunities. This ensures that 1/|S| is the correct normalization in err(h, S). Algorithm 1 ACTIVE COVER (AC) input: Constants c1, c2, c3, confidence δ, error radius γ, parameters α, β, ξ for (OP), epoch schedule 0 = τ0 < 3 = τ1 < τ2 < τ3 < . . . < τM satisfying τm+1 2τm for m 1. initialize: epoch m = 0, Z0 := , 0 := c1 ϵ1 + c2ϵ1 log 3, where ϵm := 32 log(|H|τm/δ)/τm. 1: Query the labels {Yi}3 i=1 of the first three unlabeled examples {Xi}3 i=1, and set A1 := H, P1 Pmin,i = 1, and S = {(Xj, Yj, 1)}3 j=1. 2: for i = 4, . . . , n, do 3: if i = τm + 1 then 4: Set Zm = Zm 1 S, and S = . Let hm+1 := arg min h H err(h, Zm), m := c1 q ϵmerr(hm+1, Zm) + c2ϵm log τm, and Am+1 := {h H | err(h, Zm) err(hm+1, Zm) γ m}. 5: Compute the solution Pm+1( ) to the problem (OP) and increment m := m + 1. 6: end if 7: if next unlabeled point Xi Dm := DIS(Am), then 8: Toss coin with bias Pm(Xi); add example (Xi, Yi, 1/Pm(Xi)) to S if outcome is heads, otherwise add (Xi, 1, 0) to S (see Footnote 3). 9: else 10: Add example with predicted label (Xi, hm(Xi), 1) to S. 11: end if 12: end for 13: Return h M+1 := arg minh H err(h, ZM). region are not queried but given labels predicted by hm (so error estimates are not unbiased). AC computes Pm by solving the optimization problem (OP), which is further discussed below. The objective function of (OP) encourages small query probabilities in order to minimize the label complexity. The constraints (1) in (OP) bound the variance in our importance-weighted regret estimates for every h H. This is key to ensuring good generalization as we will later use Bernsteinstyle bounds which rely on our random variables having a small variance. More specifically, the LHS of the constraints measures the variance in our empirical regret estimates for h, measured only on the examples in the disagreement region Dm. This is because the importance weights in the form of 1/Pm(X) are only applied to these examples; outside this region we use the predicted labels with an importance weight of 1. The RHS of the constraint consists of three terms. The first term ensures the feasibility of the problem, as P(X) 1/(2α2) for X Dm will always satisfy the constraints. The second empirical regret term makes the constraints easy to satisfy for bad hypotheses this is crucial to rule out large label complexities in case there are bad hypotheses that disagree very often with hm. A benefit of this is easily seen when hm H, which might have a terrible regret, but would force a near-constant query probability on the disagreement region if β = 0. Finally, the third term will be on the same order as the second one for hypotheses in Am, and is only included to capture the allowed level of slack in our constraints which will be exploited for the efficient implementation in Section 4. In addition to controlled variance, good concentration also requires the random variables of interest to be appropriately bounded. This is ensured through the constraints (2), which impose a minimum query probability on the disagreement region. Outside the disagreement region, we use the predicted label with an importance weight of 1, so that our estimates will always be bounded (albeit biased) in this region. Note that this optimization problem is written with respect to the marginal distribution of the data points PX, meaning that we might have infinitely many of the latter constraints. In Section 4, we describe how to solve this optimization problem efficiently, and using access to only unlabeled examples drawn from PX. Algorithm 1 requires several input parameters, which must satisfy: α 1, ξ 1 8nϵM log n, β2 1 γnϵM log n, γ 216, c1 2α 6, c2 216c2 1, c3 1. The first three parameters, α, β and ξ control the tightness of the variance constraints (1). The next three parameters γ, c1 and c2 control the threshold that defines the set of empirically good classifiers; c3 is used in the minimum probability (4) and can be simply set to 1. Optimization Problem (OP) to compute Pm s.t. h H EX 1(h(x) = hm(x) x Dm) x X 0 P(x) 1, and x Dm P(x) Pmin,m (2) where Im h (X) = 1(h(x) = hm(x) x Dm), bm(h) = 2α2EX[Im h (X)] + 2β2γreg(h, hm, Zm 1)τm 1 m 1 + ξτm 1 2 m 1, (3) Pmin,m = min τm 1err(hm, Zm 1) nϵM + log τm 1 , 1 Epoch Schedules: The algorithm takes an arbitrary epoch schedule subject to τm < τm+1 2τm. Two natural extremes are unit-length epochs, τm = m, and doubling epochs, τm+1 = 2τm. The main difference lies in the number of times (OP) is solved, which is a substantial computational consideration. Unless otherwise stated, we assume the doubling epoch schedule where the query probability and ERM classifier are recomputed only O(log n) times. Generalization and Label Complexity. We present guarantees on the generalization error and label complexity of Algorithm 1 assuming a solver for (OP), which we provide in the next section. Our first theorem provides a bound on generalization error. Define errm(h) := 1 j=1 (τj τj 1)E(X,Y ) P[1(h(X) = Y X DIS(Aj))], 0 := 0 and m := c1 p ϵmerrm(h ) + c2ϵm log τm for m 1. Essentially m is a population counterpart of the quantity m used in Algorithm 1, and crucially relies on errm(h ), the true error of h restricted to the disagreement region at epoch m. This quantity captures the inherent noisiness of the problem, and modulates the transition between O(1/ n) to O(1/n) type error bounds as we see next. Theorem 1. Pick any 0 < δ < 1/e such that |H|/δ > 192. Then recalling that h = arg minh H err(h), we have for all epochs m = 1, 2, . . . , M, with probability at least 1 δ reg(h, h ) 16γ m for all h Am+1, and (5) reg(h , hm+1, Zm) 216 m. (6) The proof is in Section 7.2.2 of [14]. Since we use γ 216, the bound (6) implies that h Am for all epochs m. This also maintains that all the predicted labels used by our algorithm are identical to those of h , since no disagreement amongst classifiers in Am was observed on those examples. This observation will be critical to our proofs, where we will exploit the fact that using labels predicted by h instead of observed labels on certain examples only introduces a bias in favor of h , thereby ensuring that we never mistakenly drop the optimal classifier from Am. The bound (5) shows that every classifier in Am+1 has a small regret to h . Since the ERM classifier hm+1 is always in Am+1, this yields our main generalization error bound on the classifier hτm+1 output by Algorithm 1. Additionally, it also clarifies the definition of the sets Am as the set of good classifiers: these are classifiers which indeed have small population regret relative to h . In a realizable setting where h has zero error, m = O(1/τm) leading to a O(1/n) regret after n unlabeled examples are presented to the algorithm. On the other extreme, if errm(h ) is a constant, then the regret is O(1/ n). There are also interesting regimes in between, where err(h ) might be a constant, but errm(h ) measured over the disagreement region decreases rapidly. More specifically, we show in Appendix E of [14] that the expected regret of the classifier returned by Algorithm 1 achieves the optimal rate [6] under the Tsybakov [17] noise condition. Next, we provide a label complexity guarantee in terms of the disagreement coefficient [11]: θ = θ(h ) := supr>0 PX {x | h H s.t. h (x) = h(x), PX {x | h(x ) = h (x )} r}/r. Theorem 2. With probability at least 1 δ, the number of label queries made by Algorithm 1 after n examples over M epochs is 4θ err M(h )n + θ O( p nerr M(h ) log(|H|/δ) + log(|H|/δ)). The theorem is proved in Appendix D of [14]. The first term of the label complexity bound is linear in the number of unlabeled examples, but can be quite small if θ is small, or if err M(h ) 0 it is indeed 0 in the realizable setting. The second term grows at most as O( n), but also becomes a constant for realizable problems. Consequently, we attain a logarithmic label complexity in the realizable setting. In noisy settings, our label complexity improves upon that of predecessors such as [5, 13]. Beygelzimer et al. [5] obtain a label complexity of θ n, exponentially worse for realizable problems. A related algorithm, Oracular CAL [13], has label complexity scaling with p nerr(h ) but a worse dependence on θ. In all comparisons the use of err M(h ) provides a qualitatively superior analysis to all previous results depending on err(h ) since this captures the fact that noisy labels outside the disagreement region do not affect the label complexity. Finally, as in our regret analysis, we show in Appendix E of [14] that the label complexity of Algorithm 1 achieves the information-theoretically lower bound [6] under Tsybakov s low-noise condition [17]. Section 4.2.2 of [14] gives an example where the label complexity of Algorithm 1 is significantly smaller than both IWAL and Oracular CAL by virtue of rarely querying in the disagreement region. The example considers a distribution and a classifier space with the following structure: (i) for most examples a single good classifier predicts differently from the remaining classifiers; (ii) on a few examples, half the classifiers predict one way and half the other. In the first case, little advantage is gained from a label because it provides evidence against only a single classifier. ACTIVE COVER queries over the disagreement region with a probability close to Pmin in case (i) and probability 1 in case (ii), while others query with probability Ω(1) everywhere implying O( n) times more queries. 4 Efficient implementation The computation of hm is an ERM operation, which can be performed efficiently whenever an efficient passive learner is available. However, several other hurdles remain. Testing for x DIS(Am) in the algorithm, as well as finding a solution to (OP) are considerably more challenging. The epoch schedule helps, but (OP) is still solved O(log n) times, necessitating an extremely efficient solver. Starting with the first issue, we follow Dasgupta et al. [9] who cleverly observed that x Dm := DIS(Am) can be efficiently determined using a single call to an ERM oracle. Specifically, to apply their method, we use the oracle to find4 h = arg min{err(h, Zm 1) | h H, h(x) = hm(x)}. It can then be argued that x Dm = DIS(Am) if and only if the easily-measured regret of h (that is, reg(h , hm, Zm 1)) is at most γ m 1. Solving (OP) efficiently is a much bigger challenge because it is enormous: There is one variable P(x) for every point x X, one constraint (1) for each classifier h and bound constraints (2) on P(x) for every x. This leads to infinitely many variables and constraints, with an ERM oracle being the only computational primitive available. We eliminate the bound constraints using barrier functions. Notice that the objective EX[1/(1 P(x))] is already a barrier at P(x) = 1. To enforce the lower bound (2), we modify the objective to where µ is a parameter chosen momentarily to ensure P(x) Pmin,m for all x Dm. Thus, the modified goal is to minimize (7) over non-negative P subject only to (1). We solve the problem in the dual where we have a large but finite number of optimization variables, and efficiently maximize the dual using coordinate ascent with access to an ERM oracle over H. Let λh 0 denote the 4 See Appendix F of [15] for how to deal with one constraint with an unconstrained oracle. Algorithm 2 Coordinate ascent algorithm to solve (OP) input Accuracy parameter ε > 0. initialize λ 0. 1: loop 2: Rescale: λ s λ where s = arg maxs [0,1] D(s λ). 3: Find h = arg max h H EX Im h (X) Pλ(X) 4: if EX h Im h (X) Pλ(X) i bm( h) ε then 5: return λ 6: else 7: Update λ h as λ h λ h + 2EX[Im h (X)/Pλ(X)] bm( h) EX[Im h (X)/qλ(X)3] . 8: end if 9: end loop Lagrange multiplier for the constraint (1) for classifier h. Then for any λ, we can minimize the Lagrangian over each primal variable P(X) yielding the solution Pλ(x) = 1(x Dm)qλ(x) 1 + qλ(x) , where qλ(x) = s h H λh Im h (x) (8) and Im h (x) = 1(h(x) = hm(x) x Dm). Clearly, µ/(1 + µ) Pλ(x) 1 for all x Dm, so all the bound constraints (2) in (OP) are satisfied if we choose µ = 2Pmin,m. Plugging the solution Pλ into the Lagrangian, we obtain the dual problem of maximizing the dual objective D(λ) = EX 1(X Dm)(1 + qλ(X))2 X h H λhbm(h) + C0 (9) over λ 0. The constant C0 is equal to 1 Pr(Dm) where Pr(Dm) = Pr(X Dm). An algorithm to approximately solve this problem is presented in Algorithm 2. The algorithm takes a parameter ε > 0 specifying the degree to which all of the constraints (1) are to be approximated. Since D is concave, the rescaling step can be solved using a straightforward numerical line search. The main implementation challenge is in finding the most violated constraint (Step 3). Fortunately, this step can be reduced to a single call to an ERM oracle. To see this, note that the constraint violation on classifier h can be written as Im h (X) P(X) 1(X Dm) 1 P(X) 2α2 1(h(X) = hm(X)) 2β2γτm 1 m 1(err(h, Zm 1) err(hm, Zm 1)) ξτm 1 2 m 1. The second term of the right-hand expression is simply the scaled risk (classification error) of h with respect to the actual labels. The first term is the risk of h in predicting samples which have been labeled according to hm with importance weights of 1/P(x) 2α2 if x Dm and 0 otherwise; note that these weights may be positive or negative. The last two terms do not depend on h. Thus, given access to PX (or samples approximating it, discussed shortly), the most violated constraint can be found by solving an ERM problem defined on the labeled samples in Zm 1 and samples drawn from PX labeled by hm, with appropriate importance weights detailed in Appendix F.1 of [14]. When all primal constraints are approximately satisfied, the algorithm stops. We have the following guarantee on the convergence of the algorithm. Theorem 3. When run on the m-th epoch, Algorithm 2 halts in at most Pr(Dm)/(8P 3 min,mε2) iterations and outputs a solution ˆλ 0 such that Pˆλ satisfies the simple bound constraints in (2) exactly, the variance constraints in (1) up to an additive factor of ε, and + 4Pmin,m Pr(Dm), (10) where P is the solution to (OP). Furthermore, ˆλ 1 Pr(Dm)/ε. If ε is set to ξ2τm 1 2 m 1, an amount of constraint violation tolerable in our analysis, the number of iterations (hence the number of ERM oracle calls) in Theorem 3 is at most O(τ 2 m 1). The proof is in Appendix F.2 of [14]. Table 1: Summary of performance metrics OAC IWAL0 IWAL1 ORA-OAC ORAIWAL0 AUC-GAIN 0.151 0.150 0.142 0.125 0.115 0.121 0.095 AUC-GAIN 0.065 0.085 0.081 0.078 0.073 0.075 0.072 Solving (OP) with expectation over samples: So far we considered solving (OP) defined on the unlabeled data distribution PX , which is unavailable in practice. A natural substitute for PX is an i.i.d. sample drawn from it. In Appendix F.3 of [14] we show that solving a properly-defined sample variant of (OP) leads to a solution to the original (OP) with similar guarantees as in Theorem 3. 5 Experiments with Agnostic Active Learning While AC is efficient in the number of ERM oracle calls, it needs to store all past examples, resulting in large space complexity. As Theorem 3 suggests, the query probability function (8) may need as many as O(τ 2 i ) classifiers, further increasing storage demand. Aiming at scalable implementation, we consider an online approximation of AC, given in Section 6.1 of [14]. The main differences from AC are: (1) instead of a batch ERM oracle, it invokes an online oracle; and (2) instead of repeatedly solving (OP) from scratch, it maintains a fixed-size set of classifiers (and hence non-zero dual variables), called the cover, for representing the query probability, and updates the cover with every new example in a manner similar to the coordinate ascent algorithm for solving (OP). We conduct an empirical comparison of the following efficient agnostic active learning algorithms: OAC: Online approximation of ACTIVE COVER (Algorithm 3 in Section 6.1 of [14]). IWAL0 and IWAL1: The algorithm of [5] and a variant that uses a tighter threshold. ORA-OAC, ORA-IWAL0, and ORA-IWAL1: Oracular-CAL [13] versions of OAC, IWAL0 and IWAL1. PASSIVE: Passive learning on a labeled sub-sample drawn uniformly at random. Details about these algorithms are in Section 6.2 of [14]. The high-level differences among these algorithms are best explained in the context of the disagreement region: OAC does importanceweighted querying of labels with an optimized query probability in the disagreement region, while using predicted labels outside; IWAL0 and IWAL1 maintain a non-zero minimum query probability everywhere; ORA-OAC, ORA-IWAL0 and ORA-IWAL1 query labels in their respective disagreement regions with probability 1, using predicted labels otherwise. We implemented these algorithms in Vowpal Wabbit (http://hunch.net/ vw/), a fast learning system based on online convex optimization, using logistic regression as the ERM oracle. We performed experiments on 22 binary classification datasets with varying sizes (103 to 106) and diverse feature characteristics. Details about the datasets are in Appendix G.1 of [14]. Our goal is to evaluate the test error improvement per label query achieved by different algorithms. To simulate the streaming setting, we randomly permuted the datasets, ran the active learning algorithms through the first 80% of data, and evaluated the learned classifiers on the remaining 20%. We repeated this process 9 times to reduce variance due to random permutation. For each active learning algorithm, we obtain the test error rates of classifiers trained at doubling numbers of label queries starting from 10 to 10240. Formally, let errora,p(d, j, q) denote the test error of the classifier returned by algorithm a using hyper-parameter setting p on the j-th permutation of dataset d immediately after hitting the q-th label budget, 10 2(q 1), 1 q 11. Let querya,p(d, j, q) be the actual number of label queries made, which can be smaller than 10 2(q 1) when algorithm a reaches the end of the training data before hitting that label budget. To evaluate an algorithm, we consider the area under its curve of test error against log number of label queries: AUCa,p(d, j) = 1 errora,p(d, j, q + 1) + errora,p(d, j, q) log2 querya,p(d, j, q + 1) querya,p(d, j, q) A good active learning algorithm has a small value of AUC, which indicates that the test error decreases quickly as the number of label queries increases. We use a logarithmic scale for the number of label queries to focus on the performance with few label queries where active learning is the most relevant. More details about hyper-parameters are in Appendix G.2 of [14]. number of label queries 10 1 10 2 10 3 10 4 relative improvement in test error OAC IWAL0 IWAL1 ORA-OAC ORA-IWAL0 ORA-IWAL1 PASSIVE baseline (a) Best hyper-parameter per dataset number of label queries 10 1 10 2 10 3 10 4 relative improvement in test error OAC IWAL0 IWAL1 ORA-OAC ORA-IWAL0 ORA-IWAL1 PASSIVE baseline (b) Best fixed hyper-parameter Figure 1: Average relative improvement in test error v.s. number of label queries We measure the performance of each algorithm a by the following two aggregated metrics: AUC-GAIN (a) := mean d max p median 1 j 9 AUCbase(d, j) AUCa,p(d, j) AUCbase(d, j) AUC-GAIN(a) := max p mean d median 1 j 9 AUCbase(d, j) AUCa,p(d, j) AUCbase(d, j) where AUCbase denotes the AUC of PASSIVE using a default hyper-parameter setting, i.e., a learning rate of 0.4 (see Appendix G.2 of [14]). The first metric shows the maximal gain each algorithm achieves with the best hyper-parameter setting for each dataset, while the second shows the gain by using the single hyper-parameter setting that performs the best on average across datasets. Results and Discussions. Table 1 gives a summary of the performances of different algorithms. When using hyper-parameters optimized on a per-dataset basis (top row in Table 1), OAC achieves the largest improvement over the PASSIVE baseline, with IWAL0 achieving almost the same improvement and IWAL1 improving slightly less. Oracular-CAL variants perform worse, but still do better than PASSIVE with the best learning rate for each dataset, which leads to an average of 9.5% improvement in AUC over the default learning rate. When using the best fixed hyper-parameter setting across all datasets (bottom row in Table 1), all active learning algorithms achieve less improvement compared with PASSIVE (7% improvement with the best fixed learning rate). In particular, OAC gets only 6.5% improvement. This suggests that careful tuning of hyper-parameters is critical for OAC and an important direction for future work. Figure 1(a) describes the behaviors of different algorithms in more detail. For each algorithm a we identify the best fixed hyper-parameter setting p := arg max p mean d median 1 j 9 AUCbase(d, j) AUCa,p(d, j) AUCbase(d, j) and plot the relative test error improvement by a using p averaged across all datasets at the 11 label budgets: 10 2(q 1), mean d median 1 j 9 errorbase(d, j, q) errora,p (d, j, q) errorbase(d, j, q) All algorithms, including PASSIVE, perform similarly during the first few hundred label queries. IWAL0 performs the best at label budgets larger than 80, while IWAL1 does almost as well. ORAOAC is the next best, followed by ORA-IWAL1 and ORA-IWAL0. OAC performs worse than PASSIVE except at label budgets between 320 and 1280. In Figure 1(b),we plot results obtained by each algorithm a using the best hyper-parameter setting for each dataset d: p d := arg max p median 1 j 9 AUCbase(d, j) AUCa,p(d, j) AUCbase(d, j) As expected, all algorithms perform better, but OAC benefits the most from using the best hyperparameter setting per dataset. Appendix G.3 of [14] gives more detailed results, including test error rates obtained by all algorithms at different label query budgets for individual datasets. In sum, when using the best fixed hyper-parameter setting, IWAL0 outperforms other algorithms. When using the best hyper-parameter setting tuned for each dataset, OAC and IWAL0 perform equally well and better than other algorithms. [1] Maria-Florina Balcan and Phil Long. Active and passive learning of linear separators under log-concave distributions. In Conference on Learning Theory, pages 288 316, 2013. [2] Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In Proceedings of the 23rd international conference on Machine learning, pages 65 72. ACM, 2006. [3] Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In Proceedings of the 20th annual conference on Learning theory, pages 35 50. Springer-Verlag, 2007. [4] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML, 2009. [5] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In NIPS, 2010. [6] R.M. Castro and R.D. Nowak. Minimax bounds for active learning. Information Theory, IEEE Transactions on, 54(5):2339 2353, 2008. [7] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15:201 221, 1994. [8] S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems 18, 2005. [9] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In NIPS, 2007. [10] S. Hanneke. Theoretical Foundations of Active Learning. Ph D thesis, Carnegie Mellon University, 2009. [11] Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131 309, 2014. [12] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc., 47:663 685, 1952. ISSN 0162-1459. [13] Daniel J. Hsu. Algorithms for Active Learning. Ph D thesis, University of California at San Diego, 2010. [14] Tzu-Kuo Huang, Alekh Agarwal, Daniel J Hsu, John Langford, and Robert E Schapire. Efficient and parsimonious agnostic active learning. ar Xiv preprint ar Xiv:1506.08669, 2015. [15] Nikos Karampatziakis and John Langford. Online importance weight aware updates. In UAI 2011, Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, July 14-17, 2011, pages 392 399, 2011. [16] Vladimir Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. J. Mach. Learn. Res., 11:2457 2485, December 2010. [17] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Ann. Statist., 32: 135 166, 2004. [18] Chicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learning. In Advances in Neural Information Processing Systems, pages 442 450, 2014.