# active_learning_for_costsensitive_classification__7d7c6c4c.pdf Active Learning for Cost-Sensitive Classification Akshay Krishnamurthy 1 Alekh Agarwal 2 Tzu-Kuo Huang 3 Hal Daum e III 4 John Langford 2 We design an active learning algorithm for cost-sensitive multiclass classification: problems where different errors have different costs. Our algorithm, COAL, makes predictions by regressing to each label s cost and predicting the smallest. On a new example, it uses a set of regressors that perform well on past data to estimate possible costs for each label. It queries only the labels that could be the best, ignoring the sure losers. We prove COAL can be efficiently implemented for any regression family that admits squared loss optimization; it also enjoys strong guarantees with respect to predictive performance and labeling effort. Our experiment with COAL show significant improvements in labeling effort and test cost over passive and active baselines. 1. Introduction The field of active learning studies how to efficiently elicit relevant information so learning algorithms can make good decisions. Almost all active learning algorithms are designed for binary classification problems, leading to the natural question: How can active learning address more complex prediction problems? Multiclass and importanceweighted classification require only minor modifications but we know of no active learning algorithms that enjoy theoretical guarantees for more complex problems. One such problem is cost-sensitive multiclass classification (CSMC). In CSMC with K classes, passive learners receive input examples x and cost vectors c 2 RK, where c(y) is the cost of predicting label y on x.1 A natural design for an active CSMC learner then is to adaptively query the 1University of Massachusetts, Amherst, MA 2Microsoft Research, New York, NY 3Uber Advanced Technology Center, Pittsburgh, PA 4University of Maryland, College Park, MD. Correspondence to: Akshay Krishnamurthy . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). 1Cost here refers to prediction cost and not labeling effort or the cost of acquiring different labels. costs of only a (possibly empty) subset of labels on each x. Since measuring label complexity is more nuanced in CSMC (e.g., is it more expensive to query three costs on a single example or one cost on three examples?), we track both the number of examples for which at least one cost is queried, along with the total number of cost queries issued. The first corresponds to a fixed human effort for inspecting x. The second captures the additional effort for judging the cost of each prediction, which depends on the number of labels queried. (By querying a label, we mean querying the cost of predicting that label given an example.) In this setup, we develop a new active learning algorithm for CSMC called Cost Overlapped Active Learning (COAL). COAL assumes access to a set of regression functions, and, when processing an example x, it uses the functions with good past performance to compute the range of possible costs that each label might take. Naturally, COAL only queries labels with large cost range, but furthermore, it only queries y s that could possibly have the smallest cost, avoiding the uncertain, but surely suboptimal labels. The key algorithmic innovation is an efficient way to compute the cost range realized by good regressors. This computation, and COAL as a whole, only requires that the regression set admits efficient squared loss optimization, in contrast with prior algorithms that require 0/1 loss optimization (Beygelzimer et al., 2009; Hanneke, 2014). Among our results, we prove that when processing n (unlabeled) examples with K classes and N regressors, 1. The algorithm needs to solve O(Kn2 log n) regression problems over the function class (Cor. 2), which can be done in polynomial time for convex regression sets. 2. With no assumptions on the noise in the prob- lem, the algorithm achieves generalization error O( K ln N/n) and requests O(n 2 K ln N) costs from O(n 1 K ln N) examples (Thms. 3 and 5) where 1, 2 are the disagreement coefficients (Def. 1)2. The worst case offers minimal improvement over passive learning, akin to binary classification. 3. With a favorable noise assumption (As. 2), the al- gorithm achieves generalization error O(K ln N/n) while requesting O(Kc1/βnβ 2 ln N) labels from O(c1/βnβ 1K ln N) examples (Cor. 4, Thm. 6), where 2 O( ) suppresses logarithmic dependence on n and K. Active Learning for Cost-Sensitive Classification Figure 1. Empirical evaluation of COAL on Reuters text categorization dataset. Active learning achieves better test cost than passive, with a factor of 16 fewer queries. See Section 6 for details. β 2 (0, 1) is a safety parameter and c is a constant. We also discuss some intuitive examples highlighting the benefits of using COAL. CSMC provides a more expressive language for success and failure than multiclass classification, which allows algorithms to better trade-off errors and broadens potential applications. For example, CSMC can naturally express partial failure in hierarchical classification (Silla Jr. & Freitas, 2011). Experimentally, we show that COAL substantially outperforms the passive learning baseline with orders of magnitude savings in the labeling effort on a number of hierarchical classification datasets (see Fig. 1 for comparison between passive learning and COAL on Reuters text categorization). CSMC also forms the basis of learning to avoid cascading failures in joint prediction tasks (Daum e III et al., 2009; Ross & Bagnell, 2014; Chang et al., 2015) like structured prediction and reinforcement learning. As our second application, we consider learning to search algorithms for joint (or structured) prediction, which operate by a reduction to CSMC. In this reduction, evaluating the cost of a class often involves a computationally expensive roll-out, so using an active learning algorithm inside such a (passive) joint prediction method can lead to significant computational savings. We show that using COAL within the AGGRAVATE algorithm (Ross & Bagnell, 2014; Chang et al., 2015) reduces the number of roll-outs by a factor of 1 4 on several joint prediction tasks. Related Work. Active learning is a thriving research area with many theoretical and empirical studies. We recommend the survey of Settles (2012) for an overview of more empirical research. We focus here on theoretical results. Castro & Nowak (2008) study active learning with nonparametric decision sets, while Balcan et al. (2007); Balcan & Long (2013) focus on linear representations under distri- butional assumptions. The online learning community has also studied active learning of linear separators under adversarial assumptions (Cavallanti et al., 2011; Dekel et al., 2010; Orabona & Cesa-Bianchi, 2011; Agarwal, 2013). Our work falls into the framework of disagreement-based active learning, which studies general hypothesis spaces typically in an agnostic setup (see Hanneke (2014) for an excellent survey). Existing results study binary classification, while our work generalizes to CSMC, assuming that we can accurately predict costs using regression functions from our class. The other main differences are that our query rule checks the range of predicted costs for a label, and we use a square loss oracle to search the version space. In contrast, prior work either explicitly enumerates the version space (Balcan et al., 2006; Zhang & Chaudhuri, 2014) or uses a 0/1 loss classification oracle for the search (Dasgupta et al., 2007; Beygelzimer et al., 2009; 2010; Huang et al., 2015). In most instantiations, the oracle solves an NP-hard problem and so does not directly lead to an efficient algorithm, although practical implementations using heuristics are still quite effective. Our approach instead uses a squared-loss regression oracle, which can often be implemented efficiently via convex optimization and leads to a polynomial time algorithm. Supervised learning oracles that solve NP-hard optimization problems in the worst case have been used in other problems including contextual bandits (Agarwal et al., 2014; Syrgkanis et al., 2016) and structured prediction (Daum e III et al., 2009). Thus we hope that our work can inspire new algorithms for these settings as well. Lastly, we mention that square loss regression has been used to estimate costs for passive CSMC (Langford & Beygelzimer, 2005), but, to our knowledge, using a square loss oracle for active CSMC is new. 2. Problem Setting and Notations We study cost-sensitive multiclass classification problems with K classes, where there is an instance space X, a label space Y = {1, . . . , K}, and a distribution D supported on X [0, 1]K.3 If (x, c) D, we refer to c as the cost-vector, where c(y) is the cost of predicting y 2 Y . A classifier h : X ! Y has expected cost E(x,c) D[c(h(x))] and we aim to find a classifier with minimal expected cost. Let G , {g : X 7! [0, 1]} denote a set of base regressors and let F , GK denote a set of vector regressors where the yth coordinate of f 2 F is written as f( ; y). The set of classifiers under consideration is H , {hf | f 2 F} 3In general, labels just serve as indices for the cost vector in CSMC, and the data distribution is over (x, c) pairs instead of (x, y) pairs as in binary and multiclass classification. Active Learning for Cost-Sensitive Classification where each f defines a classifier hf : X 7! Y by hf(x) , argmin y f(x; y). (1) Given a set of examples and queried costs, we often restrict attention to regression functions that predict these costs well, and assess the uncertainty in their predictions given a new example x. For a subset of regressors G G, we measure uncertainty over possible cost values for x with γ(x, G) , c+(x, G) | {z } ,maxg2G g(x) c (x, G) | {z } ,ming2G g(x) For vector regressors F F, we define the cost range for a label y given x as γ(x, y, F) , γ(x, GF (y)) where GF (y) , {f( ; y) | f 2 F} is the set of base regressors induced by F for y. When using a set of regression functions for a classification task, it is natural to assume that the expected costs under D can be predicted well by some function in the set. This motivates the following realizability assumption. Assumption 1 (Realizability). Define the Bayes-optimal regressor f ?, which has f ?(x; y) = Ec[c(y)|x], 8x 2 X (with D(x) > 0), y 2 Y . We assume that f ? 2 F. While f ? is always well defined, note that the cost itself may be noisy. In comparison with our assumption, the existence of a zero-cost classifier in H (which is often assumed in active learning) is stronger, while the existence of hf ? in H is weaker but has not been leveraged in active learning. In typical settings, the set G is extremely large, which introduces a computational challenge of managing this set. To address this challenge, we leverage existing algorithmic research on supervised learning and assume access to a regression oracle for G. Given an importance-weighted dataset D = {xi, ci, wi}n i=1 the regression oracle computes ORACLE(D) 2 argmin wi(g(xi) ci)2. (3) In many cases this is a convex problem and can be solved efficiently. In the special case of linear functions, this is just least squares and can be computed in closed form. To measure the labeling effort, we track the number of examples for which even a single cost is queried as well as the total number of queries. This bookkeeping captures settings where the editorial effort for inspecting an example is high, but each cost requires minimal further effort, as well as those where each cost requires substantial effort. Formally, we define Qi(y) to be the indicator that the algorithm queries label y on the ith example and measure Qi(y), and L2 , Algorithm 1 Cost Overlapped Active Learning (COAL) 1: Input: Regressors G, failure probability δ 1/e, safety parameter β 2 (0, 1). 2: Set i = 1/ i, = 80, n = log(2n2|G|K/δ). 3: Set i = i 1 )β n. 4: for i = 1, 2, . . . , n do 5: gi,y arg ming2G b Ri(g; y). (See Eq. (5)) 6: Define fi {gi,y}K 7: Gi(y) {g 2 G | b Ri(g; y) b Ri(gi,y; y) + i}. 8: Receive new example x. Qi(y) 0, 8y 2 Y . 9: for every y 2 Y do 10: c c+(y) MAXCOST((x, y), i, i 4 3, b Ri( ; y)). 11: c c (y) MINCOST((x, y), i, i 4 3, b Ri( ; y)). 12: end for 13: Y 0 {y 2 Y | c c (y) miny0 c c+(y0)}. 14: if |Y 0| > 1 then 15: Qi(y) 1 if y 2 Y 0 and c c+(y) c c (y) > i. 16: end if 17: Query costs of each y with Qi(y) = 1. 18: end for 3. Cost Overlapped Active Learning The pseudocode for our algorithm, Cost Overlapped Active Learning (COAL), is given in Algorithm 1. Given an example x, COAL queries the costs of some of the labels y for x. These costs are chosen by (1) computing a set of good regression functions based on the past data (i.e., the version space), (2) computing the range of predictions achievable by these functions for each y, and (3) querying each y that could be the best label and has substantial uncertainty. We now detail each step. To compute an approximate version space we first find the regression function that minimizes the empirical risk for each label y, which at round i is: b Ri(g; y) = 1 i 1 (g(xj) cj(y))2Qj(y). (5) Recall that Qj(y) is the indicator that we query label y on the jth example. Computing the minimizer requires one oracle call. We implicitly construct the version space Gi(y) in Line 7 as the regressors with low square loss regret to the empirical risk minimizer. The tolerance on this regret is i at round i, which depends on the safety parameter β 2 (0, 1) in the algorithm. When β is large, the tolerance is also large and the algorithm issues many queries. Conversely when β is small the algorithm is more aggressive. However, for any strictly positive β, the definition of i ensures that f ?( ; y) 2 Gi(y) for all i, y. COAL then computes the maximum and minimum costs predicted by the version space Gi(y) on the new example x. Since the true expected cost is f ?(x; y) and f ?( ; y) 2 Active Learning for Cost-Sensitive Classification Gi(y), these quantities serve as a confidence bound for this value. The computation is done by the MAXCOST and MINCOST subroutines which produce approximations to c+(x, Gi(y)) and c (x, Gi(y)) (Eq. (2)) respectively. Finally, using the predicted costs, COAL issues (possibly zero) queries. The algorithm queries any non-dominated label that has a large cost range, where a label is nondominated if its estimated minimum cost is smaller than the smallest maximum cost (among all labels) and the cost range is the difference between the label s estimated maximum and minimum costs. Intuitively, COAL queries the cost of every label which cannot be ruled out as having the smallest cost on x, but only if there is sufficient ambiguity about the actual value of the cost. The idea is that labels with little disagreement do not provide much information for further reducing the version space, since by construction all functions would suffer similar loss. Moreover, only the labels that could be the best need to be queried at all, since the cost-sensitive performance of a hypothesis hf depends only on the label that it predicts. Hence, labels that are dominated or have small cost range need not be queried. Similar query rules appear in prior works on binary and multiclass classification (Orabona & Cesa-Bianchi, 2011; Dekel et al., 2010; Agarwal, 2013), but specialized to linear representations. The key advantage of the linear case is that the set Gi(y) (formally, a different set with similar properties) along with c+(y) and c (y) have closed form expressions, so that the algorithms are easily implemented. However, with a general set G and a regression oracle, computing these confidence intervals is less straightforward. We use the MAXCOST and MINCOST subroutines, and discuss this aspect of our algorithm next. 3.1. Efficient Computation of Cost Range In this section, we describe the MAXCOST subroutine which uses the oracle to approximate the maximum cost on label y realized by Gi(y) (recall definition in Eq. (2)).4 Describing the algorithm requires some additional notation. Given the empirical risk functional b R(g; y) over a set of examples (we suppress the subscript as the number of examples is fixed here), we define a weighted risk functional incorporating a fresh unlabeled example x as e R(g, w, c; y) = b R(g; y) + w(g(x) c)2. (6) Finding argming e R(g, w, c; y) involves a single oracle call. We also define a set of near-optimal regressors g 2 G | b R(g; y) min g0 b R(g0; y) 4MINCOST is similar and omitted due to space constraints. Algorithm 2 MAXCOST 1: Input: (x, y), , , risk functional b R( ; y) 2: gmin = argming2G b R(g; y). 3: = 0, h = 1, c = 1. 4: while |h | 2 3 do 5: gc argming2G e R(g, / 2, c; y) (see Eq. 6). 6: If gc 2 G( ; y) (see Eq. 7), output gc(x) + . 7: (gl, gh) BSEARCH((x, y, c), , , b R( ; y)). 8: If gh 2 G(4 ; y), output gh(x). 9: Else max{gl(x), }, h gh(x), c h+ 2 . 10: end while 11: return c. Algorithm 3 BINARYSEARCH(BSEARCH) 1: Input: (x, y, c), , , risk functional b R( ; y). 2: w1, = 0, w1,h = / 2, t = 1. 3: while |wt, wt,h| 2 do 4: wt wt, +wt,h 2 , gt = argming2G e R(g, wt, c; y). 5: If gt 2 G( ; y), wt+1, wt, wt+1,h wt,h. 6: Else wt+1, wt, , wt+1,h wt. 7: t t + 1. 8: end while 9: return g = argming e R(g, wt, , c; y), and gh = argming e R(g, wt,h, c; y). Thus at round i, the set Gi(y) in COAL is equivalent to G( i; y), although we will use different radii here. The algorithm for the maximum cost approximation, displayed in Algorithm 2, is based on a form of binary search. When invoked with a radius parameter , the algorithm maintains an interval [ , h] that contains c+(x, G( ; y)) and uses a binary search to refine the interval. Using a fixed cost c and starting with some initial weight w, at each iteration, the binary search computes argming e R(g, w, c; y) and verifies if the resulting regressor belongs to G( ; y). If it does, it increases w, and otherwise it shrinks w. Once a termination criteria is reached, the BINARYSEARCH routine outputs two regressors (g , gh) that provide new upper and lower bounds on c+(x, G( ; y)). The MAXCOST routine terminates and outputs gh(x) if it has reasonable empirical regret. Otherwise, it updates parameters for the next binary search based on g (x), gh(x). Our main algorithmic result guarantees that this procedure produces an adequate approximation to c+(x, G( ; y)) without requiring too many oracle calls. Theorem 1. For any (x, y), , and , the MAXCOST algorithm outputs ˆc satisfying c+(x, G( ; y)) ˆc c+(x, G(4 ; y)) + Further, the algorithm uses O( 2 log(1/ )) oracle calls. An immediate consequence of the theorem is a bound on the oracle complexity of COAL. Active Learning for Cost-Sensitive Classification Corollary 2. Over the course of n examples, COAL makes O(Kn2 log(n)) calls to the square loss oracle. Thus COAL can be implemented in polynomial time for any set G that admits efficient square loss optimization. However, in practice, the number of oracle calls and the oracle itself are too computationally demanding to scale to larger problems. Our implementation alleviates this with an alternative heuristic approximation based on a sensitivity analysis of the oracle, which we detail in Section 6. 4. Generalization Analysis In this section, we derive generalization guarantees for COAL. Our analysis assumes that the regressor set G is large, but finite. We study two different settings: one with minimal assumptions and one low-noise setting. Our low-noise assumption is related to the Massart noise condition (Massart & N ed elec, 2006), which in binary classification posits that the Bayes optimal predictor is bounded away from 1/2 for all x. Our condition generalizes this to CSMC and posits that the expected cost of the best label is separated from the expected cost of all other labels. Assumption 2. A distribution D supported over (x, c) pairs satisfies the Massart noise condition, if there exists > 0 such that for all x (with D(x) > 0), f ?(x; y?(x)) min y6=y?(x) f ?(x; y) , where y?(x) = argminy f ?(x; y). The Massart noise condition describes favorable prediction problems that lead to sharper generalization and label complexity bounds for COAL. COAL can also be analyzed under a milder assumption inspired by the Tsybakov noise condition, an analysis that we defer to an extended version. Our results depend on the noise level in the problem, which we define using the following quantity, given any > 0. y6=y?(x) f ?(x; y) f ?(x; y?(x)) ]. (8) P describes the probability that the expected cost of the best label, which is y?(x), is close to the expected cost of the second best label. When P is small for large the labels are well-separated so learning is easier. For instance, under a Massart condition P = 0 for all . We now state our generalization guarantee. Theorem 3. For any δ < 1/e, for all i 2 [n], with probability at least 1 δ, we have Ex,c[c(hfi+1(x)) c(hf ?(x))] min where = 80, n = log , fi is as defined in Line 6 of Algorithm 1, and hfi is defined in Equation (1). In the worst case, we bound P by 1 and optimize for to obtain an O( K log(|G|/δ)/i) bound after i samples. To compare, the standard generalization bound is O( log(|F|/δ)/i) (Littlestone & Warmuth, 1989), which agrees with our bound since |F| = |G|K in our case. However, since the bound captures the difficulty of the CSMC problem as measured by P , we can obtain a sharper result under Assumption 2 by setting = . Corollary 4. Under Assumption 2, for any δ < 1/e, for all i 2 [n], with probability at least 1 δ, we have Ex,c[c(hfi+1(x)) c(hf ?(x))] 2 K n Thus, Massart-type conditions lead to a faster O(1/n) convergence rate. This agrees with the literature on active learning for classification (Massart & N ed elec, 2006) and can be viewed as a generalization to CSMC. Importantly, both generalization bounds recover the optimal rates and are independent of the safety parameter β. 5. Label Complexity Analysis Without distributional assumptions, the label complexity of COAL can be O(n), just as in the binary classification case, since there may always be confusing labels that force querying. In line with prior work, we introduce two disagreement coefficients that characterize favorable distributional properties. We first define a set of good classifiers, the cost-sensitive regret ball: 333 E [c(hf(x)) c(hf ?(x))] r We may now define the disagreement coefficients. Definition 1 (Disagreement coefficients). Define γr(x, y) = γ(x, y, Fcsr(r)), and DIS(r, y) = {x | 9f, f 0 2 Fcsr(r), hf(x) = y 6= hf 0(x)}. Then the disagreement coefficients are defined as: r P (9y | γr(x, y) > 1 x 2 DIS(r, y)) P (γr(x, y) > 1 x 2 DIS(r, y)) . Intuitively, the conditions in both coefficients correspond to the checks on the domination and cost range of a label in Lines 13 and 15 of Algorithm 1. Specifically, when Active Learning for Cost-Sensitive Classification x 2 DIS(r, y), there is confusion about whether y is the optimal label or not, and hence y is not dominated. The condition on γr(x, y) additionally captures the fact that a small cost range provides little information, even when y is nondominated. Collectively, the coefficients capture the probability of an example x where the good classifiers disagree substantially on x in both predicted costs and labels. Importantly, the notion of good classifiers is via the algorithmindependent set Fcsr(r), and is only a property of F and D. The definition is a natural adaptation from binary classification (Hanneke, 2014), where a similar disagreement region to DIS(r, y) is used. Our definition asks for confusion about the optimality of a specific label y, which provides more detailed information about the cost-structure than simply asking for any confusion among the good classifiers. The 1/r scaling leads to bounded coefficients in many examples (Hanneke, 2014), and we also scale by the cost range parameter 1, so that the favorable settings for active learning can be concisely expressed as having 1, 2 bounded, as opposed to a complex function of 1. The next two results bound the labeling effort (Eq. (4)) in the high noise and low noise cases respectively. The low noise assumption enables a significantly sharper bound. Theorem 5. With probability at least 1 2δ, the label complexity of the algorithm over n examples is bounded by, K n + log(1/δ) K n + K log(1/δ) where n = log Theorem 6. Assume the Massart noise condition holds. With probability at least 1 2δ the label complexity of the algorithm over n examples is at most, nβK log(n) n 1 + log(1/δ) nβ log(n) n [K 1 + 2] + log(1/δ) In the high-noise case, the bounds scales with n for the respective coefficients. This agrees with results in binary classification, where at best constant-factor savings over passive learning are possible. On the other hand, in the low noise case, the label complexity scales as O(c1/βnβ / 2), which is a polynomial improvement over passive learning. However, the constant scales exponentially with 1/β so β should not be chosen to be too small. Note that 2 can be much smaller than K 1, as demonstrated through an example in the next section. In such cases, only a few labels are ever queried and the L2 bound in the high noise case reflects this additional savings over passive learning. Unfortunately, under Massart-noise, the L2 bound depends directly on K 1, so that we do not benefit when 2 K 1. This can be resolved by letting i depend on the noise level , but we prefer to use the more robust choice i = 1/ i which still allows COAL to partially adapt to low noise and achieve low label complexity. Unfortunately, comparing our bound to binary classification reveals suboptimality here. Under Massart noise and bounded coefficients, the label complexity for binary classification is typically log(n)/ 2 which contrasts with our nβ/ 2 rate. This loss in rate arises from setting β > 0. However, with β = 0, a suboptimal regressor that left the version space may re-enter at a later round and cause us to issue more queries, since we may not accumulate evidence against this regressor unless it is in the version space. Binary classification methods address this issue by hallucinating labels for unqueried examples (Dasgupta et al., 2007), but hallucinating costs does not seem applicable to CSMC since the only safe choice that avoids eliminating f appears to be f (x; y), which is unknown. Our solution uses β > 0 to induce a shrinking radius so that bad regressors cannot re-enter the version space. However, to avoid eliminating f ?, the initial radius 1 must be larger than standard concentration arguments require, so the algorithm is conservative. Information-theoretically (by enumerating the version space), the logarithmic rate is possible in CSMC, but we do not know of efficient algorithms for this. 5.1. Two Examples Our first example shows the benefits of using the domination criterion in querying, in addition to the cost range condition. Consider a problem under Assumption 2, where the optimal cost is predicted perfectly, the second best cost is worse and all the other costs are substantially worse, but with variability in the predictions. Since all classifiers predict the right label, we get 1 = 2 = 0, so our label complexity bound is O(1). Intuitively, since every regressor is certain of the optimal label and its cost, we actually make zero queries. On the other hand, all of the suboptimal labels have large cost ranges, so querying based solely on a cost range criteria leads to a large label complexity. A related example demonstrates the improvement in our query rule over more na ıve approaches where we query either no label or all labels, which is the natural generalization of query rules from multiclass classification (Agarwal, 2013). In the above example, if the best and second best labels are confused occasionally 1 may be large, but we expect 2 K 1 since only the second best label can be confused with the best. Thus, the L2 bound in Theorem 5 is a factor of K smaller than with a na ıve query rule since COAL only queries the best and second best labels. Active Learning for Cost-Sensitive Classification INet 20 20 38k 6k INet 40 40 71k 6k RCV1-v2 103 781k 47k POS 45 38k 40k 24 NER 9 15k 15k 14 Wiki 9 132k 89k 25 Table 1. Dataset statistics ( is the average sequence length). 6. Experiments For computational efficiency, we implemented an approximate version of COAL using online optimization, based on online linear least-squares regression. The algorithm processes the data in one pass, and the idea is to (1) replace gi,y, the ERM, with an approximation go i,y obtained by online updates, and (2) compute the minimum and maximum costs via a sensitivity analysis of the online update. Specifically, we define the a sensitivity value s(x, c, go i,y) 0, which is the derivative of the prediction on x as a function of the importance weight w, for the new example x and cost c = 0 or c = 1 (for c and c+ respectively). Then we approximate c via go i,y(x) wo s(x, 0, go i,y) where wo is the largest weight w satisfying i,y(x)2 (go i,y(x) ws(x, 0, go and i is the radius used at round i. We similarly approximate the maximum cost. See Appendix A for details. 6.1. Simulated Active Learning We performed simulated active learning experiments with three datasets. Image Net 20 and 40 are sub-trees of the Image Net hierarchy covering 20 and 40 most frequent classes, where each example has a single zero-cost label and the cost for incorrect labels is the tree-distance to the correct one. The feature vectors are the top layer of the Inception neural network (Szegedy et al., 2015). The third dataset, RCV1-v2 (Lewis et al., 2004), is a multilabel textcategorization dataset, which has 103 topic labels, organized as a tree with similar tree-distance cost structure as the Image Net data. Some dataset statistics are in Table 1. We compare our online version of COAL to passive online learning. We use the cost-sensitive one-against-all (CSOAA) implementation in Vowpal Wabbit5, which performs online linear regression for each label separately. There are two tuning parameters in our implementation. First, instead of i, we set the radius of the version space to 0 i 1 (i.e. β = 0 and the log factor i = 2(i 1)2|G|K scales with i) and instead tune the constant . This alternate mellowness parameter controls how aggressive the query strategy is. The second parameter is the learning rate used by online linear regression6. 5http://hunch.net/ vw 6We use the default online learning algorithm in Vowpal For each parameter setting and each dataset, we make one pass through the training set and check the test cost (which is just the normalized expected cost) of the model every doubling number of queries. We repeat this on 100 random permutations of the training data and plot the results in Figures 1 and 2. For each mellowness, we show the results of the best learning rate, which maximizes a notion of AUC that reflects the trade-off between test cost and number of queries (see Eq. (11) in Appendix A). The figures show, for each dataset and mellowness, the number of queries against the median test cost along with bars extending from the 15th to 85th quantile. Overall, COAL achieves a better trade-off between performance and queries. With proper mellowness parameter, active learning achieves similar test cost as passive learning with a factor of 8 to 32 less queries. On Image Net 40 and RCV1v2 (recall Figure 1), active learning achieves better test cost with a factor of 16 less queries. On RCV1-v2, COAL queries like passive up to around 256k queries, since the data is very sparse, and linear regression has the property that the cost range is maximal when an example has a new unseen feature. Once COAL sees all features a few times, it queries much more efficiently than passive. Note that these plots correspond to the label complexity L2, with similar results for L1 in Appendix A.3. While not always the best, we recommend a mellowness setting of 0.01 as it achieves reasonable performance on all three datasets. This is also confirmed by the learning-tosearch experiments, which we discuss in the next section. We also compare COAL with two active learning baselines. Both algorithms differ from COAL only in their query rule. ALLORNONE queries either all labels or no labels using both domination and cost-range conditions and is an adaptation of existing multiclass active learners (Agarwal, 2013). NODOM just uses the cost-range condition, inspired by active regression (Castro et al., 2005). The results for Image Net 40 are displayed in Figure 2 (See also Appendix A.3), where we use the AUC strategy to choose the learning rate. We choose the mellowness by visual inspection for the baselines and use 0.01 for COAL7. COAL substantially outperforms both baselines, which provide minimal improvement over passive learning. 6.2. Learning to Search We also experiment with COAL as the base leaner in learning-to-search (Daum e III et al., 2009; Chang et al., 2015), which reduces joint prediction problems to CSMC. Here, a joint prediction example defines a search space, Wabbit, which is a scale-free (Ross et al., 2013) importance weight invariant (Karampatziakis & Langford, 2011) form of Ada Grad (Duchi et al., 2010). 7We use 0.01 for ALLORNONE and 10 3 for NODOM. Active Learning for Cost-Sensitive Classification Figure 2. Experiments with COAL. Top row shows test cost vs. number of queries for simulated active learning experiments. Bottom row shows accuracy vs. number of rollouts for active and passive learning as the CSMC algorithm in learning-to-search. where a sequence of decisions are made to generate the structured label. We focus here on sequence labeling tasks, where the input is a sentence and the output is a sequence of labels (specifically, parts of speech or named entities). Learning-to-search solves joint prediction problems by generating the output one label at a time, conditioning the input x on all past decisions. Since mistakes may lead to compounding errors, it is natural to represent the decision space as a CSMC problem, where the classes are the actions available (possible labels for a word) and the costs reflect the long term loss of each choice. Intuitively, we should be able to avoid expensive computation of long term loss on decisions like is the a DETERMINER? once we are quite sure of the answer. Similar ideas motivate adaptive sampling for structured prediction. (Shi et al., 2015). We specifically use AGGRAVATE (Ross & Bagnell, 2014; Chang et al., 2015), which runs a learned policy to produce a backbone sequence of labels. For each position in the input, it then considers all possible deviation actions and executes an oracle for the rest of the sequence. The loss on this complete output is used as the cost for the deviating action. Run in this way, AGGRAVATE requires K roll-outs when the input sentence has words and each word can take one of K possible labels. Since each roll-out takes O( ) time, this can be computationally prohibitive, so we use active learning to reduce the number of roll-outs. We use COAL and a passive learning baseline inside AGGRAVATE on three joint prediction datasets (statistics are in Figure 2, upper right). As above, we use several mellowness values and the same AUC criteria to select the best learning rate. The results are in Figure 2, and again our recommended mellowness is 0.01. Overall, active learning reduces the number of roll-outs required, but the improvements vary on the three datasets. On the Wikipedia data, COAL performs a factor of 4 less rollouts to achieve similar performance to passive learning and achieves substantially better test performance. A similar, but less dramatic, behavior arises on the NER task. On the other hand, COAL offers minimal improvement over passive learning on the POS-tagging task. This agrees with our theory and prior empirical results (Hsu, 2010), which show that active learning may not always improve upon passive. 7. Discussion This paper presents a new active learning algorithm for cost-sensitive multiclass classification. The algorithm enjoys strong theoretical guarantees and also outperforms passive baselines both in CSMC and structured prediction. We close with some intriguing questions: 1. Can we use a square loss oracle in other partial infor- mation problems like contextual bandits? 2. Can we avoid the safety parameter to achieve the opti- mal complexity in the low noise case? 3. Can we analyze the online approximation to COAL? We hope to answer these questions in future work. Active Learning for Cost-Sensitive Classification Agarwal, A. Selective sampling algorithms for costsensitive multiclass prediction. In International Conference on Machine Learning, 2013. Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R.E. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, 2014. Balcan, M.-F. and Long, P. Active and passive learning of linear separators under log-concave distributions. In Conference on Learning Theory, 2013. Balcan, M.-F., Beygelzimer, A., and Langford, J. Agnostic active learning. In International Conference on Machine Learning, 2006. Balcan, M.-F., Broder, A., and Zhang, T. Margin based ac- tive learning. In Conference on Learning Theory, 2007. Beygelzimer, A., Dasgupta, S., and Langford, J. Impor- tance weighted active learning. In International Conference on Machine Learning, 2009. Beygelzimer, A., Hsu, D., Langford, J., and Zhang, T. Ag- nostic active learning without constraints. In Advances in Neural Information Processing Systems, 2010. Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R.E. Contextual bandit algorithms with supervised learning guarantees. In Artificial Intelligence and Statistics, 2011. Castro, R., Willett, R., and Nowak, R.D. Faster rates in regression via active learning. In Advances in Neural Information Processing Systems, 2005. Castro, R.M. and Nowak, R.D. Minimax bounds for active learning. Transaction on Information Theory, 2008. Cavallanti, G., Cesa-Bianchi, N., and Gentile, C. Learning noisy linear classifiers via adaptive and selective sampling. Machine Learning, 2011. Chang, K.-W., Krishnamurthy, A., Agarwal, A., Daum e III, H., and Langford, J. Learning to search better than your teacher. In International Conference on Machine Learning, 2015. Dasgupta, S., Hsu, D., and Monteleoni, C. A general ag- nostic active learning algorithm. In Advances in Neural Information Processing Systems, 2007. Daum e III, H., Langford, J., and Marcu, D. Search-based structured prediction. Machine Learning, 2009. Dekel, O., Gentile, C., and Sridharan, K. Robust selective sampling from single and multiple teachers. In Conference on Learning Theory, 2010. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. In Conference on Learning Theory, 2010. Hanneke, S. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 2014. Hsu, D. Algorithms for Active Learning. Ph D thesis, Uni- versity of California at San Diego, 2010. Huang, T.-K., Agarwal, A., Hsu, D., Langford, J., and Schapire, R.E. Efficient and parsimonious agnostic active learning. In Advances in Neural Information Processing Systems, 2015. Karampatziakis, N. and Langford, J. Online importance weight aware updates. In Uncertainty in Artificial Intelligence, 2011. Langford, J. and Beygelzimer, A. Sensitive error correcting output codes. In Conference on Learning Theory, 2005. Lewis, D.D., Yang, Y., Rose, T.G., and Li, F. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004. Data available at http://www.jmlr.org/papers/volume5/ lewis04a/lyrl2004_rcv1v2_README.htm. Littlestone, N. and Warmuth, M.K. The weighted majority algorithm. In Foundations of Computer Science, 1989. Massart, P. and N ed elec, E. Risk bounds for statistical learning. The Annals of Statistics, 2006. Orabona, F. and Cesa-Bianchi, N. Better algorithms for selective sampling. In International Conference on Machine Learning, 2011. Ross, S. and Bagnell, J.A. Reinforcement and imitation learning via interactive no-regret learning. ar Xiv:1406.5979, 2014. Ross, S., Mineiro, P., and Langford, J. Normalized online learning. In Uncertainty in Artificial Intelligence, 2013. Settles, B. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 2012. Shi, T., Steinhardt, J., and Liang, P. Learning where to sample in structured prediction. In Artificial Intelligence and Statistics, 2015. Silla Jr., C.N. and Freitas, A.A. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 2011. Active Learning for Cost-Sensitive Classification Syrgkanis, V., Krishnamurthy, A., and Schapire, R.E. Ef- ficient algorithms for adversarial contextual learning. In International Conference on Machine Learning, 2016. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Computer Vision and Pattern Recognition, 2015. Zhang, C. and Chaudhuri, K. Beyond disagreement-based agnostic active learning. In Advances in Neural Information Processing Systems, 2014.